Universität Wien

136041 SE Explainability for LLMs (2025W)

Continuous assessment of course work
ON-SITE

Registration/Deregistration

Note: The time of your registration within the registration period has no effect on the allocation of places (no first come, first served).

Details

max. 25 participants
Language: English

Lecturers

Classes (iCal) - next class is marked with N

07.10.2025 Introductory lecture 1 & assignment of first topics (Benjamin Roth)
14.10.2025 Introductory lecture 2 & assignment of remaining topics (Benjamin Roth)
21.10.2025 Introductory lecture 3 & hands-on session (Benjamin Roth)
28.10.2025 -- 20.01.2026 Presentations by participants
27.01.2026 Seminar closing, discussion of learnings & future directions of XAI

  • Tuesday 07.10. 09:45 - 11:15 Hörsaal 2 Hauptgebäude, Tiefparterre Stiege 5 Hof 3
  • Tuesday 14.10. 09:45 - 11:15 Hörsaal 2 Hauptgebäude, Tiefparterre Stiege 5 Hof 3
  • Tuesday 21.10. 09:45 - 11:15 Hörsaal 2 Hauptgebäude, Tiefparterre Stiege 5 Hof 3
  • Tuesday 28.10. 09:45 - 11:15 Hörsaal 2 Hauptgebäude, Tiefparterre Stiege 5 Hof 3
  • Tuesday 04.11. 09:45 - 11:15 Hörsaal 2 Hauptgebäude, Tiefparterre Stiege 5 Hof 3
  • Tuesday 18.11. 09:45 - 11:15 Hörsaal 2 Hauptgebäude, Tiefparterre Stiege 5 Hof 3
  • Tuesday 25.11. 09:45 - 11:15 Hörsaal 2 Hauptgebäude, Tiefparterre Stiege 5 Hof 3
  • Tuesday 02.12. 09:45 - 11:15 Hörsaal 2 Hauptgebäude, Tiefparterre Stiege 5 Hof 3
  • Tuesday 09.12. 09:45 - 11:15 Hörsaal 2 Hauptgebäude, Tiefparterre Stiege 5 Hof 3
  • Tuesday 16.12. 09:45 - 11:15 Hörsaal 2 Hauptgebäude, Tiefparterre Stiege 5 Hof 3
  • Tuesday 13.01. 09:45 - 11:15 Hörsaal 2 Hauptgebäude, Tiefparterre Stiege 5 Hof 3
  • Tuesday 20.01. 09:45 - 11:15 Hörsaal 2 Hauptgebäude, Tiefparterre Stiege 5 Hof 3
  • Tuesday 27.01. 09:45 - 11:15 Hörsaal 2 Hauptgebäude, Tiefparterre Stiege 5 Hof 3

Information

Aims, contents and method of the course

In recent years, large language models (LLMs) have made remarkable progress and are now applied in a wide range of contexts. Yet, their outputs can often be surprising or puzzling, raising the question: why did the model respond this way?

The field of Explainable Artificial Intelligence (XAI) seeks to address such questions, focusing on the mechanisms and influences that shape a model's behavior:

What processes lead an LLM to generate a particular response?

Which factors play a role - such as the input query, training data, or model parameters?

How can these inner workings be communicated to users in a way that increases understanding and creates the right level of trust?

How can we evaluate the quality and usefulness of explanations?

In this seminar, we will examine both foundational and recent research on explainability in LLMs. Through reading, presenting, and group discussion, participants will explore current approaches, open challenges, and potential paths forward in making LLMs more transparent and interpretable.

In this seminar, participants will read, present and discuss recent papers related to explainability of language-based AI models. In this seminar, the topic of explainability will be approached both from the sides of (a) philosophy, social science, and human-computer interaction, as well as (b) machine learning and artificial intelligence. Topics can be chosen of either of these areas based on the background and preference.

Possible topics to be covered in the seminar:

What theories of "explanations" and "explaining" do exist?

How can AI models be explained in terms of the input? (E.g., which were the most important words in a query?)

How can LLMs be explained in terms of the training data? (E.g., what were the most important training examples?)

How can LLMs be explained in terms of internal weights and activations? (How can they be summarized graphically represented?)

How can the quality of explanations be measured? Do users just "like" or "prefer" an explanation, or do they actually "understand" or "learn" something about the explained LLM?

Assessment and permitted materials

Participants will present one topic from the list of topics in the seminar, based on the suggested literature. The presentation should be roughly 25 minutes (hard limits: min. 20 minutes, max. 30 minutes). The presentation is followed by a QA session and discussion. Participants will also have to submit a written report, describing the main contents of the presented paper(s) and putting it in a wider context.

Minimum requirements and assessment criteria

Your presentation will account for 45% of the grade, participation in discussions for 10%, and the written report for 45%.

Examination topics

Your presentation will account for 45% of the grade, participation in discussions for 10%, and the written report for 45%.

Reading list

Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. https://arxiv.org/abs/1702.08608

Jacovi and Goldberg (2020). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? https://arxiv.org/abs/2004.03685

Ferrando, J., Sarti, G., Bisazza, A., & Costa-Jussà, M. R. (2024). A primer on the inner workings of transformer-based language models. https://arxiv.org/pdf/2405.00208

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., ... & Guo, J. (2024). A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594.

Hammoudeh, Z., & Lowd, D. (2024). Training data influence analysis and estimation: A survey. Machine Learning, 113(5), 2351-2403.

Hase, P., & Bansal, M. (2020). Evaluating explainable AI: Which algorithmic explanations help users predict model behavior?. arXiv preprint arXiv:2005.01831.

Hempel, C. G., & Oppenheim, P. (1948). Studies in the Logic of Explanation. Philosophy of science, 15(2), 135-175.

Koh, P. W., & Liang, P. (2017). Understanding black-box predictions via influence functions. In International conference on machine learning (pp. 1885-1894). https://arxiv.org/abs/1703.04730

Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.

Lyu, Q., Apidianaki, M., & Callison-Burch, C. (2024). Towards faithful model explanation in nlp: A survey. Computational Linguistics, 50(2), 657-723

Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial intelligence, 267, 1-38.

Nauta, M., Trienes, J., Pathak, S., Nguyen, E., Peters, M., Schmitt, Y., ... & Seifert, C. (2023). From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai. ACM Computing Surveys, 55(13s), 1-42.

Pruthi, G., Liu, F., Kale, S., & Sundararajan, M. (2020). Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems. https://arxiv.org/abs/2002.08484

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). " Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135-1144).

Singh, C., Inala, J. P., Galley, M., Caruana, R., & Gao, J. (2024). Rethinking interpretability in the era of large language models. https://arxiv.org/abs/2402.01761

Sun, J., Atanasova, P., & Augenstein, I. (2025, April). Evaluating Input Feature Explanations through a Unified Diagnostic Evaluation Framework. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 10559-10577).

Vilone, G., & Longo, L. (2021). Notions of explainability and evaluation approaches for explainable artificial intelligence. Information Fusion.

Association in the course directory

S-DH Cluster I: Language and Literature

Last modified: Fr 26.09.2025 17:46