SM445/CS 707: Data and Web Science Seminar (HWS 2024)

This term's seminar: Inside Deep Learning: Understanding Neural Network Behavior.

Neural networks have become incredibly powerful but often operate as “black boxes,” and their decision process is typically difficult to understand. The field of interpretability aims to investigate how and why these models make certain decisions. This seminar offers an opportunity to dive into the latest interpretability research, which aims to bridges the gap between model complexity and human understanding. By exploring how deep learning models process and generate information, students will gain insights into model behavior, how to enhance model performance, and how to ensure their ethical use in real-world applications.

Organization

This seminar is organized by Prof. Dr. Rainer Gemulla and Jannik Brinkmann.
Available for up to 8 Master students (4 ECTS) and up to 4 Bachelor students (5 ECTS).
Prerequisites: Solid background in machine learning

Goals

In this seminar, you will

Read, understand, and explore scientific literature
Summarize a current research topic in a concise report (10 single-column pages + references)
Give two presentations about your topic (3 minutes flash presentation, 15 minutes final presentation)
Moderate a scientific discussion about the topic of one of your fellow students
Review a (draft of a) report of a fellow student

Schedule

Register as described below.
Attend the kickoff meeting on Sep 11, 17:15, in TBD.
Work individually throughout the semester according to the seminar schedule (PDF, 131 kB).
Meet your advisor for guidance and feedback.

Registration

Please register via Portal 2 until September 2.

If you are accepted into the seminar, provide at least 4 topics areas of your preference (your own and / or example topics; see below) by September 8 via email to Jannik Brinkmann. The actual topic assignment takes place soon afterwards; we will notify you via email. Our goal is to assign one of your preferred topic areas to you.

Topic areas and topics

You will be assigned a topic area in an active, relevant field of machine learning based your preferences. Your goals in this seminar are

Provide a short, concise overview of this topic area (1/4). A good starting point may be a book chapter, survey paper, or recent research paper. Here you take a birds-eyes view and are expected to discuss the main goals, challenges, and relevance of your topic area. Topic areas are selected at the beginning of the seminar.
Present a self-selected topic within this area in more detail (3/4). A good starting point is a recent or highly-influential research paper. Here you dive deep into one particular topic and are expected to discuss and explain the concrete problem statement, concrete solution or contribution, as well as your own thoughts. The actual topic is selected before the first tutor meeting.

You are generally free to propose your topic area of interest as long as it aligns with the overall theme and objectives of the seminar.

For more information, please refer to recent surveys, including Ferrando et al. (2024) and Mueller et al. (2024), as well as Google Scholar.

Below are some examples of suitable topic areas, each with a potential reference paper (i.e., topic). However, you are free to choose any other paper to dive-deep into.

0. Introduction to transformer interpretability (by us & by you at home)
Ferrando et al.
A Primer on the Inner Workings of Transformer-based Language Models
Preprint 2024

1. Causal mediation analysis for neural network interpretability (BSc students preferred)
Vig et al.
Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias
NeurIPS 2020

2. Fundamental limitations of model-agnostic interpretability approaches (BSc students preferred)
Geirhos et al.
Don't Trust Your Eyes: On the (Un)reliability of Feature Visualizations
ICML 2024

3. Causal interpretability: Challenges and opportunities (BSc students preferred)
Mueller
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks
ICML 2024 Workshop on Mechanistic Interpretability

4. Localising information using probing classifiers (BSc students preferred)
Belinkov
Probing Classifiers: Promises, Shortcomings & Advances
ACL 2022

5. Model editing: Locating and editing factual associations in language models
Meng et al.
Locating and Editing Factual Associations in GPT
NeurIPS 2022

6. Steering vectors: Representation engineering for modifying model behavior
Zou et al.
Representation Engineering: A Top-Down Approach to AI Transparency
Preprint 2023

7. Transformer circuits: Finding computational subgraphs in neural networks
Wang et al.
Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small
ICLR 2023

8. Feature disentanglement: Dictionary learning using sparse autoencoders
Cunningham et al.
Sparse Autoencoders Find Highly Interpretable Features in Language Models
ICLR 2024

9. Explanations for multimodal models: Understanding to connection between images and language
Gandelsman et al.
Interpreting CLIP’s Image Representation via Text-Based Decomposition
ICLR 2024

10. Interpretability agents: Towards automating interpretability research
Shaham et al.
A Multimodal Automated Interpretability Agent
ICML 2024

11.The emergence of linear representations in language models
Park et al
The Linear Representation Hypothesis and the Geometry of Large Language Models (PDF)
ICML 2024

12. Grokking: The dynamics between generalisation beyond overfitting
Nanda et al.
Progress Measures for Grokking via Mechanistic Interpretability
ICLR 2023

Supplementary materials and references

“Giving Conference Talks” (PDF, 1 MB)by Prof. Dr. Rainer Gemulla
“Writing for Computer Science” by Justin Zobel, Springer, 2014