TUM Master's Seminar: Advanced Topics in Vision-Language Models (SS 2025)

Content


The seminar aims to explore cutting-edge advancements in the realm of Vision-Language Models (VLMs), focusing on various topics crucial to their development and application. Through a deep dive into seminal papers and latest research, students will gain an understanding of how models like CLIP, Llama, and Stable Diffusion work at an architectural and mathematical level. By the end of the seminar, students should have a comprehensive perspective on the current state and future potential of vision-language modeling. They will be equipped to evaluate new research, identify promising applications, and contribute meaningfully to the responsible development of this important field.


This is a Master's level course. Since these topics are very complex, prior participation in at least one of the following lectures is required:

  • Introduction to Deep Learning (IN2346)
  • Machine Learning (IN2064)

Additionally, we recommend to have taken at least one advanced deep learning lecture, for example:

  • AML: Deep Generative Models (CIT4230003)
  • Machine Learning for Graphs and Sequential Data (IN2323)
  • Computer Vision III: Detection, Segmentation, and Tracking (IN2375)
  • Machine Learning for 3D Geometry (IN2392)
  • Advanced Natural Language Processing (CIT4230002)
  • ADL4CV (IN2390)
  • ADL4R (IN2349)

or a related practical.

Organization


The preliminary meeting will take place at 2pm on Wednesdy, 12th of Februray 2025 on Zoom. See Slides.


The seminar awards 5 ECTS Credits and will take place in person at SAP Labs Munich in Garching campus.


All students will be matched to one topic group including a primary paper and two secondary papers. They are expected to give one short and one long presentation on their primary paper (from the perspective of an academic reviewer) as well as a one-slide on the secondary papers from two different perspectives (industry practitioner and academic researcher).


The tentative schedule of the seminar is as follows:

  • Online Introductory Session, April 9th, (1-2:30pm), Zoom
  • Onsite Short presentations on April 23rd and 30th (1-3pm)
  • Onsite Long presentations on May 21st, May 28th, June 4th, 11th, July 2nd(1-3pm)

For questions, please contact yiran.huang@helmholtz-munich.de or sanghwan.kim@helmholtz-munich.de.


Topics to select from:

Foundation VLMs


  1. Learning Transferable Visual Models From Natural Language Supervision
  2. Visual Instruction Tuning
  3. High-Resolution Image Synthesis With Latent Diffusion Models
  4. JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

RLHF in Vision Language Models


  1. MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning
  2. Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
  3. KTO: Model Alignment as Prospect Theoretic Optimization
  4. Diffusion Model Alignment Using Direct Preference Optimization

Applications of T2I Models


  1. Adding Conditional Control to Text-to-Image Diffusion Models
  2. ReNoise: Real Image Inversion Through Iterative Noising
  3. DataDream: Few-shot Guided Dataset Generation
  4. DiG-IN: Diffusion Guidance for Investigating Networks -- Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations

Concept-based Explainability


  1. Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery
  2. Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
  3. PDiscoNet: Semantically consistent part discovery for fine-grained recognition
  4. Sparse Autoencoders Find Highly Interpretable Features in Language Models

Compositionality


  1. When and why vision-language models behave like bags-of-words, and what to do about it?
  2. Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
  3. Language-only Efficient Training of Zero-shot Composed Image Retrieval
  4. Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy

Requirements


A successful participation in the seminar includes:

  • Active participation in the entire event: We have 70% attendance policy for this seminar. (You need to attend at least 5 of the 7 sessions.)
  • Short presentation (10 minutes talk including questions)
  • Long presentation (20 minutes talk including questions)

Registration


The registration must be done through the TUM Matching Platform.