Seminar CS717: Seminar on Computer Vision (FSS 2024)

The Computer Vision seminar covers recent topics in computer vision. In FSS2024, the seminar will be on “Large Language Models and Foundation Models for Computer Vision”, that facilitate to do awesome things such as open world image classification, single image depth estimation, or object tracking at high quality.


  • This seminar is organized by Prof. Dr.-Ing. Margret Keuper
  • Available for Master students (2 SWS, 4 ECTS)
  • Prerequisites: solid background in machine learning
  • Maximum number of participants is 12 students


In this seminar, you will

  • Read, understand, and explore scientific literature
  • Summarize a current research topic in a concise report (10 single-column pages + references)
  • Give two presentations about your topic (3 minutes flash presentation, 15 minutes final presentation)
  • Moderate a scientific discussion about the topic of one of your fellow students
  • Review a (draft of a) report of a fellow student



Here you can find resources revelant for the seminar:


Each student works on a topic within the area of the seminar along with an accompanying reference paper. Your presentation and report should explore the topic with an emphasis on the reference paper, but not just the reference paper.

We strongly encourage you to explore the available literature and suggest a topic and reference paper of your own choice. Reference papers should be strong papers from a major venue; contact us if you are unsure.

We provide example topics and reference papers below .

Topic List:

[1] Gesture-Informed Robot Assistance via Foundation Models

[2] YOLO-World: Real-Time Open-Vocabulary Object Detection 

[3]  Lumiere: A Space-Time Diffusion Model for Video Generation

[4] Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data 

[5] FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects 

[6] Tracking Everything Everywhere All at Once

[7] Meta-Transformer: A Unified Framework for Multimodal Learning

[8]  LIMA: Less Is More for Alignment 

[9] IMAGEBIND: One Embedding Space To Bind Them All 

[10] SEEM: Segment Everything Everywhere All at Once 

[11] Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks 

[12] SegGPT: Segmenting Everything In Context 

[13] Zero-1-to-3: Zero-shot One Image to 3D Object

[14] 3D-GPT: Procedural 3D Modeling with Large Language Models

Getting started

The following survey articles and tutorial are good starting points for getting an overview of the topics of the seminar:

Multimodal Foundation Models: From Specialists to General-Purpose Assistants 

Vision-Language Models for Vision Tasks: A Survey