Seminar CS717: Seminar on Computer Vision (FSS 2024)

The Computer Vision seminar covers recent topics in computer vision. In FSS2024, the seminar will be on “Large Language Models and Foundation Models for Computer Vision”, that facilitate to do awesome things such as open world image classification, single image depth estimation, or object tracking at high quality.

Organization

This seminar is organized by Prof. Dr.-Ing. Margret Keuper
Available for Master students (2 SWS, 4 ECTS)
Prerequisites: solid background in machine learning
Maximum number of participants is 12 students

Goals

In this seminar, you will

Read, understand, and explore scientific literature
Summarize a current research topic in a concise report (10 single-column pages + references)
Give two presentations about your topic (3 minutes flash presentation, 15 minutes final presentation)
Moderate a scientific discussion about the topic of one of your fellow students
Review a (draft of a) report of a fellow student

Schedule

Register as described below.
Attend the kickoff meeting on 28th of February, 5.15 pm. The presentation from the kick-off meeting can be found here.
Work individually throughout the semester according to the seminar schedule.
Meet your advisor for guidance and feedback
Flash Presentations: April 24th (preliminary)
Final Presentations: Preliminary dates: May 28th/29th, 5pm

Resources

Here you can find resources revelant for the seminar:

Topics

Each student works on a topic within the area of the seminar along with an accompanying reference paper. Your presentation and report should explore the topic with an emphasis on the reference paper, but not just the reference paper.

We strongly encourage you to explore the available literature and suggest a topic and reference paper of your own choice. Reference papers should be strong papers from a major venue; contact us if you are unsure.

We provide example topics and reference papers below .

Topic List:

[1] Gesture-Informed Robot Assistance via Foundation Models

[2] YOLO-World: Real-Time Open-Vocabulary Object Detection

[3] Lumiere: A Space-Time Diffusion Model for Video Generation

[4] Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

[5] FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

[6] Tracking Everything Everywhere All at Once

[7] Meta-Transformer: A Unified Framework for Multimodal Learning

[8] LIMA: Less Is More for Alignment

[9] IMAGEBIND: One Embedding Space To Bind Them All

[10] SEEM: Segment Everything Everywhere All at Once

[11] Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks

[12] SegGPT: Segmenting Everything In Context

[13] Zero-1-to-3: Zero-shot One Image to 3D Object

[14] 3D-GPT: Procedural 3D Modeling with Large Language Models

Getting started

The following survey articles and tutorial are good starting points for getting an overview of the topics of the seminar:

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Vision-Language Models for Vision Tasks: A Survey