The Computer Vision seminar covers recent topics in computer vision. In FSS2024, the seminar will be on “Large Language Models and Foundation Models for Computer Vision”, that facilitate to do awesome things such as open world image classification, single image depth estimation, or object tracking at high quality.
In this seminar, you will
Here you can find resources revelant for the seminar:
Each student works on a topic within the area of the seminar along with an accompanying reference paper. Your presentation and report should explore the topic with an emphasis on the reference paper, but not just the reference paper.
We strongly encourage you to explore the available literature and suggest a topic and reference paper of your own choice. Reference papers should be strong papers from a major venue; contact us if you are unsure.
We provide example topics and reference papers below .
Topic List:
[1] Gesture-Informed Robot Assistance via Foundation Models
[2] YOLO-World: Real-Time Open-Vocabulary Object Detection
[3] Lumiere: A Space-Time Diffusion Model for Video Generation
[4] Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
[5] FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
[6] Tracking Everything Everywhere All at Once
[7] Meta-Transformer: A Unified Framework for Multimodal Learning
[8] LIMA: Less Is More for Alignment
[9] IMAGEBIND: One Embedding Space To Bind Them All
[10] SEEM: Segment Everything Everywhere All at Once
[12] SegGPT: Segmenting Everything In Context
[13] Zero-1-to-3: Zero-shot One Image to 3D Object
[14] 3D-GPT: Procedural 3D Modeling with Large Language Models
The following survey articles and tutorial are good starting points for getting an overview of the topics of the seminar:
Multimodal Foundation Models: From Specialists to General-Purpose Assistants