Seminar CS715: Solving Complex Tasks using Large Language Models (FSS 2024)

The seminar explores prompt engineering techniques for enabling LLMs to handle complex tasks as well as using LLMs to evaluate complex outputs.  The seminar features literature as well as experimental topics. The goal of the literature topics is to summarize the state of the art concerning the application and evaluation of LLMs. The goal of the experimental topics is to verify the utility of advanced prompt engineering techniques by applying them to tasks beyond the tasks used in the respective papers for illustration and evaluation. 

Organization

Goals

In this seminar, you will

  • read, understand, and explore scientific literature
  • critically summarize the state-of-the-art concerning your topic
  • experimentally verify the utility of advanced prompt engineering methods
  • give a presentation about your topic (before the submission of the report)

Schedule

  1. Please register for the seminar via the centrally-coordinated seminar registration in Portal2.
  2. After you have been accepted into the seminar, please email us your three preferred topics from the list below.
    We will assign topic to students according to your preferences.
  3. Attend the kickoff meeting on February 29th, 15:30 in which we will discuss general requirements for the reports and presentations as well as answer initial questions about the topics
  4. You will be assigned a mentor, who provides guidance and one-to-one meetings
  5. Work individually throughout the semester: explore literature, perform experiments (if you are assigned an experimental topic), create a presentation, and write a report
  6. Give your presentation in a block seminar on April 29th.
  7. Write and submit your seminar thesis until June 2024.

Topics

Prompt Engineering

1. Experimental Topic: From Self-Consistency to MedPrompt: Improving Results by Ensembling LLMs

  • Wang, et al.: Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 (2022)
  • Nori, Harsha, et al. “Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine.” arXiv preprint arXiv:2311.16452 (2023).
  • Zhao, et al.: A survey of Large Language Models. arXiv:2303.18223 (2023)

2. Experimental Topic: Prompt Search / Breeding

  • Fernando, Chrisantha, et al. “Promptbreeder: Self-referential self-improvement via prompt evolution.” arXiv preprint arXiv:2309.16797 (2023).
  • Liu, Pengfei, et al. “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.” ACM Computing Surveys 55.9 (2023): 1–35.

3. Experimental Topic: Active Prompt

  • Diao, Shizhe, et al. “Active Prompting with Chain-of-Thought for Large Language Models.” arXiv, May 23, 2023.
  • Mavromatis, Costas, et al. “Which Examples to Annotate for In-Context Learning? Towards Effective and Efficient Selection.” arXiv, October 30, 2023.

4. Experimental Topic: Contrastive Prompting

  • Chia, Yew Ken, et al. “Contrastive Chain-of-Thought Prompting.” arXiv preprint arXiv:2311.09277 (2023).
  • Paranjape, Bhargavi, et al. “Prompting contrastive explanations for commonsense reasoning tasks.” arXiv preprint arXiv:2106.06823 (2021).

5. Experimental Topic: Limitations of LLMs

  • Berglund, Lukas, et al. “The Reversal Curse: LLMs Trained on ‘A Is B’ Fail to Learn ‘B Is A.’” arXiv, September 22, 2023.
  • Kaddour, Jean, et al. “Challenges and Applications of Large Language Models.” arXiv, July 19, 2023. https://doi.org/10.48550/arXiv.2307.10169.

6. Literature Topic: LLM Self-Evaluation during Fine-tuning

  • Deutsch, Daniel, et al. “On the Limitations of Reference-Free Evaluations of Generated Text.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 10960–77.
  • Ouyang, Long, et al. “Training Language Models to Follow Instructions with Human Feedback.” arXiv, March 4, 2022.
  • Rafailov, Rafael, et al.“Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” arXiv, December 13, 2023.

Evaluation

7. Literature Topic: LLMs as Evaluation Metrics

  • Kocmi, Tom, et al. “Large Language Models Are State-of-the-Art Evaluators of Translation Quality.” arXiv, May 31, 2023.
  • Leiter, Christoph, et al. “The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics.” arXiv, October 30, 2023.

8. Literature Topic: Can LLMs Evaluate Themselves?

  • Deutsch, Daniel, et al. “On the Limitations of Reference-Free Evaluations of Generated Text.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 10960–77.
  • Ouyang, Long, et al. “Training Language Models to Follow Instructions with Human Feedback.” arXiv, March 4, 2022.
  • Rafailov, Rafael, et al.“Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” arXiv, December 13, 2023.

9. Experimental Topic: LLMs with Tools as Evaluation Metrics

  • Fernandes, Patrick, et al. “The Devil Is in the Errors: Leveraging Large Language Models for Fine-Grained Machine Translation Evaluation.” arXiv, August 14, 2023.
  • Kocmi, Tom, et al. “GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4.” arXiv, October 21, 2023.
  • Shu, Lei, et al. “Fusion-Eval: Integrating Evaluators with LLMs.” arXiv, November 15, 2023.

10. Literature Topic: Task Contamination

  • Li, Changmao, et al. “Task Contamination: Language Models May Not Be Few-Shot Anymore.” arXiv preprint arXiv:2312.16337 (2023).
  • Roberts, Manley, et al. “Data Contamination Through the Lens of Time.” arXiv preprint arXiv:2310.10628 (2023).
  • Jiang, et al.: Investigating Data Contamination for Pre-training Language Models. arXiv preprint arXiv:2401.06059 (2024).

11. Literature Topic: Evaluation of Code Writing Ability of LLMs

  • Chen, Mark, et al. “Evaluating large language models trained on code.” arXiv preprint arXiv:2107.03374 (2021).
  • Le, Triet HM, et al. “Deep learning for source code modeling and generation: Models, applications, and challenges.” ACM Computing Surveys (CSUR) 53.3 (2020): 1–38.
  • https://paperswithcode.com/task/code-generation

12. Experimental Topic: Evaluation Benchmark for Scientific Text Generation Models

  • Belouadi, Jonas, et al. “AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ.” arXiv, January 23, 2024.
  • Zerroug, Aimen, et al. “A Benchmark for Compositional Visual Reasoning.” Advances in Neural Information Processing Systems 35 (December 6, 2022): 29776–88.

Applications

13. Experimental Topic: WebAPI Query Planning Using LLMs

  • Chen, Zui, et al. “Symphony: Towards natural language query answering over multi-modal data lakes.” Conference on Innovative Data Systems Research, CIDR. 2023.
  • Urban, Matthias, et al. “CAESURA: Language Models as Multi-Modal Query Planners.” arXiv preprint arXiv:2308.03424 (2023).
  • Wang, et al.: A Survey on Large Language Model based Autonomous Agents. arXiv preprint arXiv:2308.11432 (2023)
  • https://gorilla.cs.berkeley.edu/

14. Experimental Topic: Attribute Value Normalization Using LLMs

  • Jaimovitch-López, Gonzalo, et al. “Can language models automate data wrangling?.” Machine Learning 112.6 (2023): 2053–2082.
  • Bogatu, Alex, et al. “Towards automatic data format transformations: Data wrangling at scale.” Data Analytics: 31st British International Conference on Databases (BICOD2017), 2017.

15. Experimental Topic: LLM for Literary Translation and Evaluation

  • Fonteyne, Margot, et al. “Literary Machine Translation under the Magnifying Glass: Assessing the Quality of an NMT-Translated Detective Novel on Document Level.” In Proceedings of the Twelfth Language Resources and Evaluation Conference, 3790–98. Marseille, France, 2020.
  • Karpinska, Marzena, et al. “Large Language Models Effectively Leverage Document-Level Context for Literary Translation, but Critical Errors Persist.” arXiv, May 22, 2023.
  • Wang, Longyue, et al. “Document-Level Machine Translation with Large Language Models.” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 16646–61. Singapore, 2023.

16. Experimental Topic: LLMs for Synthetic Training Data Generation

  • Frédéric Piedboeuf et al. Is ChatGPT the ultimate Data Augmentation Algorithm? In Findings of the Association for Computational Linguistics: EMNLP 2023, 2023.
  • Pal, Koyena, et al. “Generative Benchmark Creation for Table Union Search.” arXiv, August 7, 2023.

17. Experimental Topic: LLM-based Agents / OpenAI Assistants

18. Experimental Topic: Agent Cooperation

  • Park, Joon Sung, et al. “Generative agents: Interactive simulacra of human behavior.” Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 2023.
  • Zhuge, Mingchen, et al. “Mindstorms in Natural Language-Based Societies of Mind.” arXiv preprint arXiv:2305.17066 (2023).
  • Suzgun and Kalai: Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding. arXiv preprint arXiv:2401.12954 (2024).
  • Wang, et al.: A Survey on Large Language Model based Autonomous Agents. arXiv preprint arXiv:2308.11432 (2023)
  • https://www.promptingguide.ai/research/llm-agents

Getting started

The following survey articles and tutorial are good starting points for getting an overview of the topics of the seminar: