Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

2024

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, and 4 more authors

Jun 2024

Paper Abstract

Language Model Programs, i.e. sophisticated pipelines of modular language model (LM) calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly effective for all modules. We study prompt optimization for LM programs, i.e. how to update these prompts to maximize a downstream metric without access to module-level labels or gradients. To make this tractable, we factorize our problem into optimizing the free-form instructions and few-shot demonstrations of every module and introduce several strategies to craft task-grounded instructions and navigate credit assignment across modules. Our strategies include (i) program- and data-aware techniques for proposing effective instructions, (ii) a stochastic mini-batch evaluation function for learning a surrogate model of our objective, and (iii) a meta-optimization procedure in which we refine how LMs construct proposals over time. Using these insights we develop MIPRO, a novel optimizer that outperforms baselines on five of six diverse LM programs using a best-in-class open-source model (Llama-3-8B), by as high as 12.9% accuracy. We will release our new optimizers and benchmark in DSPy at https://github.com/stanfordnlp/dspy

@article{2406.11695v1,
  author = {Opsahl-Ong, Krista and Ryan, Michael J and Purtell, Josh and Broman, David and Potts, Christopher and Zaharia, Matei and Khattab, Omar},
  title = {Optimizing Instructions and Demonstrations for Multi-Stage Language
    Model Programs},
  eprint = {2406.11695v1},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  year = {2024},
  month = jun,
  url = {http://arxiv.org/abs/2406.11695v1},
  file = {2406.11695v1.pdf},
  eprintnover = {2406.11695}
}

Three Important Things

1. MIPRO

This paper introduces MIPRO (Multi-prompt Instruction PRoposal Optimizer) for optimizing multi-stage prompts. Multi-stage here means that there is a chain of LLM calls required to solve the task, usually done so that a complex task is broken down into simpler sub-tasks.

The paper builds on OPRO (see paper summary here), and generalizes it to multiple stages. It also optimizes both instructions and few-shot examples, instead of purely instructions in the case of OPRO.

2. Bayesian Surrogate Model

It was disappointing that the paper was not clear on how MIPRO works exactly, in particular how the Bayesian optimization part fits in.

From my understanding, MIPRO works as follows:

There are many modules to optimize (stages in the LLM pipeline)
Each module has instructions and demonstrations that can be optimized
It uses the Tree-structured Parzen Estimator as a surrogate model to decide
which modules to optimize, and then come up with a partial assignment of new values for those variables

Why the surrogate model? In Bayesian optimization, the objective is assumed to be expensive to evaluate, and therefore we make use of a cheaper surrogate model which is a proxy of the objective function. The surrogate model thus helps to decide which point to evaluate next.

However, since this is only a partial assignment, it was unclear how the assignments of the other modules were decided.

3. Results

They had a few variations of MIPRO, but the most important distinction is between whether they optimized instructions only (0-shot), demonstrations only (few-shot), or both.

The table above shows that on almost all tasks except for HotPotQA Cond, optimizing for demonstrations only outperforms optimizing for instructions only, sometimes by a significant margin.

The only exception was HotPotQA Cond, a dataset they created off HotPotQA that had more complicated conditional logic in it which was hard to elicit fully with just examples.

Most Glaring Deficiency

The most interesting part of the paper was not well-explained. I am not an expert in Bayesian optimization and was unable to confidently understand how their optimization framework works even after re-reading several times and also going through the description of the algorithm in the appendix. An example run-through would have been helpful.

Conclusions for Future Work

Improving few-shot examples are more helpful for improving performance when instructions are not too complicated.

When instructions are complicated, optimizing for those is more helpful.