Repository-Level Prompt Generation for Large Language Models of Code

With the success of large language models (LLMs) of code and their use as code assistants (e.g. Codex used in GitHub Copilot), techniques for introducing domain-specific knowledge in the prompt design process become important. In this work, we propose a framework called Repo-Level Prompt Generator that learns to generate example-specific prompts using prompt proposals. The prompt proposals take context from the entire repository, thereby incorporating both the structure of the repository and the context from other relevant files (e.g. imports, parent class files). Our technique doesn’t require any access to the weights of the LLM, making it applicable in cases where we only have black-box access to the LLM. We conduct experiments on the task of single-line code-autocompletion using code repositories taken from Google Code archives. We demonstrate that an oracle constructed from our prompt proposals gives a remarkably high relative improvement of 36% over Codex, showing the quality of these proposals. Further, we show that when we train a model to predict a prompt proposal, we can achieve significant performance gains over Codex and other baselines. We release our code, data, and trained checkpoints at: \urlhttps://github.com/shrivastavadisha/repo_level_prompt_generation.

Three Important Things

1. Repo-Level Prompt Generator

A limitation of code LLMs like Codex is that they perform poorly on large codebases as all the code cannot fit into the context window.

The paper introduces the Repo-Level Prompt Generator (RLPG), which is a way of retrieving relevant context for the prompt (i.e other related files, function definitions), in hopes that the generated code that relies on this prompt with more useful context will be better. They then feed this prompt into a Codex model.

In this paper, they call the place to infill generated code a “hole”.

RLPG works by generating prompt proposals, which is a combination of a prompt source and prompt context type.

The prompt source identifies which file to get the relevant context from, with the following categories:

Current file (with respect to the hole)
Parent class
Import files
Sibling files
Files with similar names
Child class
Import of parent class
Import of sibling files
Import of files with similar names
Import of child class

Given a prompt source, the prompt context type determines what to extract from the prompt source, with the following categories:

Post Lines (all lines in the current file from the end of the hole - this only applies to current file)
Identifiers
Type identifiers
Field declarations
String literals
Method names
Method names and bodies

An overview of the RLPG framework is given below (I personally found this diagram more confusing than helpful, so skip it if you wish):

2. Prompt Proposal Classifier (PPC)

While we have a RLPG that generates a bunch of prompt proposals, it would be expensive to try all combinations of prompt proposals to see which one works.

Instead, the authors introduce a Prompt Proposal Classifier (PPC), which given a hole position, determines which prompt proposal has the highest probability of success, where success means that Codex generates the exact same code as what was in the hole after being fed that prompt.

They introduce two variants of PPCs: the RLPG-H and RLPG-R (it was unclear to me what H and R stands for here).

RLPG-H

In RLPG-H, they take the two lines before and two lines after the hole, feed it into the encoder \(F_\phi\) and take the encoded [CLS] token as an encoded representation of the surrounding context, and then feed it through a two-layer neural network with final sigmoid activations as predictions:

\[\hat{y}_p^h =\operatorname{sigmoid}\left(W^2\left(\operatorname{relu}\left(W^1\left(F_\phi\left(H^h\right)\right)+b^1\right)\right)+b^2\right)\]

Hence the idea is that the encoded representation of the context might serve to have useful features to predict which prompt proposal might work best.

RLPG-R

In RLPG-R, they go a little further than RLPG-H and on top of obtaining the [CLS] token for the representation of the hole context \(H^h\), it takes advantage of the attention mechanism by treating that as the query, and also encoding the produced prompt context as the keys and values:

\[Q^h=F_\phi\left(H^h\right), \quad K_p^h=F_\phi\left(C_p^h\right), \quad V_p^h=F_\phi\left(C_p^h\right)\]

They then apply the multi-head attention module on it, and pass the result through a similar two-layer neural network:

\[\hat{y}_p^h =\operatorname{sigmoid}\left(W_p G\left(\text { MultiHead }\left(Q^h, K_p^h, V_p^h\right)\right)+b_p\right)\]

We might think this may perform better, since it incorporates the attention mechanism between the prompt and the hole context (and RLPG-H does not even have access to the prompt contents), but surprisingly RLPG-H performs slightly better on evaluation:

3. Performance of Prompt Proposals

With respect to the previous figure, we also notice that simply using any prompt proposal also performs better than default Codex. In fact, the authors noted that many of the prompts look like a mess with mashed-up contexts, and still performed better than the default Codex. This indicates that it is worthwhile to add any sort of context and not worry too much about ordering.

It is also interesting to note the prompt proposals that had the highest success rates

Unsurprisingly, having additional context from the current file helped the most. This was then followed by sibling files, and then files with similar names, and then the imports.

Most Glaring Deficiency

The prompt generation techniques are not mutually exclusive, and it seemed likely that they could be combined to achieve even better results, especially if we have not yet exhausted our context window. However, this direction was not explored in this paper, and as mentioned in the paper is left to future work, and it was also stated that they have promising initial results.

For instance, we might think that having relevant headers from an imported library and the relevant function definitions and bodies from a sibling file might cause it to perform even better, like the kind of information a normal developer would want.

Conclusions for Future Work

When there is more information than context, prompt generation becomes key to selectively choose information that would be the most helpful to the task at hand for the LLM. We can utilize similar techniques as this paper to augment our prompts with additional information, and note that even simple methods like randomly choosing additional context could already cause it to perform better.