OpenAI提高可靠性的技术

date

Apr 16, 2023

slug

openai_improve_stable_techs

status

Published

Why GPT-3 fails on complex tasks为什么 GPT-3 在复杂任务上失败

If you were asked to multiply 13 by 17, would the answer pop immediately into your mind? For most of us, probably not. Yet, that doesn't mean humans are incapable of two-digit multiplication. With a few seconds, and some pen and paper, it's not too taxing to work out that 13 x 17 = 130 + 70 + 21 = 221.如果你被要求将 13 乘以 17，你会立即想到答案吗？对于我们大多数人来说，可能不是。然而，这并不意味着人类无法进行两位数的乘法运算。只需几秒钟和一些笔和纸，计算出 13 x 17 = 130 + 70 + 21 = 221 并不太费力。

Similarly, if you give GPT-3 a task that's too complex to do in the time it takes to calculate its next token, it may confabulate an incorrect guess. Yet, akin to humans, that doesn't necessarily mean the model is incapable of the task. With some time and space to reason things out, the model still may be able to answer reliably.同样，如果你给 GPT-3 一个太复杂的任务，在它计算下一个标记所需的时间内无法完成，它可能会做出不正确的猜测。然而，与人类相似，这并不一定意味着模型无法完成任务。有了一些时间和空间来推理，模型仍然可以可靠地回答。

As an example, if you ask text-davinci-002 the following math problem about juggling balls, it answers incorrectly:例如，如果你问 text-davinci-002 以下关于杂耍球的数学问题，它回答错误：

Q: A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there?
A:

There are 8 blue golf balls.

Does this mean that GPT-3 cannot do simple math problems? No; in fact, it turns out that by prompting the model with Let's think step by step, the model solves the problem reliably:这是否意味着 GPT-3 不能做简单的数学题？不;事实上，事实证明，通过使用 Let's think step by step 提示模型，模型可靠地解决了问题：

Q: A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there?
A: Let's think step by step.

There are 16 balls in total. 
Half of the balls are golf balls. 
That means that there are 8 golf balls. 
Half of the golf balls are blue. 
That means that there are 4 blue golf balls.

Of course, it's hard to tell from only a single example whether this Let's think step by step trick actually works in general or just got lucky on this particular problem. But it really does work. On a benchmark of word math problems, the Let's think step by step trick raised GPT-3's solve rate massively, from a worthless 18% to a decent 79%!当然，仅从一个例子中很难判断这个 Let's think step by step 技巧是否在一般情况下确实有效，或者只是在这个特定问题上走运。但它确实有效。在单词数学问题的基准测试中， Let's think step by step 技巧大大提高了 GPT-3 的解决率，从一文不值的 18% 提高到不错的 79%！

Model capabilities depend on context模型功能取决于上下文

When learning to work with GPT-3, one common conceptual mistake is to believe that its capabilities are fixed across all contexts. E.g., if GPT-3 gets a simple logic question wrong, then it must be incapable of simple logic.在学习使用 GPT-3 时，一个常见的概念错误是认为它的功能在所有情况下都是固定的。例如，如果 GPT-3 答错了一个简单的逻辑问题，那么它一定无法解决简单的逻辑问题。

But as the Let's think step by step example illustrates, apparent failures of GPT-3 can sometimes be remedied with a better prompt that helps the model steer itself toward the correct output.但正如 Let's think step by step 示例所示，GPT-3 的明显故障有时可以通过更好的提示来补救，从而帮助模型引导自己朝着正确的输出方向发展。

How to improve reliability on complex tasks如何提高复杂任务的可靠性

The rest of this article shares techniques for improving reliability of large language models on complex tasks. Although some of the techniques are specific to certain types of problems, many of them are built upon general principles that can be applied to a wide range of tasks, e.g.:本文的其余部分分享了提高大型语言模型在复杂任务上的可靠性的技术。尽管某些技术特定于某些类型的问题，但其中许多技术都是建立在可应用于广泛任务的一般原则之上的，例如：

Give clearer instructions 给予更明确的指示

Split complex tasks into simpler subtasks将复杂任务拆分为更简单的子任务

Structure the instruction to keep the model on task构建指令以保持模型执行任务

Prompt the model to explain before answering在回答之前提示模型解释

Ask for justifications of many possible answers, and then synthesize询问许多可能答案的理由，然后综合

Generate many outputs, and then use the model to pick the best one生成许多输出，然后使用模型选择最好的一个

Fine-tune custom models to maximize performance微调自定义模型以最大化性能

Split complex tasks into simpler tasks将复杂的任务拆分成更简单的任务

One way to give a model more time and space to think is to break tasks into simpler pieces.给模型更多时间和空间思考的一种方法是将任务分解成更简单的部分。

As an example, consider a task where we ask the model a multiple-choice question about some text - in this case, a game of Clue. When asked directly, text-davinci-002 isn't able to put clues 3 & 5 together, and answers incorrectly:例如，考虑一个任务，我们向模型询问有关某些文本的多项选择题——在本例中，是一个线索游戏。当直接询问时， text-davinci-002 无法将线索 3 和 5 放在一起，并且回答错误：

Use the following clues to answer the following multiple-choice question.
 
Clues:
1. Miss Scarlett was the only person in the lounge.
2. The person with the pipe was in the kitchen.
3. Colonel Mustard was the only person in the observatory.
4. Professor Plum was not in the library nor the billiard room.
5. The person with the candlestick was in the observatory.
 
Question: Was Colonel Mustard in the observatory with the candlestick?
(a) Yes; Colonel Mustard was in the observatory with the candlestick
(b) No; Colonel Mustard was not in the observatory with the candlestick
(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick
 
Solution:

(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick

Although clues 3 and 5 establish that Colonel Mustard was the only person in the observatory and that the person in the observatory had the candlestick, the model fails to combine them into a correct answer of (a) Yes.尽管线索 3 和 5 确定了上校是天文台中唯一的人并且天文台中的人拥有烛台，但模型未能将它们组合成 (a) 是的正确答案。

However, instead of asking for the answer directly, we can split the task into three pieces:但是，我们可以将任务分为三部分，而不是直接询问答案：

First, go through the clues one by one and consider whether the clue is potentially relevant首先，一条一条地过一遍线索，考虑线索是否具有潜在相关性

Second, combine the relevant clues to reason out the answer to the question二、结合相关线索推理出问题的答案

Third, write the final answer: either (a), (b), or (c)第三，写下最终答案：(a)、(b) 或 (c)

Use the following clues to answer the following multiple-choice question, using the following procedure:
(1) First, go through the clues one by one and consider whether the clue is potentially relevant
(2) Second, combine the relevant clues to reason out the answer to the question
(3) Third, map the answer to one of the multiple choice answers: either (a), (b), or (c)
 
Clues:
1. Miss Scarlett was the only person in the lounge.
2. The person with the pipe was in the kitchen.
3. Colonel Mustard was the only person in the observatory.
4. Professor Plum was not in the library nor the billiard room.
5. The person with the candlestick was in the observatory.
 
Question: Was Colonel Mustard in the observatory with the candlestick?
(a) Yes; Colonel Mustard was in the observatory with the candlestick
(b) No; Colonel Mustard was not in the observatory with the candlestick
(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick
 
Solution:
(1) First, go through the clues one by one and consider whether the clue is potentially relevant:

Clue 1: Miss Scarlett was the only person in the lounge. This clue is not relevant.
Clue 2: The person with the pipe was in the kitchen. This clue is not relevant.
Clue 3: Colonel Mustard was the only person in the observatory. This clue is relevant.
Clue 4: Professor Plum was not in the library nor the billiard room. This clue is not relevant.
Clue 5: The person with the candlestick was in the observatory. This clue is relevant.
 
(2) Second, combine the relevant clues to reason out the answer to the question:
Clue 3 and Clue 5 both say that the person with the candlestick was in the observatory. Colonel Mustard was the only person in the observatory, so Colonel Mustard was the person with the candlestick.
 
(3) Third, write the final answer:
(a) Yes; Colonel Mustard was in the observatory with the candlestick

By giving the model more time and space to think, and guiding it along a reasoning plan, it's able to figure out the correct answer of (a) Yes.通过给模型更多的时间和空间来思考，并按照推理计划引导它，它能够找出 (a) 是的正确答案。

Another benefit of splitting complex instructions into smaller subtasks is that it can help keep the model focused on each subtask.将复杂指令拆分为更小的子任务的另一个好处是，它可以帮助让模型专注于每个子任务。

For example, if we ask text-davinci-002 to summarize a text in its original language, the model can lapse back into English:例如，如果我们要求 text-davinci-002 以原始语言总结一段文本，则该模型可以恢复为英语：

Summarize the text using the original language of the text. The summary should be one sentence long.

Text:
"""
La estadística (la forma femenina del término alemán Statistik, derivado a su vez del italiano statista, "hombre de Estado") es una ciencia que estudia la variabilidad, colección, organización, análisis, interpretación, y presentación de los datos, así como el proceso aleatorio que los genera siguiendo las leyes de la probabilidad. La estadística es una ciencia formal deductiva, con un conocimiento propio, dinámico y en continuo desarrollo obtenido a través del método científico formal. En ocasiones, las ciencias fácticas necesitan utilizar técnicas estadísticas durante su proceso de investigación factual, con el fin de obtener nuevos conocimientos basados en la experimentación y en la observación. En estos casos, la aplicación de la estadística permite el análisis de datos provenientes de una muestra representativa, que busca explicar las correlaciones y dependencias de un fenómeno físico o natural, de ocurrencia en forma aleatoria o condicional.
"""

Summary:

The text explains that statistics is a science that studies the variability, collection, organization, analysis, interpretation, and presentation of data, as well as the random process that generates them following the laws of probability.

However, if we first ask the model to identify the language of the text, and then summarize the text, it becomes more reliable:但是，如果我们先让模型识别文本的语言，然后再对文本进行总结，就会变得更可靠：

First, identify the language of the text. Second, summarize the text using the original language of the text. The summary should be one sentence long.

Text:
"""
La estadística (la forma femenina del término alemán Statistik, derivado a su vez del italiano statista, "hombre de Estado") es una ciencia que estudia la variabilidad, colección, organización, análisis, interpretación, y presentación de los datos, así como el proceso aleatorio que los genera siguiendo las leyes de la probabilidad. La estadística es una ciencia formal deductiva, con un conocimiento propio, dinámico y en continuo desarrollo obtenido a través del método científico formal. En ocasiones, las ciencias fácticas necesitan utilizar técnicas estadísticas durante su proceso de investigación factual, con el fin de obtener nuevos conocimientos basados en la experimentación y en la observación. En estos casos, la aplicación de la estadística permite el análisis de datos provenientes de una muestra representativa, que busca explicar las correlaciones y dependencias de un fenómeno físico o natural, de ocurrencia en forma aleatoria o condicional.
"""

Language:

Spanish

La estadística es una ciencia que estudia la variabilidad, colección, organización, análisis, interpretación, y presentación de los datos, así como el proceso aleatorio que los genera siguiendo las leyes de la probabilidad.

Prompt the model to explain before answering在回答之前提示模型解释

Another powerful technique for improving the reliability of answers is to prompt the model to gradually reason out the answer rather than jumping immediately to the final answer. By 'thinking aloud' the model can be far more likely to arrive at the correct answer.另一个提高答案可靠性的强大技术是促使模型逐渐推理出答案，而不是立即跳到最终答案。通过“大声思考”，模型更有可能得出正确答案。

Zero-shot

Method

Published by Takeshi Kojima et al. in 2022, the easiest way to prompt a model to reason out the answer is to simply prepend answers with Let's think step by step. Figure 2 illustrates an example:由 Takeshi Kojima 等人出版。在 2022 年，提示模型推理出答案的最简单方法是简单地在答案前添加 Let's think step by step. 图 2 说明了一个示例：

Source: Large Language Models are Zero-Shot Reasoners by Takeshi Kojima et al. (2022).资料来源：Takeshi Kojima 等人的 Large Language Models are Zero-Shot Reasoners。 (2022)。

Results

Applying this simple trick to the MultiArith math dataset, the authors found Let's think step by step quadrupled the accuracy, from 18% to 79%!将这个简单的技巧应用于 MultiArith 数学数据集，作者发现 Let's think step by step 的准确性翻了两番，从 18% 到 79%！

Source: Large Language Models are Zero-Shot Reasoners by Takeshi Kojima et al. (2022).资料来源：Takeshi Kojima 等人的 Large Language Models are Zero-Shot Reasoners。 (2022)。

Implications

Although the Let's think step by step trick works well on math problems, it's not effective on all tasks. The authors found that it was most helpful for multi-step arithmetic problems, symbolic reasoning problems, strategy problems, and other reasoning problems. It didn't help with simple math problems or common sense questions, and presumably wouldn't help with many other non-reasoning tasks either.尽管 Let's think step by step 技巧在数学问题上效果很好，但它并非对所有任务都有效。作者发现它对多步算术问题、符号推理问题、策略问题和其他推理问题最有帮助。它对解决简单的数学问题或常识性问题没有帮助，而且大概对许多其他非推理任务也没有帮助。

Source: Large Language Models are Zero-Shot Reasoners by Takeshi Kojima et al. (2022).资料来源：Takeshi Kojima 等人的 Large Language Models are Zero-Shot Reasoners。 (2022)。

To learn more, read the full paper.要了解更多信息，请阅读全文。

If you apply this technique to your own tasks, don't be afraid to experiment with customizing the instruction. Let's think step by step is rather generic, so you may find better performance with instructions that hew to a stricter format customized to your use case. For example, you can try more structured variants like

First, think step by step about why X might be true. Second, think step by step about why Y might be true. Third, think step by step about whether X or Y makes more sense.

. And you can even give the model an example format to help keep it on track, e.g.:如果您将此技术应用到您自己的任务中，请不要害怕尝试自定义指令。 Let's think step by step 相当通用，因此您可能会发现使用根据您的用例定制的更严格格式的指令可以获得更好的性能。例如，您可以尝试更多结构化变体，如

First, think step by step about why X might be true. Second, think step by step about why Y might be true. Third, think step by step about whether X or Y makes more sense.

。你甚至可以给模型一个示例格式来帮助它保持在正轨上，例如：

Using the IRS guidance below, answer the following questions using this format:
(1) For each criterion, determine whether it is met by the vehicle purchase
- {Criterion} Let's think step by step. {explanation} {yes or no, or if the question does not apply then N/A}.
(2) After considering each criterion in turn, phrase the final answer as "Because of {reasons}, the answer is likely {yes or no}."

IRS guidance:
"""
You may be eligible for a federal tax credit under Section 30D if you purchased a car or truck that meets the following criteria:
- Does the vehicle have at least four wheels?
- Does the vehicle weigh less than 14,000 pounds?
- Does the vehicle draw energy from a battery with at least 4 kilowatt hours that may be recharged from an external source?
- Was the vehicle purchased in a year before 2022?
  - If so, has the manufacturer sold less than 200,000 qualifying vehicles? (Tesla and GM have sold more than 200,000 qualifying vehicles.)
- Was the vehicle purchased in a year after 2022?
  - If so, is the vehicle present in the following list of North American-assembled vehicles? (The only electric vehicles assembled in North America are the Audi Q5, BMW 330e, BMW X5, Chevrolet Bolt EUV, Chevrolet Bolt EV, Chrysler Pacifica PHEV, Ford Escape PHEV, Ford F Series, Ford Mustang MACH E, Ford Transit Van, GMC Hummer Pickup, GMC Hummer SUV, Jeep Grand Cherokee PHEV, Jeep Wrangler PHEV, Lincoln Aviator PHEV, Lincoln Corsair Plug-in, Lucid Air, Nissan Leaf, Rivian EDV, Rivian R1S, Rivian R1T, Tesla Model 3, Tesla Model S, Tesla Model X, Tesla Model Y, Volvo S60, BMW 330e, Bolt EV, Cadillac Lyriq, Mercedes EQS SUV, and Nissan Leaf.)
"""

Question: Can I claim a federal tax credit for my Toyota Prius Prime bought in 2021?

Solution:

(1) For each criterion, determine whether it is met by the vehicle purchase
- Does the vehicle have at least four wheels? Let's think step by step.

 The Toyota Prius Prime has four wheels, so the answer is yes.
- Does the vehicle weigh less than 14,000 pounds? Let's think step by step. The Toyota Prius Prime weighs less than 14,000 pounds, so the answer is yes.
- Does the vehicle draw energy from a battery with at least 4 kilowatt hours that may be recharged from an external source? Let's think step by step. The Toyota Prius Prime has a battery with at least 4 kilowatt hours that may be recharged from an external source, so the answer is yes.
- Was the vehicle purchased in a year before 2022? Let's think step by step. The Toyota Prius Prime was purchased in 2021, which is before 2022, so the answer is yes.
- Was the vehicle purchased in a year after 2022? N/A
- If so, is the vehicle present in the following list of North American-assembled vehicles? N/A

(2) After considering each criterion in turn, phrase the final answer as "Because of {reasons}, the answer is likely {yes or no}."

Because the Toyota Prius Prime meets all of the criteria for a federal tax credit, the answer is likely yes.

Few-shot examples

Method

Prompting the model to reason out its answers can be done in many ways. One way is to demonstrate with a few examples ('few-shot'), as studied by Jason Wei and Denny Zhou et al. from Google. Here's an example few-shot chain-of-thought prompt:可以通过多种方式提示模型推理出其答案。正如 Jason Wei 和 Denny Zhou 等人所研究的那样，一种方法是用几个例子（'few-shot'）进行演示。来自谷歌。这是一个示例性的小样本思维链提示：

Source: Chain of Thought Prompting Elicits Reasoning in Large Language Models Jason Wei and Denny Zhou et al. (2022)资料来源：思想链提示在大型语言模型中引发推理 Jason Wei 和 Denny Zhou 等人。 (2022)

More demonstrations of reasoning chains written by human labelers:人工标注者编写的推理链的更多演示：

(Note that it has been called into question whether pears actually float)（请注意，人们质疑梨是否真的会漂浮）

Results

Testing on grade school math problems, the authors found that chain of thought prompting tripled the solve rate, from 18% to 57%.在对小学数学问题进行测试时，作者发现思维链提示使解决率增加了两倍，从 18% 增加到 57%。

In addition to math problems, chain of thought prompting also lifted performance on questions related to sports understanding, coin flip tracking, and last letter concatenation. In most cases, not many examples were need to saturate the performance gains (less than 8 or so).除了数学问题，思维链提示还提升了与运动理解、抛硬币跟踪和最后一个字母连接相关问题的表现。在大多数情况下，不需要太多示例就可以使性能提升达到饱和（少于 8 个左右）。

To learn more, read the full paper.要了解更多信息，请阅读全文。

Implications

One advantage of the few-shot example-based approach relative to the Let's think step by step technique is that you can more easily specify the format, length, and style of reasoning that you want the model to perform before landing on its final answer. This can be particularly helpful in cases where the model isn't initially reasoning in the right way or depth.相对于 Let's think step by step 技术，基于少量示例的方法的一个优点是，您可以更轻松地指定您希望模型在得出最终答案之前执行的推理的格式、长度和风格。这在模型最初没有以正确的方式或深度进行推理的情况下特别有用。

Fine-tuned

Method

In general, to eke out maximum performance on a task, you'll need to fine-tune a custom model. However, fine-tuning a model using explanations may take thousands of example explanations, which are costly to write.通常，为了在任务中获得最佳性能，您需要微调自定义模型。然而，使用解释来微调模型可能需要数千个示例解释，而这些解释的编写成本很高。

In 2022, Eric Zelikman and Yuhuai Wu et al. published a clever procedure for using a few-shot prompt to generate a dataset of explanations that could be used to fine-tune a model. The idea is to use a few-shot prompt to generate candidate explanations, and only keep the explanations that produce the correct answer. Then, to get additional explanations for some of the incorrect answers, retry the few-shot prompt but with correct answers given as part of the question. The authors called their procedure STaR (Self-taught Reasoner):2022 年，Eric Zelikman 和吴玉怀等人。发布了一个聪明的程序，使用几次提示生成可用于微调模型的解释数据集。这个想法是使用少量提示来生成候选解释，并且只保留产生正确答案的解释。然后，为了得到一些不正确答案的额外解释，重试几次提示，但将正确答案作为问题的一部分给出。作者称他们的程序为 STaR（自学推理机）：

Source: STaR: Bootstrapping Reasoning With Reasoning by Eric Zelikman and Yujuai Wu et al. (2022)资料来源：STaR：Eric Zelikman 和 Yujuai Wu 等人的自举推理。 (2022)

With this technique, you can combine the benefits of fine-tuning with the benefits of chain-of-thought prompting without needing to write thousands of example explanations.使用这种技术，您可以将微调的好处与思维链提示的好处结合起来，而无需编写数千个示例解释。

Results

When the authors applied this technique to a Common Sense Q&A dataset, they found that STaR outperformed both chain-of-thought prompting alone (73% > 37%) and fine-tuning alone (73% > 60%):当作者将这种技术应用于 Common Sense Q&A 数据集时，他们发现 STaR 的表现优于单独的思维链提示 (73% > 37%) 和单独的微调 (73% > 60%)：

Source: STaR: Bootstrapping Reasoning With Reasoning by Eric Zelikman and Yujuai Wu et al. (2022)资料来源：STaR：Eric Zelikman 和 Yujuai Wu 等人的自举推理。 (2022)

To learn more, read the full paper.要了解更多信息，请阅读全文。

Implications

Using a few-shot prompt to extend or modify a fine-tuning dataset is an idea that can be generalized beyond explanation writing. For example, if you have large quantities of unstructured text that you want to train on, you may find opportunities to use a prompt to extract a structured dataset from your unstructured text, and then fine-tune a custom model on that structured dataset.使用少量提示来扩展或修改微调数据集是一种可以在解释写作之外推广的想法。例如，如果您有大量想要训练的非结构化文本，您可能会找到机会使用提示从非结构化文本中提取结构化数据集，然后在该结构化数据集上微调自定义模型。

Extensions to chain-of-thought prompting思维链提示的扩展

A number of extensions of chain-of-thought prompting have been published as well.还发布了一些思维链提示的扩展。

Selection-inference prompting 选择推理提示

Method

Published by Antonia Creswell et al., one extension of the chain-of-thought technique is to split the single prompt for generating explanations and answers into smaller parts. First, a prompt selects a relevant subset of facts from the text ('selection prompt'). Then, a second prompt infers a conclusion from the selected facts ('inference prompt'). These prompts are then alternated in a loop to generate multiple steps of reasoning and eventually land on a final answer. The authors illustrate the idea in the following figure:由 Antonia Creswell 等人发布，思维链技术的一种扩展是将生成解释和答案的单一提示分成更小的部分。首先，提示从文本中选择相关的事实子集（“选择提示”）。然后，第二个提示从所选事实中推断出结论（“推理提示”）。然后这些提示在一个循环中交替产生多个推理步骤并最终得出最终答案。作者在下图中说明了这个想法：

Source: Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning by Antonia Creswell et al. (2022)资料来源：选择推理：利用大型语言模型进行可解释的逻辑推理，Antonia Creswell 等人。 (2022)

Results

When applied to a 7B-parameter model, the authors found that selection-inference prompting substantially improved performance relative to chain-of-thought prompting on the bAbi and Proof Writer benchmark tasks (both of which require longer sequences of reasoning steps). The best performance they achieved combined both selection-inference prompting with fine-tuning.当应用于 7B 参数模型时，作者发现选择推理提示相对于 bAbi 和 Proof Writer 基准任务（两者都需要更长的推理步骤序列）的思维链提示显着提高了性能。他们取得的最佳表现结合了选择推理提示和微调。

Implications

Although the gains on these benchmarks were large, these benchmarks were specifically chosen because they required longer sequences of reasoning. On problems that don't require reasoning with many steps, the gains are likely smaller.尽管这些基准的收益很大，但这些基准是专门选择的，因为它们需要更长的推理序列。在不需要很多步骤推理的问题上，收益可能较小。

The results highlight a couple of general lessons for working with large language models. One, splitting up complex tasks into smaller tasks is a great way to improve reliability and performance; the more atomic the task, the less room there is for the model to err. Two, getting maximum performance often means combining fine-tuning with whatever approach you've chosen.结果突出显示了使用大型语言模型的几个一般课程。第一，将复杂的任务拆分成更小的任务是提高可靠性和性能的好方法；任务越原子化，模型出错的空间就越小。第二，获得最佳性能通常意味着将微调与您选择的任何方法相结合。

To learn more, read the full paper.要了解更多信息，请阅读全文。

Faithful reasoning architecture 忠实推理架构

A few months after publishing the selection-inference prompting technique, the authors extended the technique in a follow-up paper, with ideas for:在发布选择推理提示技术几个月后，作者在后续论文中扩展了该技术，并提出了以下想法：

figuring out when the selection-inference cycle should stop or continue弄清楚选择推理循环应该何时停止或继续

adding a value function to help search over multiple reasoning paths添加价值函数以帮助搜索多个推理路径

reducing hallucination of fake facts by fine-tuning a model to reason about sentence labels (e.g., sen1) rather than writing out the sentences themselves通过微调模型来推理句子标签（例如 sen1）而不是自己写出句子来减少虚假事实的幻觉

Method

In the original selection-inference technique, specialized 'selection' and 'inference' prompts are alternated to select facts and make inferences from those facts, combining to generate a sequence of reasoning steps.在最初的选择推理技术中，专门的“选择”和“推理”提示交替进行，以选择事实并从这些事实中进行推理，结合生成一系列推理步骤。

The authors extend this technique with two additional components.作者用两个额外的组件扩展了这项技术。

First, the authors add a 'halter' model that, after each inference step, is asked whether the inferences thus far are sufficient to answer the question. If yes, then the model generates a final answer.首先，作者添加了一个“吊带”模型，在每个推理步骤之后，系统都会询问到目前为止的推理是否足以回答问题。如果是，那么模型会生成最终答案。

The halter models brings a couple of advantages:露背款有两个优点：

it can tell the selection-inference process to stop or keep going, as necessary.它可以根据需要告诉选择推理过程停止或继续进行。

if the process never halts, you'll get no answer, which is often preferable to a hallucinated guess如果这个过程永远不会停止，你将得不到答案，这通常比幻觉猜测更可取

Source: Faithful Reasoning Using Large Language Models by Antonia Creswell et al. (2022)资料来源：Antonia Creswell 等人使用大型语言模型进行的忠实推理。 (2022)

Second, the authors add a value function, which is used to assess the quality of reasoning steps and search over multiple reasoning trajectories. This echoes a common theme for increasing reliability; instead of generating a single answer from the model, generate a set of answers and then use some type of value function / discriminator / verifier model to pick the best one.其次，作者添加了一个价值函数，用于评估推理步骤的质量并搜索多个推理轨迹。这呼应了提高可靠性的共同主题；不是从模型中生成单个答案，而是生成一组答案，然后使用某种类型的价值函数/鉴别器/验证器模型来选择最佳答案。

Source: Faithful Reasoning Using Large Language Models by Antonia Creswell et al. (2022)资料来源：Antonia Creswell 等人使用大型语言模型进行的忠实推理。 (2022)

In addition to these two extensions, the authors also use a trick to reduce hallucination of fake facts. Rather than asking the model to write out factual sentences, they fine-tune a model to work with sentence labels (e.g., sen1) instead. This helps prevent the model from hallucinating fake facts not mentioned in the prompt context.除了这两个扩展之外，作者还使用了一个技巧来减少虚假事实的幻觉。他们没有要求模型写出事实句子，而是微调模型以使用句子标签（例如 sen1）。这有助于防止模型对提示上下文中未提及的虚假事实产生幻觉。

Source: Faithful Reasoning Using Large Language Models by Antonia Creswell et al. (2022)资料来源：Antonia Creswell 等人使用大型语言模型进行的忠实推理。 (2022)

Results

The authors evaluated their technique on two benchmarks: the ProofWriter task (not shown) and EntailmentBankQA (shown). The technique increased accuracy substantially, especially on harder reasoning problems.作者在两个基准上评估了他们的技术：ProofWriter 任务（未显示）和 EntailmentBankQA（显示）。该技术大大提高了准确性，尤其是在解决更难的推理问题时。

Source: Faithful Reasoning Using Large Language Models by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271)资料来源：Antonia Creswell 等人使用大型语言模型进行的忠实推理。 (2022)]( https://arxiv.org/abs/2208.14271 )

In addition, their sentence label manipulation trick essentially eliminated hallucination!此外，他们的句子标签操作技巧从根本上消除了幻觉！

Implications

This paper illustrates a number of helpful lessons for improving the reliability of large language models:本文阐述了一些有助于提高大型语言模型可靠性的经验教训：

Split complex tasks into smaller, more reliable subtasks将复杂任务拆分为更小、更可靠的子任务

Generate your answer in a step-by-step fashion, evaluating it along the way逐步生成您的答案，并在整个过程中对其进行评估

Generate many possible answers and use another model or function to pick the ones that look best生成许多可能的答案并使用另一个模型或函数来选择看起来最好的答案

Reduce hallucination by constraining what the model can say (e.g., by using sentence labels instead of sentences)通过限制模型可以说的内容来减少幻觉（例如，通过使用句子标签而不是句子）

Maximize performance of models by fine-tuning them on specialized tasks通过在专门任务上微调模型来最大化模型的性能

To learn more, read the full paper.要了解更多信息，请阅读全文。

Least-to-most prompting 从最少到最多的提示

In addition to doing poorly on long reasoning chains (where selection-inference shines), chain-of-thought prompting can especially struggle when the examples are short but the task is long.除了在长推理链（选择推理的优势所在）上表现不佳之外，当示例很短但任务很长时，思维链提示尤其困难。

Method

Least-to-most prompting is another technique that splits up reasoning tasks into smaller, more reliable subtasks. The idea is to elicit a subtask from the model by prompting it with something like To solve {question}, we need to first solve: ". Then, with that subtask in hand, the model can generate a solution. The solution is appended to the original question and the process is repeated until a final answer is produced.Least-to-most 提示是另一种将推理任务分解为更小、更可靠的子任务的技术。这个想法是通过使用 To solve {question}, we need to first solve: " 之类的提示来从模型中引出一个子任务。然后，有了这个子任务，模型就可以生成一个解决方案。解决方案将附加到原始问题并重复该过程，直到产生最终答案。

Source: Least-to-most Prompting Enables Complex Reasoning in Large Language Models by Denny Zhou et al. (2022)资料来源：Denny Zhou 等人的 Least-to-most Prompting Enables Complex Reasoning in Large Language Models。 (2022)

Results

When applied to benchmarks involving long reasoning chains using code-davinci-002 (which is optimized for code but can still understand text), the authors measured gains as large as 16% -> 99.7%!当应用于涉及使用 code-davinci-002 （针对代码进行了优化但仍然可以理解文本）的长推理链的基准测试时，作者测得的收益高达 16% -> 99.7%！

Implications

Although the above gains from least-to-most prompting are impressive, they are measured on a very narrow set of tasks that require long reasoning chains.尽管上述从最少到最多提示的收益令人印象深刻，但它们是在一组非常狭窄的需要长推理链的任务上衡量的。

Still, they illustrate a common theme: increase reliability by (a) breaking complex tasks into smaller subtasks and (b) giving the model more time and space to work out the answer.尽管如此，它们还是说明了一个共同的主题：通过 (a) 将复杂任务分解为更小的子任务以及 (b) 为模型提供更多时间和空间来计算答案来提高可靠性。

To learn more, read the full paper.要了解更多信息，请阅读全文。

Related ideas

Maieutic prompting 美化提示

Method

In contrast to the previous techniques, which try to maximize the likelihood of correct answers, another approach is to use GPT-3 to generate a tree of possible explanations (both correct and incorrect), and then analyze their relationships to guess at which set is correct. This technique was coined maieutic prompting by Jaehun Jung et al. in May 2022 (maieutic means relating to the Socratic method of asking questions to elicit ideas).与之前尝试最大化正确答案的可能性的技术相比，另一种方法是使用 GPT-3 生成可能的解释树（正确和错误），然后分析它们之间的关系以猜测哪一组是正确的正确的。这种技术是由 Jaehun Jung 等人创造的 maieutic 提示。 2022 年 5 月（maieutic 是指与苏格拉底式的提问以引出想法的方法有关）。

The method is complicated, and works as follows:方法比较复杂，工作原理如下：

First, build a maieutic tree, where each node is a statement that could be true or false:首先，构建一棵数学树，其中每个节点都是一个可能为真或为假的陈述：

Start with a multiple-choice question or true/false statement (e.g. War cannot have a tie

)

从多项选择题或判断题开始（例如 War cannot have a tie ）

For each possible answer to the question, use the model to generate a corresponding explanation (with a prompt like War cannot have a tie? True, because

)

对于问题的每个可能答案，使用模型生成相应的解释（带有类似 War cannot have a tie? True, because 的提示）

Then, prompt the model with the question and the generated explanation, and ask it to produce the answer. If reversing the explanation (with a prefix like It is wrong to say that {explanation}

) reverses the answer, then the explanation is considered 'logically integral.'

然后，用问题和生成的解释提示模型，并要求它产生答案。如果颠倒解释（使用 It is wrong to say that {explanation} 之类的前缀）颠倒答案，则该解释被认为是“逻辑上完整的”。

If an explanation is not logically integral, then repeat the above process recursively, with each explanation turned into a True or False question, and generate more explanations for each new question.如果一个解释在逻辑上不是完整的，则递归地重复上述过程，每个解释都变成一个 True 或 False 问题，并为每个新问题生成更多解释。
After all of the recursive explaining is done, you end up with a tree of explanations, where each leaf on the tree has the property that reversing the explanation reverses the model's answer.完成所有递归解释后，您最终会得到一棵解释树，其中树上的每个叶子都具有反转解释反转模型答案的属性。

Second, convert the tree into a graph of relations:其次，将树转换为关系图：

For each node in the tree, calculate the model's relative belief in each node (inferred from the probability of getting an answer of True

to given an explanation)

对于树中的每个节点，计算模型对每个节点的相对置信度（从得到 True 答案的概率推断给出解释）

For each pair of nodes in the tree, use the model to identify whether they are entailed (implied) or contradicted对于树中的每一对节点，使用模型来确定它们是蕴含（隐含）还是矛盾

Third, find the most consistent set of beliefs and take those to be true:第三，找到最一致的信念集并将其视为真实：

Specifically, using the strength of belief in each node and the logical relationships between them, formulate the problem as a weighted maximum satisfiability problem (MAX-SAT)具体来说，利用每个节点的信念强度和它们之间的逻辑关系，将问题表述为加权最大可满足性问题（MAX-SAT）
Use a solver to the find the most self-consistent set of beliefs, and take those as true使用求解器找到最自洽的信念集，并将其视为真实

Source: Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations by Jaehun Jung et al. (2022)资料来源：Maieutic Prompting：Jaehun Jung 等人的逻辑一致推理与递归解释。 (2022)

Results

Implications

Beyond the complexity, one limitation of this method is that it appears to only apply to questions that can be posed as multiple-choice.除了复杂性之外，这种方法的一个局限性是它似乎只适用于可以作为多项选择提出的问题。

To learn more, read the full paper.要了解更多信息，请阅读全文。

Extensions

Self-consistency

Method

For tasks with a discrete set of answers, one simple way to improve reliability is to sample multiple explanations & answers from the model (using a positive temperature) and then pick the final answer that appears most often.对于具有一组离散答案的任务，提高可靠性的一种简单方法是从模型中抽取多个解释和答案（使用正温度），然后选择最常出现的最终答案。

Source: Self-Consistency Improves Chain of Thought Reasoning in Language Models by Xuezhi Wang et al. (2022)资料来源：Xuezhi Wang 等人的自我一致性改进了语言模型中的思维链推理。 (2022)

Results

This technique lifted accuracies by anywhere from 1 to 24 percentage points on a suite of math and reasoning benchmarks. (Plotted below are results from Google's LaMDA model; using Google's larger PaLM model, the baselines were higher but the gains were a bit smaller.)这项技术在一套数学和推理基准测试中将准确度提高了 1 到 24 个百分点。（下图是谷歌 LaMDA 模型的结果；使用谷歌更大的 PaLM 模型，基线更高，但收益略小。）

Implications

Although this technique is simple to implement, it can be costly. Generating a set of 10 answers will increase your costs by 10x.虽然这种技术实施起来很简单，但成本可能很高。生成一组 10 个答案将使您的成本增加 10 倍。

Also, as with many of these techniques, it applies only to tasks with a limited set of answers. For open-ended tasks where each answer is unique (such as writing a poem), it's not obvious what it would mean to pick the most common answer.此外，与其中许多技术一样，它仅适用于答案有限的任务。对于每个答案都是唯一的开放式任务（例如写一首诗），选择最常见答案的意义并不明显。

Lastly, this technique ought to be most beneficial when there are multiple paths or phrasings to reach an answer; if there's only one path, then the technique may not help at all. An extreme example: If the task was to generate a single token answer, then taking the most common token from 100 generations would be no different than taking the token with the highest logprobs (which you can get with a single generation at temperature=0).最后，当有多种路径或措辞可以得出答案时，这种技术应该是最有用的；如果只有一条路径，那么该技术可能根本无济于事。一个极端的例子：如果任务是生成单个标记答案，那么从 100 代中获取最常见的标记与获取具有最高 logprobs 的标记（您可以在温度 = 0 的单代中获得）没有什么不同.

Verifiers

Another key technique for improving task performance is to train a verifier or discriminator model to evaluate the outputs of the main generative model. If the discriminator rejects the output, then you can resample the generative model until you get an acceptable output. In many cases, it's easier to judge an answer than it is to create an answer, which helps explain the power of this method.另一个提高任务性能的关键技术是训练验证器或鉴别器模型来评估主生成模型的输出。如果鉴别器拒绝输出，那么您可以对生成模型重新采样，直到获得可接受的输出。在许多情况下，判断一个答案比创建一个答案更容易，这有助于解释这种方法的强大之处。

Method

In 2021, OpenAI researchers applied this technique to grade school math problems, using the following procedure:2021 年，OpenAI 研究人员使用以下程序将此技术应用于小学数学问题：

First, they fine-tuned a model on questions and solutions首先，他们微调了一个关于问题和解决方案的模型

For each problem in the training set, they generated 100 solutions对于训练集中的每个问题，他们生成了 100 个解决方案

Each of those 100 solutions was automatically labeled as either correct or incorrect, based on whether the final answer was correct根据最终答案是否正确，这 100 个解决方案中的每一个都被自动标记为正确或不正确

Using those solutions, with some labeled correct and some labeled incorrect, they fine-tuned a verifier model to classify whether a question and candidate solution was correct or incorrect使用这些解决方案，其中一些标记为正确，一些标记为错误，他们微调了验证器模型以对问题和候选解决方案是正确还是错误进行分类

Finally, at test time, the generative model creates 100 solutions to each problem, and the one with the highest score according to the verifier model is picked as the final answer最后，在测试时，生成模型为每个问题创建 100 个解决方案，根据验证者模型选择得分最高的一个作为最终答案

Source: Training Verifiers to Solve Math Word Problems by Karl Cobbe et al. (2021)资料来源：Karl Cobbe 等人的训练验证者解决数学单词问题。 (2021)

Results

With a 175B GPT-3 model and 8,000 training examples, this technique substantially lifted grade school math accuracy from ~33% to ~55%.借助 175B GPT-3 模型和 8,000 个训练示例，该技术将小学数学准确率从约 33% 大幅提高到约 55%。

Source: Training Verifiers to Solve Math Word Problems by Karl Cobbe et al. (2021)资料来源：Karl Cobbe 等人的训练验证者解决数学单词问题。 (2021)

Implications

Similar to the self-consistency technique, this method can get expensive, as generating, say, 100 solutions per task will increase your costs by roughly ~100x.与自洽技术类似，这种方法可能会变得昂贵，因为每个任务生成 100 个解决方案会使您的成本增加大约 100 倍。

Theories of reliability 可靠性理论

Although the techniques above vary in their approach, they all share the goal of improving reliability on complex tasks. Mainly they do this by:尽管上述技术的方法各不相同，但它们的目标都是提高复杂任务的可靠性。他们主要通过以下方式做到这一点：

decomposing unreliable operations into smaller, more reliable operations (e.g., selection-inference prompting)将不可靠的操作分解为更小、更可靠的操作（例如，选择推理提示）

using multiple steps or multiple relationships to make the system's reliability greater than any individual component (e.g., maieutic prompting)使用多个步骤或多个关系使系统的可靠性高于任何单个组件（例如，美化提示）

Probabilistic graphical models 概率图形模型

This paradigm of trying to build a reliable system out of less reliable components is reminiscent of probabilistic programming, and many of the analysis techniques of that field can be applied to this one.这种试图用不太可靠的组件构建可靠系统的范例让人想起概率编程，该领域的许多分析技术都可以应用于这一领域。

In the paper Language Model Cascades, David Dohan et al. interpret the above techniques in the paradigm of probabilistic graphical models:在论文 Language Model Cascades 中，David Dohan 等人。在概率图模型的范式中解释上述技术：

Implications

Although formulating these techniques as probabilistic graphical models may not be immediately useful for solving any particular problem, the framework may be helpful in selecting, combining, and discovering new techniques.尽管将这些技术表述为概率图形模型可能不会立即对解决任何特定问题有用，但该框架可能有助于选择、组合和发现新技术。

Closing thoughts

Research into large language models is very active and evolving rapidly. Not only do researchers continue to improve the models, they also continue to improve our understanding of how to best employ the models. To underscore the pace of these developments, note that all of the papers shared above were published within the past 12 months (as I write in Sep 2022).对大型语言模型的研究非常活跃并且发展迅速。研究人员不仅继续改进模型，他们还继续提高我们对如何最好地使用这些模型的理解。为了强调这些发展的速度，请注意上面分享的所有论文都是在过去 12 个月内发表的（正如我在 2022 年 9 月写的那样）。

In the future, expect better models and better techniques to be published. Even if the specific techniques here are eclipsed by future best practices, the general principles behind them will likely remain a key part of any expert user's toolkit.在未来，期待更好的模型和更好的技术被发布。即使这里的具体技术被未来的最佳实践所掩盖，它们背后的一般原则可能仍然是任何专家用户工具包的关键部分。

Bibliography

Lesson	Paper	Date
Break complex tasks into simpler subtasks (and consider exposing the intermediate outputs to users)将复杂任务分解为更简单的子任务（并考虑向用户公开中间输出）	AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts AI 链：通过链接大型语言模型提示进行透明和可控的人机交互	2021 Oct
You can improve output by generating many candidates, and then picking the one that looks best您可以通过生成许多候选对象然后选择看起来最好的一个来提高输出	Training Verifiers to Solve Math Word Problems 训练验证者解决数学单词问题	2021 Oct
On reasoning tasks, models do better when they reason step-by-step before answering在推理任务中，模型在回答之前逐步推理时会做得更好	Chain of Thought Prompting Elicits Reasoning in Large Language Models 思维链提示在大型语言模型中引发推理	2022 Jan
You can improve step-by-step reasoning by generating many explanation-answer outputs, and picking the most popular answer您可以通过生成许多解释答案输出并选择最受欢迎的答案来改进逐步推理	Self-Consistency Improves Chain of Thought Reasoning in Language Models 自洽性改进了语言模型中的思维链推理	2022 Mar
If you want to fine-tune a step-by-step reasoner, you can do it with multiple-choice question & answer data alone如果你想微调一个循序渐进的推理机，你可以单独使用多项选择题和答案数据来完成	STaR: Bootstrapping Reasoning With Reasoning STaR：用推理引导推理	2022 Mar
The step-by-step reasoning method works great even with zero examples循序渐进的推理方法即使在零个例子的情况下也能很好地工作	Large Language Models are Zero-Shot Reasoners 大型语言模型是零样本推理机	2022 May
You can do better than step-by-step reasoning by alternating a ‘selection’ prompt and an ‘inference’ prompt通过交替使用“选择”提示和“推理”提示，您可以比逐步推理做得更好	Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning 选择推理：利用大型语言模型进行可解释的逻辑推理	2022 May
On long reasoning problems, you can improve step-by-step reasoning by splitting the problem into pieces to solve incrementally在长推理问题上，您可以通过将问题拆分成多个部分以逐步解决来改进逐步推理	Least-to-most Prompting Enables Complex Reasoning in Large Language Models 从最少到最多的提示使大型语言模型中的复杂推理成为可能	2022 May
You can have the model analyze both good and bogus explanations to figure out which set of explanations are most consistent您可以让模型分析好的解释和错误的解释，以确定哪一组解释最一致	Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations Maieutic Prompting：逻辑一致推理与递归解释	2022 May
You can think about these techniques in terms of probabilistic programming, where systems comprise unreliable components您可以从概率编程的角度考虑这些技术，其中系统包含不可靠的组件	Language Model Cascades 语言模型级联	2022 Jul
You can eliminate hallucination with sentence label manipulation, and you can reduce wrong answers with a 'halter' prompt你可以通过句子标签操作来消除幻觉，你可以通过“吊带”提示来减少错误答案	Faithful Reasoning Using Large Language Models 使用大型语言模型进行忠实推理	2022 Aug

OpenAI提高可靠性的技术

Why GPT-3 fails on complex tasks为什么 GPT-3 在复杂任务上失败

Model capabilities depend on context模型功能取决于上下文

How to improve reliability on complex tasks如何提高复杂任务的可靠性

Split complex tasks into simpler tasks将复杂的任务拆分成更简单的任务

Prompt the model to explain before answering在回答之前提示模型解释

Zero-shot

Method

Results

Implications

Few-shot examples

Method

Results

Implications

Fine-tuned

Method

Results

Implications

Extensions to chain-of-thought prompting思维链提示的扩展

Selection-inference prompting 选择推理提示

Method

Results

Implications

Faithful reasoning architecture 忠实推理架构

Method

Results

Implications

Least-to-most prompting 从最少到最多的提示

Method

Results

Implications

Related ideas

Maieutic prompting 美化提示

Method

Results

Implications

Extensions

Self-consistency

Method

Results

Implications

Verifiers

Method

Results

Implications

Theories of reliability 可靠性理论

Probabilistic graphical models 概率图形模型

Chain of thought prompting 思维链提示

Fine-tuned chain of thought prompting / Self-taught reasoner微调的思维提示链/自学推理机

Selection-inference prompting 选择推理提示

Verifiers

Implications

Closing thoughts

Bibliography