原始文献:https://arxiv.org/abs/2406.14283
阅读了量子位的两次报道,都和提升大语言模型性能和Q*有关,故此对第二个报道中提到的文献重新认真阅读了一下,之前在@贯一智能科技 的评论区和一位网友辩论过这件事,我还是坚信AGI已经是人类能够实现的目标。因为我一直在关注相关性不等于因果性的相关内容,故此我也认同Yang Lecun 的观点,即现在的人工智能还不具有常识和正确认识世界的能力,但是我有一点不同意,任何认为人工智能在提示词下生成的内容不符合现实的想法都是在对一个智能体的梦境进行吹毛求疵,如果这个智能体一旦清醒了,事情将颠倒过来。
下面摘录一些我认为重要的内容:
On the other hand, solving complex reasoning problems requires more in-depth, deliberative and logical thinking steps, i.e., the “System 2" mode [15]. Taking solving math word problems as an example, any incorrect intermediate reasoning step (e.g., calculation errors, mis-interpretations) can potentially lead to incorrect final answers. Prior attempts [25, 26, 27, 28] for enhancing “System 2” reasoning capability includes performing deliberation with basic tree search algorithms (e.g., BFS or DFS), Monte Carlo Tree Search (MCTS) [29], and A* [30]. Nonetheless, the utility functions used in these methods often require laborious expertise to design for each specific task, which are difficult to be extended to new scenarios. Furthermore, deliberation with MCTS would require significant number of rollouts before finding high-quality responses when solving the problems with many reasoning steps, which substantially slows down the overall decoding process.
另一方面,解决复杂的推理问题需要更深入、深思熟虑和逻辑性的思维步骤,即“系统2”模式[15]。以解决数学应用题为例,任何不正确的中间推理步骤(例如,先前的尝试[25,26,27,28]增强“系统2”推理能力包括使用基本树搜索算法(例如BFS或DFS)进行审议,尽管如此,这些方法中使用的效用函数通常需要大量的专业知识来设计每个特定的任务,这很难扩展到新的场景。在使用许多推理步骤解决问题时,使用 MCTS 进行审议需要进行大量的部署才能找到高质量的响应,这会大大减慢整个解码过程。
上一篇文章中我认为MCTSr可能加强LLM的逻辑推理能力,现在看来,在专业学者眼中,还有更高效的搜索算法可以解决这个问题。
In light of this, we propose Q*, a general, versatile and agile framework for improving the multi-step reasoning capability of LLMs with deliberative planning. Different from the existing deliberation methods, our method does not rely on domain knowledge to design the heuristic function. Besides, by leveraging plug-and-play Q-value models as heuristic function, our Q* can effectively solve various tasks via guiding LLMs to select the most promising next step without fine-tuning LLMs beforehand, which avoids the significant computational overhead and potential risk of performance degeneration in other tasks. Finally, Q* considers only one single step when performing deliberation, which is much cheaper than completing rollouts in MCTS. Specifically, the main contributions of our work are summarized as follows:
有鉴于此,我们提出了 Q*,一个通用的、通用的、敏捷的框架,用于通过深思熟虑的规划来提高 LLMs 的多步推理能力。与现有的审议方法不同,我们的方法不依赖领域知识来设计启发式函数。此外,通过利用即插即用的 Q 值模型作为启发式功能,我们的 Q* 可以通过引导 LLMs 选择最有希望的下一步来有效地解决各种任务,而无需提前微调 ,这避免了其他任务中的大量计算开销和性能下降的潜在风险。最后,Q* 在进行审议时仅考虑一个步骤,这比在 MCTS 中完成部署要便宜得多。具体来说,我们工作的主要贡献总结如下:
We formalize the multi-step reasoning of LLMs as a Markov Decision Process (MDP) where the state is the concatenation of input prompt and the reasoning steps generated so far, the action is the next reasoning step and the reward measures how well the task is solved.
• 我们将LLMs的多步推理形式化为马尔可夫决策过程(MDP),其中状态是输入提示和迄今为止生成的推理步骤的串联,动作是下一个推理步骤,并且奖励衡量任务的完成程度。•
We present several general approaches to estimate the optimal Q-value of state-action pairs, i.e., offline reinforcement learning, the best sequence from rollouts, and completion with stronger LLMs. It is noteworthy that our methods only need the ground-truth of training problems and can be easily applied to various reasoning tasks without modification.
• 我们提出了几种估计状态-动作对的最佳 Q 值的通用方法,即离线强化学习、推出的最佳序列以及更强的 LLMs 完成。值得注意的是,我们的方法只需要训练问题的真实情况,无需修改即可轻松应用于各种推理任务。•
We cast solving multi-step reasoning tasks as a heuristic search problem, where the objective is to find the most proper reasoning trace with maximum utility. Built upon A* search, our deliberation framework, Q*, leverages plug-and-play Q-value models as heuristic function and guides LLMs to select the most promising next reasoning step in best-first fashion.
• 我们将解决多步推理任务视为启发式搜索问题,其目标是找到具有最大效用的最合适的推理轨迹。我们的审议框架 Q* 建立在 A* 搜索的基础上,利用即插即用的 Q 值模型作为启发式函数,并指导 LLMs 以最佳优先的方式选择最有希望的下一步推理步骤。We conduct extensive experiments on math reasoning and code generation tasks, demonstrating that Q* can significantly improve the multi-step reasoning capability of existing open-sourced LLMs.
• 我们对数学推理和代码生成任务进行了广泛的实验,证明Q*可以显着提高现有开源LLMs的多步推理能力。Enhancing LLMs with planning.
通过规划加强LLMs。
Tree-of-thoughts (ToT) [25] improves the LLMs’ reasoning capability by exploring the intermediate steps towards problem solving with basic tree-search algorithms. In the same vein, A* search and MCTS are applied to serve as a planning technique to enhance the performance of LLMs when solving challenging complex reasoning problems [26, 27, 28, 34]. Unfortunately, the utility function used in these methods is either constructed from LLMs’ feedback (e.g., [25, 27]), which could be highly-inaccurate in complex problems, or specific to each individual task (e.g., [28, 34]). Moreover, planning with MCTS often requires to perform costly rollout, which can significantly slow down the overall decoding process. In contrast, our Q* solely relies on training a Q-value model to guide LLMs to select the most promising next reasoning step and the pipeline can be easily applied to various reasoning tasks without modification. Besides, we consider only a single step each time in Q*, which is much cheaper than complete rollout in MCTS-based methods.
思想树(ToT)[25]通过探索使用基本树搜索算法解决问题的中间步骤来提高LLMs的推理能力。同样,A* 搜索和 MCTS 被用作规划技术,以在解决具有挑战性的复杂推理问题时增强 LLMs 的性能 [26,27,28,34]。不幸的是,这些方法中使用的效用函数要么是根据 LLMs 的反馈(例如,[ 25, 27])构建的,这在复杂问题中可能非常不准确,要么是针对每个单独的任务(例如, ,[ 28, 34])。此外,使用 MCTS 进行规划通常需要执行成本高昂的部署,这会显着减慢整个解码过程。相比之下,我们的 Q* 仅依靠训练 Q 值模型来指导 LLMs 选择最有希望的下一步推理步骤,并且管道可以轻松应用于各种推理任务而无需修改。此外,我们每次在 Q* 中只考虑一个步骤,这比基于 MCTS 的方法中完全推出要便宜得多。
A* [30] is an important heuristic search algorithm in deliberative planning [38], multi-agent pathfinding [39], and constraint reasoning [40]. Originally, A* is proposed for finding the shortest path from source 𝑠 to goal 𝑔 in path planning problems. It associates each frontier vertex 𝑛 with a value 𝑓(𝑛)=𝑔(𝑛)+ℎ(𝑛), where 𝑔(𝑛) is the accumulated path cost from source 𝑠 and ℎ(𝑛) is a heuristic value that estimates the cost of the shortest path from 𝑛 to goal 𝑔. The algorithm adopts a best-first search strategy, i.e., in each iteration it always picks the vertex with minimum 𝑓-value to explore until reaching the goal. When the heuristic ℎ(⋅) is admissible [41], A* guarantees to find the optimal path.
A* [30]是深思熟虑规划[38]、多智能体寻路[39]和约束推理[40]中重要的启发式搜索算法。最初,A* 是为了在路径规划问题中寻找从源 𝑠 到目标 𝑔 的最短路径而提出的。它将每个边界顶点 𝑛 与值 𝑓(𝑛)=𝑔(𝑛)+ℎ(𝑛) 相关联,其中 𝑔(𝑛) 是来自源 𝑠 和 ℎ(𝑛) 到目标 𝑔 的最短路径的成本。该算法采用最佳优先的搜索策略,即在每次迭代中它总是选择具有最小 𝑓 值的顶点进行探索,直到达到目标。当启发式 ℎ(⋅) 可接受时[41],A*保证找到最优路径。
上面内容中提到了规划,而这与Yang Lecun 的观点不谋而合,他也认为当前提升人工智能的性能需要使用与规划有关的策略,那也就是说这条研发思路是正确的,而且后面论文的结果也证明了,这种方法确实提高了LLM 的性能。
GSM8K. For the comparison on GSM8K dataset, we select Llama-2-7b [45] as our base model, whose accuracy can achieve 65.2% after finetuning on MetaMath [5]. Then, we treat Llama-2-7b finetuned on MetaMath as policy 𝜋𝜃, and perform random rollout to collect Q-value labels for training Q-value model (QVM). For utility aggregation, we train a process reward model (PRM) on PRM800K [22] to provide intermediate signal for each reasoning step. With PRM and QVM in hand, traditional methods tend to treat either of them as a verifier to select the Best-of-𝑁 trajectory or utilize them to perform PPO training of RLHF. As the results shown in Table 2, we can find that with the same PRM/QVM, using it for verification performs significantly better than using it for alignment. Further, in the comparison of planning-based methods, we can find that with the same QVM, Q* method with constant aggregated utility can still outperform Best-of-𝑁 method. With the PRM trained on PRM800K determining whether the intermediate reasoning steps are correct, Q* method that combines PRM and QVM achieves the best performance among all methods based on the same LLM, helping Llama-2-7b surpass the performance of close-sourced ChatGPT-turbo [46] and reaching an accuracy of 80.8%.
GSM8K。为了在 GSM8K 数据集上进行比较,我们选择 Llama-2-7b [45] 作为我们的基础模型,在 MetaMath [5] 上进行微调后,其准确率可以达到 65.2%。然后,我们将在 MetaMath 上微调的 Llama-2-7b 视为策略 𝜋𝜃 ,并执行随机 rollout 来收集 Q 值标签以训练 Q 值模型(QVM)。对于效用聚合,我们在 PRM800K [22] 上训练过程奖励模型(PRM),为每个推理步骤提供中间信号。有了 PRM 和 QVM,传统方法倾向于将它们中的任何一个视为验证者来选择 Best-of- 𝑁 轨迹或利用它们来执行 RLHF 的 PPO 训练。如表2所示的结果,我们可以发现,在相同的PRM/QVM的情况下,使用它进行验证的性能明显优于使用它进行对齐的效果。此外,在基于规划的方法的比较中,我们可以发现,在相同的QVM下,具有恒定聚合效用的Q*方法仍然可以优于Best-of- 𝑁 方法。通过在 PRM800K 上训练的 PRM 来判断中间推理步骤是否正确,结合 PRM 和 QVM 的 Q* 方法在基于相同 LLM 的所有方法中取得了最佳性能,帮助 Llama-2-7b 超越了闭源 ChatGPT-turbo [46] 的性能,达到 80.8% 的准确率。MATH. As the results shown in Table 3, considering the weak performance of Llama-2-7b fine-tuned with MetaMath for the MATH dataset, we seek for two other stronger LLMs to evaluate the effectiveness of our Q* method. One is Llama-2-7b fine-tuned on Synthetic Data [47], which is constructed following the instruction of scaling up the SFT data, and achieves 41.9% accuracy on MATH dataset, approaching the performance of GPT-4 [48]. The other base model is DeepSeek-Math-7b [49], which could be the most powerful open-source 7b model for math reasoning on MATH dataset, achieving 50.8% accuracy in our evaluation. From the results shown in the second and third blocks of Table 3, we can find that Q* can still lead to further performance improvement compared to the Best-of-𝑁 method on either of base models. Additionally, it is noteworthy that the performance of DeepSeek-Math-7b enhanced with Q* has already surpassed a series of closed-source models on the leaderboard of MATH dataset 1, such as Gemini Ultra (4-shot), reaching an accuracy of 55.4% .
数学。结果如表 3 所示,考虑到 Llama-2-7b 在 MATH 数据集上用 MetaMath 微调的性能较弱,我们寻找另外两个更强的 LLMs 来评估我们 Q* 方法的有效性。一种是在合成数据[47]上进行微调的Llama-2-7b,它是按照按比例放大SFT数据的指令构建的,在MATH数据集上达到了41.9%的准确率,接近GPT-4[48]的性能。另一个基础模型是 DeepSeek-Math-7b [49],它可能是在 MATH 数据集上进行数学推理的最强大的开源 7b 模型,在我们的评估中实现了 50.8% 的准确率。从表3的第二和第三块中显示的结果可以发现,与任一基本模型上的Best-of- 𝑁 方法相比,Q*仍然可以带来进一步的性能改进。此外,值得注意的是,Q*增强的DeepSeek-Math-7b的性能已经超越了MATH数据集排行榜上的一系列闭源模型,例如Gemini Ultra(4-shot),准确率达到55.4 %。
这部分的内容很令我惊讶,7B的参数居然能在数学领域达到GPT4的水平,这说明这个方法的效果很明显。另外根据外网网友的猜测,GPT4的规模应该是1万亿,即1000B。毕竟Nvidia都训练出了340B的模型,GPT4的规模达到1万亿也是有可能的。那么在这个研究方法的加持下,以下内容将是很可能实现的:
另外OpenAI的女CTO也在访谈中提到过,我们将见到在某个领域达到博士水平的人工智能,那么,以上推测就更有可能得出,AGI是能够短期内三五年后实现的目标了。
最后谈谈我对于超级人工智能的期待,我想,配合DAG因果推断结构的思维链以及上面的Q*方法,让人工智能实现ASI是可能的,这也是为什么以利亚有那么大的口气放出这一概念的原因吧,或许他还有其他更专业的方法缩小这一差距。