By adding rewards for logical reasoning steps, a new training framework aims to make large language models more robust and reliable.
When asked to solve a tricky problem, some large language models (LLMs) today offer step-by-step explanations of their reasoning. However, these ‘thought processes’ can often be riddled with errors or pure fabrications. Cleaning up this messy logic usually calls for human help or the model itself to review each step, which can take more time and resources. So, how might a model be taught to reason more reliably on its own?
“One idea we had was to teach LLMs to evaluate their reasoning with heuristic, or ‘good enough’, real-time search methods,” said Fangkai Jiao, a fourth-year PhD student at Nanyang Technological University (NTU) and the A*STAR Institute for Infocomm Research (A*STAR I2R). “At the same time, we wondered if—like a chess player studying past games—models could learn from their own offline reasoning histories to identify promising steps, without relying on human annotation or real-time search.”
With this twofold approach in mind, Jiao and A*STAR I2R colleagues including Principal Scientist Nancy Chen and Lead Research Engineer Zhengyuan Liu, collaborated with NTU and Salesforce Research Singapore to develop a novel offline LLM training framework based on process-supervised Direct Preference Optimisation (pDPO). Their latest work builds on a previous study on self-supervised, logic-enhanced training for LLMs, which focused on activating LLM reasoning through in-context learning...
Read the full article https://bit.ly/4rN0fQc