The Rise of a New AI Powerhouse
Since the launch of OpenAI's Code Interpreter, its powerful reasoning ability has dominated the AI conversation. Without specialized training, it can directly win gold medals in Mathematical Olympiads and even achieve higher scores than human experts in PhD-level scientific Q&A sessions.
We've seen many demos showcasing Code Interpreter's capabilities, and the internet is flooded with evaluations of its performance. Discussions about its technical roadmap are also in full swing, sparking widespread attention and in-depth thinking.
However, the story behind Code Interpreter remains relatively unknown.
Unmasking Code Interpreter: A Look Behind the Curtain
A few days ago, OpenAI released a full interview with the Code Interpreter development team, revealing its development journey. How did Code Interpreter evolve step by step to become an entity with an IQ of 120? How did OpenAI combine the two paradigms of reinforcement learning and supervised learning? Behind this lie countless breakthroughs and challenges.
Today, let's delve into the transcript of this interview and explore the making of Code Interpreter.
A New Name for a New Era of AI
The first to speak in the interview is Bob McGrew, the head of OpenAI's research team. He leads the entire interview, engaging with each member of the research group to explain various new concepts in Code Interpreter. McGrew acts as a mouthpiece for the team, and the first question everyone is most concerned about is why the model is no longer called GPT but suddenly changed its name to Code Interpreter.
Team member Hyung Won Chung explains that the team used the new name "Code Interpreter" to launch a series of new models to emphasize the fact that users may feel a distinct difference when using Code Interpreter compared to previous models such as GPT-4. Because Code Interpreter is a reasoning model, it will do more thinking before answering your questions.
Redefining AI: The Power of Reasoning
So what's the difference between a reasoning model and previous large language models? Giambattista Parascandolo takes over, explaining that a reasoning model is a model that has "the ability to turn thinking time into better results".
For simple questions, we want the model to give us an answer immediately after we ask. For example, if you ask where the capital of Italy is, you don't have to think long to know that the answer is Rome. However, if you want to know the answer to a complex question, such as how to write a good business plan or write a novel, it may take some time to think. And usually, the more you think, the better the final result. In other words, the time spent thinking is proportional to the results produced.
AI models before Code Interpreter could not reproduce the difference between fast thinking and slow thinking. The Code Interpreter development team hopes to be able to replicate this characteristic of human thinking on the model and refer to models with this thinking characteristic as reasoning models.
However, since previous researchers have not been able to do this, it is naturally a very difficult task. The development of Code Interpreter is not something that can be solved overnight.
From AlphaGo to Code Interpreter: A Journey of Exploration
Jakub Pachocki, one of the team members, recalls that the development of Code Interpreter actually had its prototype as early as the beginning of OpenAI's establishment. When OpenAI was first founded, they were deeply inspired by AlphaGo and realized the great potential of deep reinforcement learning. Therefore, OpenAI has done a lot of research in this area and has achieved good scaling results in data and robotics.
In the process, team members were thinking about how to apply reinforcement learning in the general domain to create artificial intelligence with logical capabilities.
The subsequent success of GPT allowed the team to witness the amazing results brought about by Scaling Law and the supervised learning paradigm.
The "Aha Moment": Unlocking a New Level of AI Intelligence
Team member Jerry Tworek said that he participated in the training of GPT-2, GPT-3 and GPT-4. When the model first came out, the developers began to have conversations with the model. Although people outside were amazed at the time, "Wow, this model is really good", the development team had already begun to think about how to further optimize the model.
At a particular moment during the training process, the team had a sudden idea to do a test. They put in more computational resources and tried to get the training model to generate coherent chains of thought. This initiative brought quite good feedback. In Tworek's words: "Wow, this looks really different from before".
Jerry Tworek called this moment of inspiration the "Aha moment," and there was more than one such moment. Another member of the team, Trapit Bansal, also said that after discovering the improvement in model capabilities brought about by the chain of thought, they began to think about how to internalize the chain of thought into the output of the model.
The first method the team thought of was manual input, which is to let humans write down their thought processes and then let the AI train by example. They tried this method for a while, but it not only took time and effort, but also wasted a lot of labor costs.
At this time, they ushered in another "Aha Moment". In an unintentional attempt, the team members found that training the model to generate and refine its own chain of thought using reinforcement learning worked better than having humans write down the chain of thought.
So from then on, they began to study how to combine large language models with deep reinforcement learning, two different paradigms. However, at this time, the team has not yet determined the specific research direction.
Conquering Mathematics: A Milestone for Code Interpreter
With reinforcement learning and chain of thought large models, in which area should we take the first step?
The answer is mathematics.
In fact, the team has been trying to improve the mathematical ability of the model, and they have put a lot of effort into it. Although they have tried many different methods, they have all had little effect. Every time the developers read the model output, they always feel very frustrated because the model never learns to reflect. The model never seems to question what went wrong.
Now, with the blessing of reinforcement learning and chain of thought, the new generation of Code Interpreter models has finally broken through the bottleneck of reflection.
Early in the training of the Code Interpreter model, the developers also put it through reflection tests. They not only had conversations with the model, but also asked it some common math questions. Through a series of tests, the researchers were able to clearly observe how the model reasoned. In math tests, Code Interpreter began to question its own output and was able to make interesting reflections. This is a historic breakthrough for the entire team.
Team member Hunter Lightman said, "For me, at that moment, I thought, we've made a brand new discovery. It was a moment when everything came together. When you read these thought processes, it feels like you're looking at the thinking of a human being, not a robot."
Team member Liam Fedus also felt it was more of a spiritual experience. "You can resonate with the model, you can see it making mistakes that a lot of humans tend to make, and you can see it questioning some of the mundane norms."
In terms of behavior, the Code Interpreter model behaves surprisingly human-like. When the team set a thinking deadline for the model, it would often quickly draw a conclusion when the time was approaching, as if it realized, "Time is up, I have to finish it now".
Throughout the development of Code Interpreter, many big names, including Ilya, have participated in it. It can be said that the Code Interpreter model is the essence of the entire OpenAI.
Navigating Challenges and Embracing the Future
However, such a powerful model, its development process is naturally not smooth sailing. Jerry Tworek mentioned that, in essence, training large models is a very difficult task. There are thousands of things that can go wrong, and in fact, in every round of training, at least hundreds of things do go wrong. Almost everyone put in a lot of heart, sweat and tears to train these models and figure out how to get them to keep learning and improving. The path to success is very narrow, and the possibilities for failure are many.
It can be said that the entire team faces huge anxiety every day. Although the current Code Interpreter model performs very well, sometimes even better than humans, as if it has several doctoral degrees, but this can sometimes become a challenge. Higher intelligence means that it is becoming increasingly difficult for humans to detect errors made by AI. Researchers often need to verify whether the output of the model is off track or doing something unreasonable.
In order to test the model more efficiently, everyone on the development team used the Code Interpreter early on, and each had their own testing methods. Shengjia Zhao likes to have Code Interpreter count how many letters "r" are in the word "Strawberry." Lightman often searches Twitter for "things that large language models can't do" and then copies and pastes them into the model for testing to see if it really can't be done.
Hyung Won Chung likes to use Code Interpreter to program. After all, most of the work of researchers is programming, so he can now leave the chores to Code Interpreter and focus more on defining the problem. However, Chung generally doesn't ask the AI to write him a program directly because that's too general. Instead of writing a code that works out of the box, he prefers to write a unit test that clarifies how the program should run to be considered correct, and then hand it over to Code Interpreter to complete. This way he can focus his energy on more important, higher-level issues.
In addition to testing Code Interpreter's ability to write code, another key test project for Hyung Won Chung is debugging. When he encounters a bug, he will directly hand over the bug to Code Interpreter. Sometimes Code Interpreter can solve the problem immediately. Even if it can't be solved, at least it can guide Chung to ask better questions or provide more valuable ideas.
In fact, it was in this programming test after test that the team also took the opportunity to get Code Interpreter Mini done. According to Chung, the team hopes to bring the Code Interpreter series to more users and reduce the cost of use, so they created Code Interpreter Mini. It is designed to showcase a simplified version of the entire Code Interpreter pipeline or framework. In the team's vision, Code Interpreter should be a reasoning expert. It may not necessarily know the birthday of your favorite celebrity, but it must be able to understand how to reason very effectively.
Regarding Code Interpreter mini, Chung said very confidently, "Code Interpreter mini will be much smarter than the best reasoning models before and almost on par with the best models of Code Interpreter level." Although Code Interpreter mini does have some limitations, such as may not know a lot of information about the outside world, especially content that is not related to science and technology, the team strives to make its performance roughly on par with the best models currently available, such as GPT-4o mini. So Chung is very much looking forward to external users being able to try and experience this "lightning-fast" reasoning and thinking.
Back to the topic of testing, team member Jason Wei's testing also focuses on the ideas AI can provide. He likes to think of Code Interpreter as a brainstorming partner, from solving a very specific machine learning problem to how to write a blog post.
For example, he recently wrote a blog post about large language model evaluation. During the writing process, Jason Wei asked Code Interpreter for ideas on the structure of the article, as well as the advantages and disadvantages of certain benchmark tests, and even suggestions for writing style. Code Interpreter handled these tasks very well, because it thinks before giving the final answer, not only can better connect ideas, but also modify and evaluate alternative solutions.
Overall, during the entire testing process, the members of the team can be described as having their own unique skills, trying to test the capabilities of Code Interpreter as comprehensively as possible. However, as the size of the model parameters expands, testing the capabilities of the model is also becoming very time-consuming and laborious.
In the interview, Ilge Akkaya said helplessly that they have exhausted all industry-level resources, and they no longer know what aspects of the model can be tested next. So this is another challenge that the OpenAI team will have to face in the future.
Well, that's the interview with the Code Interpreter development team this time. In my opinion, while people are rejoicing for Code Interpreter, those team members who silently contribute behind the scenes, their stories are equally worthy of being told and remembered.
So what do you think of the Code Interpreter model and the development team behind it?