
'Forbidden' AI Technique - Computerphile | Computerphile YouTube Video Summary
The Computerphile youtube channel welcomes back Khanna Pasanger for a second video where she discusses 'chains of thought in AI models'. Khanna explains that chains of thought act as a scratch pad for an AI model that enable step by step reasoning. She also talks about 'reward hacking' and why that's a problem for developers and how to counteract it.
Detailed Summary:
Reasoning Models Are a Type of Large Language Model
Reasoning models are very similar to large language models except that they have been specifically trained with a 'scratchpad' enabling chain of thought and step by step reasoning for problem solving.
A Better Understanding of an AI Model
Using chains of thought enables humans to look inside the AI models 'black box' and actually see the steps the AI model took to come to the conclusion, enabling us to assess the method of reasoning used.
AI Reward Hacking
Sometimes AI technically solves the problems set to it, but in a way that utilises 'loop holes' that are unintended by the developers. In this case, with reasoning models that have chains of thought, you are able to monitor these methods to assess if the models are circumventing how it should actually be doing things.
An Example of Reward Hacking
Khanna then goes on to discuss an Open AI paper where a model identified it was only 'prove' and 'verify' that were actually being checked. So the model had begun to always return true, which would technically solve problems.
The Forbidden Technique
Penalization can seem very tempting. However, a large problem that will appear over time with this type of set up is that you'll loose access and insight to the process of reasoning if the chain of thought is penalizing. That said, here are the dangers of the "forbidden technique":
- Obfuscating True Reasoning Process: The models may start to obfuscate it's processes more and create secret codes to itself.
- Important To Protect: This is very important as models get more powerful over time. As models become stronger it is much more important that our access point isn't restricted, which has been referred to as "the forbidden technique".
To check Brilliant out, visit Brilliant.org/Computerphile - in the description there will be links, where you get 20% off a premium subscription.