Here’s a peek into the intricacies of ChatGPT’s functioning 👇
ChatGPT uses Reinforcement Learning (RL) powered by human feedback to run smoothly.
Back then “Algorithmic Bias” was a pretty common challenge to training LMs which used to associate some sort of real-life bias in the results, leading to limiting creativity and relevance.
The solution? A mathematical reward function to reward/punish the model outputs based on their alignment with desired outcomes.
This was the motto behind Reinforcement Learning from Human Feedback (RLHF) which incorporate human feedback to fine-tune the LMs.
⚡Here’s how ChatGPT is using RLHF:
– Fine-Tuning the model
– Rewarding the model
– Fine-Tuning the existing model with rewards
So, initially, sample data is collected from user prompts, AI agents, API requests to train the model, and human labelers used to fine-tune the processes.
But as it’s not viable to train LLMs with a large dataset using the same theory, a reward model needs to be prepared to automatically rank the responses and update the model.
Now based on the reward, the model needs to be updated. To do this process iteratively, Proximal Policy Optimization (PPO) algorithm is implemented which will update the policy based on the current feedback while ensuring the changes aren’t too drastic.
This is what goes on under the hood of ChatGPT and no need to say how successful it is as you can see from all the talks about job replacement by AI.