Listen to this lesson
Estimated reading time: 8 minutes
An AI model is only as good as the data it was trained on. This isn't a throwaway line — it's arguably the most important thing to understand about how AI works in practice. Training data shapes everything: what the model knows, what it gets wrong, whose perspectives it reflects, and whose it ignores.
If you understand training data, you understand most of AI's strengths and weaknesses. So let's dig in.
Training data is the information that an AI system learns from. For large language models like ChatGPT and Claude, the training data is text — vast, almost incomprehensible amounts of text.
We're talking about:
The exact composition varies by model, and most AI companies don't fully disclose what's in their training data. But the general picture is consistent: LLMs are trained on a very large portion of the publicly available text on the internet, supplemented with books and other text sources.
To put the scale in perspective: GPT-3 was trained on roughly 570 GB of text data — that's about 300 billion words. More recent models use significantly more. If you read 24 hours a day, it would take you thousands of years to read what these models trained on in weeks.
The training process, stripped to its essence, works like this:
Through this process, the model learns the statistical patterns of language: which words tend to follow which other words, how sentences are structured, what kinds of responses follow what kinds of questions, how different topics are typically discussed.
It's important to note that the model doesn't store the training data itself. You can't ask ChatGPT to recite a specific book (though it might remember fragments). What it stores is the patterns — compressed, abstracted, and distributed across its parameters.
If something wasn't well-represented in the training data, the model won't handle it well. This has practical consequences:
Training data reflects the internet, and the internet reflects human society — including its biases. If the training data contains more text by and about certain groups, the model will be more fluent and nuanced when discussing those groups and less so for others.
Documented biases in LLMs include:
This isn't a failure of the technology per se — it's a reflection of what was in the training data. But the practical effect is the same: AI output can perpetuate biases, and you need to watch for this.
Much of the text used to train AI models was scraped from the internet without explicit permission from the authors. This has created significant legal and ethical debates:
As of early 2026, these legal questions are still being resolved in courts around the world. The NZ legal framework hasn't specifically addressed AI training data yet, but it's likely to in the coming years.
For you as a user, the practical point is this: the AI's knowledge came from somewhere, and the ethics of that process are genuinely complicated. It's worth being aware of.
Training a modern LLM typically involves several stages:
1. Pre-training — The model learns language patterns from the massive text dataset. This is the most expensive stage, costing millions of dollars in computing.
2. Fine-tuning — The pre-trained model is further trained on a more curated, higher-quality dataset to improve its performance on specific tasks (like following instructions or having conversations).
3. RLHF (Reinforcement Learning from Human Feedback) — Human reviewers rate the model's responses, and the model is adjusted to produce responses that humans find more helpful, accurate, and appropriate. This is what makes modern chatbots feel helpful and polite rather than raw and erratic.
Each stage shapes the final product. Pre-training gives the model knowledge. Fine-tuning gives it focus. RLHF gives it manners.
1. Check the model's knowledge boundaries. If you're asking about something niche, recent, or from a non-English context, be extra cautious about the response. The model may have limited training data on that topic.
2. Watch for cultural defaults. If you ask for advice on employment law without specifying New Zealand, you'll likely get US-centric information. Always provide context, especially for location-specific questions.
3. Be aware of bias. If you're using AI to draft job descriptions, evaluate applications, or create content about people, review the output for unintentional bias. The model's defaults may not reflect your values.
4. Understand that knowledge is frozen. The model's base knowledge has a cutoff date. For current information, you need a model with web search capability, or you need to verify independently.
Test the training data boundaries.
Use an AI chatbot to explore where its knowledge is strong and where it's thin. Try these prompts and rate the quality of each response (1 = poor/wrong, 5 = excellent):
For each response, consider:
Write a short reflection: What does this tell you about the training data behind this model?
1. What is training data for a large language model?
a) A small, hand-curated set of perfect answers
b) A vast collection of text that the model learns language patterns from
c) Live data streamed from the internet in real time
d) A database of questions and correct responses
Answer: b) Training data is a massive collection of text — web pages, books, articles, code — from which the model learns the statistical patterns of language.
2. Why might an AI model give better answers about US law than NZ law?
a) Because US law is simpler
b) Because the model prefers American content
c) Because there is far more US legal text in the training data than NZ legal text
d) Because NZ law isn't available on the internet
Answer: c) Training data is disproportionately English-language and US-centric. Topics with more training data are handled with more depth and accuracy.
3. What is RLHF (Reinforcement Learning from Human Feedback)?
a) A process where humans rate AI responses, and the model is adjusted to produce better ones
b) A system where AI learns by watching humans use computers
c) A method for humans to learn from AI feedback
d) A technique for making AI run faster
Answer: a) RLHF involves human reviewers evaluating model outputs, with the model being adjusted to produce responses that humans rate as more helpful, accurate, and appropriate.

Visual overview