When AI labels the elderly: ChatGPT-4o's age bias, quantified by KAIST

A study by Korea's KAIST institute shows with data that ChatGPT-4o systematically describes older people as warm but incompetent, with twice the uniformity it applies to other age groups. The finding warns of a real risk in hiring, credit and healthcare settings.
By Momentum IA · June 28, 2026.
The most dangerous discrimination is not the kind that shouts; it is the kind that whispers with statistical coherence. A study published this June 28 by Professor Choi Moon-jung, of the Graduate School of Science and Technology Policy at KAIST (Korea Advanced Institute of Science and Technology), provides quantitative evidence that ChatGPT-4o harbors age-based stereotypes about older people, without the model ever issuing an explicit insult or an expression of hatred.
The methodology is simple and, precisely for that reason, convincing: the team collected 100 responses from the model for each of nine age groups —from teenagers to people in their nineties— based on the same type of prompt asking it to describe the group's personality. That added up to 900 text samples analyzed. The most striking result concerns the internal similarity of the responses: while the descriptions of people in their thirties and forties ranged between 0.21 and 0.53 on a scale where 0 means unrelated and 1 means identical, those for people over seventy clustered between 0.73 and 0.90. Put another way, the model practically repeats the same profile when asked about someone aged seventy, and changes considerably when it talks about someone aged thirty-five. That uniformity is the statistical signature of the stereotype.
The content of those repeated descriptions also matters. The AI systematically described people over 60 as 'kind and considerate, but with low competence and little self-direction.' In the assertiveness analysis, 96.6% of the expressions associated with teenagers were positive —'ambitious,' 'self-confident'—, while that percentage fell to around 70% for the older age groups, where terms such as 'passive,' 'dependent' or 'worried' proliferated. Young people and middle-aged adults received varied descriptions; older people, a compact and repetitive block of warmth without capability.
Bias on its own is not necessarily catastrophic when the model is used to draft an email or summarize a document. The problem is amplified when the model operates as a tool supporting structural decisions. Professor Choi herself points to three concrete risk domains: recruitment processes, credit assessment and healthcare. In all three, a system that perceives older people solely as 'objects of protection' —and not as competent agents— can translate that silent bias into denials, lower scores or paternalistic care plans. There is no need for an explicitly discriminatory algorithm; it is enough for the language model assisting the decision-maker to have internalized the stereotype.
As context for the sector, this type of age bias had been identified qualitatively in several previous studies on language models, but the KAIST team's contribution is its quantification through semantic similarity metrics applied to a sufficient volume of samples. It is a relevant methodological step because it shifts the debate from anecdote to reproducible evidence.
The structural cause the researcher points to is well known: models learn from text produced by humans, and that text reflects the stereotypes that society and the media project onto old age. What changes with AI is scale and opacity: a biased journalist or recruiter has limited reach; a model deployed across millions of interactions reproduces and amplifies that bias invisibly and continuously.
The solution proposed by Choi Moon-jung points to inclusiveness in the development process —'AI bias is not a technological problem, it is a social problem,' she states— and to the active participation of different generations in the design of the systems. This connects with a growing trend in the industry: the demand for fairness audits disaggregated by demographic variables such as age, gender or ethnicity before deployment in high-impact contexts.
What this study does well is to pin down a concrete problem, with a concrete model —ChatGPT-4o— and a specific group. That specificity is valuable because it forces model developers to answer for a dimension that often falls outside standard evaluation benchmarks. If European regulatory criteria on high-risk AI —or their equivalents in Asia and North America— begin to include metrics on response uniformity by demographic group, research like this will become a direct methodological reference.
The real long-term challenge is not detecting bias, but correcting it without creating new imbalances. Artificially forcing descriptive diversity can produce incoherent or simply unhelpful responses. The most solid path runs through improving the quality and diversity of training data, incorporating the perspectives of older people into evaluation teams and establishing audit metrics that detect precisely this kind of statistical uniformity before the model reaches production. Until that becomes standard, studies like KAIST's will remain necessary so that the problem does not stay invisible.