Sensitive medical data from the UK Biobank, a major medical research project holding records for 500,000 British volunteers, has been repeatedly exposed online due to inadequate security practices. A Guardian investigation reveals that researchers approved to access this data have inadvertently published files containing detailed patient information on public platforms like GitHub.
While these datasets lack direct identifiers such as names and addresses, the sheer volume of exposed records – including hospital diagnoses, dates of procedures, and demographic details – raises serious privacy concerns. The risk is amplified by advancements in AI and data aggregation, which make re-identification increasingly simple.
The Scale of the Problem
Between July and December 2025, UK Biobank issued 80 legal notices to GitHub requesting the removal of leaked data, yet significant portions remain accessible. One dataset alone contained diagnoses for over 413,000 participants, along with birthdates and sex. The Guardian tested the risk by providing minimal personal information to a data scientist, who successfully matched a volunteer’s medical history with near-certainty using only their birth month/year and a surgery date.
“It sent shivers down my spine to even open… It was very detailed and felt like a gross invasion of privacy even to glance at.” – A data expert reviewing the leaked files.
Biobank’s Response and Criticisms
UK Biobank defends its security, stating that no names or addresses were provided to researchers. CEO Prof Sir Rory Collins claims no re-identification has occurred. However, experts argue this stance is unrealistic, given the ease of cross-referencing data in the digital age.
“Are these people aware that the internet exists?” asked Prof Felix Ritchie, an economist at the University of the West of England. “The idea that they can rely on their volunteers never putting any other information out there about themselves is an entirely unreasonable thing to expect.”
Dr Luc Rocher, of the Oxford Internet Institute, points out that even partial data – such as birthdates and injury dates – can be sufficient to pinpoint individuals. Once identified, these records can reveal deeply sensitive information, including psychiatric diagnoses or HIV status.
The Tension Between Research and Privacy
The leaks stem from a policy allowing researchers to download data directly onto their systems until late 2024, combined with increasing pressure to publish research code. Some researchers accidentally included Biobank datasets in these uploads. Biobank has introduced further training but admits the problem persists.
The situation highlights an inherent conflict between driving research with large datasets and protecting individual privacy. Despite Biobank’s efforts to remove leaked data, copies remain archived online, questioning whether full control is achievable. The scale of these leaks—hundreds of incidents—suggests systemic failures, not isolated errors.
The incident underscores the growing challenges of data security in the age of open science and AI-driven re-identification. While Biobank’s work remains valuable, these repeated breaches raise doubts about whether its current safeguards are sufficient to protect participant data.
