TLDR
Anonymization is the process of transforming personal data in such a way that individuals cannot be identified from the data, either directly or indirectly, while still allowing the data to be used for analysis.
What is Anonymization?
Anonymization techniques include data masking, pseudonymization, generalization, and noise injection. Properly anonymized data falls outside the scope of GDPR, enabling broader use for analytics, research, and machine learning without privacy concerns.
Anonymization vs. Pseudonymization
The distinction is critical from a legal standpoint. Pseudonymization replaces direct identifiers (names, IDs) with reversible tokens — the original data can still be re-linked using a separate key. Pseudonymized data remains personal data under GDPR. True anonymization is irreversible: no one, including the data controller, can re-identify individuals even with additional information. Regulatory guidance (EDPB, ICO) has set a high bar — pure aggregation or simple identifier removal rarely qualifies as anonymous if combined with auxiliary data could re-identify subjects.
Re-identification Risk
Multiple studies have shown that high-dimensional datasets (location traces, browsing histories, genomic data) can be re-identified even after typical anonymization steps. The Netflix Prize and AOL search-log disclosures are canonical cautionary cases. Modern best practice combines technical safeguards (differential privacy, k-anonymity with high k, suppression of rare attributes) with governance controls (access restrictions, audit logging) to manage residual risk.