The European Union’s Artificial Intelligence Act (EU AI Act) proposes a framework to regulate AI, particularly for “high-risk” systems — those that could impact health, safety, or fundamental rights. One element of this framework is Article 10, which focuses on data and data governance. This article mandates strict standards for the datasets used in training, validating, and testing high-risk AI systems to prevent issues like bias, errors, or discrimination.
If you’re an AI provider, or just curious about AI regulation on data and data governance, understanding Article 10 is important. In this post, I will conceptualize data and data governance requirement as outlined in the Act. We’ll explore what data governance means, its key elements, and why it matters for compliance.
Grok
What is Data Governance in the Context of AI?
Data governance refers to the set of practices, policies, and processes that ensure data is handled ethically, accurately, and in line with ethical and legal standards. For high-risk AI systems, poor data practices can lead to amplified biases or unreliable outcomes, which is why the AI Act emphasizes governance to mitigate risks and ensure systems perform as intended.
Think of data governance as a conceptual framework:
It covers everything from how data is collected and prepared to how biases are detected and corrected.The goal? To make AI systems not just functional, but also fair and compliant with regulations like the General Data Protection Regulation (GDPR) and others.In Article 10, this governance applies specifically to training, validation, and testing datasets, ensuring they’re suitable for the AI’s purpose and free from flaws that could harm users.
The Five Pillars of Data Governance
Article 10 is structured around five main paragraphs (as conceptualized in this post and seen on the figure below), each building on the last to create a robust data management ecosystem. They apply to datasets for high-risk AI systems, with some exceptions for non-training-based systems. Let’s dive into each one.
Data Governance and Management Practices (Article 10(2))
Datasets must undergo appropriate governance and management practice tailored to the AI system’s intended purpose. It’s not a one-size-fits-all approach; practices should reflect the system’s design and real-world application.
Key elements include:
Design Choices: Strategic decisions during development to align the AI with its goals. This involves selecting technical, procedural, and organizational elements, incorporating stakeholder input, and adhering to data principles like minimization, adequacy, necessity, and proportionality. Regular reviews ensure the system stays on track throughout its lifecycle.Data Collection Processes: Document the origins of data, how it was gathered, and (for personal data) its original purpose. Transparency here prevents misuse and builds trust.Data Preparation Operations: Handle tasks like annotation, labeling, cleaning, updating, enrichment, and aggregation to maintain high quality.Formulation of Assumptions: Clearly define what the data represents and measures — avoid vague interpretations that could lead to errors.Assessment of Data Suitability: Evaluate if datasets are available, sufficient in quantity, and fit for purpose.Bias Examination: Scrutinize data for biases that could affect health, safety, fundamental rights, or cause discrimination, especially in feedback loops where outputs influence future inputs.Bias Mitigation: Implement measures to detect, prevent, and correct biases.Addressing Data Gaps and Shortcomings: Identify and fix any deficiencies that might hinder compliance with the AI Act.
2. Dataset Characteristics (Article 10(3))
Once governance practices are in place, the datasets themselves must meet quality benchmarks. They need to be:
Relevant and Sufficiently Representative: Mirror the real-world scenarios where the AI will be deployed, capturing diverse populations or contexts to avoid skewed results.Free of Errors and Complete: To the greatest extent possible, minimize inaccuracies, duplicates, or missing values that could distort AI performance.Statistically Appropriate: Ensure the data’s statistical properties align with the target population or group the AI serves, promoting reliability and generalizability.
3. Contextual Considerations (Article 10(4))
Data doesn’t exist in a vacuum. This paragraph requires datasets to be customized to the AI’s specific geographical, behavioral, functional, or contextual settings. Why? To ensure the AI operates effectively, fairly, and safely in its intended environment.
Benefits and rationale:
Promotes Fairness and Non-Discrimination: Representative data reduces biases that could disadvantage certain groups.Enhances Accuracy and Integrity: Tailored data improves completeness and reliability.Aligns with Legal Standards: Complies with GDPR principles like data minimization and purpose limitation.Reduces Risks: Matches data to operational contexts, avoiding mismatches that could lead to failures (e.g., historical issues like inaccuracies in Google’s Gemini AI).Compliance Workflow: Providers must assess the AI’s purpose, curate relevant data, balance fairness with accuracy, document decisions, and conduct regular evaluations for ongoing bias mitigation.
4. Processing Special Categories of Personal Data (Article 10(4))
Special categories of personal data — think health records, biometric info, or racial/ethnic details — are highly sensitive. Providers can only process them exceptionally, and only for bias detection and correction when absolutely necessary (and when alternatives like synthetic or anonymized data won’t suffice).
Strict conditions must all be met:
No viable alternative data exists for the task.Technical limitations on reuse, combined with top-tier security and privacy-preserving measures.Effective access controls, full documentation, and confidentiality obligations.Data must not be transferred or accessed by third parties.Delete the data once the bias is fixed or the retention period ends (whichever comes first).Processing records must explain why special data was essential and why other options weren’t feasible.
These safeguards, layered on top of GDPR and related directives, protect fundamental rights while allowing limited use for critical improvements.
5. Testing Datasets for Non-Training Systems (Article 10(5))
Not all high-risk AI systems rely on machine learning models that “train” on data. For those that don’t, the full governance requirements (Paragraphs 2–5) apply only to testing datasets. This streamlines compliance without skimping on quality for evaluation phases.
Why Does This Matter? The Bigger Picture
Article 10 isn’t just regulatory fine print; it’s a blueprint for compliance. By enforcing rigorous data governance, the EU AI Act helps prevent AI from perpetuating inequalities or causing unintended harm. For providers, compliance means investing in robust processes — but the payoff is AI that’s more innovative, trustworthy, and market-ready.
If you’re building AI, start auditing your data practices against these pillars. As AI integrates deeper into society, remember: Great AI starts with great data governance.
What challenges have you faced with data in AI projects? Share in the comments — I’d love to hear your thoughts!
EU AI Act: Understanding Data and Data Governance in Article 10 was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story.