Cybersecurity Report Highlights Best Practices for Managing AI Training Data
by
June 10, 2025
Cybersecurity agencies in the US, UK, Australia, and New Zealand recently collaborated on a new Cybersecurity Information Sheet (CIS). The CIS discusses cybersecurity risks and best practices for managing data sets used to train AI. Here are five of the best practices the report identifies, as summarized in a recent A&O Shearman memo:
- “Source reliable data: use trusted, authoritative data sources and implement tracking systems to identify the origin of any data being used. Cryptographical tools can used to check for corruption of data sets.
- Use trusted computing systems: use a computing system that does not automatically trust any user or device (even if they come from within a known network).
- Data encryption: use sophisticated encryption protocols for data (including when data is at rest, being transported and during processing). AES-256 encryption is the industry standard as it is considered to be quantum-resistant.
- Preserve privacy: use data anonymisation and depersonalisation techniques to protect sensitive information, such as data masking, which involves replacing sensitive data with other (realistic) information or federated learning, which allows AI systems to be trained over several datasets.
- Carry out data security risk assessments: Data security should be regularly assessed using NIST frameworks.”
In addition to identifying best practices for data management, the CIS provides mitigation strategies for emerging risks in three key areas:
- Data supply chain – Threats associated with data sourcing and data alterations made before a developer acquires the data. These include risks from large-scale data collection and data collected and curated by third parties.
- Maliciously modified data – Deliberately manipulated data that is fed into AI systems to make them less accurate and secure.
- Data drift – Input data naturally degrades and becomes altered from the original dataset. Data drift occurs in many ordinary and predictable ways, making AI systems less accurate over time.