Free Text and Data Protection

Collections of free text – whether in database fields, documents or email archives – present a challenge both for operations and under data protection law. They may contain personal data but it’s hard to find: whether you’re trying to use it, to ensure compliance with the data protection principles, or to allow data subjects to exercise their legal rights. Some level of risk is unavoidable in these collections, but there are ways to reduce it.

Provide structured fields wherever possible. If you know that a helpdesk ticket will contain the requester’s name and e-mail address, ensure those fields exist in the database. This makes the information much easier to find for operational purposes, as well as to apply appropriate deletion/anonymisation policies.
Set policies for using those structured fields, for when and how personal data may be entered into unstructured fields, and which personal data (e.g. sensitive) should never be entered there. Some of the data may be entered by people not under your control (e.g. if someone describes their health problems in a website comment field), but at least those who are under your control should know how to do the right thing. Knowing the source of unstructured data and the ways in which it is collected should also help in the subsequent assessment of how great a risk it represents.
Set appropriate retention periods for both structured and unstructured fields. With structured fields it should be relatively easy to define when personal data are no longer needed for the purpose and the content of a single field can be deleted or over-written. For unstructured fields this is harder, since both the utility and risk of long retention are unknown. Deciding on an appropriate period to retain unstructured information is likely to involve balancing the benefit and the risk, taking account of the uncertainty of both.
For high-risk situations or activities, it may be worth considering using either humans or computers to scan unstructured data for personal content. The choice involves a trade-off: humans are likely to be more accurate but also more expensive. Conversely a computer may spot a name and redact it without realising that it was critical to the meaning and purpose of the record (though in that case it should, perhaps, have been in a structured field anyway).

Databases and other collections should also be secured using technical means, of course. Where appropriate to the purpose, access controls can ensure that only authorised users can see the content, encrypting that content when it is at rest and in transmission can protect against those with physical access.

Finally, the organisation should assess the remaining risk – it is very unlikely to be possible to eliminate it – and ensure that this is justified by the benefits of storing and processing the data. The General Data Protection Regulation’s requirement to demonstrate accountability for processing of personal data probably means this assessment (and particularly the reasons why possible risk-reduction options were not taken) should be documented, at least for large collections of information.

By Andrew Cormack

Leave a Reply Cancel reply