Processing personal data of users while respecting privacy protection laws

When working on projects that collect personal identifiable information of users (referred to as PII hereafter), it is a legal obligation to comply with the local privacy protection laws like GDPR / CCPA etc. However, this does not mean that whole teams and business units should be blocked from using this data without a valid reason to access PII.

Here are a few techniqes that I have used or come across, to stay legally compliant yet not stop from innovating with the data.

Hash to anonymize the data

Hashing is a one-way function using which you convert a given string to another string of fixed length. If you've worked on B2C packaged products or any other digital package that needs to ensure integrity, you would have come across techniques like SHA, MD5, BLAKE etc.

We can employ the same algorithms to hash PII fields in the data such that anyone looking at the data can recognize its the same user, same address, same email etc but will not be able to recognize who the user is, what city they are from or what their address is.

Using a hashing approach to mask the data will enable the data to be readily used for purposes like training weights for ML models, analytics etc.

You can then ensure only the deploymed ML model would have access to actual PII data and not directly the teams who develop it. Hashing can be done either on-read or during write depending on use case and how performant the chosen hashing algorithm is.

Encryption

Encryption is another technique which you can employ to store PII data and give controlled access to the data only to requesting entities. Encryption is a two-way technique and encrypted data can be decrypted to view the actual data.

The whole process would pretty much work the same as hashing based technique, except that the encryption keys for each user or each dataset is stored in a secure way. Any team/person, that needs to access the actual PII data, are granted access to decrypt the data using the key, without actually exposing the key to them.

Nowadays, most cloud service providers have the option to enable column level encryption on their data warehouses and the accesses to the encryption keys can be managed using IAM roles.

Encryption is mostly applied during writes and decrpyted, if need be, on read. You would have to destroy the encryption key of the user if there is a request to delete their data.

Data split

If your use case does not need PII data but requires all other data surrounding it, the simplest technique would be to split the data while processing and writing each split to different targets which are protected by role based access controls.

Data deletion is easier with this approach as you can simply deleting whole rows or partitions without having to process them.

However, this technique adds an additional maintenance overhead and might make processes like replaying the data more complex than required.

Column level access control

This is a very straight forward approach leveraging the power of tools to control access to the data. Data is written to target with no additional processes or logic, instead you can employ access control at column level in the target layer to restrict unauthorized users from accessing the data.

It is a similar approach as described in encryption technique, except that this assumes you store data in tabular format at your target.

These are some of the ways with which you can manage access and process personal data on your data platform.