From the course: Privacy, Governance, and Compliance: Data Sharing
Techniques to minimize privacy risk
From the course: Privacy, Governance, and Compliance: Data Sharing
Techniques to minimize privacy risk
- [Instructor] So far we have seen the scale of data sharing and the privacy risks it causes. The good news here is that there are several best practices available, and techniques available to help make data sharing more privacy focused. When it comes to data collection, I have built architectures with a key principle, the more precisely identifiable the data, the lower the retention period should be. I will repeat again, the more precisely identifiable the data, the lower the retention period should be. We saw this exact principle in action in the data classification course as well, precision and retention should have an inverse correlation. This theme is apropos for data sharing as well. So now let's look at some best practices to implement this principle in practice. As an app designer, you should ask vendors and partners to document their retention and deletion policies for each type of data being collected and shared. And on this slide, I have some specific recommendations on how to obfuscate the data before sharing. We will look at some of these concepts in more detail in upcoming slides, but here is the key takeaway once again, the more precisely the data can identify someone, the lower its retention period should be. Just as there is an inverse correlation between data precision and retention, there should be a similarly inverse relationship between precision and availability. When you share data with a partner, you should insist that the anonymize data in memory, especially if you share with them data that is very, very granular. Some techniques include not persisting data used solely for the purposes of aggregation, keeping individual level data in memory, only process data to the disk. This means that precise data is short-lived, and therefore less accessible, while more aggregated data is available to more people, since it is on disk, where you can also manage access much more proactively and effectively. Here are some more best practices. In order to prevent personal identification, you should remove or replace any identifiers that uniquely identify someone. You will want to do this before sharing the data, or have the vendor do this as soon as they receive the data, and only then complete mapping it at their end. On this slide, there are also two additional specific bullet points to show you how to do exactly that. Key tip, you may also have some use cases where you may actually want to identify somebody whose data you have shared, for those use cases, you may want to create another table that links their external identifiers, like passport numbers, for example, to internal identifiers that are customized for your company alone. However, you should carefully manage access to this linking table so as to prevent any potential privacy issues. Let's look at an example on the next slide. In this slide, Table A, on the far left, is the internal table with your user data linked to their passport numbers. As you know, the passport number uniquely identifies someone without any doubt as to who they are, which is why when you share this user's data externally, you will want to create an ID specific to that vendor or partner. You will also want to create a new table with the data you wish to share, but linked to this custom internal id, not the passport number. That way, if this data were ever to leak, it would be much harder to associate this data back to that specific user and their passport number. This is the table on the far right, AKA Table C. And the table in the middle, Table B, is the one that links passport numbers to the custom ID. This is the table that will require tight and ordered access, since this table is the bridge between the full data and the anonymized data. That way, you can share anonymously and identify internally. Real world scenarios will probably have a lot more complexity, but this example gives you a logical foundation to build upon. But do we need a custom ID for each vendor, or can we just create one internal ID for all external sharing? This debate is a privacy version of soda or pop, and what is the more harmful thing in your diet, fats or sugar? There is no clear consensus on this issue. I've been part of these debates for a long time, so let me give you the trade-offs and let you decide what works best in your case. Choice 1, you could create one internal ID and hash it, this is indicated by the entries on the left hand side. The goal of hashing is to prevent IDs that are tied to you from appearing in the wild. Hashing could help ensure that each vendor has individual IDs, so that vendors cannot connect the dots amongst each other, and if there were ever to be a breach, you could identify the vendor that was impacted. The other upside is that any hashing algorithm with salt may help achieve this goal, although your security teams may have more specific recommendations. The downside is that you will need to focus on where to store the salt, and identify whether the hashing algorithm is susceptible to brute force or other techniques. Choice 2, you could create an internal ID per vendor. This is indicated by the list on the right hand side. This approach may negate the need to hash the ID and the complexity attached to hashing. Your security team, again, may have other ideas. The downside of this approach is you may have too many IDs and too many mapping tables, as a result, you may end up with too much complexity and too little privacy. The reason behind this is that if anybody can start sharing data by creating a new internal id, you may see promiscuous data sharing with no central ability to regulate the sharing of data. On your screen right now are even more techniques you can use when it comes to sharing location and time data. I will let you read these techniques and apply them at your own convenience. Let me explain how those measures we just looked at could help protect privacy. Let's assume you have two trips, Trip A starts at 12:22 PM and ends at 1:09 PM, Trip B starts at 12:24 PM and ends at 1:11 PM. If you need to share this data with a third-party for analysis purposes, it could pose a privacy risk, since there are just two trips, based on the start and end times, and other public data, you may be able to identify who took which trip. However, with the techniques we saw in the last slide, the two trips could be shared as follows, Trip A starts at 12:30 PM ends at 1:00 PM, Trip B starts at 12:30 PM and ends at 1:00 PM. This makes the people on the trip less identifiable without hurting the aggregate data analysis. Now, this example may be a bit too simplistic, but the key message remains the same, avoid blindly sharing data that individually identifies the folks whose data it is. When I look at data sharing protocols, I ask to make sure that once I share data with a partner, they manage who on their side can access that data. Here are some techniques I've used. I ask that the partner be judicious with the availability of their APIs to those who want access to the data. My teams have often implemented tools to check, on an ongoing basis, if engineers and scientists who have access to sensitive encrypted data still actually need that kind of access. We routinely sample the data, and check to make sure when it was last decrypted. Often, we found that teams had requested access to the data, but hardly, or never, use keys to access the data. In those cases, we swap the keys to check if the engineers ever complained. 75% of the time, we never heard back and nobody complained. What this means is that people often think they will need more access to data than they actually do, and even if they don't use the data, their ability to access it poses a privacy risk. Remember, credentials often get lost. You will only use all of these techniques and these instincts internally, as well as externally when you share data. I've shaped privacy principles for many, many organizations as an employee and a consultant, but these two are the closest ones to my heart. During my early days as a privacy engineer, I followed President Reagan's famous "Trust but verify" principle. When it comes to data sharing in this world, however, the principle I follow is one cited by Intel Founder Andy Grove, "Only the paranoid survive."