What is the best way for synthetic data generation while maintaining privacy?


For one of the projects where we are working as third party contractors, we need a way for the company to share some datasets which can be used for data science. It is not possible for the company to share the real data as that would be a privacy issue.

We are exploring ways so that the company can either share the data while maintaining privacy or else ways to generate fake data that matches the statistics/demographics of the actual data.

We are currently looking at a couple of options:

  • Using differential privacy to add noise to the data and then sharing the transformed data with us. Can this approach lead to any privacy issue? I am concerned about reverse engineering. Does "privacy budget" apply here? How should it be tackled?
  • Using encoder-decoder neural networks to learn vector embedding of the real data. Once the vector embedding is learned, the decoder can be destroyed and the encoder's output can be shared with us.

Is there any other approach that can be used for synthetic data generation that resembles the actual data in terms of demography and statistics. Or else what would be the best way to access the real data without violating privacy?

Vivek Maskara

Posted 2020-07-17T18:38:34.417

Reputation: 121



We solved this problem by using NER. Using spacy or similar alternatives, the entities can be detected and replaced with xxx . This way identification of company names, currencies etc becomes difficult/impossible.

Post this synthetic data generation techniques can be applied like multiplication, paraphrasing or NLG.

Sandeep Bhutani

Posted 2020-07-17T18:38:34.417

Reputation: 633


If you are trying to hide the actual data values, one standard way to make private data available publicly is to process the dataset through PCA or similar algorithms. Also use one hot encoding or embedding for categorical/text data, renaming the columns. Reverse engineering this data exactly would be very difficult, maybe even impossible. There may be ways to get similar data back, but you can mimimize even that by performing a 2nd step:

  1. reducing the number of output features to convolute the original data
  2. using another dimensionality reduction method after using PCA such as SVD, LDA, etc

After this process, the data is not quite the same, but is usually similar enough to the original dataset to be useful for most use cases.

Donald S

Posted 2020-07-17T18:38:34.417

Reputation: 1 493

1I don't see how dimensionality reduction protects privacy, since you're not changing the mappings between any variables or between any variables and the output. It's functionally the same as having collected the same dataset, but just with fewer features. In that regard, the original data can be considered a subset of some larger dataset that you didn't collect. Reporting a subset of the data doesn't protect anything about the data that remains. – Nuclear Hoagie – 2020-07-21T17:06:11.547

The feature reduction is the 2nd step. If you use PCA and then use feature reduction, as I suggested, some of the information that could be used for reverse engineering would be lost, making that risk much less likely. By using PCA, the values themselves will be changed by the algorithm, and not on a simple scaled factor but by combining multiple features, taking the orthogonal projections and thereby masking the original data. Hope this is more clear. – Donald S – 2020-07-21T17:47:13.787


Whichever method you choose is fine but assuming that you wish to mitigate inference attacks something like differential privacy is required for either approach.

Formally speaking, differential privacy provides some of the strongest guarantees against reverse engineering. Specifically, it promises that any attacker, regardless of attack methodology or available computing power, will be unable to conclude with certainty whether or not any individual has contributed data to a dataset. This is because the results of differentially-private methods are ambiguous up to the addition or removal of the input contributions of any individual. In essence, every individual gets deniability about their participation (or non-participation) in the input.

The problem with synthetic data is that it is generated from a model that is fit to real data. This means that model parameters are aggregate functions of the real data. This is problematic as it is often possible to make inferences from aggregates of data or estimates thereof (this is the motivation for differential privacy in the first place), and parameters of the generative model can be often be estimated from the synthetic data. I am happy to give an example of such an attack if there is interest. Further, it should be noted that this reasoning also implies that white-box exchange of the model is at least as risky and comes with additional concerns like has this network memorized training data. A straight-forward mitigation for this is to apply differential privacy in building the generative model for synthetic data.

In regards to privacy budgets, one can interpret the budget (often referred to as $\varepsilon$ and also the privacy-loss) as the amount of information (say in bits) which can be inferred about any individual by an adversary who has access to differentially-private results. Perhaps surprisingly, it can and ideally should be much less than 1. If there are future updates which reference the same individuals then one has to worry about how much individual information can be inferred from the aggregate collection of releases. There is a straight-forward composition theorem (See e.g. Sect 3.5) that follows directly from the definition of differential privacy. It states that aggregate privacy-loss is at most the sum of the individual privacy-losses of the constituent releases. In other words it's additive in the worst case. It may also be helpful to know that when the inputs are disjoint it goes like the max.

Alfred Rossi

Posted 2020-07-17T18:38:34.417

Reputation: 101

Thank you for the detailed response. Does differential privacy also work in cases where there's a single central repository for the data? To add some context, a company already has all the data of its users in the original form. Now if the company wants to share this data with a third party, can differential privacy algorithms be applied while sharing the data. I think it can be done but it would be great if you could elaborate more on it and probably cite some sources for detailed reading. – Vivek Maskara – 2020-07-21T20:58:24.530

Also, as far as synthetic data generation is concerned, I tried using neural networks(Generative adversarial networks) for generating synthetic data. The idea is to destroy the discriminator once the network is fine-tuned for generating the real-like data. Will it still have privacy concerns? It would be great if you could share an example of the possible attack that you mentioned. Also, it would be great if could elaborate more on how differential privacy can be applied in building a generative model. For context here's the python notebook I used for my network. https://bit.ly/30CWuBb

– Vivek Maskara – 2020-07-21T21:04:19.520