Synthetic Data

The concept of synthetic data generation is to take an original data source (dataset) and create new, artificial data, with similar statistical properties from it.

Keeping the statistical properties means that anyone analysing the synthetic data, a data analyst for example, should be able to draw the same statistical conclusions from the analysis of a given dataset of synthetic data as he/she would if given the real (original) data.

The process to create synthetic data, called synthesis, involves the use of generative models. The way a generative model work is explained by Foster, D. [2], in the following way: “Suppose we have a dataset containing images of horses. We may wish to build a model that can generate a new image of a horse that has never existed but still looks real because the model has learned the general rules that govern the appearance of a horse. First, we require a dataset consisting of many examples of the entity we are trying to generate. This is known as the training data, and one such data point is called an observation. It is our goal to build a model that can generate new sets of features that look as if they have been created using the same rules as the original data”.

The use of synthetic data is growing in many fields: from training of artificial intelligence models within the health sector to computer vision, image recognition and robotics fields.

Positive foreseen impacts on data protection:

  • Less privacy-intrusive training of artificial intelligence models: synthetic data allows for the training of artificial intelligence models, in a manner that is less privacy-intrusive for the individuals because the data used in the training process does not directly refers to an identified or identifiable person.
  • Enhanced privacy in data transfers: synthetic data can be considered as a Privacy Enhancing Technology (PET) and, in that sense, it might be used as a supplementary measure for data transfers outside the European Union or within organizations that do not need to identify a specific person.


Negative foreseen impacts on data protection:

  • Risk of reidentification: synthetic data generation implies a compromise between privacy and utility. The more a synthetic dataset mimics the real data, the more utility it will have for analysts but, at the same time, the more it may reveal about real people, with risks to privacy and other human rights. 
  • Lack of clarity on other risks: it is unclear at this stage if the data transference of generative models, which would allow other parties to generate synthetic data on their own, might bring further risks to privacy. 
  • Risk of membership inference attacks seems possible: synthetic data seems to share the same caveats of other forms anonymisation regarding the risk of membership inference attacks (i.e., the possibility for an attacker to infer whether the data sample is in the target classifier training dataset), especially when it comes to outlier records (i.e., data with characteristics that stand out among other records).


Further readings:

  • Synthesis Tutorials, Replica Analytics -   
  • D. Foster, Generative deep learning: teaching machines to paint, write, compose, and play. O'Reilly Media, 2019
  • T. Stadler, B. Oprisanu, C. Troncoso, Synthetic Data - A Privacy Mirage. arXiv preprint, 2020 - arXiv:2011.07018

Tech Champion: Vítor Bernardo