Print

Synthetic Data

The concept of synthetic data generation is to take an original data source (dataset) and create new artificial data with similar statistical properties to the original data.

Keeping the statistical properties means that anyone analysing the synthetic data, a data analyst for example, should be able to draw the same statistical conclusions from the analysis of a given dataset of synthetic data as if they were given the real (original) data.

The process to create synthetic data, called synthesis, involves the use of generative models. The way a generative model works is explained by Foster, D. [2] in the following way: “Suppose we have a dataset containing images of horses. We may wish to build a model that can generate a new image of a horse that has never existed but still looks real because the model has learned the general rules that govern the appearance of a horse. First, we require a dataset consisting of many examples of the entity we are trying to generate. This is known as the training data, and one such data point is called an observation. It is our goal to build a model that can generate new sets of features that look as if they have been created using the same rules as the original data”.

The use of synthetic data is growing in many fields: from training artificial intelligence models within the health sector, to computer vision, image recognition, and in the robotics fields.

Positive foreseen impacts on data protection:

  • Less privacy-intrusive training of artificial intelligence models: synthetic data allows for the training of artificial intelligence models, in a manner that is less privacy-intrusive for the individuals because the data used in input does not directly refer to an identified or identifiable person.
  • Enhanced privacy in data transfers: synthetic data can be considered as a Privacy Enhancing Technology (PET) and, in this sense, it might be used as a supplementary measure for data transfers outside the European Union or within organisations that do not need to identify a specific person.

Negative foreseen impacts on data protection:

  • Risk of reidentification: synthetic data generation implies a compromise between privacy and practicality. The more a synthetic dataset mimics the real data, the more use it will have for analysts but, at the same time, the more it may reveal about real people, which may increase the risks for individuals’ privacy and other human rights. 
  • Lack of clarity on other risks: it is unclear at this stage whether or not the data transference of generative models, which would allow other parties to generate synthetic data on their own, may bring further risks to privacy. 
  • Risk of membership inference attacks seem possible: synthetic data seems to share the same caveats as other forms of anonymisation regarding the risk of membership inference attacks (i.e., the possibility for an attacker to infer whether the data sample is in the target classifier training dataset), especially when it comes to outlier records (i.e., data with characteristics that stand out among other records).

 

Further readings:

 

Tech Champion: Vítor Bernardo