Synthetic data can protect privacy

Synthetic data generated by artificial intelligence can be used for performing statistical inferences on, for example, the personal data of real people. Research on the topic currently holds both promise and challenges.

The use of artificial intelligence could reduce the privacy risks associated with the processing of personal data. This could be achieved with the help of synthetic data created by computers. Synthetic data is often created using generative AI, which generates, for example, text or images.

Synthetic data generated by artificial intelligence can resemble, for example, the health data or banking details of real people. Artificial intelligence can also be harnessed to generate data resembling data on human location or mobility.

“At its best, synthetic data can be completely anonymous, in which case it reveals nothing about actual individuals,” says Professor of Data Science Antti Honkela at the University of Helsinki.

At the same time, there are problems associated with anonymity. To be able to use it to replace original data, synthetic data should be sufficiently similar to the personal data of real people.

“How do you combine similarity to the original and anonymity? This is what researchers are now seeking solutions for,” Honkela says.

Differential privacy helps preserve anonymity

A research group led by Honkela, who works at the University of Helsinki and the Finnish Center for Artificial Intelligence (FCAI), has been designing a way of generating and analysing synthetic data based on personal data.

“While the synthetic data generated by our method is anonymous, we were able to use it to produce statistically valid inferences from the original data. Previous methods have not enabled reliable statistical inference with anonymous synthetic data,” Honkela says.

Instead of a single synthetic dataset, the method developed by Honkela’s group generates several datasets that are suitably different from one another. When the data are analysed and the results appropriately combined, reliable inferences can be made on the original data so that the uncertainty associated with the inference is correctly assessed.

The proven anonymity of the results is based on what is known as differential privacy, where the level of anonymity can be adjusted.

“This comes at a price: strict anonymity reduces the accuracy of results and increases related uncertainty. However, our method is able to take into account the effect of this inaccuracy on the outcome,” Honkela says.

According to Honkela, the findings expand the uses of synthetic data in research, as the technique makes it possible to carry out at least preliminary statistical analyses.

“In certain cases, the analysis result may remain inaccurate, making it necessary to repeat the analysis with original, real data, if possible, to gain a more accurate result.”

Honkela and his colleagues have published research-based tools for anonymous synthetic data in an open source software package.

Who generates synthetic data?

According to Honkela, the generation of synthetic data would primarily be the responsibility of the controllers of various registry data. They could share anonymous synthetic data with researchers similarly to open data. Researchers could use the data, for example, for teaching and software development as well as for preliminary statistical analyses.

Honkela considers synthetic data a promising way of reducing data protection issues associated with the use of personal data in research.

“The topic is prominent in the social welfare and healthcare sector, at Statistics Finland and in the European discussion on health data. I believe the technique we have developed will significantly expand the use of synthetic data.”

Honkela points out that synthetic data does not solve the challenges of privacy once and for all.

“In generating it, part of the information in the original data is always lost. However, when used correctly, it can be part of a solution that enables the safe use of personal data.”

Watch the Finnish-language stream from the event Synteettinen data yhteiskunnassa – Voiko väestöä simuloida?  organised by DataLit

An article on the method enabling statistical inference:

Ossi Räisä, Joonas Jälkö, Samuel Kaski and Antti Honkela. Noise-Aware Statistical Inference with Differentially Private Synthetic Data. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS 2023), 2023.

The project has received support from the Data Literacy for Responsible Decision-Making (DataLit) project funded by the Strategic Research Council established within the Research Council of Finland.