Ossi Räisä defends his PhD thesis on Theory and Algorithms for Usable Synthetic Data and Efficient Privacy-Preserving Computation

On Friday the 14th of November 2025, M.Sc. Ossi Räisä defends his PhD thesis on Theory and Algorithms for Usable Synthetic Data and Efficient Privacy-Preserving Computation. The thesis is related to research done in the Department of Computer Science and in the Trustworthy Machine Learning group.

M.Sc. Ossi Räisä defends his PhD thesis "Theory and Algorithms for Usable Synthetic Data and Efficient Privacy-Preserving Computation" on Friday the 14th of November 2025 at 13 in the University of Helsinki Exactum building, Auditorium CK112 (Pietari Kalmin katu 5, basement). His opponent is Senior Researcher Aurélien Bellet (Inria, University of Montpellier, France) and custos Professor Antti Honkela (University of Helsinki). The defence will be held in English.

The thesis of Ossi Räisä is a part of research done in the Department of Computer Science and in the Trustworthy Machine Learning group at the University of Helsinki. His supervisor has been Professor Antti Honkela (University of Helsinki).

Theory and Algorithms for Usable Synthetic Data and Efficient Privacy-Preserving Computation

As more and more data is collected about people, the importance of ensuring the privacy of these people increases. Limiting access to private data is the easiest way to ensure privacy, but it also limits the benefits that come from analysing the data. Differential privacy is a mathematical definition of privacy that provides a principled compromise between the extremes of completely open data and prohibiting the use of sensitive data. However, differential privacy is a very strict notion of privacy, and applying it can significantly reduce the utility of data analysis.

In this thesis, we study several ways to reduce the utility loss. The focus will be on synthetic data, which refers to using a synthetic dataset that has been generated to resemble a real dataset in place of the real data. When combined with differential privacy, synthetic data allows an arbitrary number of analyses, which would not be possible under differential privacy otherwise. Synthetic data can also be used without differential privacy to obtain higher utility at the cost of losing the formal privacy guarantee and risking vulnerability to attacks that disclose private information.

In particular, we study statistical inference from synthetic data in the first two publications. This problem is non-trivial, since using synthetic data introduces extra uncertainty that must be accounted for in uncertainty estimates like confidence intervals. This extra uncertainty is made even worse if differential privacy is used. As a solution, we develop a synthetic data generation method that is capable of representing this uncertainty in the form of multiple synthetic datasets in the first publication. For frequentist inference problems, an existing method can use these datasets to give valid inferences, and we show that another existing method gives valid Bayesian inferences under appropriate conditions in the second publication.

In the third publication, we theoretically study a technique of using multiple synthetic datasets for supervised machine learning, which previous work has empirically found to be useful. In particular, we derive a bias-variance decomposition for this technique, which yields several practical insights, such as guidance on selecting the number of synthetic datasets.

In the fourth and fifth publications we move from synthetic data to efficient privacy-preserving computations. We start by studying the use of simulated data to optimise a privacy mechanism that makes probabilistic predictions with limited real data in the fourth publication. In this work, we use meta-learning to train a model on the simulated data, which is then capable of adapting to the real data under differential privacy in a single forward pass of a neural network.

Finally, in the fifth publication, we seek a theoretical explanation of the benefit of large batch sizes in differentially private stochastic optimisation, which has been observed empirically in previous work. We find the explanation in the variance of the noise added to the gradients: the effective variance decreases with larger batch sizes.

Availability of the dissertation

An electronic version of the doctoral dissertation will be available in the University of Helsinki open repository Helda at .

Printed copies will be available on request from Ossi Räisä: .

3.11.2025

Ossi Räisä, Pirjo Moen

News

Technology

Share this page

Newsletter

Ossi Räisä defends his PhD thesis on Theory and Algorithms for Usable Synthetic Data and Efficient Privacy-Preserving Computation

Theory and Algorithms for Usable Synthetic Data and Efficient Privacy-Preserving Computation

Avail­ab­il­ity of the dis­ser­ta­tion

Availability of the dissertation