AI systems process private data, which is why they must not remember too much

Professor of Data Science Antti Honkela investigates privacy-preserving artificial intelligence. When training AI models using sensitive data, it is important to ensure that they do not retain and disclose the information.

What are your research topics?

My group’s research focuses on the trustworthiness of artificial intelligence. This involves many aspects, with data and privacy protection as well as the management of uncertainty in our particular focus. We investigate how AI systems are able to tackle these challenges, and develop techniques that improve their performance.

Among other things, we have developed methods that can prevent AI systems from revealing the confidential data used in their training. We have also developed methods for producing anonymous synthetic data, which can be used in many applications as a safe alternative to personal data.

Where and how does the topic of your research have an impact?

In recent years, AI systems have spread to a growing number of fields, in many of which questions about their trustworthiness are essential.

One important example is the use of health data in training artificial intelligence systems. For example, the data of mental health patients can be used to train an AI model that could assist employees in mental health chat services. For example, the model could suggest themes and messages appropriate to discussions. 

Under no circumstances should such a model reveal the content of its training data to anyone. My group is developing techniques for both analysing and preventing the memorisation of data.

Similar challenges arise in many other situations where sensitive personal data are to be utilised.

What is particularly inspiring in your field right now?

I’m inspired by how the results of research conducted by my group and the rest of the community in the field are starting to be seen in day-to-day life and are of broad interest. 

The tendency of AI systems to memorise their training data is directly linked to the discussion on copyright violations in the output of generative AI, which has attracted a lot of attention. The use of health data in the training of AI systems and related risks of disclosure have also elicited public discussion. 

The increased interest also stems from genuine progress: the methods used to train AI models, which guarantee the privacy of the training data, have become quite effective. In many applications, they are already able to combine sufficient accuracy with privacy.