Rafael Savvides defends his PhD thesis on Statistical methods for uncertainties in data exploration and model building

On the 11th of October 2024, M.Sc. (Tech) Rafael Savvides defends his PhD thesis on Statistical methods for uncertainties in data exploration and model building. The thesis is related to research done in the Department of Computer Science and the Exploratory Data Analysis group.

M.Sc. (Tech) Rafael Savvides defends his doctoral thesis "Statistical methods for testing visual patterns, selecting models, and bounding model errors" on Friday the 11th of October 2024 at 13 o'clock in the University of Helsinki Exactum building, Auditorium CK112 (Pietari Kalmin katu 5, basement). His opponent is Professor Pauli Miettinen (University of Eastern Finland) and custos Professor Kai Puolamäki (University of Helsinki). The defence will be held in English.

The thesis of Rafael Savvides is a part of research done in the Department of Computer Science and in the Exploratory Data Analysis group at the University of Helsinki. His supervisor has been Professor Kai Puolamäki (University of Helsinki).

Statistical methods for testing visual patterns, selecting models, and bounding model errors

Data science involves data analysis and building models on data. The analyses and the models produced by data scientists are used for making decisions and for creating products that affect our lives. However, real world data are imperfect, which introduces uncertainties into analyses and models based on data. The uncertainties propagate to down-stream decisions and products, which can be negatively affected if these uncertainties are not made explicit.

This thesis introduces three computational methods that aid with uncertainties when analyzing data and building models based on data. The methods provide formal statistical guarantees about visual patterns in data exploration, models selected based on their expected errors in model selection, and model errors on specific data points. The methods concern fundamental problems in machine learning and therefore have wide applicability.

The first method concerns patterns observed during data exploration. Data scientists typically explore data using various visualizations that reveal patterns in the data. Since data are noisy, the observed visual patterns may also be due to noise, rather than a true effect in the data. Determining whether something is a true pattern or a random occurrence is traditionally determined using statistical testing. The method we developed is a statistical testing procedure for testing visual patterns during visual data exploration. The procedure allows analysts to measure whether what they see is compatible with their accumulated knowledge of the data.

The second method concerns selecting between machine learning models. When faced with a prediction task, data scientists use data to train and validate multiple models. In the end, only one model will be used, which is selected based on some criterion. The criterion usually relates to the average error on new data, which is estimated using data not seen during training. However, it is not clear how much new data should be used to estimate the performance; more data is always better, but there is a finite amount for both training models and validating them. Using less data leads to a more uncertain selection, but it is not clear how to quantify that uncertainty. The method we developed is a model selection algorithm that automatically decides how much data to use for selecting between models. The algorithm uses as much data as required to ensure a formal confidence guarantee that the selected model's loss is within a specified tolerance from the best model.

The third method concerns estimating prediction errors on specific data points. The performance of a machine learning model is typically evaluated using an average prediction error. However, since the model predicts on individual points, it can often be more important to estimate the (unknown) error at a specific test point, which can differ significantly from the average error. The method we developed provides an upper bound for the prediction error of any regression model at a given test point. The bound is based on a powerful model for quantifying uncertainties given by Gaussian processes. The bound improves upon existing methods based on Gaussian processes by requiring less information from the user, being applicable to a large class of kernels and any regressor, and being computationally faster.

Avail­ab­il­ity of the dis­ser­ta­tion

An electronic version of the doctoral dissertation will be available in the University of Helsinki open repository Helda at http://urn.fi/URN:ISBN:978-952-84-0695-2.

Printed copies will be available on request from Rafael Savvides: rafael.savvides@helsinki.fi