Juhani Kivimäki defends his PhD thesis on Confidence-based Performance Estimation in Machine Learning Model Monitoring

On Friday the 22nd of May 2026, M.Sc. Juhani Kivimäki defends his PhD thesis on Confidence-based Performance Estimation in Machine Learning Model Monitoring. The thesis is related to research done in the Department of Computer Science and in the Empirical Software Engineering group.

M.Sc. Juhani Kivimäki defends his PhD thesis "Confidence-based Performance Estimation in Machine Learning Model Monitoring" on Friday the 22nd of May 2026 at 13 in the University of Helsinki Exactum building, Auditorium CK112 (Pietari Kalmin katu 5, basement). His opponent is Professor Thomas Schön (Uppsala University, Sweden) and custos Professor Jukka K. Nurminen (University of Helsinki). The defence will be held in English.

The thesis of Juhani Kivimäki is a part of research done in the Department of Computer Science and in the Empirical Software Engineering group at the University of Helsinki. His supervisor has been Professor Jukka K. Nurminen (University of Helsinki).

Confidence-based Performance Estimation in Machine Learning Model Monitoring

Machine learning models deployed in production often operate under evolving data distributions and delayed or missing ground truth labels, making traditional performance monitoring infeasible and risking undetected degradation. This dissertation develops a principled framework for confidence-based performance estimation that enables label-free monitoring of machine learning models, with theoretical guarantees and practical algorithms designed for real-world settings.

In the first article, we present an industrial case study on extracting calibrated, document-level confidence for a 2D information extraction (IE) pipeline for a deployed commercial AI system that processes invoices. We develop an unintrusive auxiliary confidence branch that connects to the IE pipeline, leveraging internal representations and metadata from the pipeline to express calibrated invoice-level confidence scores. The approach substantially improves ranking quality, calibration, and coverage of automatically processed invoices without increasing error rate.

In the second article, we establish a theoretical foundation for our methodology by examining Average Confidence (AC), a widely used yet previously ad hoc estimator that takes the mean of confidence scores to estimate accuracy. We prove that under perfect calibration, AC is an unbiased and consistent estimator of accuracy. We extend AC to a probabilistic setting to derive valid batch-level confidence intervals by modeling correctness counts with the Poisson binomial distribution. Experiments on synthetic covariate shift scenarios show AC is competitive with or outperforms more complex baselines, with estimation error closely tied to calibration error.

In the third article, we generalize from unsupervised accuracy estimation to unsupervised performance estimation by introducing Confidence-based Performance Estimation (CBPE). CBPE models each confusion-matrix element (TP, FP, TN, FN) as a Poisson binomial random variable parameterized by per-instance confidences, yielding full distributions and valid confidence intervals for any confusion-matrix-derived classification metric (e.g., accuracy, precision, recall, F score). It is uninvasive and computationally efficient. We show strong theoretical guarantees for CBPE and demonstrate its effectiveness under mild distributional shifts.

Finally, in the fourth article, we address calibration degradation under covariate shift by proposing Probabilistic Adaptive Performance Estimation (PAPE), which adapts calibration to a target distribution via density ratio estimation and then applies CBPE on the calibrated scores. Through extensive evaluations on real tabular datasets, we demonstrate that PAPE establishes a new state-of-the-art in unsupervised performance estimation.

Overall, this dissertation advances confidence-based monitoring from heuristic practice to a theoretically grounded, practically deployable methodology. It provides label-free estimation with uncertainty quantification for the metrics that matter in production, bridging the gap between academic evaluation and operational model monitoring under distribution shift.

Avail­ab­il­ity of the dis­ser­ta­tion

An electronic version of the doctoral dissertation will be available in the University of Helsinki open repository Helda at .

Printed copies will be available on request from Juhani Kivimäki: .