M.Sc. (Tech) Otto Nyberg defends his doctoral thesis "Uplift Modeling with Click-Stream Data" on Friday the 29th of September 2023 at 13 o'clock in the University of Helsinki Language Centre building, Festive Hall (Fabianinkatu 26, 3rd floor). His opponent is Professor Szymon Jaroszewicz (Polish Academy of Sciences, Poland) and custos Associate Professor Arto Klami (University of Helsinki). The defence will be held in English.
The thesis of Otto Nyberg is a part of research done in the Department of Computer Science and in the Multi-Source Probabilistic Intelligence group at the University of Helsinki. His supervisor has been Associate Professor Arto Klami (University of Helsinki).
Uplift Modeling with Click-Stream Data
Uplift modeling is a form of causal inference on an individual level using machine learning methods. It is used in a handful of fields, namely e-commerce, social sciences, and medical sciences, and could potentially be used in multiple others. Essentially uplift modeling answers the question "will this treatment produce the desired effect for this individual?"
This thesis focuses on uplift modeling using click-stream data. The need for uplift modeling was raised by the author's employer at the time who was collecting vast amounts of click-stream data (data generated by users' actions while visiting online stores and other sites), and looking for new ways to utilize it. In this thesis, we focus on the issues that arise specifically in this setting, although the results are by no means limited to it. Click-stream data usually exhibits high class-imbalance, i.e. the number of positive observations (e.g. buyers) make up only a tiny fraction of all observations. This has been shown to cause major issues in classification and we show that it also causes issues in calibration of classifiers. We propose a novel calibration method that both quantifies and controls for the resulting uncertainty.
High class-imbalance also causes issues for uplift modeling that are more complex to understand than the issues in classification. We are the first ones to analyze and propose solutions to this problem. We show that high class-imbalance needs to be dealt with in the context of uplift modeling and propose a number of solutions based on undersampling of the majority class. We show how undersampling causes distortions in the data and propose a few calibration methods to correct for this. Lastly, we introduce two novel ways of quantifying the uncertainty of an uplift estimate. We argue that this uncertainty is a vital piece of information, especially if uplift modeling is to become more common in fields with much smaller amounts of data than e-commerce.
Availability of the dissertation
An electronic version of the doctoral dissertation will be available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-951-51-9440-4.
Printed copies will be available on request from Otto Nyberg: otto.nyberg@helsinki.fi.