M.Sc. (Tech) Ananth Mahadevan defends his doctoral thesis "Scaling and Maintaining Machine Learning Pipelines" on Thursday the 12th of December 2024 at 12 o'clock in the University of Helsinki Athena building, Hall 107 (Siltavuorenpenger 3 A, 1st floor). His opponent is Associate Professor Hong-Linh Truong (Aalto University) and custos Associate Professor Michael Mathioudakis (University of Helsinki). The defence will be held in English.
The thesis of Ananth Mahadevan is a part of research done in the Department of Computer Science and in the Algorithmic Data Science group at the University of Helsinki. His supervisor has been Associate Professor Michael Mathioudakis (University of Helsinki).
Scaling and Maintaining Machine Learning Pipelines
Machine learning (ML) systems have become indispensable across diverse domains such as social media and e-commerce, where they are employed to enhance decision-making, automate complex tasks, and extract insights from massive datasets. As the volume and velocity of data in these systems grows exponentially, the associated ML pipelines -- which handle data management, model training, deployment, and monitoring -- face significant issues. Scaling these pipelines from gigabytes to terabytes of data introduces significant computational bottlenecks. Additionally, evolving data distributions in dynamic environments degrade the performance over time. Addressing these issues is crucial for developing scalable and maintainable ML pipelines.
This thesis tackles these issues by exploring four key research themes: efficient training, robustness, updatability, and cost awareness in ML pipelines. We begin by studying each theme individually to answer specific research questions. For efficient training, we identify and mitigate the computational bottlenecks in scaling JANE, a novel node classification algorithm, to large graph datasets. For robustness, we analyze the resilience of WM-SKETCH, a compressed linear classifier, to adversarial attacks. For updatability, we evaluate different methods to efficiently forget deleted training data points. Finally, for cost awareness, we develop algorithms to make retraining decisions that balance computational costs and performance.
In this thesis, we further provide an end-to-end perspective, exemplified by the development of ReceptionReader a real-world system designed for the large-scale analysis of historical documents. This system serves as a practical case study, highlighting how challenges related to each theme manifest in a real-world context. We discuss our solutions to these practical challenges, ranging from the robust preprocessing of millions of noisy digitized documents to the cost-aware implementation choices to optimize the system. These solutions result in a scalable and maintainable ML pipeline for the ReceptionReader system, enabling researchers to conduct novel analyses of historical documents.
In summary, this thesis presents research insights and practical solutions for scaling and maintaining ML pipelines. By exploring computational efficiency, robustness, and operational cost management, it provides a comprehensive roadmap for handling the complex trade-offs inherent in large-scale ML systems. The strategies discussed not only enhance the efficiency of ML processes but also ensure adaptability and resilience in evolving environments, thereby facilitating scalable and cost-effective deployment.
Availability of the dissertation
An electronic version of the doctoral dissertation will be available in the University of Helsinki open repository Helda at http://urn.fi/URN:ISBN:978-952-84-0744-7.
Printed copies will be available on request from Ananth Mahadevan: ananth.mahadevan@helsinki.fi.