Themes for MSc Thesis and Internship
Below you can find themes for potential MSc thesis or internship topics in our research group.

If you are interested in a MSc thesis with our group, then please read these instructions for how to start and complete your thesis in our group.

[e2eML] Efficient End-to-End Machine Learning

Many researchers and practitioners target the predictive performance of Machine Learning (ML) models, e.g., they aim to develop ML models that have high accuracy on certain predictive tasks. By contrast, within this theme, you will work on aspects of computational efficiency, i.e., study and potentially improve the running time of ML pipelines. To do that, you will consider the various stages of ML pipelines, end-to-end (e.g., data acquisition, data sampling and processing, model training and validation, model deployment and retraining) and study how the individual and joint efficiency of these stages can be improved. For example, your thesis may study how big a data sample is necessary and sufficient for the ML training to lead to high accuracy in short time; or how often an ML model should be updated in order to maintain good model accuracy without wasteful/redundant training.

Background articles:

  • Doris Xin, Hui Miao, Aditya Parameswaran, and Neoklis Polyzotis. 2021. Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21). ACM (pp. 2639–2652).
  • Mahadevan, A., & Mathioudakis, M. (2022). Certifiable Unlearning Pipelines for Logistic Regression: An Experimental Study. Machine learning and knowledge extraction, 4(3), 591-620.
  • Wang, Y., Fabbri, F., & Mathioudakis, M. (2021). Fair and Representative Subset Selection from Data Streams. In Proceedings of the Web Conference 2021 (WWW '21) (pp. 1340–1350). ACM.
  • Wang, Y., Mathioudakis, M., Li, Y., & Tan, K-L. (2021). Minimum Coresets for Maxima Representation of Multidimensional Data. In Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS '21) (pp. 138–152). ACM.

[biggraph] Massive-Scale Graph Processing

Within this theme, you will focus on the scalability of graph processing tasks for large graphs. One direction is to consider common graph-processing tasks (e.g., computation of connected components, clusters, distances, embeddings, etc) and study how existing state-of-the-art algorithms for these tasks perform for massive graphs (i.e., with more than 100 million nodes or edges). Another direction is to consider a specific massive graph (e.g., a dataset that you are interested in or that is available in the research group) and study how to perform efficiently tasks for this particular graph.

Background articles:

  • Merchant, A., Mathioudakis, M., & Wang, Y. (2023). Graph Summarization via Node Grouping: A Spectral Algorithm. In The 16th ACM International Conference On Web Search And Data Mining. [link]

  • Merchant, A., Gionis, A., & Mathioudakis, M. (2022). Succinct Graph Representations as Distance Oracles: An Experimental Evaluation. Proceedings of the VLDB Endowment, 15(11), 2297 - 2306. [link]

  • Merchant, A., Mahadevan, A., & Mathioudakis, M. (2022). Scalably Using Node Attributes and Graph Structure for Node Classification. Entropy, 24(7), 511-522. [906]. [link]

Completed Theses

See here: /en/researchgroups/algorithmic-data-science/people#section-98775.