Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join
String data is ubiquitous and string similarity search and join are critical to the applications of information retrieval, data integration, data cleaning, and also big data analytics. To support these operations, many techniques in the database and machine learning areas have been proposed independently. More precisely, in the database research area, there are techniques based on the filtering-and-verification framework that can not only achieve a high performance, but also provide guaranteed quality of results for given similarity functions. In the machine learning research area, string similarity processing is modeled as a problem of identifying similar text records; Specifically, the deep learning approaches use embedding techniques that map text to a low-dimensional continuous vector space.
In this tutorial, we review a large number of studies of string similarity search and join in these two research areas. We divide the studies in each area into different categories. For each category, we provide a comprehensive review of the relevant works, and present the details of these solutions. We conclude this tutorial by pinpointing promising directions for future work to combine techniques in these two areas.
Figure 1. A timeline of literature on string similarity processing with database and machine learning techniques
1.1 A brief overview of the history of string similarity search and join.
1.2 The real-world applications of string similarity search and join.
1.3 The formal problem definition and necessary background.
2.1 Background (15')
-- Motivation of performing string similarity joins in DB
-- Problem definition
-- Overview of the history of string similarity join algorithms
2.2 Categorization of Approaches (5')
-- Syntactic-based string similarity joins
-- Semantic-based string similarity joins
2.3 Syntactic-based string similarity joins (30')
-- String similarity joins on single-machine environment [SIGMOD'04, VLDB'06, ICDE'06, VLDB'11, SIGMOD'15, SIGMOD'18]
-- String similarity joins on distributed environment [SIGMOD'10, VLDB'12, ICDE'14, ICDE'17]
2.4 Semantic-based string similarity joins (20')
-- Synonyms-based string similarity joins [ICDE'08, SIGMOD'13, VLDB'17, EDBT'19]
-- Taxonomy-based string similarity joins [CIKM'18]
2.5 Summary (5')
3.1 Background (20')
-- Motivation to use ML for string matching
-- Brief overview of different machine learning algorithms
-- Word Embedding
-- Deep Neural Network
3.2 Categorization of Approaches (5')
-- Supervised ML
-- Deep Learning
3.3 Supervised ML based Approach (10')
-- Early approaches: base on SVM, Bayesian network, active learning, clustering...[multiple references]
-- Falcon [Das et al. SIGMOD 2017]
3.4 Deep Learning based Approach (10')
-- LinkNBed [Trivedi et al. ACL 2018]
-- DeepER 
-- Hybrid [Mudgal et al. SIGMOD 2018]
3.5 Comprehensive Approach (5')
-- Magellan [Konda et al. SIGMOD 2016]
-- Smurf [Suganthan et al. VLDB 2019]
3.6 More related topics in NLP (10')
-- Question Answering
-- Natural Language Inference
-- Text Matching
 Sanjib Das, Paul Suganthan G. C., AnHai Doan, Jeffrey F. Naughton, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, Youngchoon Park: Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services. SIGMOD Conference 2017: 1431-1446 [PDF]
 Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, Nan Tang: Distributed Representations of Tuples for Entity Resolution. PVLDB 11(11): 1454-1467 (2018) [PDF]
Part 4 Open Problem
4.1 General purposed pipeline for string similarity search and join.
4.2 Accelerating the machine learning-based approaches.
4.3 Combining Human-in-the-loop with machine learning approaches for better performance.
Jiaheng Lu is an Associate Professor at the University of Helsinki, Finland. His main research interests lie in the big data management and database systems, and specifically in the challenge of efficient data processing from real-life, massive data repository and Web. He has written four books on Hadoop and NoSQL databases, and more than 70 journal and conference papers published in SIGMOD, VLDB, TODS, and TKDE, etc.
Chunbin Lin is a software engineer at Amazon Web Services (AWS) and he is working on AWS Redshift. He completed his Ph.D. in computer science at the University of California, San Diego (UCSD) in 2018. His research interests are database management and big data management. He has more than 20 journal and conference papers published in SIGMOD, VLDB, VLDB J, and TODS, etc.
Jin Wang is a fourth year PhD student in University of California, Los Angeles. Before joining UCLA, he obtained his master degree in computer science from Tsinghua University in the year 2015. His research interest mainly lies in the ﬁeld of data management and text analytics. He has published more than 10 papers in top-tier conferences and journals like ICDE, IJCAI, EDBT, TKDE and VLDB Journal.
Chen Li is a professor in the Department of Computer Science at UC Irvine. He received his Ph.D. degree in Computer Science from Stanford University. His research interests are in the field of data management, including data-intensive computing, query processing and optimization, visualization, and text analytics.