Abstract:
String data is ubiquitous and string similarity search and join are critical to the applications of information retrieval, data integration, data cleaning, and also big data analytics. To support these operations, many techniques in the database and machine learning areas have been proposed independently. More precisely, in the database research area, there are techniques based on the filtering-and-verification framework that can not only achieve a high performance, but also provide guaranteed quality of results for given similarity functions. In the machine learning research area, string similarity processing is modeled as a problem of identifying similar text records; Specifically, the deep learning approaches use embedding techniques that map text to a low-dimensional continuous vector space.
In this tutorial, we review a large number of studies of string similarity search and join in these two research areas. We divide the studies in each area into different categories. For each category, we provide a comprehensive review of the relevant works, and present the details of these solutions. We conclude this tutorial by pinpointing promising directions for future work to combine techniques in these two areas.
Figure 1. A timeline of literature on string similarity processing with database and machine learning techniques
Outline
Part 1
1.1 A brief overview of the history of string similarity search and join.
1.2 The real-world applications of string similarity search and join.
1.3 The formal problem definition and necessary background.
Part 2
2.1 Motivation & Problem Definition (5')
2.2 String Similarity Measures (15')
-- Syntactic similarity measures
Token-based, Character-based, and Hybrid Similarity
-- Semantic similarity measures
Synonym-based, Taxonomy-based, and Unified Similarity
2.3 String Similarity Search (25')
-- Threshold-based Search
-- Top-k Search
2.4 String Similarity Join (35')
-- Threshold-based Join
-- Top-k Join
-- Distributed Join
Pertinent references:
[VLDB/GravanoIJKMS01] Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava: Approximate String Joins in a Database (Almost) for Free. VLDB 2001: 491-500
[VLDB/ArasuGK06] Arvind Arasu, Venkatesh Ganti, Raghav Kaushik: Efficient Exact Set-Similarity Joins. VLDB 2006: 918-929
[VLDB/LiWY07] Chen Li, Bin Wang, Xiaochun Yang: VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams. VLDB 2007: 303-314
[ICDE/ArasuCK08] Arvind Arasu, Surajit Chaudhuri, Raghav Kaushik: Transformation-based Framework for Record Matching. ICDE 2008: 40-49
[ICDE/LiLL08] Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008: 257-266
[WWW/XiaoWLY08] Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu: Efficient similarity joins for near duplicate detection. WWW 2008: 131-140
[PVLDB/XiaoWL08] Chuan Xiao, Wei Wang, Xuemin Lin: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1): 933-944 (2008)
[ICDE/XiaoWLS09] Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang: Top-k Set Similarity Joins. ICDE 2009: 916-927
[SIGMOD/ZhangHOS10] Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Divesh Srivastava: Bed-tree: an all-purpose index structure for string similarity search based on edit distance. SIGMOD Conference 2010: 915-926
[PVLDB/WangLF10] Jiannan Wang, Guoliang Li, Jianhua Feng: Trie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints. PVLDB 3(1): 1219-1230 (2010)
[SIGMOD/VernicaCL10] Rares Vernica, Michael J. Carey, Chen Li: Efficient parallel set-similarity joins using MapReduce. SIGMOD Conference 2010: 495-506
[SIGMOD/QinWLXL11] Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, Xuemin Lin: Efficient exact edit similarity query processing with the asymmetric signature scheme. SIGMOD Conference 2011: 1033-1044
[ICDE/WangLF11] Jiannan Wang, Guoliang Li, Jianhua Feng: Fast-join: An efficient method for fuzzy token matching based string similarity join. ICDE 2011: 458-469
[PVLDB/LiDWF11] Guoliang Li, Dong Deng, Jiannan Wang, Jianhua Feng: PASS-JOIN: A Partition-based Method for Similarity Joins. PVLDB 5(3): 253-264 (2011)
[PVLDB/MetwallyF12] Ahmed Metwally, Christos Faloutsos: V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. PVLDB 5(8): 704-715 (2012)
[SIGMOD/LuLWLW13] Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang: String similarity measures and joins with synonyms. SIGMOD Conference 2013: 373-384
[ICDE/DengLFL13] Dong Deng, Guoliang Li, Jianhua Feng, Wen-Syan Li: Top-k string similarity search with edit-distance constraints. ICDE 2013: 925-936
[PVLDB/WangDTZ13] Xiaoli Wang, Xiaofeng Ding, Anthony K. H. Tung, Zhenjie Zhang: Efficient and Effective KNN Sequence Search with Approximate n-grams. PVLDB 7(1): 1-12 (2013)
[ICDE/DengLHWF14] Dong Deng, Guoliang Li, Shuang Hao, Jiannan Wang, Jianhua Feng: MassJoin: A mapreduce-based method for scalable string similarity joins. ICDE 2014: 340-351
[TODS/LuLWLX15] Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Xiaokui Xiao: Boosting the Quality of Approximate String Matching by Synonyms. TODS. 40(3): 15:1-15:42 (2015)
[ICDE/WangLDZF15] Jin Wang, Guoliang Li, Dong Deng, Yong Zhang, Jianhua Feng: Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search. ICDE 2015: 519-530
[ICDE/RongLSWLD17] Chuitian Rong, Chunbin Lin, Yasin N. Silva, Jianguo Wang, Wei Lu, Xiaoyong Du: Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics. ICDE 2017: 1059-1070
[ICDE/ShangLLF17 ] Zeyuan Shang, Yaxiao Liu, Guoliang Li, Jianhua Feng: K-Join: Knowledge-Aware Similarity Join. ICDE 2017: 23-24
[SEMWEB/SlabbekoornHH12] Kristian Slabbekoorn, Laura Hollink, Geert-Jan Houben: Domain-Aware Ontology Matching. International Semantic Web Conference (1) 2012: 542-558
[PVLDB/TaoDS17] Wenbo Tao, Dong Deng, Michael Stonebraker: Approximate String Joins with Abbreviations. PVLDB 11(1): 53-65 (2017)
[PVLDB/DengKMS17] Dong Deng, Albert Kim, Samuel Madden, Michael Stonebraker: SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints. PVLDB 10(10): 1082-1093 (2017)
[CIKM/XuL18] Pengfei Xu, Jiaheng Lu: Efficient Taxonomic Similarity Joins with Adaptive Overlap Constraint. CIKM 2018: 1563-1566
[PVLDB/XuL19] Pengfei Xu, Jiaheng Lu: Towards a Unified Framework for String Similarity Joins. PVLDB 12(11): 1289-1302 (2019)
[ICDE/WangLZ19] Jin Wang, Chunbin Lin, Carlo Zaniolo: MF-Join: Efficient Fuzzy String Similarity Join with Multi-level Filtering. ICDE 2019: 386-397
[EDBT/WangLLZ19] Jin Wang, Chunbin Lin, Mingda Li, Carlo Zaniolo: An Efficient Sliding Window Approach for Approximate Entity Extraction with Synonyms. EDBT 2019: 109-120
Part 3
3.1 Background (25')
-- Motivation to use ML for string matching
-- Brief overview of ML algorithms
-- Word Embedding [MSC 2013],[LM 2014],[PSM 2014]
-- Deep Neural Network [HS 1997][CMG 2014]
-- Categorization of Approaches
3.3 Traditional ML based Approach (5')
-- Basics of ML for string matching
-- Different approaches: rule-based, supervised ML, unsupervised ML
3.4 Deep Learning based Approach (10')
-- LinkNBed [TSD 2018]
-- DeepER [ETJ 2018]
-- Hybrid [MLR 2018]
3.5 Comprehensive Approach (10')
-- Magellan [KDS 2016]
-- Falcon [DSD 2017]
-- Smurf [SAD 2018]
3.6 More related topics in NLP (10')
-- Information Retrieval [HHG 2013], [WLG 2016]
-- Question Answering [WJ 2017], [SKF 2017], [YDL 2018]
-- Paraphrase Identification [HL 2016], [HLL 2014]
Pertinent references:
[CMG 2014] Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, pages 1724–1734, 2014.
[DHM 2005] Reference reconciliation in complex information spaces. In SIGMOD, pages 85-96, 2005.
[DSD 2017] Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In SIGMOD, pages 1431–1446, 2017.
[ETJ 2018] Distributed representations of tuples for entity resolution. PVLDB, 11(11):1454–1467, 2018.
[HS 1997] Long Short-Term Memory. Neural Computation 9(8): 1735-1780 (1997)
[KDS 2016] Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197–1208, 2016.
[HHG 2013] Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. In CIKM 2013, pages: 2333-2338.
[HL 2016] Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement. In NAACL, pages 937-948, 2016.
[HLL 2014] Convolutional Neural Network Architectures for Matching Natural Language Sentences. In NIPS, pages 2042-2050, 2014.
[LM 2014] Distributed representations of sentences and documents. In ICML, pages 1188–1196, 2014.
[MLR 2018] Deep learning for entity matching: A design space exploration. In SIGMOD, pages 19–34, 2018.
[MCC 2013] Efficient Estimation of Word Representations in Vector Space. ICLR Workshop, 2013.
[MSC 2013] Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013.
[PSM 2014] Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014.
[SAD 2018] Smurf: Self-service string matching using random forests. PVLDB, 12(3):278–291, 2018.
[SKF 2017] Bidirectional Attention Flow for Machine Comprehension. In ICLR 2017.
[TSD 2018] LinkNBed: Multi-Graph Representation Learning with Entity Linkage. In ACL pages 252-262, 2018.
[WJ 2017] Machine Comprehension Using Match-LSTM and Answer Pointer. In ICLR 2017.
[WLG 2016] A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations. AAAI 2016: 2835-2841.
[YDL 2018] QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In ICLR 2018.
Part 4 Open Problem
4.1 General purposed pipeline for string similarity search and join.
4.2 Accelerating the machine learning-based approaches.
4.3 Combining Human-in-the-loop with machine learning approaches for better performance.
Presenters
Jiaheng Lu is an Associate Professor at the University of Helsinki, Finland. His main research interests lie in the big data management and database systems, and specifically in the challenge of efficient data processing from real-life, massive data repository and Web. He has written four books on Hadoop and NoSQL databases, and more than 70 journal and conference papers published in SIGMOD, VLDB, TODS, and TKDE, etc.
Chunbin Lin is a software engineer at Amazon Web Services (AWS) and he is working on AWS Redshift. He completed his Ph.D. in computer science at the University of California, San Diego (UCSD) in 2018. His research interests are database management and big data management. He has more than 20 journal and conference papers published in SIGMOD, VLDB, VLDB J, and TODS, etc.
Jin Wang is a fourth year PhD student in University of California, Los Angeles. Before joining UCLA, he obtained his master degree in computer science from Tsinghua University in the year 2015. His research interest mainly lies in the field of data management and text analytics. He has published more than 10 papers in top-tier conferences and journals like ICDE, IJCAI, EDBT, TKDE and VLDB Journal.
Chen Li is a professor in the Department of Computer Science at UC Irvine. He received his Ph.D. degree in Computer Science from Stanford University. His research interests are in the field of data management, including data-intensive computing, query processing and optimization, visualization, and text analytics.