Entity matching (EM) finds data instances that refer to the same real-world entity. In 2015, we started the Magellan project at UW-Madison, jointly with industrial partners, to build EM systems. Most current EM systems are stand-alone monoliths. In contrast, Magellan borrows ideas from the field of data science (DS), to build a new kind of EM systems, which is ecosystems of interoperable tools for multiple execution environments, such as on-premise, cloud, and mobile. This paper describes Magellan, focusing on the system aspects. We argue why EM can be viewed as a special class of DS problems and thus can benefit from system building ideas in DS. We discuss how these ideas have been adapted to build
CloudMatcher, sophisticated on-premise tools for power users and self-service cloud tools for lay users. These tools exploit techniques from the fields of machine learning, big data scaling, efficient user interaction, databases, and cloud systems. They have been successfully used in 13 companies and domain science groups, have been pushed into production for many customers, and are being commercialized. We discuss the lessons learned and explore applying the Magellan template to other tasks in data exploration, cleaning, and integration.
Entity matching (EM) finds data instances that refer to the same real-world entity, such as tuples (David Smith, UW-Madison) and (D. Smith, UWM). This problem, also known as entity resolution, record linkage, deduplication, data matching, et cetera, has been a long-standing challenge in the database, AI, KDD, and Web communities.2,6
As data-driven applications proliferate, EM will become even more important. For example, to analyze raw data for insights, we often integrate multiple raw data sets into a single unified one, before performing the analysis, and such integration often requires EM. To build a knowledge graph, we often start with a small graph and then expand it with new data sets, and such expansion requires EM. When managing a data lake, we often use EM to establish semantic linkages among the disparate data sets in the lake.
Given the growing importance of EM, in the summer of 2015, together with industrial partners, we started the Magellan project at the University of Wisconsin-Madison, to develop EM solutions.9 Numerous works have studied EM, but most of them develop EM algorithms for isolated steps in the EM workflow. In contrast, we seek to build EM systems, as we believe such systems are critical for advancing the EM field. Among others, they help evaluate EM algorithms, integrate R&D efforts, and make practical impacts, the same way systems such as System R, Ingres, Apache Hadoop, and Apache Spark have helped advance the fields of relational database management systems (RDBMSs) and Big Data.
Of course, Magellan is not the first project to build EM systems. Many such systems have been developed.9,2 However, as far as we can tell, virtually all of them have been built as stand-alone monolithic EM systems, or parts of larger monolithic systems that perform data cleaning and integration.2,6,9 These systems often employ the RDBMS building template. That is, given an EM workflow composed of logical operators (specified declaratively or via a GUI by a user), they compile this workflow into one consisting of physical operators and then optimize and execute the compiled workflow.
In contrast, Magellan develops a radically different system building template for EM, by leveraging ideas from the field of data science (DS). Although DS is still "young," several common themes have emerged.
- For many DS tasks, there is a general consensus that it is not possible to fully automate the two stages of developing and productionizing DS workflows. So users must "be in the loop," and many step-by-step guides that tell users how to execute the above two stages have been developed.
- Many "pain points" in these guides, that is, steps that are time-consuming for users, have been identified, and (semi)-automated tools have been developed to reduce user effort.
- Users often use multiple execution environments (EE), such as on-premise, cloud, and mobile, switching among them. So tools have been developed for all of these EEs.
- Finally, within each EE, tools have been designed to be atomic and interoperable, forming a growing ecosystem of DS tools. Examples include PyData, the ecosystem of 184,000+ interoperable Python packages (as of June 2019), R, tidyverse, and many others.4
We observed that EM bears strong similarities to many DS tasks.9 As a result, we leveraged the above ideas to build a new kind of EM systems. Specifically, we develop guides that tell users how to perform EM step by step, identify the "pain points" in the guides, and then develop tools to address these pain points. We develop tools for multiple execution environments (EEs), such that within each EE, tools intemperate and build upon existing DS tools in that EE.
Thus, the notion of "system" in Magellan has changed. It is no longer a stand-alone monolithic system such as RDBMSs or most current EM systems. Instead, this new "system" spans multiple EEs. Within each EE, it provides a growing ecosystem of interoperable EM tools, situated in a larger ecosystem of DS tools. Finally, it provides detailed guides that tell users how to use these tools to perform EM.
Since the summer of 2015, we have pursued the above EM agenda and developed small ecosystems of EM tools for on-premise and cloud EEs. These tools exploit techniques from the fields of machine learning, big data scaling, efficient user interaction, databases, and cloud systems. They have been successfully used in 13 companies and domain science groups, have been pushed into production for many customers, and are being commercialized. Developing them has also raised many research challenges.4
In this paper, we describe the above progress, focusing on the system aspects. The next section discusses the EM problem and related work. Section 3 discusses the main system building themes of data science and the Magellan agenda. Sections 4–5 discuss
CloudMatcher, two current thrusts of Magellan. Section 6 discusses the application of Magellan tools to real-world EM problems. Section 7 discusses lessons learned and ongoing work. Section 8 concludes by exploring how to apply the Magellan template to other tasks in data exploration, cleaning, and integration. More information about Magellan can be found at sites. google.com/site/anhaidgroup/projects/magellan.
2. The Entity Matching Problem
Entity matching, also known as entity resolution, record linkage, data matching, et cetera., has received enormous attention.2, 6, 5, 13 A common EM scenario finds all tuple pairs that match, that is, refer to the same real-world entity, between two tables A and B (see Figure 1). Other EM scenarios include matching tuples within a single table, matching into a knowledge graph, matching XML data, et cetera.2
When matching two tables A and B, considering all pairs in A x B often takes very long. So users often execute a blocking step followed by a matching step.2 The blocking step employs heuristics to quickly remove obviously nonmatched tuple pairs (e.g., persons residing in different states). The matching step applies a matcher to the remaining pairs to predict matches.
The vast body of work in EM falls roughly into three groups: algorithmic, human-centric, and system. Most EM works develop algorithmic solutions for blocking and matching, exploiting rules, learning, clustering, crowdsourcing, external data, et cetera.2,6,5 The focus is on improving accuracy, minimizing runtime, and minimizing cost (e.g., crowd-sourcing fee), among others.13,6
A smaller but growing body of EM work (e.g., HILDA1) studies human-centric challenges, such as crowdsourcing, effective user interaction, and user behavior during the EM process.
The third group of EM work develops EM systems. In 2016, we surveyed 18 noncommercial systems (e.g., D-Dupe, Febrl, Dedoop, and Nadeef) and 15 commercial ones (e.g., Tamr, Informatica, and IBM InfoSphere).9,12 Most of these systems are stand-alone monoliths, built using the RDBMS template. Specifically, such a system has a set of logical operations (e.g., blocking and matching) with multiple physical implementations. Given an EM workflow (composing of these operations) specified by the user using a GUI or a declarative language, the system translates the workflow into an execution plan and then optimizes and executes this plan.
3. The Magellan Agenda
We now discuss system building ideas in the field of data science (DS). Then we argue that EM is very similar in nature to DS and thus can benefit from these ideas. Finally, we suggest a system building agenda for Magellan.
System Building Ideas of Data Science: Although the DS field has been growing rapidly, we are not aware of any explicit description of its "system template." But our examination reveals the following important ideas.
First, many DS tasks distinguish between two stages, development and production, as these stages raise different challenges. The development stage finds an accurate DS workflow, often using data samples. This raises challenges in data exploration, profiling, understanding, cleaning, model fitting and evaluation, et cetera. The production (a.k.a. deployment) stage executes the discovered DS workflow on the entirety of data, raising challenges in scaling, logging, crash recovery, monitoring, et cetera.
Second, DS developers do not assume that the above two stages can be automated. Users often must be "in the loop" and often do not know what to do, how to start, et cetera. As a result, developers provide detailed guides that tell users how to solve a DS problem, step by step. Numerous guides have been developed, described in books, papers, Jupyter notebooks, training camps, blogs, tutorials, et cetera.
It is important to note that such a guide is not a user manual on how to use a tool. Rather, it is a step-by-step instruction to the user on how to start, when to use which tools, and when to do what manually, in order to solve the DS task end to end. Put differently, it is an (often complex) algorithm for the user to follow. (See Section 4 for an example.)
Third, even without tools, users should be able to follow a guide and manually execute all the steps to solve a DS task. But some of the steps can be very time-consuming. DS developers have identified such "pain point" steps and developed (semi-)automatic tools to reduce the human effort.
Fourth, these tools target not just power users, but also lay users (as such users increasingly also need to work on the data), and use a variety of techniques, for example, machine learning (ML), RDBMS, visualization, effective user interaction, Big Data scaling, and cloud technologies.
Fifth, it is generally agreed that users will often use multiple execution environments (EEs), such as on-premise, cloud, and mobile, switching among these EEs as appropriate, to execute a DS task. As a result, tools have been developed for all of these EEs.
Finally, within each EE, tools have been designed to be atomic (i.e., each tool does just one thing) and interoperable, forming a growing ecosystem of DS tools. Popular examples of such ecosystems include PyData, R, tidyverse, and many others.4
The Similarities between EM and DS: We argue that EM bears strong similarities to many DS tasks. EM often shares the same two stages: development, where users find an accurate EM workflow using data samples, and production, where users execute the workflow on the entirety of data (see Section 4 for an example).
The above two EM stages raise challenges that are remarkably similar to those of DS tasks, for example, data understanding, model fitting, scaling, et cetera. Moreover, there is also an emerging consensus that it is not possible to fully automate the above two stages for EM. Similar to DS, this also raises the need for step-by-step guides that tell users how to be "in the loop," as well as the need for identifying "pain points" in the guide and developing tools for these pain points (to reduce user effort). Finally, these tools also have to target both power and lay users, and use a variety of techniques, for example, ML, RDBMS, visualization, scaling, et cetera.
Thus, we believe EM can be viewed as a special class of DS problems, which focuses on finding the semantic matches, for example, "(David Smith, UWM) = (D. Smith, UW-Madison)." As such, we believe EM can benefit from the system building ideas in DS.
Our Agenda: Using the above "system template" of DS, we developed the following agenda for Magellan. First, we identify common EM scenarios. Next, we develop how-to guides to solve these scenarios end to end, paying special attention to telling the user exactly what to do. Then we identify the pain points in the guides and develop (semi-)automatic tools to reduce user effort. We design tools to be atomic and interoperable, as a part of a growing ecosystem of DS tools. Developing these tools raises research challenges, which we address. Finally, we work with users (e.g., domain scientists, companies, and students) to evaluate our EM tools.
In the past few years, we have been developing EM tools for two popular execution environments: on-premise and cloud. Specifically,
PyMatcher is a small ecosystem of on-premise EM tools for power users, built as a part of the PyData ecosystem of DS tools, and
CloudMatcher is a small ecosystem of cloud EM tools for lay users, built as a part of the AWS ecosystem of DS tools. The next two sections briefly describe these ecosystems.
We now describe
PyMatcher, an EM system developed for power users in the on-premise execution environment.
Problem Scenarios: In this first thrust of Magellan, we consider an EM scenario that commonly occurs in practice, where a user U wants to match two tables (e.g., see Figure 1), with as high matching accuracy as possible, or with accuracy exceeding a threshold. U is a "power user" who knows programming, EM, and ML.
Developing How-to Guide: We developed an initial guide based on our experience and then kept refining it based on user feedback and on watching how real users do EM. As of Nov 2018, we have developed a guide for the above EM scenario, which consists of two smaller guides for the development and production stages, respectively. Here, we focus on the guide for the development stage (briefly discussing the guide for the production stage at the end of this section).
This guide (which is illustrated in Figure 2) heavily uses ML. To explain it, suppose user U wants to match two tables A and B, each having 1 million tuples. Trying to find an accurate workflow using these two tables would be too time-consuming, because they are too big. Hence, U will first "down sample" the two tables to obtain two smaller tables A' and B', each having 100K tuples, say (see the figure).
Next, suppose the EM system provides two blockers X and Y. Then, U experiments with these blockers (e.g., executing both on Tables A' and B' and examining their output) to select the blocker judged the best (according to some criterion). Suppose U selects blocker X. Then next, U executes X on Tables A' and B' to obtain a set of candidate tuple pairs C.
Next, U takes a sample S from C and labels the pairs in S as "match"/"no-match" (see the figure). Let the labeled set be G, and suppose the EM system provides two learning-based matchers U and V (e.g., decision trees and logistic regression). Then, U uses the labeled set G to perform cross validation for U and V. Suppose V produces higher matching accuracy (such as F1 score of 0.93, see the figure). Then, U selects V as the matcher and applies V to the set C to predict "match"/"no-match," shown as "+" or "-" in the figure. Finally, U may perform quality check (by examining a sample of the predictions and computing the resulting accuracy) and then go back and debug and modify the previous steps as appropriate. This continues until U is satisfied with the accuracy of the EM workflow.
Developing Tools for the Steps of the Guide: Over the past 3.5 years, 13 developers have developed tools for the steps of the above guide (see Govind et al.8). As of September 2019,
PyMatcher consists of 6 Python packages with 37K lines of code and 231 commands (and is open sourced9). It is built on top of 16 different packages in the PyData ecosystem (e.g., pandas and scikit-learn). As far as we can tell,
PyMatcher is the most comprehensive open-source EM system today, in terms of the number of features it supports.
Principles for Developing Tools & Packages: In
PyMatcher, each tool is roughly equivalent to a Python command, and tools are organized into Python packages. We adopted five principles for developing tools and packages:
- They should interoperate with one another, and with existing PyData packages.
- They should be atomic, that is, each does only one thing.
- They should be self-contained, that is, they can be used by themselves, not relying on anything outside.
- They should be customizable.
- They should be efficient for both humans and machines.
We now illustrate these principles. As an example of facilitating interoperability among the commands of different packages, we use only generic well-known data structures such as Pandas DataFrame to hold tables (e.g., the two tables A and B to match and the output table after blocking).
Designing each command, that is, tool, to be "atomic" is somewhat straightforward. Designing each package to be so is more difficult. Initially, we designed just one package for all tools of all steps of the guide. Then, as soon as it was obvious that a set of tools form a coherent stand-alone group, we extracted it as a new package. However, this extraction is not always easy to do, as we will discuss soon.
Ignoring self-containment for now, to make tools and packages highly customizable, we expose all possible "knobs" for the user to tweak and provide easy ways for him/her to do so. For example, given two tables A and B to match,
PyMatcher can automatically define a set of features (e.g., jaccard(3gram(A.name), 3gram(B.name))). We store this set of features in a global variable F. We give users ways to delete features from F and to declaratively define more features and then add them to F.
As an example of making a tool, that is, a command, X efficient for a user, we can make X easy to remember and specify (i.e., it does not require the user to enter many arguments). Often, this also means that we provide multiple variations for X, because each user may best remember a particular variation.
Command X is efficient for machine if it minimizes run-time and space. For instance, let A and B be two tables with schema (id,name,age). Suppose X is a blocker command that when applied to A and B produces a set of tuple pairs C. Then, to save space, X should not use (A.id, A.name, A.age, B.id, B.name, B.age), but only (A.id, B.id) as the schema of C.
If so, we need to store the "metadata information" that there is a key-foreign key (FK) relationship between tables A, B, and C. Storing this metadata in the tables themselves is not an option if we have already elected to store the tables using Pandas DataFrame (which cannot store such meta-data, unless we redefine the DataFrame class). So we can use a stand-alone catalog Q to store such metadata for the tables.
But this raises a problem. If we use a command Y of some other package to remove a tuple from table A, Y is not even aware of catalog Q and so will not modify the metadata stored in Q. As a result, the metadata is now incorrect: Q still claims that an FK relationship exists between tables A and C. But this is no longer true.
To address this problem, we can design the tools to be self-contained. For example, if a tool Z is about to operate on table C and needs the metadata "there is an FK constraint between A and C" to be true, it will first check that constraint. If the constraint is still true, then Z will proceed normally. Otherwise, Z outputs a warning that the FK constraint is no longer correct and then stops or proceeds (depending on the nature of the command). Thus, Z is self-contained in that it does not rely on anything outside to ensure the correctness of the metadata that it needs.
Trade-Offs Among the Principles: It should be clear by now that the above principles often interact and conflict with one another. For example, as discussed, to make commands inter-operate, we may use Pandas DataFrames to hold the tables, and to make commands efficient, we may need to store meta-data such as FK constraints. But this means the constraints should be stored in a global catalog. This makes extracting a set of commands to create a new package difficult, because the commands need access to this global catalog.
There are many examples such as this, which together suggest that designing an "ecosystem" of tools and packages that follow the above principles requires making trade-offs. We have made several such trade-offs in designing
PyMatcher. But obtaining a clear understanding of these trade-offs and using it to design a better ecosystem is still ongoing work.
The Production Stage: So far, we have focused on the development stage for
PyMatcher and have developed only a basic solution for the production stage. Specifically, we assume that after the development stage, the user has obtained an accurate EM workflow W, which is captured as a Python script (of a sequence of commands). We have developed tools that can execute these commands on a multicore single machine, using customized code or Dask (which is a Python package developed by Anaconda that can be used to quickly modify a Python command to run on multiple cores, among others). We have also developed a how-to guide that tells the user how to scale using these tools.
We now describe
CloudMatcher, an EM system developed for lay users in the cloud environment.
Problem Scenarios: We use the term "lay user" to refer to a user who does not know programming, ML, or EM, but understands what it means to be match (and thus can label tuple pairs as match/no-match). Our goal is to build a system that such lay users can use to match two tables A and B. We call such systems self-service EM systems.
Developing an EM System for a Single User: In a recent work,3 we have developed
Falcon, a self-service EM system that can serve a single user. As
CloudMatcher builds on
Falcon, we begin by briefly describing
To match two tables A and B, like most current EM solutions,
Falcon performs blocking and matching, but it makes both stages self-service (see Figure 3). In the blocking stage (Figure 3a), it takes a sample S of tuple pairs (Step ) and then performs active learning with the lay user on S (in which the user labels tuple pairs as match/no-match) to learn a random forest F (Step ), which is a set of n decision trees. The forest F declares a tuple pair p a match if at least αn trees in F declare p a match (where α is prespecified).
In Step ,
Falcon extracts all tree branches from the root of a tree (in random forest F) to a "No" leaf as candidate blocking rules. For example, the tree in Figure 4a predicts that two book tuples match only if their ISBNs match and the number of pages match. Figure 4b shows two blocking rules extracted from this tree.
Falcon enlists the lay user to evaluate the extracted blocking rules and retains only the precise rules. In Step ,
Falcon executes these rules on tables A and B to obtain a set of candidate tuple pairs C. This completes the blocking stage (Figure 3a). In the matching stage (Figure 3b),
Falcon performs active learning with the lay user on C to obtain another random forest G and then applies G to C to predict matches (Steps and ).
Falcon is well suited for lay users, who only have to label tuple pairs as match/no-match. We implemented
CloudMatcher 0.1 and deployed as shown in Figure 5, with the goal of providing self-service EM to domain scientists at UW. Any scientist wanting to match two tables A and B can go to the homepage of
CloudMatcher, upload the tables, and then label a set of tuple pairs (or ask crowd workers say on Mechanical Turk to do so).
CloudMatcher uses the labeled pairs to block and match, as described earlier, and then returns the set of matches between A and B.
Developing an EM System for Multiple Users: We soon recognized, however, that
CloudMatcher 0.1 does not scale, because it can execute only one EM workflow at a time. So we designed
CloudMatcher 1.0, which can efficiently execute multiple concurrent EM workflows (e.g., submitted by multiple scientists at the same time). Developing
CloudMatcher 1.0 was highly challenging.7 Our solution was to break each submitted EM workflow into multiple DAG fragments, where each fragment performs only one kind of task, for example, interaction with the user, batch processing of data, crowd-sourcing, et cetera. Next, we execute each fragment on an appropriate execution engines. We developed three execution engines: user interaction engine, crowd engine, and batch engine. To scale, we interleave the execution of DAG fragments coming from different EM workflows and coordinate all of the activities using a "metamanager." See Govind et al.7 for more details.
Providing Multiple Basic Services:
CloudMatcher 1.0 implemented only the above rigid
Falcon EM workflow. As we interacted with real users, however, we observed that many users want to flexibly customize and experiment with different EM workflows. For example, a user may already know a blocking rule, so he or she wants to skip the step of learning such rules. Yet another user may want to use
CloudMatcher just to label tuple pairs (e.g., to be used in
So we developed
CloudMatcher 2.0, which extracts a set of basic services from the
Falcon EM workflow and makes them available on
CloudMatcher, and then allows users to flexibly combine them to form different EM workflows (such as the original
Falcon one). Appendix C of Govind et al.8 shows the list of services that we currently provide. Basic services include uploading a data set, profiling a data set, editing the metadata of a data set, sampling, generating features, training a classifier, et cetera. We have combined these basic services to provide composite services, such as active learning, obtaining blocking rules, and Falcon. For example, the user can invoke the "Get blocking rules" service to ask
CloudMatcher to suggest a set of blocking rules that he/she can use. As another example, the user can invoke the "Falcon" service to execute the end-to-end
Falcon EM workflow.
6. Real-World Applications
We now discuss real-world applications of
CloudMatcher, as well as their typical usage patterns. In the discussion here, we measure EM accuracy using precision, the fraction of predicted matches that are correct, and recall, the fraction of true matches that are returned in the set of predicted matches.
Applications of PyMatcher:
PyMatcher has been successfully applied to multiple real-world EM applications in both industry and domain sciences. It has been pushed into production in most of these applications and has attracted significant funding (e.g., $950K from UW-Madison, $1.1M from NSF, and $480K from industry). It has also been used by 400+ students in 5 data science classes at UW-Madison. Finally, it has resulted in multiple publications, both in the database field and in domain sciences.4
Table 1 summarizes the real-world applications. The first column shows that
PyMatcher has been used in a variety of companies and domain sciences. The second column shows that
PyMatcher has been used for three purposes: debugging an EM pipeline in production (Walmart), building a better EM pipeline than an existing one (economics and land use), and integrating disparate data sets (e.g., Recruit, Marshfield Clinic, and limnology).
The third column shows the main results. This column shows that
PyMatcher found EM workflows that were significantly better than the EM workflows in production in three cases: Walmart, Economics (UW), and Land Use (UW). The fourth column indicates that, based on those results,
PyMatcher has been put into production in 6 out of 8 applications. This is defined as either (a)
PyMatcher is used in a part of an EM pipeline in production or (b) the data resulted from using
PyMatcher has been pushed into production, that is, being sent to and consumed by real-world customers.
The fifth column shows that in all cases that we know of,
PyMatcher does not require a large team to work on it (and the teams are only part-time). The final column lists additional notable results. (Note that funding from UW came from highly selective internal competitions.) More details about these applications can be found in Govind et al.8 and Konda et al.10
CloudMatcher has been successfully applied to multiple EM applications and has attracted commercial interest. It has been in production at American Family Insurance since the summer of 2018 and is being considered for production at two other major companies.
Table 2 summarizes
CloudMatcher's performance on 13 real-world EM tasks. The first two columns show that
CloudMatcher has been used in 5 companies, 1 nonprofit, and 1 domain science group, for a variety of EM tasks. The next two columns show that
CloudMatcher was used to match tables of varying sizes, from 300 to 4.9M tuples.
Ignoring the next two columns on accuracy, let us zoom in on the three columns under "Cost" in Table 2. The first column ("Questions") lists the number of questions
CloudMatcher had to ask, that is, the number of tuple pairs to be labeled. This number ranges from 160 to 1200 (the upper limit for the current
In the next column ("Crowd"), a cell such as "$72" indicates that for the corresponding EM task,
CloudMatcher used crowd workers on Mechanical Turk to label tuple pairs, and it cost $72. A cell "-" indicates that the task did not use crowd-sourcing. It used a single user instead, typically the person who submitted the EM task, to label, and thus incurred no monetary cost.
In the third column ("Compute"), a cell such as "$2.33" indicates that the corresponding EM task used AWS, which charged $2.33. A cell such as "-" indicates that the EM task used a local machine owned by us, and thus incurred no monetary cost.
Turning our attention to the last three columns under "Time," the first column ("User/Crowd") lists the total labeling time, either by a single user or by the Mechanical Turk crowd. We can see that when a single user labeled, it was typically quite fast, with time from 9m to 2h. When a crowd labeled, time was from 22h to 36h (this does not mean crowd workers labeled nonstop and took that long; it just meant Mechanical Turk took that long to finish the labeling task). These results suggest that
CloudMatcher can execute a broad range of EM tasks with very reasonable labeling time from both users and crowd workers. The next two columns under "Time" show the machine time and the total time.
We now zoom in on the accuracy. The columns "Precision" and "Recall" show that in all cases except three,
CloudMatcher achieves high accuracy, often in the 90 percentage. The three cases of limited accuracy are "Vehicles," "Addresses," and "Vendors." A domain expert at American Family Insurance (AmFam) labeled tuple pairs for "Vehicles." But the data was so incomplete that even he was uncertain in many cases on whether the tuple pair matches. At some point, he realized that he had incorrectly labeled a set of tuple pairs, but
CloudMatcher provided no way for him to "undo" the labeling, hence the low accuracy. This EM task is currently being re-executed at AmFam.
For "Vendors," it turned out that the portion of data that consists of Brazilian vendors is simply incorrect: the vendors entered some generic addresses instead of their real addresses. As a result, even users cannot match such vendors. Once we removed such vendors from the data, the accuracy significantly improved (see the row for "Vendors (no Brazil)"). It turned out that "Addresses" had similar dirty data problems, which explained the low recall of 76–81%.
Typical Usage Patterns: We observed the following patterns of using
CloudMatcher. When working with enterprise customers, a common scenario is that the EM team, which typically consists of only a few developers, is overwhelmed with numerous EM tasks sent in by many business teams across the enterprise.
To address this problem, the EM team asks business teams to use
CloudMatcher to solve their EM tasks (in a self-service fashion), contacting the EM team only if
CloudMatcher does not reach the desired EM accuracy. In those cases, the EM team builds on the results of
CloudMatcher but uses
PyMatcher to debug and improve the accuracy further.
We found that the EM team also often uses
CloudMatcher to solve their own EM tasks, because it can be used to quickly solve a large majority of EM tasks, which tend to be "easy," allowing the EM team to focus on solving the small number of more difficult EM tasks using
For domain sciences at UW, some teams used only
CloudMatcher, either because they do not have EM and ML expertise or they found the accuracy of
CloudMatcher acceptable. Some other teams preferred
PyMatcher, as it gave them more customization options and higher EM accuracies.
Finally, some customers used both and switched between them. For example, a customer may use
PyMatcher to experiment and create a set of blocking rules and then use
CloudMatcher to execute these rules on large tables.
We now discuss lessons learned and ongoing work.
The Need for How-to Guides: Our work makes clear that it is very difficult to fully automate the EM process. The fundamental reason is because at the start, the user often does not fully understand the data, the match definition, and even what he or she wants. For example, in a recent case study with
PyMatcher,10 we found that the users repeatedly revised their match definition during the EM process, as they gained a better understanding of the data.
This implies that the user must "be in the loop" and that a guide is critical for telling the user what to do, step by step. In addition, we found that these guides provide assurance to our customers that we can help them do EM end to end. The guides provide a common vocabulary and roadmap for everyone on the team to follow, regardless of their background. Even for the EM steps where we currently do not have tools, the guide still helps enormously, because it tells the customers what to do, and they can do it manually or find some external tools to help with it. Such guides, however, are completely missing from most current EM solutions and systems.
Difficulties in Developing How-to Guides: Surprisingly, we found that developing clear how-to guides is quite challenging. For example, the current guide for
PyMatcher is still quite preliminary. It does not provide detailed guidance for many steps such as how to help users converge to a match definition, how to collaboratively label effectively, and how to debug learning-based matchers, among others. Developing detailed guidance for such steps is ongoing work.
Focusing on Reducing User Effort: Many existing EM works focus on automating the EM process. In Magellan, our focus switched to developing a step-by-step guide that tells users how to execute the EM process, identifying "pain points" of the guide and then developing tools to reduce the user effort in the pain points. We found this new perspective to be much more practical. It allows us to quickly develop end-to-end EM solutions that we can deploy with real users on Day 1 and then work with them closely to gradually improve these solutions and reduce their effort.
Many New Pain Points: Existing EM work has largely focused on blocking and matching. Our work makes clear that there are many pain points that current work has ignored or not been aware of. Examples include how to quickly converge to a match definition, how to label collaboratively, how to debug blockers and matchers, and how to update an EM workflow if something (e.g., data and match definition) has changed. We believe that more effort should be devoted to addressing these real pain points in practice.
Monolithic Systems vs. Ecosystems of Tools: We found that EM is so much messier than we thought. Fundamentally, it was a "trial and error" process, where users kept experimenting until they find a satisfactory EM workflow. As a result, users tried all kinds of workflows, customization, data processing, et cetera. (e.g., see Konda et al.10).
Because EM is so messy and users want to try so many different things, we found that an ecosystem of tools is ideal. For every new scenario that users want to try, we can quickly put together a set of tools and a mini how-to guide that they can use. This gives us a lot of flexibility.
Many "trial" scenarios require only a part of the entire EM ecosystem. Having an ecosystem allows us to very quickly pull out the needed part, and popular parts end up being used everywhere. For example, several string matching packages in
PyMatcher are so useful in many projects (not just in EM) that they ended up being installed on Kaggle, a popular data science platform.
Extensibility is also much easier with an ecosystem. For example, recently, we have developed a new matcher that uses deep learning to match textual data.11 We used PyTorch, a new Python library, to develop it, released it as a new Python package in the
PyMatcher ecosystem, and then extended our guide to show how to use it. This smoothly extended
PyMatcher with relatively little effort.
Clearly, we can try to achieve the above three desirable traits (flexibility/customizability, partial reuse, and extensibility) with monolithic stand-alone systems for EM, but our experience suggests it would be significantly harder to do so. Finally, we found that it is easier for academic researchers to develop and maintain (relatively small) tools in an ecosystem, than large monolithic systems.
Using Multiple Execution Environments (EEs): We found that users often want to use multiple EEs for EM. For example, a user may want to work on-premise using his or her desktop to experiment and find a good EM workflow and then upload and execute the workflow on a large amount of data on the cloud. Whereas working on-premise, if the user has to perform a computation-intensive task, such as executing a blocker, he or she may opt to move that task to the cloud and execute it there. Similarly, collaborative tasks such as labeling and data cleaning are typically executed on the cloud, using Web interfaces, or on mobile devices, although the user is taking the bus, say.
This raises two challenges. First, we need to develop an ecosystem of EM tools for each EE, for example, Python packages for the on-premise EE, containerized apps for the cloud, and mobile apps for smart phones. Second, we need to develop ways to quickly move data, models, and workflows across the EEs, to allow users to seamlessly switch among the EEs. In Magellan, we have taken some initial steps to address these two challenges. But clearly a lot more remains to be done.
Serving Both Lay Users and Power Users: In Magellan, we have developed
PyMatcher as a solution for power users and
CloudMatcher as a self-service solution for lay users. Serving both kinds of users is important, as suggested by our experience with EM teams and business teams at enterprises, as well as with domain scientists at UW (see Section 6).
Support for Easy Collaboration: We found that in many EM settings there is actually a team of people wanting to work on the problem. Most often, they collaborate to label a data set, debug, clean the data, et cetera. However, most current EM tools are rudimentary in helping users collaborate easily and effectively. As users often sit in different locations, it is important that such tools are cloud-based, to enable easy collaboration.
Managing Machine Learning "in the Wild": Our work makes clear that ML can be very beneficial to EM, mainly because it provides an effective way to capture complex matching patterns in the data and to capture domain expert's knowledge about such patterns. ML is clearly at the heart of EM workflows supported by
CloudMatcher. In many real-world applications we have worked with, ML helps significantly improve recall although retaining high precision, compared to rule-based EM solutions.
Yet to our surprise, deploying even traditional ML techniques to solve EM problems already raises many challenges, such as labeling, debugging, coping with new data, et cetera. Our experience using
PyMatcher also suggests that the most accurate EM workflows are likely to involve a combination of ML and rules. More generally, we believe ML must be used effectively in conjunction with hand-crafted rules, visualization, good user interaction, and Big Data scaling, in order to realize its full potential.
Cannot Work on EM in Isolation: It turned out that when working on EM, users often perform a wide variety of non-EM tasks, such as exploring the data (to be matched), understanding it, cleaning, extracting structures from the data, et cetera. User also often perform many so-called DS tasks, such as visualization, analysis, et cetera., by invoking DS tools (e.g., calling Matplotlib or running a clustering algorithm in scikit-learn). Worse, users often interleave these non-EM and DS tasks with the steps of the EM process. For example, if the accuracy of the current EM workflow is low, users may want to clean the data, then retrain the EM matcher again, then clean the data some more, et cetera.
As described, building different ecosystems of tools for different tasks (e.g., EM, schema matching, cleaning, exploration, and extraction) is suboptimal, because constant switching among them creates a lot of overhead. Rather, we believe it is important to build unified ecosystems of tools. That is, for the on-premise EE, build one (or several) ecosystem that provides tools not just for EM, but also for exploration, understanding, cleaning, et cetera. Then, repeat for the cloud and mobile EEs. Further, these ecosystems should "blend in" seamlessly with DS ecosystems of tools, by being built on top of those.
Going forward, we are continuing to develop both the on-premise and cloud-hosted ecosystems of EM tools. In particular, we are paying special attention to the cloud-hosted ecosystem, where in addition to
CloudMatcher, we are developing many other cloud tools to label, clean, and explore the data. We are also working on ways for users to seamlessly move data, workflows, and models across these two ecosystems. Finally, we are looking for more real-world applications to "test drive" Magellan.
We have described Magellan, a project to build EM systems. The key distinguishing aspect of Magellan is that unlike current EM systems, which use an RDBMS monolithic stand-alone system template, Magellan borrows ideas from the data science field to build ecosystems of interoperable EM tools. Our experience with Magellan in the past few years suggests that this new "system template" is highly promising for EM. Moreover, we believe that it can also be highly promising for other non-EM tasks in data integration, such as data cleaning, data extraction, and schema matching, among others.
1. Workshop on Human-In-the-Loop Data Analytics, http://hilda.io/.
3. Das, S., P.S.G.C., Doan, A., Naughton, J.F., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V., Park, Y. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD'17 (New York, NY, USA, 2017), ACM, 1431–1446.
* Additional authors are Sanjib Das (Google), Erik Paulson (Johnson Control), Palaniappan Nagarajan (Amazon), Han Li (UW-Madison), Sidharth Mudgal (Amazon), Aravind Soundararajan (Amazon), Jeffrey R. Ballard (UW-Madison), Haojun Zhang (UW-Madison), Adel Ardalan (Columbia Univ.), Amanpreet Saini (UW-Madison), Mohammed Danish Shaikh (UW-Madison), Youngchoon Park (Johnson Control), Marshall Carter (American Family Ins.), Mingju Sun (American Family Ins.), Glenn M. Fung (American Family Ins.), Ganesh Krishnan (WalmartLabs), Rohit Deep (WalmartLabs), Vijay Raghavendra (WalmartLabs), Jeffrey F. Naughton (Google), Shishir Prasad (Instacart), and Fatemah Panahi (Google).
The original version of this paper is entitled "Entity Matching Meets Data Science: A Progress Report from the Magellan Project" and was published in Proceedings of the 2019 SIGMOD Conference.
©2020 ACM 0001-0782/20/8
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2020 ACM, Inc.