Technical Perspective: Entity Matching with Magellan

Ferdinand Magellan was a Portuguese explorer who launched a Spanish expedition that completed the first circumnavigation of the Earth. It is in this spirit that Magellan was used as the name of the end-to-end entity matching system that is developed at the University of Wisconsin.

Entity matching (also known as entity resolution or reference reconciliation or deduplication) is a major task in the larger problem of data integration, a problem that is pervasive in many organizations. Despite being a subject of extensive research for many years, the entity matching problem is surprisingly simple to describe and understand. It is to determine whether two different representations refer to the same real-world entity. For example, whether the two tuples—(J. Doe, UWisc) and (John Doe, Univ. of Wisconsin)—refer to the same person.

Perhaps more surprisingly, most prior systems for entity matching are stand-alone systems, sometimes built for specific applications, and are difficult to interoperate in the larger data integration setting, which often involves a composition of various other tasks such as data acquisition, preparation, transformation, cleaning, and schema matching, in addition to entity matching. For example, the two tuples above may be the result of data extracted from acquired pdfs or text files and transformed into the format above before they are matched. Different tasks need different libraries and techniques and they must interoperate before an end-to-end entity matching or data integration pipeline can be successfully executed. Magellan is able to provide all of the above.

Magellan's key insight is that a successful entity matching system must offer a versatile system building paradigm for entity matching that can be easily adapted for different application needs. Furthermore, it must also be easy to "plug-and-play" entity matching into data integration pipelines or other systems. There already exist vibrant ecosystems of data science libraries and tools (for example, those in Python and R), which are heavily used by data scientists to solve many data integration tasks. By developing entity matching tools within such ecosystems, Magellan makes it easy for data scientists to exploit the tools (including Magellan) in the ecosystems and in turn, make such ecosystems better at solving various data integration problems. In sum, Magellan distinguishes itself by making it easy to develop entity matching tools that incorporates advanced entity matching techniques. In addition, it allows researchers to "connect" and exploit the vast ecosystems of data science tools and build entity matching tools directly into those ecosystems.

In Magellan, there are two entity matching tools developed for two widely used execution environments: (1) PyMatcher is an entity matching tool that is developed as part of the PyData ecosystem. This allows users to leverage the rich set of Python libraries to carry out the entire entity matching pipeline, which may involve subtasks such as data cleaning, visualization, in addition to blocking and matching. (2) CloudMatcher is a cloud-based entity matching tool that is part of the Amazon Web Services ecosystem. PyMatcher is intended for a "power user" who possess knowledge about entity matching, programming, and basic machine learning while CloudMatcher is targeted for "lay users" who may not know how to program or possess machine learning knowledge.

Magellan makes it easy to develop an entity matching solution.

PyMatcher provides how-to guides that describe how to approach the development of entity matching workflows. These guides describe how to develop a solution for a small sample of data (by downsampling, blocking, and training a matcher) and how to scale the solution to work with production data. The entity matching workflow for CloudMatcher is similar to that of PyMatcher except that Cloud-Matcher actively learns from the user how to block tuples. Afterwards, it executes the blocking rules that are learnt to obtain a set of candidate pairs of tuples and again actively learns from the users what are the (non-)matching candidate pairs of tuples before deriving a model that can be applied to match tuples across two tables.

In short, Magellan makes it easy to develop an entity matching solution and easy to interoperate with other tools to form a bigger data integration pipeline that solves larger problems. It is a showcase for practical software development tools that originate from data management research. It has been successfully applied to multiple entity matching problems in the real world, is used in production at many data science groups and companies, and is recently being commercialized, demonstrating that using data science ideas to build entity matching systems is highly promising. For more details, check out Magellan's website at https://sites.google.com/site/anhaidgroup/projects/magellan.

Footnotes

To view the accompanying paper, visit doi.acm.org/10.1145/3405476

Technical Perspective: Entity Matching with Magellan

Technical Perspective: Entity Matching with Magellan

DOI

August 2020 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Technical Perspective: Entity Matching with Magellan

DOI

August 2020 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.