Can we simulate name-like numbers?
Accurate and efficient entity resolution (ER) has been a problem in data analysis and data mining projects for decades.
It is used as a basic tool in data integration to combine multiple datasets. ER has been studied for years, but the focus has been pairwise comparisons. In our work, we are interested in the global matching problem for large datasets. Good datasets in this domain are rare, and often much smaller.
Simulation is one technique to approach generating datasets for testing. However, simulation of the many individual details of identification keys is not needed when we consider the global matching problem. In our current work we are looking at how to simulate simple vectors in a space that approximates the properties of names (which are commonly used as identification keys) as one step towards being able to generate large simulated datasets for large-scale testing of global matching techniques.
Samudra Dilrukshi Herath