Skip to main content
Log in

Linking Entity Resolution and Risk

  • Article
  • Published:
Eastern Economic Journal Aims and scope Submit manuscript

Abstract

A major emerging problem among consumer finance institutions is that customers that are not well recognized might be riskier than customers that are fully recognized. Fortunately, financial institutions count with external vendors databases that indicate the level of recognition of their customers. However, this information is normally presented as features with partial scores that must be aggregated into an overall matching accuracy score. This score indicates how similar a record is to a master database that contains the best available public information about a specific customer. In addition, information management and risk management departments of financial institutions may have very different models. Hence, it is necessary to connect the customer recognition information with risk models. This paper studies this problem in two parts: (1) generation of a matching accuracy score to quantify the status of entity resolution between consumer records of a major financial company and an external database, and (2) evaluation of the relationship between the matching accuracy score and several risk segments. As a final result, an overall matching accuracy score is obtained for every customer using the most current account information and a learning algorithm. The matching accuracy score is an indicator of the level of customer recognition. This matching accuracy score is correlated with the FICO score (FICO is a risk score generated by the company Fair Isaac & Co. The maximum value of FICO is 850. In this paper, values above 720 are considered Superprime, between 661–719 are Prime, 600–660 are Near Prime, and less than 600 or not available are Subprime).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. Coreferent means that two or more records or expressions refer to the same entity.

  2. Intuitively, weak learner is an algorithm with a performance at least slightly better than random guessing.

  3. Mapping x to {0, 1} instead of {−1, +1} increases the flexibility of the weak learner. Zero can be interpreted as “no prediction” [Freund and Schapire 1997].

  4. The experts were external consultants with analytical experience on consumer finance companies.

  5. Last update time is the number of seconds from a reference date (January 1, 1960) to the last update of a particular account.

  6. See Churi et al. [2009] for the original description of how to generate the matching accuracy scores.

References

  • Baeza-Yates, Ricardo A., and Berthier Ribeiro-Neto . 1999. Modern Information Retrieval. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.

    Google Scholar 

  • Bhattacharya, Indrajit, and Lise Getoor . 2007. Collective Entity Resolution In Relational Data. ACM Transactions on Knowledge Discovery from Data, 1 (1): 1–36.

    Article  Google Scholar 

  • Bilenko, Mikhail, and Raymond J. Mooney . 2003. Adaptive Duplicate Detection Using Learnable String Similarity Measures, KDD ’03: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 39–48.

  • Breiman, Leo . 1996. Bagging Predictors. Machine Learning, 24: 123–140.

    Google Scholar 

  • Breiman, Leo . 1998. Arcing Classifiers. Annals of Statistics, 26 (3): 801–849.

    Article  Google Scholar 

  • Chaudhuri, Surajit, Kris Ganjam, Venkatesh Ganti, and Motwani Rajeev . 2003. Robust and Efficient Fuzzy Match for Online Data Cleaning, SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. New York, NY, USA: ACM, 313–24.

  • Churi, Prashant, Germán Creamer, Sara Tresch, and Mary Weissman . 2009. Methods, Systems and Computer Programming Products for Generating Data Quality Indicators for Relationships in a Database, patent no. US 2009/0094237 A1.

  • Cohen, William W., and Jacob Richman . 2002. Learning to Match and Cluster Large Highdimensional Data Sets for Data Integration, KDD ’02: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 475–80.

  • Cohen, William W., Pradeep Ravikumar, and Fienberg Stephen . 2003. A Comparison of String Distance Metrics for Name-matching Tasks, The IJCAI Workshop on Information Integration on the Web (IIWeb), 73–78.

  • Even, Adir, and Ganesan Shankaranarayanan . 2007. Understanding Impartial Versus Utility-driven Quality Assessment in Large Datasets, 12th International Conference on Information Quality (ICIQ), 15–44.

  • Even, Adir, Ganesan Shankaranarayanan, and Paul D. Berger . 2010. Inequality in the Utility of Data and Its Implications for Data Management. Journal of Database Marketing and Customer Strategy Management, 17: 19–35.

    Article  Google Scholar 

  • Fellegi, Ivan, and Alan Sunter . 1969. A Theory for Record Linkage. Journal American Statistical Association, 64: 1183–1210.

    Article  Google Scholar 

  • Freund, Yoav, and Llew Mason . 1999. The Alternating Decision Tree Learning Algorithm, Machine Learning: Proceedings of the Sixteenth International Conference. San Francisco: Morgan Kaufmann Publishers Inc, 124–33.

  • Freund, Yoav, and Robert E. Schapire . 1997. A Decision-theoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55 (1): 119–139.

    Article  Google Scholar 

  • Freund, Yoav, and Robert E. Schapire . 1998. Discussion of the Paper “Arcing Classifiers” by Leo Breiman. Annals of Statistics, 26 (3): 824–832.

    Google Scholar 

  • Gusfield, Dan . 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. New York, NY, USA: Cambridge University Press.

    Book  Google Scholar 

  • Hastie, Trevor, Robert Tibishirani, and Jerome Friedman . 2003. The Elements of Statistical Learning. New York: Springer.

    Google Scholar 

  • Madnick, Stuart, Richard Wang, and Xiang Xian . 2002. A Framework for Corporate Householding, 8th International Conference on Information Quality (ICIQ), 36–46.

  • Madnick, Stuart, Richard Wang, and Xiang Xian . 2003. The Design and Implementation of a Corporate Householding Knowledge Processor to Improve Data Quality. Journal of Management Information Systems, 20 (3): 41–69.

    Google Scholar 

  • Monge, Alvaro E., and Charles P. Elkan . 1996. The Field Matching Problem: Algorithms and Applications, Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 267–70.

  • Navarro, Gonzalo . 2001. A Guided Tour to Approximate String Matching. ACM Computing Surveys, 33 (1): 31–88.

    Article  Google Scholar 

  • Newcombe, H.B., J.M. Kennedy, S.J. Axford, and A.P. James . 1959. Automatic Linkage of Vital Records. Science, 130: 954–959.

    Article  Google Scholar 

  • Quinlan, John R. 1993. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Ristad, Eric Sven, and Peter N. Yianilos . 1998. Learning String-edit Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20 (5): 522–532.

    Article  Google Scholar 

  • Sarawagi, Sunita, and Anuradha Bhamidipaty . 2002. Interactive Deduplication Using Active Learning, KDD ’02: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA: ACM, 269–78.

  • Singla, Paratt, and Pedro Domingos . 2006. Entity Resolution with Markov Logic, Proceedings of the Sixth IEEE International Conference on Data Mining. Hong Kong: IEEE Computer Society Press, 572–82.

  • Talburt, John R., Charles G. Morgan, Terry Talley, and Ken Archer . 2005. Using Commercial Data Integration Technologies to Improve the Quality of Anonymous Entity Resolution in the Public Sector, in Proceedings of the 10th International Conference on Information Quality (ICIQ-2005), edited by Felix Naumann, Michael Gertz and Stuart E. Madnick Cambridge, MA: MIT, 133–144.

    Google Scholar 

  • Tejada, Sheila, Craig A. Knoblock, and Steven Minton . 2001. Learning Object Identification Rules for Information Integration. Information Systems, 26 (8): 607–633.

    Article  Google Scholar 

  • Tibisharini, Robert 1996. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society Series B, 58 (1): 267–288.

    Google Scholar 

  • Wang, Y. Richard, and Stuart E. Madnick . 1989. The Inter-Database Instance Identification Problem in Integrating Autonomous Systems, Proceedings of the Fifth International Conference on Data Engineering. Washington, DC, USA: IEEE Computer Society, 46–55.

  • Winkler, William E. 2006. Overview of Record Linkage and Current Research Directions, Tech. rept. Statistics 2006-2. Statistical Research Division, U.S. Bureau of the Census, Washington, DC.

Download references

Acknowledgements

I thank participants of the Eastern Economic Association meeting of 2010, Jason Barr, Leanne Ussher, Troy Tassier, Sara Tresch, Eddie Alvarez, Joyce Jacobsen, the editor, and three anonymous referees for suggestions and informal discussions about the matching accuracy algorithm, and Patrick Jardine and Michelle Crilly for proofreading the paper. The opinions presented are the exclusive responsibility of the author.

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Creamer, G. Linking Entity Resolution and Risk. Eastern Econ J 37, 150–164 (2011). https://doi.org/10.1057/eej.2010.63

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1057/eej.2010.63

Keywords

JEL Classifications

Navigation