Duplicate detection is an important part of data cleaning; it is the process of detecting multiple representations of a same real-world object in the data sources. Numbers of solutions are available for detecting duplicates in XML data. One of the novel methods for XML duplicate detection is XMLDup. XMLDup makes use of a Bayesian network to evaluate the probability of two XML elements are duplicates. In addition a network pruning strategy is also used for improving the evaluation of the Bayesian network. A DOM tree construction algorithm for constructing the tree of the input XML data is proposed. It is seen that by using DOM tree construction algorithm higher efficiency is achieved for detection of similar identities in XML Documents.
Duplicate Detection, XML, DOM, Bayesian network, data cleaning.
 Luis Leita o, Pavel Calado, and Melanie Herschel, ”Efficient and Effective Duplicate Detection in Hierarchical data”,IEEE Transactions on Knowledge and Data Engineering, VOL. 25, NO. 5, MAY 2013.
 E. Rahm and H. H. Do, ”Data cleaning: Problems and current approaches,” IEEE Data Engineering Bulletin, vol. 23, pp. 3-13, 2000.
 L. Leita o, P. Calado, and M. Weis, ”Structure-Based Inference of XML Similarity for Fuzzy Duplicate Detection ”,Proc. 16th ACM Int’l Conf. Information and Knowledge Management ,pp. 293-302, 2007
 S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu” Approximate XML joins”, in Conference on the Management of Data (SIGMOD), 2002.
 R. Ananthakrishna, S. Chaudhuri, and V. Ganti, ”Eliminating fuzzy duplicates in data warehouses”, in Conference on Very Large Databases (VLDB),Hong Kong, China, 2002, pp. 586-597.
 D. Milano, M. Scannapieco, and T. Catarci, ”Structure aware XML object identification”, in VLDB Workshop on Clean Databases (CleanDB),Seoul, Korea, 2006.
 M. Weis and F. Naumann, ”Dogmatix tracks down duplicatesin XML”, in Conference on the Management of Data (SIGMOD),Baltimore, MD, 2005, pp. 431-442.
 J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of plausible inference, ed. Morgan Kaufmann Publishers, 1988.
 L. Leita o, P. Calado, and M. Weis, ”Structure-Based Inference of XML Similarity for Fuzzy Duplicate Detection”, Proc. 16th ACM Int’l Conf. Information and Knowledge Management ,pp. 293-302, 2007.
 A. M. Kade and C. A. Heuser, ”Matching XML documents in highly dynamic applications”, ACM Symposium on Document Engineering (DocEng),2008, pp. 191-198.
 S. Puhlma nn, M. Weis, and F. Naumann, ”XML Duplicate Detection Using Sorted Neighborhoods”, Proc. Conf. Extending Database Technology (EDBT),pp. 773-791, 2006.
[Miss Amita Fulsundar, Dr.K.V.Metre (2015), Detection of Similar Identities in XML Documents, International Journal of Innovative Research in Computer Science & Technology (IJIRCST), Vol-3, Issue-3, Page No-134-138], (ISSN 2347 - 5552). www.ijircst.org
Miss Amita Fulsundar
Computer Department, MET BKCIOE Savitribai Phule Pune University, Nasik, India