Abstract
Electronic data play an important role in business application and decision making process. The quality of the data can depend on many factors like duplicates, errors, missing values etc. Here it is focused to find fuzzy duplicates in more complex hierarchical structures like XML data the duplicate are classified into the exact duplicates, partially duplicates and set of duplicates. A novel method for XML, duplicate detection called XMLDUP uses Bayesian Network, which is to determine the probability of two XML elements being duplicates.ie by considering two things: information within the elements and the way that the information is structured. Here the classification of the hierarchical data likes parent nodes, child nodes and their values. Then by applying the new conditional and prior probabilities which are easy to identify the duplicates on XML data. The node ordering technique is used which means ordering the contents of data depending upon the features of data. It is used to improve the efficiency of duplicate detection in XML data. Next to derive the automatic pruning factor in order to improve the effectiveness of the duplicates detection. The pruning factor means a certain threshold reached by data means that data's are assumed as duplicates. Thus to improve the efficiency, Network Pruning Strategy is used, which is capable of significant gains over an optimized versions through these experiments will be able to achieve high precision and recall scores in several data sets.