(Publisher of Peer Reviewed Open Access Journals)

International Journal of Advanced Computer Research (IJACR)

ISSN (Print):2249-7277    ISSN (Online):2277-7970
Volume-1 Issue-2 December-2011
Full-Text PDF
Paper Title : The Study of Detecting Replicate Documents Using MD5 Hash Function
Author Name : Pushpendra Singh Tomar, Maneesh Shreevastava
Abstract :

A great deal of the Web is replicate or near- replicate content. Documents may be served in different formats: HTML, PDF, and Text for different audiences. Documents may get mirrored to avoid delays or to provide fault tolerance. Algorithms for detecting replicate documents are critical in applications where data is obtained from multiple sources. The removal of replicate documents is necessary, not only to reduce runtime, but also to improve search accuracy. Today, search engine crawlers are retrieving billions of unique URL’s, of which hundreds of millions are replicates of some form. Thus, quickly identifying replicate detection expedites indexing and searching. One vendor’s analysis of 1.2 billion URL’s resulted in 400 million exact replicates found with a MD5 hash. Reducing the collection sizes by tens of percentage point’s results in great savings in indexing time and a reduction in the amount of hardware required to support the system. Last and probably more significant, users benefit by eliminating replicate results. By efficiently presenting only unique documents, user satisfaction is likely to increase.

Keywords : Unique documents, detecting replicate, replication, search engine.
Cite this article : Pushpendra Singh Tomar, Maneesh Shreevastava, " The Study of Detecting Replicate Documents Using MD5 Hash Function " , International Journal of Advanced Computer Research (IJACR), Volume-1, Issue-2, December-2011 ,pp.14-17.