A method is described for identifying pages that are near duplicates in a linked database. In the linked database, pages can have incoming links and outgoing links. Two pages are selected, a first page and a second page. For each selected page, the number of outgoing links is determined. The two pages are marked as near duplicates based on the number of common outgoing links for the two pages.