Wikipedia Search

Basic stats

  • Wikipedia snapshot on March 12, 2008 from here.
  • Single file enwiki-20080312-pages-articles.xml—uncompressed 16 GB, compressed 3.5 GB
  • 6,552,490 articles
  • Line count: 259,427,121 as reported by wc -l (takes 12.784u 11.432s 2:32.61 time)
  • 84,267,421 valid links, 38,998,582 invalid links → 13 outgoing edges on the average
  • Different pages have different indegrees: here is the exact histogram of indegrees
  • There are 9M distinct words, when we sample 1 in 4 words (words are just determined by text strings)
    • Takes 1980 seconds to process articles (link structure and index), 3.5GB resident (5.1GB virtual).
    • Takes 113 seconds to print index (only the words in the index and their frequency)
    • Distributuion of words

Performance

C2D: 4gb RAM running 64bit linux (Debian 4.1: 2.6.18-6-amd64) at 3.15 Ghz. Two 640GB WD Drives. Compiled with -g

  • Run time for Wiki_HashArticleTitleToIntId is 35 sec
    • This function maps the 6M page titles to line and articles ids (ints)
  • Run time for Wiki_ProcessEntireFile is 485 sec
    • This function creates the maps from articles to the links appearing in articles for the entire Wiki
  • Run time for printTopMissingLinks is 10 sec
    • This functon scans throught the missing links array and sorts it, prints out the top missing links
  • Run time for printTopLinks is 8 sec
    • This function identifies the top links (creates an array of articles to number of times it appears, sorts it, prints out)
  • Run time for pageRank is 836 sec
    • 157 iterations → 5.3 seconds/iteration
    • Each iteration is multiplying a 6M dim vector into a (very sparse) 6Mx6M matrix

On entire wiki: top reports 3.47 GB

With -O3 get 35 sec for HashArticleTitleToIntId, 1618 for ProcessEntireFile, with all optimizations turned off get 33 sec for HashArticleTitleToInId (10% slower)

Rankings

Source code

The code is in C and can be downloaded from here. I tested it on Debian 2.6.18-6-amd64 Linux. (64-bit addressing is essential, since the uncompressed Wikipedia file is 16 GB → individual bytes cannot be addressed with 32 bits)

Key components include

  • search functionality—reverse index, page rank, query lookup in src/srch/wiki.[ch],
  • main, code for exporting search functions to Tcl in src/nm/nm*.[ch],
  • utilities for resizable arrays, hashes, timing, etc. are in src/util/*.[ch],
  • a Makefile, and
  • test-cases in the tests subdirectory.

I've only included small tests cases, you will need to download Wikipedia source from the link above.



Issues

  • According to wc -l there are 4,888,841 lines in enwiki-20080312-all-titles-in-ns0, whereas I counted over 6M titles (using Perl to look for <title>Foo</title>
  • How to make sense of the huge number of 40M missing links?
    • Redirect file enwiki-20080312-redirect.sql is huge, hard to parse
  • Why do pages like Portal:Contents/Categorical index show up in the pagerank results but not the highest indegree results? (It's number 1 in pagerank, but not even in the top 100 indegree)
    • Grepping for shows the above link shows a page with this title, and only 11 links to it
    • TODO time to run grep
  • What is the SCC DAG structure like?
  • What is the diameter of the graph?
 
wikipedia/start.txt · Last modified: 2008/07/10 13:37 by adnan
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki