enwiki-20080312-pages-articles.xml—uncompressed 16 GB, compressed 3.5 GBwc -l (takes 12.784u 11.432s 2:32.61 time)
C2D: 4gb RAM running 64bit linux (Debian 4.1: 2.6.18-6-amd64) at 3.15 Ghz. Two 640GB WD Drives. Compiled with -g
Wiki_HashArticleTitleToIntId is 35 secWiki_ProcessEntireFile is 485 secprintTopMissingLinks is 10 secprintTopLinks is 8 secpageRank is 836 sec
On entire wiki: top reports 3.47 GB
With -O3 get 35 sec for HashArticleTitleToIntId, 1618 for ProcessEntireFile, with all optimizations turned off get 33 sec for HashArticleTitleToInId (10% slower)
The code is in C and can be downloaded from here. I tested it on Debian 2.6.18-6-amd64 Linux. (64-bit addressing is essential, since the uncompressed Wikipedia file is 16 GB → individual bytes cannot be addressed with 32 bits)
Key components include
src/srch/wiki.[ch], main, code for exporting search functions to Tcl in src/nm/nm*.[ch],src/util/*.[ch], Makefile, andtests subdirectory.I've only included small tests cases, you will need to download Wikipedia source from the link above.
wc -l there are 4,888,841 lines in enwiki-20080312-all-titles-in-ns0, whereas I counted over 6M titles (using Perl to look for <title>Foo</title>enwiki-20080312-redirect.sql is huge, hard to parsePortal:Contents/Categorical index show up in the pagerank results but not the highest indegree results? (It's number 1 in pagerank, but not even in the top 100 indegree)