Share via


How to download Wikipedia

So you're looking for some dummy data?  Well how about downloading the wikipedia???!! 

There are over 2 milliion pages on the wikipedia.  Don't try to crawl the site, it won't let you.  No robots allowed!

Go to https://download.wikipedia.org and you'll see a list of all the databases.  If you're looking for the English one it's "enwiki".  Then you can choose to download a whole bunch of stuff ... but the file you generally want to download is "pages-articles.xml.bz2".  This contains current versions of article content, and is the archive most mirror sites will probably want.  The latest version at the time of writing is 1.7GB.

Now you can run some decent content through your search engine or proof of conept applcation!

Comments

  • Anonymous
    October 05, 2006
    Thank you for higligting this, this is so cool!
  • Anonymous
    October 05, 2006
    Not a problem Hannes :)
  • Anonymous
    December 02, 2006
    Very cool. Don't forget you can use DataDude to generate data too... I believe it's now out as CTP 7
  • Anonymous
    December 02, 2006
    Update: DataDude is now RTM 1.0 :)