Skip navigation

After my last post, I promptly got several hits from google searches with terms like “scrape infobox” and “scraping wikipedia infobox,” so I thought I would post something useful for those people.

Here is a python script that does some simple parsing of wikipedia infoboxes. It only requires PyYaml in addition to a standard python 2.5+ library.

It can be run at the command line as “python scrape_infobox.py [ARTICLE] [BOX]” and will print out a yaml-formatted dictionary where the keys are the infobox item titles and the values are the corresponding data. ([BOX] is an optional argument. If you leave it off, it will just parse the first infobox on the page. Otherwise it will look for an infobox with the given name.)

For example “python scrape_infobox.py “Femoral artery” Artery” and “python scrape_infobox.py “Femoral artery”" both yield:

{BranchFrom: ‘[[external iliac artery]]’, BranchTo: ‘[[Superficial epigastric artery]]
[[Superficial
iliac circumflex artery|Superficial iliac circumflex]]
[[Superficial external
pudendal artery|Superficial external pudendal]]
[[Deep external pudendal artery|Deep
external pudendal]]
[[Deep femoral artery]]’, Caption: ‘Structures passing
behind the [[inguinal ligament]]. (Femoral artery labeled at upper right.)’, Caption2: ‘Femoral
artery and its major branches – right thigh, anterior view.’, DorlandsPre: a_61,
DorlandsSuf: ’12154275′, GrayPage: ’623′, GraySubject: ’157′, Image: Gray546.png,
Image2: Gray548.png, Latin: arteria femoralis, MeshName: Femoral+Artery, MeshNumber: A07.231.114.35,
Name: Femoral artery, Supplies: ‘[[anterior compartment of thigh]]’, Vein: ‘[[femoral
vein]]’}

Note that the script will fail with certain, more complex infoboxes that have templates inside them (i.e. things with pipe characters in places other than wikilinks).

Advertisement

5 Comments

  1. Thats awesome!

  2. Ej,
    From python scrape_infobox.py "Scientific_American" "Magazine" the resulting yaml is not great:

    Not all variables are uniformly ‘quoted’, ao: country, frequency and history, which gives difficulty further parsing it with spyc.

  3. Hi cirkusbanjaluka,

    I just tried it, and you’re right, the quoting seems inconsistent, but PyYAML doesn’t seem to have any problem with that. It was able to parse the output with no problem.

    Sounds like the problem is either with PyYAML’s output or spyc’s parsing. You might be able to add a keyword argument to the call to yaml.dump in the source code as described in these two places to force it to format the output in a way that you like better: http://pyyaml.org/wiki/PyYAMLDocumentation#DumpingYAML http://dpinte.wordpress.com/2008/10/31/pyaml-dump-option/

    I should also mention that the code is hereby released under the BSD license, so you’re free to alter/redistribute/etc it however you want.

  4. has anyone done some cool mashup of infobox data? there would be some interesting stuff in there…

  5. Hello I tried you script like >python scrape_infobox.py “Bye Bye Bangkok”. But it not working. I’m getting errors message.
    found unknown escape character ‘/’
    in “”, line 1, column 585:
    … = [[Swastika Mukherjee]][[Kanchan Mullick]][[Sh …
    ^


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.