Robust Wikipedia Table Parser
Robust HTML table parser code for tables with rowspan and/or colspans
In scouring the web, it was to the embarrasment of humanity that I could not find a parser that could accurately parse the tables found on Wikipedia such as Wisconsin_political_power; I thus wrote my own.
The issue is that parsers for Wikipedia tables could not handle rowspan and colspan attributes in tables so when these parsers faced such obstacles they either errered out or returned jarmbled tables and entries. Because I needed data on a Wikipedia article in which the data tables had these attributes, I had to somehow figure out how to parse such tables. If you encounter a similar issue, I hope this saves you time.
Below, I include both the functions and a few examples. If you wish to just see the examples, you can scroll past the Parser Functions
Parser functions
The functions below parse HTML tables (BeautifulSoup objects). I also include a convenience function for fetching the HTML of the website though for details, I suggest looking through the examples.
Example - the essence of the game
s = """
<table>
<tr>
<th rowspan="2">year</th>
<th>col1</th>
<th colspan="2">col2</th>
</tr>
<tr>
<td rowspan="2">aaa</td>
<td rowspan="2">bbb</td>
<td rowspan="1">ccc</td>
</tr>
<tr>
<td>2017</td>
<td rowspan="2">ddd</td>
</tr>
<tr>
<td>2018</td>
<td>col1</td>
<td colspan="1">col2</td>
</tr>
</table>
"""
table = bs(s, 'lxml').find('table')
rows, num_rows, num_cols = pre_process_table(table)
df = process_rows(rows, num_rows, num_cols)
df
0 | 1 | 2 | |
---|---|---|---|
0 | year | col1 | col2 |
1 | year | aaa | bbb |
2 | 2017 | aaa | bbb |
3 | 2018 | col1 | col2 |
Example 2 - Political Parties in Wisconsin:
Here, I will parse the HTML table linked in the description.
## Go to the link and download the page's html code:
url = "https://en.wikipedia.org/wiki/Political_party_strength_in_Wisconsin"
site = fetch_website(url)
soup = bs(site, 'lxml')
## Find tables on the page and locate the desired one:
## Caveat: note that I explicitly search for a wikitable!
tables = soup.findAll("table", class_='wikitable')
## I want table 3 or the one that contains years 2000-2018
table = tables[3]
## run the above functions to extract the data
rows, num_rows, num_cols = pre_process_table(table)
df = process_rows(rows, num_rows, num_cols)
## print the result
df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Year | Executive offices | Executive offices | Executive offices | Executive offices | Executive offices | State Legislature | State Legislature | United States Congress | United States Congress | United States Congress | Electoral College votes |
1 | Year | Governor | Lieutenant Governor | Secretary of State | Attorney General | Treasurer | State Senate | State Assembly | U.S. Senator (Class I) | U.S. Senator (Class III) | U.S. House | Electoral College votes |
2 | 2000 | Tommy Thompson (R) | Scott McCallum (R) | Doug La Follette (D) | Jim Doyle (D) | Jack Voight (R) | 17D, 16R | 55R, 44D | Herb Kohl (D) | Russ Feingold (D) | 5D, 4R | Gore/Lieberman (D) N |
3 | 2001 | Scott McCallum (R) | Margaret Farrow (R) | Doug La Follette (D) | Jim Doyle (D) | Jack Voight (R) | 18D, 15R | 56R, 43D | Herb Kohl (D) | Russ Feingold (D) | 5D, 4R | Gore/Lieberman (D) N |
4 | 2002 | Scott McCallum (R) | Margaret Farrow (R) | Doug La Follette (D) | Jim Doyle (D) | Jack Voight (R) | 18D, 15R | 56R, 43D | Herb Kohl (D) | Russ Feingold (D) | 5D, 4R | Gore/Lieberman (D) N |
Example 3 - Cricket
Stackoverflow had a Wikipedia parsing question, thought I would use the page of interest.
url = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
site = fetch_website(url)
soup = bs(site, 'lxml')
tables = soup.findAll("table", class_='wikitable')
## I want table 3 or the one that contains years 2000-2018
table = tables[3]
rows, num_rows, num_cols = pre_process_table(table)
df = process_rows(rows, num_rows, num_cols)
## print the result
df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | Rank | Score | Opposition | Venue | Place | Season | Result |
1 | 1 | 45 | v Australia | Sydney Cricket Ground | Sydney, Australia | 1886/87 | Won |
2 | 2 | 46 | v West Indies | Queen's Park Oval | Port of Spain, Trinidad and Tobago | 1994 | Lost |
3 | 3 | 51 | v West Indies | Sabina Park | Kingston, Jamaica | 2009 | Lost |
4 | 4 | 52 | v Australia | The Oval | London, England | 1948 | Lost |
Example 4 - Political Parties in DC
url = "https://en.wikipedia.org/wiki/Political_party_strength_in_Washington,_D.C."
site = fetch_website(url)
soup = bs(site, 'lxml')
tables = soup.findAll("table", class_='wikitable')
## I want table 3 or the one that contains years 2000-2018
table = tables[0]
rows, num_rows, num_cols = pre_process_table(table)
df = process_rows(rows, num_rows, num_cols)
## print the result
df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Year | Executive office | Executive office | District Council | District Council | U.S. Congress | U.S. Congress | U.S. Congress | U.S. Congress | Presidential electoral college votes |
1 | Year | Mayor | Attorney General | Chair | Composition | U.S. House Delegate | Shadow U.S. Senator (Seat 1) | Shadow U.S. Senator (Seat 2) | Shadow Representative | Presidential electoral college votes |
2 | 1801\n...\n1870 | [1] | [1] | [1] | [1] | [1] | [1] | [1] | [1] | [2] |
3 | 1871 | [3] | [3] | [3] | [3] | Norton P. Chipman (R) | [4] | [4] | [4] | [2] |
4 | 1872 | [3] | [3] | [3] | [3] | Norton P. Chipman (R) | [4] | [4] | [4] | [2] |