While working on a side project, I needed to scrape some data from Wikipedia into CSVs. As I wrote the script, I ran into three hurdles: handling multiple tables in an article, data cells that span multiple rows, and removing footnotes from cells.
You can find the finished script on Github.
Handling multiple tables in an article
Not every table on a Wikipedia page has valuable data for analysis. For example, the last table on the volcano height article only has 3 data points, one of which is ‘over 9,000 meters’.
Luckily, there are two CSS selectors Wikipedia uses only on larger tables –
plainrowheaders. Beautiful Soup makes it easy to filter for only these tables in Python. You can test the below code using Python 3 after installing the
requests packages from pip.
Data cells that span multiple rows
Sometimes a data cell spans multiple rows in a table. In theory this is a presentation-level detail, and the underlying tabular data would have the value duplicated across those rows. It’s a nice theory, but HTML decided to go a different route.
Examine this table in the article on highest-grossing films by year. The year 1925 has data from two or more films, and the Year column is stretched across two rows to reflect this. This is the underlying HTML table structure of those two rows:
So when you parse this table, some rows have 5 columns, and some rows have 4 columns. Great.
To solve this, create a list for each table being parsed to track these pesky rowspan elements. Use the first row of the table to define the number of columns.
As the script parses each row, check the row for any elements with a
rowspan attribute. If such an element exists, save the value of the cell and the number of rows it spans to a dictionary. Save this dictionary to the
saved_rowspans list at its current position.
(I chose a dictionary because the
rows_left value decrements over time, so it needs to be mutable. A Counter-like class would also be a good choice.)
To use the saved rowspan elements, the loop checks if the number of cells in the current row matches the total number of columns in the table. If it doesn’t, it inserts the
value from the same cell in
saved_rowspans, and decrements its
rows_left. If this is the last row the cell spans into, it sets the
saved_rowspans for that cell back to
Now our table actually has the proper number of columns in each row, with values in the previously blank cells. Stupid HTML.
Removing footnotes from cells
The final step was to remove footnotes, references, and other garbage from the text. These are important pieces of information for the broader analysis, but do not belong in the tabular data.
I passed each cell of HTML parsed with
BeautifulSoup through the following steps to remove these items.
After that, each cell is surrounded with double quotes and joined into a comma-delimited string, then written to the CSV file.
The final script can be found on Github, along with the CSV outputs from four articles that I used for testing (e.g. top-grossing films). It’s been reliable for my needs so far, but I’ll update it as I run into new bugs.
If you enjoyed this article, please consider contributing to or starring my project on Github.