A brief tutorial on how to extract, transform, and load data from wikipedia with webscraping and pandas
Photo by Marco Lastella on Unsplash
Motivation
The motivation for writing this article came from a documentary I watched a while ago about how the first subways stations were built and what motivated cities to create an underground transit system in the first place. At the time (in 1870), many people believed the idea of creating a railway underground was crazy. To them, the proposal equated to disturbing the land of the dead, and they believed malicious spirits would rise from below and dwell there.
In reality, the man with this so-called “crazy” idea of an “underground transportation wagon,” was a visionary of his time. His name was Alfred Ely Beach, and he was a publisher, editor, and inventor, who patented the idea of an underground transit system in New York in 1869.
By Moses King — szkennelés, Public Domain, https://commons.wikimedia.org/w/index.php?curid=108133952
By the time the first oil-powered underground train was designed and implemented, newer underground systems (this time powered by coal) were beginning to be manufactured. In either case, it was not a good situation for the first subway riders’ lungs!
In 1894, the New York subway system would be approved, its construction beginning just six year. Large cities around the world quickly followed suit, and today you’ll be hard pressed to find a major city without one.
The Data
The full dataset that brings this information to us is available on Wikipedia. Though a commonly used source, Wikipedia does not always organize information as we would like it, and sometimes lacks a level of granularity we might require.
In this tutorial, I will show readers a relatively simple way of extracting data from Wikipedia, by using Python within a Jupyter Notebook. I will then demonstrate how to store that data in CSV files to Google drive storage, and visualize that data in Tableau for further analysis.
This article will be split into two parts. Here, in part one, I will cover the Extract and Transform phases of the ETL process, and in part two, I will explore how to Load that data for further analysis in Tableau.
Webscraping data from Wikipedia
We will first need to import the packages and libraries that allow us to scrape data. I usually work with [requests](https://2.python-requests.org/en/master/user/quickstart/#make-a-request)
to scrape the data and [beautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/)
to parse the extracted data into a [json](https://docs.python.org/3/library/json.html)
file. I will also use the ever-handy [pandas](https://pandas.pydata.org/)
and [numpy](https://numpy.org/)
libraries for data manipulation. In this case, we will also use datetime objects, so I’ve also imported the [datetime](https://docs.python.org/3/library/datetime.html)
module.
Image by Author
Extraction stage
First, we begin the Extraction stage of the ETL process. Here, we’ll need to request the data as an HTML document from Wikipedia, and then parse that data to convert it to a tabular format:
Image by Author
The Wikipedia article we are scraping contains 3 different tables, identified in the HTML document with the class="wikitable sortable jquery-tablesorter"
object. We can also identify that one table by looking at the headers that it contains in the HTML doc.
Image by Author
The second table in the article lists all the systems by country:
Image by Author
The third table in the article groups the subway systems by the ones that are currently under construction:
Image by Author
In the code written to support this tutorial you will notice that after calling the beautifulSoup
html.parser
method, we look for all the objects in the HTML doc identified by 'class': "wikitable"
.
As previously explained, the Wikipedia page contains three tables. So, every table that we store inside the tables
variable will be assigned an index, beginning with 0
.
We use parse_tables
, a function specifically designed to parse the HTML data into the notebook, passing as arguments the object we would like to parse, tables
, followed by the index of the table we would like to access (in this case, the first table).
Image by Author
As you can see, there is some data in this table that we won’t need for our analysis. We will address this in the next phase of our ETL pipeline: Transformation.
Standardizing model management can be tricky but there is a solution. Learn more about experiment management from Comet’s own Nikolas Laskaris.
Transformation stage
We won’t be needing the data in the second and third extracted tables, so we’ll focus our attention solely on the first table:
Image by Author
Image by Author
Next, we will rename the columns and extract just the numerical characters in from the columns containing Length
and Year
. For more details on the code, please feel free to clone this Notebook in my Github Repo.
Image by Author
Once the columns are renamed and the information in them is extracted, we inspect our DataFrame:
Image from Author
As you can see, all columns are currently object
data types. This makes sense for some of the columns, but we’ll need to change this for all numerical columns. Let’s first replace the NaN values in the columns Year_of_Last_Expansion
and Annual_Ridership
with 0, and then we will convert the columns Year_Opened
, Year_of_Last_Expansion
, Stations
, System_Length
and Annual_Ridership
to numeric data types (integer and float):
Image by Author
Now we have converted the numerical columns to integer and float data types, so that they can be read by Tableau for further analysis!
I hope that you’ve found this tutorial useful.
Be sure to check out my second installation, Subway Data ETL Pipeline: Part II, where I cover the “load” stage of ETL, as well as a bit of further visualization of the data!
Cheers!
Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.
Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.
If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.