Subway Data ETL Pipeline: Part II

Subway Data ETL Pipeline: Part II

A brief tutorial on how to extract, transform, and load data from Wikipedia with Python, Excel, and Tableau

Image from Wikipedia

In this article, we will continue where we left off in Subway Data ETL Pipeline: Part I (that is, at the L of ETL), by loading the data we extracted from Wikipedia, plus we’ll do a little bit extra by creating some visualizations in Tableau.

Before we begin, let’s briefly recap the complete ETL process in the following diagram:

Image by Author

Tableau Public

Let’s start with a brief introduction on the visualization platform we will be using: Tableau Public is a free platform to explore, create, and publicly share data visualizations online. Visualizations created in Tableau Public (“vizzes”) can be embedded into web pages and blogs, they can be shared via social media or email, and they can be made available for download and exploration by other users.

Image from Tableau Public

Possibly the best part of Tableau is that no programming skills are required! Also, you can create vizzes and publish to Tableau Public from Tableau Desktop Professional or Public Edition.

While it is also possible (and common!) to load the data into a SQL database, for the sake of simplicity, in this article we’ll load the data into .xlsx files and store them in a Google Drive.

Google Drive storage

In order to store the extracted data in your personal Google Drive storage, I encourage you to clone the Github repository and then install Google Drive Desktop, which will sync your Google Drive storage locally, so you’ll get the following drives locally:

Image by Author

Once that’s done, create a folder called analysis in your drive (online or locally, at the end they will be synced, so what you create locally, will also be online, and vice-versa). Inside that folder I also recommend you clone the repository that contains the folder structure, as well as the code. The folder structure should look as follows:

Image from Author

Inside the folder of your Google account right click on the file etl_pipeline-subways.ipynb. If you don’t have Google Colab already connected to your Google account, click Open with >> Connect more applications. From there search for Google Colab.

Loading

Following the ETL process, we proceed to load the extracted data into Excel files, by using the [xlsxwriter](https://xlsxwriter.readthedocs.io/getting_started.html) module:

Image by Author

The above code will store three Excel files in the data folder, which we will later connect to Tableau Public to generate some “vizzes.”

Image by Author

These Excel files will have the same structure as a locally stored Excel spreadsheet, but will be stored in the cloud!

Image by Author

Curious to see how Comet works? Check out our PetCam scenario to see MLOps in action.

Visualization

If you don’t have a Tableau Public account, you can simply create one for free on the Tableau Public site:

Image from Tableau Public

Once your account is created, you can edit your profile (recommended), by clicking “Create Viz” on the main screen of your account.

Again, for the sake of simplicity, we will be creating these visualizations online, but you can also install the Tableau Desktop version, which has more connector options (i.e., the different databases on premise and cloud, text and Excel files, etc.). Tableau Public only has the option to connect to locally-stored files (csv or Excel), or files stored your Google Drive.

Image from Author

From here, connecting to the data in your Google account will be very straightforward. Simply go to the folder containing the data and select the files you need to connect to Tableau:

Image from Author

Once you click on the file you are about to connect to Tableau, you should see a screen like this one:

  1. Connections: gives details about the data source connected.

2. The main canvas: where you can drag and drop other tables you might connect and join to your data by a key field. This has both a logical layer and a physical layer.

3. The details of the data types of the fields contained within the table.

4. A pre-visualization of the data, similar to an Excel spreadsheet.

Image by Author

Right there in Tableau, we can create data visualizations in a variety of very flexible ways.

When we load geographical data (e.g., countries and cities) in Tableau, we can quickly and easily view the latitude and longitude of those locations in attractive visualizations like the one below.

Image by Author

In the visualization above, we can also create filters on the geo locations to display when each subway system was opened.

It might not be a surprise that the country with the most subway systems currently open is China.

Image by Author

We can also continue to play around with this data. We can try joining it with more datasets that expand our understanding of the context, or we can create new features and dimensions.

The visualizations I performed in Tableau for this tutorial are all available in my Tableau Public profile. Please feel free to play around with the data or even add more data to it by doing your own analysis and dashboards.

In this two-part tutorial series, I have covered the end-to-end ETL process, from webscraping, to data loading, and even visualization. I encourage you to clone the Python code that covers the ETL process yourself, and extract the data in your own Google Drive storage. This way you can connect the data to the online Tableau tool, and continue playing with it at home!

Enjoy, and happy coding!

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.