Data Extraction from APIs: From ETL to Visualization

Data Extraction from APIs: From ETL to Visualization

Image from Unsplash

Introduction

Traditionally, APIs are tools that make a website’s data digestible for a computer. Through it, a computer can view and edit data, just like a person can by loading pages and submitting forms.

Through this tutorial, you will both learn and put into practice the extraction of data from APIs. You’ll learn where and how they are extracted and consumed by the vast majority of websites with which we have daily contact.

Quick Overview of APIs

Image by Author. Communication between servers and APIs

Is just as simple as there is one side, which is the “server” that provides the API, and the other one “website” that is running on the server. It may be part of the same program that handles web traffic, or it can be a completely separate one. In either case, it is sitting, waiting for others to ask them for data.

The other side is the “client.” This is a separate program that knows what data is available through the API and can manipulate it, typically at the request of a user.

So the key terms for APIs are as follows:

  • Server: A powerful computer that runs an API
  • API: The “hidden” portion of a website that is meant for computer consumption
  • Client: A program that exchanges data with a server through an API

The channel where this kind of communication is possible between Servers and Clients is called a “protocol.” On the web, the main protocol is the Hyper-Text Transfer Protocol, better known by its acronym, HTTP.

Image by Author. Request-Response cycle

Communication through HTTP centers around the Request-Response Cycle. The client sends the server a request to do something. The server, in turn, sends the client a response saying whether or not the server could do what the client asked.

HTTP Requests

In order to receive a response the client needs to include four things in a request:

  1. URL (Uniform Resource Locator)
  2. Method
  3. List of Headers
  4. Body

Image by Author. Structure of an HTTP request.

Image by Author. Structure of an HTTP request.

In order to get further details on the different methods, please review this documentation.

Status Codes

Status codes are three-digit numbers that each have a unique meaning. Maybe the most known among them are the 200 (for successful responses), the 400 (Client error response), and the 500 (Server error response). For more information on statuses, please follow this link.

Data formats

HTTP requests also have data formats The most common formats found in modern APIs are JSON (JavaScript Object Notation) and XML (Extensible Markup Language).

The most popular among APIs, is JSON, which is a very simple format that has two pieces: keys and values.

Image by Author. Response data format (JSON)

The keys are represented by an attribute of the object that has linked corresponding values, it could be a single one or a list of values for one attribute. The values are the elements that correspond to those different keys.

So I think that’s enough information to have a basic understanding of what APIs are and how they work. For further information on APIs review this very complete documentation.

Python Packages for API Requests

There are many different types of requests, as we saw previously, but the most commonly used one is the GET request, which is used to retrieve data from an API endpoint. We’ll just be working with retrieving data, so let’s focus on making ‘get’ requests.

When we make a request, the response from the API comes with a response code that tells us whether or not our request was successful.

Requests

To make a ‘GET’ request, we’ll use the [requests.get()](https://2.python-requests.org/en/master/user/quickstart/#make-a-request)function, which comes included in the Python requests module, but first let’s install the library in our Python environment.

I am using the Anaconda installation and a Jupyter notebook, which has all necessary python modules already pre-installed. To install the Anaconda distribution in your local environment, follow these quick steps.

When you finally install Anaconda, just run pip install requests, if getting an error after running, import requests or just using conda,conda install requests.

Getting connected to API endpoints

I have attached the Jupyter Notebook to follow along with the step-by-step on the API call to Alphadvantage API, but first, you’ll need to follow two simple steps in order to get the API key to get the connection to the API.

How does the team at Uber manage to keep their data organized and their team united? Comet’s experiment tracking. Learn more from Uber’s Olcay Cirit.

Get an API Key: Once you get to the Alphavantage page you’ll need to get an API Key in order to connect the different endpoints.

Review the documentation: Once you have your API Key, quickly review the documentation and you’ll notice that there are different endpoints to retrieve data. Either way, I’ll explain to you how to connect.

alphadvantage.co

Making our first API connection

Once you have set up your Anaconda (Python) environment to start performing requests from APIs, you’ll need to import three libraries in your notebook:

Image from Author

requests, pandasand config parser. Besides, we’ll need pandasto manipulate and format the json output we get from the API, and configparser is a configuration file module that allows us to create .ini files and store our “secrets” (passwords, etc.) in a secure structure, so you’ll need to create a config.ini file and save it in the same directory where your notebook is. That .ini will have the following structure:

Image by Author

[tokens] is the tag to identify that key inside the file, and your secret API key, that’s all. After that, we´’l get the API key from the config file and set the different parameters for the API call to alphavantage:

Image by Author

So there we are connecting to the base URL of the alphavantage API, by using one of the functions mentioned in the documentation and retrieving data for the AAPL (Apple) stocks.

Image by Author

After running this chunk of code we expect to get a successful response from the API. So next we’ll need to convert our output which is a json formatted one, into a more tabular one, first storing that json in an object, and then making a function that will format it into a DataFrame (pandas) structure:

Image by Author

After applying that function to the json data stored in thedataobject, we’ll get a tabular formatted DataFrame containing AAPL prices for the last 5 months:

Image by Author

Data manipulation and EDA

Once we have gotten all the data we need from the API, we need to do some EDA (Exploratory Data Analysis) and some little Feature Engineering as well. In our DataSet we have the ‘date’ corresponding to each value, they go from 2021–12–17 to the day before I am writing this article (2022–05–11):

image by Author

So would be useful for us to have the year, month and day columns for further analysis, and also change some data types that should beint or floattype instead of Object. I made those changes in the columns as follows:

Image by Author

Now checking the new columns in the DataFrame, we’ll get two new Int64columns, and some other ones were changed to float type:

Image by Author

Now we need to filter those records to get the market trending for one complete month, but we could also analyze the trends for the whole dataset extracted:

Image by Author

Visualization and Insights from extracted data

In order to visualize our dataset, we’ll need to use the seaborn library from python, which generates very fancy plots, but of course, we always need to have in mind that not all kind of plots applies to all kind of data, to help you on that here a link that could give you some help on deciding.

So seaborn, which was imported above in the Jupyter Notebook through the import seaborn as snscode accepts multiple parameters depending on the different visualization we need to use. In this case, data(the dataset that contains the data we need to visualize) and the xand yaxis.

Image by Author

So here we notice that the APPL stock was going down during April 2022.

In this article we have covered all the road from the concept of what is the infrastructure of APIs, the way they are built and their different elements.

We have learned as well how we can extract value (data) from them. In order to achieve that goal, there are still a bunch of tools and languages we are able to use, among them Python libraries, .NET, Java, and C/C++.

Hope this piece has given you an introductory overview of APIs and how to use them in the real world. Remember that you have also access to the analysis notebooks I used in my GitHub.

Cheers.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.