Take advantage of PySpark and PyData libraries by building an app that analyzes data with Spark
Introduction
In this article, we will continue analyzing data with Spark. The goal is to teach the basics of building an app that gathers data from social networks, then extract it and properly format it for further analysis.
The ideas for this article are mainly inspired by the book Spark for Python Developers, which you can get from Amazon or download a digital copy from several sources on Google. Also for your reference, here is a repo that has a complete copy of the book.
The basic architecture of Streaming apps
As pointed out in the previous article in this series (if you missed it find the link below), we did a quick overview on how to set up a Jupyter Notebook for data analysis using Spark, as well as explained some basic features of PySpark.
[Starting Data Analisys with PySpark
Introductionfelixvidalgu.medium.com](https://felixvidalgu.medium.com/starting-data-analisys-with-pyspark-ab6869b360e7 "felixvidalgu.medium.com/starting-data-anali..")
We discovered that Spark is an extremely efficient distributed computing framework, and that in order to exploit its full power, it is necessary to architect data solutions in accordance with its particular features. Spark is unique in that it allows batch processing and streaming analytics on the same unified platform.
Spark data solutions must guarantee four main features:
- Latency: Slow and fast processing must be combined in the same architecture. Slow processing is done on historical data in batch mode, which is also called data at rest. This phase builds precomputed models and data patterns that will be used in the fast processing phase (also known as the real-time analysis or data in motion) and is where continuous data is fed into the system. The main difference between data at rest and data in motion is essential that the data processing has a longer latency in the first case and the streaming computation of data ingested in real-time does in the second case.
- Scalability: Spark is natively linearly scalable through its distributed in-memory computing framework. Databases and data stores interacting with Spark also need to be able to scale linearly as data volume grows.
- Fault tolerance: If a failure occurs due to hardware, software, or network reasons, the architecture should be resilient enough and provide availability throughout the whole process.
- Flexibility: The data pipelines put in place in this architecture can be adapted and retrofitted very quickly depending on the use case.
Getting data from Twitter
This time we will dive into a pipeline that puts into action the data at rest paradigm of Spark. For this we will connect to the Twitter API in order to collect some data in JSON format. Then: Correct, Collect, Compose, Consume and Control — this 5 step process will be executed iteratively, as we will need to create a new app in Twitter Developer platform.
Image by author
Once complete, you will get four necessary codes to connect to the API:
CONSUMER_KEY = 'GetYourKey@Twitter'
CONSUMER_SECRET = ' GetYourKey@Twitter'
OAUTH_TOKEN = ' GetYourToken@Twitter'
OAUTH_TOKEN_SECRET = ' GetYourToken@Twitter'
By using those authorization codes we will make a programmatic connection that will activate our OAuth access to the Twitter data and allows us to tap into the Twitter API under the rate limitation. In streaming mode, the limitation is for a GET request.
Struggling to track and reproduce complex experiment parameters? Artifacts are just one of the many tools in the Comet toolbox to help ease model management. Read our PetCam scenario to learn more.
By now you should have set up your conda environment with PySpark installed, and you should be able to activate your environment (which we named pyspark
in the previous tutorial). You should then start a Jupyter notebook by running the following command in your working directory:
jupyter notebook
Image from Author
The first thing you should do is install the twitter python library, by running:
pip install twitter
Installing twitter library from Jupyter Notebook
Then we will import three libraries in order to retrieve Twitter data via OAuth, and then set some base methods that will authenticate, search and parse the resulting extracted data:
import twitter
import urllib.parse
from pprint import pprint as pp
The full code examples, including the code I will explain here in this article is available on Google by doing a simple search for the book name. This is one of the repositories of code, which seems to be one of the official ones. However, as it was written in Python2, I will be adapting it to Python3, (and adding some spice as well!) in the repository available in my GitHub.
## Python Twitter API class and its base methods for authentication, searching, and parsing the results.
class TwitterAPI(object):
"""
TwitterAPI class allows the Connection to Twitter via OAuth
once you have registered with Twitter and receive the
necessary credentials
"""
initialize and get the twitter credentials
def __init__(self):
consumer_key = config.get('auth','consumer_key')
consumer_secret = config.get('auth','consumer_secret')
access_token = config.get('auth','access_token')
access_secret = config.get('auth','access_secret')
self.consumer_key = consumer_key
self.consumer_secret = consumer_secret
self.access_token = access_token
self.access_secret = access_secret
authenticate credentials with Twitter us Oauth
self.auth = twitter.oauth.OAuth(access_token, access_secret, consumer_key, consumer_secret)
creates registered Twitter API
self.api = twitter.Twitter(auth=self.auth)
search Twitter with query q and max result
def searchTwitter(self, q, max_res=10, **kwargs):
search_results = self.api.search.tweets(q=q, count=10, **kwargs)
statuses = search_results['statuses']
max_results = min(1000, max_res)
for _ in range(10):
try:
next_results = search_results['search_metadata']['next_results']
except KeyError as e:
break
#next_results = urlparse.parse_qs(next_results[1:])
next_results = urllib.parse.parse_qsl(next_results[1:])
kwargs = dict(next_results)
search_results = self.api.search.tweets(**kwargs)
statuses += search_results['statuses']
if len(statuses) > max_results:
break
return statuses
parse tweets as it is collected to extract id, creation date, user id, tweet text
def parseTweets(self, statuses):
return [(status['id'],
status['created_at'],
status['user']['id'],
status['user']['name'],
status['text'], url['expanded_url'])
for status in statuses
for url in status['entities']['urls']
]
Basically, the methods of this class will set the required authentication in order to connect to Twitter API, then it will search for a query desired by the user and lastly visualize the JSON output:
Image from Author
First we instantiate the TwitterAPI()
class, then we run a search using the searchTwitter
function for tweets related to “ChampionsLeague,” we visualize the JSON using the pprint
library, and lastly we only keep the fields we are interested in using the parseTweets
function.
Serializing and storing data
As we are harvesting data from web APIs, we need to store that data into a database with persistent storage, such as MongoDB or any simple file storage that is widely used nowadays like CSV or JSON.
Serializing a Python object converts it into a stream of bytes so it can be transferred over a TCP network or stored in persistent storage.
In order to serialize the extracted data, some methods have been written to achieve that goal, the first one saves the data in CSV format which is lightweight, human-readable, and easy to use. It also has delimited text columns with an inherent tabular schema. The second one uses the JSON Python library, JSON being one of the most popular data formats for Internet-based applications. All APIs deal with the JSON format, and it is relatively lightweight and human readable compared to XML. As opposed to the CSV format, where all records follow exactly the same tabular structure, JSON records can vary in their structure, and due to this reason are considered semi-structured.
JSON IO method, taken from https://resources.oreilly.com/examples/9781784399696/-/blob/master/B03986_Code/B03986_03_code/Spark4python_Code_Chapter03.py
CSV IO method, taken from https://resources.oreilly.com/examples/9781784399696/-/blob/master/B03986_Code/B03986_03_code/Spark4python_Code_Chapter03.py
Let’s go through the code and explain a little bit about what it is doing:
The save
method of both classes uses a Python-named tuple and the header fields of the file (CSV or JSON) in order to impart a schema while persisting the rows of the file. If one of the files already exists, it will be appended and not overwritten. Otherwise, it will be created.
The load
method of the class also uses a Python-named tuple and the header fields of the files in order to retrieve the data using a consistent schema. The load method is a memory-efficient generator to avoid loading a huge file in memory, hence it uses yield
instead of return
.
Conclusion
Through this article we have learned about and put into practice the basic architecture of an application built over Python libraries that extract, transform and store data from APIs.
For the next article we will set up MongoDB in our Spark environment. We’ll then learn how it works, and finally retrieve some of the data stored in Mongo or JSON. We will also attempt to give it some meaning, (that is, generate some analytics by using SparkSQL), so stay tuned, and as always feel free to clone the GitHub repo used in this article!!
Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.
Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.
If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.