Starting Data Analysis with PySpark

Starting Data Analysis with PySpark

Image from https://www.opensourceforu.com/2018/12/apache-spark-and-developing-applications-using-spark-streaming/

Introduction

This is the second part of a series on setting up a data engineering environment. In case you missed the first part, be sure to check out the link below:

[Setting up WSL and PySpark Data Engineering environment
Motivationfelixvidalgu.medium.com](https://felixvidalgu.medium.com/setting-up-wsl-and-pyspark-data-engineering-environment-f717f1340115 "felixvidalgu.medium.com/setting-up-wsl-and-..")

As a summary of the previous tutorial, I installed and configured WSL2 in Windows 10. If you are Linux user, you can follow those same steps after WSL setup without any problem. Then, after Python and Spark are up and running in your environment, you can start manipulating and analyzing your data using Python + Spark — or even better, using PySpark! But before we get ahead of ourselves, what even is Spark?

What is Apache Spark?

Apache Spark is an open-source analytical processing engine for large-scale, powerful distributed data processing and machine learning applications. It was originally developed at the University of California, Berkeley, and later donated to the Apache Software Foundation.

Apache Spark is a polyglot framework that is written in Scala and Java, and supported by Python and R programming languages. It has three main implementations:

  • Spark — Default interface for Scala and Java
  • PySpark — Python interface for Spark
  • SparklyR — R interface for Spark

Spark libraries

Spark includes some powerful libraries, such as:

  • SparkSQL: This provides the SQL-like ability to interrogate structured data and interactively explore large datasets.
  • SparkMLLIB: This provides major algorithms and a pipeline framework for machine learning.
  • Spark Streaming: This is for near-realtime analysis of data using micro-batches and sliding windows on incoming streams of data.
  • Spark GraphX: This is for graph processing and computation on complex connected entities and relationships.

A key concept in Spark is the RDD (Resilient Distributed Dataset). We’ll break this down into three main keywords:

  • Dataset: is as simple as a collection of elements.
  • Distributed: means that this dataset can be on any node in the Spark cluster.
  • Resilient: means that the dataset could get lost or partially lost without major harm to the computation in progress as Spark will re-compute from the data lineage in memory. In this context, resiliency is also known as a DAG (short for Directed Acyclic Graph) of operations.

Basically, Spark snapshots in memory a state of the RDD in the cache. If one of the computing machines crashes during operation, Spark rebuilds the RDDs from the cached RDD and the DAG of operations. In other words, RDDs recover from node failure.

There are two types of operation on RDDs:

  • Transformations: Transformations take an existing RDD and lead it to a pointer of a new transformed RDD. An RDD is by default an immutable element. Each transformation creates a new RDD. Transformations are lazily evaluated. This means that transformations are executed only when an action occurs, and, in the case of failure, the data lineage of transformations rebuilds the original RDD.
  • Actions: An action on an RDD triggers a Spark job and yields a value. An action operation causes Spark to execute the (lazy) transformation operations that are required to compute the RDD returned by the action. The action results in a DAG of operations. The DAG is compiled into stages where each stage is executed as a series of tasks. A task is a fundamental unit of work.

Without further ado and after that theoretical quick recap on Spark, let’s start juggling some data by putting PySpark in action!

Most projects fail before they get to production. Check out our free ebook to learn how to implement an MLOps lifecycle to better monitor, train, and deploy your machine learning models to increase output and iteration.

Juggling data with Spark

A basic first structure where we can begin our analysis is a DataFrame. If you’ve spent any time with pandas or R programming, these structures will be familiar to you.

If you followed the first tutorial ([Setting up WSL and PySpark](felixvidalgu.medium.com/setting-up-wsl-and-.. "felixvidalgu.medium.com/setting-up-wsl-and-..")) you will already have an environment to start with. Additionally, it is strongly recommended to work inside a conda environment. To create one, follow these simple steps:

conda create -n pyspark
conda activate pyspark
conda install -c conda-forge pyspark
conda install -c conda-forge jupyter notebook

The first command will create a conda environment named pyspark in your anaconda environment, the second one will activate the environment, and the other commands are to install PySpark and Jupiter notebook.

You should see the following screen during the installation which lists all the modules and dependencies installed in your PySpark environment:

Image by Author

Once installed, just run jupyter notebook in your terminal or cmd and the server open a browser:

Image from Author

The first thing we do in Spark is to start an SparkSession,and after that, we define a SparkContextthat will help us manage spark jobs and RDDs:

Image from Author

By clicking on the Spark UI link that is shown when executing the sc (SparkContext) the variable will open the Spark application UI where you can access Jobs, Executor and other important properties of your environment

Image from Author

Here, I will analyze the IMDB dataset that I have stored in my GitHub datasets folder. You can access this, or the Jupyter notebook of this project, in my GitHub.

Before reading the CSV with Spark, we will need to create a pandas DataFrame and then convert that pandas df into a Spark DataFrame:

Image by Author

We can read the pandas DataFrames stored in the object pd_df and notice that the Spark DataFrame’s output is pretty ugly in comparison:

Image from Author

But beauty is not directly correlated with efficiency in Spark terms. In contrast with pandas DataFrames, PySpark DataFrames are lazily evaluated and are implemented on top of RDDs, which basically means that Spark does not transform data immediately, but plans how to compute later, only once an action is called.

When you start using its functions, you will notice some similarities with pandas. For example, .show(5) will generate the same as.head(5) in pandas:

Image from Author

Or, the .columns attribute:

Image from Author

There are other functions, methods, and attributes that are native to Spark, like the one below. This one is useful when we need the rows to be shown vertically because they are too long to show horizontally.

A vertically arranged DataFrame can be also be achieved in pandas by writing a UDF (User Defined Function) with pandas. There is one in my repo that generates a similar output:

Image from Author

You can also display the spark_df schema by using the .printSchema() function, which essentially is similar to the pandas .info() function. It shows the data types of each of the columns in the DataFrame:

Image from Author

Using pandas API for PySpark

Pandas UDFs are User Defined Functions that are executed by Spark using Arrow to transfer data, and pandas to work with the data, which allows vectorized operations. A Pandas UDF is defined using pandas_udf() as a decorator or to wrap the function, and no additional configuration is required.

In the following example, I have imported the required package to write a Pandas UDF function, it basically takes a pd.Series type string from the spark_df and then returns that single column:

Image from Author

Grouping data

PySpark DataFrame also provides a way of handling grouped data by using the common approach, split-apply-combine strategy. It groups the data by a certain condition, applies a function to each group, and then combines them back to the DataFrame.

The following example takes one of the columns (1,000 records) of the spark_df (Genre), and then calculates an average Rating for each Genre. As you may notice, the syntax is pretty similar to the pandas groupby function:

Image from Author

Working with Spark SQL

DataFrame and Spark SQL share the same execution engine so they can be interchangeably used seamlessly. Spark SQL integrates relational processing with Spark’s functional programming. It provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.

Through the following examples, first I have created a temporary view of my spark_df in Spark SQL passing that df to the createOrReplaceTempView function, then we have a temp SQL table called mySQLtable which we can start performing SQL queries over that table using ANSI SQL, pretty cool uuhh!

Image from Author

Final thoughts

In conclusion, you can think of Spark as a distributed computing engine for when you have data so large, they don’t fit into memory on one machine. By design, pandas is structured to run operations on a single machine, whereas PySpark runs on multiple machines. If you are working on a machine learning application where you are dealing with larger datasets, PySpark is often the best fit because it can process operations many times (100x) faster than Pandas.

PySpark is very efficient for processing large datasets. Additionally, it is feasible to convert a Spark DataFrame to a pandas one after preprocessing and data exploration, and after that continue training machine learning models using Sk-learn. Now that’s really the best of both worlds!

That being said, if the computation is complex enough that it could benefit from a lot of parallelization, then you could also see an efficiency boost by using PySpark. Sincerely, I feel more comfortable with pandas, but in the near future I might end up using PySpark’s APIs instead, because it will boost the efficiency of my day-to-day work with data.

Sources

[1]https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html

[2]https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.