Setting Up a WSL and PySpark Data Engineering Environment

Setting Up a WSL and PySpark Data Engineering Environment

A full-code tutorial on setting up your Linux environment

Image by author

Motivation

This article will kick off a series of tutorials about how to set up and use several tools for data engineering in a Linux environment (or WSL in Windows 10), which is a setup commonly used by developers, together with MacOS and Ubuntu.

Linux-based environments are especially convenient for data engineers because they allow the installation of tools like Apache Spark, Hadoop and Airflow, each of which depend on Java JDK.

In order to follow this tutorial, it is recommended to have at least a basic understanding of Linux file systems and the most commonly used commands. If you need a guide to help get you started, try this one below:

[The Linux Commands Handbook
Read free book: The Linux Commands Handbook, Flavio Copes. The Linux Commands Handbook follows the 80/20 rule: learn in…dbooks.org](https://www.dbooks.org/the-linux-commands-handbook-5613994259/pdf/ "dbooks.org/the-linux-commands-handbook-5613..")

Setting up WSL dependencies

The first thing that you will need is to set up your WSL environment. This is a very straightforward process, but for assistance, just follow this guide:

[Install Ubuntu on WSL2 on Windows 10 | Ubuntu
Windows Subsystem for Linux (WSL) allows you to install a complete Ubuntu terminal environment in minutes on your…ubuntu.com](https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-10#1-overview "ubuntu.com/tutorials/install-ubuntu-on-wsl2..")

Next, you will need to install Java JDK. This is a very easy three-step process. Once you have completely set up your WSL, just run these commands in Ubuntu:

sudo update
sudo apt install openjdk-11-jdk
java --version

If you feel like you need a guide for this step, you can follow this one:

[Install Java on Windows 10 Linux subsystem
Some of us would install Java using built-in IntelliJ JDK installation tool. 😵
But some others would prefer installing…
medium.com](https://medium.com/@pierre.viara/install-java-on-windows-10-linux-subsystem-875f1f286ee8 "medium.com/@pierre.viara/install-java-on-wi..")

From there, after a couple of recommended restarts, you can customize your environment according to your needs, You will see the Ubuntu distro between your installed applications in Windows:

Image by author

Customization is one of the features that developers like the most from Linux, and with Ubuntu there are a whole world of possibilities for just this. First of all, you will able to install Windows Terminal on your system, which allows you to have multiple command prompts open in the same environment. You can also have the Windows PowerShell, together with the Windows Command Prompt (cmd) and Ubuntu Terminal, all in the same window. Pretty cool, right?

Image by author

You can customize the appearance of your terminals in the configuration file of the Windows Terminal. Here, you can also change the names, fonts, and more:

Image by author

In your Ubuntu terminal, you can create your own aliases for your most used commands (like ls, -ltr, or cd ..) in the .bashrc config file by running:

nano ~/.bashrc

Image by author

In the image above, I have configured my WSL to create the nice blue highlighting of the username of my terminal. That is Oh My Zsh, and you can follow this guide to install it:

[Oh My Zsh - a delightful & open source framework for Zsh
Oh My Zsh is an open source, community-driven framework for managing your Zsh configuration. Sounds boring. Let's try…ohmyz.sh](https://ohmyz.sh/ "ohmyz.sh")

Avoid growing pains early by implementing MLOps best practices today. Learn what to look out for in our notes from the field.

Installing Anaconda and Spark in WSL

Installing Anaconda in your WSL Ubuntu distribution is just a matter of going to the repo of archived Anaconda distributions and choosing the latest Linux release. At the time of writing this article this is version 5.3.1, so running the following command will download the latest installation script:

wget repo.continuum.io/archive/Anaconda3-5.3.1-L..

Image by author

Then just execute that .sh file. The next steps are pretty straightforward, but if you need a guide, you can follow this one.

bash Anaconda3-5.3.1-Linux-x86_64.sh

In order to install Spark, start by running the following command, always in your home directory:

wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz

Image by author

Next, create a directory where you will store the installation files previously downloaded. The following command will unzip the downloaded files and store them in this folder:

mkdir ~/hadoop/spark-3.3.0
tar -xvzf spark-3.3.0-bin-hadoop3.tgz -C ~/hadoop/spark-3.3.0 --strip 1

Image by author

Once all files are unzipped in the hadoop/spark-3.3.0 folder, add the following lines to the end of your .bashrcor .zshrc:

export SPARK_HOME= ~/hadoop/spark-3.3.0
export PATH=$SPARK_HOME/bin:$PATH

This will tell Ubuntu where to locate the binary files in your system. After saving changes to your file, you can confirm them with:

source ~/.bashrc

or

source ~/.zshrc

Next, run the following command to create a Spark default config file:

cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf

And add this line to the of of the spark-defaults.conf:

nano $SPARK_HOME/conf/spark-defaults.conf

spark.driver.host localhost

Image by author

Then run spark-shell and voila!

image by author

Stay tuned because in the next article I will teach you how to start setting up a data engineering project using PySpark in an Ubuntu environment!

Cheers!

Sources

[1] https://kontext.tech/article/1066/install-spark-330-on-linux-or-wsl

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.