A full-code tutorial on setting up your Linux environment
Image by author
Motivation
This article will kick off a series of tutorials about how to set up and use several tools for data engineering in a Linux environment (or WSL in Windows 10), which is a setup commonly used by developers, together with MacOS and Ubuntu.
Linux-based environments are especially convenient for data engineers because they allow the installation of tools like Apache Spark, Hadoop and Airflow, each of which depend on Java JDK.
In order to follow this tutorial, it is recommended to have at least a basic understanding of Linux file systems and the most commonly used commands. If you need a guide to help get you started, try this one below:
[The Linux Commands Handbook
Read free book: The Linux Commands Handbook, Flavio Copes. The Linux Commands Handbook follows the 80/20 rule: learn in…dbooks.org](https://www.dbooks.org/the-linux-commands-handbook-5613994259/pdf/ "dbooks.org/the-linux-commands-handbook-5613..")
Setting up WSL dependencies
The first thing that you will need is to set up your WSL environment. This is a very straightforward process, but for assistance, just follow this guide:
[Install Ubuntu on WSL2 on Windows 10 | Ubuntu
Windows Subsystem for Linux (WSL) allows you to install a complete Ubuntu terminal environment in minutes on your…ubuntu.com](https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-10#1-overview "ubuntu.com/tutorials/install-ubuntu-on-wsl2..")
Next, you will need to install Java JDK. This is a very easy three-step process. Once you have completely set up your WSL, just run these commands in Ubuntu:
sudo update
sudo apt install openjdk-11-jdk
java --version
If you feel like you need a guide for this step, you can follow this one:
[Install Java on Windows 10 Linux subsystem
Some of us would install Java using built-in IntelliJ JDK installation tool. 😵
But some others would prefer installing…medium.com](https://medium.com/@pierre.viara/install-java-on-windows-10-linux-subsystem-875f1f286ee8 "medium.com/@pierre.viara/install-java-on-wi..")
From there, after a couple of recommended restarts, you can customize your environment according to your needs, You will see the Ubuntu distro between your installed applications in Windows:
Image by author
Customization is one of the features that developers like the most from Linux, and with Ubuntu there are a whole world of possibilities for just this. First of all, you will able to install Windows Terminal on your system, which allows you to have multiple command prompts open in the same environment. You can also have the Windows PowerShell, together with the Windows Command Prompt (cmd) and Ubuntu Terminal, all in the same window. Pretty cool, right?
Image by author
You can customize the appearance of your terminals in the configuration file of the Windows Terminal. Here, you can also change the names, fonts, and more:
Image by author
In your Ubuntu terminal, you can create your own aliases for your most used commands (like ls
, -ltr,
or cd ..
) in the .bashrc
config file by running:
nano ~/.bashrc
Image by author
In the image above, I have configured my WSL to create the nice blue highlighting of the username of my terminal. That is Oh My Zsh, and you can follow this guide to install it:
[Oh My Zsh - a delightful & open source framework for Zsh
Oh My Zsh is an open source, community-driven framework for managing your Zsh configuration. Sounds boring. Let's try…ohmyz.sh](https://ohmyz.sh/ "ohmyz.sh")
Avoid growing pains early by implementing MLOps best practices today. Learn what to look out for in our notes from the field.
Installing Anaconda and Spark in WSL
Installing Anaconda in your WSL Ubuntu distribution is just a matter of going to the repo of archived Anaconda distributions and choosing the latest Linux release. At the time of writing this article this is version 5.3.1, so running the following command will download the latest installation script:
wget repo.continuum.io/archive/Anaconda3-5.3.1-L..
Image by author
Then just execute that .sh
file. The next steps are pretty straightforward, but if you need a guide, you can follow this one.
bash Anaconda3-5.3.1-Linux-x86_64.sh
In order to install Spark, start by running the following command, always in your home directory:
wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
Image by author
Next, create a directory where you will store the installation files previously downloaded. The following command will unzip the downloaded files and store them in this folder:
mkdir ~/hadoop/spark-3.3.0
tar -xvzf spark-3.3.0-bin-hadoop3.tgz -C ~/hadoop/spark-3.3.0 --strip 1
Image by author
Once all files are unzipped in the hadoop/spark-3.3.0
folder, add the following lines to the end of your .bashrc
or .zshrc
:
export SPARK_HOME= ~/hadoop/spark-3.3.0
export PATH=$SPARK_HOME/bin:$PATH
This will tell Ubuntu where to locate the binary files in your system. After saving changes to your file, you can confirm them with:
source ~/.bashrc
or
source ~/.zshrc
Next, run the following command to create a Spark default config file:
cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf
And add this line to the of of the spark-defaults.conf
:
nano $SPARK_HOME/conf/spark-defaults.conf
spark.driver.host localhost
Image by author
Then run spark-shell
and voila!
image by author
Stay tuned because in the next article I will teach you how to start setting up a data engineering project using PySpark in an Ubuntu environment!
Cheers!
Sources
[1] https://kontext.tech/article/1066/install-spark-330-on-linux-or-wsl
Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.
Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.
If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.