Learning how to obtain and manipulate data right in the command line with Unix Power Tools
All images by author
In this tutorial, we will get into efficient and productive ways to use command-line tools and how data scientists, data engineers, and other data-related people can leverage the power of the Linux terminal for data wrangling.
Nowadays, people working in data positions can choose from an overwhelming collection of exciting technologies and programming languages for data wrangling, manipulation, and interpreting, such as Python, R, Julia, and Apache Spark. Maybe the following question would come up in our minds:
Why should you still care about an old technology developed back in the seventies for doing these data-intensive tasks?
What does the command line have to offer that these other technologies and programming languages do not?
Data analytics is an exciting field that often requires cutting-edge technology to “refine the oil” properly, but unfortunately, many people and companies believe you need new technology to tackle tasks like data wrangling and data manipulation, even though there are a lot of things that can be accomplished by using the command line instead. Not only that, but it takes minimal setup and can sometimes be achieved much more efficiently.
You will need to dive a little into Linux commands to realize that the command line is not just for installing software, configuring systems, and searching files. There are amazing and useful tools, such as curl
, head
and sort
that can help you get, organize, and manipulate data just as you do in Pandas or Spark. These are examples of command-line tools that take data as input, do something to it, and print the result. Ubuntu comes with quite a few of them previously installed. In this article, we will be putting into action one of these command-line tools, which is a Python package you can install as any other package through pip
, and this is known as csvkit
.
Obtaining Data
Getting data and then wrangling it is very time-consuming, so why not take it to its simplest expression by doing it in the terminal? We are gonna use curl
to download fake data from the Random User Generator API, and then we will be using a very nice command line tool called bat, which is a cat
clone that allows you syntax highlighting and some other cool features. We will also use jq, a powerful tool that operates on JSON files.
The command to get the random data is as follows:
curl -s "https://randomuser.me/api/1.2/?results=5&seed=foobar" > users.json
As you may have noticed, curl
downloads the raw JSON returned by the API. No interpretation is being done, and the response contents are immediately printed on standard output. In this case, we are using the redirection operator (>
) to save the response in JSON format.
A very nice option to man pages is tldr, which gives you context regarding that specific Linux command and provides examples and possible usage of that command.
Next, we are gonna inspect the JSON file we previously generated by using the bat
command. Here’s the output:
Instead of redirecting the API response to be saved to the users.json
file, we can add a pipe to jq
and visualize the JSON file in a nicer output:
curl -s "https://randomuser.me/api/1.2/?results=5&seed=foobar" | jq .
Instead of those previous steps where we get some dummy data, we can pass the redirection operator from our extracted JSON file to jq
and then get the emails in that JSON:
Data Manipulation
For data manipulations purposes, the dataset we have parsed and converted to CSV format seems a bit small, so we are gonna download another one from an external repository using curl
:
Image from Author
Then inspect the first lines of our CSV file:
To generate some statistics over this CSV file, we will be using a very handy command line tool called csvkit. It can be installed as any other Python package from your terminal by running the following:
sudo pip install csvkit
The first useful feature of csvkit is in2csv, which allows you to convert files in Excel format to CSV format right in the command line:
In the screenshot above, we are converting an Excel file into CSV and then saving the output to a CSV file. Now that we have some data to play with, we might be tempted to open it in Excel or Google Docs, but wouldn’t it be nice if we could take a look in the command line? For this purpose, we have csvlook. With it, we can pipe to the less option and generate a nice output of the file we previously downloaded:
We can also combine the csvcut and csvstat tools to get some pretty statistics of our CSV dataset:
What if in this dataset of 1261 rows, we would need to see the records corresponding to a specific city, so we should use the csvgrep tool of csvkit.
Querying CSV Data
csvkit has a command-line tool called sql2csv. It can work with many databases through a common interface, including MySQL, Oracle, PostgreSQL, SQLite, Microsoft SQL Server, and Sybase. As its name suggests, sql2csv outputs files in CSV format.
We will be downloading a sample SQLite database using the wget
command line tool and then unzipping its content in the same directory. After that, we will be getting a .db file to analyze with sql2csv:
wget sqlitetutorial.net/wp-content/uploads/2018/..
This database contains 11 tables of data for a music store that we will be exploring and retrieving SQL tables directly in our terminal without the necessity of connecting a SQL client. Pretty cool, isn’t it?
Image from Big Data and Business Intelligence, Packt
We need to pass two parameters to sql2csv, --db
and the --query
, just like we would be doing in an SQL editor. As we noticed in the ERD, there are the artists
and albums
tables:
We can join both by their primary key, ArtistId
, and bring a table that has the following for each Artist
’s corresponding Title
s:
The csvsql tool allows you to create a database table from a CSV file by running the following line:
The above syntax will generate a create table statement of our CSV file, but what will really create the database table is the following:
csvsql --db sqlite:///crime.db --insert crime.csv
Final Thoughts
In this tutorial, we have learned about multiple useful command line tools that data people can use on daily basis tasks. They are powerful and simple to install, so mastering the Linux command line is a plus in every data scientist and data engineer’s career path.
In the following link, I will leave you a terrific resource to learn much more about command line tools:
[GitHub - jeroenjanssens/data-science-at-the-command-line: Data Science at the Command Line
You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…github.com](https://github.com/jeroenjanssens/data-science-at-the-command-line "github.com/jeroenjanssens/data-science-at-t..")