Understanding Awk for Text Processing

Understanding Awk for Text Processing

A practical guide on pattern scanning with a text-processing language

Image from https://www.computerhope.com/unix/uawk.htm

Introduction

Awk is a powerful data processing programming language included by default in every *nix system.

Awk was created in 1977 following the success of other line processing tools likesed and grep. AWK stands for Aho, Kernighan, and Weinberger, and started as an experiment into whether text processing tools could be extended to deal with numbers. While grep lets you search for lines, and sed lets you do replacements in lines, awk was designed to let you do calculations on lines, and additionally to search files for lines (or other units of text) that contain certain patterns.

In this tutorial, we will dive into learning how to use this powerful tool for data processing in the command line. But, you may ask, “with data tools like Pandas and Jupyter Notebooks, which provide users with a more interactive and user-friendly experience, why should I be using command line tools for these purposes?”

The reason could be that it’s a time saver to use the command line, or that you’re particularly passionate about the command line. But the most important reason is: imagine that you’re using a Linux server with no GUI! What could you do to quickly manipulate and analyze some large CSV files in that server? The answer is 3 letters, AWK, just look at what is the simple idea behind this powerful language:

The biggest reason to learn Awk is that it’s on pretty much every single Linux distribution. You might not have Perl or Python. You will have Awk. Only the most minimal of minimal Linux systems will exclude it.

Understanding Basic Operations

Programs in awk are different from programs in most other languages. Most languages are data-driven, meaning that you must describe the data you want to work with and then what you want to do with it. Most languages are procedural (or at least support procedural syntax), meaning you have to describe, in detail, every successive step a given program should take. When working with procedural languages, it is usually much harder to clearly describe the data your program will process. For this reason, awk programs are often refreshingly easy to read and write in these scenarios.

When writing awk, you must specify an awk program that tells awk what to do. This program consists of a series of rules (that may also contain function definitions), which specify a particular pattern to search for and an action to perform upon finding the pattern.

Awk is part of the Portable Operating System Interface (POSIX), which means it is natively installed on your MacBook or Linux server. Several versions of Awk have been released since its inception, so the first thing you will do is run the following command to determine which version you’re using:

awk --version

Image from Author

The best way to get documentation regarding Awk is from Linux man pages, by running the following from the command line:

man awk

Image from linux man pages

If you keep scrolling through the page above (by pressing enter) you will get details regarding AWK PROGRAM EXECUTION, which states:

An AWK program consists of a sequence of optional directives, pattern-action statements, and optional function definitions.

Image from linux man pages

If the program is short, the easiest thing to do is to include the actions to be performed by awk in the command line itself:

awk pattern { action } *input-file*

But when the program is long, it is usually more convenient to put the program in a file and run it with a command like this:

awk -f *program-file* *input-file*

Let’s see a few examples of both cases. First, we’ll get some data to play with. We will be using the Amazon Product Reviews dataset which contains 142.8 million product reviews in the period of May 1996 to July 2014. They are stored in a public S3 bucket and the data is available for different categories:

Image from s3.amazonaws.com/amazon-reviews-pds

For this tutorial I’ll be downloading a shorter version of the sports reviews, (the complete version would be of 2+GB). To do this, I’ll add a parameter to curl in order to get just the first 1,0000:

curl https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Sports_v1_00.tsv.gz **\**  
  **|** gunzip -c **\**  
  **|** head -n 10000 **\**  
  > sportsreviews.tsv

Image from Author

The first command that we will use is for showing the first column of the dataset in the console:

awk '{ print $7 }' sportsreviews.tsv **|** head

Image from Author

Here we are extracting the first records of the seventh column of the dataset. But, what if we need to see the very first columns of the dataset, like the head command in Pandas? For this, we pass the parameter $0 to the print statement of the command:

Image from Author

With this we can have an idea of what data is inside the dataset and its headers. By default, Awk assumes that the fields in the dataset are whitespace delimited, but we can change the field separator by passing the -F parameter. For our example, our file is separated by tabs so we will pass '\t' as the argument to -F.

Since we already know the headers of the dataset, we can make use of the NF parameter. In reality the NF parameter is a variable that holds the number of fields, but in this case we use its value as an index to print the last and third last columns.

The NR the parameter which stands for number of records, will print the number of each line of the dataset, and we format it with | as field separators:

awk -F '\t' '{ print NR "|" $NF "|" $(NF-2)}' sportsreviews.tsv | head

Image from Author

If you’re used to GUI tools, this may seem a bit complicated. But in my opinion, command-line tools should be applied to every data-related task. If you don’t believe me just look at this:

[Chapter 1 Introduction | Data Science at the Command Line, 2e
This book is about doing data science at the command line. My aim is to make you a more efficient and productive data…datascienceatthecommandline.com](https://datascienceatthecommandline.com/2e/chapter-1-introduction.html "datascienceatthecommandline.com/2e/chapter-..")

Understanding Regular Expression Syntax

Regular expressions allow you to write simple or complex descriptions of patterns hidden in texts. Understanding the individual elements of the regex is not so hard. But what makes writing regular expressions difficult (and also interesting!) is the complexity of its application and the variety of occurrences or contexts in which a pattern appears. This complexity is inherent in language itself in the same way that you can’t always understand an expression by looking up each word in the dictionary.

Sometimes simple solutions offer the best results. We made minor hardware optimizations for a huge increase in thoroughput. Check out the project here.

Basically, the process of writing a regular expression involves three simple steps:

1. Knowing what it is you want to match and how it might appear in the text.

2. Writing a pattern to describe what you want to match.

3. Testing the pattern to see what it matches.

If you need some guidance on some basic regex you can follow this article also published in Medium:

[Regular Expressions Demystified: RegEx isn’t as hard as it looks
Are you one of those people who stays away from regular expressions because it looks like a foreign language? I was…medium.com](https://medium.com/free-code-camp/regular-expressions-demystified-regex-isnt-as-hard-as-it-looks-617b55cf787 "medium.com/free-code-camp/regular-expressio..")

So, the real power of Awk comes from pattern matching. As you can do in other programming languages, you can give Awk a pattern to match each line of the text, like in this one from our example. Here we want to know the top 15 most-rated products from our reviews related to the pattern /Green Bay Packers/:

Image from Author

awk -F '\t' '/Green Bay Packers/ { print $4, $6, $8 }' sportsreviews.tsv | head --lines=15

But as you may notice the output includes multiple products related to the Green Bay Packers pattern, so we can pass instead of the pattern a specific product to our output:

awk -F '\t' '$4 == "B0088DS76K" { print $6,$8 }' sportsreviews.tsv | head --lines=15

BEGIN and END actions in Awk

According to the awk manual, BEGIN and END are not used to match any input, but rather to provide start-up and clean-up information to the awk script.

In our example we will use NR to get the total amount of records and the END action at the end of the processing. But first let's see how it works by passing those commands to the star rating which is the 8th column of the dataset. As you may notice in the screenshot below, the variable total is declared inside the script, so it calculates a running average of that column:

Image from Author

Then, we can use that running average to calculate the average of the 8th column of the whole dataset that corresponds to the average star rating of the sports products:

Image from Author

I can also use BEGIN to run an action before Awk starts processing records:

Image from Author

AWK scripting

Scripting is as crucial for Awk programming as for any other programming language. Awk programs are able to span multiple lines, but as they become bigger and more complex, you may consider putting them into a separate file.

But why use Awk scripting if we have some other powerful scripting languages such as Python?

The first reason is that if you are doing some glorified loop over some input and the control flow is limited, in that case, using Awk can be more concise than Python.

Second, the transition from an Awk program to a Python one is not going to be more than 100 lines of code, and it will be very straightforward.

The third, why not? Learning a new tool can be fun.

The syntax for invoking awk has two basic forms:

awk [-v var=value] [-Fr e] [ — ] ’patter n { action }’ var=value datafile(s)

awk [-v var=value] [-Fr e] -f scriptfile [ — ] var=value datafile(s)

An awk command line consists of the command, the script, and lastly an input filename. Input is read from the file specified on the command line. If there is no input file (or if “-” is specified), then standard input is read.

The -F option sets the field separator. Then the script (consisting of pattern and action) on the command line should be specified and surrounded by single quotes. Alternatively, you can place the script in a separate file and specify the name of the script file on the command line with the -f option.

You need to take care to note the absolute path of your Awk installation, so that you can run your script no matter the directory you are located in your terminal. You can run which awk to show this location, but in most recent UNIX distributions (like Ubuntu) it is located in /usr/bin/awk. So you can shebang #! your script to that directory by starting your script with the following line #! /usr/bin/awk -f. In my case, I created a folder in my home directory called scripts then I executed it:

Once that’s done, you can run Awk, passing the -f parameter, the name of the script (in the path where it is located) and then the input file to be processed by the awk script.

image from Author

Conclusion

As we have seen in this tutorial, Awk becomes a very powerful option for text processing and it is available in every single Linux distro. I encourage you to give it a try and start from the basics, or if you need to go further review this book:

[sed & awk
Preface Chapter 1: Power Tools for Editing Chapter 2: Understanding Basic Operations Chapter 3: Understanding Regular…docstore.mik.ua](https://docstore.mik.ua/orelly/unix/sedawk/index.htm "docstore.mik.ua/orelly/unix/sedawk/index.htm")

Cheers and happy coding!

Awk coding examples:

[examples / sed awk 2nd Edition · GitLab
O'Reilly Resourcesresources.oreilly.com](https://resources.oreilly.com/examples/9781565922259 "resources.oreilly.com/examples/9781565922259")

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.