Is there a Data Engineering Roadmap?

Image from Author

When we think of the field of Data Engineering, for someone unfamiliar with the term or not in the data world, the first thing we think of is the term Data. Data is the gasoline that drives the whole machinery in the IT industry, depending on its type, and how it is classified and structured.

The second thing that comes to our minds is Engineering, and when we think of it from a traditional point of view it refers to the use of scientific principles to design and build machines and structures that connect physical spaces, it also talks about the necessary tools and how to use them efficiently to achieve such a task.

When we put these two concepts together we come up with the definition of Data Engineering, which is the discipline in the “Data World” which drives us to design and build things, such as data pipelines that transform and transport data into a specific format required by Data Scientists, Business Analysts, Data Analysts, Data Managers or other end users. These pipelines must take data from many disparate sources and collect them into a single warehouse that represents the data uniformly as a single source of truth.

https://www.future-processing.com/blog/business-intelligence-developer-and-data-engineer-are-they-the-same-person-and-why-do-you-need-both-in-your-team/

Data Engineers are highly skilled and technical people who create software solutions around data, usually using Java, Scala, or Python. They should possess extensive knowledge of creating data products and data pipelines, meaning the way data is extracted, processed, and used to bring a certain value.

Do you need an undergraduate title?

From my perspective, experience, and Google research, the answer is that I don’t think so. There is no undergraduate title (as far as I know) called “Bachelors in Data Engineering” and there is neither a prerequisite to get into the field.

What you do need to have is some basic knowledge of Computer Science, a programming language, SQL and a lot of desire to always keep learning and improving yourself, as well as keep incorporating new skills into your toolbox and your tech stack.

For example, in my case being an economist, I could be comfortable doing tasks related to the violet circle in the Venn diagram below:

Image from https://www.pngfind.com/

But that wasn’t enough for me, as I was looking a little bit more like this:

Image by Author

Beginning the path of mastering this tech stack, could grant you these superpowers:

Assembling large and complex sets of data, that would meet the business requirements.
Identifying, designing, and implementing internal process improvements that include re-designing the infrastructure for greater scalability, optimizing data delivery, and automating manual processes, among others.
Building data infrastructure for optimal extraction, transformation and loading of data from various data sources using AWS and SQL technologies.
Designing and implementing analytical tools to utilize the data pipeline, providing actionable insight into key business performance metrics including operational efficiency and customer acquisition.
Working with stakeholders including data, design, product and executive teams and assisting them with data-related technical issues.
Working with stakeholders including the Executive, Product, Data and Design teams to support their data infrastructure needs while assisting with data-related technical issues.

Required Tech Stack

If search technologies are commonly used in Data Engineering, you will probably find it a bit overwhelming. How could somebody master all of them? I think it is difficult, but not impossible, and requires a lot of study, hands-on experience, and preparation.

The requirements vary from company to company, depending on the seniority that is required and the organization’s level of data maturity and staffing levels.

The required technologies

Data Engineering Basic Skills

CS Fundamentals

Knowing how a computer works and how version control (GitHub, Bitbucket or GitLab) is used for tracking changes in source code, will give you a good start in this field.

It is very helpful to know how to use, at least at a basic level, the command prompt, terminal, or whatever shell you have in your OS, especially if you have knowledge of Linux shell scripting, cronjobs and Vim, it will add even more value to your skill set. Having a basic knowledge of structured and unstructured data, APIs and data structures and algorithms will be a plus.

Programming languages

Python is one of the most commonly used programming languages in several fields related to data because of its versatility and ease of learning. We can do data analysis and data science using Python, but in terms of engineering there are some other types to have in mind. They are not as easy to learn as Python, but play an important role in this field: Java and Scala.

Airflow which is also an important element of the Data Engineering toolbox is also written in Python. Most of the popular big data tools have support for Python APIs. For example, Spark which is also crucial in DE, has PySpark, the specialized spark API for Python.

Scala is a typed language and is used in major technology companies. It runs on JVM and supports JAVA libraries. Apache Kafka, Apache Spark, and Apache Flink, are also key technologies in this ecosystem, and all of them have been written in Java and Scala.

Do you know how many engineering hours could be saved by implementing an MLOps strategy? We do. Learn more today.

Relational and non-relational databases

Data Engineers need to know how to manage, create, and design databases being relational or non-relational, how the data on them can be normalized, and understand the Entity-Relationship model, normalization and scaling patterns. Most commonly used are Postgres, MySQL, MariaDB and Amazon Aurora.

DynamoDB is a highly scalable database provided by AWS. PostgresSQL and Mysql are relational databases.

Cassandra is one of the highly available NoSQL databases which is being used.

When talking about non-relational databases the offer is wider, there we have Document, Wide-column, Graph and Key-value databases.

Image from Author

Mastering just one or two of each type is considered enough to have a good understanding of this topic of data engineering.

Unit Testing

Unit tests are automated tests that are created in order to ensure that the data coming through your data pipeline is what you expect it to be. Data unit tests are useful in Data Engineering for knowing when upstream data changes, and for preventing bad data from coming into data pipelines.

A data unit test is also a good way of documenting what the data set should look like which you can then show a teammate or stakeholder and get on the same page quickly.

Commonly used tools for data unit testing are dbt tests, pytest or SQL depending on your data pipeline.

Data Engineering Advanced Skills

Data Warehouses

A data warehouse is basically a database that has all your company’s historical data and is used to run analytical queries.

In this category you can find an extensive offer of vendors, starting with the most used: Snowflake, Amazon Redshift, and Google BigQuery. Other data warehouses used are Presto, Apache Hive, Apache Impala, Azure Synapse, and ClickHouse.

Object Storage

Object Storage is designed to store unstructured information and object files. Object files contain data, metadata, and individual identifiers. These files are highly customizable, durable, and rich in data. In Data Engineering you can use object storage to store various files such as static content, media, and data backups in the form of objects. All the data is stored for the purpose of a data lake, or staging layer, which is in the next stages will be used for batch processing.

In this category you can choose from three vendors such as AWS S3, Azure Blob Storage and Google Cloud Storage.

Cluster Computing

Hadoop is the idea behind a lot of the technologies. It will help in understanding basic concepts like scalability, replication, failure tolerance, and partitioning.

HDFS is the storage part of Hadoop, which is a distributed file system.
MapReduce, a batch processing algorithm published in 2004, was subsequently implemented in various open-source data systems, including Hadoop, MongoDB, CouchDB, etc.

Although the importance of MapReduce is declining, it is worth understanding because it provides a clear picture of why and how batch processing is useful.

Messaging

This category is to store real-time data with fast access.
Azure Service Bus, AWS SNS & SQS, Google Cloud Pub/Sub are most of the tools used in ingestion with event streaming.

Workflow scheduling

These tools are being used to orchestrate the data pipelines.
Apache Airflow, Google Cloud Composer, Astronomer, AWS Step Functions All the pipelines need to be scheduled and dependencies need to be defined. Apache Airflow is a Python-based workflow engine that creates pipelines dynamically.

Google Cloud Composer and Astronomer are Airflow-based managed solutions while AWS Step functions are provided by AWS which is not dependent on Airflow.

Monitoring data pipelines

Monitoring data pipelines makes a challenge in every data-related process, because many of the important metrics are unique. For example, you need to understand the throughput of the pipeline, how long it takes data to flow through it, and whether your data pipeline is resource-constrained.

Four Golden Signals are defined as metrics for data pipelines:

Latency — The time it takes for your service to fulfill a request
Traffic — How much demand is directed at your service
Errors — The rate at which your service fails
Saturation — A measure of how close to fully utilized the service’s resources are

Prometheus, Datadog, Sentry and StatsD are the commonly used ones.

Data Processing

The following tools ingest raw data into Data Engineering platforms in Streaming mode, such as Apache Kafka, AWS Kinesis, Apache Storm.
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. It provides durability really well compared with queues like RabbitMQ, Redis Queue.
AWS Kinesis is provided by AWS, while Cloud Pub/Sub is provided by Google Cloud.

Below tools are used for processing data in Hybrid mode (Batch and Streaming): Apache Flink, Spark Streaming, Apache Beam, Apache NiFi

Apache Flink is a fast-growing solution for real-time data with minimum latency. Spark Streaming is also widely used but it processes data in really small windows. Apache Spark is the most widely used for ETL operations. It has a lot of APIs supported in Python, Scala, Java and R. It is an in-memory compute engine which has a lot of benefits over MapReduce.

These tools are being used for batch processing: Apache Pig, Apache Arrow, dbt

Apache Arrow started as a columnar in-memory representation of data that every processing engine can use. Arrow by itself is not a storage or execution engine but serves as a language-agnostic standard for in-memory processing.

dbt is a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation.

The image below illustrates the three categories of data processing:

Image by Author

Infrastructure as Code

Wikipedia defines IaC as follows:

“Infrastructure as code is the process of managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools.”

Basically this means you manage your IT infrastructure using configuration files.

In Data Engineering you are able to manage the infrastructure of your own projects using tools like Docker, Kubernetes, Docker Swarm, Terraform and Pulumi.

CI/CD

Continuous Integration and Continuous Delivery (CI/CD) is a software development approach where all developers work together on a shared repository of code. GitHub Actions and Jenkins will help you with this purpose for your DE projects.

Nice to have skills

Data Engineers often work very closely with Data Scientists, Data Analysts, Business Analysts and Machine Learning Engineers, so having a basic or even good understanding of the tools they use is considered a plus.

Machine Learning Fundamentals

Data Engineers need to know the basic terminology of ML, terms like supervised and unsupervised learning, classification and regression must be familiar, as well as the tools they work in order to work with it such as Scikit-Learn, Tensorflow, Keras and PyTorch.

Data Visualization Tools

We need to also manage some data viz tools in order to build dashboards that monitor the workflow and visualize the data involved in the DE processes. BI tools like Tableau, Looker, Grafana, Jupyter Notebooks and Power BI are commonly used by Data Engineers for this purpose.

Final words

I hope that this article works as a guide to give you a basic and advanced understanding on what are the technologies you need to master or want to start a learning path in this exciting field as it is Data Engineering.

These are a few online and free resources you can follow if you are a beginner, intermediate or advanced in DE. There is a lot of information and learning resources, so you might feel overwhelmed, but doing a quick recap on all of them and then taking just what you need to fill your knowledge gaps will be enough.

Data Engineering Cookbook

Awesome Data Engineering

Useful resources for Data Engineers

Data Engineering Projects

A very Long never ending Learning around Data Engineering & Machine Learning

Remember the three keywords (or phrases in this case) learn by doing, practice, and happy coding!

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.