From OLAP Data Cubes to Redshift

How OLAP cubes and workloads have evolved into the diverse ecosystem we call AWS

Introduction

When computer-based data analytics began picking up in the early 90s, we began hearing about “data cubes” as a way of running OLAP workloads. But what the heck is OLAP, and what relation does it have with the cube in the picture above?

If you’ve built your career in data analytics anytime within the past three decades, chances are high that you have at least a basic knowledge of OLAP techniques. Although they were dominant over the past fifty years of business intelligence, the recent rise of massive tabular databases and data warehouses bring even more possibilities of development — at a low cost and with a lot of cloud products and services.

Online Analytical Processing (or OLAP) started out as just a fancy way to describe a certain class of database applications. The term was originally coined by a legend of the database world, Edgar F. Codd, in his 1993 paper Providing OLAP to User-Analysts: An IT Mandate.

One year before releasing his study, Arbor Software brought to market a software product called Essbase, which was acquired by the software giant Oracle, who in turn renamed it as Oracle Essbase. This was the cutting-edge technology of the time in the MDBMS (Multidimensional DBMS) field. If you feel like exploring this term further, review his wiki:

[Essbase - Wikipedia
Essbase is a multidimensional database management system (MDBMS) that provides a platform upon which to build analytic…en.wikipedia.org](https://en.wikipedia.org/wiki/Essbase "en.wikipedia.org/wiki/Essbase")

So what is OLAP?

Maybe the easiest way to explore this term is by describing two scenarios of business application usage:

Source: https://www.holistics.io/blog/the-rise-and-fall-of-the-olap-cube/

The first scenario is known as Online Transaction Processing aka OLTP. The second is known as Online Analytical Processing aka OLAP.

In simple terms, each can be described as follows:

OLTP: using a database to run your business
OLAP: using a database to understand your business

Basically, the idea behind OLAP cubes is to grab only the data you need out of your relational database, store it into a ‘cube’ and convert it into an efficient in-memory structure for data manipulation.

Business intelligence practitioners realized pretty early on that it was a bad idea to use SQL databases for large OLAP workloads. The fact that computers weren’t particularly powerful back in 1995, made this scenario even worse.

The OLAP cube structure

But, why the cube? Here we’ll explore the structure and behavior of these data cubes:

OLAP cubes aggregate a metric (often called a fact, saved in a fact table) such as “total sales” or “number of purchases” over dimensions. In the picture below, we have some cities in the Locations dimension, which is aggregated on the Time dimension by quarters, and finally by the Product dimension.

Image by Author

If we would like to know the quantity of Laptops sold between April and September in Canada, we would need to aggregate the dimensions {Locations[Toronto,Vancouver],Time[Q2,Q3],Products[Laptops]} in an Essbase interpreter, for example, and it will tell us that the answer we are looking for it is: 1,360 Laptops.

Data cube operations

These are used to manipulate data to meet the needs of users, helping to select particular data for analysis. There are five main operations:

Roll-up: Through this users are able to aggregate certain similar data that have the same dimension together. For the example above, if we would like to know yearly sales on phones, we can use a roll-up operation to find that number.
Drill-down: This operation is the reverse of the roll-up operation. It allows us to take particular information and then subdivide it further for coarser granularity analysis. It zooms into more detail. For example, if we would like to know the models of laptops sold in Q4 in New York, then the drill-down operation splits laptops into models and then displays the required information.
Slicing: This operation selects only the desired data, filtering the unnecessary portions. Suppose a user only needs one particular attribute in a given dimension of the data. For example, city = “Vancouver”, will display only data about Vancouver.
Dicing: This operation does a multidimensional cutting, that not only cuts only one dimension but also can go to another dimension and cut a certain range of it. As a result, it looks more like a sub-cube out of the whole cube. For example, the user wants to see the annual sales of phones for all cities.
Pivot: this operation is very important from a viewing perspective. It basically transforms the view of the data cube, while still preserving the data present in the data cube.

In the pre-cloud days, businesses hoping to perform BI had to spend millions to purchase the data warehouse hardware and software from a limited number of companies. The makers of these systems could (and did) often charge a premium, and because of their high cost, businesses had to be very selective about what they chose to put into their warehouses.

In order to do queries as fast as possible, the data warehouse systems also needed to “pre-materialize” the data into OLAP Cubes. These processes were costly, and very time- and resource-expensive, but have been vastly improved by what we know now as “the cloud.” Today we enjoy an extensive number of products and services that solve these problems and are also relatively cheap, flexible, accessible, scalable and reliable.

Seamless integration requires a lot of organization. Learn how GE Healthcare handles multi-team projects and numerous ML experiments in our conversation with Vignesh Shetty.

What is AWS Redshift?

Amazon Redshift, like many other AWS services, is a fully managed and scalable data warehouse service in the cloud. This means that low-level, time-consuming administrative tasks like OS patching, backups, replacing failed hardware, and software upgrades, are handled automatically and transparently.

With Amazon Redshift, in contrast with the old data warehouse solutions, users are able to provision a cluster, then load it with their data, and start executing queries — as simple as that.

All data is continuously, incrementally, and automatically backed up in the highly durable and efficient S3 service. This means disaster recovery across regions can be accomplished with just a few clicks.

Image from http://novacontext.com/an-introduction-to-amazon-redshift/

In Amazon Redshift, the data columns are stored physically adjacent on disk, meaning that unlike in OLAP queries, runs are particularly fast. Additionally, Amazon Redshift uses 10GB Ethernet interconnects, and specialized EC2 instances (with between three and 24 spindles per node) to achieve high throughput and low latency. For even faster queries, Amazon Redshift allows customers to use column-level compression to both greatly reduce the amount of data that needs stored, and reduce the amount of disk I/O.

A really cool feature of Amazon Redshift, is the possibility to scale a provisioned cluster up and down in a matter of minutes. This is something a traditional DW system would typically involve weeks of preparation, days of downtime, and a hefty six- or seven-figure bill. This scaling process in Redshift is typically transparent to the end user, and when a resize is requested, data is copied in parallel from the source cluster (which continues working in read-only mode) to a new cluster. Once all data is finally migrated, DNS is flipped to the new cluster and the old cluster is automatically disabled. This allows customers to easily scale up and down, and each scaling event nicely re-stripes the data across the new cluster for a balanced workload.

AWS Redshift features and capabilities

Redshift offers mature, native, and tunable security, and allows the possibility to integrate with other AWS services, some of them as follows:

Image from Author

What AWS Redshift is not…

Now that we know what Redshift is, let’s also see what it is not:

It is not a NoSQL engine.
It’s not a suitable solution to searching through large collections of text documents.
It is not an RDBMS, nor is it intended to serve OLTP for external customers.
It is not a real-time analysis engine.
It is not a good place to store anything but structured data.
It is not the fastest way to analyze data, nor is it the cheapest way to store it.

It is definitely a cloud-based EDW that allows internal users to quickly perform business analytics on large collections of both rolled-up and granular data.

Conclusion

In this article, we have learned a little bit of the history of OLAP cubes and workloads, what this technology has represented, and how it has contributed to making what we know now as BI Analytics. We have also seen how this evolved in that diverse ecosystem of services that we know as AWS.

Sources:

[1] https://www.holistics.io/blog/the-rise-and-fall-of-the-olap-cube/

[2] http://novacontext.com/an-introduction-to-amazon-redshift/

[3] https://docs.aws.amazon.com/redshift/latest/mgmt/overview.html

[4] https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.