Choosing a data warehouse


 

photo-1468664093569-795a12e8b31cPostgres

Postgres is a relational database, meaning that it stores rows of data. It is fantastic for transactional application workloads, like selecting a single record from a users table, or updating the status of a couple of orders.

Analytical queries tend to look very different than these “transactional” queries. Instead, you might do a count(distinct) on a whole column of data with no filters applied, or join four tables together and group by half a dozen fields, aggregating the rest. Postgres is not designed to handle queries like this, and you’ll probably find that queries take forever to return unless indexes are liberally added to the columns that you join on.

To Postgres’s credit, it will serve you well for analytical queries until your tables contain millions of records. My thinking here is generally that your data will grow in volume. You can either migrate to a more capable analytical warehouse early (when it’s not strictly necessary yet), and provide yourself a whole lot of runway. Or, you can wait until your data (and team!) outgrows Postgres, and you are forced to migrate. I personally would prefer to be in the former situation, as migrating warehouses is very possible, but can be very not fun.

Redshift

Amazon Redshift is a columnar 3 database that is based on Postgres. Postgres and Redshift feel pretty similar – they both use the same protocol, and share (mostly) identical syntaxes.

Redshift stores data in columns (not rows), meaning that a given query will only read the columns required for its execution. This means that operations like full table scans 2 are not only doable, but totally reasonable. You would probably have a bad time doing a full table scan on a large Postgres table.

A Redshift database is a cluster of worker database nodes. These nodes are able to split up a query, execute its constituent parts in parallel, then combine the results. As a result, you can horizontally scale performance by adding nodes to your cluster. If you check out the Redshift pricing page 3 you’ll see that adding a node will give you more processing power, as well as more storage.

This is sort of a problem though: sometimes you need a lot of storage space, but you don’t need the extra processing power. The converse may also be true. This coupling of “storage” and “compute” is one of the main drawbacks of Redshift compared to the other databases listed below. As data teams push the limits of Redshift, they find themselves “resizing” their cluster to add another node every couple of months. This can be a painful and expensive process, and it gets pretty old pretty quickly.

Redshift comes with a bunch of configuration to optimize query performance like sort keys and dist keys. You can also configure compression for your columns, or perform table maintenance tasks like vacuum 1 and analyze. It ends up being close to a full-time job to keep Redshift humming along, and honestly, I think that time could be better spent actually doing analytics. The databases below don’t have as many bells and whistles, but you end up not really missing them. Maintaining Redshift can be a real chore.

Redshift is pretty easy to set up and is integrated deeply into the AWS ecosystem. You can natively pipe in data from S3, Kinesis, GLUE, etc. If using these tools is high on your priority list, then Redshift might be a good choice for your organization.

BigQuery

BigQuery is Google’s take on a distributed analytical database. Whereas in Redshift you might have six or eight compute nodes, BigQuery will throws hundreds or thousands of nodes at you query. Further, storage on BigQuery is effectively infinite, and you just pay for how much data you load into and query in the warehouse. This means that truly insane amounts of data can be queried with pretty astounding query latency. This video 5 is the best resource I’ve found on BigQuery under the hood; definitely give it a quick look if you’re interested in using BQ.

So, BigQuery can process really big datasets really quickly, but it of course comes with some caveats. BigQuery is really, very bad at doing joins. The recommended way to to handle one-to-many relationships is through repeated records 10. Repeated records are a radically new data storage pattern that Google just decided to make up. These repeated records are incredibly useful and they work very well, but you’ll need to internalize this feature of BigQuery to make effective use of its power.

I have a pretty good intuition for SQL and how databases work, but really, it just doesn’t translate to BigQuery. I think that BQ is a great choice for a database, as long as the folks in your organization are willing to rewire their brains to take advantage of its features and access patterns.

BigQuery has some annoying artificial limitations. Date partitioning is crucial for performant querying, but only something like 2,300 date partitions can be created at a given time. My bet is that limitations like these will become more flexible (or disappear) over time, but keep in mind that GMail was in Beta for five years, and Google’s gonna do Google things.

Snowflake

Snowflake is situated as a sort of happy medium between Redshift and BigQuery. It generally looks and acts how you would expect a database to behave. If you have existing intuition around analytics, it will probably translate to Snowflake pretty readily. Check out this article 3 for more information on migrating from Redshift to Snowflake.

Snowflake is more similar in architecture to Redshift than BigQuery. Crucially though, its storage is decoupled from its compute. With Snowflake you pay for 1) storage space used and 2) amount of time spent querying data. Snowflake also has a notion of a “logical warehouse” which is the “compute” aspect of the database. These warehouses can be scaled up or down to deliver different grades of performance. You can also configure the number of compute nodes to parallelize query execution. These warehouses can be configured to “pause” when you’re not using them. As a result, you can have a super beefy warehouse for BI queries that’s only running when people are using your BI tools, while your background batch jobs can use cheaper hardware.

Snowflake doesn’t have all the performance optimization bells and whistles of Redshift, but you end up not really needing or missing them. Snowflake’s query optimizer is substantially better than Redshift’s, and I find myself not really thinking about performance considerations when writing SQL for Snowflake.

Final thoughts

I prefer using Snowflake and BigQuery over Redshift because I can mostly just write the logic I want to implement without regard for how the database will execute the query behind the scenes. On Redshift, you have to think about column compression, sort keys, dist keys, disk utilization, or vacuum/analyze statements. ERROR: Disk Full is a thing you have to deal with on Redshift, but that’s just not a class of error that can occur on BigQuery or Snowflake.

It is absolutely possible to migrate warehouses, but it can definitely be a big pain. The best advice I can give is: conduct analytics in a way that preserves your optionality. Avoid writing custom scripts to pipe data into Redshift; instead, use a tool like Singer 19 that will pipe data to any of the warehouses listed above. Be judicious when using non-standard features of a warehouse (like Postgres GIS plugins, or BigQuery javascript UDFs). To be sure, you should use your warehouse to its fullest potential, but definitely be mindful of how these decisions impact your switching cost. Today, Redshift feels old and crufty, but it was the best-in-class analytical warehouse only 6 years ago. This industry moves fast!

The good news is that technology in the data space is becoming more modular 3, and databases are continuing to coalesce around a standard-ish version of SQL. If you’re just getting started with a warehouse, really any of Redshift, Snowflake, or BigQuery are likely appropriate. I think that choosing Snowflake or BigQuery is probably a more forward-thinking decision, and they’re my preference absent any other business-specific considerations.

Here’s some further reading:

published here –

 


Leave a Reply

Your email address will not be published. Required fields are marked *