Comparing cloud storage

Cloud Datastore is the best for semi-structured application data that is used in app engines applications.

Bigtable is best for analytical data with heavy read-write events like AdTech, Financial or IoT data.

Cloud Storage is best for structured and unstructured, binary or object data like images, large media files, and backups.

Cloud SQL is best for web frameworks and in existing applications like storing user credentials and customer orders.

Cloud Spanner is best for large scale database applications that are larger than two terabytes, for example, for financial trading and e-commerce use cases.

Possible Virtual Machines for Machine Learning(AWS)

These are few VM options for Machine Learning built on/ using  AWS.

Understanding_ML_virtual_servers__Amazon_Web_Services_Machine_Learning_Essential_Training

 

The first is software as a serviceThey’re called Databricks. It’s a third party service, so you pay them and then you access underlying Amazon resources. And they are well-known for their implementation of Spark as a service.

Their implementation includes their version of a notebook. It is a Jupyter-like notebook,but it’s a Databricks notebook. A optimized addition of a Spark cluster and the ability to install additional libraries.

The next level is Platform as a Service and Amazon’s offering there is Elastic MapReduce, which is managed Hadoop and Spark. It comes with the ability to install common libraries, such as Spark and Hive and Pig and other types of libraries, just by clicking when you install when you’re setting a flag if you’re doing it via script and you can also optionally install additional machine learning libraries such as TensorFlow and MXNet with bootstrap actions. Interestingly, Amazon has already installed, in SageMaker notebooks, the environments, they’re called Sparkmagic, so that a connection to an external cluster of Spark, including EMR, can be easily made.

Now the third possibility is infrastructure as a service.Most people would say, “Well, you can have an EC2 machine learning or deep learning AMI or image or you can just use EC2.” And, yes, I think you can use a machine learning AMI. It’s optimized for deep learning and all the libraries are already pre-installed. I actually would not recommend you use EC2 because you must manually install and configure all the language run-times and machine learning libraries and I have seen this task take people days or even weeks to set-up at a cluster of machines.

Choosing a data warehouse

 

photo-1468664093569-795a12e8b31cPostgres

Postgres is a relational database, meaning that it stores rows of data. It is fantastic for transactional application workloads, like selecting a single record from a users table, or updating the status of a couple of orders.

Analytical queries tend to look very different than these “transactional” queries. Instead, you might do a count(distinct) on a whole column of data with no filters applied, or join four tables together and group by half a dozen fields, aggregating the rest. Postgres is not designed to handle queries like this, and you’ll probably find that queries take forever to return unless indexes are liberally added to the columns that you join on.

To Postgres’s credit, it will serve you well for analytical queries until your tables contain millions of records. My thinking here is generally that your data will grow in volume. You can either migrate to a more capable analytical warehouse early (when it’s not strictly necessary yet), and provide yourself a whole lot of runway. Or, you can wait until your data (and team!) outgrows Postgres, and you are forced to migrate. I personally would prefer to be in the former situation, as migrating warehouses is very possible, but can be very not fun.

Redshift

Amazon Redshift is a columnar 3 database that is based on Postgres. Postgres and Redshift feel pretty similar – they both use the same protocol, and share (mostly) identical syntaxes.

Redshift stores data in columns (not rows), meaning that a given query will only read the columns required for its execution. This means that operations like full table scans 2 are not only doable, but totally reasonable. You would probably have a bad time doing a full table scan on a large Postgres table.

A Redshift database is a cluster of worker database nodes. These nodes are able to split up a query, execute its constituent parts in parallel, then combine the results. As a result, you can horizontally scale performance by adding nodes to your cluster. If you check out the Redshift pricing page 3 you’ll see that adding a node will give you more processing power, as well as more storage.

This is sort of a problem though: sometimes you need a lot of storage space, but you don’t need the extra processing power. The converse may also be true. This coupling of “storage” and “compute” is one of the main drawbacks of Redshift compared to the other databases listed below. As data teams push the limits of Redshift, they find themselves “resizing” their cluster to add another node every couple of months. This can be a painful and expensive process, and it gets pretty old pretty quickly.

Redshift comes with a bunch of configuration to optimize query performance like sort keys and dist keys. You can also configure compression for your columns, or perform table maintenance tasks like vacuum 1 and analyze. It ends up being close to a full-time job to keep Redshift humming along, and honestly, I think that time could be better spent actually doing analytics. The databases below don’t have as many bells and whistles, but you end up not really missing them. Maintaining Redshift can be a real chore.

Redshift is pretty easy to set up and is integrated deeply into the AWS ecosystem. You can natively pipe in data from S3, Kinesis, GLUE, etc. If using these tools is high on your priority list, then Redshift might be a good choice for your organization.

BigQuery

BigQuery is Google’s take on a distributed analytical database. Whereas in Redshift you might have six or eight compute nodes, BigQuery will throws hundreds or thousands of nodes at you query. Further, storage on BigQuery is effectively infinite, and you just pay for how much data you load into and query in the warehouse. This means that truly insane amounts of data can be queried with pretty astounding query latency. This video 5 is the best resource I’ve found on BigQuery under the hood; definitely give it a quick look if you’re interested in using BQ.

So, BigQuery can process really big datasets really quickly, but it of course comes with some caveats. BigQuery is really, very bad at doing joins. The recommended way to to handle one-to-many relationships is through repeated records 10. Repeated records are a radically new data storage pattern that Google just decided to make up. These repeated records are incredibly useful and they work very well, but you’ll need to internalize this feature of BigQuery to make effective use of its power.

I have a pretty good intuition for SQL and how databases work, but really, it just doesn’t translate to BigQuery. I think that BQ is a great choice for a database, as long as the folks in your organization are willing to rewire their brains to take advantage of its features and access patterns.

BigQuery has some annoying artificial limitations. Date partitioning is crucial for performant querying, but only something like 2,300 date partitions can be created at a given time. My bet is that limitations like these will become more flexible (or disappear) over time, but keep in mind that GMail was in Beta for five years, and Google’s gonna do Google things.

Snowflake

Snowflake is situated as a sort of happy medium between Redshift and BigQuery. It generally looks and acts how you would expect a database to behave. If you have existing intuition around analytics, it will probably translate to Snowflake pretty readily. Check out this article 3 for more information on migrating from Redshift to Snowflake.

Snowflake is more similar in architecture to Redshift than BigQuery. Crucially though, its storage is decoupled from its compute. With Snowflake you pay for 1) storage space used and 2) amount of time spent querying data. Snowflake also has a notion of a “logical warehouse” which is the “compute” aspect of the database. These warehouses can be scaled up or down to deliver different grades of performance. You can also configure the number of compute nodes to parallelize query execution. These warehouses can be configured to “pause” when you’re not using them. As a result, you can have a super beefy warehouse for BI queries that’s only running when people are using your BI tools, while your background batch jobs can use cheaper hardware.

Snowflake doesn’t have all the performance optimization bells and whistles of Redshift, but you end up not really needing or missing them. Snowflake’s query optimizer is substantially better than Redshift’s, and I find myself not really thinking about performance considerations when writing SQL for Snowflake.

Final thoughts

I prefer using Snowflake and BigQuery over Redshift because I can mostly just write the logic I want to implement without regard for how the database will execute the query behind the scenes. On Redshift, you have to think about column compression, sort keys, dist keys, disk utilization, or vacuum/analyze statements. ERROR: Disk Full is a thing you have to deal with on Redshift, but that’s just not a class of error that can occur on BigQuery or Snowflake.

It is absolutely possible to migrate warehouses, but it can definitely be a big pain. The best advice I can give is: conduct analytics in a way that preserves your optionality. Avoid writing custom scripts to pipe data into Redshift; instead, use a tool like Singer 19 that will pipe data to any of the warehouses listed above. Be judicious when using non-standard features of a warehouse (like Postgres GIS plugins, or BigQuery javascript UDFs). To be sure, you should use your warehouse to its fullest potential, but definitely be mindful of how these decisions impact your switching cost. Today, Redshift feels old and crufty, but it was the best-in-class analytical warehouse only 6 years ago. This industry moves fast!

The good news is that technology in the data space is becoming more modular 3, and databases are continuing to coalesce around a standard-ish version of SQL. If you’re just getting started with a warehouse, really any of Redshift, Snowflake, or BigQuery are likely appropriate. I think that choosing Snowflake or BigQuery is probably a more forward-thinking decision, and they’re my preference absent any other business-specific considerations.

Here’s some further reading:

published here –

 

Purrr – mapping pmap functions to data

In functional programming paradigm, map is used to map a set of values to another set of values based on the function used.

 

     

In general sense, a unit of function should only be used to map one value to another. While this utility can be applied to a list of inputs to produce another set of input using map function.  It takes two inputs

  • Function
  • Sequence of values

It produces new sequence of values where the function has been applied.

which prints

Note that the above is only used for one input. For two input values you can use map2.

Now for situations where you need to use multiple input values(say multiple lists) to apply to a function, you can use pmap

An important point- Length of x and y should be same.

which produces

 

What’s the difference to map and map2?

Both map and map2 take vector arguments directly like x = and y = , while pmap takes a list of arguments like list(x = , y = ).

Exploring purrr furthur, I see new use cases which I will explain in next posts.

 

purrrr is a productivity ninza. Try to  use it.

Think Functional!

R Markdown- How to run a markdown file(.Rmd) from an R script(.R)?

There are situations when you would want to generate an R markdown file for a list. (e.g. – performance report for a number of sales people).

In this scenario you can run your markdown script (.rmd) from a R script(.R) in the following way-

R script

Note that person is the variable you can use  in the rmd_file.Rmd file.

You can see more options from the Render function here.

 

Functional Programming in R – what is Lazy evaluation?

Under Lazy evaluation, evaluation of function argument is delayed until necessary. This differs to strict evaluation where the arguments of a function are evaluated before the function is called.

Let’s see an example-

In the first function call, “I am evaluated” is not printed as it is not required to be evaluated. This is also called call-by-promise.

It could be useful when you have parameters which need to be evaluated when only certain conditions are met.

Another example-

Here b is not evaluated until the execution reaches the expression a + b.

Moreover, The expressions you give to function calls are evaluated in the scope outside the function.

Hence the below will give error as it remains unknown when b = a is evaluated as it looks for the value of ain the global scope(outside function f)

 

 

 

 

Testing in R- Introduction

Resources for testing in R is not readily available. In this series- (Testing in R). I would be sharing my learnings in a comprehensible format.

The problem with data analysis is that a small mistake in your model can give you a big change in your results.

May be the columns that you are working on cannot have negative values, or should have a range of value or specific criteria which tests the validity of your data. Testing helps to automate the checks, both in your logic and data.

Another aspect of incorporating testing in your work is when you are developing an R package that is to be publically available (perhaps on CRAN). Users will certainly be using your code in ways that you didn’t imagine, and on datasets in different formats.

Testing would be needed in the following two scenarios-

  • Unit Testing (Development Time Testing)- When you want to avoid the errors during the development time- this is for the developer to put checks on the logic, that she has written.
  • Runtime Testing helps to prevent user errors. This is to check bad user inputs. e.g. – only numbers are accepted for arithmetic operations.

In Summary, the point of development time testing is to check developer’s error, while run-time testing is mainly to put a check or inform users on incorrect inputs.

I will be sharing more in detail in the next posts.

R: Column Summarisation using tidy verse and purrr- towards Functional Programming.

I was working on column summarization(mean, median, standard deviation, etc) and found out better ways to select and summarise the data.

Let’s start

Let’s take Iris data.

Screenshot 2017-07-17 17.27.53

A usual way to summarise(that I used) is this:

This code has the following issues:

  • The functions mean and sd are repeated.
  • I need to write the column name specifically.

To resolve this I started exploring more and found a better way to select columns and perform column wise summaries.

On selecting:

Let’s say I want to select the first three columns. I would do like this:

This will select the columns starting from Sepal.Length to Petal.Length.

You can use regex patterns as well. Like in the example below, If I just need to get the summaries of columns that start with “Sepal”-

This will select the only Sepal.Length and Sepal.Width columns.

You can also use many helper functions with select and hence utilise many ways to select based on the name, position.

 Summarising the data:

You can directly use a function called summarized_at

The result is all mean and sd of all the columns that start with Sepal.

As you can see we have calculated the summaries without repeating the function names. In a way, we are applying functions to data, and not the opposite.

There are scenarios when the result of a function is multiple values. e.g. when we use quantile to get multiple values.

Let’s see an example-

How can perform the summarization, similar to the previous examples?

Map which is a function in the functional programming toolkit purrr. comes handy here. The only difference is that we have to use bind_rows.

This results in quantiles for the two columns that start with “Sepal”.

Summarise columns with specific properties.

Suppose you want to summarise all the columns which are numeric. You can achieve this using summarize_if.

This results in summaries for all numeric columns.

 

Grouping

For all the data, you can also do the grouping as well

 

Next Steps

I will be sharing more on applying Functional Programming principles with R.

R is a Functional Programming Language

A write-up by Hadley Wickham shows the aspects of Functional Programming in R. It’s a great article for understanding both R and using functions as the core of operations in R. Three building blocks are explained:

 1. Anonymous functions

 2. Closures (functions written by functions)

 3. Lists of functions.