July updates from Google Cloud

  • Chronicle joined the Google Cloud family. This means more tools, such as malware intelligence services, will be available to use with your cloud infrastructure and your data, enabling you to detect and mitigate threats faster.
  • Investigating production code just got easier with Stackdriver Profiler now generally available. Profiler lets you look at how your code runs and breaks down where time and resources are being spent with no noticeable performance impact.
  • BigQuery is now integrated with Kaggle, making it easier and faster to analyze data and train models
  • It can take a lot of time to set up a machine learning project, especially with managing the complexity and compatibility issues of an ever-evolving software stack. When you’d rather just spend time refining your model, deep learning containers are a great place to start. These containers are prepackaged, performance-optimized, and compatibility tested. So you have a consistent environment to work in.
  • Cloud Data Catalog is now available in public beta. Data Catalog is a fully managed data discovery and metadata management service, which means you can use Data Catalog to easily search for tables in Google BigQuery or topics in Cloud Pub/Sub. You can even filter for sensitive data thanks to Data Catalog’s integration with Cloud Data Loss Prevention. 
  • Connecting to external services on Kubernetes has historically required compromises, but no more thanks to Workload Identity. Workload Identity creates a relationship between Kubernetes service accounts and Cloud IAM so that you can define which workloads run as which identities, grant access to other Cloud services, and never worry about Kubernetes secrets or IAM service account keys again. 

Comparing cloud storage

Cloud Datastore is the best for semi-structured application data that is used in app engines applications.

Bigtable is best for analytical data with heavy read-write events like AdTech, Financial or IoT data.

Cloud Storage is best for structured and unstructured, binary or object data like images, large media files, and backups.

Cloud SQL is best for web frameworks and in existing applications like storing user credentials and customer orders.

Cloud Spanner is best for large scale database applications that are larger than two terabytes, for example, for financial trading and e-commerce use cases.

Possible Virtual Machines for Machine Learning(AWS)

These are few VM options for Machine Learning built on/ using  AWS.

Understanding_ML_virtual_servers__Amazon_Web_Services_Machine_Learning_Essential_Training

 

The first is software as a serviceThey’re called Databricks. It’s a third party service, so you pay them and then you access underlying Amazon resources. And they are well-known for their implementation of Spark as a service.

Their implementation includes their version of a notebook. It is a Jupyter-like notebook,but it’s a Databricks notebook. A optimized addition of a Spark cluster and the ability to install additional libraries.

The next level is Platform as a Service and Amazon’s offering there is Elastic MapReduce, which is managed Hadoop and Spark. It comes with the ability to install common libraries, such as Spark and Hive and Pig and other types of libraries, just by clicking when you install when you’re setting a flag if you’re doing it via script and you can also optionally install additional machine learning libraries such as TensorFlow and MXNet with bootstrap actions. Interestingly, Amazon has already installed, in SageMaker notebooks, the environments, they’re called Sparkmagic, so that a connection to an external cluster of Spark, including EMR, can be easily made.

Now the third possibility is infrastructure as a service.Most people would say, “Well, you can have an EC2 machine learning or deep learning AMI or image or you can just use EC2.” And, yes, I think you can use a machine learning AMI. It’s optimized for deep learning and all the libraries are already pre-installed. I actually would not recommend you use EC2 because you must manually install and configure all the language run-times and machine learning libraries and I have seen this task take people days or even weeks to set-up at a cluster of machines.

Purrr – mapping pmap functions to data

In functional programming paradigm, map is used to map a set of values to another set of values based on the function used.

 

     

In general sense, a unit of function should only be used to map one value to another. While this utility can be applied to a list of inputs to produce another set of input using map function.  It takes two inputs

  • Function
  • Sequence of values

It produces new sequence of values where the function has been applied.

which prints

Note that the above is only used for one input. For two input values you can use map2.

Now for situations where you need to use multiple input values(say multiple lists) to apply to a function, you can use pmap

An important point- Length of x and y should be same.

which produces

 

What’s the difference to map and map2?

Both map and map2 take vector arguments directly like x = and y = , while pmap takes a list of arguments like list(x = , y = ).

Exploring purrr furthur, I see new use cases which I will explain in next posts.

 

purrrr is a productivity ninza. Try to  use it.

Think Functional!