Tuesday, December 12

Cloud Dataproc: Google’s new managed service for Hadoop and Spark

Google has announced a new cloud service in beta that makes data analysis on Hadoop and Spark easier and faster. Offered as a managed service, Cloud Dataproc lets users take advantage of open source data tools for batch processing, querying, streaming, and machine learning.

“Cloud Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don’t need them,” Google Cloud Platform product manager James Malone wrote in a blog post. “Cloud Dataproc minimizes the time you spend on administration and management.”

 

Source: Google

Cloud Dataproc is priced at only 1 cent per virtual CPU in users’ cluster per hour, on top of the other Cloud Platform resources they use.

“In addition to this low price, Cloud Dataproc clusters can include preemptible instances that have lower compute prices, reducing your costs even further. Instead of rounding your usage up to the nearest hour, Cloud Dataproc charges you only for what you really use with minute-by-minute billing and a low, ten-minute-minimum billing period.”

Without using Dataproc, it can take anywhere from 5 to 30 minutes to create Spark and Hadoop clusters on-premises or through IaaS providers. By comparison, Cloud Dataproc clusters are quick to start, scale, and shutdown with each of these operations taking 90 seconds or less, on average.

The service has built-in integration with other Google Cloud Platform services, such as BigQuery, Cloud Storage, Cloud Bigtable, Cloud Logging, and Cloud Monitoring. Users can use Cloud Dataproc to ETL terabytes of raw log data directly into BigQuery for business reporting.

Google said companies can use Spark and Hadoop clusters without the assistance of an administrator or special software. Instead, they can interact with clusters and Spark or Hadoop jobs through the Google Developers Console, the Google Cloud SDK or the Cloud Dataproc REST API. When a cluster is no longer in use it can be turned off to avoid spending money needlessly.

The current implementation of Cloud Dataproc features clusters based on Spark 1.5 and Hadoop 2.7.1.