Our partnership with Google Cloud has allowed me an opportunity to explore Google Cloud Platform's (GCP) Big Data and Cloud Data Warehouse technologies. I recently helped a client pursue a Proof of Concept in order to explore the viability of using GCP's Cloud Dataflow and BigQuery services in order to uplift their reporting and analytics capability.
The Proof of Concept
The client currently has a real time streaming ETL pipeline from its data sources, using Kinesis, Apache Storm, S3 and Redshift. The entire pipeline is currently hosted on AWS.
A key differentiator for GCP's Cloud Dataflow and BigQuery offerings is that both of these services adopt a serverless approach. A serverless approach allows these services to promise unlimited scalability, and it also allows the burden of infrastructure and cluster management to be transferred from the customer to the cloud provider.
With these benefits in mind, the client wanted to prove out the benefits of BigQuery as quickly as possible, with minimal impact to the existing ETL pipeline. With this in mind, we needed to find an appropriate place to "fork" the data flow from the existing pipeline into GCP, as well as providing a minimum ETL pipeline so that the data can be streamed into BigQuery in real time.
We decided to adopt the following approach.
We decided to fork the pipeline from the existing staging S3 bucket using a lambda function. The lambda function will execute for every file that is dropped into the S3 bucket and make a copy of the same file verbatim into a Google Cloud Storage (GCS) bucket.
Once the data is replicated into GCS, there are two main options:
- Trigger a Cloud Function on every file that is copied (google.storage.object.finalize), which will then in turn execute a Dataflow job (in batch mode) to import the contents of a single file into BigQuery, or
- Run a Cloud Dataflow job (in stream mode) into a Cloud Pub/Sub topic, and then run another Cloud Dataflow job (in stream mode) to stream the data from the Cloud Pub/Sub topic into BigQuery
The benefit of both these options are that the code for this is already provided by Google out of the box (https://github.com/GoogleCloudPlatform/DataflowTemplates), and so development effort is minimal.
We actually started out with option 1. It turns out that option 1 is preferable if you have large files running in batches, as the Cloud Dataflow jobs will only run when a file appears, which will save on execution cost. However, this option is not really suitable for a streaming dataset, as the client's dataset consists of many small files in S3/GCS (some files only have 6 rows of data each), arriving in regular intervals (we're dealing with around 8000 files every day, with files arriving every 10 minutes). Because Apache Beam (the open source framework that is used by Cloud Dataflow) is built to run on distributed processing back-ends such as Apache Spark as well as Cloud Dataflow, when a job is run, there is an equivalent of a "map" task (as per MapReduce) where chunks of data is allocated to worker nodes for processing. This task is run no matter if the dataset contains six rows or six million rows. Also, worker nodes are spun up and spun down automatically for every job. Because of this, we observed that a simple Cloud Dataflow job would run for three to four minutes no matter if the dataset was really small, or if the dataset was a lot larger.
After a few helpful conversations with the GCP Customer Engineering team, we decided to adopt a streaming approach (option 2). Adopting a streaming approach ensures that we can have long-running Cloud Dataflow jobs to ingest the data, and it helps us reduce the latency in the pipeline so that data arrives in BigQuery sooner. This also allows the client to extend the scope proof of concept, so that rather than just looking at BigQuery performance, they are also able to evaluate the streaming capabilities of Cloud Pub/Sub (as a potential Kinesis replacement) and Cloud Dataflow (as a potential Apache Storm replacement).
While Google Cloud is a "new kid on the block", in my recent experience there is a strong compelling reason for organisations to give it a shot, especially if what they want is a serverless big data offering, offering a real-time, streaming ETL pipeline into an infinitely scalable Cloud Data Warehouse.
Where to next?
This pipeline, being a proof of concept, was definitely sub-optimal because the data integration between AWS and GCP was achieved through file drops between S3 and GCS. If the client decides to pursue this proof of concept further, the "replace everything" option will be to completely replace Kinesis with Cloud Pub/Sub, and Apache Storm with Cloud Dataflow. However this will probably be quite a costly endeavour, as there will be business logic within the Apache Storm implementation that may or may not be easy to migrate.
Another option will be to modify the Apache Storm data pipeline to push data into Google Cloud Pub/Sub directly using the Google Cloud APIs. This should make the pipeline more performant and real-time, without requiring a huge tooling migration.
Cloud is no longer just about provisioning on-demand servers and networking anymore. Because of this, the cloud providers are now competing on their various platform-as-a-service products, rather than just competing on price and reliability for their infrastructure-as-a-service offerings. This is a big win for consumers like you and me, because with example architectures like what I’ve shown above, we can choose best-of-breed solutions, and are no longer locked in to a single cloud provider. This forces the cloud providers to constantly innovate to provide us with better products, and we, as the consumer, become the ultimate winners.