Friday, March 28, 2014

Input → Transform → Output → Done!

We have published a sample App Engine application to help you move your data from one place in the cloud to another, transforming it along the way. The Data Pipeline application includes samples to get you started quickly and produce powerful pipelines right out the gate. It also has a simple API for extending its functionality.



Data Pipeline is a Python application that uses Google App Engine Pipeline API to control complex data processing pipelines. Pipelines are built of stages that can be wired together to process large amounts of data, with work going on in parallel. The application comes with several sample stages that use many of the Cloud Platform services. You can easily write new stages to perform custom data processing.



The Data Pipeline app comes with built-in functionality that lets you read data from:


  • URLs via HTTP

  • Google Cloud Datastore

  • Google Cloud Storage


transform it on:


and output it to:


  • BigQuery

  • Google Cloud Storage


For example, one of the pre-built dataflows takes a file from a Cloud Storage bucket, transforms it using a MapReduce job on Hadoop running on Compute Engine, and uploads the output file to BigQuery. To kick off the process, simply drop the file into Cloud Storage.



We hope that you will not only use the built-in transformations, but will create custom stages to transform data in whatever way you need. You can customize the pipelines easily by extending the Python API, which is available here on Github.



You can also customize the input and output; for example, you could customize the output to write to Google Cloud SQL.



You create and edit pipelines in a JSON configuration file in the applications UI. The app checks that the configuration is syntactically correct and each stage’s preconditions are met. After you save the config file, click the Run button to start the pipeline execution. Youll see the progress of the running pipeline in a new window.




Editing the config file

The source code is checked into Github. We invite you to download it and set up your pipelines today.



- Posted by Alex K, Cloud Solutions Engineer

Related Posts by Categories

0 comments:

Post a Comment