Data Pipeline is a Python application that uses Google App Engine Pipeline API to control complex data processing pipelines. Pipelines are built of stages that can be wired together to process large amounts of data, with work going on in parallel. The application comes with several sample stages that use many of the Cloud Platform services. You can easily write new stages to perform custom data processing.
The Data Pipeline app comes with built-in functionality that lets you read data from:
- URLs via HTTP
- Google Cloud Datastore
- Google Cloud Storage
transform it on:
- Google App Engine using the Google App Engine Pipeline API
- Google Compute Engine using Apache Hadoop
and output it to:
- BigQuery
- Google Cloud Storage
For example, one of the pre-built dataflows takes a file from a Cloud Storage bucket, transforms it using a MapReduce job on Hadoop running on Compute Engine, and uploads the output file to BigQuery. To kick off the process, simply drop the file into Cloud Storage.
We hope that you will not only use the built-in transformations, but will create custom stages to transform data in whatever way you need. You can customize the pipelines easily by extending the Python API, which is available here on Github.
You can also customize the input and output; for example, you could customize the output to write to Google Cloud SQL.
You create and edit pipelines in a JSON configuration file in the applications UI. The app checks that the configuration is syntactically correct and each stage’s preconditions are met. After you save the config file, click the Run button to start the pipeline execution. Youll see the progress of the running pipeline in a new window.
Editing the config file |
The source code is checked into Github. We invite you to download it and set up your pipelines today.
- Posted by Alex K, Cloud Solutions Engineer
0 comments:
Post a Comment