As data pipelines become more complex and involve multiple team members, it can be challenging to keep track of changes, collaborate effectively, and deploy pipelines to different environments in a controlled manner.
We can help, with Google’s new pipeline edit feature that we introduced for Cloud Data Fusion (CDF) batch pipelines.
Better pipelines with versioning
A typical pipeline development process is iterative in nature. You make small unit changes to a pipeline, test them on small data, then on production data, then iteratively add features to the pipeline. Iterative pipeline design is also critical for a seamless experience to the Cloud Data Fusion user as it reduces overheads in developing and testing pipelines. An ETL developer is able to design a pipeline iteratively, where improvements are added incrementally while maintaining a full history of changes.
You can edit pipelines starting in Cloud Data Fusion version 6.9. When you edit a pipeline you’ve already deployed, you don’t have to duplicate the pipeline and implement a versioning strategy across multiple pipelines. Instead, you edit a single pipeline and the versions are tracked for you. With pipeline edit capability in place after deployment, you do not have to implement versioning artificially by duplicating a pipeline. Thereby enhancing the user experience and productivity, and maintaining a correlation mapping between the various clones of a pipeline.
Benefits of pipeline editing
The pipeline edit feature lets you do the following:
Incrementally make changes to any part of the deployed pipeline, such as the pipeline structure, configuration, metadata, preferences, and comments.You can also export an edited JSON file for a deployed pipeline.
How is it different from the CDF duplicate pipeline feature?
Duplicating a pipeline creates a new pipeline with a different name while editing a pipeline creates a new version of the same pipeline, which prevents proliferation of pipelines (as seen in figure below), allowing for better organization.
Before you begin
You need a Cloud Data Fusion instance with version 6.9.1 or above.
Upgrading to 6.9.1 or above will also unlock Source Control Management with Github. You can refer to the blog here.
NOTE: The pipeline edit feature is supported only for CDF batch pipelines.
How to use this feature?
When you edit the pipeline, CDF creates a new draft, once deployed it becomes the latest version of the pipeline (in case of upgraded instances, the pipelines are upgraded to become the latest version of the pipeline).The latest version retains the triggers, pipeline configurations, runtime arguments, metadata, comments, and schedules from the previous version. The latest version is the active version of the pipeline, i.e; it can be run or scheduled to run.
To edit a deployed pipeline follow the below steps:
Go to the pipeline that you want to edit and click Edit, you can access this in the UI through both pipeline studio and the pipeline list page:
Edit through the pipeline studio page
Edit through the pipeline list page
A new draft of the pipeline is created. Edit your pipeline and make the necessary changes. Optional: To finish editing the pipeline later, click Save. Draft statuses are displayed to mitigate concurrency issues (more discussed below).
Edit Draft opens for changes
“In-Progress” editing status for the edit draft that is yet to be deployed.
Note: You must make changes to your pipeline draft in order to deploy it, else an error message is displayed.
View version history
The history button is introduced in the pipeline studio page, which displays a list of edit versions and through which the previous edit versions of the pipeline can be accessed. The only actions that can be performed on an older edit version are view and restore. The older versions are identified by the date of creation and the change summary.
You can go back to the latest version through the return to latest version link.
Export older edit version
When you wish to view or manipulate an older version pipeline json, you can export it locally. The edited json can be imported back to the pipeline edit draft.
An orphaned edit draft
When a pipeline is deleted, all deployed versions of the pipeline are removed other than the ones that are open in draft status. The draft pipeline enters an orphaned status, since the associated pipeline is removed and the draft no longer belongs to an existing pipeline. Deploying the draft will deploy a brand new pipeline and resolve the orphaned status.
An obsolete edit draft
When a newer version of the pipeline that you are currently editing becomes available, your changes are out of date. This happens when another user deploys the pipeline before you finish editing. The draft then enters the out of date/obsolete status.
Deployment is blocked and you see the error message prompting you to manually reconcile your changes.
To manually reconcile your pipeline, click on Export and Rebase in the prompt, this will export your current json draft locally, and rebase studio to the latest version. Thereby, resolving the out of date/obsolete status. Manually resolving the conflicts and importing the changes back into the draft is the recommended solution.
Learn more
Along with iterative development, use the source control management feature to allow for team based collaborationMore on CDFMore documentation on the feature
Source : Data Analytics Read More