Dev Diary – Orchestrating Mapping Data Flows
Written by Roelant Vos on 9.7.2021
Earlier dev diary posts have focused on defining and deploying a Mapping Data Flow using Biml. For example, covering the Biml syntax and adding initial transformations and parameters.
Additional complexity will be added in upcoming posts to ultimately cover a fully functioning set of patterns.
These Mapping Data Flows can be generated from BimlFlex by selecting the ‘Azure Data Flow’ Integration Template for a Project. The Project defines which data sets are in scope for a specific deliver or purpose, and the Integration Template directs the automation engine as to what output is required – Mapping Data Flows in this case.
The settings, overrides and configurations govern how the resulting output will be shaped and what features are included in the resulting Mapping Data Flow.
One of these features covers the orchestration.
In Azure, a Mapping Data Flow itself is not an object that can be executed directly. Instead, it needs to be called from an Execute Pipeline. This pipeline can be run, and in turn it will start the data flow.
This means that a corresponding pipeline will need to be created as part of any Mapping Data Flow, and this is exactly what BimlFlex does out of the box. In addition to this, BimlFlex can use this pipeline to integrate the data logistics into the control framework – the BimlCatalog database. This database captures the runtime details on process execution, so that there is a record of each execution result as well as statistics on volumes processed and more.
An example of the generated output is shown in the screenshot below:
The BimlCatalog will also issue a unique execution runtime point that is passed down from the pipeline to the Mapping Data Flow where it can be added to the data set. This means that a full audit trail can be established; it is always known which process was responsible for any data results throughout the solution.
But BimlFlex will generate more than just the pipeline wrapper for the Mapping Data Flow and integration into the control framework. It will also generate a batch structure in case large amounts of related processes must be executed at the same time – either sequentially or in parallel.
To manage potentially large numbers of pipelines and Mapping Data Flows, BimlFlex automatically generates a ‘sub-batch’ middle layer to manage any limitations in Azure. For example, at the time of writing Azure does not allow more than 40 objects to be defined as part of a single pipeline.
BimlFlex will generate as many sub-batches as necessary to manage limits such as these, and each process will start the next one based on the degree of parallelism defined in metadata.
The next post will cover how to start generating this using Biml.