2. ETL Pipeline Creation
In this section we create our ETL Pipeline. Once it’s created, we will discuss each part of the ETL Pipeline schema in detail.
Note
This schema is built using data from resources that were created during the earlier sections of this User Guide.
2.1 Create the ETL Pipeline
Create an ETL Pipeline (createETLPipeline)
Save the following ETL pipeline definiton to a file named etl-pipeline.json
in your current working directory.
{
"id": "etl-userguide-pipeline-<user_id>",
"before": null,
"remote_outbox": {
"data": {
"system_id": "etl.userguide.systema.<user_id>",
"path": "/ETL/REMOTE-OUTBOX/DATA",
},
"manifests": {
"system_id": "etl.userguide.systemb.<user_id>",
"generation_policy": "auto_one_per_file",
"path": "/ETL/REMOTE-OUTBOX/MANIFESTS"
}
},
"local_inbox": {
"control": {
"system_id": "etl.userguide.systemb.<user_id>",
"path": "/ETL/LOCAL-INBOX/CONTROL"
},
"data": {
"system_id": "etl.userguide.systema.<user_id>",
"path": "/ETL/LOCAL-INBOX/DATA"
},
"manifests": {
"system_id": "etl.userguide.systemb.<user_id>",
"path": "/ETL/LOCAL-INBOX/MANIFESTS"
}
},
"jobs": [
{
"name": "sentiment-analysis",
"appId": "etl-sentiment-analysis-test",
"appVersion": "dev",
"nodeCount": 1,
"coresPerNode": 1,
"maxMinutes": 10,
"execSystemId": "test.etl.ls6.local.inbox",
"execSystemInputDir": "${JobWorkingDir}/jobs/${JobUUID}/input",
"archiveSystemId": "test.etl.ls6.local.inbox",
"archiveSystemDir": "ETL/LOCAL-OUTBOX/DATA",
"parameterSet": {
"schedulerOptions": [
{
"name": "allocation",
"arg": "-A TACC-ACI"
},
{
"name": "profile",
"arg": "--tapis-profile tacc-apptainer"
}
],
"containerArgs": [
{
"name": "input-mount",
"arg": "--bind $(pwd)/input:/src/input:ro,$(pwd)/output:/src/output:rw"
}
],
"archiveFilter": {
"includes": [],
"excludes": ["tapisjob.out"],
"includeLaunchFiles": false
}
}
}
],
"local_outbox": {
"data": {
"system_id": "etl.userguide.systema.<user_id>",
"path": "/ETL/LOCAL-OUTBOX/DATA"
},
"manifests": {
"system_id": "etl.userguide.systemb.<user_id>",
"path": "/ETL/LOCAL-OUTBOX/MANIFESTS"
}
},
"remote_inbox": {
"data": {
"system_id": "etl.userguide.systema.<user_id>",
"path": "/ETL/REMOTE-INBOX/DATA"
},
"manifests": {
"system_id": "etl.userguide.systemb.<user_id>",
"path": "/ETL/REMOTE-INBOX/MANIFESTS"
}
},
"after": null
}
Then submit the definition.
with open('etl-pipeline.json', 'r') as file:
pipeline = json.load(file)
t.workflows.createETLPipeline(group_id=<group_id>, **pipeline)
curl -X POST -H "content-type: application/json" -H "X-Tapis-Token: $JWT" https://tacc.tapis.io/v3/workflows/beta/groups/<group_id>/etl -d @etl-pipeline.json
Once created, you can now fetch and run the pipeline
t.workflows.getPipeline(group_id=<group_id>, pipeline_id=<pipeline_id>)
curl -H "content-type: application/json" -H "X-Tapis-Token: $JWT" https://tacc.tapis.io/v3/workflows/groups/<group_id>/pipeline/<pipeline_id>
2.2 Remote Outbox
The Remote Outbox is the ETL System Configuration that tells Tapis ETL where data files to be processed are staged. Any data files placed on the Data System in the data directory (after applying the include and exclude pattern filters) are the files that will be processed during runs of an ETL Job.
How many get processed and in what order is determined by the manifests generated for those data files. Consider the schema below, specifically the Manifest Configuration.
{
"data": {
"system_id": "etl.userguide.systema.<user_id>",
"path": "/ETL/REMOTE-OUTBOX/DATA"
},
"manifests": {
"system_id": "etl.userguide.systemb.<user_id>",
"generation_policy": "auto_one_per_file",
"path": "/ETL/REMOTE-OUTBOX/MANIFESTS"
}
}
This Manifest Configuration tells Tapis ETL that the user wants Tapis ETL to handle manifests generation.
According to the manifest generation policy
of auto_one_per_file
, Tapis ETL will generate one manifest
for each data file that is not currently being tracked in another manifest.
Warning
Using automatic manifest generation policies without specifying a data integrity profile can beak a pipeline.
The preferred manifest generation policy for the Remote Inbox is manual
. Instructions for generating manifests
will come in the following sections.
For more info on other possible configurations, see the Manifests Configuration section
Tapis ETL will perform this operation for every untracked data file for every run of the pipeline. Whether this step
runs or not has no effect on the actual data processing phase of the pipeline run (the phase where the ETL Jobs are run).
If there are no data files to be tracked, Tapis ETL will simply move on to the next step; looking for an unprocessed manifest with a status of pending
and submitting it for processing.
2.3 Local Inbox
Tapis ETL will transfer all of the data files from the Remote Outbox to the data directory of the Local Inbox for processing.
Note
When configuring your Local Inbox, consider that your ETL Jobs should run on a system that has a shared file system with the Local Inbox. The data path in the Local Inboxes Data Configuration should be accessible to the first ETL Job.
Notice that this ETL System Configuration has an addition property, control
. This is simply a place that
Tapis ETL will write accounting files to ensure the pipeline runs as expected. It is recommended that you use the same
system as in Local Inbox’s Manifest System, however any system to which Tapis ETL can write files to would work.
{
"control": {
"system_id": "etl.userguide.systemb.<user_id>",
"path": "/ETL/LOCAL-INBOX/CONTROL"
},
"data": {
"system_id": "etl.userguide.systema.<user_id>",
"path": "/ETL/LOCAL-INBOX/DATA"
},
"manifests": {
"system_id": "etl.userguide.systemb.<user_id>",
"path": "/ETL/LOCAL-INBOX/MANIFESTS"
}
}
2.4 ETL Jobs
ETL Jobs are an ordered list Tapis Job definitions that are dispatched and run serially during a pipeline’s transform phase. Tapis ETL dispatches these Tapis Jobs and uses the data files in a manifest as the inputs for the job. Once all of the ETL Jobs complete, the tranform phase ends.
[
{
"name": "sentiment-analysis",
"appId": "etl-sentiment-analysis-test",
"appVersion": "dev",
"nodeCount": 1,
"coresPerNode": 1,
"maxMinutes": 10,
"execSystemId": "test.etl.ls6.local.inbox",
"execSystemInputDir": "${JobWorkingDir}/jobs/${JobUUID}/input",
"archiveSystemId": "test.etl.ls6.local.inbox",
"archiveSystemDir": "ETL/LOCAL-OUTBOX/DATA",
"parameterSet": {
"schedulerOptions": [
{
"name": "allocation",
"arg": "-A TACC-ACI"
},
{
"name": "profile",
"arg": "--tapis-profile tacc-apptainer"
}
],
"containerArgs": [
{
"name": "input-mount",
"arg": "--bind $(pwd)/input:/src/input:ro,$(pwd)/output:/src/output:rw"
}
],
"archiveFilter": {
"includes": [],
"excludes": ["tapisjob.out"],
"includeLaunchFiles": false
}
}
}
]
- Every ETL Job is furnished with the following envrionment variables which can be accessed at runtime by your Tapis Job.
TAPIS_WORKFLOWS_TASK_ID
- The ID of the Tapis Workflows task in the pipeline that is currently being executedTAPIS_WORKFLOWS_PIPELINE_ID
- The ID of the Tapis Workflows Pipeline that is currently runningTAPIS_WORKFLOWS_PIPELINE_RUN_UUID
- A UUID given to this specific run of the pipelineTAPIS_ETL_HOST_DATA_INPUT_DIR
- The directory that contains the data files for inital ETL JobsTAPIS_ETL_HOST_DATA_OUTPUT_DIR
- The directory to which output data files should be persistedTAPIS_ETL_MANIFEST_FILENAME
- The name of the file that contains the manifestTAPIS_ETL_MANIFEST_PATH
- The full path (including the filename) to the manifest fileTAPIS_ETL_MANIFEST_MIME_TYPE
- The MIME type of the manifest file (alwaysapplication/json
)
In addition to the envrionment variables, a fileInput for the manfiest file is added to the job definition to ensure that is available to the ETL Jobs runtime.
In your application code, you can use the envrionment variables above to locate the manifest file and inspect its contents.
This is useful if your application code does not know where to find its input data. The local_files
array property of the manifest
contains the list of input data files. The files are represented as objects in the local_files
array and take the following form.
{
"mimeType": null,
"type": "file",
"owner": "876245",
"group": "815499",
"nativePermissions": "rwxr-xr-x",
"url": "tapis://etl.userguide.systemb.<user_id>/ETL/LOCAL-INBOX",
"lastModified": "2024-02-26T22:51:31Z",
"name": "data1.txt",
"path": "ETL/LOCAL-INBOX/DATA/data1.txt",
"size": 66
}
2.5 Local Outbox
Once the Transform Phase ends, all of the output data files should be in the data directory of the Local Inbox. These files are then tracked in manifests to be transferred in the Egress Phase of the pipeline.
{
"data": {
"system_id": "etl.userguide.systema.<user_id>",
"path": "/ETL/LOCAL-OUTBOX/DATA"
},
"manifests": {
"system_id": "etl.userguide.systemb.<user_id>",
"path": "/ETL/LOCAL-OUTBOX/MANIFESTS"
}
}
2.6 Remote Inbox
During the Egress Phase, all of the output data produced by the ETL Jobs is transferred from the Local Outbox to the Remote Inbox. Once all of the data files from an Egress Manifest are successfully transferred over, Tapis ETL will then transfer that manifest to the manifests directory of the Remote Inbox system to indicate that files transfers are complete.
{
"data": {
"system_id": "etl.userguide.systema.<user_id>",
"path": "/ETL/REMOTE-INBOX/DATA"
},
"manifests": {
"system_id": "etl.userguide.systemb.<user_id>",
"path": "/ETL/REMOTE-INBOX/MANIFESTS"
}
}
Note
- Section 2 Punch List
Created an ETL Pipeline
Can fetch the pipeline’s details from Tapis Workflows