Skip to main content

ETL and workflows

One workflow has usually its own ETL, because a workflow has its own set of data, account, traces, etc. An ETL is linked to a workflow through the workflow_id. For example, on a workflow “Renewal”, we have an ETL called agre-renewal, which has its own data with filters, its own database name in metabase, and so on.

ETL is a Job

Below is a representation of an ETL
const etlJob: Job = {
  id: '643e8c66-003c-4b10-ab93-75341c7ad3fb',
  name: 'agre-renewal',
  description: 'TEST',
  cron_expression: '10 * * * *',
  data: {
    rs_database: 'agre_v2',
    schema: 'renewal',
    wf_ids: [1148],
    filters: ['listBuid'],
    column_filter: []
  },
  created_by: 'ADMIN',
  starts_at: '2024-10-31 00:33:28.778',
  ends_at: null,
  type: 'ETL',
  status: 'IN_PROGRESS',
  image:
    'git.sia-partners.com:5656/stratumn/platform/backend/jobs/analytics-etl:no-null-stratumn-com',
  metadata: {
    store_execution_logs: true
  }
};
An ETL is a Job as represented in Launcher-API. This JOB has a type ETL and data typed EtlJobData
interface EtlJobData {
  rs_database: string;
  wf_ids: string[];
  filters: string[];
  column_filter: string[];
}
An ETL launches on a Kubernetes cluster as a job. When a job is launched, a pod is created. Pods are instances of job (contains Job data) and lives also in Launcher database.
const etlPod: Pod = {
  job_id: '643e8c66-003c-4b10-ab93-75341c7ad3fb',
  status: 'SUCCESS',
  logs: 'logs from the kubernetes job',
  metrics: {},
  triggered_at: '2024-08-12 14:30:00.037'
};

DTO

name
literal
Name of the etl
cron_expression
literal
The frequency of the etl job. Must match cron expression literal. See (see cron expression )
description
literal
Description of the etl
starts_at
Date
Starting date of the ETL, if scheduled, it will start at this time.
ends_at
Date
Ending date of the ETL, if scheduled, it will end at this time.
image
literal
Docker image name used for this ETL. One ETL can have its own docker image (for example, financial flow has its own set of calculation and has its own docker image)
sentry_monitor
string
Sentry monitor slug name, available in Sentry when you create a cron monitor. Slug enables Sentry to identify the monitor within your app. Whenever a job is running with this slug, Sentry can link it to the monitor and display its results (status, triggering date, etc).
sentry_dsn
string
Sentry DSN (Data Source Name), available in Sentry when you create a sentry project DSN. A DSN tells a Sentry where to send events so the events are associated with the correct project.
metadata
object
Metadata is an object that takes a property store_execution_logs, which is a boolean. By default, this property is set to false, which means Launcher-API won’t store the logs of the pods. This has been decided for storage purpose, as a log can take lot of memory. To enable logs written in database, you would need to put the store_execution_logs to true.
data
object
Data details of an ETL. Check EtlData DTO

EtlData DTO

schema
literal
Schema name from the metabase database. It’s in snake_case
Changing this field might result in data loss. The name of the schema should match the one in metabase, otherwise the data won’t be updated correctly.
rs_database
literal
Database name from metabase. It is also in snake_case.
Changing this field might result in data loss. The name of the database should match the one in metabase, otherwise the data won’t be updated correctly.
wf_ids
array
IDs of the workflow. Typically one ID per workflow. Array of number
filters
array
Filters to remove tables from the ETL. Array of strings
column_filters
array
Filters to remove columns from table. Array of string.