Build an ETL Pipeline | Dagster Docs (original) (raw)

Build your first ETL pipeline

In this tutorial, you'll build an ETL pipeline with Dagster that:

Imports sales data to DuckDB
Transforms data into reports
Runs scheduled reports automatically
Generates one-time reports on demand
Set up a Dagster project with the recommended project structure
Create and materialize assets
Create and materialize dependant assets
Ensure data quality with asset checks
Create and materialize partitioned assets
Automate the pipeline
Create and materialize a sensor asset
Refactor your project when it becomes more complex Prerequisites

To follow the steps in this guide, you'll need:

Basic Python knowledge
Python 3.9+ installed on your system. Refer to the Installation guide for information.
Familiarity with SQL and Python data manipulation libraries, such as Pandas.
Understanding of data pipelines and the extract, transform, and load process.

First, set up a new Dagster project.

Open your terminal and create a new directory for your project:

mkdir dagster-etl-tutorial  
cd dagster-etl-tutorial

Create and activate a virtual environment:
- MacOS
- Windows
  bash python -m venv dagster_tutorial source dagster_tutorial/bin/activate
Install Dagster and the required dependencies:

pip install dagster dagster-webserver pandas dagster-duckdb

Run the following command to create the project directories and files for this tutorial:

dagster project from-example --example getting_started_etl_tutorial

Your project should have this structure:

dagster-etl-tutorial/
├── data/
│   └── products.csv
│   └── sales_data.csv
│   └── sales_reps.csv
│   └── sample_request/
│       └── request.json
├── etl_tutorial/
│   └── definitions.py
├── pyproject.toml
├── setup.cfg
├── setup.py

info

Dagster has several example projects you can install depending on your use case. To see the full list, run dagster project list-examples. For more information on the dagster project command, see the API documentation.

Dagster project structure

dagster-etl-tutorial root directory

In the dagster-etl-tutorial root directory, there are three configuration files that are common in Python package management. These files manage dependencies and identify the Dagster modules in the project.

File	Purpose
pyproject.toml	This file is used to specify build system requirements and package metadata for Python projects. It is part of the Python packaging ecosystem.
setup.cfg	This file is used for configuration of your Python package. It can include metadata about the package, dependencies, and other configuration options.
setup.py	This script is used to build and distribute your Python package. It is a standard file in Python projects for specifying package details.

etl_tutorial directory

This is the main directory where you will define your assets, jobs, schedules, sensors, and resources.

File	Purpose
definitions.py	This file is typically used to define jobs, schedules, and sensors. It organizes the various components of your Dagster project. This allows Dagster to load the definitions in a module.

data directory

The data directory contains the raw data files for the project. We will reference these files in our software-defined assets in the next step of the tutorial.

To make sure Dagster and its dependencies were installed correctly, navigate to the project root directory and start the Dagster webserver:"

Continue this tutorial by creating and materializing assets

Build an ETL Pipeline | Dagster Docs (original) (raw)

Build your first ETL pipeline

Dagster project structure​

dagster-etl-tutorial root directory​

etl_tutorial directory​

data directory​

Dagster project structure

dagster-etl-tutorial root directory

etl_tutorial directory

data directory