Getting Started with dbt Core

What is dbt Core?

dbt (data build tool) Core is an open-source command-line tool that enables data analysts and engineers to transform data in their warehouses more effectively. It brings software engineering best practices like version control, testing, and documentation to data transformation workflows.

Unlike traditional ETL tools that move data between systems, dbt focuses exclusively on the transformation layer (the “T” in ETL/ELT), working directly within your data warehouse or database. This makes it a perfect companion for lightweight databases like DuckDB in an edge data stack.

Why Use dbt Core?

dbt Core brings several key benefits to data transformation workflows:

SQL-First Approach: Write transformations using SQL, the language most data analysts already know, reducing the learning curve.
Modularity: Build reusable data models that can be referenced by other models, creating a clear dependency graph.
Version Control: Track changes to your transformations in Git, enabling collaboration and providing an audit trail.
Testing and Documentation: Define tests for your data models and automatically generate documentation.
Dependency Management: dbt automatically handles the execution order of your transformations based on the dependencies between models.
Incremental Builds: Efficiently update only new or changed data rather than rebuilding entire tables.

Setting Up dbt Core with DuckDB

In our edge data stack, DuckDB serves as an excellent backend for dbt Core. Here’s how to get started:

1. Installation

First, set up a Python environment and install the dbt-duckdb adapter:

# Create a virtual environment (optional but recommended)
python -m venv dbt-env
source dbt-env/bin/activate  # On Windows: dbt-env\Scripts\activate

# Install dbt with DuckDB adapter
pip install dbt-duckdb

2. Initialize a dbt Project

Create a new dbt project:

# Create a new dbt project
dbt init my_analytics_project

# When prompted, select DuckDB as your database

This creates a starter project with the following structure:

my_analytics_project/
├── dbt_project.yml         # Project configuration
├── models/                 # Where your SQL models live
│   └── example/            # Example models
├── analyses/               # Ad-hoc analyses
├── tests/                  # Custom tests
├── macros/                 # Reusable SQL snippets
├── seeds/                  # CSV files to load as tables
└── snapshots/              # Point-in-time snapshots

3. Configure DuckDB Connection

The initialization creates a profiles.yml file in your ~/.dbt/ directory with connection settings for DuckDB:

my_analytics_project:
  outputs:
    dev:
      type: duckdb
      path: my_analytics.duckdb  # Local DuckDB file
      threads: 4
  target: dev

4. Create Your First Model

Create a simple model in models/my_first_model.sql:

-- models/my_first_model.sql
{{
  config(
    materialized='table'
  )
}}

-- A simple transformation using a common table expression (CTE)
WITH source_data AS (
    SELECT 1 AS id, 'Product A' AS product_name, 100 AS price
    UNION ALL
    SELECT 2 AS id, 'Product B' AS product_name, 200 AS price
    UNION ALL
    SELECT 3 AS id, 'Product C' AS product_name, 150 AS price
)

SELECT 
    id,
    product_name,
    price,
    price * 0.9 AS discounted_price
FROM source_data

5. Run Your Model

Now run the model to create the table in DuckDB:

dbt run

dbt executes your SQL and creates the table in DuckDB.

Working with Real Data Sources

In real projects, you’ll connect to actual data sources. With DuckDB, you can:

-- models/customer_analysis.sql
{{
  config(
    materialized='table'
  )
}}

-- Read directly from a CSV file
WITH customers AS (
    SELECT * FROM read_csv_auto('data/customers.csv')
),
-- Join with data from a Parquet file
orders AS (
    SELECT * FROM read_parquet('data/orders.parquet')
)

SELECT 
    c.customer_id,
    c.name,
    c.segment,
    COUNT(o.order_id) AS order_count,
    SUM(o.order_total) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY 1, 2, 3

Advanced dbt Features

Once you’re comfortable with basic models, you can explore more advanced features:

Testing

Add tests to ensure data quality in models/schema.yml:

version: 2

models:
  - name: customer_analysis
    description: "Customer order analysis"
    columns:
      - name: customer_id
        description: "Primary key"
        tests:
          - unique
          - not_null
      - name: total_spent
        description: "Total amount spent by customer"
        tests:
          - positive_values

Documentation

Generate documentation for your project:

dbt docs generate
dbt docs serve

This creates an interactive website documenting your data models and their relationships.

Integrating with the Rest of Your Stack

dbt Core fits perfectly into an edge data stack:

Source Data: Use DuckDB to directly query source files (CSV, Parquet) or connect to other systems.
Transformation: Apply transformations with dbt Core.
Visualization: Use the transformed data with tools like Evidence.dev or Marimo.

Conclusion

dbt Core brings software engineering best practices to data transformation workflows without the overhead of complex tools. When combined with DuckDB, it forms a powerful, lightweight solution for data transformation, perfectly aligned with the edge data philosophy.

By starting with a simple setup and gradually incorporating more features as needed, you can build a maintainable, version-controlled data transformation pipeline that grows with your needs.