Β· Datumology Β· Data Tools Β· 4 min read
Getting Started with dbt Core
A beginner-friendly guide to setting up and using dbt Core for data transformation in your edge data stack.
What is dbt Core?
dbt (data build tool) Core is an open-source command-line tool that enables data analysts and engineers to transform data in their warehouses more effectively. It brings software engineering best practices like version control, testing, and documentation to data transformation workflows.
Unlike traditional ETL tools that move data between systems, dbt focuses exclusively on the transformation layer (the βTβ in ETL/ELT), working directly within your data warehouse or database. This makes it a perfect companion for lightweight databases like DuckDB in an edge data stack.
Why Use dbt Core?
dbt Core brings several key benefits to data transformation workflows:
SQL-First Approach: Write transformations using SQL, the language most data analysts already know, reducing the learning curve.
Modularity: Build reusable data models that can be referenced by other models, creating a clear dependency graph.
Version Control: Track changes to your transformations in Git, enabling collaboration and providing an audit trail.
Testing and Documentation: Define tests for your data models and automatically generate documentation.
Dependency Management: dbt automatically handles the execution order of your transformations based on the dependencies between models.
Incremental Builds: Efficiently update only new or changed data rather than rebuilding entire tables.
Setting Up dbt Core with DuckDB
In our edge data stack, DuckDB serves as an excellent backend for dbt Core. Hereβs how to get started:
1. Installation
First, set up a Python environment and install the dbt-duckdb adapter:
# Create a virtual environment (optional but recommended)
python -m venv dbt-env
source dbt-env/bin/activate # On Windows: dbt-env\Scripts\activate
# Install dbt with DuckDB adapter
pip install dbt-duckdb2. Initialize a dbt Project
Create a new dbt project:
# Create a new dbt project
dbt init my_analytics_project
# When prompted, select DuckDB as your databaseThis creates a starter project with the following structure:
my_analytics_project/
βββ dbt_project.yml # Project configuration
βββ models/ # Where your SQL models live
β βββ example/ # Example models
βββ analyses/ # Ad-hoc analyses
βββ tests/ # Custom tests
βββ macros/ # Reusable SQL snippets
βββ seeds/ # CSV files to load as tables
βββ snapshots/ # Point-in-time snapshots3. Configure DuckDB Connection
The initialization creates a profiles.yml file in your ~/.dbt/ directory with connection settings for DuckDB:
my_analytics_project:
outputs:
dev:
type: duckdb
path: my_analytics.duckdb # Local DuckDB file
threads: 4
target: dev4. Create Your First Model
Create a simple model in models/my_first_model.sql:
-- models/my_first_model.sql
{{
config(
materialized='table'
)
}}
-- A simple transformation using a common table expression (CTE)
WITH source_data AS (
SELECT 1 AS id, 'Product A' AS product_name, 100 AS price
UNION ALL
SELECT 2 AS id, 'Product B' AS product_name, 200 AS price
UNION ALL
SELECT 3 AS id, 'Product C' AS product_name, 150 AS price
)
SELECT
id,
product_name,
price,
price * 0.9 AS discounted_price
FROM source_data5. Run Your Model
Now run the model to create the table in DuckDB:
dbt rundbt executes your SQL and creates the table in DuckDB.
Working with Real Data Sources
In real projects, youβll connect to actual data sources. With DuckDB, you can:
-- models/customer_analysis.sql
{{
config(
materialized='table'
)
}}
-- Read directly from a CSV file
WITH customers AS (
SELECT * FROM read_csv_auto('data/customers.csv')
),
-- Join with data from a Parquet file
orders AS (
SELECT * FROM read_parquet('data/orders.parquet')
)
SELECT
c.customer_id,
c.name,
c.segment,
COUNT(o.order_id) AS order_count,
SUM(o.order_total) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY 1, 2, 3Advanced dbt Features
Once youβre comfortable with basic models, you can explore more advanced features:
Testing
Add tests to ensure data quality in models/schema.yml:
version: 2
models:
- name: customer_analysis
description: "Customer order analysis"
columns:
- name: customer_id
description: "Primary key"
tests:
- unique
- not_null
- name: total_spent
description: "Total amount spent by customer"
tests:
- positive_valuesDocumentation
Generate documentation for your project:
dbt docs generate
dbt docs serveThis creates an interactive website documenting your data models and their relationships.
Integrating with the Rest of Your Stack
dbt Core fits perfectly into an edge data stack:
- Source Data: Use DuckDB to directly query source files (CSV, Parquet) or connect to other systems.
- Transformation: Apply transformations with dbt Core.
- Visualization: Use the transformed data with tools like Evidence.dev or Marimo.
Conclusion
dbt Core brings software engineering best practices to data transformation workflows without the overhead of complex tools. When combined with DuckDB, it forms a powerful, lightweight solution for data transformation, perfectly aligned with the edge data philosophy.
By starting with a simple setup and gradually incorporating more features as needed, you can build a maintainable, version-controlled data transformation pipeline that grows with your needs.