Quick Start with duckdb and dbt

Why DuckDB? and dbt?

DuckDB is a fast open-source SQL OLAP Database. It is designed to be a drop-in replacement for PostgreSQL, and is optimized for analytical workloads. DuckDB is written in C++ and is available for Linux, macOS, and Windows.

DuckDB is a powerful tool for data analysis and exploration. It is particularly well-suited for working with large datasets, and is optimized for fast query execution. DuckDB is also highly customizable, allowing users to create custom functions and operators for advanced data processing.

DuckDB vs. PostgreSQL?

PostgreSQL is a powerful, open-source relational database management system. It is widely used for a variety of applications, including web development, data warehousing, and analytics. PostgreSQL is known for its reliability, scalability,

duckdb-dbt-run.png

To use DuckDB with dbt (data build tool), you’ll need to follow these general steps:

  1. Install necessary packages: First, ensure you have both dbt and the DuckDB adapter for dbt installed:

    pip install dbt-core dbt-duckdb
    
  2. Set up your dbt project: Initialize a new dbt project or use an existing one:

    dbt init <project_name>
    
  3. Configure your profiles.yml: In your ~/.dbt/profiles.yml file, add a profile for DuckDB. Here’s an example:

    your_profile_name:
      target: dev
      outputs:
        dev:
          type: duckdb
          path: '/path/to/your/duckdb/database.db'
          extensions:
            - httpfs
            - parquet
    

    Adjust the path to where you want your DuckDB database file to be located.

  4. Update your dbt_project.yml: In your project’s dbt_project.yml, make sure the profile name matches:

    profile: 'your_profile_name'
    
  5. Create your models: In the models directory of your dbt project, create your SQL models as usual. DuckDB supports standard SQL syntax.

  6. Run dbt commands: You can now use standard dbt commands like:

    dbt run
    dbt test
    dbt docs generate
    
  7. Using DuckDB-specific features: DuckDB has some unique features you might want to leverage in your models:

    • Reading from Parquet files:
      SELECT * FROM read_parquet('path/to/file.parquet')
      
    • Using the HTTP extension to read remote files:
      SELECT * FROM read_csv_auto('https://example.com/data.csv')
      
  8. Materialization: DuckDB supports all standard dbt materializations (view, table, incremental, ephemeral).

Remember that DuckDB is an in-process database, so it’s great for local development and testing. For production use, you might want to consider how you’ll manage the database file across different environments.

Also, keep in mind that DuckDB’s feature set might differ slightly from other databases, so always refer to the DuckDB documentation for specifics on SQL syntax and available functions.

please create the project to investigate the users data of coffee shops using dbt and duckdb

Certainly! I’ll guide you through creating a dbt project to investigate coffee shop user data using DuckDB. This project will include sample data, models, and some basic analyses. Let’s get started:

  1. Set up the project:
# Create a new directory and navigate into it
mkdir coffee_shop_analysis && cd coffee_shop_analysis

# Initialize a new dbt project
dbt init coffee_shop_dbt

# Navigate into the project directory
cd coffee_shop_dbt
  1. Configure profiles.yml: Edit ~/.dbt/profiles.yml to include:
coffee_shop_dbt:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: 'coffee_shop.db'
      extensions:
        - httpfs
        - parquet
  1. Update dbt_project.yml: Ensure it has the correct profile name:
profile: 'coffee_shop_dbt'
  1. Create sample data: In the project root, create a data directory and add two CSV files:

data/users.csv:

user_id,name,email,sign_up_date
1,Alice,alice@email.com,2023-01-01
2,Bob,bob@email.com,2023-01-15
3,Charlie,charlie@email.com,2023-02-01

data/orders.csv:

order_id,user_id,order_date,total_amount
101,1,2023-01-05,10.50
102,2,2023-01-20,15.75
103,1,2023-02-10,8.25
104,3,2023-02-15,12.00
  1. Create models: In the models directory, create the following SQL files:

models/stg_users.sql:

models/stg_orders.sql:

models/imt_user_profile.sql:

models/imt_order_detail.sql:

models/mart_user_summary.sql:

models/mart_order_summary.sql:

models/fimal_user_order_summary.sql:

  1. Create a simple test: In models/schema.yml:
version: 2

models:
  - name: user_order_summary
    columns:
      - name: user_id
        tests:
          - unique
          - not_null
  1. Run dbt commands:
dbt seed
dbt run
dbt test
  1. Query the results:

Run this command to open the database in the DuckDB CLI:

duckdb ./target/file.duckdb

You can now query the results using DuckDB CLI or any DuckDB-compatible tool. For example:

SELECT * FROM user_order_summary ORDER BY total_spent DESC;

Additionally, use dbt commands for generating documentation and running tests:

dbt docs generate
dbt docs serve
dbt test

This project sets up a basic structure for analyzing coffee shop user data:

  • It stages user and order data from CSV files.
  • Creates a summary model that combines user and order information.
  • Calculates metrics like total orders, total spent, days since signup, and average order value.
  • Includes a simple test to ensure user_id is unique and not null in the summary.

You can expand this project by:

  • Adding more data sources (e.g., product information, store locations).
  • Creating more complex models (e.g., cohort analysis, retention rates).
  • Adding more tests and documentation.
  • Creating data visualizations using the output from these models.

Remember to adjust file paths if needed and ensure you have the necessary permissions to read/write files in the specified locations.