What is dbt - and how to set up your dbt developer environment?

blog preview

Most of the world nowadays agrees, that having some sort of data warehouse is key for a successful data operation. Each and every company - no matter how big or small - has quite a huge number of tools which all produce data. Combining the data of these tools is essential to make use of them. That's - in short - the sales-pitch for a data warehouse. Compared to the very expensive and huge "enterprise warehouses" from 20 years ago, there is a new breed of warehouses available. For marginal cost and with incredible scalability. The "modern data warehouse" - a warehouse which is feasible for any company - no matter of budget or size.

Now, let's assume that we succeeded in establishing the warehouse technology and integrating all our data sources into this warehouse. What do we have? A bunch of raw data, all organized in a central place. Which is already good - but now comes the work of making use of them.

And here is where dbt comes into play. This post will cover to basic topics:

What is dbt?

It's rare these days that one can simply navigate to a website of a tool and get a perfect description of what it does. dbt actually is one of these tools. So, I've the pleasure to simply quote dbt's website:

dbt transforms data in your data warehouse and helps to data engineers to work like software engineers - to ship trusted data, incrementally and fast.

dbt does this by providing tools to:

  • use supercharged SQL (more on that later) to transform your raw data in usable, well-defined models
  • provide easy-to maintain documentation tools
  • add a thing data testing layer to make sure your data are what you expect

This marketing image summarizes this quite well.

dbt marketing slidedbt Marketing Slide

Summarized, dbt provides us an easy to use, easy to maintain way of modelling our raw data into multiple ready-to-use datasets which can be utilized by more downstream operations like machine learning, BI and data analytics. dbt does this, by providing a powerful SQL interface and tools for documentation and testing.

If you want to see dbt in action, skip to section How to model data with dbt?.

How to set up dbt?

In this section we are installing dbt and create a nice and tidy vscode development setup.

dbt core - the open source and well... core offering of dbt is actually just a python package one needs to install.

Besides the dbt main application, you can install one or several database adapters.

Run these commands to install dbt.

1# Optional but highly recommended, create and activate a virtual environment
2# Instead of venv, you might also use anaconda for environment management
3python3 -m vevn dbt-venv
4source dbt-venv/bin/activate
5
6# Install the core application
7pip install dbt-core
8
9# Install one or several of the following adapters
10
11pip install dbt-bigquery # If you want to model data in your bigquery data warehouse
12pip install dbt-athena-adapter # If you want to model data in your athena data warehouse
13pip install dbt-postgres # If you want to model data in your postgres data warehouse
14pip install dbt-synapse # If you want to model data in your synapse data warehouse
15pip install dbt-databricks # If you want to model data in your databricks platform
16pip install dbt-redshift # If you want to model data in your redshift data warehouse
17pip install dbt-snowflake # If you want to model data in your snowflake data warehouse
18pip install dbt-spark # If you want to model data in your spark platform

These are all the dependencies one needs, to start developing with dbt!

For developing our data models we are going to use VSCode - as it provides awesome extensions which make working with dbt fun and exciting.

  1. If not already done, install vscode
  2. Set up the folder structure, as follows
    1. Add a .gitignore file with the following content

      1target/
      2dbt_modules/
      3dbt_packages/
      4logs/
      5venv/
      6env/
      7test.env
      8__pycache__
      9.vscode
      10export.json
    2. Run dbt init and follow the screen. The init application might ask different questions, depending on which database adapter you are going to use. In this example, I'm using BigQuery as an example (make sure you have installed the appropriate adapter - see the pip install steps above)

      1dbt init
      2
      3# This is the result
      417:06:52 Running with dbt=1.4.5
      5Enter a name for your project (letters, digits, underscore): waffle_shop
      6Which database would you like to use?
      7[1] bigquery
      8
      9(Dont see the one you want? https://docs.getdbt.com/docs/available-adapters)
      10
      11Enter a number: 1
      12[1] oauth
      13[2] service_account
      14Desired authentication method option (enter a number): 2
      15keyfile (/path/to/bigquery/keyfile.json): /tmp/google_key.json
      16project (GCP project id): devopsandmore
      17dataset (the name of your dbt dataset): dbt_devops
      18threads (1 or more): 8
      19job_execution_timeout_seconds [300]: 300
      20[1] US
      21[2] EU
      22Desired location option (enter a number): 2
      2317:07:25 Profile waffle_shop written to /home/andreas/.dbt/profiles.yml using targets profile_template.yml and your supplied values. Run 'dbt debug' to validate the connection.
      2417:07:25
      25Your new dbt project "waffle_shop" was created!
      26
      27For more information on how to configure the profiles.yml file,
      28please consult the dbt documentation here:
      29
      30https://docs.getdbt.com/docs/configure-your-profile
      31
      32One more thing:
      33
      34Need help? Don't hesitate to reach out to us via GitHub issues or on Slack:
      35
      36https://community.getdbt.com/
      37
      38Happy modeling!

      Finishing this installer will create the following folder structure

      1.
      2├── logs
      3│   └── dbt.log
      4└── waffle_shop
      5 ├── README.md
      6 ├── analyses
      7 ├── dbt_project.yml
      8 ├── macros
      9 ├── models
      10 │   └── example
      11 │   ├── my_first_dbt_model.sql
      12 │   ├── my_second_dbt_model.sql
      13 │   └── schema.yml
      14 ├── seeds
      15 ├── snapshots
      16 └── tests
      • logs: This folder contains the logs of the dbt init operation. You can delete it.
      • waffle_shop: This is the main dbt project folder. All your model sql files, description files and settings are located here. The name of the folder is the name of the dbt project you entered during dbt init.
      • README.md: Readme file. Feel free to change.
      • analyses: Folder which might contain dbt analyses. Despite it's name, analyses are not the main thing in dbt - "models" are
      • dbt_project.yaml: Project settings file
      • macros: Folder to store dbt macros
      • models: This is the folder you will spend most of your time in. It contains all dbt models - as a reminder, dbt models are "just" sql files, enhanced with jinja template code.
      • example: Delete this folder
      • seeds: Folder which might contain dbt seeds
      • snapshots: Folder which might contain dbt snapshots
      • tests: Folder containing custom dbt tests
    3. Install vscode-dbt. This adds support for the dbt syntax (dbt-sql, jinja, yaml, markdown)

    4. Install sqlfluff

      1pip install sqlfluff
    5. Install the sqlfluff vscode extension

    6. Add a file .sqlfluff to the toplevel folder, with the following content:

      1[sqlfluff]
      2# Choose the sql dialect to use
      3dialect = bigquery
      4# Use jinja templater instead of dbt, as the dbt templater is slow
      5# and does not work with bigquery dialect
      6templater = jinja
      7# Optional: Exclude some rules
      8# Find all rules here: https://docs.sqlfluff.com/en/stable/rules.html
      9exclude_rules = L034, L032, L024, L015, L008, L001
      10ignore_templated_areas = True
      11large_file_skip_byte_limit = 200000
      12
      13[sqlfluff:templater:jinja]
      14# Make dbt default macros available like `ref`
      15apply_dbt_builtins = true
      16# Load custom macros from the macro directory
      17load_macros_from_path = ./macros/
      18# Allow mocking dbt_utils and other packages through
      19# python files placed in this dir
      20library_path = ./sqlfluff_libs/
      21
      22# Mock variables that are used in dbt
      23[sqlfluff:templater:jinja:context]
      24target = prod
      25
      26[sqlfluff:rules]
      27tab_space_size = 4
      28max_line_length = 100
    7. Add a file .sqlfluffignore, with following lines:

      1target/
      2dbt_modules/
      3dbt_packages/
      4macros/
      5tests/
    8. Add a folder .vscode at the same folder level as logs and waffle_shop.

    9. Add a file settings.json inside the .vscode folder, having the following content:

      1{
      2 // change this to where your python dbt virtual environment python binary is located
      3 "python.python.defaultInterpreterPath": "./.dbtenv/bin/python",
      4
      5 "files.associations": {
      6 // associate all sql files in the folder with jinja sql
      7 "*.sql": "jinja-sql",
      8
      9 // optional: don't format models in `target/` dir.
      10 // This disables syntax highlighting in the target folder
      11 // This helps to very quickly identify, that you are not in a model file,
      12 // but a compiled target file.
      13 "**/target/**": ""
      14 },
      15 "editor.quickSuggestions": {
      16 "strings": true
      17 },
      18
      19 // run "whereis sqlfluff" to find the executable path of your sqlfluff installation
      20 "sqlfluff.executablePath": "/path/to/dbt-venv/bin/sqlfluff",
      21 "sqlfluff.dialect": "bigquery",
      22 "sqlfluff.linter.run": "onSave",
      23 "sqlfluff.experimental.format.executeInTerminal": true,
      24 "editor.formatOnSave": false,
      25 "dbt.queryLimit": 1000,
      26 "editor.rulers": [100]
      27}
    10. Install the dbt power user vscode extension

If you follow the steps above you'll get a basic dbt setup and vscode as sort of dbt IDE. The developer experience is quite nice and you have a development system at hand to start working with dbt.

In the next post, you'll see how to create your first dbt data model, based on some raw-data in your data warehouse.

------------------

Interested in how to train your very own Large Language Model?

We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:

  • Cost control
  • Data privacy
  • Excellent performance - adjusted specifically for your intended use

Need assistance?

Do you have any questions about the topic presented here? Or do you need someone to assist in implementing these areas? Do not hesitate to contact me.