SDF & Dagster: The Post-Modern Data Stack

Seamlessly integrate your local code, orchestration, and cloud data warehouse.

Aug 14, 2024

In recent years, data and analytics engineers have benefited from a rise in tools and best practices targeting three major categories: local development (code), orchestration (assets in time), and the data warehouse (cloud). While these tools have had a considerable impact, they suffer from a fundamental problem: there are no guarantees.

Today, SQL developers can’t trust that the queries they author locally will execute successfully against the cloud. Downstream, the resulting pipelines might not materialize as expected and your data warehouse might be in an inconsistent state. Traditionally, code and local authoring are offline processes, whereas cloud databases and orchestration are online processes. Within the context of data development, there is a constant friction to keep the offline, and the online worlds in sync.

SDF Labs and Dagster Labs are excited to announce a huge leap forward in seamless data development, combining the safety and speed of SDF’s powerful transformation layer with the expansive orchestration capabilities of Dagster.

The issue at hand

A typical analytics or data engineer might use dbt for authoring their data warehouse, Airflow for scheduling, and Snowflake to ultimately materialize their queries. Without running their queries remotely in a dev or staging environment, developers cannot assert their semantic and syntactic validity.

SDF is a multi-dialect SQL compiler, transformation framework, and analytical database engine. It solves this issue by natively compiling SQL dialects, like Snowflake, guaranteeing SQL correctness before materialization. Its fundamental understanding of your data lineage means the components of your data pipeline are not only independently valid SQL, but are guaranteed to execute as a unit or DAG.

When it comes time to orchestrate your data pipelines or DAGs, most orchestrators available today are not designed with data in mind; rather, they are task centric, paying little attention to the broader picture of data lineage within a data warehouse. This makes orchestrating data pipelines on traditional schedulers manual, prone to logical errors, and disjoint.

Dagster is a powerful orchestration platform that, by design, brings structure and efficiency to data flows. A data asset centric orchestrator, Dagster provides unparalleled visibility into data pipelines composed of various platforms, tools, and environments. By stitching these together as data assets, Dagster is able to schedule and materialize complex dags with transparent dependency management and observability.

SDF and Dagster, Better Together

Together, SDF and Dagster provide the best of both worlds; a powerful transformation and authoring layer, exposing a rich set of metadata and compile-time guarantees to an extensive, asset-oriented orchestrator that natively understands the transformation graph.

Through a first of its kind integration, users of SDF will be able to effectively schedule and materialize their workspaces with Dagster. True column level lineage, rich classifiers, and data quality checks exposed to the Dagster UI make understanding your SDF workspace within the context of your broader Dagster pipelines clearer than ever.

Robust Data Pipelines

When it comes time to schedule or materialize your models in Dagster, you can be confident that SDF has successfully compiled your workspace, making it safe to execute locally or against your cloud data warehouse. This is the magic of SDF compile and run.

SDF catches issues at compile time, making your queries safe to execute

Breaking changes at the database level are reflected in Dagster before you execute and data integrity checks in the form of SDF’s table and column tests are automatically converted into Dagster Asset Checks and executed during orchestration.

Local Development

Using SDF with Dagster also supercharges developer productivity. With both tools exposing a rich local development experience, the data development feedback loop is drastically shortened, freeing up more time to focus on what matters most - building strong data pipelines. Time wasted waiting for the validation of queries executing remotely, debugging SQL semantics, or trying to chase down some arbitrary ref, can be spent writing type-safe SQL and data quality checks that build trust in your data pipelines.

Local development is also enabled by the SDF DB - a fast, vectorized query engine integrated into SDF. As you develop you can seamlessly execute queries, run jobs, tests or other transformations locally, directly within Dagster.

Lower Cloud Compute Costs

Without the need to execute pipelines remotely to assert their validity with SDF, Teams should notice a dramatic reduction in queries executed against their data warehouse, resulting in reduced compute costs. By catching potential errors early and preventing failing queries from ever running in the warehouse, SDF decreases costs associated with remediation, rollbacks, and developer time.

SDF Cache

On its own, the SDF Cache drastically improves the performance of local compilation and remote execution of data pipelines. Within the context of orchestration with Dagster, SDF ensures you only execute queries you’ve modified or that are impacted by upstream changes, improving not only the speed of execution, but again reducing the costs associated with materializing your data warehouse.

We’re excited today to share this new bridge between the offline world of local development and the online world of data orchestration. Not only are SDF and Dagster complementary, but each elevates the other to create a truly best in class toolchain for data.

Getting Started

Getting started from scratch, or from an existing Dagster project is easy. Follow our official Getting Started with Dagster guide to learn more, or check out Dagster’s post for more information.

SDF Blog

Discussion about this post