Introduction
Imagine a warehouse with hundreds, thousands, or millions of tables. How do you know what purpose each table has, or what each column represents? In most SQL-based DBs, the best you can do is use meaningful names, types, and comments. But that is not enough to control the complexity that comes with scale. Unlike modern programming languages with advanced type, object, and procedural abstractions, SQL’s types and procedural abstractions are fairly primitive. In fact in most warehouses more than 50% of columns are just of type VARCHAR and most of the rest are other basic types. So how can you augment existing queries and pipelines with additional meaning? With data classification!
Data classification in the warehouse or data lake refers to the process of assigning additional meaning to tables and columns that goes beyond data types such as integers, strings, and booleans. This can include different kinds of PII (e.g. Phone, Address, SSN), levels of sensitivity (e.g. Public, Secret, Top Secret), retention periods (e.g. 90 days, 1 year), units of measurement and many more. Such additional meaning is crucial for an in-depth understanding of the data and for having any hope of warehouse-wide enforcement of data governance policies such as GDPR, CCPA, and HIPAA.
Traditionally, data classification methods rely on manual tagging, and, as a result, are tedious and untenable at the scale of modern warehouses. AI-based data classification addresses some of the scale concerns at the cost of being inherently imprecise: due to hallucination, its guesses are often wrong.
SDF Labs has developed a novel solution to address the above challenges. We use the familiar techniques from the field of programming languages — static analysis and information flow analysis — in the context of data, queries, and pipelines. The resulting approach achieves precise column-level and table-level data classification automatically for the vast majority of tables, requiring manual tagging or AI-based classification only for root tables — a small percentage of the overall warehouse.
The following figure illustrates this: once a single root table has one of its columns annotated with a classifier, a large number of downstream tables automatically have the same classifier propagated to their relevant columns.
In this article, we explore the above ideas using several simple classifier examples. In later articles, we will draw upon the analogy with type systems and programming languages and show how SDF provides an extensible enforcement mechanism for spotting data policy violations at compile time. This is similar to how Typescript enhanced Javascript with a type system that helped spot coding bugs at compile time.
Basics
Let us start with a simple warehouse consisting of several tables:
The root (source) tables:
customer
orders
lineitem
nation
A derived table
customer_revenue
calculated from the root tablesA second-level derived table
public_revenue_info
calculated fromcustomer_revenue
(An astute observer will notice that this example is based on the common DB benchmark TPCH.)
In general, a warehouse is a collection of tables connected by dependency relationships into a directed graph. (For the most part, warehouse graphs tend to be acyclic with the exception of recursively defined cumulative/incremental tables.)
In SDF, this dependency relationship is inferred automatically from the set of SQL files representing the tables. Root tables are defined by files with DDL statements:
CREATE TABLE private.customer (
c_custkey BIGINT,
c_name VARCHAR(255),
c_address VARCHAR(255),
c_nationkey BIGINT,
c_phone VARCHAR(255),
c_acctbal DECIMAL(15,2),
c_mktsegment VARCHAR(255),
c_comment VARCHAR(255)
)
And derived tables — by files with SELECT, INSERT INTO, or CREATE TABLE AS statements:
CREATE TABLE private.customer_revenue AS
SELECT
c_custkey,
c_name,
sum(l_extendedprice * (1 - l_discount)) AS revenue,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment
FROM
customer,
orders,
lineitem,
nation
WHERE
c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate >= date '2023-10-01'
and o_orderdate < date '2024-01-01'
and l_returnflag = 'R'
and c_nationkey = n_nationkey
GROUP BY
1,2,4,5,6,7,8
ORDER BY
revenue DESC
Our example has a second-level derived table defined from the above table as follows:
CREATE TABLE external.public_revenue_info AS
SELECT
c_custkey,
c_name
FROM
private.customer_revenue;
With this information, SDF can compute the structure of every table. Here, for example, is external.public_revenue_info
:
+-------------+-----------+------------+
| column_name | data_type | classifier |
+-------------+-----------+------------+
| c_custkey | bigint | |
| c_name | varchar | |
+-------------+-----------+------------+
Classifiers
Now, let us move beyond the basic types shown in the second column above and into the world of rich classifiers. In SDF, classifier taxonomies are defined by simple Yaml files. Here, for example, is a tiny taxonomy consisting of two classifier sets: PII and CURRENCY:
classifier:
name: PII
labels:
- name: name
- name: phone
- name: address
- name: hashed
---
classifier:
name: CURRENCY
labels:
- name: usd
- name: cad
The PII classifier set includes labels name
, phone
, address
, and hashed
. We will use the latter to represent securely hashed PII information, which can be allowed in more contexts than unhashed, or clear-text, PII information. The CURRENCY classifier set includes labels usd
and cad
to represent US and Canadian currencies respectively.
Now we can apply this classifier taxonomy to the root tables in our warehouse. We will do so with the help of side Yaml files linking to the tables using fully-qualified names. Here is the classifier assignment for private.customer
:
table:
name: private.customer
columns:
- name: c_name
classifiers:
- PII.name
- name: c_address
classifiers:
- PII.address
- name: c_phone
classifiers:
- PII.phone
- name: c_acctbal
classifiers:
- CURRENCY.usd
We can attach any number of classifiers to a given column or to an entire table.
Now, SDF knows both data types and classifiers for private.customer
:
+--------------+----------------+--------------+
| column_name | data_type | classifier |
+--------------+----------------+--------------+
| c_custkey | bigint | |
| c_name | varchar | PII.name |
| c_address | varchar | PII.address |
| c_nationkey | bigint | |
| c_phone | varchar | PII.phone |
| c_acctbal | decimal(15, 2) | CURRENCY.usd |
| c_mktsegment | varchar | |
| c_comment | varchar | |
+--------------+----------------+--------------+
Inference
Much like modern programming languages can infer types of expressions from the types of their components, SDF can infer classifiers for downstream tables from those of the upstream tables. Once root tables are fully annotated, classifiers for all the other tables are automatically inferred by SDF using information flow analysis. We also say that classifiers are propagated to downstream tables.
So, without having to do anything else, we get the following classifier assignments for private.customer_revenue
:
+-------------+----------------+--------------+
| column_name | data_type | classifier |
+-------------+----------------+--------------+
| c_custkey | bigint | |
| c_name | varchar | PII.name |
| revenue | double | CURRENCY.usd |
| c_acctbal | decimal(15, 2) | CURRENCY.usd |
| n_name | varchar | PII.name |
| c_address | varchar | PII.address |
| c_phone | varchar | PII.phone |
| c_comment | varchar | |
+-------------+----------------+--------------+
And external.public_revenue_info
:
+-------------+-----------+------------+
| column_name | data_type | classifier |
+-------------+-----------+------------+
| c_custkey | bigint | |
| c_name | varchar | PII.name |
+-------------+-----------+------------+
The figure below illustrates the flow of classifiers in our example. Most propagate as direct copies of an upstream column (e.g. c_name
, c_acctbal
, and c_phone
are copied from customer
to customer_revenue
.) The revenue
column is an example of propagation through transformation: even though the data is transformed as it flows from the upstream columns l_discount
and l_extendedprice
, SDF preserves and propagates the classifiers.
SDF’s information flow analysis is conservative: by default, it propagates all of the input classifiers. For example, for the following expression concatenating three private.customer
columns: CONCAT(c_name, c_address, CAST(revenue as VARCHAR))
SDF will infer three classifiers: PII.name
, PII.address
, and CURRENCY.usd
.
How classifiers are propagated within a single query can be surprisingly tricky. Consider, for example, this query using a source table with an INFO.Sensitive
classifier on its c3
column.
SELECT (SELECT count(*) AS cnt FROM source_tbl WHERE c3 LIKE 'xyz') AS cnt;
Should the cnt
column in the output table inherit the sensitivity level? While in a typical query, SDF does not propagate classifiers attached to the WHERE
clause conditions, here, evaluation of the WHERE
clause in the scalar subquery directly impacts the value in the output column, so SDF takes the opposite approach and propagates INFO.Sensitive
to the output.
Another design choice arises when dealing with user data and PII. What inference rules should we use for propagating PII classifiers through aggregation? For example, the following query aggregates the length of phone numbers across all users:
SELECT AVG(LENGTH(phone)) as len FROM customer;
Intuitively, the resulting column len
does not have PII because aggregation mixed the data from all users. But what about the following queries?
SELECT MAX(phone) as phone FROM customer;
SELECT c_custkey, MAX(phone) as phone
FROM customer
GROUP BY c_custkey;
The first query above leaks the phone of one customer; the second query leaks all customers’ phones — in both cases despite aggregation. SDF takes the approach of being conservative with respect to aggregation: all of the above examples result in classifiers propagating to the output. Read on to learn how reclassification can be used to adjust the results of SDF’s inference algorithm.
Reclassification
Suppose we want to restrict clear text PII classifiers to only private schema tables. This invariant is currently violated: column c_name
in external.public_revenue_info
has PII.name
. Let us attempt to fix this by hashing the PII column using the MD5
function:
CREATE TABLE public_revenue_info AS
SELECT
c_custkey,
MD5(c_name) AS c_name
FROM
private.customer_revenue;
Without any further information about MD5
, SDF will conservatively propagate the PII.name
classifier to the c_name
column produced by the modified query.
To achieve the desired effect, we need to tell SDF that MD5
is a special function whose effect is to reclassify any PII passed in as input into hashed PII returned as the result. We do this with the help of the following Yaml declaration:
function:
name: MD5
reclassify:
- from: PII.*
to: PII.hashed
Now, SDF infers the right classifier for external.public_revenue_info
:
+-------------+-----------+------------+
| column_name | data_type | classifier |
+-------------+-----------+------------+
| c_custkey | bigint | |
| c_name | varchar | PII.hashed |
+-------------+-----------+------------+
Whenever MD5
is used in any other table definition, it will reclassify any PII classifier in its input into PII.hashed
in the output. What happens to other classifiers that pass through MD5
? Nothing: following the conservative principle mentioned above, they are passed through unchanged. We call this function-based reclassification.
SDF also supports table-based reclassification for directly reclassifying a particular table or column. For instance, we can instruct SDF to reclassify c_name
column in external.public_revenue_info
regardless of the defining query as follows:
table:
name: external.public_revenue_info
columns:
- name: c_name
reclassify:
- from: PII.name
to: PII.hashed
In subsequent blog posts, we will show how invariants like the one discussed here (that clear text PII cannot flow into external tables) can be encoded as code contracts and enforced at compilation time — much like modern programming language compilers detect certain classes of bugs at compile time. We will also talk about code reports that compute and display compile-time statistics about classifiers, types, and tables.
A Glimpse of More to Come
This article showed a basic example of using a PII classification taxonomy to track the flow of PII information through warehouse tables. We are just scratching the surface here; a lot more can be done with classifiers including the following use cases some of which we will describe in more detail in subsequent blog entries:
ID tracking: a typical warehouse will contain a multitude of different types of IDs: user id, device id, post id, etc. While all of them may be represented by the same underlying physical data type (e.g. BIGINT), it is important not to mix different kinds of ids in any way. One may not want to inadvertently coalesce a user id and a device id: the resulting value will not have any meaning.
Currency: a financial application may track different types of currencies — US dollars, Canadian dollars, etc. — and perform computations on them only after properly converting to a normalized currency representation.
Retention management: tracking table-level TTLs is crucial for implementing data governance policies such as GDPR. A classifier can identify a TTL for a particular table and ensure that it is obeyed.
Table visibility: table-level classifiers can also help in constraining how and where a particular table can be used. For example, using classifiers, we can set up a policy enforcing that tables marked private may only be referenced within the same catalog and schema. We can even define friend classifiers which help ensure that only a specified set of downstream tables may source data from a given table.
Tune in for more on this in subsequent articles.