At Conductor, we deal with pretty big data sets and provide users with a rich set of data views, filtering, and aggregations – all in real-time!
Snowflake is a modern data warehouse that we use to power some of the core features of the Conductor platform – Domain and Page Explorer which analyze millions of keywords per user request. Compared to other tools, Snowflake satisfies our requirements for heavy OLAP and provides a flexible pricing model. Another benefit of Snowflake is that the queries are written with SQL and some Snowflake-specific extensions which developers should be comfortable with.
Every technology has its limitations, pros, and cons. While implementing features for the Conductor platform, we came across a number of tricky Snowflake queries. In the article below I would like to share these insights with those who would like to use Snowflake in their projects. Let’s go!
The examples below will be based on the following table:
Snowflake supports most standard data types that you normally have in SQL. A complete list of data types can be found in Snowflake Documentation.
For real-world Snowflake tables you should also define a clustering key, but for simplicity I omitted these details.
Now let’s walk through the examples.
Task: find records where keywords contain a target word.
First solution that you might think about could look like this:
|SELECT url, keyword, tokens|
WHERE CONTAINS(keyword, ‘real’);
The results are obviously not what we need, we get the phrases where ‘real’ is only a part of the word:
Here is where the array data type can help! In order to insert a record into the array column, you can parse a comma-separated array as json:
|INSERT INTO serps (keyword, tokens)|
VALUES (‘a real fire’, PARSE_JSON(‘[“a”,”real”, “fire”]’));
It is worth mentioning that Snowflake can do basic data transformations while loading data into the table – for example parse a csv file, recognize comma-separated records as different columns, handle empty values, etc. In our case we already pre-processed keywords splitting them into tokens.
Now let’s use the ARRAY_CONTAINS function to filter tokens by a target word:
|SELECT url, keyword, tokens|
WHERE ARRAY_CONTAINS(‘real’::variant, tokens);
Task: find out which words are the most popular in keywords and count the number of keywords containing these tokens.
Snowflake has a FLATTEN function which can be used to convert semi-structured data to a relational representation. When applying to arrays, it converts a single record with an array of tokens into a set of records each containing one token from the array. Let’s see how it is applied to our example:
|SELECT tokens.value::string AS token, COUNT(keyword) AS count|
FROM serps t, LATERAL FLATTEN(input => t.tokens) tokens
GROUP BY tokens.value
ORDER BY count DESC, token;
The results will look like this:
Task: retrieve all records and calculate total modeled traffic for each keyword using only one query.
Window functions are supported by many traditional relational database engines. They are effective for the class of queries that require aggregates – min/max, count, sum, etc. on subgroups of data while running select on the data set. Likely, Snowflake also supports window functions and this elegant solution can be really helpful in practice. The following query solves our task – notice OVER and PARTITION BY operators:
|SELECT keyword, url, modeled_traffic,|
SUM(modeled_traffic) OVER (PARTITION BY keyword) AS total_modeled_traffic
ORDER BY keyword;
Optimizing queries can be a vital part of the implementation process. When working with large datasets it is easy to make a little mistake that will break the entire user experience. Snowflake has tools for query profiling where you can analyze your query execution parameters:
- query breakdown on execution steps
- how much time does each step takes
- how many records are returned on each step
- how many data was scanned from disk and from a cache
- how many partitions used – consider this when designing your clustering key
Typically query profiling UI looks like this:
From this screenshot, we can see that 64% of data was scanned from cache which is pretty good. Also, the heaviest operation is JOIN, it takes 50% of execution time – the first thing to consider when trying to optimize the query.
Things that profiling can teach you
- Use clustering keys in JOIN and WHERE clauses. Avoid general fields as much as possible, otherwise your queries will spend the majority of time on table scan operations.
- Forget about SELECT * queries. Snowflake is a columnar data store, explicitly write only the columns you need.
- Apply filters before joins. Use filters in the order of their cardinalities – first filter by columns having a smaller range of different values.
Bonus hint: forget about foreign keys and unique constraints, Snowflake validates only non-null constraints.
Snowflake is a good tool for its class of tasks, it requires minimal maintenance efforts, it supports SQL syntax, has a columnar data structure and clustering keys which should be easy to understand for developers familiar with other solutions like Amazon Redshift. I hope these tips and Snowflake technical documentation will help you pass the pitfalls and write effective elegant queries in your projects!