What the f*ck does Snowflake do?
Introduction
Have you heard about Snowflake and perhaps even used it but don’t understand why people think it's so amazing? You don’t need a technical background to understand the core concepts behind Snowflake. This article will explain in non-technical language what Snowflake is and why it was such a game-changer in the data warehousing space.
What this article isn’t is an exhaustive review of Snowflake explaining every feature and technical detail. If that sounds good with you and you’re looking for a quick rundown of what Snowflake is and how it’s become a leader in the data warehouse landscape, then this is the article for you.
Table of Contents
Introduction
What is a Database?
On-Premises and Cloud Servers
Understanding the Rise of Snowflake: The 4 Key Factors
4.1 Separation of Storage and Compute
4.2 Ability to Control Compute Resources
4.3 Micro-Partitions and Data Clustering
4.4 User Experience: It “Just Works”
Glossary
What is a Database?
At its core, Snowflake is a database where you can store your data.
A database is like a digital storage system that helps organize and manage large amounts of information. It acts as a structured container where data is stored in a systematic and organized manner. It's similar to a filing cabinet or a well-organized library catalog that allows you to store, retrieve, and manipulate data efficiently.
Databases provide a way to store different types of data, such as customer information, product details, or financial records. They also offer mechanisms to search, sort, and filter the data, making it easier to find specific information when needed. Databases store their data in tables which look similar to an Excel sheet with columns and rows.
Example of a database table
At a holistic level, Snowflake is considered a data warehouse, which is a database that contains data from a variety of sources to be analyzed.
On-Premises and Cloud Servers
To understand what makes Snowflake so special, it's important to understand the rise of the “cloud” in technology.
Databases are stored on powerful computers called servers that are specifically designed to store and manage data. In the past, organizations built their own physical servers in their facilities, known as “on-premises servers”. These servers required a lot of time and budget to build and a dedicated IT team to maintain and upgrade.
What a server room looks like
However, as technology advanced, 3rd party providers like Amazon and Microsoft began building their own servers and renting them out to organizations. As long as organizations had access to the internet, they could access these remote servers and their data. This phenomenon is referred to as the cloud and was a game-changer which offered a range of benefits over on-premises servers.
For instance, cloud servers eliminated the need for companies to maintain their own hardware, as all the infrastructure and resources were provided by the cloud provider. This not only reduced upfront costs but also alleviated the burden of ongoing server maintenance. Moreover, cloud servers provide easier access as data and applications could be accessed from anywhere with an internet connection. This flexibility enabled better collaboration, leading to increased productivity.
Overall, cloud servers have replaced the traditional on-premises servers by offering cost-efficiency, scalability, flexibility, and accessibility, empowering businesses to focus on their core operations while leaving the server management to the experts
Returning to Snowflake, Snowflake is a data warehouse built on the cloud. This means that the data is stored and processed on remote servers by companies like Amazon and Microsoft. While Snowflake wasn’t the first cloud data warehouse, it leveraged the cloud to create many amazing features that set it apart from its competitors.
Understanding the Rise of Snowflake: The 4 Key Factors
1. Separation of Storage and Compute
When building a server, it’s very much like buying or building a computer or phone. There are generally two important factors to consider:
How much space is available to store data and applications?
How fast does the machine run?
In the servers and database world, these two concepts are known as storage and compute.
Traditionally, for on-prem servers, storage and compute were bundled together in a single package. This caused a lot of headaches for companies. If your storage needs grew significantly or you required more computing power, you would need to expand the storage and add more servers, which was time-consuming and costly. As well, it often led to underutilization of resources. Since storage and compute were bundled together, you might have had to allocate more computing power than necessary to meet storage requirements or vice versa, resulting in inefficiency.
By contrast, when Snowflake was released, they separated out storage and compute, so that you could manage the storage resources independently from the computing resources. This meant you could easily scale up or down your computing power without affecting the storage capacity or vice versa. This separation enabled greater efficiency with better resource utilization, and also allowed cost optimization since in the cloud, you only pay for the storage and compute that you use.
This was a significant achievement and was entirely made possible because of Snowflake’s unique architecture built on the cloud.
One Sentence Summary: Snowflake's claim to fame gives you the flexibility to manage your storage and compute independently so that you can adjust and scale as your needs change
2. Ability to Control Compute Resources
Clusters and Virtual Warehouses
In Snowflake, clusters and virtual warehouses are two important concepts about processing and managing data efficiently.
Imagine you have a large group of workers in a factory. Each worker specializes in a specific task, such as assembling products, packaging, or quality control. These workers form a cluster, which is a group of individuals with complementary skills working together towards a common goal.
Similarly, in Snowflake, a cluster is a group of computing resources that work together to process data.
Now, let's consider a virtual warehouse as a manager who oversees the workers in the factory. The virtual warehouse is a supporting resource that controls the cluster of compute resources. It determines how many computing nodes are active and the amount of computing power allocated to handle specific workloads.
Scaling Virtual Warehouses
The virtual warehouse can be scaled up or down based on the demand for processing resources. It's like adjusting the number of workers in the factory or providing them with additional tools and equipment when needed. Scaling up the virtual warehouse means adding more computing nodes to handle larger workloads, while scaling it down means reducing the number of nodes to save costs during periods of lower demand.
The virtual warehouse can also be scaled out by creating multi-cluster warehouses which is like adding a whole other factory to your project. Instead of making an individual task go “faster”, this lets you complete multiple streams of work in parallel.
By managing the virtual warehouse, you can allocate resources effectively, ensuring that queries and data operations are executed efficiently and in a timely manner. It allows you to handle varying workloads, optimize performance, and control costs by scaling the computing resources as needed.
Snowflake even has a ton of nifty features to make your life easier like auto-suspend which shuts down warehouses after a certain period of time or auto-scale which increases or decreases the number of clusters depending on the activity.
How organizations use warehouses
One Sentence Summary: Snowflake's enables you manage workloads efficiently by quickly scaling up and down your computing power.
3. Micro-Partitions and Data Clustering
Snowflake's micro-partitions are an impressive feature that brings significant benefits. Think of micro-partitions as small containers that hold your data in an organized and efficient way with the larger table.
How micro-partitions work like small containers
Snowflake's micro-partitions groups rows of data together and then organizes the data in an extremely smart way called columnar storage. Instead of storing data row by row, Snowflake arranges it column by column. This arrangement allows Snowflake to compress the data efficiently and process only the necessary columns when executing queries. It's like having all the relevant information right at your fingertips, making the queries much faster and more efficient.
Another great feature of micro-partitions is their ability to automatically group similar data together. This grouping technique is called data clustering. Imagine you have a huge collection of data with different attributes, such as customer information or sales data. Data clustering organizes this information by putting similar pieces together. For example, it might group all the customers from a specific region or all the sales from a particular time period. By doing this, Snowflake can quickly find the specific data you need without searching through all the information. It's like having an organized library where you can easily find the book you're looking for.
The benefits of Snowflake's micro-partitions and data clustering go beyond just making queries faster. They also help save costs. Since Snowflake charges based on the amount of data stored and the processing required, micro-partitions and data clustering help reduce the amount of data that needs to be processed, resulting in lower costs. It's like having a streamlined process that saves time and resources.
One Sentence Summary: Snowflake's micro-partitions stores your data in small sections which makes your queries significantly faster and more efficient.
4. User Experience: It “Just Works”
When you talk to analysts, data engineers, database administrators, what I hear about Snowflake all the time is that “it just works”. It’s an elegant and simple to use tool which lets small teams easily set up a database with very little effort. For instance, even connecting to on-prem databases requires a VPN connection or other configurations to access. In contrast, since Snowflake is on the cloud, all you need to do is enter in the URL on a browser.
Traditional on-premises databases also required significantly more overhead and could be a pain to manage. You had to select the cluster that had fixed processing and storage bundled together. If you were running out of space, you had to manually resize your cluster. Snowflake simplified the product experience and made it so that you could update the warehouse with a simple click. Enhancements such as these made life easier for data professionals and reduced the workload on database administrators.
Other useful features include:
Time Travel: Snowflake's time travel feature is like a data time machine, enabling you to revisit past versions of your data. It allows you to analyze data as it existed at different points in time, facilitating historical analysis, change investigation, error recovery, and compliance auditing.
Zero-Copy Cloning: Zero-copy cloning in Snowflake enables instant and efficient duplication of data without actually duplicating the underlying data. It's similar to creating virtual copies of a physical document that take up minimal space. This saves time, storage space, and resources, allowing you to perform tasks like development, testing, and analytics without affecting the original data.
Data Sharing: Snowflake excels at facilitating seamless and secure data sharing between organizations. It's like having a collaborative workspace where different teams can access and work with shared information while maintaining strict controls and permissions.
Data Marketplace: The data marketplace is like an online store where you can find and access a wide variety of curated datasets from different sources. You can discover datasets from various industries and domains, preview their details, and securely acquire them within the Snowflake platform.
Documentation: Not technically a feature, but Snowflake has really good documentation that’s clear and easy to understand with great examples. It makes your life a lot easier and saves a lot of time and headaches!
One Sentence Summary: Snowflake "just works" with a simple and seamless user experience supported by a ton of useful features.
Glossary
Database: A digital storage system that helps organize and manage large amounts of data
Data Warehouse: A database that contains data from a variety of sources to be analyzed
Server: A powerful computer that stores and manages information for other computers and devices
On-Premises Server: A server that organizations built and maintained in their own facilities
Cloud Server: A server built by 3rd party providers like Amazon and Microsoft that are rented out to organizations
Storage: The amount of space available on servers to save data
Compute: The resources on a server used to process queries
Cluster: A group of computing resources that work together to process data
Virtual Warehouse: A supporting resource that controls the cluster of compute resources which can be scaled up and down
Micro-Partitions: Small containers that hold your data in an organized and efficient way within the larger table
Data Clustering: Feature which automatically groups similar data together in a table