On Nov 2, 2021, we introduced that we set the official world file for the quickest information warehouse with our Databricks SQL lakehouse platform. These outcomes had been audited and reported by the official Transaction Processing Efficiency Council (TPC) in a 37-page doc obtainable on-line at tpc.org. We additionally shared a third-party benchmark by the Barcelona Supercomputing Heart (BSC) outlining that Databricks SQL is considerably quicker and less expensive than Snowflake.
Quite a bit has occurred since then: many congratulations, some questions, and a few bitter grapes. We take this chance to reiterate that we stand by our weblog publish and the outcomes: Databricks SQL gives superior efficiency and worth efficiency over Snowflake, even on information warehousing workloads (TPC-DS).
Snowflake’s response: “missing integrity”?
Snowflake responded 10 days after our publication (final Friday) claiming that our outcomes had been “missing integrity.” They then introduced their very own benchmarks, claiming that their providing has roughly the identical efficiency and worth at $267 as Databricks SQL at $242. At face worth, this ignores the truth that they’re evaluating the value of their most cost-effective providing with that of our most costly SQL providing. (Word that Snowflake’s “Enterprise Essential” tier is 2x the price of the most cost effective tier.) Additionally they gloss over the truth that Databricks can use spot cases, which most prospects use, and produce the value right down to $146. However none of that is the main focus of this publish.
The gist of Snowflake’s declare is that they ran the identical benchmarks as BSC and located that they may run the entire benchmark in 3,760 seconds vs 8,397 seconds that BSC measured. They even urged readers to join an account and take a look at it out for themselves. In any case, the TPC-DS dataset comes with Snowflake out of the field they usually actually have a tutorial on how one can run it. So it needs to be simple to confirm the outcomes. We did precisely that.
First, we need to commend Snowflake for following our lead and eradicating the DeWitt clause, which had prohibited rivals from benchmarking their platform. Due to this, we had been capable of get a trial account and confirm the idea for claims of “missing integrity”.
Reproducing TPC-DS on Snowflake
We logged into Snowflake and ran Tutorial 4 for TPC-DS. The outcomes actually intently matched what they claimed at 4,025 seconds, certainly a lot quicker than the 8,397 seconds within the BSC benchmark. However what unfolded subsequent is rather more attention-grabbing.
Whereas performing the benchmarks, we observed that the Snowflake pre-baked TPC-DS dataset had been recreated two days after our benchmark outcomes had been introduced. An vital a part of the official benchmark is to confirm the creation of the dataset. So, as an alternative of utilizing Snowflake’s pre-baked dataset, we uploaded an official TPC-DS dataset and used similar schema as Snowflake makes use of on its pre-baked dataset (together with the identical clustering column units), on similar cluster measurement (4XL). We then ran and timed the POWER take a look at thrice. The primary chilly run took 10,085 secs, and the quickest of the three runs took 7,276 seconds. Simply to recap, we loaded the official TPC-DS dataset into Snowflake, timed how lengthy it takes to run the facility take a look at, and it took 1.9x longer (greatest of three) than what Snowflake reported of their weblog.
These outcomes can simply be verified by anybody. Get a Snowflake account, use the official TPC-DS scripts to generate a 100 TB information warehouse. Ingest these recordsdata into Snowflake. Then run just a few POWER runs and measure the time for your self. We wager the outcomes can be nearer to 7000 seconds, and even greater numbers should you don’t use their clustering columns (see subsequent part). You can even simply run the POWER take a look at on the dataset they ship with Snowflake. These outcomes will doubtless be nearer to the time they reported of their weblog.
Why official TPC-DS
Why is there such a giant discrepancy between operating TPC-DS on the pre-baked dataset in Snowflake vs loading the official dataset into Snowflake? We don’t precisely know. However the way you lay out your information considerably impacts TPC-DS, and typically all, workloads. In most programs, clustering or partitioning the info for a particular workload (e.g., sorting by the mix of fields utilized in a question) can enhance efficiency for that workload, however such optimizations include extra value. That point and value must be included within the benchmark outcomes.
It is because of this that the official benchmark requires you to report the time it takes to load the info into the info warehouse in order that they’ll accurately account for any time and value the system takes to optimize the format. This time might be considerably greater than the POWER take a look at queries for some storage schemes. The official benchmark additionally contains information updates and upkeep, similar to real-world datasets and workloads (how usually do you question a dataset that by no means modifications?). That is all finished to stop the next situation: a system spends huge assets optimizing a static dataset offline for an actual set of immutable workloads, after which can run these workloads tremendous rapidly.
As well as, the official benchmark requires reproducibility. That’s why yow will discover all of the code to breed our file within the submission.
This brings us to our remaining level. We agree with Snowflake that benchmarks can rapidly devolve into business gamers “including configuration knobs, particular settings, and really particular optimizations that may enhance a benchmark”. Everybody appears actually good in their very own benchmarks. So as an alternative of taking anyone vendor’s phrase on how good they’re, we problem Snowflake to take part within the official TPC benchmark.
Once we determined to take part on this benchmark, we set a constraint for our engineering workforce that they need to solely use generally utilized optimizations finished by nearly all our prospects, not like previous entries. They weren’t allowed to use any optimizations that may require deep understanding of the dataset or queries (as finished within the Snowflake pre-baked dataset, with extra clustering columns). This matches actual world workloads and what most prospects want to see (a system that achieves nice efficiency with out tuning).
When you learn our submission intimately, yow will discover the reproducible steps that match how a typical buyer want to handle their information. Minimizing the trouble to get productive with a brand new dataset was considered one of our high design objectives for Databricks SQL.
A remaining phrase from us at Databricks. As co-founders, we care deeply about delivering the very best worth to our prospects, and the software program we construct to resolve their enterprise wants. Benchmark outcomes that don’t resonate with our understanding of the world can result in an emotional or visceral response. We attempt to not let that get the very best of us. We’ll search the reality, and publish end-to-end outcomes which can be verifiable. We due to this fact gained’t accuse Snowflake of missing integrity within the outcomes they revealed of their weblog. We solely ask them to confirm their outcomes with the official TPC council.
Our main motivation to take part within the official TPC information warehousing benchmark was to not show which information warehouse is quicker or cheaper. Moderately, we consider that each enterprise ought to be capable to grow to be information pushed the way in which the FAANG firms are. These firms don’t construct on information warehouses. They as an alternative have a a lot easier information technique: retailer all information (structured, textual content, video, audio) in open codecs and use a single copy in direction of every kind of analytics, be it information science, machine studying, real-time analytics, or basic enterprise intelligence and information warehousing. They don’t do every thing in simply SQL. However reasonably, SQL is without doubt one of the key instruments of their arsenal, along with Python, R, and a slew of different instruments within the open-source ecosystem that leverage their information. We name this paradigm the Information Lakehouse. The Information Lakehouse, not like Information Warehouses, has native help for Information Science, Machine Studying, and real-time streaming. However it additionally has native help for SQL and BI. Our objective was to dispel the parable that the Information Lakehouse can’t have best-in-class worth and efficiency. Moderately than making our personal benchmarks, we sought the reality and took part within the official TPC benchmark. We’re due to this fact very pleased that the Information Lakehouse paradigm gives superior efficiency and worth over information warehouses, even on basic information warehousing workloads (TPC-DS). It will profit enterprises who not want to keep up a number of information lakes, information warehouses, and streaming programs to handle all of their information. This easy structure allows them to redeploy their assets towards fixing the enterprise wants and issues that they face day-after-day.