Insights Gleaned from Executing Databricks SQL Serverless alongside DBT
=========================================================================================
In a recent experiment, the performance, cost, and management of Databricks' SQL Classic, SQL Pro, and SQL Serverless warehouse offerings were compared. The study simulated a real company's approach to running a SQL warehouse and looked at the parameters of warehouse size and DBT threads.
The key differences between these options lie in their performance features, cost structure, and management style.
Performance Features
The Photon engine, a vectorized query engine, is supported by all three options, boosting SQL and DataFrame API speed. Predictive IO, which optimizes selective scans, is available in Pro and Serverless but not Classic. Intelligent Workload Management (IWM), which dynamically manages resources with AI for handling large query volumes efficiently, is exclusive to Serverless only.
Cost and Management
Classic uses manual or automated cluster management prone to idle compute costs. Pro provides enhanced performance but still lacks full serverless benefits. Serverless warehouses, on the other hand, fully manage resource allocation, auto-scale on demand, and do not charge for idle compute, providing better cost control and operational simplicity. Serverless reduces spin-up time by quickly launching clusters at query start and shutting down after, mitigating idle resource waste.
Use Case Suitability
Serverless is generally recommended for most workloads due to ease of use, better elasticity, cost efficiency, and out-of-the-box performance optimizations. Classic is largely legacy, suitable for cases where manual control is preferred or legacy pipelines exist. Pro serves as a middle ground with added performance features but less automation and cost-efficiency than serverless.
The experiment used Databricks Workflows with a Task Type of dbt as the runner for the dbt CLI, and all jobs were executed concurrently. The results revealed that the performance of black box "serverless" systems can result in some unusual anomalies.
The workload selected for the experiment was the TPC-DI benchmark, with a scale factor of 1000. The results showed that for smaller warehouses, changing the number of threads affected the duration by about +/- 10%, but not the cost much. For larger warehouses, the number of threads impacted both cost and runtime significantly.
The Databricks serverless option was found to be the best in terms of cost and runtime for the TPC-DI benchmark. The medium warehouse was discovered to be the optimal size, with larger warehouses having no impact on runtime, or perhaps being worse. The cost graph showed that costs started to skyrocket after the medium warehouse size, as more expensive machines were used while the runtime remained constant.
The experiment also demonstrated that each job spun up a new SQL warehouse and tore it down afterwards, and ran in unique schemas in the same Unity Catalog. The plot showing cost vs. duration of the total job showed that larger warehouses typically led to shorter durations but higher costs.
In conclusion, Databricks SQL Serverless warehouses offer the best combination of performance, automation, cost efficiency, and ease of management, especially for modern, scalable SQL workloads. Classic remains basic with manual handling and potential idle costs, while Pro introduces performance enhancements but lacks full dynamic scaling and AI-driven workload management.
[1] Databricks. (2021). SQL Serverless. Retrieved from https://docs.databricks.com/sql/sql-serverless.html
[2] Databricks. (2021). SQL Pro. Retrieved from https://docs.databricks.com/sql/sql-pro.html
[3] Databricks. (2021). SQL Classic. Retrieved from https://docs.databricks.com/sql/sql-classic.html
- For businesses looking to optimize cost and achieve operational simplicity, the serverless option in data-and-cloud-computing platforms like Databricks is recommended for most real-estate investment data workloads, due to its ease of use, better elasticity, cost efficiency, and out-of-the-box performance optimizations.
- Real-estate investors considering technology-driven approaches for managing and analyzing investment data might find interest in Databricks' serverless SQL warehouses, given their functionality of auto-scaling resources on demand, and not charging for idle compute, which can help in controlling costs and simplifying management.
- In the realm of investing and finance, businesses implementing data-and-cloud-computing solutions such as Databricks may find the serverless warehouses advantageous, as they excel in managing large query volumes efficiently with AI, and incorporate features such as Predictive IO and Intelligent Workload Management that can boost performance, particularly for real-time analysis and decision-making purposes.