I attended a webinar on Microsoft Databricks yesterday. The only link I have is a zoom registration so you’ll have to search. I’m sure they’ll run it again. The hands-on-lab made it very worthwhile. The rest was marketing fluff. They did a good job of the Lab and I was able to follow along during the session and they left it up so I could complete what I thought was some unfinished work.
The lab got us running, with an already created Databricks cluster with individualized credentials. That was great. The notebook experience is a lot like others such as Jyputer or Azure Data Studio but allows mixed languages. In this case everything was in a dialect of SQL. The first 35 cells brought in tables from KKBox’s Churn Prediction Challenge on Kagle https://www.kaggle.com/c/kkbox-churn-prediction-challenge/ and stored them in an Azure Data Lake Database.
The retrieval time was kind of slow. Of course, that’s somewhat based on the number of cores and the max cores were set to two. The Scatter plot that hit all million transaction rows took 50 seconds. It was a set-up: it wasn’t supposed to be fast because they were going to show Delta Lake next.
They created Delta Lake tables but never got around to running the queries against them. They left the Lab up and running for another 24 hours so I gave it a try. The translation to Delta Lake was a bit less than obvious because the Delta lake table definitions changed column names and types but that was easy enough to overcome. Their changes were reasonable but had to be translated. I tried the scatter plot example because it reports on an aggregation that reads 14 million transaction rows. That dropped the time from 50 seconds down to 13 seconds. 4.5x which is a good improvement. Not the 100x hint that they dropped but, hey, this is marketing. And they claim near-linear scaling.
I couldn’t resist the challenge. How would SQL Server do?
The KKBox files are public so I downloaded them imported to SQL Server and with a few minor syntax adjustments ran the query. SQL was done in 2 seconds. That wasn’t fair! I was using my 8 core desktop running at 4.2 Ghz. So I limited the cores with OPTION (MAXDOP 2) and the time was 8 seconds. Still beats the 13 seconds for Delta Lake but the cores in the cloud are probably running at half the speed so now it’s getting close. A crude comparison, for sure but I can see that Databricks achieves performance that’s reasonable.
Overall my impression of Databricks is favorable. The development experience in a notebook isn’t quite what I’m used to in SQL Server with SSMS’s IntelliSense and Reg-Gate’s SQL Prompt helping out. But I could find everything that I expected including information on how the query was processed in impressive detail. The performance was comparable. It’s in-memory and scales out so comparisons with much larger datasets are going to be interesting.