Streaming Data Using Spark with Dustin Vannoy

Data engineering has historically involved extracting data from disperate sources, transforming it to a standard layout, and then loading it into a new database for analytics. Usually these data engineering pipeline jobs would run on a schedule such as nightly or weekly. In today’s fastpaced high-tech world however the need for data closer to real-time, meaning when it was first generated, is higher than ever. In today’s episode we hear from Dustin Vannoy who is a consultant and blogger in the streaming data space about how to use Apache Spark, the most popular streaming analytics platform.

How to connect with Dustin:

Learn data skills at our academy and elevate your career. Start for free at

0:00:00 Intro
0:01:01 Dustin’s Background
0:09:51 Transitioning from legacy databases to Big Data and Streaming
0:13:29 Microbatching vs Streaming
0:18:17 What is Spark and why use it?
0:22:33 Apache Spark vs Data Bricks
0:26:24 Pay for a hosted Spark version or roll your own?
0:28:27 Databricks setup
0:30:25 How Databricks executes queries
0:32:41 Scaling approaches to Spark
0:35:14 Connecting to external databases in databricks
0:37:51 Visualizing data in Databricks
0:39:40 Using Spark for ETL work
0:42:50 What is real-time processing?
0:44:25 How to build a streaming job in Spark using Kafka
0:46:18 Streaming architecture overview
0:49:15 Pulling data from Kafka into Spark streaming
0:51:09 Why apps use Kafka
0:54:33 Why use Spark versus alternatives
0:57:37 What is Confluent?
0:59:38 Ways to learn Spark
1:02:04 How hard is Spark to learn?
1:04:16 Troubleshooting errors in Spark
1:07:03 How hard is it to transition to Spark from traditional databases?
1:11:51 Interviewing for a Spark job
1:15:46 Outro