Building and Scaling Robust Zero-code IoT Streaming Data Pipelines

With the rapid onset of the global Covid-19 Pandemic in 2020 the USA Centers for Disease Control and Prevention (CDC) quickly implemented a new Covid-19 pipeline to collect testing data from all of the USA’s states and territories, and produce multiple consumable results for federal and public agencies. They did this in under 30 days, using Apache Kafka.

Inspired by this story, we built two demonstration streaming pipelines for ingesting, storing, and visualizing public IoT data (Tidal data from NOAA, the National Oceanic and Atmospheric Administration) using multiple open source technologies. The common ingestion technologies were Apache Kafka, Apache Kafka Connect, and Apache Camel Kafka Connector, supplemented with Prometheus and Grafana for monitoring. The initial experiment used Open Distro for Elasticsearch and Kibana as the target storage and visualisation technologies, while the second experiment used PostgreSQL and Apache Superset.

In this talk we introduce each technology and the pipeline architecture, and walk through the steps followed, challenges encountered, and solutions used to build reliable and scalable pipelines, and visualize the results (including Tidal periods, ranges and locations). We compare and contrast the two approaches, focussing on exception handling, scalability, performance and monitoring, and the pros and cons of the two visualization technologies (Kibana and Superset).

Paul Brebner

Technology Evangelist in Instaclustr. I focus on developing and blogging about example applications for the Instaclustr managed Open Source services, including Apache Cassandra, Spark, Kafka, Elasticsearch and Redis
  • Date: Oct 26, 10:00 (US Pacific Time)
  • Fee: Free
  • Available Seats: 342 (max 500)
  • Help? Send Question
Watch Recording