https://flic.kr/p/S3jt5j

Building and Testing a new Apache Airflow Plugin

Recently, I had the opportunity to add a new EMR on EKS plugin to Apache Airflow. While I’ve been a consumer of Airflow over the years, I’ve never contributed directly to the project. And weighing in at over half a million lines of code, Airflow is a pretty complex project to wade into. So here’s a guide on how I made a new operator in the AWS provider package. Overview Before you get started, it’s good to have an understanding of the different components of an Airflow task....

 · 8 min
Example output of Air Quality Data

Build your own Air Quality Monitor with OpenAQ and EMR on EKS

Fire season is closely approaching and as somebody that spent two weeks last year hunkered down inside with my browser glued to various air quality sites, I wanted to show how to use data from OpenAQ to build your own air quality analysis. With Amazon EMR on EKS, you can now customize and package your own Apache Spark dependencies and I use that functionality for this post. Overview OpenAQ maintains a publicly accessible dataset of various air quality metrics that’s updated every half hour....

 · 12 min
stacked rocks on a beach, credit: https://flic.kr/p/LhbFfr

Big Data Stack with CDK

I wanted to write a post about how I built my own Apache Spark environment on AWS using Amazon EMR, Amazon EKS, and the AWS Cloud Development Kit (CDK). This stack also creates an EMR Studio environment that can be used to build and deploy data notebooks. Disclaimer: I work for AWS on the EMR team and built this stack for my various demos and it is not intended for production use-cases....

 · 9 min