Recently, I had the opportunity to add a new EMR on EKS plugin to Apache Airflow. While I’ve been a consumer of Airflow over the years, I’ve never contributed directly to the project. And weighing in at over half a million lines of code, Airflow is a pretty complex project to wade into. So here’s a guide on how I made a new operator in the AWS provider package.
Overview Before you get started, it’s good to have an understanding of the different components of an Airflow task....
Build your own Air Quality Monitor with OpenAQ and EMR on EKS
Fire season is closely approaching and as somebody that spent two weeks last year hunkered down inside with my browser glued to various air quality sites, I wanted to show how to use data from OpenAQ to build your own air quality analysis.
With Amazon EMR on EKS, you can now customize and package your own Apache Spark dependencies and I use that functionality for this post.
Overview OpenAQ maintains a publicly accessible dataset of various air quality metrics that’s updated every half hour....
Big Data Stack with CDK
I wanted to write a post about how I built my own Apache Spark environment on AWS using Amazon EMR, Amazon EKS, and the AWS Cloud Development Kit (CDK). This stack also creates an EMR Studio environment that can be used to build and deploy data notebooks.
Disclaimer: I work for AWS on the EMR team and built this stack for my various demos and it is not intended for production use-cases....