THE CLOUD

"Serverless" Analytics of Twitter Data with MSK Connect and Athena

Like many, I was recently drawn in to a simple word game by the name of “Wordle”. Also, like many I wanted to dive into the analytics of all the yellow, green, and white-or-black-depending-on-your-dark-mode blocks. While you can easily query tweet volume using the Twitter API, I wanted to dig deeper. And the tweets were growing… Given the recent announcement of MSK Connect, I wanted to see if I could easily consume the Twitter Stream into S3 and query the data with Athena....

 · 6 min
A reflective lake

An Introduction to Modern Data Lake Storage Layers

In recent years we’ve seen a rise in new storage layers for data lakes. In 2017, Uber announced Hudi - an incremental processing framework for data pipelines. In 2018, Netflix introduced Iceberg - a new table format for managing extremely large cloud datasets. And in 2019, Databricks open-sourced Delta Lake - originally intended to bring ACID transactions to data lakes. 📹 If you’d like to watch a video that discusses the content of this post, I’ve also recorded an overview here....

 · 13 min

SSH to EC2 Instances with Session Manager

I’m kind of an old-school sys admin (aka, managed NT4 in the 90’s) so I’m really used to SSH’ing into hosts. More often than not, however, I’m working with AWS EC2 instances in a private subnet. If you’re not familiar with it AWS Systems Manager Session Manager is a pretty sweet feature that allows you to connect remotely to EC2 instances with the AWS CLI, without needing to open up ports for SSH or utilize a bastion host....

 · 3 min
Skipping stones on the data lake...

Updating Partition Values With Apache Hudi

If you’re not familiar with Apache Hudi, it’s a pretty awesome piece of software that brings transactions and record-level updates/deletes to data lakes. More specifically, if you’re doing Analytics with S3, Hudi provides a way for you to consistently update records in your data lake, which historically has been pretty challenging. It can also optimize file sizes, allow for rollbacks, and makes streaming CDC data impressively easy. Updating Partition Values I’m learning more about Hudi and was following this EMR guide to working with a Hudi dataset, but the “Upsert” operation didn’t quite work as I expected....

 · 3 min
Jupyter Notebook Continuous Deployment Architecture

Continuous Deployment of Jupyter Notebooks

This is a guide on how to use AWS CodePipeline to continuously deploy Jupyter notebooks to an S3-backed static website. Overview Since I started using EMR Studio, I’ve been making more use of Jupyter notebooks as scratch pads and often want to be able to easily share the results of my research. I hunted around for a few different solutions and while there are some good ones like nbconvert and jupytext, I wanted something a bit simpler and off-the-shelf....

 · 4 min
Example output of Air Quality Data

Build your own Air Quality Monitor with OpenAQ and EMR on EKS

Fire season is closely approaching and as somebody that spent two weeks last year hunkered down inside with my browser glued to various air quality sites, I wanted to show how to use data from OpenAQ to build your own air quality analysis. With Amazon EMR on EKS, you can now customize and package your own Apache Spark dependencies and I use that functionality for this post. Overview OpenAQ maintains a publicly accessible dataset of various air quality metrics that’s updated every half hour....

 · 12 min
stacked rocks on a beach, credit: https://flic.kr/p/LhbFfr

Big Data Stack with CDK

I wanted to write a post about how I built my own Apache Spark environment on AWS using Amazon EMR, Amazon EKS, and the AWS Cloud Development Kit (CDK). This stack also creates an EMR Studio environment that can be used to build and deploy data notebooks. Disclaimer: I work for AWS on the EMR team and built this stack for my various demos and it is not intended for production use-cases....

 · 9 min