AWS
I’ve had two roles at AWS – Big Data Architect && Developer Advocate. As part of both of those roles, I’ve created a few different projects on top of EMR, Glue, and Athena....
I’ve had two roles at AWS – Big Data Architect && Developer Advocate. As part of both of those roles, I’ve created a few different projects on top of EMR, Glue, and Athena....
I love building. I often whip up quick little utilities to make my life easier or sometimes just to have fun with a project....
I’ve written or collaborated on many different blog posts while at AWS. This is a list of a few of them. Easily query AWS service logs using Amazon Athena - One of the first tools I built at AWS, was a set of Glue scripts to parse, process, and convert AWS service logs for VPC Flow Logs, CloudTrail, AWS Load Balancers, CloudFront, and S3 Access Logs. Announcing Amazon EMR Serverless (Preview): - Launch post for a new serverless service for EMR....
As part of my role as a developer advocate, I’ve created several different open source tools or integrations to make working with Amazon EMR easier for data engineers and other data wranglers. Amazon EMR CLI The EMR CLI is an open-source command-line interface that makes packaging, deploying, and running jobs across all EMR deployment models as simple as an emr run. The tool supports PySpark projects and automatically bundles the required dependencies in a consistent manner no matter whether you’re using EMR on EC2, EMR on EKS (coming soon), or EMR Serverless....
AWS Service Logs come in all different formats. Ideally they could all be queried in place by Athena and, while some can, for cost and performance reasons it can be better to convert the logs into partitioned Parquet files. The general approach is that for any given type of service log, we have Glue Jobs that can do the following: Create source tables in the Data Catalog Create destination tables in the Data Catalog Know how to convert the source data to partitioned, Parquet files Maintain new partitions for both tables This library was created as part of my role as a Big Data Architect and is available at awslabs/athena-glue-service-logs....
Athena SQLite is a project that allows you to query SQLite databases in S3 using Athena’s Federated Query functionality. Install it from the Serverless Application Repository: AthenaSQLiteConnector. Wait, what?! SQLite in S3? Yea! As quite possibly the most prevelant database in the world, it’s not unsual for me to have various SQLite files laying around. This Athena data connector allows you to query those databases directly from Athena. Cool! Right?! One of the fun things about this project is that SQLite is not intended to be a network database....
I’m a fan of making the code I run in presentations publicly available. Most of the demo code I use is available in my demo-code repository on GitHub. In there, you’ll find such interesting things as: EMR CloudFormation Templates EMR on EKS notes EMR Studio notes And even a big data CDK stack for easy deployment of my demos ...
I love building. I often whip up quick little utilities to make my life easier or sometimes just to have fun with a project. I believe ideas should be freely available - it’s the execution of an idea that turns it into something larger. As such, I try to document all the random ideas I have in my personal GitHub. Feel free to go check them out and let me know if you find anything interesting....
If you weren’t aware, Bing.com has an awesome image of the day. Even better, they have a daily quiz(!) for every image. I like both a/ beautiful wallpapers and b/ mini-quizzes so I wrote a little macOS menubar app that updates the wallpaper daily with the Bing image of the day and also gives you a little link where you can take the quiz. Installation Either download from my GitHub releases or use Homebrew....
What is it? A simple calculator geared towards converting data sizes. For example, I often need to do some conversions from bytes to something more human-readable: 585828 👇 572.10 KB You can even do math! Say we transferred 233033728 bytes in 7 days, we can divide by 7 to get the per-day number. 233033728/7 👇 910.29 KB Where can I find it? I’ve got a version anybody can use at dacort....