Why do a Big Data Project over the Holiday?

In 2021 the machine learning market was a little over $15B. That is projected to increase 10x between now and 2028. It’s the fastest growing area of technology (think mobile 10 years ago) and therefor it is top of mind for my clients. In addition, the sophisticated (read as expensive) hardware, software, and staff required to do on-premise, original Machine Learning is cost prohibitive for many current companies. I believe that, increasingly, “access to the hardware and off-the-shelf software that are provided by the hyperscalers” will become one of the primary reasons clients begin or accelerate their cloud journey. Right alongside “closing a datacenter” or “decreasing time-to-market” or “increasing availability.”

I’m certainly not new to creating cloud environments to support machine learning. I have created several kubernetes clusters and cloud environments across multiple clients with the explicit goal of supporting their AI/ML or Big Data efforts. In spite of that, I had little knowledge of what actually happened in those environments. With that in mind, I decided to embark on building an AI based “player” for the fantasy/gambling app that I already use to keep my hands-on skills sharp.

Introducing Book-E the robot gambler.

As many of you know, I currently run an “app” that lets my friends and I keep score on our football predictions. It’s described reasonably well on the homepage (https://lthoi.com/). The TLDR version is that it allows players to chose wagers that should have even odds (they are coin flips) and then forces each of the other players in the game to take a portion of the other side of the wager. So, our AI/ML “player” in the game will have to pick which over/under and spread position bets they want to make each week. In order to have some fun with this, will call our AI/ML player “Book-E”

Book-E (assuming I can finish the project) will do a few things:

  1. Keep an up-to-date data set of all of the relevant football games and the data about them.
  2. Use machine learning to create a “model” of what kinds of bets will win.
  3. Evaluate each game just before betting closes (to have the best data) and pick which bets (if any) to make.

What tools/training am I going to use?

I’m going to have a lot to learn to complete this project! I will need to gather the data, to process the data in to data set(s) that can be used for machine learning, to create and then serve a machine learning model, and (finally) to integrate that model with my current game so that we have a new “player”.

Given my focus in 2021/2022 on AWS, I’m planning to focus on AWS technologies. I plan to leverage all of the AI technology in SageMaker for capturing the data and creating/serving the machine learning model. Also, since my application is AWS based (a set of lambdas, dynamodb tables, SQS queues, and an API Gateway), I will be adding a few lambdas and cloudwatch triggers to make the AI Player actually place “bets” and update models without the need for human intervention.

For the aggregating of the data, I am going to be using Python and Jupyter Notebooks as my workspace. Since I’m planning to be very AWS dependent I’m going to use the AWS Sagemaker Studio as my IDE. The data will come from existing tables in my application (which I will access using the AWS SDK known as boto3) and from the company I use to provide my scores/data for the game (which I will access through the Python wrapper they provide).

For creating and serving the actual machine learning model, I plan to use AWS SageMaker. Specifically, I’m really excited about the AWS Autopilot functionality which will select the best machine learning model for me without me having to be a data scientist.

This is going to require some training! At the onset of this project, I do not know much about AWS Sagemaker, AWS Sagemaker Studio, Python, the AWS SDK for Python, Jupyter notebooks, or machine learning! I identified the following Udemy courses that I plan to go through:

  • AWS SageMaker Practical for Beginners | Build 6 Projects – This is my primary course. It does a great job introducing the concepts of machine learning, the different types of models, and the ways to evaluate models. Even better, it does this using AWS Sagemaker and Sagemaker Studio as the tools.
  • AWS – Mastering Boto3 & Lambda Functions Using Python – This course was a great way to get started with both Python in general and with Boto3 (which is the AWS SDK for Python). If you’re a bit of an idiot (like me) and jumping in to this project without background in Python, let me HIGHLY recommend chapter 5 which covers a lot of what you need to know about Python generally in 58m. This would probably only be a sufficient overview if you have a decent amount of programming experience.
  • Data Manipulation in Python: A Pandas Crash Course – This course was great for an introduction to Pandas (a library in Python that’s useful for data manipulation/review) and Jupyter notebooks. While these are both touched on in the first course I mentioned above, if you’re going to actually do some of your own coding, you’ll need a more in-depth review.