In this first of a series of posts, we will be describing how to build a machine learning-based fake news detector from scratch. That means we will literally construct a system that learns how to discern reality from lies, using nothing but raw data. And our project will take us all the way from initial ideation to deployed solution.
We should note that building machine learning products is hard. Starting from the very beginning, the process for a functional and useful system contains at least all of the following steps:
1) ideation and defining of your problem statement
2) acquiring (or labelling) of a dataset
3) exploration of your data to understand its characteristics
4) building a training pipeline for an initial version of your model
5) testing and performing error analysis on your model's failure modes
6) iterating from this error analysis to build improved models
7) repeating steps 4-6 until you get the model performance you need
8) building the infrastructure to deploy your model with the runtime characteristics your users want
9) monitoring your model consistently and use that to repeat any of steps 2-8
Sounds like a lot? It is. And here's what the current machine learning tooling/infrastructure landscape looks like :
When you couple the inherent complexity of building a machine learning product with the myriad tooling decision points, it's no surprise that many companies report 87% of their data science projects never making it into production! Any of the steps 1-9 above introduce numerous places where a project can hit an insurmountable roadblock, and the breadth of technologies and skills required to deliver on a successful project make amassing a team able to deliver equally challenging.
It's hard, but that's the current reality of bringing data-driven value to the world.
To make things even more difficult, when we look at the available resources for learning how to do this, the bulk of machine learning tutorials available don't go beyond taking some sample code off the Tensorflow website and running it through an overused benchmark dataset.
In this series of posts, we will describe a viable sequence for carrying a machine learning product through the entire process above. That is to say, we will build a system starting with the initial ideation through to the deployed solution.
We will go into the nitty-gritty of our technology decisions, down to how we would organize our code repository structure for fast engineering iteration. As we progress through our posts, we will incrementally add code to our repository until at the end we have a fully functional and deployable system.
Our posts will cover all of the following:
1) Ideation, organizing your codebase, and setting up tooling (this post!)
Note, while we will describe in full-detail our solution to the problem, this is only a solution not the solution. Building machine learning projects, like all engineering work, entails a series of tradeoffs and design decisions, but we think there is value in showing a valid sequence through the machine learning lifecycle.
With that let's get started!
Defining the Problem
Arguably the most important step in creating a machine learning application is the problem definition. This means both determining what problem you want your application to solve (the value add) as well as how you will measure success (your metrics).
Given that this is an election year where many people across the US will be looking to form their opinion about candidates through various internet outlets, for the purposes of this project, we will tackle the problem of fake news detection. In the modern day, fake statements are more prevalent and are able to spread more virally than ever before.
To help alleviate this problem, we will want to build an application that can assess the truthfulness of statements made by either political speakers or social media posts. Once completed, we would ideally like to deploy our project as a web browser extension that can be run in realtime on statements that users are reading on their pages.
As far as metrics is concerned, if our tool can automatically detect whether a statement is true or false on a random web page with at least 50% accuracy, then we will consider our project a success. Typically success if measured by some sort of business metric, but since we don't have a formal metrics like that, this (somewhat arbitrary) threshold for accuracy will suffice.
Organizing Your Repo and Tooling
Now that we have our product goal and metrics defined, let's describe how we will organize our application code repository. We're choosing to spend some time talking about this because there is not a clear consensus in the community on best practices for organizing machine learning projects. This makes it very difficult to quickly look at new projects and understand how the separation of concerns is defined, making collaboration and iteration slower.
As a counterpoint, consider the convention around how Java projects are structured. Because of a community agreed-upon consensus, any Java programmer can jump into a new codebase and immediately know where to find what.
In lieu of a common convention, we will describe our practices for organizing machine learning projects, born out of our experience with various projects over the years. The full source code can be found at this repository.
At the top-level, our repository will be structured as follows:
Here is what each each item is responsible for:
assets: This will store any images, plots, and other non-source files generated throughout the project.
data: This will store our fake news data in both its raw (untouched from the original source) and processed (featurized or updated for our usecase) form.
deploy: This will store any files needed for our deployment including Dockerfiles, etc.
fake-news: This will store all the source for building, training, and evaluating our models.
model_checkpoints: This will store the model binaries that we train and eventually want to deploy.
notebooks: This will store any Jupyter notebooks used for any data analysis.
scripts: This will store any one-off scripts for generating model artifacts, setting up application environments, or processing the data.
LICENSE: Our software license.
requirements.txt: This will store the code dependencies for our project. This is a standard practice for Python-based projects, which our application will be.
For now our requirements.txt is pretty simple, but we will highlight two dependencies that we will certainly need: pytest (a library we will use for our code testing) and dvc (a tool we will use for data and run pipeline versioning).
We will also be using an Anaconda environment to isolate our project dependencies, so they do not interfere with our local system-level dependencies. Dependency isolation is a generally good practice when building new applications, as they not only ensure clean working environments but they make it easier to quickly move and recreate applications on different host servers.
And with that organization completed, we are ready to move on to the next step of getting our dataset and doing some exploratory data analysis.
Reproduced with permission from this post.