Fake News Detection From Ideation to Deployment: Error Analysis and Model V2
In this post, we will continue where our previous post left us and perform error analysis on the model we built. From here we will work toward a v2 version of our model.
As a reminder, recall that our goal is to apply a data-driven solution to the problem of fake news detection taking it from initial setup through to deployment. The phases we will conduct include the following:
4. Performing error analysis and iterating toward a v2 model (this post!)
This article will focus on performing error analysis with an emphasis on understanding our model's behavior and finding places to improve. Full source code is here.
A Quick Recap
Recall that in our last post we built a feature-based random forest classifier as our first model.
Our goal was to get some model up and running that could be trained via a configurable pipeline and provide immediate value to users.
As a reminder, we were able to build a model that achieved roughly 74% accuracy on a held-out test set for identifying whether a headline was fake or real.
But what does that even mean? Is this performance good enough?
In a typical research setting, our work might stop here if our performance was state-of-the-art. We would write a paper, submit it in the hour before the midnight conference deadline, and be done with it.
In the world of delivering real-world ML value, our work is really just beginning.
At this point we would ideally deploy our model to users, get people interacting with it, and use those results to inform an improved model. This reflects the continuous feedback-loop based nature of ML product development (PC: Josh Tobin):
While we won't go that full route immediately, we will do something equally important which is to perform manual error and feature analysis on our model.
This is a crucial step in working on any machine learning project as it helps us understand what our model has learned about our data, where we haven't provided enough of a signal to solve our problem, and what we need to do about it.
Understanding Feature Importances
Let's start with understanding the most important features in our v1 model. We can start by computing the Gini importance of the features in our tree. This computes the total reduction in Gini impurity brought by the feature during the training of the trees in our random forest.
Here are the top features this produces:
As we can see, many of the top features are the bins we created for the "credit history" counts. This is an interesting observation, as it suggests that knowing a speaker's past record of truthfulness will determine their behavior in the future.
We also see some features that come from the ngram-based tokenization such as in, of, and, etc. On their own these don't seem that interesting because many of them are stopwords, words we wouldn't expect to contribute very much to the overall semantic meaning or truthfulness of a statement.
This provides us some room for improvement: we can filter stopwords in a later model to bias it toward learning some more interesting lexical properties of the data.
While looking at Gini importance is one way of understanding features, we can use more sophisticated techniques for model interpretability. One powerful technique for interpretability is computing Shapley values.
This technique is provided in a nice library called SHAP. The SHAP values indicate how much each feature contributes to pushing a model's output from some base prediction value.
When we apply this framework to analyzing our model's outputs on the validation set, we get this plot:
Understanding the above plot is pretty tricky so let's take an example.
Consider the first feature listed: pants_fire_count=0. Recall that this means that the pants_fire credit history count (indicating blatant liars) for the speaker of the statement is in the first bin out of 10. That means they don't have many past instances of being blatant liars.
Now if that feature has a high value (since these are binary features, that means it is 1 and the speaker doesn't have a lot of instances of being a blatant liar), then that will tend to push the value of the FALSE probability to be smaller.
Or in other words, this will push the datapoint to have a higher chance of being labelled TRUE. Makes sense!
As you can see, a lot of the most important features are the various credit history bin features, which is consistent with our Gini importance analysis.
Another important set of features we are seeing is the party affiliation of the speaker.
For this particular dataset, having a Democratic party affiliation tends to push the label toward TRUE. While this notion seems to be represented in our analysis, be careful with drawing broader conclusions as this may just indicate a dataset bias.
Now that we understand our feature importances, let's spend some time analyzing what kinds of errors our model makes.
To do this, we will run our model over the validation set and find the datapoints for which it is least confident and incorrect, namely where the absolute difference between the probability of TRUE and FALSE is smallest.
These are examples for which our model got confused just barely in the wrong direction.
We could also do an analysis of the incorrect examples for which our model is most confident, but for now we will look at the trickiest examples.
Here are a handful of them:
In inspecting the second example, we see a political statement that on the surface seems plausible.
Given our model only uses ngram features (in this case unigrams are the default) coupled with some coarse past history measures and a few details about the context of the statement, the prediction of truthfulness certainly seems like a toss-up.
We have no history of Mitt Romney's statements on the bill which are really the crucial point in determining whether it is accurate. Without such evidence, it will always be quite difficult to really know whether something is true or not.
An improved model would need to find a way to scrape relevant information from other news sources where the speaker in question (here Mitt Romney) made statements about the topic in question.
What Does the Data Tell Us?
The previous analysis introduces an interesting question around data ethics and model accountability.
Our initial model performed reasonably well on the given data using features such as "credit history" as well as the speaker party affiliation.
While this did help us do well on this dataset, are those really the features we think should be most meaningful for the problem of fake news detection?
You can see how that could very easily go down a rabbit hole of having our features codify the biases of our data.
We have to be very careful in this respect to make sure that we are clear about what our model has accomplished: it has performed well on this **specific** fake news dataset, but we are still far from solving the more general problem.
Our dataset is small (roughly 10K datapoints) and we ourselves have not done any quality control on whether it accurately captures all the phenomena we need to be able to model in a fake news detection system.
It's the best we have right now, so we'll keep using it, but we should acknowledge how partial our features may have become to this particular dataset.
To combat this problem, we would need to take steps like increasing our dataset coverage so we can't do this well in offline evaluation through potentially spurious features like party affiliation.
We will now work on building another model on our dataset. When doing this we could leverage the new feature insights we gained from our feature and error analysis to iterate on our random forest model.
Instead, we will see what we can learn through raw lexical and linguistic features alone using some of the new Transformer-based models that have become commonplace in natural language processing.
Here we will leverage a Roberta model to this task using the HuggingFace Transformers library combined with Pytorch-lightning. The Roberta model will encode only the speaker statement in each datapoint, no additional fields from the data.
The RobertaModel will look something like this:
Here we are leveraging the RobertaModule we defined:
Thanks to our model interface and general training loop, we can run this to get the following test set performance:
So pure textual features from the statement aren't going to cut it alone.
Further work (left as an exercise to the reader) could explore encoding some other fields (credit history, etc.) from the data as inputs to Roberta.
In the next post, we'll look at deploying our model and setting up a continuous integration solution for our project!