26 Top Machine Learning Interview Questions and Answers: Theory Edition

Mihail Eric
Dec 7, 2020
8 min read

Updated: Dec 12, 2020

If you're in the market for a machine learning job, you'll definitely have to go through technical screens. That means your interviewers will often spend some time evaluating your knowledge of core machine learning theory concepts.

In this post, we will describe 26 essential machine learning interview questions and provide their answers. Here we will focus on the machine learning theory you should definitely know for acing your machine learning interviews. In a later post, we will provide additional machine learning questions focusing on systems and engineering concepts.

We can practically guarantee that some flavor of at least one of these questions will be asked at your next machine learning interview.

As a heads-up, we have an additional expert-curated collection of over 300 data science and machine learning interview questions covering topics like deep learning, SQL, MLOps, Pandas, and more for full-stack data science and machine learning engineering roles which can be found here.

With that, let's get started!

1. What is the difference between supervised learning and unsupervised learning?

The fundamental difference between supervised and unsupervised learning is that supervised model algorithms use training data with labelled outputs while unsupervised learning does not use data with labels. In other words, supervised learning typically takes a set of labelled (X, Y) pairs and seeks to learn a function mapping from X to Y. Meanwhile unsupervised learning typically involves learning structure from data through techniques like clustering.

2. Give a few examples of commonly used supervised learning algorithms.

Commonly used algorithms in supervised learning include Naive Bayes, k-nearest neighbors classification, decision trees, random forests, and support vector machines. Neural networks are also very often used for supervised learning.

3. Explain the difference between classification and regression.

Classification refers to supervised learning algorithms that use discrete output labels whereas regression algorithms use continuous output labels. Hence when learning a model from (X, Y) pairs, in classification, Y takes on values like [0, 1, 2, 3, ...] whereas in regression Y may take on any non-integral numbers like 1.012, -3.4221, etc.

4. What is a commonly used linear regression cost function?

One of the go-to cost functions for evaluating linear regression models is least-squares. It is commonly written as a residual sum of squares between the gold labels and predicted labels of a model's output on a dataset.

5. Describe the k-nearest neighbors algorithm.

In k-nearest neighbors, the label for a point is determined by taking the average (in the case of regression) or majority (in the case of classification) label of the k nearest points to our point-of-interest. For example, in the image below if we were performing k-nearest neighbors with k=3, we would classify P as green because its nearest 3 neighbors are all green.

6. What is a key assumption in the Naive Bayes algorithm?

Naive Bayes is a frequently used algorithm for classification that assumes that the input features are conditionally independent given the output label. In other words, if we have input features (x1, x2, ..., xn) and output Y, then Naive Bayes allows us to say that p(Y, x1, x2, ..., xn) is proportional to p(Y)*p(x1|Y)*p(x2|Y)*...*p(xn|Y). In theory this is a pretty strong independence assumption, but in practice it makes certain classification problems tractable and still produces fairly performant models.

7. Describe how a decision tree is learned.

Decision trees are built in a top-down fashion by splitting a set of observations according to a certain feature and feature value. This recursive partitioning is done greedily whereby the feature and value split are determined by which split will enable the largest reduction in some error metrics such as Gini impurity in the dataset.

8. What are the differences between k-nearest neighbors and k-means clustering?

K-nearest neighbors is a supervised learning algorithm whereas k-means clustering is an unsupervised algorithm. They work very differently: k-nearest neighbors uses the k closest neighbors to a given point with an unknown label to calculate its label. This is often done via majority vote for classification and averaging for regression. K-means clustering works by splitting a dataset into k clusters by minimizing a measure of "spread" in the data known as distortion.

9. In a K-Nearest Neighbors classifier, what effect does picking a smaller number of neighbors have for classification?

Reducing the the number of neighbors in k-nearest neighbors classification tends to make the model's label very susceptible to the closest neighbors, leading to a very "jagged" decision boundary. This tends to make the model overfit on its training data. See the image below from here.

10. Describe how support vector machines work.

At their core, support vector machines are max-margin classifiers that seek to maximize the minimum distance of all points to some linear separator. Ultimately they strive to find linear separators of data even if that means projecting data into some higher-dimensional space where that data may be linearly separable (even if it isn't in a lower-dimensional space).

11. What are the differences between L1 and L2 regularization?

L1 and L2 are both forms of regularization but L1 includes an absolute value of the weights term while L2 uses a squared magnitude term. In practice, L1 tends to induce sparsity in the model weights which L2 does not really do.

12. Explain the bias-variance tradeoff.

The bias-variance tradeoff is the process of simultaneously trying to minimize two sources of error (bias and variance) during supervised learning which determine the extent to which a model can generalize from a training set to an unseen test set.

13. You have a model suffering from low bias and high variance. What does this mean?

Low bias and high variance usually means the model is overfitting. If we have more features than datapoints, this could easily lead to overfitting. Typically this behavior may require us to add some form of regularization to the model objective that we are optimizing.

14. Why is it not recommended to assess a model's quality using only its train error?

When building models we are always more interested in generalization error, or how the model performs on data it hasn't seen during training (i.e. the unseen test set). Therefore if we only focus on training error and we see for example a very low training error, we could draw incorrect conclusions about the actual model quality. That being said, training error is a reasonable proxy during model development for a model's performance since it is generally assumed that train data is sampled from the same distribution as test data. Training error typically can be computed relatively easily during model training.

15. What are commonly-used forms of cross-validation?

K-fold validation and leave-one-out-validation are very commonly used forms of cross-validation, with the latter being especially useful for very small datasets. In fact, leave-one-out can actually be considered a special case of k-fold cross validation with k being equal to the number of points in the dataset. In k-fold validation, the data is split into k different folds and then for each fold we execute a separate training run where the model is trained on all folds but the fold in interest and then evaluated on the held-out fold. The model's performance is then aggregated as an average across all the k-fold runs.

16. Define principal components analysis.

Principal components analysis (PCA) is a dimensionality reduction technique for data whereby we search for the axes along which data variance is maximized. It is commonly-used as a preprocessing technique or for visualizing high-dimensional data in 2 or 3 dimensions.

17. What are examples of dimensionality reduction?

A few examples of dimensionality reduction include principal components analysis, non-negative matrix factorization, and autoencoders. All of these often take high-dimensional data and reduce them to lower dimensional subspace for reasons including computational considerations or removing redundancy/repetition in the data.

18. What is model bagging?

Bagging (or bootstrap aggregating) is an ensembling technique in which we create a number of bootstrapped datasets by sampling with replacement from a larger dataset. We then train models on each of these smaller datasets and combine their outputs to form the larger model output.

19. What are gradient boosted trees?

Gradient boosted trees are a machine learning technique that ensembles weak-performing decision trees (weak learners) via a learned and weighted function of the trees. Gradient boosted trees are built in an interative fashion by progressively adding new learners whose weights are chosen by optimizing a differentiable objective function.

20. What is an F1 score?

The F1 metric computes the harmonic mean of the precision and recall of a model. This metric is often more useful when dealing with very imbalanced datasets in which a measure like accuracy may give misleading impressions about overall model performance.

21. You are doing a classification task where you achieve 95% accuracy. Why should you be wary of these results?

First off it is very difficult to assess how good a certain accuracy is without the context of the problem. If we are predicting whether someone will click an ad with 95% accuracy that could be really good, but if we have a 95% accuracy for whether or not someone will survive a certain daredevil stunt, that number may not be adequate. Metrics without relevant context really do not mean that much, so we should be careful about dealing with absolute number judgments. In general a fairly high accuracy could imply a label-imbalanced dataset, which may suggest an alternate metric such as F1 may be more appropriate and worth computing.

22. Imagine that you have a dataset with 10 features but some of the datapoints have several features missing. How could you handle this?

There are several techniques handling missing values. You can drop the points altogether, though this can be problematic if too many points have to be dropped. You can also interpolate the missing values using the features of well-formed points through something like taking the majority value or the average of existing values. For certain modeling problems you can also use the lack of a feature value to help your model learn. In that case it is common to fill in a placeholder no-feature value like -1 or NaN.

23. What is backpropagation?

Backpropagation is the algorithm we use to compute gradients of a neural network loss function via the chain rule, so that we can properly update model weights during training. Nowadays, most high-level machine learning libraries like Tensorflow and PyTorch perform backpropagation implicitly so that the end-user does not need to explicitly calculate and program gradient updates.

24. What is dropout?

Dropout is a commonly used regularization technique that involves setting some of the weights in the layers of a network randomly to 0 during the forward pass of network computation. This has the effect of dropping them during training which helps prevent overfitting.

25. What is the difference between batch gradient descent and stochastic gradient descent?

Batch gradient descent computes the gradient on a collection of points (known as the batch which in the extreme case can be the entire train set) and then applies that update to the weights of a model. Stochastic gradient descent only computes the gradient for a single point at a time and then immediately applies it (note that some sources may use the term "stochastic gradient descent" even if the gradients are over multiple points). Because of its single point gradient, stochastic gradient descent is a faster algorithm to compute in practice. However because it uses only a single point, its estimate of the "true" gradient of the optimization function is much noisier.

26. What are benefits of using transfer learning in deep learning models?

Transfer learning is an effective and frequently used technique in modern day deep learning systems, both in computer vision and natural language processing. It can reduce computational cost, financial cost, and also help models leverage previously learned features and information from one task to another one. Many powerful deep learning architectures in natural language processing (such as BERT) as well as computer vision (such as ResNets) learn powerful generalizable feature representations that can be quickly adapted to new tasks and datasets using transfer learning.

Hopefully these machine learning interview questions provide you good practice materials as you prepare for your interviews. At Confetti AI, our goal is to help aspiring practitioners jumpstart their machine learning and data science careers. If you're interested in getting more preparation materials for your data science and machine learning interviews, sign up for our updates!