Linear Regression
Suggested reading on difference between linear and non-linear regression
What is linear regression? Why is it called linear?What are the constraints you need to keep in mind when using a linear regression?How does the variance of the error term change with the number of predictors, in OLS? In linear regression, under what condition R^2 always equals a perfect 1?Do you consider the models Y~X1+X2+X1X2 and Y~X1+X2+X1X2 to be linear? Why?Suggested reading
Do we always need the intercept term? When do we need it and when do we not?Suggested reading
What is collinearity and what to do with it?Suggested reading
How to remove multicollinearity?Suggested reading
What is overfitting a regression model? What are ways to avoid it?Suggested reading
What is Ridge Regression? How is it different from OLS Regression? Why do we need it?What is Lasso regression? How is it different from OLS and Ridge?What are the assumptions that standard linear regression models with standard estimation techniques make?How can some of these assumptions be relaxed?You fit a multiple regression to examine the effect of a particular feature. The feature comes back insignificant, but you believe it is significant. How will you explain it?Your model considers the feature X significant, and Z is not, but you expected the opposite result. How will you explain it?How to check if the regression model fits the data well?When to use k-Nearest Neighbors for regression?Could you explain some of the extension of linear models like Splines or LOESS/LOWESS?Classification
Basic Questions
State some real life problems where classification algorithms can be used?Text categorization (e.g., spam filtering) • fraud detection • optical character recognition • machine vision (e.g., face detection) • natural-language processing (e.g., spoken language understanding) • market segmentation (e.g.: predict if customer will respond to promotion) • bioinformatics (e.g., classify proteins according to their function) etc.
What is the simplest classification algorithm?Many consider Logistic Regression as a simple approach to begin with in order to to set a baseline and only make it more complicated if need be.
What is your favourite ML algorithm? Why is it your favourite? How will you describe it to a non-technical person.Decision Trees
To answer questions on decision trees here are some useful links:
Youtube video tutorial
This article covers decision tree in depth
Other suggested reading
Ensemble models:
To answer questions on ensemble models here is a useful link:
Logistic regression:
Link to understand basics of Logistic regression
Here’s a nice tutorial from Khan Academy
Support Vector Machines
A tutorial on SVM can be found here and here
Neural Networks
Here’s a link to Neural Network course from Hinton on Coursera
Other models:
What other models do you know?How can we use Naive Bayes classifier for categorical features? What if some features are numerical?Tradeoffs between different types of classification models. How to choose the best one?Compare logistic regression with decision trees and neural networks.Regularization
Suggested Reading: wikipedia and Quora answers
What is Regularization?Which problem does Regularization try to solve?Ans. used to address the overfitting problem, it penalizes your loss function by adding a multiple of an L1 (LASSO) or an L2 (Ridge) norm of your weights vector w (it is the vector of the learned parameters in your linear regression).
What does it mean (practically) for a design matrix to be “ill-conditioned”?When might you want to use ridge regression instead of traditional linear regression?What is the difference between the L1 and L2 regularization?Why (geometrically) does LASSO produce solutions with zero-valued coefficients (as opposed to ridge)?Dimensionality Reduction
Suggested Reading: Scikit and Kdnuggets
What is the purpose of dimensionality reduction and why do we need it?Are dimensionality reduction techniques supervised or not? Are all of them are (un)supervised?What ways of reducing dimensionality do you know?Is feature selection a dimensionality reduction technique?What is the difference between feature selection and feature extraction?Is it beneficial to perform dimensionality reduction before fitting an SVM? Why or why not?Principal Component Analysis
What is Principal Component Analysis (PCA)? Under what conditions is PCA effective? How is it related to eigenvalue decomposition (EVD)?What are the differences between Factor Analysis and Principal Component Analysis?How will you use SVD to perform PCA? When SVD is better than EVD for PCA?Why do we need to center data for PCA and what can happen if we don’t do it? Do we need to normalize data for PCA? Why?Is PCA a linear model or not? Why?Other Dimensionality Reduction techniques:
Do you know other Dimensionality Reduction techniques?What is Independent Component Analysis (ICA)? What’s the difference between ICA and PCA?Suppose you have a very sparse matrix where rows are highly dimensional. You project these rows on a random vector of relatively small dimensionality. Is it a valid dimensionality reduction technique or not?Have you heard of Kernel PCA or other non-linear dimensionality reduction techniques? What about LLE (Locally Linear Embedding) or tt-SNE (tt-distributed Stochastic Neighbor Embedding)What is Fisher Discriminant Analysis? How it is different from PCA? Is it supervised or not?Cluster Analysis
Suggested reading: tutorialspoint and Lecture notes
Why do you need to use cluster analysis?Give examples of some cluster analysis methods?Differentiate between partitioning method and hierarchical methods.Explain K-Means and its objective? How do you select K for K-Means?How would you assess the quality of clustering?Optimization
Here is a good video to learn about optimization.
Some basic questions about optimization
Give examples of some convex and non-convex algorithms.Examples of convex optimisation problems in machine learning
linear regression/ Ridge regression, with Tikhonov regularisation, etc; sparse linear regression with L1 regularisation, such as lasso; Support vector machines; Parameter estimation in linear-Gaussian time series (Kalman filter and friends)
Typical examples of non-convex optimization in ML are
Neural networks; maximum likelihood mixtures of Gaussians
What is Gradient Descent Method?Tell us the difference between Batch Gradient Descent and Stochastic gradient descent.Give examples of some convex optimization problems in machine learningGive examples of the algorithms using Gradient based methods of second order information.Does Gradient Descent methods always converge to the same point?Is it necessary that the Gradient Descent Method will always find the global minima? What is a local optimum is and why is it important in a specific context, such as k-means clustering. What are specific ways for determining if you have a local optimum problem? What can be done to avoid local optima? Read possible answerSuggested Reading
Explain the Newton’s method?Suggested Reading
What kind of problems are well suited for Newton’s method? BFGS? SGD?What are “slack variables”?Describe a constrained optimization problem and how you would tackle it.Recommendation
some good examples of recommender models can be found here
What is a recommendation engine? How does it work?How to do customer recommendation?What is Collaborative Filtering?How would you generate related searches for a search engine?How would you suggest followers on Twitter?Do you know about the Netflix Prize problem? How would you approach it?Here is a nice post on the Netflix challenge
Feature Engineering
Here is a good article on feature engineering
What is Feature Engineering?How predictors are encoded in a model can have a signi?cant impact on model performance and we achieve such encoding through feature engineering. Sometimes using combinations of predictors can be more e?ective than using the individual values: the product of two predictors may be more e?ective than using two independent predictors. Often the most e?ective encoding of the data is captured by the modeler’s understanding of the problem and thus is not derived from any mathematical technique.
These features can be extracted in two ways: 1. By a human expert (known as hand-crafted) or 2. By using automated feature extraction methods such as PCA, or Deep Learning tools such as DBN. Both 1 and 2 can be used on top of each other.
Feature Selection
Here is a nice post on feature selection,
also known as variable selection, attribute selection or variable subset selection
Natural Language Processing (NLP)
For basic introduction visit the wiki page.
Here is the link to coursera course for NLP
Pick the software from the The Stanford NLP (Natural Language Processing) Group and input some text to view its parse tree, named entities, part of speech tags, etc.
If the company deals with text data, you can expect some questions on NLP and Information Retrieval:
Some interesting usages are in areas like sentiment analysis, spam detecting, POS, Text summarization, Language translation etc.
How unstructured text data can be converted into structured data for the purpose of ML models?Explain Vector Space Model and its use?Explain the distances and similarity measures that can be used to compare documents?Explain cosine similarity in a simple way?Suggested Reading
Why and when stop words are removed? In which situation we do not remove them?Image processing and Text mining
What tool would you prefer for image processing?Some popular tools are: MATLAB, OpenCV or Octave
What parameters would you consider while selecting a tool for image processing?Ease of use, speed and resources needed are some of the common parameters
How to apply Machine Learning to images?What are the text mining tools you are familiar with?Some example are:
Commercial: Autonomy, Lexalytics , SAS/SPSS, SQLServer 2008+
OpenSource: RapidMiner , NClassifier, OpenTextSumarizer, WordNet, OpenNLP/SharpNLP, Lucene/Lucene.NET, LingPipe, Weka
Meta Learning
Wiki link on meta learning
How will you differentiate between boosting and inductive transfer?Model selection
What criteria would you use while selecting the best model from many different models?You have one model and want to find the best set of parameters for this model. How would you do that?How would you use model tuning for arriving at the best parameters?Suggested Reading
Explain grid search and how you would use it?What is Cross-Validation?What is 10-Fold CV?What is the difference between holding out a validation set and doing 10-Fold CV?Evaluating Machine Learning
How do you know if your model overfits?How do you assess the results of a logistic regression?Which evaluation metrics you know? Something apart from accuracy?Which is better: Too many false positives or too many false negatives?What precision and recall are?What is a ROC curve? Write pseudo-code to generate the data for such a curve.What is AU ROC (AUC)? Do you know about Concordance or Lift?Discussion Questions
You have a marketing campaign and you want to send emails to users. You developed a model for predicting if a user will reply or not. How can you evaluate this model? Is there a chart you can use?Miscellaneous
Curse of Dimensionality
What is Curse of Dimensionality? What is the difference between density-sparse data and dimensionally-sparse data?Suggested Reading
Dealing with correlated features in your data set, how to reduce the dimensionality of data.What are the problems of large feature space? How does it affect different models, e.g. OLS? What about computational complexity?What dimensionality reductions can be used for preprocessing the data?What is the difference between density-sparse data and dimensionally-sparse data?Others
You are training an image classifier with limited data. What are some ways you can augment your dataset?