Exercises
In this chapter we have covered some of the most important concepts in Machine
Learning. In the next chapters we will dive deeper and write more code, but before we
do, make sure you know how to answer the following questions:
1. How would you define Machine Learning?
Machine Learning is the science (and art) of programming computers so they can
learn from data.
Machine Learning is about building systems that can learn from data.Learning means getting better at some task, given some performance measure.
2. Can you name four types of problems where it shines?
Image-detection, auto-driving, spam filter, npl image to words
Machine Learning is great for complex problems for which we have no algorithmic solution,to replace long lists of hand-tuned rules, to build systems that adapt to fluctuaing environments, and finally to help humans learn(eg, data mining).
3. What is a labeled training set?
the training set you feed to algorithm includes the desired solutions
A labeled training set is a training set that contains the desired solution(a.k.a a label) for each instance.
4. What are the two most common supervised tasks?
classification
predict a target numeric value(eg: price of a car)
The two most common supervised tasks are classification and regression
5. Can you name four common unsupervised tasks?
Dimensionality reduction, simplify the data without losing too much information
Anomaly detection, detecting unusual credit card transactions to prevent fraud
Novelty detection, detect a new instances that look different from all instances in training set
Association rule learning, dig into large amounts of data and discover interesting relations between attributes
Clustering, Visualiation, Dimensionality reduction, Association rule learning.
6. What type of Machine Learning algorithm would you use to allow a robot to walk in various unknown terrains?
Reinforcement Learning: observe the environment, select and perform actions and get reward in return. (alpha Go)
Reinforcement Learning is likely to perfom best if we want a robot to learn to walk in various unknown terrains, since this typically the type of problem that Reinforcement Learning tackles. It might be posiible to express the probem as a supervised or semisupervised learning problem, but it would be less natural.
7. What type of algorithm would you use to segment your customers into multiple groups?
clustering algorithm, detect groups of similar visitors
If you dont know how to define the groups, then you can use a clustering alorithm(unsupervised learning) to segement yourcustomers into clusters of similar customers. Hoever, if you know what groups you would like to have, then you can feed many examples of each group to a classification algorithm(supervised learning), and it will classify all your customers into these groups.
8. Would you frame the problem of spam detection as a supervised learning problem or an unsupervised learning problem?
supervised learning problem. It is trained with many example emails along with their class(spam or harm) and it must learn how to classify emails.
9. What is an online learning system?
You train the system incrementally by feeding it data instances sequentially, either individually or in small groups called mini-bathes.
An online learning system can learn incrementally, as opposed to a batch learning system. This makes it caplable of adapting rapidly to both changing data and autonomous systems, and of training on very large quantities of data.
10. What is out-of-core learning?
When online learning algorithms are used to train systems on huge datasets that cannot fit in one machine's main memory(this is called ou -of-core learning). The algorithm loads part of the data, runs a training step on that data, and repeats the process until it has run on all of the data.
Out-of-core algorithms can handle vast quantities of data that cannot fit in computer's main memory, it uses a similarity measure to find the most similar learned instances and uses them to make predictions.
11. What type of learning algorithm relies on a similarity measure to make predictions?
Instance-based learning: the system learns the examples by heart, then generalizes to new cases by using a similarity measure to compare them to the learned examples ( or as subset of them)
An instance-based learning system learns the training data by heart; then, when given a new instance, it uses asimilarity measure to find the most similar learned instances and uses them to make predictions.
12. What is the difference between a model parameter and a learning algorithm’s hyperparameter?
A hyperparameter is a parameter of a learning algorithm(not of the model). Model parameter can not be controlled.
A model has one or more model parameters that determine what it will predict given a new instance(eg, the slope of a linear model). A learning algorithm tries to find optimal values for these parameters such that the model generalizes well to new instances.
A hyperparameter is a parameter of the learning algoithm itself, not of the model( eg, the amount of regularization to apply)
13. What do model-based learning algorithms search for? What is the most common strategy they use to succeed? How do they make predictions?
Model-based learning algorithms search for a way to generalize from a set of examples, which is to build a model of these examples and then use that model to make predictions.
The common strategy is to define a utility function or fitness functon that measures how good your model is. Or define a cost function to measure how bad the model is.
They select a linear model to do prediction.
Model-based learning algorithms search for an optimal value for the model parameters such taht the model will generalize well to new instances.
We usually train such systems by minimizing a cost function that measures how bad the system is at making predictions on the training data, plus a penalty for model complexity if themodel is regularized.
To make predictions, we feed the new instance's features into the model's prediction function, using the parameter values found by the learning algorithm.
14. Can you name four of the main challenges in Machine Learning?
Insufficient quantity of training data
Nonrepresentataive training data
Poor-quality data
Irrelevant features
Overfitting the training data
Underfitting the training data
15. If your model performs great on the training data but generalizes poorly to new instances, what is happening? Can you name three possible solutions?
Overfitting is happening(or we got extremely lucky on the training data).
1) simplify the model by selecting one with fewer parameters, by reducing the number of attributes in the training data, or by constraining the model
2) gather more training data
3) reduce the noise in the training data
16. What is a test set and why would you want to use it?
Spiting your data into two sets: the training set and the test set. You train your model using the training set, and you test it using the test set.
A test set is used to estimate the generalization error that a model will make on new instances, before the model is launched in production.
17. What is the purpose of a validation set?
To solve the problem like the model is unlikely to perform as well on new data, you simply hold out part of the training set to evlaute several candidate models and select the best one. The new held-out set is called the validation set.
A validation set is used to compare models. It makes it possible to select the best model and tune the hyperparameters.
18.What is the train-dev set,when do you need it, and how do you use it?
Train-dev set is to hold out some of the training pictures(from the web) in yet another set. If your data performs well on train-dev set, the model is not overfitting the training set.
The train-dev set is used when there is a risk of mismatch between the training data and the data used in the validation and test datasets(which should always be as close as possible to the data used once the model is in production).The trian-dev set is a part of the training set that's held out(the model is not trained on it). The model is trained on the restr of the training set, and evaluated on both the train-dev set and the validation set.
If the model performs well on the training set but not on the train-dev set, then the model is likely overfitting the training set. If it performs well on both the training set and the train-dev set, but not on the validation set, then there is probably a significant data mismatch between the training data and the validation +test data, and you should try to improve the training data to make it look more like the validation +test data.
19. What can go wrong if you tune hyperparameters using the test set?
It will get overfitting, which means has less error in test set, but more error in prodcution or real data set.
If you tune hyperparameters using the test set, you risk overfitting the test set, and the generalization error you measurewill be optimistic(you may launch a model that performs worse than you expect).