Common Pitfalls In Machine Learning Projects
In a recent presentation, Ben Hamner described the common pitfalls in machine learning projects he and his colleagues have observed during competitions on Kaggle.
The talk was titled “Machine Learning Gremlins” and was presented in February 2014 at Strata.
In this post we take a look at the pitfalls from Ben’s talk, what they look like and how to avoid them.
Early in the talk, Ben presented a snap-shot of the process for working a machine learning problem end-to-end.
This snapshot included 9 steps, as follows:
He commented that the process is iterative rather than linear.
He also commented that each step in this process can go wrong, derailing the whole project.
Ben presented a case study problem for building an automatic cat door that can let the cat in and keep the dog out. This was an instructive example as it touched on a number of key problems in working a data problem.
The first great takeaway from this example was that he studied accuracy of the model against data sample size and showed that more samples correlated with greater accuracy.
He then added more data until accuracy leveled off. This was a great example of understanding how easy it can be get an idea of the sensitivity of your system to sample size and adjust accordingly.
The second great takeaway from this example was that the system failed, it let in all cats in the neighborhood.
It was a clever example highlighting the importance of understanding the constraints of the problem that needs to be solved, rather than the problem that you want to solve.
Ben went on to discuss four common pitfalls in when working on machine learning problems.
Although these problems are common, he points out that they can be identified and addressed relatively easily.
Ben’s talk “Machine Learning Gremlins” is a quick and practical talk.
You will get a useful crash course in the common pitfalls we are all susceptible to when working on a data problem.
机器学习项目中常见的误区