Is machine learning Kaggle competitions?

Today I have found two interesting posts. One is named "Machine learning isn't Kaggle Competitions", while the other is named "Machine learning is Kaggle Competitions". Now let's see what they say.


Machine learning isn't Kaggle competitions


I write about strace and kernel programming on this blog, but at work I actually mostly work on machine learning, and it’s about time I started writing about it! Disclaimer: I work on a data analysis / engineering team at a tech company, so that’s where I’m coming from.

When I started trying to get better at machine learning, I went to Kaggle (a site where you compete to solve machine learning problems) and tried out one of the classification problems. I used an out-of-the-box algorithm, messed around a bit, and definitely did not make the leaderboard. I felt sad and demoralized – what if I was really bad at this and never got to do math at work?! I still don’t think I could win a Kaggle competition. But I have a job where I do (among other things) machine learning! What gives?

To back up from Kaggle for a second, let’s imagine that you have an awesome startup idea. You’re going to predict flight arrival times for people! There are a ton of decisions you’ll need to make before you even start thinking about support vector machines:

Understand the business problem

If you want to predict flight arrival times, what are you really trying to do? Some possible options:

  • Help the airline understand which flights are likely to be delayed, so they can fix it.
  • Help people buy flights that are less likely to be delayed.
  • Warn people if their flight tomorrow is going to be delayed

I’ve spent time on projects where I didn’t understand at all how the model was going to fit into business plans. If this is you, it doesn’t matter how good your model is. At all.

Understanding the business problem will also help you decide:

  • How accurate does my model really need to be? What kind of false positive rate is acceptable?
  • What data can I use? If you’re predicting flight days tomorrow, you can look at weather data, but if someone is buying a flight a month from now then you’ll have no clue.

Choose a metric to optimize

Let’s take our flight delays example. We first have to decide whether to do classification (“will this flight be delayed for at least an hour”) or regression (“how long will this flight be delayed for?”). Let’s say we pick regression.

People often optimize the sum of squares because it has nice statistical properties. But mispredicting a flight arrival time by 10 hours and by 20 hours are pretty much equally bad. Is the sum of squares really appropriate here?

Decide what data to use

Let’s say I already have the airline, the flight number, departure airport, plane model, and the departure and arrival times.

Should I try to buy more specific information about the different plane models (age, what parts are in them..)? Really accurate weather data? The amount of information available to you isn’t fixed! You can get more!

Clean up your data

Once you have data, your data will be a mess. In this flight search example, there will likely be

  • airports that are inconsistently named
  • missing delay information all over the place
  • weird date formats
  • trouble reconciling weather data and airport location

Cleaning up data to the point where you can work with it is a huge amount of work. If you’re trying to reconcile a lot of sources of data that you don’t control like in this flight search example, it can take 80% of your time.

Build a model!

This is the fun Kaggle part. Training! Cross-validation! Yay!

Now that we’ve built what we think is a great model, we actually have to use it:

Put your model into production

Netflix didn’t actually implement the model that won the Netflix competition because it was too complicated.

If you trained your model in Python, can you run it in production in Python? How fast does it need to be able to return results? Are you running a model that bids on advertising spots / does high frequency trading?

If we’re predicting flight delays, it’s probably okay for our model to run somewhat slowly.

Another surprisingly difficult thing is gathering the data to evaluate your model – getting historical weather data is one thing, but getting that same data in real time to predict flight delays right now is totally different.

Measure your model’s performance

Now that we’re running the model on live data, how do I measure its real-life performance? Where do I log the scores it’s producing? If there’s a huge change in the inputs my model is getting after 6 months, how will I find out?

Kaggle solves all of this for you.

With Kaggle, almost all of these problems are already solved for you: you don’t need to worry about the engineering aspects of running a model on live data, the underlying business problem, choosing a metric, or collecting and cleaning up data.

You won’t go through all these steps just once – maybe you’ll build a model and it won’t perform well so you’ll try to add some additional features and see if you can build a better model. Or maybe how useful the model is to your business depends on how good the results are.

Doing Kaggle problems is fun! It means you can focus on machine learning algorithm nerdery and get better at that. But it’s pretty far removed from my job, where I work on a team (hiring!) that thinks about all of these problems. Right now I’m looking at measuring models’ performance once they’re in production, for instance!

So if you look at Kaggle leaderboards and think that you’re bad at machine learning because you’re not doing well, don’t. It’s a fun but artificial problem that doesn’t reflect real machine learning work.

(to be clear: I don’t think that Kaggle misrepresents itself, or does a bad job – it specializes in a particular thing and that’s fine. But when I was starting out, I thought that machine learning work would be like Kaggle competitions, and it’s not.)

(thanks to the fantastic Alyssa Frazee for helping with drafts of this!)


Machine Learning is Kaggle Competitions


Julia Evans wrote a post recently titled “Machine learning isn’t Kaggle competitions“.

It was an interesting post because it pointed out an important truth. If you want to solve business problems using machine learning, doing well at Kaggle competitions is not a good indicator of that skills. The rationale is that the work required to do well in a Kaggle competition is only a piece of what is required to deliver a business benefit.

This is an important point to consider, especially if you are just starting out and find yourself struggling to do well on the leaderboards. In this post we ruminate on how competitive machine learning relates to applied machine learning.

Competitive Machine Learning
Photo by tableatny, some rights reserved

Competitions vs the “Real World”

Julia made an attempt at a Kaggle competition and did not do well. The problem was that she does machine learning as part of her role at Stripe. It was this disconnect from what makes her good at her job and what it takes to do well in a machine learning competition what sparked the post.

Scope must be limited to be able to assess skill. You know this if you have ever taken a test at school.

Think of a job interview. You can get the candidate to hack on the production codebase or you can get them to work through an abstract standalone problem. Both approaches have their merits, where the benefits of the latter is that it is simple enough to parse and work through in an interview environment. The former may require hours, days, weeks of context.

You can hire a candidate based purely from their test scores, you can hire a programmer based on their ranking on Top Coder and you can hire a machine learning engineer based on their Kaggle score, but you must have confidence that the skills demonstrated in their assessments translate to the tasks required of them on the job.

That last part is hard. Thats’s why you throw live questions at candidates to see how they think on the fly.

You can be awesome at nailing competitions and poor at ML on the fly or in the context of the broader set of expectations of an engineer in the workplace. You can also be great at machine learning in practice and do poorly in competitions as reasonably claimed in Julia case.

Broader Problem Solving Process

The key to Julia’s argument is that the machine learning required in a competition is but a piece of the broader process required to deliver a result in practice.

Julia uses predicting flight arrival times as the problem context for nailing this point home. She highlights facts of the broader problem as follows:

  1. Understand the business problem
  2. Choose a metric to optimize
  3. Decide what data to use
  4. Clean up your data
  5. Build a model
  6. Put the model into production
  7. Measure model performance

Julia points out that a Kaggle competition is only point 5 (build a model) in the above list.

It’s a great point and I totally agree. I would point out that I do think that what we do in a Kaggle competition is machine learning (hence the title of this post) and that the broader process is called something else. Maybe that was data mining, maybe it is applied machine learning, and maybe this is what people mean when they throw around data science. Whatever.

Machine Learning is Hard

The broader process is critical and I stress this all of the time.

Now think about the steps in the process in terms of the technical skills and experience required. Data selection, cleaning, and model building are hard technical tasks that require great skill to do well. To some degree a data analyst and even a business analyst can perform much of the duties, except the build a model step.

I may be out on a limb here, but perhaps that is why machine learning is put on such a high pedestal.

It is hard to build great models. Very hard. But, the great models as defined by a machine learning competition (score against a loss function) are almost always not the same as the great models required by the business. Such finely tuned models are fragile. They are hard to put into production, they are hard to reproduce, they are hard to understand.

In most business cases you want a model that is “good enough” at picking out the structure in the domain rather than the very best model possible.

Julia makes this point in referencing the failure of deploying the winning models in the Netflix Prize.

Competitions Are Great

Kaggle competitions, like conference competitions before them, can be great fun for participants.

Traditionally, they have been used by academics (mostly grad students) to test out algorithms and discover and explore the limits of specific methods and methodologies. Algorithm bake-offs are common in research papers, but of little benefit in practice. This is known.

The key point and the point I believe Julia set out to make is to not despair if you find yourself struggling to do well in Kaggle competitions.

It is very likely because the competition environment is hard and the evaluation of your skill disproportionally biased towards one facet of what is required to do well in practice, model building.



你可能感兴趣的:(模式识别与机器学习)