数据科学家访谈—

Chris Moody

Data Scientist at Stitch Fix

Astrophysics to Data Science

Chris Moody started off his journey towards data science by peering off into distant galaxies, studying computational astrophysics at UC Santa Cruz as a graduate student.

As the data revolution hit the fields of science, however, Chris found himself having to learn how to use more sophisticated tools that could process more data. He dove into programming and contributing towards open-source astrophysics projects.

All this culminated in a data science fellowship at Insight Data Science. After completing his Fellowship, Chris joined Square’s Data Science team. After leaving Square, Chris is now a data scientist at Stitch Fix, a fashion startup.

Thank you very much for being with us, Chris. Can you tell us a little bit about your background?

I went to Caltech as an undergrad to study Physics. There, I had projects that were largely computational.

For example, a project I was involved in was looking at dark matter simulations. Basically, we don’t know that much about dark matter, but we can guess at things that it could possibly do. One of those things is that it could decay. If it decays, the dark matter particle gets a kick, and it goes off in a random direction at a random speed. Galaxies are sitting at the bottom of a gravity well; they’re like bread crumbs in a big bowl of dark matter. If the dark matter were spontaneously decaying and getting lots of extra energy, it could popcorn out, and totally change the profiles of galaxies in an essential way. This was a strongly computational project that taught me many skills.

After Caltech, I came to Santa Cruz for graduate studies, still working in computational astrophysics. While I was there, I was doing all sorts of things pertaining to galaxies. We would look through the Hubble Space Telescope at the youngest galaxies in the universe and notice that they were not at all like the galaxies today. Galaxies today are beautiful spiral structures. But when you look back at the youngest galaxies, they are lumpy and clumpy… they look like soup.

So, one of the questions was: Does that mesh well with our ideas of how our universe formed? We started to look at the simulations and realized that what we observed through the telescope is what we were seeing in our simulations. We were super surprised at these theoretical predictions coming true!

The next part followed the standard trajectory of a lot of businesses. We got one or two really positive examples of galaxies matching our predictions, and were very excited about the progress. But it was only one or two examples; we wanted to know if this was statistically significant, and so we started to scale up our data. We exploded from 100 gigabytes to hundreds of terabytes of data. This all started at the NASA Ames supercomputer.

It turns out that it’s really hard to answer simple questions when those questions don’t fit onto one computer. So we had to scale up a lot of our algorithms, and build our own infrastructure and framework. It was at that point that we started to get really interesting results. We started to see that this is generally true, and this attracted a lot of people to our project, scaling up our people power. So we’d get other new graduate student astronomers and explain, ‘This is how we work; this is how you can be efficient.’

I think the romantic, public idea of a scientist is that you jump into a cave and then five months later, you have a “Eureka” moment and you come out. Then it’s glorious. But that’s not really how it works. The reality is: you have lots of bugs, you make lots of errors, and you have to work as a team, which means you have to be able to work efficiently. You have to know how a pull request works. You have to know how commits work. You have to know how to document. You have to file bugs and report to issue trackers. You have to do all of these things.

At the end of all that, I realized that I most liked working with data. I liked working with algorithms. Actually, I absolutely loved working with algorithms.

I spent more time reading about how the algorithms worked and how they found all this truth, despite all the noise and red herrings in the data. I loved doing that and working with people on a project together. It was great. I thought galaxies were cool, don’t get me wrong, but I liked algorithms more.

It sounds like you spotted a project, saw that it was interesting and used your experience of working on it to explore your interests. How did your background in science inform your work as a data scientist?

Science is getting harder to do. It’s harder to do it individually and it has to happen as part of a team; a collaborative effort, so we can measure different things. Looking at papers from 50 years ago: having a paper with 50 authors on it was ridiculous, that just never happened. Half the papers out there were published with only one or two authors on them.

Now, that’s ridiculously absurd. I can’t remember the last time I read a paper with only one author on it.

It’s just because the instruments you have to use are larger. We end up having to use supercomputer resources or we have to use the Hubble Space Telescope to get somewhere. This means that the data and ideas are starting to grow much larger than one person can manage. In turn, it means that you have to learn how to work with other people. So that’s a paradigm shift of science, and also something that I think industry has been familiar with for a much longer period of time.

At the same time, a lot of my exposure to things like software engineering best practices, or even computer science, was completely self-taught. I didn’t take any formal classes in these fields.

That’s really interesting that it worked out so well, and also that that didn’t hinder you.

I think that’s actually pretty normal. Look at some start-ups. They’re really interested in finding someone who can actually do the work; someone who is trying to find and build a whole community and foster that growth. Take that person from the 90th percentile and just teach them the remaining 10% of the small skills needed. These startups are basically instilling habits; thinking about what you’re going to do and how it’s going to reflect on everyone else in the network, instead of being an isolated person.

Sometimes, that has to happen as a feedback reflex. You have to think of how you’re going to fit in with everything else. You have to think about how your code is going to be used by others. I was lucky in that I had a community leader in my project who was really interested in teaching everyone else how to work together, and I learned a lot from him.

Of your friends and peers from Caltech, many of whom have also gone on to do heavy computational physics research, have you found that a substantial portion of them are heading towards industry?

Yes, especially in astrophysics. I can’t tell you how many plots I have seen in the last year with the number of faculty jobs remaining constant, or maybe even slightly decreasing with time, compared with the sky-high number of post doctorates. That means that the likelihood of a post doctorate job opening is going down at a ridiculous rate. Even when I was in graduate school, the expected number of postdoctoral candidates went from two to three. If it kept going at that rate, by the time I’d finished my first post doctorate, the expected rate would be four postdocs to every one position.

Clearly, there’s a huge supply of post doctorates and not that many positions within academia.

How much did those academic job statistics influence your decision on what you wanted to do after graduate school? Did you feel you could get the same intellectual stimulation from problems in industry as you received in academia?

Yes, it was a hard decision, but you look at it and think, ‘How many times do I really want to roll the dice? How much do I really like this?’ That fear of not finding a job really destroys a lot of the romance of science. I feel like a lot of people start doing science because they have this romantic notion of becoming the best scientist, or contributing in a noble way. But the truth is that science is a shitty ride.

You can do a lot of the same things that science will let you do, but you don’t have to do these things in the world of academia. You can work on science in industry. When I made that realization, and understood I could do a lot of the science, and be involved in a lot of the cool stuff I’d tried to do in the first place, it made me realize that I could switch to a new job outside of academia. At the same time, I didn’t feel that I was giving up on what drove me initially. There are a lot of startups that are changing the world, so instead of trying to define clumps and galaxies, I could try to actually work with somebody, and try to change the world. I thought this was really cool and super exciting.

So then you joined Insight – a six week long Fellowship for PhDs looking to enter the field of data science. How much of what the Fellowship taught you would you say was new to you?

All of it. There’s a paradigm shift from science and industry. Everything in science is about a fully detailed presentation of an idea; exhaustively explicating all of the caveats. All of the communication is bordered on fully defined facts, or at least as much as possible. You look at the borders of your project, the borders of the results, and you know the downsides and you know the upsides, and that’s because you’re terrified that someone will find a deficit in your project, and then nail you for it.

But then the opposite is true in business. The biggest problem is that people have very limited bandwidth. It takes a lot of effort, and there are a lot of people demanding it. So the crux of everything in business is actually being able to move all of your results in as terse and precise a fashion as possible.

You don’t need to delineate all of the possibilities, you just need to say what is the major point, and you can go on from there. So, a lot of what Insight taught me was that you need to condense all of your results down as quickly as possible. You get someone’s attention and you go; that’s the hardest part. As scientists, we were taught to give an hour-long lecture on our project. We didn’t have to consider whether our audience was being entertained or not. If they’re not interested, you don’t care. They’re not your audience if they weren’t interested in the first place.

It’s the opposite idea during the Insight Data Science Fellowship. You have to go out and you have to make every single connection for yourself. You have to boil it down and make it completely convincing that everything you’re saying is relevant to them, and you have to do it in 5 seconds. Everything is an elevator pitch. Every YC Company has to give demos in 180 seconds. So Insight is all about building a demo in those 6 weeks, and then pitching it in 180 seconds. You’re basically pitching yourself as a candidate to those companies. You’re saying, ‘Don’t look at me like a graduate student. I’m actually super goal-orientated, or systems-orientated. I can take all this data, apply these algorithms, and give you some amazing results.’ That’s what those three minutes are for, and that’s the whole paradigm shift. Now, the focus is not so much on the new idea or how much you’ve added to the body of knowledge. The focus is what can you tell me in 100 seconds. That’s all the CEO has time for.

In scientific lectures, you’re not trying to reach a super-broad audience. In the case of science, you’re trying to deliver an idea, and then you try to back it up in 15,000 words.

You need to do that in business as well. You need to be able to take your idea and defend it. The thing is that, here, you’re no longer trying to defend it to the CEO, you’re no longer trying to defend it to anyone else. You just need to defend it to yourself, and then you need to give them the ideas; there’s an implicit trust there.

No one else is going to check your work and no one else should check your work. You need to be an independent party and you need to break it down as to what is important.

You have to build up small kernels of truth, and that’s all you can deliver. A lot of the time, people find it distressing, but I thought it was great. I thought it was an awesome challenge to be able to compress my message down and figure out what all the tidbits are. It’s like a whole design philosophy. I liked the idea of throwing out everything except for what you need to function. I like it from a designer standpoint and also from an algorithms and data analysis standpoint. I think that embodying that philosophy was the single most successful part of the Insight Fellowship.

“Data science” has now become a very common phrase in many business sectors. Yet, it’s still nebulous and no one is really sure what it means. So, what does data science mean to you? How would you break it down?

It means a lot. It always means to measure data, being able to make sense of that data, create models of that data, and most importantly, to be able to communicate what that data means.

I think data science splits into two fields, and I believe a lot of hiring companies are starting to reflect this. Data science is starting to break off into descriptive analytics and predictive analytics.

Descriptive analytics is, ‘we saw this trend.’ Or, for example, ‘We saw this spike or dip… is that because our service crashed? We saw this huge spike…is it a multiplicity of things?’ It’s always asking questions of dynamics, and then asking what is going on. So the raw data comes back, and then you make something useful – actionable business intelligence – from that data. That’s descriptive analytics, taking data that has been produced and trying to make head or tails out of it, to drive some decisions out of it. So that might mean, ‘We saw some really exciting events in Bulgaria, but why is our site exploding in Bulgaria and nowhere else?’ You may find out that it’s not really from Bulgaria, or that it’s raining everywhere else, or a volcano just went off and everybody’s Tweeting about it, or something ridiculous like that.

The other side of data science is predictive analytics; being ahead of the game. This is where you’re shifting towards machine learning algorithms. You’re looking at things such as fraud, where you’re trying to predict whether a transaction is fraudulent or not. Or, you’re trying to figure out security applications: is this malevolent activity? But that’s what it is, fundamentally. It’s pattern finding within all the data, in real-time, which adds additional constraints on computational complexity.

Data science rapidly becoming something concrete, especially as it becomes a more well-defined field. But it’s definitely splitting off into those two directions of data, analyzing it and figuring out underlying trends. If there are multiple trends, maybe it’s multiple elements stacking up to produce the signal you’re looking at. Maybe it’s not really a signal at all, and it’s a bug somewhere, so you have to look at the data.

The other side is not just trying to make heads or tails of the data, but also making predictions. Which city are we going to open up in next? What are the relevant quantities? A lot of business is driven by intuition and gut feelings, and this scares a lot of people. CEOs are trying to pitch entire companies on feelings, essentially. They’re trying to drive home their points on a colloquial basis. The whole field of data science is trying to turn that feeling into something a little more rigorous; trying to deliver on something that’s not intuition, and finding something that you can ground yourself on. That gives your business a lot of stability, especially when there’s a lot of startups and they’re all thinking of great ideas, but only some of them are really as great as they believe, and most of them won’t pan out.

You engage a data scientist at the point when you’re looking to add an incremental value. That’s not going to make your business take off, it’s not guaranteed. But at least it will give you something that’s not solely based on a feeling.

Of the two different types of data science you articulated, do they also require different skills?

For the most part, they require a lot of the same core skills. Predictive data science requires a little more machine learning type skills, and descriptive probably requires a lot more statistical skills. But then, in predictive data analysis, you might be using a lot more random forests or neural networks – all these really cool algorithms.

Which side of data science, from your physics background, seems more intuitive with you?

I started learning programming In high school, because I wanted to play around with genetic algorithms. So that’s been a long running interest. Even though I went off and did experimental physics and computational astrophysics, I’ve always had this background of really wanting to do machine learning. That appeals more to the predictive side than the descriptive side. Both of them have a lot of overlap. There’s not a wall between the two, but you can start to see the continuum of data science. So, I think I’m far more attracted to the predictive side. Neural networks I just think are really cool because you’re essentially training artificial intelligence. You’re taking these tiny artificial brains and making a decision with them. You’re actually turning a whole company based on that.

What do you feel are the defining qualities of a top-notch data scientist, compared with someone who is merely good?

I think it deals with communication. I think that’s the difference between the good scientists and the great. Both are going to know a lot about statistics, the techniques they can use, and how to design, implement, and execute an experiment. Those things are all important. The biggest thing, though, is that you need to be able to communicate those results. That’s a lot harder than it looks.

I think the easiest thing for a graduate student to do, coming into this field, is to gloss over it, but that’s the single most important thing. Most people complain that graduate students don’t have a great programming background. All of their other intuitions, well designed experiments, caveated results, are sound. But I think that a lot of people believe that a programming background is not necessary.

So, maybe it is programming for a lot of people, but if you’re already pretty good, then you’re probably already a good programmer. The last step is just communication. People need to sense the passion inside of you. This defines the most successful people. It’s the realization that you are working with other people, and for a lot of scientists, I think that’s quite a shock. It really goes against this notion of romantic science.

Isaac Newton spent three years in a shack during the plague. He didn’t want to get the plague and he hated talking to everyone. Granted, he was possibly autistic in some ways, but I think a lot of people follow that archetype of going back and living by themselves, and then they emerge with all of their findings. But in reality, it needs to be a much more continuous process. It needs to be a much smoother process than just coming back and reeling off a list of accomplishments. So it’s always communication, but that’s the easiest part to skip over.

What do you see as the promise with data science, and also the interplay between mathematics and computer science, that really speaks to you? Where does your passion lie?

We’re living in a really exciting time because I think what were formerly highly theoretical principles are finally having an impact on the world. Before, I was looking at clumps and galaxies. To do that I needed to run clustering algorithms. I needed to be able to run distributed frameworks on thousands of nodes to answer basic questions.

Now, I can do almost the same stuff, and I can tweak a learning algorithm that teaches students in the best way they can learn. There’s a whole feedback system that says, “you should answer these questions, and then five minutes from now, we’ll come back and repeat it, and then we’ll come back a week later and repeat it again.”

The wonderful thing is that those algorithms, that whole pattern, is being replicated from galaxies to psychology and cognition. All of these high topics of knowledge are beginning to trickle down, and they’re actually making a real impact on day-to-day interactions. There is not a single company on NASDAQ that doesn’t use some aspect of this. Your Facebook Newsfeed is highly tweaked to give you everything that you think is relevant, and new content to test your preferences.

LinkedIn is using all kinds of graph networks. Square is using all these fraud detection techniques. HealthTap is fielding all of these questions, and training a computer to understand what these questions are. And there really are doctors who will be answering a lot of those medical questions.

The cool thing here is that they can take a doctor and clone him virtually. He can answer a question, and that might reduce patient time in a hospital somewhere. And when you take that power, and you multiply it by the number of patients in the whole world – it’s a huge number. These are real things. We’re not limited to theoretical worlds. You really can go out and have awesome effects immediately, and they’re tangible. We’re collecting more and more data, to the point that there are not that many aspects of life that aren’t becoming data driven. So it’s super exciting.

Imagine if you were able to go back to the beginning of your graduate school career, and you meet yourself coming in the corridors and you have a five minute window to speak to yourself. Would you tell yourself to do anything differently?

A lot of it would have centered on working more with people. I joined an open source project, and that was the single best decision in all of graduate school. I learned how to code in a collaborative way.

The second most important thing probably would have been communication. Every week, I would deliver a presentation on my results during the past week, and usually, it would boil down to giving a two or three minute feedback session at the end of that. So I was already doing a lot of communication and I wouldn’t have changed that.

My programming context was great; maybe I should have started that earlier and taken more formal programming classes. If you were to design the curriculum, I’d say you have to have a lot of programming. A lot of classes are like, ‘go and do this assignment.’ The real world is, ‘go do this assignment but you only have to do this module, and someone else will do the next module. You guys need to be working collaboratively.’

They should also be doing lots of statistics, and they should be able to do it as quickly as possible. People love to talk about this Pareto Principle, where 80% of the outcomes result from 20% of the effort. The hard part is trying to figure out where that 80% line actually is, and once you realize you’re at it, stop.

How can people find open source projects to participate in?

A lot of the time, they already exist. You probably already know what they are because you hear about them. The biggest thing is not to be shy about it, and not to be scared off. It took me a long time to work up the courage to actually push code back out and be able to take the criticism. No matter where you’re working, there are other people working with similar problems. Just go out and search for them. If they haven’t solved your specific niche problem, join the effort. It’s a worthwhile process. It’s really hard to convince graduate students about this, who are already overwhelmed with a lot of other things, but it is definitely the best part of those five years.

Your advisor is going to be pushing you for results, and my advisor said it had been years since he’d written any code. So you might not realize how important this is. But in a world that is becoming way more team-based, both in industry and science, it’s super important to push everything into a team-based context.

Also, if you’re in science, you’re all about trying to communicate your results. One of the best ways to do that is through your open source network. They have an audience there, waiting for you, and they might be really interested. A lot of it is, ‘I built this feature onto this project.’ They’ll go try it out and maybe they’ll write a paper about it, and then you get an extra citation.

There are a lot of extra indirect effects. The direct effect is that you’ll be better. The indirect effects are that there are a lot of other people who will benefit, and that will reflect very well on you.

It’s a little unfortunate that the primary currency of science is citations and not source code, even though that’s a big infrastructure push. I think that will have to change going forward because everything is being done in a team-based context. To do science more efficiently, it has to be that way. There’s no other alternative.

数据科学家访谈——Chris Moody

你可能感兴趣的:(数据科学家访谈——Chris Moody)