原文:
Machine Learning: What skills are needed for machine learning jobs?
I am a learner sitting at home and learning linear algebra. Very interested in working in Machine Learning someday, but not sure:
a) What technical skills are needed for an interview/job.
b) Any relevant mandatory work experience.
I have taken an initiative to at least start rather than just think about doing it so any suggestion/guidance would be very helpful and appreciated.
Joseph Misiti, I ♥ machine-learning
In my opinion, these are some of the necessary skills:
1. Python/C++/R/Java - you will probably want to learn all of these languages at some point if you want a job in machine-learning. Python's Numpy and Scipy libraries [2] are awesome because they have similar functionality to MATLAB, but can be easily integrated into a web service and also used in Hadoop (see below). C++ will be needed to speed code up. R [3] is great for statistics and plots, and Hadoop [4] is written in Java, so you may need to implement mappers and reducers in Java (although you could use a scripting language via Hadoop streaming [6])
2. Probability and Statistics: A good portion of learning algorithms are based on this theory. Naive Bayes [6], Gaussian Mixture Models [7], Hidden Markov Models [8], to name a few. You need to have a firm understanding of Probability and Stats to understand these models. Go nuts and study measure theory [9]. Use statistics as an model evaluation metric: confusion matrices, receiver-operator curves, p-values, etc.
3. Applied Math + Algorithms: For discriminate models like SVMs [10], you need to have a firm understanding of algorithm theory. Even though you will probably never need to implement an SVM from scratch, it helps to understand how the algorithm works. You will need to understand subjects like convex optimization [11], gradient decent [12], quadratic programming [13], lagrange [14], partial differential equations [15], etc. Get used to looking at summations [16].
4. Distributed Computing: Most machine learning jobs require working with large data sets these days (see Data Science) [17]. You cannot process this data on a single machine, you will have to distribute it across an entire cluster. Projects like Apache Hadoop [4] and cloud services like Amazon's EC2 [18] makes this very easy and cost-effective. Although Hadoop abstracts away a lot of the hard-core, distributed computing problems, you still need to have a firm understanding of map-reduce [22], distribute-file systems [19], etc. You will most likely want to check out Apache Mahout [20] and Apache Whirr [21].
5. Expertise in Unix Tools: Unless you are very fortunate, you are going to need to modify the format of your data sets so they can be loaded into R,Hadoop,HBase [23],etc. You can use a scripting language like python (using re) to do this but the best approach is probably just master all of the awesome unix tools that were designed for this: cat [24], grep [25], find [26], awk [27], sed [28], sort [29], cut [30], tr [31], and many more. Since all of the processing will most likely be on linux-based machine (Hadoop doesnt run on Window I believe), you will have access to these tools. You should learn to love them and use them as much as possible. They certainly have made my life a lot easier. A great example can be found here [1].
6. Become familiar with the Hadoop sub-projects: HBase, Zookeeper [32], Hive [33], Mahout, etc. These projects can help you store/access your data, and they scale.
7. Learn about advanced signal processing techniques: feature extraction is one of the most important parts of machine-learning. If your features suck, no matter which algorithm you choose, your going to see horrible performance. Depending on the type of problem you are trying to solve, you may be able to utilize really cool advance signal processing algorithms like: wavelets [42], shearlets [43], curvelets [44], contourlets [45], bandlets [46]. Learn about time-frequency analysis [47], and try to apply it to your problems. If you have not read about Fourier Analysis[48] and Convolution[49], you will need to learn about this stuff too. The ladder is signal processing 101 stuff though.
Finally, practice and read as much as you can. In your free time, read papers like Google Map-Reduce [34], Google File System [35], Google Big Table [36], The Unreasonable Effectiveness of Data [37],etc There are great free machine learning books online and you should read those also. [38][39][40]. Here is an awesome course I found and re-posted on github [41]. Instead of using open source packages, code up your own, and compare the results. If you can code an SVM from scratch, you will understand the concept of support vectors, gamma, cost, hyperplanes, etc. It's easy to just load some data up and start training, the hard part is making sense of it all.
Good luck.
[1] http://radar.oreilly.com/2011/04...
[2] http://numpy.scipy.org/
[3] http://www.r-project.org/
[4] http://hadoop.apache.org/
[5] http://hadoop.apache.org/common/...
[6] http://en.wikipedia.org/wiki/Nai...
[7] http://en.wikipedia.org/wiki/Mix...
[8] http://en.wikipedia.org/wiki/Hid...
[9] http://en.wikipedia.org/wiki/Mea...
[10] http://en.wikipedia.org/wiki/Sup...
[11] http://en.wikipedia.org/wiki/Con...
[12] http://en.wikipedia.org/wiki/Gra...
[13] http://en.wikipedia.org/wiki/Qua...
[14] http://en.wikipedia.org/wiki/Lag...
[15] http://en.wikipedia.org/wiki/Par...
[16] http://en.wikipedia.org/wiki/Sum...
[17] http://radar.oreilly.com/2010/06...
[18] http://aws.amazon.com/ec2/
[19] http://en.wikipedia.org/wiki/Goo...
[20] http://mahout.apache.org/
[21] http://incubator.apache.org/whirr/
[22] http://en.wikipedia.org/wiki/Map...
[23] http://hbase.apache.org/
[24] http://en.wikipedia.org/wiki/Cat...
[25] http://en.wikipedia.org/wiki/Grep
[26] http://en.wikipedia.org/wiki/Find
[27] http://en.wikipedia.org/wiki/AWK
[28] http://en.wikipedia.org/wiki/Sed
[29] http://en.wikipedia.org/wiki/Sor...
[30] http://en.wikipedia.org/wiki/Cut...
[31] http://en.wikipedia.org/wiki/Tr_...
[32] http://zookeeper.apache.org/
[33] http://hive.apache.org/
[34] http://static.googleusercontent....
[35]http://static.googleusercontent....
[36]http://static.googleusercontent....
[37]http://static.googleusercontent....
[38] http://www.ics.uci.edu/~welling/...
[39] http://www.stanford.edu/~hastie/...
[40] http://infolab.stanford.edu/~ull...
[41] https://github.com/josephmisiti/...
[42] http://en.wikipedia.org/wiki/Wav...
[43] http://www.shearlet.uni-osnabrue...
[44] http://math.mit.edu/icg/papers/F...
[45] http://www.ifp.illinois.edu/~min...
[46] http://www.cmap.polytechnique.fr...
[47 ]http://en.wikipedia.org/wiki/Tim...
[48] http://en.wikipedia.org/wiki/Fou...
[49 ]http://en.wikipedia.org/wiki/Con...
译文:
译文
Q:机器学习工作需要哪些技能
我是一个在家自学机器学习的初学者,现在正在学习线性代数。
我非常希望有朝一日能够从事于机器学习领域,但是我对以下几点不太确定:
a)面试该工作或者胜任该工作需要哪些专业技能。
b)是否要求相关领域的工作经验的硬性要求。
我并非只是空想者,我已经迈出了主动开始的第一部,所以任何建议或者知道都会对我很有帮助,我也由衷地感激各位。
A:
在我看来,有一下必需技能:
1、Python/C++/Java 如果你想要得到机器学习领域中的一份工作,你很可能会想要去学习这几个编程语言。Python的科学计算包(Numpy)与稀疏矩阵运算包(Scipy) [2]与MATLAB有相似的功能,因此显得很强大、很棒,但是与MATLAB不同的是,Python可以很容易集成到网络服务和应用在Hadoop。为了加速编码,C++佳能也是需要的。R [3]非常适用于统计与绘图。Hadoop [4]是用Java语言编写的,所以你可能需要用Java实现mappers和reducers(虽然你可以通过Hadoop streaming(Hadoop提供的一个编程工具)来使用一种脚本语言 [6])
2、概率统计:机器学习很大的一部分是以这些理论为基础的。例如朴素贝叶斯,高斯混合模型 [7]、隐马尔可夫模型 [8]等等。你需要对概率学和统计学有一个透彻扎实的理解才能够明白这些模型。努力吧,少年,去学习测度论吧 [9]。把统计学作为一种模型评估标准,例如混淆矩阵、受试者工作曲线、p值等等都可作为评估标准。
3、应用数学与算法:对一些识别模型,譬如支持向量机模型 [10],你需要对此模型算法理论有一个透彻扎实的理解。即使你很可能从来不需要从零做起去实现一个支持向量机,但是算法理论有助于你了解这个算法是如何运作的。你将需要去熟悉像凸优化 [11]、梯度下降 [12]、二次规划编程 [13]、拉格朗日 [14]、偏微方程 [15]等等概念。养成看总结 [16]的习惯。
4、分布式计算:目前,大多数机器学习工作都需要与大数据集打交道(参考Data Science [17])。你不可能在一单机上处理这大数据集,你必须把这大数据集分布在整个计算机簇中。像Apache Hadoop项目与亚马逊的EC2 [18]云服务都使处理大数据集变得非常简单和低成本。虽然Hadoop把核心分布式计算问题的细节隐藏了很多,但是你还是需要对MapReduce [22]、分布式档案系统 [19]等等有一个很好的理解。你将很可能想去查询一下Apache Mahout [20]、Apache Whirr [21]。
5、Unix工具的专业知识:除非你真的很幸运不需修改数据格式,不然你将需要修改你的数据集格式,使它能够在R、Hadoop、HBase [23]等平台上能够加载。虽然你可以用脚本语言,譬如python来完成这个,但是最好的方法可能就是掌握这些非常棒的、专门为处理数据而设计的unix工具,例如cat [24]、grep [25]、find [26]、awk [27]、sed [28]、sort [29]、cut [30]、tr [31]诸如此类非常多的。由于这些处理很可能都是在基于linux的机器上运行(我很肯定地相信Hadoop不能在window上运行),因此你将会接触这些工具。你应该学会去喜欢它们,尽可能地区使用它们。它们的确使我的生活变得更加方便、简单。这就是一个很好说明这些工作强大的例子。
6、熟悉Hadoop的衍生项目:Hbase、Zookeeper [32]、Hive [33]、Ma货身体等等。这些项目能够帮你存放与使用数据,并且它们支持大规模数据处理。
7、学习高级信号处理技术:特征提取是机器学习最重要组成部分之一。如果你的特征提取很糟糕,那么无论你选择哪一个算法,你都将会看到很差的性能。根据你着手尝试解决问题的类型,你也许能够利用非常酷的高级信号处理算法,譬如wavelets [42]、shearlets [43]、curvelets [44]、contourlets [45]、bandlets [46]。学会时频分析 [47],并且尝试将它应用到你的问题解决当中。即使你不读Fourier Analysi [48]和Convolution [49],你都应该读一下上述介绍的理论。而二进制码信号处理技术就是解决问题的途径。
最后,尽管多练多看。在你空闲的时候,读一些Google Map-reduce [34]、Google File System [35]、Google Big Table [36]、The Unreasonable Effectiveness of Data [37]等等之类的文章。网络上有大量免费的机器学习电子书籍,你也应该读读这些书籍。 [38] [39] [40] 我发现了一个很棒的课程,并且把它转载到github [41]上。不要用开源包,你自己编代码做一个源码包,然后与开源包比较结果。如果你能够从零开始实现支持向量机,你将会理解support vector、gamma、cost、hyperplanes等等。加载数据、开始训练是很简单的事情,困难的是使这些过程都变得有意义
译者: 林羽飞扬
出处:http://www.cnblogs.com/zhengyuhong/
原文版权归原作者,如译文有侵权行为,请联系译者,欢迎转载,但未经作者同意必须保留原作者与译者信息,且在文章页面明显位置给出原文连接
本文基于知识共享署名-非商业性使用 3.0 许可协议进行许可。欢迎转载、演绎,但是必须保留本文的署名林羽飞扬,若需咨询,请给我发信