R, Octave, and Python: Which Suits Your Analysis Needs?



Analysts and engineers on a budget are turning to R, Octave and Python instead of data analysis packages from proprietary vendors. But which of those is right for your needs?

Some businesses want all the benefits of a top-shelf data analysis package, but lack the budget to purchase one from SAS Institute, MathWorks, or another established, proprietary vendor.

However, analysts can still rely on open-source software and online-learning resources to bring data-mining capabilities into their organization. In fact, many are turning to R, Octave and Python with exactly this goal in mind.

Why Those Three?

When it comes to machine learning (the creation of algorithms that allow machines to recognize and react to patterns), matrix decomposition algorithms are critical. R, Octave and Python are flexible and easy to use for vectorization and matrix operations; they’re not just data-analysis packages, but also programming languages for creating one’s own functions or packages.

For analysts who lack the time to engage in extensive coding, these open-source packages also offer some very handy built-in functions and toolboxes. For example, both R and Octave have simple zscore functions for computing Z-Score; for Python, the function can be defined in a very straightforward manner:

def zscore(X):
mu = mean(X,None)
sigma = samplestd(X)
return (array(X)-mu)/sigma

If you want to use MCMC Bayesian estimation, R boasts MCMCpack, Octave includes pmtk3, and Python has PyMC.

All three options feature large and growing user communities (i.e., the R mailing list) that serve as vital hubs for sharing information and exchanging experiences.

Which Software Package to Choose?

Can any one of these packages do more than the other two? The answer is probably no; the three functionalities have a lot in common. That being said, R is popular among statisticians thanks to its emphasis on statistical computing. Octave has a number of industry and academic applications, and engineers and analysts often utilize Python for building software platforms. It would definitely prove easier for someone who has worked with Matlab to pick up Octave, as Octave is often described as the open source “clone” for Matlab.

My suggestion is to try all three, and see which offering’s toolbox solves your specific problems. As previously mentioned, R’s strength is in statistical analysis. Octave is good for developing Machine Learning algorithms for numeric problems. Python is a general programming language strong in algorithm building for both number and text mining.

Based on my own user experience and research, here is a high-level summary for the three:

R, Octave, and Python: Which Suits Your Analysis Needs?

If you don’t have time or need to learn an entire programming language, an online universe of open-source software can provide you with multiple solutions for your specific needs. Take a little time to experiment and find the one that fits best. When searching for open source solutions, it’s a good idea to search both for the broad terms such as machine learning, data mining or artificial intelligence, along with specific implementations such as neural networks.


No matter what your skill level, open source software may have a solution for you. Open source software can range from all-in-one solutions to code libraries for sophisticated users who want a more customized solution. So whether you’re looking to learn simple regression or robotic vision, open source may have an ideal solution for you.


==============================================================================================


The following would be more useful.


==============================================================================================


Some of comments:


gongyiliao 


I use both R and Python daily, I have to say that the table above is a little inaccurate.

For data analysis and processing, R's syntax is much more easier for data analysts and statistician (that's why people developed pandas package for python).

For statistical analysis and computing, R is of course much more easier than Python, just take a look on CRAN , BioConductor and OmegaHat.

In parallel computing, R and Python both have their problems (R's loop is extremely slow and not thread-safe, Python has the GIL bottleneck), for HPC, you still need C/C++/Fortran functions (but you can use Rcpp for R, Cython for Python).

There're some goodies can help combine both R and Python's strength, like Rpy, Rpy2, but these packages assume the users have advanced knowledge of both languages.

I believe that being polyglot will be a must for data analysts and statisticians in next few years.


anonymous 


Very weak article.  R and Octave are designed for significantly different tasks than Python.  Calling Python "good" for "big data" and the others "bad" requires some imaginative and creative definitions of "good", "bad" and "big data". 


Octave is a Matlab work-alike, and many of our customers use it.  Whether or not its appropriate for their tasks is a completely different question, one I won't answer.  R is an S work-alike, has an active and growing user community, and is in use for many very large BI projects.   

Both Octave and R have specific places in the pantheon of analytics, usually adjacent to their respective work-alikes.  Unfortunately, there is no current operational Octave nor R compiler (as in optimizing compiler), so in both cases, you have something interpreted.  This isn't a terrible thing ... its great for interactive debugging ... but performance on non-natively compiled code is horrible.  Just try a dense LU decomposition on a large matrix (say 4k x 4k) just to see how painful it is compared to well optimized Fortran/C.

I'd argue that Python has no real place in this group.  Its the odd one out.  It is a programming language, in use by a subset of scientific and engineering programmers (not the majority, or even a significant minority as indirectly implied by the author ... I've noticed over the years that Pythonians have a tendency to exaggerate their number, as well as the power of their tools).

Python is roughly akin to Perl, Java, Ruby, and other scripting and rapid application development languages.  It has many modules and kits available for it.  Not nearly as many as Perl, nor Java.  It has a strong and vocal following, akin to Ruby.

It is a programming language first and foremost, and its not trying to masquerade as a data analysis or modeling platform.  For that you need to add in modules or develop your own.

All of the programming languages mentioned above have pretty good analysis tools.  If you choose the right ones, you can get native C/Fortran level performance where you need it, and rapid application development where you need it.  In some sense, it is a good mixture. 

All this noted, we've seen many new developers go to Lua and other languages for jit based performance.  One can get nearly native C speed from a jit compiled "script".  This is quite impressive. We also see domain specific languages being developed (Julia, et al) that look to challenge the more general Octave/Matlab's.  Julia is very interesting at several levels.

A reader of this article might make the mistake of assuming that these are the major languages in use for the described problems.  There are many in use.  Octave less so than Matlab.  R more so than S.  Python where people who know Python use it.  Everything else, everywhere else.

你可能感兴趣的:(数据分析,Analysis)