数据科学r语言
Data science is an exciting field to work in, combining advanced statistical and quantitative skills with real-world programming ability. There are many potential programming languages that the aspiring data scientist might consider specializing in.
数据科学是一个令人兴奋的领域,它将先进的统计和定量技能与实际编程能力相结合。 有抱负的数据科学家可能会考虑采用许多潜在的编程语言。
While there is no correct answer, there are several things to take into consideration. Your success as a data scientist will depend on many points, including:
尽管没有正确的答案,但有几件事要考虑。 您作为数据科学家的成功取决于很多方面,包括:
Specificity
特异性
When it comes to advanced data science, you will only get so far reinventing the wheel each time. Learn to master the various packages and modules offered in your chosen language. The extent to which this is possible depends on what domain-specific packages are available to you in the first place!
当涉及到高级数据科学时,每次您都只能重新发明轮子。 学习掌握以您选择的语言提供的各种软件包和模块。 可能的程度首先取决于您可以使用哪些特定于域的软件包!
Generality
概论
A top data scientist will have good all-round programming skills as well as the ability to crunch numbers. Much of the day-to-day work in data science revolves around sourcing and processing raw data or ‘data cleaning’. For this, no amount of fancy machine learning packages are going to help.
一位顶尖的数据科学家将具有良好的全面编程技能以及处理数字的能力。 数据科学中的许多日常工作都围绕着采购和处理原始数据或“数据清理”。 为此,没有任何花哨的机器学习包会有所帮助。
Productivity
生产率
In the often fast-paced world of commercial data science, there is much to be said for getting the job done quickly. However, this is what enables technical debt to creep in — and only with sensible practices can this be minimized.
在通常快节奏的商业数据科学世界中,要快速完成工作有很多话要说。 但是,这正是技术债务蔓延的原因,只有明智的实践才能使这种债务减至最少。
Performance
性能
In some cases it is vital to optimize the performance of your code, especially when dealing with large volumes of mission-critical data. Compiled languages are typically much faster than interpreted ones; likewise statically typed languages are considerably more fail-proof than dynamically typed. The obvious trade-off is against productivity.
在某些情况下,优化代码的性能至关重要,尤其是在处理大量关键任务数据时。 编译语言通常比解释语言要快得多。 同样,静态类型的语言比动态类型的语言具有更好的防故障能力。 明显的权衡是不利于生产力。
To some extent, these can be seen as a pair of axes (Generality-Specificity, Performance-Productivity). Each of the languages below fall somewhere on these spectra.
在某种程度上,这些可以看作是一对轴(通用性,性能和生产率)。 下面的每种语言都属于这些频谱。
With these core principles in mind, let’s take a look at some of the more popular languages used in data science. What follows is a combination of research and personal experience of myself, friends and colleagues — but it is by no means definitive! In approximately order of popularity, here goes:
牢记这些核心原则,让我们看一下数据科学中使用的一些更流行的语言。 接下来是我自己,朋友和同事的研究和个人经验的结合,但这绝不是确定的! 按流行程度大致如下:
Released in 1995 as a direct descendant of the older S programming language, R has since gone from strength to strength. Written in C, Fortran and itself, the project is currently supported by the R Foundation for Statistical Computing.
R作为较早的S编程语言的直接后代于1995年发布,此后R变得越来越强大。 该项目由C,Fortran及其本身编写,目前得到R统计计算基金会的支持。
Free!
自由!
Excellent range of high-quality, domain specific and open source packages. R has a package for almost every quantitative and statistical application imaginable. This includes neural networks, non-linear regression, phylogenetics, advanced plotting and many, many others.
高质量,特定领域和开源软件包的优秀产品。 R提供了几乎所有可以想象的定量和统计应用程序的软件包。 这包括神经网络,非线性回归,系统发育,高级绘图以及许多其他功能。
Data visualization is a key strength with the use of libraries such as ggplot2.
借助ggplot2之类的库,数据可视化是关键优势。
Performance. There’s no two ways about it, R is not a quick language.
性能。 关于它,没有两种方法, R不是一种快速的语言 。
R is a powerful language that excels at a huge variety of statistical and data visualization applications, and being open source allows for a very active community of contributors. Its recent growth in popularity is a testament to how effective it is at what it does.
R是一种功能强大的语言,擅长于各种统计和数据可视化应用程序,并且开源是一个非常活跃的贡献者社区。 它最近受欢迎程度的提高证明了它在工作中的有效性。
Guido van Rossum introduced Python back in 1991. It has since become an extremely popular general purpose language, and is widely used within the data science community. The major versions are currently 3.6 and 2.7.
Guido van Rossum于1991年引入Python。此后,Python成为一种非常流行的通用语言,并在数据科学界广泛使用。 当前的主要版本是3.6和2.7 。
Free!
自由!
Python is a very popular, mainstream general purpose programming language. It has an extensive range of purpose-built modules and community support. Many online services provide a Python API.
Python是一种非常流行的主流通用编程语言。 它具有广泛的专用模块和社区支持。 许多在线服务都提供Python API。
Packages such as pandas, scikit-learn and Tensorflow make Python a solid option for advanced machine learning applications.
诸如pandas , scikit-learn和Tensorflow之类的软件包使Python成为高级机器学习应用程序的可靠选择。
Python is a very good choice of language for data science, and not just at entry-level. Much of the data science process revolves around the ETL process (extraction-transformation-loading). This makes Python’s generality ideally suited. Libraries such as Google’s Tensorflow make Python a very exciting language to work in for machine learning.
Python是数据科学语言的很好选择,而不仅仅是入门级的语言。 许多数据科学过程都围绕ETL过程 (提取-转换-加载)进行。 这使得Python的通用性非常适合。 诸如Google的Tensorflow之类的库使Python成为一种非常激动人心的语言,可用于机器学习。
SQL (‘Structured Query Language’) defines, manages and queries relational databases. The language appeared by 1974 and has since undergone many implementations, but the core principles remain the same.
SQL (“结构化查询语言”)定义,管理和查询关系数据库 。 该语言于1974年问世,此后经历了许多实现,但是核心原理保持不变。
Varies — some implementations are free, others proprietary
不同-有些实现是免费的,而另一些则是专有的
Declarative syntax makes SQL an often very readable language . There’s no ambiguity about what SELECT name FROM users WHERE age >
18 is supposed to do!
声明式语法使SQL成为一种非常易读的语言。 对于SELECT name FROM users WHERE age >
18 SELECT name FROM users WHERE age >
, SELECT name FROM users WHERE age >
应该做什么没有任何歧义!
SQL is very used across a range of applications, making it a very useful language to be familiar with. Modules such as SQLAlchemy make integrating SQL with other languages straightforward.
SQL在各种应用程序中都非常常用,这使其成为一种非常有用的语言。 诸如SQLAlchemy之类的模块使SQL与其他语言的集成变得简单。
There are many different implementations of SQL such as PostgreSQL, SQLite, MariaDB . They are all different enough to make inter-operability something of a headache.
SQL有许多不同的实现,例如PostgreSQL , SQLite , MariaDB 。 它们之间的差异足以使互操作性令人头疼。
SQL is more useful as a data processing language than as an advanced analytical tool. Yet so much of the data science process hinges upon ETL, and SQL’s longevity and efficiency are proof that it is a very useful language for the modern data scientist to know.
SQL作为数据处理语言比作为高级分析工具更有用。 然而,这么多的数据科学过程都取决于ETL,而SQL的寿命和效率证明了它是现代数据科学家了解的非常有用的语言。
Java is an extremely popular, general purpose language which runs on the (JVM) Java Virtual Machine. It’s an abstract computing system that enables seamless portability between platforms. Currently supported by Oracle Corporation.
Java是一种非常流行的通用语言,可在(JVM)Java虚拟机上运行。 这是一个抽象的计算系统,可实现平台之间的无缝移植。 目前由Oracle Corporation支持。
Version 8 — Free! Legacy versions, proprietary.
版本8 —免费! 旧版,专有。
There is a lot to be said for learning Java as a first choice data science language. Many companies will appreciate the ability to seamlessly integrate data science production code directly into their existing codebase, and you will find Java’s performance and and type safety are real advantages.
学习Java作为首选的数据科学语言有很多话要说。 许多公司将欣赏将数据科学生产代码直接无缝集成到其现有代码库中的能力,并且您会发现Java的性能和类型安全是真正的优势。
However, you’ll be without the range of stats-specific packages available to other languages. That said, definitely one to consider — especially if you already know one of R and/or Python.
但是,您将没有其他语言可用的特定于统计信息的软件包范围。 就是说,绝对要考虑的一个-特别是如果您已经了解R和/或Python之一。
Developed by Martin Odersky and released in 2004, Scala is a language which runs on the JVM. It is a multi-paradigm language, enabling both object-oriented and functional approaches. Cluster computing framework Apache Spark is written in Scala.
Scala由Martin Odersky开发并于2004年发布,是一种在JVM上运行的语言。 它是一种多范式语言,支持面向对象的方法和功能方法。 集群计算框架Apache Spark用Scala编写。
Free!
自由!
Scala is not a straightforward language to get up and running with if you’re just starting out. Your best bet is to download sbt and set up an IDE such as Eclipse or IntelliJ with a specific Scala plug-in.
如果您刚入门,Scala不是一种简单易用的语言。 最好的选择是下载sbt并使用特定的Scala插件设置IDE(例如Eclipse或IntelliJ)。
When it comes to using cluster computing to work with Big Data, then Scala + Spark are fantastic solutions. If you have experience with Java and other statically typed languages, you’ll appreciate these features of Scala too.
在使用群集计算与大数据一起使用时,Scala + Spark是绝佳的解决方案。 如果您有Java和其他静态类型语言的使用经验,那么您也会喜欢Scala的这些功能。
Yet if your application doesn’t deal with the volumes of data that justify the added complexity of Scala, you will likely find your productivity being much higher using other languages such as R or Python.
但是,如果您的应用程序不处理足以证明Scala增加了复杂性的数据量,那么使用R或Python等其他语言可能会发现您的生产力要高得多。
Released just over 5 years ago, Julia has made an impression in the world of numerical computing. Its profile was raised thanks to early adoption by several major organizations including many in the finance industry.
Julia(Julia)发布于5年前,在数值计算领域给人留下了深刻的印象。 由于包括金融业在内的数个主要组织的早期采用,提高了它的形象。
Free!
自由!
The main issue with Julia is one that cannot be blamed for. As a recently developed language, it isn’t as mature or production-ready as its main alternatives Python and R.
Julia的主要问题是不能责怪的。 作为一种新近开发的语言,它不像其主要替代品Python和R那样成熟或可以投入生产。
But, if you are willing to be patient, there’s every reason to pay close attention as the language evolves in the coming years.
但是,如果您愿意耐心等待,那么随着语言在未来几年的发展,我们有充分的理由要密切注意。
MATLAB is an established numerical computing language used throughout academia and industry. It is developed and licensed by MathWorks, a company established in 1984 to commercialize the software.
MATLAB是一种在学术界和行业中广泛使用的已建立的数值计算语言。 它由MathWorks开发并获得许可,该公司成立于1984年,旨在将该软件商业化。
Proprietary — pricing varies depending on your use case
专有-定价因您的用例而异
Proprietary licence. Depending on your use-case (academic, personal or enterprise) you may have to fork out for a pricey licence. There are free alternatives available such as Octave. This is something you should give real consideration to.
专有许可证。 根据您的用例(学术,个人或企业),您可能需要为获得昂贵的许可证付出代价。 有免费的替代方法,例如Octave 。 这是您应该真正考虑的事情。
MATLAB’s widespread use in a range of quantitative and numerical fields throughout industry and academia makes it a serious option for data science.
MATLAB在整个行业和学术界在定量和数值领域的广泛使用使其成为数据科学的重要选择。
The clear use-case would be when your application or day-to-day role requires intensive, advanced mathematical functionality. Indeed, MATLAB was specifically designed for this.
明确的用例是您的应用程序或日常角色需要密集的高级数学功能时。 实际上,MATLAB是为此专门设计的。
There are other mainstream languages that may or may not be of interest to data scientists. This section provides a quick overview… with plenty of room for debate of course!
还有其他主流语言可能会或可能不会对数据科学家感兴趣。 本节提供快速概述…当然还有足够的讨论空间!
C++ is not a common choice for data science, although it has lightning fast performance and widespread mainstream popularity. The simple reason may be a question of productivity versus performance.
尽管C ++具有闪电般的快速性能和广泛的主流流行度,但它并不是数据科学的常见选择。 简单的原因可能是生产率与性能的问题。
As one Quora user puts it:
正如Quora的一位用户所说 :
“If you’re writing code to do some ad-hoc analysis that will probably only be run one time, would you rather spend 30 minutes writing a program that will run in 10 seconds, or 10 minutes writing a program that will run in 1 minute?”
“如果您正在编写代码以进行可能仅运行一次的临时分析,您宁愿花30分钟编写将在10秒内运行的程序,还是花10分钟编写将在1秒内运行的程序?分钟?”
The dude’s got a point. Yet for serious production-level performance, C++ would be an excellent choice for implementing machine learning algorithms optimized at a low-level.
花花公子的观点。 但是对于严重的生产级性能,C ++将是实现在低级优化的机器学习算法的绝佳选择。
Verdict — “not for day-to-day work, but if performance is critical…”
裁决-“不是日常工作,但如果绩效至关重要……”
With the rise of Node.js in recent years, JavaScript has become more and more a serious server-side language. However, its use in data science and machine learning domains has been limited to date (although checkout brain.js and synaptic.js!). It suffers from the following disadvantages:
近年来,随着Node.js的兴起, JavaScript越来越成为一种严肃的服务器端语言。 但是,它在数据科学和机器学习领域中的使用迄今受到限制(尽管结帐brain.js和synaptic.js !)。 它具有以下缺点:
Performance-wise, Node.js is quick. But JavaScript as a language is not without its critics.
在性能方面,Node.js很快。 但是JavaScript作为一种语言并不是没有批评者的 。
Node’s strengths are in asynchronous I/O, its widespread use and the existence of languages which compile to JavaScript. So it’s conceivable that a useful framework for data science and realtime ETL processing could come together.
Node的优势在于异步I / O,其广泛使用以及可编译为JavaScript的语言的存在。 因此可以想象,一个有用的数据科学框架和实时ETL处理可以融合在一起。
The key question is whether this would offer anything different to what already exists.
关键问题是,这是否会提供与已经存在的不同的东西。
Verdict — “there is much to do before JavaScript can be taken as a serious data science language”
结论—“在将JavaScript视为一种严肃的数据科学语言之前,还有很多事情要做”
Perl is known as a ‘Swiss-army knife of programming languages’, due to its versatility as a general-purpose scripting language. It shares a lot in common with Python, being a dynamically typed scripting language. But, it has not seen anything like the popularity Python has in the field of data science.
Perl因其作为通用脚本语言的多功能性而被称为“编程语言的瑞士军刀”。 它是动态类型化的脚本语言,与Python有很多共同点。 但是,它没有像Python在数据科学领域那样受欢迎。
This is a little surprising, given its use in quantitative fields such as bioinformatics. Perl has several key disadvantages when it comes to data science. It isn’t stand-out fast, and its syntax is famously unfriendly. There hasn’t been the same drive towards developing data science specific libraries. And in any field, momentum is key.
考虑到它在生物信息学等定量领域的使用,这有点令人惊讶。 在数据科学方面,Perl有几个关键的缺点。 它不是很快就脱颖而出,并且其语法众所周知是不友好的 。 开发特定于数据科学的库的驱动力不同。 在任何领域,动力都是关键。
Verdict — “a useful general purpose scripting language, yet it offers no real advantages for your data science CV”
判决-“一种有用的通用脚本语言,但对于您的数据科学CV并没有任何真正的优势”
Ruby is another general purpose, dynamically typed interpreted language. Yet it also hasn’t seen the same adoption for data science as has Python.
Ruby是另一种通用的动态类型的解释语言。 但是,它还没有像Python那样被数据科学采用。
This might seem surprising, but is likely a result of Python’s dominance in academia, and a positive feedback effect . The more people use Python, the more modules and frameworks are developed, and the more people will turn to Python.
这似乎令人惊讶,但很可能是Python在学术界的统治地位以及积极的反馈效应的结果。 使用Python的人越多,开发的模块和框架就越多,并且使用Python的人也就越多。
The SciRuby project exists to bring scientific computing functionality, such as matrix algebra, to Ruby. But for the time being, Python still leads the way.
存在SciRuby项目是为了将科学计算功能(例如矩阵代数)引入Ruby。 但是就目前而言,Python仍然处于领先地位。
Verdict — “not an obvious choice yet for data science, but won’t harm the CV”
判决-“对于数据科学而言,尚不是一个显而易见的选择,但不会损害简历”
Well, there you have it — a quickfire guide to which languages to consider for data science. The key here is to understand your usage requirements in terms of generality vs specificity, as well as your personal preferred development style of performance vs productivity.
好了,您已掌握了它-速成指南,可为数据科学考虑使用哪些语言。 这里的关键是要从通用性和特异性方面了解您的使用要求,以及您个人偏爱的性能与生产力开发风格。
I use R, Python and SQL on a regular basis, as my current role largely focuses on developing existing data pipeline and ETL processes. These languages give the right balance of generality and productivity to do the job, with the option of using R’s more advanced statistics packages when needed.
我经常使用R,Python和SQL,因为我目前的职责主要集中在开发现有的数据管道和ETL流程上。 这些语言可以在通用性和生产率之间实现适当的平衡,并在需要时可以选择使用R的高级统计软件包。
However — you may already have some experience with Java. Or you may want to use Scala for big data. Or, perhaps you’re keen to get involved with the Julia project.
但是,您可能已经对Java有一定的经验。 或者您可能想将Scala用于大数据。 或者,也许您渴望参与Julia项目。
Maybe you learned MATLAB at university, or want to give SciRuby a chance? Perhaps you have an altogether different suggestion. If so, please leave a reply below — I look forward to hearing from you!
也许您是在大学学习过MATLAB的,还是想给SciRuby一个机会? 也许您有完全不同的建议。 如果是这样,请在下面留下答复-我期待您的来信!
Thanks for reading!
谢谢阅读!
翻译自: https://www.freecodecamp.org/news/which-languages-should-you-learn-for-data-science-e806ba55a81f/
数据科学r语言