机器学习工程师学习路线

…and the journey so far; the evolution of data and analytics

……以及迄今为止的旅程；数据和分析的发展

最初-数据科学家 (In the beginning — The Data Scientist)

At the start of the 2010’s the hype around Big Data really took off. As the expectations around advanced analytics and analyzing unstructured data grew, the role of “Data Scientist” appeared on the upward slope of the Gartner hype-cycle (see figure below).

在2010年初，围绕大数据的炒作真的开始了。随着人们对高级分析和分析非结构化数据的期望不断提高，“数据科学家”的角色出现在Gartner炒作周期的上升斜线上(请参见下图)。

At the same time challenges with implementing various important new data platforms referenced in the Gartner graphic were starting to become apparent (i.e. Map Reduce and other distributed systems and Database Platform as a Service) and these start to appear on the downward slope of the same Gartner hype-cycle. These platforms didn’t magically provide the Data Scientists with the data they required and it became clear that a lot of design and engineering was required to align these data-platforms with what the Data Scientists needed. Also, a huge amount of hype and expectation was developing around noSQL databases, but this mainly focused on the needs of web-scale applications and agile development rather than the needs of data analysis.

同时，实施Gartner图形中引用的各种重要的新数据平台的挑战开始变得明显(例如，Map Reduce和其他分布式系统以及数据库平台即服务)，并且这些挑战开始出现在同一Gartner的下坡炒作周期。这些平台并没有神奇地为数据科学家提供所需的数据，而且很显然，要使这些数据平台与数据科学家所需的内容保持一致，需要大量的设计和工程。另外，围绕noSQL数据库正在大量宣传和期望，但这主要集中在Web规模应用程序和敏捷开发的需求上，而不是数据分析的需求上。

大多数人必须去的地方-一半 (Where most people have got to — half way there)

This disconnect between the Data and the Analytics led to disillusionment, frustration and inability to deliver many data science and analytics projects as the data part was missing.

数据与分析之间的这种脱节导致幻灭，沮丧以及由于缺少数据部分而无法交付许多数据科学和分析项目。

By 2015 Big Data had been dropped off the hype-cycle (i.e. see https://www.datasciencecentral.com/profiles/blogs/big-data-falls-off-the-hype-cycle), and the world of analytics and data science was pinning its hope on new data platform technologies such as Apache Spark and Data Lakes.

到2015年，大数据已经脱离了炒作周期(即，请参阅https://www.datasciencecentral.com/profiles/blogs/big-data-falls-off-the-hype-cycle )以及分析和数据科学将希望寄托在新的数据平台技术上，例如Apache Spark和Data Lakes 。

输入数据工程师 (Enter the Data Engineer)

As a result the role of the Data Engineer was born and demand for this position soared — by 2020, according to https://www.itjobswatch.co.uk/, 1.5% of all IT jobs in the UK were related to Data Engineering — to put this in perspective, 1.2% of all IT jobs were advertised for Web Development:

因此，数据工程师的角色诞生了，并且对该职位的需求猛增-根据https://www.itjobswatch.co.uk/，到2020年，英国所有IT职位中有1.5％与数据工程有关—从角度来看，1.2％的IT职位是通过Web开发招聘的：

Interestingly at the same time, back in 2015, hopes and expectation for Machine Learning were judged to be at a peak. Machine Learning offered a way to put all this data to work.

有趣的是，与此同时，早在2015年，人们对机器学习的希望和期望就达到了顶峰。机器学习提供了一种将所有这些数据发挥作用的方法。

要解决的新问题-我们如何将其投入生产？ (A new problem to solve — how do we get this into production?)

Heath Robinson’s pancake-making machine 希思·罗宾逊的煎饼机

As the role of the data engineer matured and these experts got to work fixing all the data sourcing and processing problems for the data scientists, a new issue arose — how to get all those machine learning models deployed into production (i.e. get them running real parts of the business), now that the Data Scientists had access to the data they needed?

随着数据工程师的角色的成熟以及这些专家开始为数据科学家解决所有数据源和处理问题，出现了一个新问题–如何将所有这些机器学习模型部署到生产中(即使它们运行实际零件)的业务)，现在数据科学家可以访问所需的数据了吗？

By 2019 the focus on Machine Learning had moved up a level and on to a whole separate Gartner Hype-Cycle with multiple types of machine learning and AI techniques and use-cases considered: https://twitter.com/kdnuggets/status/1234871536391245824

到2019年，对机器学习的关注已升级到一个完整的Gartner Hype-Cycle，其中包含多种类型的机器学习和AI技术以及考虑的用例： https ： //twitter.com/kdnuggets/status/1234871536391245824

进入机器学习工程师 (Enter the Machine Learning Engineer)

To have a machine learning solution that is valuable to the business or research institute or NGO where it is deployed, it needs to:

要拥有对部署它的企业或研究机构或NGO有价值的机器学习解决方案，它需要：

Integrate with live data sources
与实时数据源集成
Be reliable, robust and accurate
可靠，强大和准确
Actually be usable by other people — probably many other people and applications.
实际上可以被其他人(可能还有许多其他人和应用程序)使用。

Ultimately a machine learning or “AI” solution is just a software product that applies algorithms or maths to some data.

最终，机器学习或“ AI”解决方案只是将算法或数学应用于某些数据的软件产品。

To achieve an integrated robust and scalable software product, software source control and automated test frameworks for merging changes and updates into releases will be required. This allows a team of people to collaborate on a complete product that goes beyond a concept that is demonstrated by a data-scientist in a Jupyter (IPython) notebook.

为了获得集成的健壮且可扩展的软件产品，将需要用于将更改和更新合并到发行版中的软件源代码控制和自动测试框架。这样一来，团队成员就可以在一个完整的产品上进行协作，而这个产品超出了Jupyter(IPython)笔记本中数据科学家所演示的概念。

Also, this needs to be underpinned by an architecture that can cater for hardware and network failures and scale to meet demand.

同样，这需要以可解决硬件和网络故障并进行扩展以满足需求的体系结构为基础。

Finally, the nature of the domain means that the application will be heavily data-centric with more complex requirements in this area than the typical web application or transactional system. There are likely to be many bulk data aggregation requirements and complex data-feature engineering aspects, perhaps coupled with a high-volume data-streaming context.

最后，域的性质意味着应用程序将以数据为中心，与典型的Web应用程序或事务系统相比，在这一领域的要求更为复杂。可能存在许多批量数据聚合需求和复杂的数据功能工程方面，也许还伴随着大量的数据流上下文。

To achieve this someone must straddle the multiple disciplines of Data Architecture and Engineering, Data Science and Statistics and DevOps or Software Engineering — the role of Machine Learning Engineer is born:

为了实现这一目标，某人必须跨越数据架构与工程 ， 数据科学与统计 ， DevOps或软件工程的多个学科—机器学习工程师的职责由此诞生：

A Machine Learning Engineer doesn’t just Build, they also Design.

机器学习工程师不仅可以构建，还可以设计。

With a hybrid knowledge across multiple domains, the machine learning engineer is a critical part of the design process, not just the implementation.

凭借跨多个领域的混合知识，机器学习工程师是设计过程的关键部分，而不仅仅是实现。

Often small compromises at the data science and modelling stage can lead to huge efficiencies at the data-layer. Also, understanding the context of the coding approach taken by the data-scientist will help with translation into the production-code deployment. Conversely, being able to feed back the constraints of the code deployment platform and the data-processing platform to the data-scientist in language that makes sense to the data scientist allows this to be factored into the analysis and modelling approach.

通常，在数据科学和建模阶段进行小的妥协可以导致数据层的巨大效率。同样，了解数据科学家采用的编码方法的上下文将有助于转换为生产代码部署。相反，能够以对数据科学家有意义的语言将代码部署平台和数据处理平台的约束反馈给数据科学家，从而可以将其纳入分析和建模方法中。

In this way, the introduction of the machine learning engineer breaks down the inter-disciplinary silos and finally leads to the promised land of a machine-learning application delivering value from the data provided by the data engineer and analysed by the data scientist.

这样，机器学习工程师的介绍打破了跨学科的孤岛，最终导致了机器学习应用程序的希望之地，该应用程序从数据工程师提供的数据中获取价值，并由数据科学家进行分析。

翻译自: https://medium.com/swlh/the-rise-of-the-machine-learning-engineer-c04bab0c29e