One question I often hear is: “What skills should I learn to be an effective data scientist?” This comes up in mentoring sessions, 1:1s with team members, Q&A sessions with students, and more. Whether you’re looking to get into the field, or are a data scientist already, it’s a relevant topic, as we all need to continue advancing our skills as we grow in our careers. But what areas should you focus on? While the field of data science has been continuously changing, we’ve put together a framework that has withstood the test of time. In this post, we’ll walk through three key areas to continue advancing your skills. Here are some quick links to each section (note, these links may not work if you’re reading this article on a mobile device):
我经常听到的一个问题是:“我应该学习成为有效的数据科学家的哪些技能?” 这在辅导会议,与团队成员的1:1对话,与学生的问答环节等等中出现。 无论您是想进入该领域,还是已经是数据科学家,这都是一个相关的主题,因为随着我们职业的发展,我们所有人都需要继续提高自己的技能。 但是您应该关注哪些领域? 在数据科学领域不断变化的同时,我们建立了一个经受住时间考验的框架。 在本文中,我们将逐步介绍三个关键领域,以继续提高您的技能。 以下是每个部分的一些快速链接(请注意,如果您在移动设备上阅读本文,则这些链接可能不起作用):
The data science Venn diagram
数据科学维恩图
Technical skills
技术能力
Business context
商业环境
Soft skills
软技能
Bringing it all together (the intangibles, the unicorn, the pep talk, and a plan)
汇集所有内容(无形资产,独角兽,鼓舞人心的演讲和计划)
Frequently asked questions
经常问的问题
数据科学维恩图(The data science Venn diagram)
A plethora of Venn diagrams have been used to describe the field of data science. Of course, there’s the original Conway Venn diagram that started it all. Then there’s the “Battle of the data science Venn diagrams,” with a number of variations that others have created over time. In addition to being entertaining, each of these variations offers good points and perspectives. And the variety of perspectives makes sense, since the field of data science is evolving, and it’s also diverse.
大量的维恩图已用于描述数据科学领域。 当然,原始的Conway Venn图开始了这一切。 然后是“数据科学维恩图之战”,随着时间的推移,其他人会创建许多变种。 除了娱乐之外,这些变体中的每一个都提供了优点和观点。 鉴于数据科学领域的发展以及它的多样性,各种观点也是有意义的。
We like to summarize the many skills that comprise a data scientist into three main categories:
我们想将构成数据科学家的许多技能归纳为三个主要类别:
Credit: Matt Storey, who developed this durable framework for our team in 2015. 图片提供:Matt Storey,他在2015年为我们的团队开发了这个持久的框架。I haven’t drawn this diagram to any scale, so you could debate the relative sizes of these three categories. However, “technical” is likely one of the largest categories, given the many facets it entails, so we’ll start there.
我没有以任何比例绘制此图,因此您可以辩论这三个类别的相对大小。 但是,“技术性”很可能是最大的类别之一,因为它涉及许多方面,因此我们将从这里开始。
技术能力 (Technical skills)
“Technical” covers a broad set of capabilities. In this section, we’ll walk through what each of these technical skill areas means, and how to apply them to your work.
“技术”涵盖了广泛的功能。 在本节中,我们将逐步讲解这些技术技能领域的意义以及如何将其应用于您的工作。
Analytical problem solving: Before you get your hands on any data, you first must understand the problem you’re trying to solve. Having this perspective to help chart a path forward is key to getting there. Furthermore, having effective judgment to evaluate approaches, choose the correct formula, and apply the appropriate data points are all key to a successful result. Without it, you might inadvertently show a ratio in the reverse direction, plug in the wrong data points, or otherwise manipulate the data in a way that isn’t sound. Data science roles often specify quantitative fields of study as preferred academic degrees to encourage this kind of critical thinking.
分析性问题解决:在获取任何数据之前,您首先必须了解要解决的问题。 拥有这种观点来帮助规划前进的道路是实现目标的关键。 此外,拥有有效的判断力来评估方法,选择正确的公式并应用适当的数据点,都是取得成功结果的关键。 如果没有它,您可能会无意间显示相反的比率,插入错误的数据点,或者以其他不合理的方式操作数据。 数据科学的角色通常将量化的研究领域指定为首选的学位,以鼓励这种批判性思维。
Statistical concepts and techniques: Another important skill to ensure data is handled correctly is statistics. This includes concepts such as probability distributions, confidence intervals, regression analysis, and hypothesis testing. Statistics is one of the many educational backgrounds that aligns well with data science. It’s important for all data scientists to have enough of a foundation in statistics to develop an intuition regarding sound approaches, as well as awareness of the key concepts. The danger is that if you don’t know that additional considerations need to be checked (like sample sizes, distributions, and statistical significance, among others), you might share unverified analysis with a stakeholder and mislead them with the results.
统计概念和技术:确保数据正确处理的另一项重要技能是统计。 这包括诸如概率分布,置信区间,回归分析和假设检验之类的概念。 统计学是与数据科学紧密结合的众多教育背景之一。 对于所有数据科学家而言,重要的是要有足够的统计基础,以发展出关于合理方法以及对关键概念的认识的直觉。 危险是,如果您不知道需要检查其他考虑因素(例如样本量,分布和统计意义等),则可能会与利益相关者共享未经验证的分析,并在结果中误导他们。
Languages (SQL, Python, R, and Kusto): Having fluency in some programming languages and the ability to learn others is key to ensure that you’re efficient and productive in your role. These languages are used to manipulate data, implement machine learning models, and develop programmatic solutions. In any data science interview, at some point you will be asked questions to ensure you’re familiar with common querying concepts such as joins, as well as programming syntax in the language of your choice. It’s helpful to also have experience with tools like Jupyter notebooks and R Studio as effective environments where you can create and share documents with live code.
语言(SQL,Python,R和Kusto):熟练使用某些编程语言和学习其他语言的能力是确保您高效且高效地工作的关键。 这些语言用于处理数据,实现机器学习模型以及开发编程解决方案。 在任何数据科学面试中,有时都会询问您一些问题,以确保您熟悉常见的查询概念,例如联接以及所选语言的编程语法。 拥有Jupyter笔记本电脑和R Studio之类的工具作为有效的环境的经验很有用,您可以在其中使用实时代码创建和共享文档。
Machine learning modeling: Machine learning is one of the areas you hear about most in data science and is certainly exciting and powerful. There are a variety of modeling techniques, such as supervised and unsupervised learning, classification and regression, clustering, deep learning, reinforcement learning, and more. These are used in a variety of applications, such as recommender models, natural language processing, segmentation models, forecasting models, and propensity models. These models help us understand current dynamics, predict future outcomes, and recommend user actions. It’s good to gain experience with the various enterprise scenarios that arise, such as handling noisy and/or sparse data, providing users with model explainability, running models in production with ML Ops, retraining the models over time, incorporating user feedback into the model, and tracking model performance.
机器学习建模:机器学习是您在数据科学领域了解最多的领域之一,并且无疑是令人兴奋且强大的。 有多种建模技术,例如监督学习和无监督学习,分类和回归,聚类,深度学习,强化学习等。 这些用于各种应用程序,例如推荐程序模型,自然语言处理,细分模型,预测模型和倾向性模型。 这些模型可帮助我们了解当前动态,预测未来结果并推荐用户操作。 最好能从出现的各种企业场景中获得经验,例如处理嘈杂的数据和/或稀疏的数据,为用户提供模型可解释性,使用ML Ops在生产中运行模型,随着时间的推移重新训练模型,将用户反馈纳入模型,并跟踪模型性能。
Data preparation and pipeline management: In order to get started, there is a nontrivial first step to prepare the data. Often we’re working with “big data” in real-world enterprise scenarios. We need to identify the relevant datasets, gain permissions, extract the relevant records, and join with the appropriate identifiers to create meaningful connections. Then we pursue our initial data exploration. This includes checking the data quality and completeness, as well as handling outliers and other data cleaning needs. If it’s your first time working with the dataset, you may need to read internal documentation or test out the scenario to verify the data it generates. There may also be cases where the telemetry isn’t available and you must design new instrumentation. If you’re setting up a production model to run on this dataset, you’ll want to architect the pipelines in a way that allows for an automated refresh schedule and live site management. To ensure a stable service, this involves DevOps concepts including anomaly detection, reliability, uptime, and service level agreements. Finally, you must be familiar with policies around data privacy, GDPR, ethics, and security to ensure that the data is handled appropriately.
数据准备和管道管理:为了开始使用,准备数据非常重要的第一步。 通常,我们在实际企业场景中使用“大数据”。 我们需要标识相关的数据集,获得许可,提取相关的记录,并与适当的标识符一起创建有意义的连接。 然后,我们继续进行初始数据探索。 这包括检查数据质量和完整性,以及处理异常值和其他数据清理需求。 如果是第一次使用数据集,则可能需要阅读内部文档或测试场景以验证其生成的数据。 在某些情况下,遥测功能不可用,您必须设计新的仪器。 如果要建立在此数据集上运行的生产模型,则需要以一种允许自动刷新计划和实时站点管理的方式来构建管道。 为了确保稳定的服务,这涉及DevOps概念,包括异常检测,可靠性,正常运行时间和服务水平协议。 最后,您必须熟悉有关数据隐私,GDPR,道德和安全性的策略,以确保正确处理数据。
Experimentation: Running experiments is key for driving innovation in a data-driven culture. Successive random control trials can help the team discover drivers and learn which one is having a material impact on business goals. In order to effectively lead these activities, you must be able to design proper experiments and analyze the results. This includes forming hypotheses, constructing proper control groups, accounting for biases, running statistical tests, and concluding findings. Finally, experience running experimentation review councils, applying ethical practices, developing and using experimentation frameworks, managing intersecting experiments with multi-attribution, and reporting experiment results at scale are all relevant on-the-job skills. Experience with causal inference techniques can be useful for impact analysis, as well.
实验:进行实验是在数据驱动的文化中推动创新的关键。 连续的随机控制试验可以帮助团队发现驱动因素,并了解哪些因素对业务目标产生了重大影响。 为了有效地领导这些活动,您必须能够设计适当的实验并分析结果。 这包括形成假设,建立适当的对照组,解决偏差,进行统计检验以及得出结论。 最后,拥有运行实验审查委员会,遵循道德规范,开发和使用实验框架,管理具有多种属性的相交实验以及大规模报告实验结果的经验都是与工作相关的技能。 因果推理技术方面的经验也可以用于影响分析。
Data visualization: When it’s time to present our results, visuals help tell the story. It’s important to communicate to stakeholders in a concise way so that points are consumed and remembered, and data visualization is an effective tool toward this goal. Several best practices help land the message with visuals. These include selecting the optimal chart type, simplifying (by removing unnecessary lines or datapoints), reducing colors, focusing attention on the key points, making text large and readable, aligning to a grid, and more. Here are a few books that we like, which provide practical tips: Storytelling with Data, The Wall Street Journal Guide to Information Graphics, Information Dashboard Design, and The Visual Display of Quantitative Information.
数据可视化:是时候展示我们的结果了,视觉效果可以帮助您讲述故事。 重要的是,以简洁的方式与利益相关者进行沟通,以便消耗和记住要点,并且数据可视化是实现此目标的有效工具。 几种最佳做法有助于通过视觉效果传达信息。 其中包括选择最佳的图表类型,简化(通过删除不必要的线条或数据点),减少颜色,将注意力集中在关键点上,使文本大而可读,与网格对齐等。 以下是一些我们喜欢的书,它们提供了实用的技巧:讲故事的数据, 《华尔街日报信息图形指南》 ,信息仪表板设计和定量信息的可视化显示。
商业环境 (Business context)
One of the most exciting aspects of data science is the opportunity to apply data to business scenarios. This includes using data to inform business decisions and developing AI services to enhance the customer experience. But to be successful at these enterprise data science projects, you must understand the business and customer scenarios. In fact, this understanding is useful for any industry job. That is why Starbucks employees do taste tests, Airbnb employees host guests and stay at properties, and Amazon employees visit fulfillment centers. Similarly, in Azure, we sit in with our support team on customer calls, dogfood our product, and visit data centers. Putting yourself closer to the customer creates a new level of awareness and empathy. It also provides context for how different departments need to work together to enable the end-to-end experience. Taking a customer-centric mindset is a good compass to help guide you on any decision, based on what is best for the customer. Finally, this background can help spark ideas for how to make an impact from your role.
数据科学最令人兴奋的方面之一是将数据应用于业务场景的机会。 这包括使用数据来指导业务决策以及开发AI服务以增强客户体验。 但是要在这些企业数据科学项目上取得成功,您必须了解业务和客户场景。 实际上,这种理解对于任何行业工作都是有用的。 这就是星巴克员工进行味蕾测试,Airbnb员工接待客人并留在酒店,亚马逊员工访问配送中心的原因。 同样,在Azure中,我们与支持团队一起处理客户电话,狗粮我们的产品以及访问数据中心。 使自己与客户更加亲近会创造新的意识和同理心。 它还为不同部门需要如何协同工作以提供端到端体验提供了上下文。 以客户为中心的心态是一个很好的指南针,可以根据对客户最有利的方面帮助您做出任何决定。 最后,这种背景可以帮助激发有关如何从您的角色中产生影响的想法。
While business context is critical for any business role, in this section we’ll cover why it’s particularly important for data science. For a data scientist, business context includes understanding user scenarios, having a close connection with your business stakeholder, and being a subject matter expert in the dataset. Data can be misleading if misused, and one of the many ways to misuse data is to misinterpret the meaning of a field in the database. Therefore, it’s very important to maintain good documentation and understand what the data represents. This context also helps identify data quality issues (by giving a sense of expected bounds), uncover findings (by recognizing interesting trends for the business), and inspire new ideas for what to explore or model next. Understanding the goals for the analysis is often the key differentiator for turning data points into insights, by framing the results in an actionable way.
尽管业务环境对于任何业务角色都至关重要,但是在本节中,我们将介绍为什么它对数据科学特别重要。 对于数据科学家而言,业务环境包括了解用户场景,与业务利益相关者紧密联系以及成为数据集中的主题专家。 如果使用不当,数据可能会产生误导,并且滥用数据的许多方法之一是误解数据库中字段的含义。 因此,保持良好的文档记录并理解数据代表什么非常重要。 此上下文还有助于识别数据质量问题(通过提供预期的界限),发现发现(通过识别业务的有趣趋势)并启发新的想法以供下一步探索或建模。 通过以可行的方式构建结果框架,了解分析目标通常是将数据点转变为见解的关键因素。
While your particular application domain will vary, in this section we’ll provide examples from our experience with Microsoft products, to help spark ideas for how to grow the business understanding in your company as well. In Azure, we need to have a technical understanding of the Azure services that customers use (shown below) and the solutions that they build:
尽管您的特定应用程序领域会有所不同,但在本节中,我们将提供我们在Microsoft产品方面的经验示例,以帮助激发有关如何在公司中增进业务了解的想法。 在Azure中,我们需要对客户使用的Azure服务(如下所示)及其构建的解决方案有技术上的了解:
https://docs.microsoft.com/en-us/learn/modules/welcome-to-azure/3-tour-of-azure-services https://docs.microsoft.com/zh-cn/learn/modules/welcome-to-azure/3-tour-of-azure-servicesWe also need to understand the experience that different audiences have when engaging with Microsoft sites, programs, and services:
我们还需要了解不同受众在与Microsoft网站,程序和服务互动时所获得的体验:
So, how do you build that context? Here are a few approaches:
那么,您如何构建该上下文? 以下是几种方法:
Leverage training materials: Your company probably has plenty of resources for users to learn how to use its products. Those can be great resources for you to study, too. For Azure, this includes Azure.com, Docs, MS Learn, Channel9, Webinars, Quickstart templates, Knowledge center, and more. If you’re learning as a team, schedule a recurring brownbag and nominate team members to research and present topics at each session. You can also watch the executive keynotes and session demos from external tradeshows to stay well-versed with customer scenarios. Internal all-hands meetings, earnings reports, and “ask anything” sessions are additional opportunities to hear from executives on the company and product direction.
利用培训材料:您的公司可能有足够的资源供用户学习如何使用其产品。 这些也是您学习的宝贵资源。 对于Azure,这包括Azure.com , Docs , MS Learn , Channel9 ,网络研讨会,快速入门模板,知识中心等。 如果您以团队学习的方式,请安排一个反复出现的brown包,并提名团队成员在每次会议上进行研究和提出主题。 您还可以观看外部贸易展览会的执行主旨演讲和会议演示,以充分了解客户情况。 内部全体人员会议,收益报告和“询问任何内容”会议是公司高管就公司和产品方向发表意见的额外机会。
Develop a project: Find a project you’re motivated to complete and that involves using the product. This can be a personal project or an initiative that you volunteer to help with at work. Having a specific end goal in mind will force you to work through scenarios and learn more in the end, compared to simply browsing through the learning materials above. To keep yourself accountable, you can also commit to a deadline, such as an event presentation, as a forcing function to prioritize this learning activity.
开发项目:找到您有动机去完成并涉及使用产品的项目。 这可以是个人项目,也可以是您自愿在工作中提供帮助的一项计划。 与仅浏览上面的学习资料相比,牢记一个特定的最终目标将迫使您研究场景并最终学习更多内容。 为了使自己负责,您还可以约定一个截止日期(例如活动演示),作为强制功能优先安排此学习活动。
Listen to customers: Join support calls, events, or message boards to hear what’s top of mind for customers. If you don’t have access to these in your data science organization, ask your business stakeholders about opportunities you can join. The competition and market direction are good aspects to be aware of, too. Think of new ideas and approaches that the team can take to accomplish the strategic goals.
倾听客户:加入支持电话,活动或留言板,以了解客户的首要考虑。 如果您无法在数据科学组织中访问这些资源,请向业务利益相关者询问可以加入的机会。 竞争和市场方向也是值得注意的方面。 考虑团队可以用来实现战略目标的新想法和新方法。
Document your understanding: As you learn, document your understanding so others can also benefit. Creating documentation for user flows and experiences can be a powerful step to align different parts of the organization. It also helps ensure that you’re interpreting the data correctly by clarifying the business process that it represents. Each time you share the draft with another person, you’ll learn a bit more about the way things are actually occurring, and you’ll end up with an artifact that is an accurate representation of the truth. (This improved cross-group understanding is beneficial for your stakeholders as well.) As the size of your company grows and the numbers of teams contributing to the customer experience increases, shared understanding and written artifacts become even more important. Below is an example from the Azure Marketplace user flow.
记录您的理解:在学习时,记录您的理解,这样其他人也可以受益。 为用户流程和体验创建文档可能是使组织的不同部门保持一致的有力步骤。 通过阐明数据所代表的业务流程,还有助于确保您正确地解释了数据。 每次与其他人共享草稿时,您都会学到更多有关事情发生的方式的知识,并且最终会得到可以准确表示真相的工件。 (这种改进的跨组理解对您的利益相关者也很有益。)随着公司规模的扩大以及为客户体验做出贡献的团队数量的增加,共享理解和书面工件变得更加重要。 下面是Azure Marketplace用户流的示例。
软技能 (Soft skills)
Traditionally, the training materials for data science have focused on technical skills. However, at any given point in time, I find that the areas my team members are prioritizing for their career development are pretty evenly split among the three categories introduced earlier (technical, business, and soft skills). More and more, I also see these topics coming up in industry conference sessions on “tips to be a successful data scientist in the enterprise.” It’s great to see the growing acknowledgment for this. I do find that soft skills are a key aspect of an individual’s ability to have a strong impact and growing career path in the organization. So, what are the top soft skills for data scientists?
传统上,数据科学培训材料侧重于技术技能。 但是,在任何给定的时间点,我发现我的团队成员优先考虑的职业发展领域平均分布在较早引入的三个类别(技术,业务和软技能)中。 我也越来越多地在行业会议上看到这些主题,这些主题是“成为企业中成功的数据科学家的秘诀”。 很高兴看到人们对此的认可度越来越高。 我确实发现,软技能是个人在组织中具有强大影响力和不断发展的职业道路的能力的关键方面。 那么,数据科学家最重要的软技能是什么?
Communication: As data scientists, if we develop the most innovative solution but no one knows about it, how much impact can it truly have? Scientists must speak, and data scientists are no exception. When we speak, we also need to make sure our message comes across. To land takeaways with a busy executive, the communication needs to be clear and concise. It’s good to have additional details “back pocket,” but many can be saved for the Q&A session. We also need to share the facts in a way that accurately conveys the information, why it matters, and what action to take. Data storytelling is a key skill to land this story arc. To learn more, see LinkedIn trainings on Presentation Skills and Public Speaking, enroll in the Coursera course on scientific writing, join Toastmasters, hire a speaking coach (we’ve had a good experience with Richard Klees), and most of all: Practice, practice, practice. For fast results, identify a learning buddy (or your manager) to give feedback after each presentation, including what went well and what you can improve.
沟通:作为数据科学家,如果我们开发出最具创新性的解决方案,但没人知道它,那么它真正能产生多少影响? 科学家必须讲话,数据科学家也不例外。 当我们讲话时,我们还需要确保传达出我们的信息。 为了让忙碌的管理人员获得外卖,沟通必须简洁明了。 可以在“后袋”中包含其他详细信息是很好的,但是可以保存许多内容以进行问答环节。 我们还需要以一种能够准确传达信息的方式共享事实,说明其重要性以及应采取的行动。 数据叙事是掌握故事情节的一项关键技能。 要了解更多信息,请参阅LinkedIn的演讲技巧和公共演讲培训,参加Coursera的科学写作课程,加入Toastmasters ,聘请演讲教练(我们在Richard Klees方面有很好的经验),最重要的是:练习,练习,练习。 为了获得快速的结果,请确定一个学习伙伴(或您的经理),以便在每次演示后提供反馈,包括进展良好以及您可以改进的方面。
Influence: Related to communication is the ability to influence. The data scientist must be able to stand by the numbers, whether they represent “good” or “bad” news. At its best, data science is a close partnership with stakeholder teams. Rather than merely serving data points, the data scientist should bring ideas (based on data insights) regarding the strategic initiative to take on next. Finally, the data scientist should be able to say “no” to lower priority asks and curiosity questions, in order to focus on projects with maximum business impact.
影响力:与沟通相关的是影响力。 数据科学家必须能够支持数字,无论它们代表“好消息”还是“坏消息”。 在最好的情况下,数据科学是与利益相关者团队的紧密合作伙伴关系。 数据科学家不仅要为数据点提供服务,还应带来(基于数据洞察力)有关下一步采取的战略举措的想法。 最后,数据科学家应该能够对较低优先级的问题和好奇心说“不”,以便专注于对业务产生最大影响的项目。
Collaboration: At the same time, data scientists need to be incredibly collaborative, both with business stakeholders and with fellow data science team members. While there are opportunities to deliver results independently, there are also many team projects to partner on together. Given our diverse backgrounds, we tend to create environments for team members to gather ideas and perspectives from the broader group. This helps keep our work consistent while ensuring that we’re applying common best practice approaches.
协作:同时,数据科学家需要与业务利益相关者以及与数据科学团队的其他成员进行令人难以置信的协作。 尽管有机会独立交付成果,但也有许多团队项目可以合作。 鉴于我们的背景各异,我们倾向于为团队成员创建环境,以收集来自更广泛小组的想法和观点。 这有助于确保我们的工作保持一致,同时确保我们采用通用的最佳实践方法。
Organization: Good organizational skills are important for everyone’s effectiveness, but in particular for data scientists. There is no shortage of questions we want to answer with data, so it’s important to prioritize one’s backlog. Data scientists need to “cost” and plan their projects, so that others can depend on them to deliver. Documenting project requirements, leveraging work item tracking, and publishing results are all great best practices.
组织:良好的组织技能对于每个人的效率都很重要,特别是对于数据科学家而言。 我们要用数据回答的问题不缺,因此,优先考虑待办事项很重要。 数据科学家需要“成本”和计划他们的项目,以便其他人可以依靠他们来交付。 记录项目需求,利用工作项目跟踪和发布结果都是很好的最佳实践。
汇集全部 (Bringing it all together)
While the primary focus of this article is on data scientist “tools,” or skill sets, it’s worth taking a moment to discuss what data scientists do with all these tools. For example, if we defined a ceramic pottery artist by their tools (potter’s wheel, wire cutter, angled knives, shaping tools, sponge, brushes, calipers, kiln, and so on), we would be missing the essence of what they do (create art!), and therefore lack the context of what they use these tools for.
尽管本文的主要重点是数据科学家“工具”或技能集,但值得花一点时间来讨论数据科学家如何使用所有这些工具。 例如,如果我们通过他们的工具(陶轮,钢丝钳,斜刀,整形工具,海绵,刷子,卡尺,窑炉等)定义陶瓷陶艺家,我们将失去他们所做工作的本质(创造艺术!),因此缺乏他们使用这些工具的目的。
In Ron Sielinski’s earlier post “The role of data scientist,” he quotes the following definition from Jeannette M. Wing: “Data science is the study of extracting value from data.” This “value” (i.e., what data scientists “do”) comes in the form of analytical insights, machine learning models, experimentation results, and more. For more details, see the article for specific data science deliverables by role.
在Ron Sielinski的早期文章“数据科学家的角色”中,他引用了Jeannette M. Wing的以下定义:“数据科学是对从数据中提取价值的研究。” 这种“价值”(即数据科学家的行为)以分析见解,机器学习模型,实验结果等形式出现。 有关更多详细信息,请参阅按角色分类的特定数据科学可交付成果的文章。
Like the potter who needs to combine learned skills and tools with their own innate artistic style, the data scientist also brings together both art and science.
就像需要将学习的技能和工具与自己固有的艺术风格相结合的陶工一样,数据科学家也将艺术与科学融合在一起。
无形资产 (The intangibles)
This brings me to the “intangibles” — the more innate characteristics of effective data scientists that are not listed in any data science master’s curriculum and yet are core traits for those successful in the field.
这将我带到了“无形资产”上-有效的数据科学家的更先天的特征,这些数据科学家未在任何数据科学硕士课程中列出,但对于该领域的成功者而言,它们是核心特征。
Curiosity: One of the most fun parts of being a data scientist is uncovering surprises. While we design a product or program with a particular use case in mind, users might find novel ways to take advantage of it, and telemetry data is a path toward discovering the truth. In looking at the data for one thing, we might also notice another trend that turns out to be a powerful insight. But without curiosity, this insight might go unnoticed. Another key trend that curious data scientists notice pertains to data quality, which is key to being able to deliver high quality analysis.
好奇心:成为数据科学家最有趣的部分之一就是发现惊喜。 当我们在设计产品或程序时要考虑到特定的用例,用户可能会发现利用它的新颖方法,而遥测数据是发现真相的途径。 在看一件事的数据时,我们可能还会注意到另一种趋势,事实证明这是一种有力的洞察力。 但是如果没有好奇心,这种见解可能不会被注意到。 好奇的数据科学家注意到的另一个关键趋势与数据质量有关,这是能够提供高质量分析的关键。
Creativity: In the previous article, Ron noted the importance of creativity for data science as “one of the most valuable skills of a data scientist, but often the least emphasized, and certainly the most difficult to cultivate.” A creative data scientist will come up with ideas for new AI services to improve the customer experience, by “connecting the dots” from the data. Creativity also helps with navigating the inevitable blockers that come up along the way and inspires ways to work through them.
创造力:在上一篇文章中,罗恩(Ron)指出了创造力对于数据科学的重要性,因为它是“数据科学家最有价值的技能之一,但往往被强调得最少,当然也最难培养。” 一位富有创造力的数据科学家将提出新的AI服务的想法,以通过“连接点”数据来改善客户体验。 创造力还有助于解决不可避免的阻碍因素,并激发通过它们的工作方式。
Grit: Having a strong determination and drive for results helps the data scientist to work through challenges. (Some also call this “stick-to-it-iveness.”) For a data scientist, this may include working through data access permissions, finding ways to join disparate datasets, handling noisy data with outliers, driving model performance, navigating experimentation limitations, and handling resource constraints.
毅力:拥有坚定的决心和对结果的追求,有助于数据科学家应对挑战。 (有些人还称其为“坚持到底”。)对于数据科学家而言,这可能包括研究数据访问权限,寻找方法以连接分散的数据集,处理异常数据中的嘈杂数据,提高模型性能,应对实验限制,并处理资源限制。
Growth mindset: A growth mindset is the belief that you can learn anything you set your mind to. It’s about facing challenges with the excitement for the learning opportunity that they provide, rather than being discouraged by the risk for failure. For a data scientist, a growth mindset (or “learn-it-all” approach) is key to learning the many skills discussed in this post. It also means you’ll be more open to feedback, which will make your impact that much greater.
成长心态:成长心态是一种信念,您可以学习自己设定的目标。 这是要面对挑战,他们要提供学习的机会,而不是因失败的风险而灰心。 对于数据科学家而言,成长的思维方式(或“全面学习”的方法)是学习本文中讨论的许多技能的关键。 这也意味着您将更愿意接受反馈,这将使您的影响更大。
Passion: Of course, anytime someone is passionate about their work, the better job they’ll do at it. Passionate data scientists are excited about the application of science to business, and want to operate in a data-driven culture (versus relying on opinion). I often see candidates applying to our team who have experienced other methods of decision-making and want to be part of a more structured approach.
热情:当然,只要有人对工作充满热情,他们就会做得更好。 热情的数据科学家对科学在商业中的应用感到兴奋,并希望在数据驱动的文化中开展业务(而不是依靠观点)。 我经常看到有候选人经历过其他决策方法并希望成为更结构化方法的一部分而向我们的团队提出申请。
独角兽 (The unicorn)
After reading this long list, you may feel a bit overwhelmed. If so, you’re not alone. In fact, there is a coined term, the “data scientist unicorn,” in existence because it’s so rare to find someone who satisfies all the criteria. Among professions, the data scientist skill set is one of the more diverse. Some joke that the data scientist job description is truly a “wish list.” While each is skill is useful, it’s possible to start with a subset, while continuing to develop the rest.
阅读这份冗长的清单后,您可能会觉得不知所措。 如果是这样,您并不孤单。 实际上,存在一个专有名词“数据科学家独角兽”,因为找到满足所有条件的人非常罕见。 在各行各业中,数据科学家的技能是较为多样化的技能之一。 有人开玩笑说数据科学家的职位描述确实是一个“愿望清单”。 尽管每个技能都很有用,但是有可能从一个子集开始,而后继续发展其余部分。
In fact, another way to interpret the data science Venn diagram is that it represents a data science team rather than a data science individual. That is, even if each individual doesn’t fulfill all the areas, we can address them by hiring a team with complementary strengths. In this way, putting together a team is like assembling an orchestra. Not every individual needs to be an expert in everything — they just need to work together well.
实际上,解释数据科学维恩图的另一种方法是,它代表数据科学团队而不是数据科学个人。 也就是说,即使每个人都不能满足所有领域的需求,我们也可以通过雇用具有互补优势的团队来解决这些问题。 这样,组建团队就像组建乐队一样。 并非每个人都需要成为一切方面的专家-他们只需要良好地合作即可。
I recommend doing your own skill assessment. Reflect on what your true superpowers are and find roles that leverage them. At the same time, be self-aware and understand your development opportunities. Then prioritize learning plans based on what will help you be self-sufficient and have the biggest impact with your work.
我建议您进行自己的技能评估。 思考一下您真正的超级大国是什么,并找到可以利用它们的角色。 同时,要自我意识并了解您的发展机会。 然后根据可帮助您实现自给自足并在工作中产生最大影响的方法对学习计划进行优先排序。
鼓舞士气的谈话 (The pep talk)
As a result of having such a long list of requirements, data scientists often experience “imposter syndrome.” This is a feeling that even if you have a data science job or data science achievements, you fear you’re a fraud — that you “got lucky” with those accomplishments — and you don’t truly belong. Imposter syndrome occurs frequently in the tech industry.
由于需求量如此之长,数据科学家经常会遇到“冒名顶替综合症”。 这是一种感觉,即使您有数据科学工作或数据科学成就,您也担心自己是骗子-凭借这些成就“走运”,而且您并不真正属于自己。 冒名顶替综合症在科技行业中经常发生。
Personally, I like to turn this “upside down” and maintain a healthy outlook by remembering that no one knows everything in the world of technology. In fact, the more you know, the more you actually realize that you don’t know. And that’s the beauty of tech. It’s constantly evolving, so we get to continuously research and contribute new ideas. If you’re excited about the opportunity to grow and experience lifelong learning, you will find this motivating, and never be bored doing the same thing over and over again.
就个人而言,我喜欢记住没有人对技术世界一无所知,以此来“颠倒过来”并保持健康的观点。 实际上,您了解的越多,您实际上就越意识到自己不知道。 这就是科技的美。 它在不断发展,因此我们可以继续进行研究并提出新的想法。 如果您对成长和体验终生学习的机会感到兴奋,那么您会发现这种激励,并且永远不会一遍又一遍地做同一件事。
Of course, our inner critic is useful at times. It pushes us to achieve more and do better. However, when it reaches too high of a level, it can be debilitating. “Inner Critic Inner Success” offers a “devil’s advocate” technique for this. If you think you don’t know about a particular topic, you can use reverse psychology to eek out all the things you do actually know, and then build on that. In the end, the goal is to end up with a healthy level of self-awareness.
当然,我们内在的批评家有时很有用。 它推动我们取得更大的成就和更好的成就。 但是,当它达到一个很高的水平时,它可能会使您虚弱。 “内部批评家内部成功”为此提供了“魔鬼的拥护者”技术。 如果您认为自己对某个特定主题一无所知,则可以使用逆向心理学来找出自己真正了解的所有知识,然后以此为基础。 最后,目标是达到健康的自我意识水平。
计划您的工作和计划 (Plan your work and work your plan)
Now that you know where you stand, take some time to reflect on where you want to go next. One tool that you can use to figure this out is a career plan template. There are many versions available online, so feel free to pick one that works for you. The key is to make time for this reflection, and to consider your values, skills, and passions. Think back on an awesome day, and then figure out what excited you most.
现在您知道自己的立场,花点时间思考下一个要去的地方。 您可以用来弄清这一点的一个工具是职业计划模板。 在线有很多版本,请随时选择一个适合您的版本。 关键是要花时间进行反思,并考虑自己的价值观,技能和激情。 回顾美好的一天,然后找出最让您兴奋的地方。
Assuming this still leads you down a data science career path, the next step is to put together a learning plan. Pick a few areas to pursue and activities that will get you there. Below are some examples. Pick two to three activities to focus on every three to four months, and then schedule a check-in with your manager to keep yourself accountable. You may find it helpful to designate a specific time in your schedule to tackle these trainings. In our team, we carve out Thursday afternoons for learning and development.
假设这仍然可以引导您走上数据科学的职业道路,下一步就是制定学习计划。 选择一些可以追求的领域和活动,以带您到达目的地。 以下是一些示例。 选择两到三项活动,每三到四个月关注一次,然后安排您的经理进行检查,以确保自己负责。 您可能会发现在时间表中指定一个特定的时间来解决这些培训很有帮助。 在我们的团队中,我们在星期四的下午进行学习和发展。
经常问的问题 (Frequently asked questions)
This framework is a useful guide both for individuals interested in entering data science as well as individuals interested in growing within their Data Science career, at any level. Here are a few questions I often receive:
该框架对于有兴趣进入数据科学的个人以及有兴趣在其数据科学职业发展的任何水平的个人都是有用的指南。 这是我经常收到的一些问题:
What are the different roles available in the field of data science?
数据科学领域有哪些不同的角色?
This article builds upon our earlier articles, which cover this topic:
本文以我们之前的文章为基础,这些文章涵盖了以下主题:
Designing a data science organization (which describes the types of organizational structures you can join), as well as
设计数据科学组织(描述您可以加入的组织结构的类型),以及
The role of a data scientist (which describes the types of data science roles within the organization)
数据科学家的角色(描述组织内数据科学角色的类型)
What are some recommended training resources for technical skills?
建议使用哪些技术技能培训资源?
There are many ways to acquire data science technical skills, starting with a boot camp to a formal degree (bachelor’s, master’s, or Ph.D.). Of course, the more thorough the program, the more comprehensive the training you will receive. There are a growing number of data science programs that have become available over the past decade. If you’re learning these skills while working (which also gives you the opportunity to practice them on the job), you can take advantage of evening and weekend programs, as well as books and MOOCs (“massive open online courses”).
从正式的新兵训练(学士学位,硕士学位或博士学位)开始,有许多方法可以获取数据科学技术技能。 当然,程序越全面,您将获得的培训越全面。 在过去的十年中,越来越多的数据科学程序可供使用。 如果您在工作时正在学习这些技能(这也使您有机会在工作中进行练习),则可以利用晚上和周末的课程以及书籍和MOOC(“大规模在线公开课程”)。
What is the ideal background for a data science role?
数据科学角色的理想背景是什么?
Data science attracts people with a wide variety of backgrounds. While the majority of software engineers have a computer science background, the educational background for data scientists is more evenly divided among math, statistics, physics, economics, engineering, and other applied sciences. Data scientists may have work experience in finance, consulting, database administration, business planning, software engineering, and more. One benefit of bringing together diverse backgrounds is that we often receive open-ended questions, which gives us an opportunity to consider multiple perspectives as we determine our approach. Given that most data science university programs came into being in the past five to ten years, it’s more common to see data science-specific degrees among recent graduates. If you’re interested to learn about career paths of our current team members, please see our “Faces of data science” series.
数据科学吸引了具有各种背景的人们。 尽管大多数软件工程师都具有计算机科学背景,但数据科学家的教育背景在数学,统计,物理学,经济学,工程学和其他应用科学之间的分配更为平均。 数据科学家可能在财务,咨询,数据库管理,业务计划,软件工程等方面具有工作经验。 集合不同背景的好处之一是,我们经常收到开放性问题,这使我们有机会在确定方法时考虑多种观点。 鉴于大多数数据科学大学课程都是在过去的五到十年内形成的,因此在刚毕业的毕业生中看到特定于数据科学的学位更为普遍。 如果您想了解我们现有团队成员的职业道路,请参阅我们的“数据科学面Kong”系列。
I’m interested in switching careers into a data science role. (Or, I’m interested in shifting focus within data science career paths, from analytics to machine learning.) How should I proceed?
我对将职业转变为数据科学职位感兴趣。 (或者,我有兴趣将重点放在从分析到机器学习的数据科学职业道路上。)我应该如何进行?
My top three tips are to learn the skills in this article, get a mentor in the field, and gain experience with “hands on” projects. Developing a project has the benefit that you’ll learn more about the capabilities and limitations of the tools and techniques you’re studying by pursuing a specific end goal (as opposed to more theoretical learning). In the process, you’ll gain a better sense of what you know and what you need to learn. Very importantly, you’ll also confirm whether you actually enjoy this kind of work! Finally, you’ll have experience to reference and draw from as you interview for future data science roles.
我最重要的三个技巧是学习本文中的技能,获得该领域的指导并获得“动手”项目的经验。 开发项目的好处是,您将通过追求特定的最终目标来了解正在研究的工具和技术的功能和局限性(相对于更多的理论学习而言)。 在此过程中,您将更好地了解自己知道的知识和需要学习的知识。 非常重要的是,您还将确认您是否真的喜欢这种工作! 最后,在面试未来数据科学角色时,您将具有参考和借鉴的经验。
There are a few ways to kick off a project. One option is to find one at work. You can volunteer to help the data science team (and get mentorship from them in the process). Another option is to start a new project, within your current role, that will benefit the business. This provides a way to continue contributing to your team while learning a new skill and starting to position yourself in a new way. Finally, you can also develop a personal project, and share the code on GitHub.
有几种启动项目的方法。 一种选择是在工作中找到一个。 您可以自愿帮助数据科学团队(并在此过程中获得他们的指导)。 另一种选择是在您当前的职位范围内启动一个新项目,这将使业务受益。 这提供了一种在学习新技能并开始以新方式定位自己的同时继续为团队做出贡献的方法。 最后,您还可以开发一个个人项目,并在GitHub上共享代码。
What is the interview process like?
面试过程如何?
Our interview process has a few stages, including a resume review, initial phone call with a hiring manager or HR recruiting, technical screen, and finally the “in person” interview. (Note: The in person interview is currently held remotely, as of March 2020.) In an article in our “Faces of data science” interview series, three members of our team who joined in early 2020 speak about their interview experiences at Microsoft and offer some tips and perspectives.
我们的面试过程分为几个阶段,包括简历审查,与招聘经理或人力资源招聘人员的初始电话通话,技术筛选,最后是“面对面”面试。 (注:在人的采访,目前远程举行,截至3月2020)在我们的“数据科学的面Kong”系列访谈在一篇文章中,我们团队的三名成员谁早在2020年加入了发言对他们的采访经验,在微软和提供一些技巧和观点。
Where should I apply if I’m interested in a data science role at Microsoft?
如果我对Microsoft的数据科学职位感兴趣,应该在哪里申请?
Please visit our careers site at https://careers.microsoft.com/.
请访问我们的职业网站,网址为https://careers.microsoft.com/ 。
Lisa Cohen is on LinkedIn.
丽莎·科恩(Lisa Cohen)在LinkedIn上。
翻译自: https://medium.com/data-science-at-microsoft/the-data-scientist-toolbelt-985c86e54fd3