数据分析中的统计概率
Data Science is a hot topic nowadays. Organizations consider data scientists to be the Crème de la crème. Everyone in the industry is talking about the potential of data science and what data scientists can bring in their BigTech and FinTech organizations. It’s attractive to be a data scientist.
数据科学是当今的热门话题。 组织认为数据科学家是Crèmede lacrème 。 行业中的每个人都在谈论数据科学的潜力以及数据科学家可以在其BigTech和FinTech组织中带来什么。 成为一名数据科学家很有吸引力。
This article will outline everything we need to know to become an expert in the Data Science field.
本文将概述成为数据科学领域专家所需的一切。
During tough times, data scientists are required even more because it’s crucial to be able to work on projects that cut costs and generate revenue. Data science projects have been possible mainly because of the advanced computing power, data availability, and cheap storage costs.
在艰难时期,甚至需要更多的数据科学家,因为至关重要的是能够开展削减成本和创收项目。 数据科学项目之所以成为可能,主要是因为其先进的计算能力,数据可用性和廉价的存储成本。
A quantitative analyst, software engineer, business analyst, tester, machine learning engineer, support, DevOps, project manager to a sales executive can all work within the data science field.
定量分析师,软件工程师,业务分析师,测试人员,机器学习工程师,支持人员,DevOps,销售经理的项目经理都可以在数据科学领域工作。
文章目的 (Article Aim)
These are the four sections of this article:
这些是本文的四个部分:
- Data Scientists — The Magicians 数据科学家—魔术师
- Essential Skills Of Expert Data Scientists 专家数据科学家的基本技能
- Data Science Project Stages 数据科学项目阶段
- Data Science Common Pitfalls 数据科学的常见陷阱
- Data Science Best Practices 数据科学最佳实践
This article will provide an understanding of what data science is and how it works. I will also outline the skills that we need to become a successful data scientist. Furthermore, the article will outline the 10 stages of a successful data science project. The article will then mention the biggest problems we face in data science model-building projects along with the best practices that I recommend everyone to be familiar with.
本文将提供对什么是数据科学及其如何工作的理解。 我还将概述成为一名成功的数据科学家所需的技能。 此外,本文还将概述成功的数据科学项目的10个阶段。 然后,本文将提及我们在数据科学模型构建项目中面临的最大问题,以及我建议大家都熟悉的最佳实践。
I will also highlight the skills that are required to become an expert data scientist. At times, it’s difficult to write everything down but I will be attempting to provide all of the information which I recommend to the aspiring data scientists.
我还将重点介绍成为专家数据科学家所需的技能。 有时很难将所有内容记下来,但我将尝试向有抱负的数据科学家提供所有我推荐的信息。
1.数据科学家-魔术师 (1. Data Scientists — The Magicians)
Expert data scientists can use their advanced programming skills, their superior statistical modeling know-how, domain knowledge, and intuition to come up with project ideas that are inventive in nature, can cut costs, and generate substantial revenue for the business.
专家数据科学家可以利用其先进的编程技能,卓越的统计建模专业知识,领域知识和直觉来提出本质上具有创造性,可以削减成本并为企业带来可观收入的项目创意。
All we need to do is to find the term Data Science in Google Trends to visualise how popular the field is. The interest has grown over time and it is still growing:
我们需要做的就是在Google趋势中找到术语“ 数据科学” ,以可视化该领域的受欢迎程度。 随着时间的流逝,人们的兴趣不断增长,并且还在不断增长:
Data Science projects can solve problems that were once considered impossible to solve.
数据科学项目可以解决曾经被认为无法解决的问题。
Data science-focused solutions have pushed technology to a higher level. We are seeing solutions that are attempting to model volatility, prices, and entities’ behavior. Many companies have implemented fantastic data science applications that are not limited to the FinTech or BigTech only.
以数据科学为中心的解决方案将技术推向更高的水平。 我们看到的解决方案正在尝试对波动性,价格和实体行为进行建模。 许多公司已经实现了出色的数据科学应用程序,这些应用程序不仅限于FinTech或BigTech。
In my view, the umbrella of data science requires individuals with a diverse set of skills; from development, analysis to DevOps skills.
我认为,数据科学的保护伞要求个人具备多种技能。 从开发,分析到DevOps技能。
2.专家数据科学家的基本技能 (2. Essential Skills Of Expert Data Scientists)
It’s necessary to have the right foundation to be a successful expert data scientist.
要成为一名成功的专家数据科学家,必须有正确的基础。
First thing first, the subject of data science is a branch of computer science and statistics. Hence it involves one to acquire computer science and statistical knowledge. There are no two ways about it. Although the more knowledge the better but we can’t learn everything all at once. Some of the areas are more important than the others. This article will only focus on the must-know skills.
首先,数据科学是计算机科学和统计学的一个分支。 因此,它涉及到获得计算机科学和统计知识。 关于它,没有两种方法。 尽管知识越多越好,但是我们无法一次学习所有内容。 一些领域比其他领域更重要。 本文仅关注必备知识。
In this section, I will aim to outline the required skills. In my view, there are only 4 main skills.
在本节中,我将概述所需的技能。 我认为只有4种主要技能。
1.领域知识 (1. Domain Knowledge)
Before we can invent a solution, we need to understand the problem. Understanding the problem requires knowing the domain really well.
在我们提出解决方案之前,我们需要了解问题。 理解问题需要真正了解领域。
The must-know skills of this section are:
本节必须具备的技能:
- Understanding the current problems and solutions within the domain to highlight the inefficiencies 了解领域内的当前问题和解决方案以突出显示低效率
- Solid understanding of how users work 对用户的工作方式有扎实的了解
- How revenue is generated and where the cost can be saved 如何产生收入以及可以在哪里节省成本
- An understanding of who the decision-makers are and what the pain-points are 了解谁是决策者以及痛点是什么
The users of the domain are the best teachers of that domain. It takes years to become an expert in a domain. And in my view, the best way to understand the domain is to code a data science application for the users that solves the most painful problem within the domain.
该域的用户是该域的最佳老师。 成为领域专家需要花费数年时间。 在我看来,了解该领域的最好方法是为用户编写一个数据科学应用程序,以解决该领域内最痛苦的问题。
2.讲故事 (2. Story Telling)
I can’t emphasise enough that the most important skill for a data scientist is the ability to tell and then sell a story. The must-know skills within story-telling are:
我不能足够强调,数据科学家最重要的技能是讲述和出售故事的能力。 讲故事时必须知道的技能是:
- Being imaginative 富有想象力
- Open to experiment 开放实验
- Confidently articulating ideas clearly and concisely to the point 清晰,简洁,自信地表达观点
- Having the power of persuasion 具有说服力
- Being confident to seek guidance from the domain experts 有信心寻求领域专家的指导
- Presenting the data and methodology 展示数据和方法
Most importantly, the key ingredient is genuinely believing in your idea.
最重要的是,关键要素是真正相信您的想法。
Therefore, the most important skill is essentially a soft-skill. And believe me, given the right training, all of us can master it.
因此,最重要的技能本质上是软技能。 相信我,只要接受正确的培训,我们所有人都可以掌握它。
The story should concisely and clearly explain the problem that the proposed data science solution is intending to solve. The decision-makers should be able to determine that the problem does exist and the proposed solution can help them cut costs and generate further revenue.
这个故事应该简洁明了地说明所提出的数据科学解决方案打算解决的问题。 决策者应该能够确定问题确实存在,并且所提出的解决方案可以帮助他们削减成本并产生更多收入。
Without storytelling, it’s nearly impossible to get projects funded by the business, and without projects, data scientists’ skills can end up rotting.
如果没有讲故事的话,就几乎不可能获得由企业资助的项目,而没有项目的话,数据科学家的技能就可能最终败坏。
The usual questions that the stakeholders enquire about are around the timelines which are heavily dependent on the required data. Therefore, it’s crucial to help them understand what those required data sets are and where to fetch them.
利益相关者询问的常见问题围绕时间轴,这些时间轴很大程度上取决于所需的数据。 因此,至关重要的是帮助他们了解那些必需的数据集是什么以及从何处获取它们。
During the project lifecycle, the solution is usually presented to several users. Hence one should be able to confidently present his/her ideas.
在项目生命周期中,解决方案通常提供给多个用户。 因此,一个人应该能够自信地提出自己的想法。
Therefore, the key skill is the ability to tell a story of your data science project and pitch it with the right detail to the target audience.
因此,关键技能是能够讲述您的数据科学项目的故事并向目标受众介绍正确细节的能力。
3.技术能力: (3. Technical Skills:)
We need to be able to prepare a prototype at a minimum and later productionise it into a production-quality software.
我们需要能够至少准备一个原型,然后将其生产成具有生产质量的软件。
The must-know technical skills are the ability to:
必须知道的技术技能是能够:
- Create classes, functions, call external APIs, and a thorough understanding of the data structures. 创建类,函数,调用外部API,并对数据结构有透彻的了解。
- Expertise in loading and saving data from various sources via programming (SQL + Python) 擅长通过编程(SQL + Python)从各种来源加载和保存数据
- Ability to write functions to transform data into the appropriate format 能够编写将数据转换为适当格式的函数
It’s obvious that an understanding of designing scalable technological solutions using micro-services architecture, with the knowledge of continuous integration and deployment is important but I am intentionally avoiding the mention of these skills for now and I am only concentrating on the must-know skills. Again, the more knowledge the better but here, let’s concentrate on the must-know only.
显然,了解具有微服务体系结构的可伸缩技术解决方案的设计以及持续集成和部署的知识很重要,但是我有意暂时避免提及这些技能,而我只专注于必须了解的技能。 同样,知识越多越好,但是在这里,让我们仅关注必须知道的内容。
Although we can use R programming language and I have written articles on R but I recommend and prefer to use the Python programming language. Python has started to become the De-facto of choice for Data Scientists. Plus, there is huge community support. I have yet to find a question on Python that has not been answered already.
尽管我们可以使用R编程语言,并且我已经撰写了有关R的文章,但是我还是建议并喜欢使用Python编程语言。 Python已开始成为数据科学家的首选事实。 另外,社区提供了巨大的支持。 我尚未找到关于Python的问题,但尚未得到解答。
Google trends shows that the popularity of Python is ever increasing:
Google的趋势表明,Python的流行程度正在不断提高:
Therefore the first key skill within the technology section is to have a solid grip on the Python programming language.
因此,技术部分的首要关键技能是牢固掌握Python编程语言。
I highly recommend reading this article:
我强烈建议您阅读本文:
It’s a known fact that data scientists spend a large amount of their time designing solutions to gather data. Most of the data is either text-based or numerical in nature. Plus the datasets are usually unstructured. There is an unlimited number of Python packages available and it’s near-to-impossible to learn about all of them. There are 3 packages that I always recommend everyone: Pandas, Numpy and Sci-kit Learn.
众所周知,数据科学家花费大量时间来设计解决方案以收集数据。 本质上,大多数数据是基于文本的或数字的。 另外,数据集通常是非结构化的。 有无数个可用的Python软件包,要了解所有这些软件包几乎是不可能的。 我总是向所有人推荐3种软件包:Pandas,Numpy和Sci-kit Learn。
Numpy is a widely used package within the data science ecosystem. It is extremely efficient and fast at number crunching. Numpy allows us to deal with arrays and matrices.
Numpy是数据科学生态系统中广泛使用的软件包。 它在数字运算方面非常高效且快速。 Numpy使我们能够处理数组和矩阵。
It’s important to understand how to:
了解如何进行以下操作很重要:
- Create collections such as arrays and matrices. 创建集合,例如数组和矩阵。
- How to transpose matrices 如何转置矩阵
- Perform statistical calculations along with saving the results. 执行统计计算并保存结果。
I highly recommend this article. It is sufficient to understand Numpy:
我强烈推荐这篇文章。 理解Numpy就足够了:
Finally, as most of our time is spent on playing with data, we rely heavily on databases and Excel spreadsheets to reveal the hidden secrets and trends within the data sets.
最后,由于我们大部分时间都花在处理数据上,因此我们严重依赖数据库和Excel电子表格来揭示数据集中隐藏的秘密和趋势。
To find those patterns, I recommend the readers to learn the Pandas library. Within Pandas features, it’s essential to understand:
为了找到这些模式,我建议读者学习熊猫库。 在熊猫功能中,必须了解以下内容:
- What a dataframe is 什么是数据框
- How to load data into a dataframe 如何将数据加载到数据框中
- How to query a data frame 如何查询数据框
- How to join and filter dataframes 如何加入和过滤数据框
- Manipulate dates, fill the missing values along with performing statistical calculations. 处理日期,填充缺失值以及执行统计计算。
I recommend this article that explains just that:
我推荐这篇文章,解释一下:
As long as we can manipulate and transform data into the appropriate structures, we can use Python to call any machine learning library. I recommend everyone to read on the interfaces to the Scikit-learn library.
只要我们可以操纵数据并将其转换为适当的结构,就可以使用Python调用任何机器学习库。 我建议大家阅读Scikit学习库的接口。
4.统计和概率技巧 (4. Statistics And Probability Skills)
Calling the machine learning models is technical in nature but explaining and enhancing the models requires the knowledge of probability and statistics.
调用机器学习模型本质上是技术性的,但是解释和增强模型需要概率和统计知识。
Let’s not see machine learning models as black-boxes as we’ll have to explain the workings to the stakeholders, team-members and improve on the accuracy in the future.
让我们不要将机器学习模型视为黑盒,因为我们将不得不向利益相关者,团队成员解释工作原理,并在将来提高准确性。
The must-know skills within the statistics and probability section are:
“统计和概率”部分中必须掌握的技能是:
- Probability distributions 概率分布
- Sampling techniques 采样技术
- Simulation techniques 仿真技术
- Calculation of the moments 力矩的计算
- Thorough understanding of accuracy measures and loss functions 透彻了解准确性指标和损失函数
- Regression and Bayesian models understanding 回归和贝叶斯模型理解
- Time series knowledge 时间序列知识
Once we have the data, it’s important to understand the underlying characteristics of the data. Therefore, I recommend everyone to understand statistics and probability. In particular, the key is to get a good grip on the various probability distributions, what they are, how sampling works, how we can generate/simulate data sets, and perform hypothesis analysis.
一旦有了数据,了解数据的基本特征就很重要。 因此,我建议大家了解统计数据和概率。 特别是,关键是要牢牢把握各种概率分布,它们是什么,采样如何工作,我们如何生成/模拟数据集以及执行假设分析。
Unless we understand the area of statistics, the machine learning models can appear to be a black-box. If we acknowledge how statistics and probability work then we can understand the models and explain them confidently.
除非我们了解统计领域,否则机器学习模型可能看起来像一个黑匣子。 如果我们承认统计数据和概率的工作原理,那么我们可以理解模型并自信地解释它们。
There are two must-read articles that I recommend everyone to read to get a solid grip on Statistics and Probability.
我建议大家阅读两篇必读的文章,以扎实地掌握“统计和概率”。
The first article will explain the key concepts of Probability:
第一篇文章将解释概率的关键概念:
The second article provides an overview of the Statistical inference techniques:
第二篇文章概述了统计推断技术:
Although the knowledge of neural networks, activation functions, couplas, Monte-Carlo simulations, and Ito calculus is important, I want to concentrate on the must-know statistical and probability skills.
尽管神经网络,激活函数,偶合算,蒙特卡洛模拟和伊托演算的知识很重要,但我还是要专注于必知的统计和概率技能。
As we start working on more projects, we can look into advanced/expert level architecture and programming skills, understand how deep learning models work and how to deploy and monitor data science applications but it’s necessary to get the foundations right.
当我们开始从事更多项目时,我们可以研究高级/专家级的体系结构和编程技能,了解深度学习模型的工作原理以及如何部署和监视数据科学应用程序,但是有必要打好基础。
3.数据科学项目阶段 (3. Data Science Project Stages)
Let’s start by understanding what a successful data science project entails.
让我们首先了解成功的数据科学项目意味着什么。
We can slice and dice a data science project in a million ways but in a nutshell, there are 10 key stages.
我们可以用一百万种方式对数据科学项目进行切片和切块,但概括地说,有10个关键阶段。
I will explain each of the stage below.
我将在下面解释每个阶段。
1.了解问题 (1. Understanding The Problem)
This is the first stage of a data science project. It requires acquiring an understanding of the target business domain.
这是数据科学项目的第一阶段。 它需要了解目标业务领域。
It involves data scientists to communicate with business owners, analysts, and team members. The current processes in the domain are first understood to discover whether there are any inefficiencies and whether the solutions already exist within the domain. The key is to present the problem and the solution to the users earlier on.
它涉及数据科学家与业务所有者,分析师和团队成员进行沟通。 首先要了解域中的当前过程,以发现域中是否存在任何效率低下以及解决方案是否已经存在。 关键是尽早向用户介绍问题和解决方案。
2.阐明解决方案 (2. Articulating The Solutions)
The chosen solution(s) are then presented clearly to the decision-makers. The inputs, outputs, interactions, and cost of the project is decided.
然后将所选解决方案清楚地呈现给决策者。 确定项目的投入,产出,相互作用和成本。
This skill at times takes a long time to master because it requires sales and analysis skills. One needs to understand the domain extremely well and sell the idea to the decision-makers.
有时,此技能需要很长时间才能掌握,因为它需要销售和分析技能。 人们需要非常了解该领域,并将其推销给决策者。
The key is to evaluate the conceptual soundness, write down, and agree on all of the assumptions and benchmark results. Always choose a benchmark process. It could be the current process and the aim is to produce a solution that is superior to the current benchmark process. Benchmark processes are usually known and understood by the users. Always keep a record of how the benchmark solution works as it will help in comparing the new solutions.
关键是评估概念的稳健性,写下并在所有假设和基准结果上达成一致。 始终选择基准测试流程。 可能是当前流程,目的是提供一种优于当前基准流程的解决方案。 基准过程通常是用户已知和理解的。 始终记录基准解决方案的工作方式,因为这将有助于比较新解决方案。
It’s important to mention the assumptions, note them down, and get users validation as the output of the application will depend on these assumptions.
重要的是要提及这些假设,将其记录下来,并让用户验证,因为应用程序的输出将取决于这些假设。
3.数据探索 (3. Data Exploration)
Now that the problem statement is determined, this step requires exploring the data that is required in the project. It involves finding the source of the data along with the required quantity, such as the number of records or timeline for the historic data and where we can source it from.
现在已经确定了问题陈述,此步骤需要探索项目中所需的数据。 它涉及查找数据源以及所需数量,例如历史数据的记录数或时间表以及我们从何处获取数据。
Consequently, it requires business and quantitative analytical skills.
因此,它需要业务和定量分析技能。
4.数据收集 (4. Data Gathering)
This stage is all about interfacing with the data teams and sourcing the required data. It requires building tools using advanced programming techniques to gather the data and saving the data into a repository such as files or databases. The data can be extracted via web service calls, databases, files, etc.
这个阶段就是与数据团队进行交互并采购所需的数据。 它需要使用高级编程技术构建工具来收集数据并将数据保存到文件或数据库等存储库中。 可以通过Web服务调用,数据库,文件等提取数据。
This requires technical skills to call the APIs and saving them in the appropriate data structures. I also recommend everyone to get a solid understanding of the SELECT SQL statements.
这需要技术技能来调用API并将其保存在适当的数据结构中。 我还建议大家对SELECT SQL语句有深入的了解。
5.数据转换 (5. Data Transformation)
This stage requires going over each feature and understanding how we need to fill the missing values, which cleaning steps we need to perform, whether we need to manipulate the dates, if we need to standarise or normalise the data, and/or create new features from the existing features.
此阶段需要遍历每个功能,并了解我们如何填充缺失值,需要执行哪些清理步骤,是否需要操纵日期,是否需要对数据进行标准化或标准化和/或创建新功能从现有功能。
This stage requires understanding data statistical probability distributions.
此阶段需要了解数据统计概率分布。
6.模型选择与评估 (6. Model Selection and Evaluation)
This stage requires feeding the data to the statistical machine learning models so that they can interpret the data, train themselves, and output the results. Then we need to tune the parameters of the selected models to obtain the highest accuracy. This stage requires an understanding of statistical models and accuracy measures. There are a large number of models available, each with their benefits and drawbacks.
此阶段需要将数据提供给统计机器学习模型,以便他们可以解释数据,进行自我训练并输出结果。 然后,我们需要调整所选模型的参数以获得最高的精度。 此阶段需要了解统计模型和准确性度量。 有大量可用的模型,每种模型都有其优点和缺点。
Most of the machine learning models have been implemented in Python and they are publicly available
大多数机器学习模型已经用Python实现,并且可以公开获得
The key is to be able to know which model to call, with what parameters, and how to measure its accuracy.
关键是能够知道调用哪个模型,使用哪些参数以及如何测量其准确性。
It’s crucial to split the input data into three parts:
将输入数据分为三部分至关重要:
- Train — which is used to train the model 训练—用于训练模型
- Test — that is used to test the accuracy of the trained model 测试-用于测试训练模型的准确性
- Validation — used to enhance the model accuracy once the model hyper-parameters are fine-tuned 验证-用于微调模型超参数后,用于增强模型准确性
I highly recommend this article. It explains the end-to-end machine learning process in an intuitive manner:
我强烈推荐这篇文章。 它以直观的方式说明了端到端的机器学习过程:
It is also advisable to choose a simple model first which can give you your expected results. This model is known as the benchmark model. Regression models are traditionally good benchmark models.
最好先选择一个简单的模型,它可以为您提供预期的结果。 该模型称为基准模型。 回归模型通常是良好的基准模型。
The regression models are simple and can reveal errors in your data sets at earlier stages. Remember overfitting is a problem and the more complex the models, the harder it is to explain the outcomes to the users. Therefore, always look into simpler models first and then apply regularisation to prevent over-fitting. We should also utilise boosting and bagging techniques as they can overweight observations based on their frequency and improve model predictivity ability. Once we have exhausted the simple models, only then we should look into advanced models, such as the deep learning models.
回归模型很简单,可以在早期阶段揭示数据集中的错误。 请记住,过度拟合是一个问题,模型越复杂,向用户解释结果的难度就越大。 因此,请始终先研究较简单的模型,然后再应用正则化以防止过度拟合。 我们还应该利用增强和装袋技术,因为它们可以根据频率增加观测值并提高模型的预测能力。 一旦我们用完了简单的模型,只有到那时,我们才应该研究高级模型,例如深度学习模型。
To understand how neural networks work, read this article:
要了解神经网络如何工作,请阅读本文:
7.测试应用 (7. Testing Application)
This stage requires testing the current code to ensure it works as expected to eliminate any model risk. We have to have a good understanding of testing and DevOps skills to be able to implement continuous integration build plans that can run with every check-in.
此阶段需要测试当前代码,以确保其按预期工作,以消除任何模型风险。 我们必须对测试和DevOps技能有充分的了解,才能实施可在每次签入时运行的持续集成构建计划。
The build plans should perform a check-out of the code, run all of the tests, prepare a code coverage report, and produce the required artifacts. Our tests should involve feeding the application the unexpected data too. We should stress and performance test the application and ensure all of the integration points work. The key is to build unit, integration, and smoke tests in your solution. I also recommend building a behavior-driven testing framework which ensures that the application is built as per the user requirements. I have used behave
Python package and I recommend it to everyone.
构建计划应执行代码检出,运行所有测试,准备代码覆盖率报告并生成所需的工件。 我们的测试也应包括向应用程序提供意外的数据。 我们应该对应用程序进行压力测试和性能测试,并确保所有集成点都能正常工作。 关键是在解决方案中进行单元测试,集成测试和冒烟测试。 我还建议构建一个行为驱动的测试框架,以确保根据用户要求构建应用程序。 我已经使用了behave
Python包,并向所有人推荐。
8.部署和监视应用程序 (8. Deployment And Monitoring Application)
This stage involves deploying the code to an environment where the users can test and use the application. The process needs to run without any human intervention. This involves DevOps skills to implement continuous deployment. It should take the artifacts and deploy the solution. The process should run a smoke test and notify the users that the application is deployed successfully so that everything is automated.
此阶段涉及将代码部署到用户可以测试和使用应用程序的环境。 该过程无需任何人工干预即可运行。 这涉及DevOps技能以实现连续部署。 它应该采用工件并部署解决方案。 该过程应运行冒烟测试,并通知用户该应用程序已成功部署,从而使一切自动化。
We are often required to use microservices architecture within containers to deploy the solution horizontally across servers.
我们经常需要在容器内使用微服务架构,以在服务器之间水平部署解决方案。
Once the solution is up, we need to monitor it to ensure it’s up and running without any issues. We should log all of the errors and implement heart-beat endpoints that we can call to ensure that the application is running successfully.
解决方案启动后,我们需要对其进行监视以确保其正常运行。 我们应该记录所有错误并实现可调用的心跳端点,以确保应用程序成功运行。
Data science projects are not one-off projects. They require continuous monitoring of the acquired data sets, problems, and solutions to ensure that the current system works as desired.
数据科学项目不是一次性项目。 他们需要对所获取的数据集,问题和解决方案进行连续监控,以确保当前系统能够按需运行。
I highly recommend this article that aims to explain the Microservices architecture from the basics:
我强烈推荐这篇旨在从基础知识解释微服务架构的文章:
9.申请结果介绍 (9. Application Results Presentation)
We are now ready to present the results. We might want to build a web user interface or use reporting tools such as Tablaue amongst other tools to present the charts and data sets to the users.
我们现在准备介绍结果。 我们可能想构建一个Web用户界面,或者使用诸如Tablaue之类的报告工具以及其他工具来向用户展示图表和数据集。
It comes back to the domain and story-telling skills.
回到领域和讲故事的技巧。
10.回测应用 (10. Backtesting Application)
Lastly, it’s crucial to back-test the application. Once the model is live, we want to ensure that it is always working as we expect it to work. One way of validating the model is to feed it historical data and validate the results quantitatively and qualitatively.
最后,对应用程序进行回测试至关重要。 模型上线后,我们要确保它始终能够按预期工作。 验证模型的一种方法是提供其历史数据并定量和定性地验证结果。
The current project becomes the benchmark for the next phases. The next phases might involve changing data sets, models, or just fine tuning the hyper-parameters.
当前项目成为下一阶段的基准。 下一阶段可能涉及更改数据集,模型,或仅微调超参数。
It’s scarce to find one person who can work on all of the stages independently but these super-stars do exist in the industry and they are the expert data scientists. It takes many years of hard work to gain all of the required skills and yes it is possible with the right projects and enough time. It is absolutely fine to be more confident in one stage than the other but the key is to have a good enough understanding of each of the stage.
几乎没有人可以独立完成所有阶段的工作,但是这些超级明星确实存在于行业中,而且他们是专家数据科学家。 要获得所有必需的技能,需要花费很多年的辛勤工作,是的,只有正确的项目和足够的时间才有可能。 在一个阶段比另一个阶段更自信是完全可以的,但是关键是要对每个阶段都有足够的了解。
4.数据科学的常见陷阱 (4. Data Science Common Pitfalls)
Whilst working on data science projects, it’s important to remember common pitfalls.
在从事数据科学项目时,重要的是要记住常见的陷阱。
This section will provide an overview of the pitfalls.
本节将概述这些陷阱。
输入数据错误 (Bad input data)
Ensure your input data is of good quality otherwise, you will end up spending a lot of time in producing a solution that won’t benefit the users.
确保您的输入数据质量良好,否则,您最终将花费大量时间来制作对用户无益的解决方案。
参数错误 (Bad parameters)
A model is essentially a set of functions. The parameters of the functions are often calibrated and decided based on intuition and knowledge. It’s important to ensure that the parameters are right and the data that is passed in the parameters is good.
模型本质上是一组功能。 通常根据直觉和知识来校准和确定功能的参数。 确保参数正确并且在参数中传递的数据正确是很重要的。
错误的假设 (Wrong assumptions)
It is important to get the assumptions about the data and the model along with its parameters verified. Wrong assumptions can end up wasting a lot of time and resources.
获得有关数据和模型的假设以及经过验证的参数非常重要。 错误的假设最终会浪费大量时间和资源。
型号选择错误 (The wrong choice of models)
Sometimes we choose the wrong model and feed it the right data set to solve the right problem. Expectedly, the application produces invalid results. Some models are good to solve only particular problems. Therefore, do ensure that the chosen model is appropriate for the problem you are attempting to solve.
有时,我们选择了错误的模型并提供了正确的数据集以解决正确的问题。 预期该应用程序将产生无效结果。 有些模型仅能解决特定问题。 因此,请确保选择的模型适合您要解决的问题。
编程错误 (Programming errors)
Data science projects require a lot of coding at times. The errors can be introduced into the projects whilst implementing incorrect mappings, data structures, functions, and general coding bugs. The key is to build unit, integration, and smoke tests in your solution. I also recommend building a behavior-driven testing framework.
数据科学项目有时需要大量编码。 可以在实施不正确的映射,数据结构,功能和常规编码错误的同时将错误引入项目中。 关键是在解决方案中进行单元测试,集成测试和冒烟测试。 我还建议建立一个行为驱动的测试框架。
A software engineer/analyst/manager or anyone can all become an expert data scientist as long as they continously work on the data science stages
只要软件工程师/分析师/经理或任何人持续从事数据科学阶段的工作,他们都可以成为专家数据科学家
5.数据科学最佳实践 (5. Data Science Best Practices)
Lastly, I wanted to outline the best practices of data science projects.
最后,我想概述数据科学项目的最佳实践。
- Be aware of the pros and cons of the models. Not all models can solve all of the problems. If you have chosen smoothing techniques then ensure you understand the decay factors and whether the methodology is valid. Always choose a simple benchmark model and present the results to the end-users earlier on. Plus split the data into Train, Test, and Validation set. 注意模型的优缺点。 并非所有模型都能解决所有问题。 如果选择了平滑技术,请确保您了解衰减因子以及该方法是否有效。 始终选择一个简单的基准模型,并尽早将结果提供给最终用户。 另外,将数据分为训练,测试和验证集。
- Always build a benchmark model and show the results to the domain experts. Test the project on expected inputs to ensure the results are expected. This will clarify most of the issues with the assumptions and input data sets. 始终构建基准模型并将结果显示给领域专家。 在预期输入上测试项目,以确保预期结果。 这将阐明与假设和输入数据集有关的大多数问题。
- Test the project on unexpected inputs to ensure the results are not unexpected. Stress-testing the projects is extremely important. 在意外的输入上测试项目,以确保结果不会意外。 对项目进行压力测试非常重要。
- Performance test the application. It is essential to ensure it can handle large data sets and requests. Build CI/CD plans at the start. 性能测试应用程序。 确保它可以处理大型数据集和请求至关重要。 首先要构建CI / CD计划。
- Always backtest your model on historic data. It’s important to feed in the historic data to count the number of exceptions that are encountered. As history does not repeat itself, the key is to remember that backtesting can validate the assumptions which were set when the project was implemented. Hence, always backtest the model. 始终根据历史数据对模型进行回测。 馈入历史数据以计算遇到的异常数非常重要。 由于历史不会重演,因此关键是要记住,回测可以验证项目实施时设定的假设。 因此,请始终对模型进行回测。
- Document and evaluate the soundness of the proposed methodology and monitor the application. Ensure the business users of the domain are involved at earlier stages and throughout the various stages of the project. Constantly monitor the application to ensure that it’s responsive and working as expected. 记录并评估所提议方法的合理性,并监视应用程序。 确保域的业务用户参与项目的早期阶段和整个阶段。 不断监视应用程序,以确保其响应和正常运行。
摘要 (Summary)
Data science is an extremely popular subject nowadays. This article outlined the skills we need to acquire to become a successful expert data scientist. It then provided an overview of the biggest problems we face in data science model-building projects along with the best practices.
数据科学是当今极为流行的学科。 本文概述了成为一名成功的专家数据科学家所需的技能。 然后概述了我们在数据科学模型构建项目中面临的最大问题以及最佳实践。
With time, we can start gathering more data such as running the code in parallel or building containers to launch the applications or simulating the data to forecast models. However, the key point here was to outline the must-know skills.
随着时间的流逝,我们可以开始收集更多数据,例如并行运行代码或构建容器以启动应用程序或将数据模拟为预测模型。 但是,这里的重点是概述必须知道的技能。
Let me know what your thoughts are.
让我知道你的想法。
翻译自: https://towardsdatascience.com/understanding-statistics-and-probability-becoming-an-expert-data-scientist-b178e4175642
数据分析中的统计概率