大数据平台蓝图_数据科学面试蓝图

大数据平台蓝图

1.组织是关键 (1. Organisation is Key)

I’ve interviewed at Google (and DeepMind), Uber, Facebook, Amazon for roles that lie under the “Data Scientist” umbrella and this is the typical interview construction theme I’ve observed:

我曾在Google(和DeepMind),Uber,Facebook和Amazon接受过采访,采访对象是“数据科学家”保护伞下的角色,这是我观察到的典型采访构建主题:

  1. Software Engineering

    软件工程
  2. Applied Statistics

    应用统计
  3. Machine Learning

    机器学习
  4. Data Wrangling, Manipulation and Visualisation

    数据整理,操纵和可视化

Now nobody is expecting some super graduate level competency in all of these topics, but you need to know enough to convince your interviewer that you’re capable of delivering if they offered you the job. How much you need to know depends on the job spec, but in this increasingly competitive market, no knowledge is lost.

现在,没有人期望在所有这些主题上都有一定的超级研究生水平的能力,但是您需要足够的知识来说服面试官,如果他们为您提供了工作,您就可以胜任。 您需要了解多少知识取决于工作规范,但是在这个竞争日益激烈的市场中,知识不会丢失。

I recommend using Notion to organise your job prep. It’s extremely versatile, and enables you to utilise the Spaced Repetition and Active Recall principles to nail down learning and deploying key topics that come up time and time again in a Data Scientist interview. Ali Abdaal has a great tutorial on note taking with Notion to maximise your learning potential during the interview process.

我建议使用Notion来组织您的工作准备。 它的用途非常广泛,可以让您利用“ 间隔重复”和“ 主动回忆”原理来确定学习和部署关键主题,这些主题一次又一次地出现在数据科学家访谈中。 Ali Abdaal拥有关于Notion的精彩笔记教程 ,可在面试过程中最大程度地发挥学习潜力。

I used to run through my Notion notes over and over, but in particular, right before my interview. This ensured that key topics and definitions were loaded into my working memory and I didn’t waste precious time “ummmmmm”ing when hit with some question.

过去,我经常反复浏览概念笔记,尤其是在面试之前。 这样可以确保将关键主题和定义加载到我的工作记忆中,当我遇到一些问题时,我不会浪费宝贵的时间。

2.软件工程 (2. Software Engineering)

Not all Data Scientist roles will grill you on the time complexity of an algorithm, but all of these roles will expect you to write code. Data Science isn’t one job, but a collection of jobs that attracts talent from a variety of industries, including the software engineering world. As such you’re competing with guys that know the ins and outs of writing efficient code and I would recommend spending at least 1–2 hours a day in the lead-up to your interview practicing the following concepts:

并非所有的Data Scientist角色都会使您担心算法的时间复杂性,但是所有这些角色都希望您编写代码。 数据科学不是一项工作,而是一系列工作,吸引了包括软件工程界在内的各种行业的人才。 因此,您正在与了解编写高效代码的来龙去脉的人竞争,我建议您每天至少花费1-2个小时来进行面试,并实践以下概念:

  1. Arrays

    数组
  2. Hash Tables

    哈希表
  3. Linked Lists

    链表
  4. Two-Pointer based algorithms

    基于两指针的算法
  5. String algorithms (interviewers LOVE these)

    字符串算法(访问者喜欢这些)
  6. Binary Search

    二元搜寻
  7. Divide and Conquer Algorithms

    分而治之算法
  8. Sorting Algorithms

    排序算法
  9. Dynamic Programming

    动态编程
  10. Recursion

    递归

DO NOT LEARN THE ALGORITHMS OFF BY HEART. This approach is useless, because the interviewer can question you on any variation of the algorithm and you will be lost. Instead learn the strategy behind how each algorithm works. Learn what computational and spatial complexity are, and learn why they are so fundamental to building efficient code.

不要通过心学习算法。 这种方法没有用,因为访问员可以对算法的任何变体询问您,您会迷路。 相反,学习后面的每个算法是如何工作的战略 。 了解什么是计算空间复杂性,并了解为什么它们对于构建高效代码如此重要。

LeetCode was my best friend during interview preparation and is well worth the $35 per month in my opinion. Your interviewers only have so many algorithm questions to sample from, and this website covers a host of algorithm concepts including companies that are likely or are known to have asked these questions in the past. There’s also a great community who discuss each problem in detail, and helped me during the myriad of “stuck” moments I encountered. LeetCode has a “lite” version with a smaller question bank if the $35 price tag is too steep, as do HackerRank and geeksforgeeks which are other great resources.

LeetCode是我在面试准备期间最好的朋友,在我看来,每月35美元的价值非常值得。 您的面试官只有这么多算法问题可供选择,并且此网站涵盖了许多算法概念,包括过去可能或已知曾问过这些问题的公司。 还有一个很棒的社区,详细讨论每个问题,并在遇到的许多“卡住”时刻为我提供了帮助。 LeetCode有一个“精简版”版本,如果35美元的价格太高,则问题库较小, HackerRank和geeksforgeeks也是其他很棒的资源。

What you should do is attempt each question, even if it’s a brute force approach that takes ages to run. Then look at the model solution, and try to figure out what the optimal strategy is. Then read up what the optimal strategy is and try to understand why this is the optimal strategy. Ask yourself questions like “why is Quicksort O(n²) average time complexity?”, why do two pointers and one for loop make more sense than three for loops?

您应该做的是尝试每个问题,即使这是一个蛮横的方法,也要花很多时间才能解决。 然后查看模型解决方案,并尝试找出最佳策略。 然后阅读最佳策略是什么,并尝试了解为什么这是最佳策略。 问自己一些问题,例如“为什么Quicksort O(n²)平均时间复杂度?”,为什么两个指针和一个for循环比三个for循环更有意义?

3.应用统计 (3. Applied Statistics)

Data science has an implicit dependence on applied statistics, and how implicit that will be depends on the role you’ve applied for. Where do we use applied statistics? It pops up just about anywhere where we need to organise, interpret and derive insights from data.

数据科学对应用的统计信息有隐式依赖,隐含的依赖程度取决于您申请的角色。 我们在哪里使用应用统计数据? 它几乎出现在我们需要组织,解释和从数据中获取见解的任何地方。

I studied the following topics intensely during my interviews, and you bet your bottom dollar that I was grilled about each topic:

我在面试中认真研究了以下主题,您敢打赌,我为每个主题都感到沮丧:

  1. Descriptive statistics (What distribution does my data follow, what are the modes of the distribution, the expectation, the variance)

    描述性统计信息(我的数据遵循什么分布,分布的模式是什么,期望,方差)
  2. Probability theory (Given my data follows a Binomial distribution, what is the probability of observing 5 paying customers in 10 click-through events)

    概率论(鉴于我的数据遵循二项分布,那么在10次点击事件中观察到5个付费客户的概率是多少)
  3. Hypothesis testing (forming the basis of any question on A/B testing, T-tests, anova, chi-squared tests, etc).

    假设检验(构成A / B检验,T检验,方差分析,卡方检验等问题的基础)。
  4. Regression (Is the relationship between my variables linear, what are potential sources of bias, what are the assumptions behind the ordinary least squares solution)

    回归(是变量之间的线性关系,潜在的偏差来源,普通最小二乘法背后的假设是什么)
  5. Bayesian Inference (What are some advantages/disadvantages vs frequentist methods)

    贝叶斯推理(相对于惯常方法有哪些优点/缺点)

If you think this is a lot of material you are not alone, I was massively overwhelmed with the volume of knowledge expected in these kinds of interviews and the plethora of information on the internet that could help me. Two invaluable resources come to mind when I was revising for interviews.

如果您认为其中有很多材料并不孤单,那么这些采访中所期望的知识量以及互联网上可以帮助我的大量信息会让我不知所措。 当我进行面试修订时,会想到两个宝贵的资源。

  1. Introduction to Probability and Statistics, an open course on everything listed above including questions and an exam to help you test your knowledge.

    概率统计概论 ,这是一门有关上述所有内容的公开课程,包括问题和帮助您测试知识的考试。

  2. Machine Learning: A Bayesian and Optimization Perspective by Sergios Theodoridis. This is more a machine learning text than a specific primer on applied statistics, but the linear algebra approaches outlined here really help drive home the key statistical concepts on regression.

    机器学习:贝叶斯和优化视角 ,作者:Sergios Theodoridis。 这更多是机器学习的文章,而不是应用统计的特定基础知识,但是这里概述的线性代数方法确实有助于推动回归的关键统计概念的理解。

The way you’re going to remember this stuff isn’t through memorisation, you need to solve as many problems as you can get your hands on. Glassdoor is a great repo for the sorts of applied stats questions typically asked in interviews. The most challenging interview I had by far was with G-Research, but I really enjoyed studying for the exam, and their sample exam papers were fantastic resources when it came to testing how far I was getting in my applied statistics revision.

记住这些东西的方式不是通过记忆,而是需要解决尽可能多的问题。 Glassdoor是针对访谈中通常会问到的各种应用统计问题的绝佳仓库。 到目前为止,我遇到的最具挑战性的采访是在G-Research上,但我真的很喜欢为考试学习,当涉及到测试我的应用统计学修订版的学习程度时,他们的样本试卷是非常有用的资源。

4.机器学习 (4. Machine Learning)

Now we come to the beast, the buzzword of our millennial era, and a topic so broad that it can be easy to get so lost in revision that you want to give up.

现在我们来谈谈兽,我们千禧年时代的流行语,以及一个如此广泛的话题,以至于很容易迷失在修订之中,以至于您想放弃。

The applied statistics part of this study guide will give you a very very strong foundation to get started with machine learning (which is basically just applied applied statistics written in fancy linear algebra), but there are certain key concepts that came up over and over again during my interviews. Here is a (by no means exhaustive) set of concepts organised by topic:

本学习指南的应用统计部分将为您提供非常强大的基础,以帮助您开始机器学习(这基本上只是用花哨的线性代数编写的应用统计),但是有些关键概念会反复出现。在我的采访中 这是按主题组织的(绝不是穷举)概念集:

指标-分类 (Metrics — Classification)

  1. Confusion Matrices, Accuracy, Precision, Recall, Sensitivity

    混淆矩阵,准确性,精度,召回率,灵敏度

  2. F1 Score

    F1分数

  3. TPR, TNR, FPR, FNR

    TPR,TNR,FPR,FNR

  4. Type I and Type II errors

    I型和II型错误

  5. AUC-ROC Curves

    AUC-ROC曲线

指标-回归 (Metrics — Regression)

  1. Total sum of squares, explained sum of squares, residual sum of squares

    平方总和,解释平方和,残差平方和

  2. Coefficient of determination and its adjusted form

    确定系数及其调整形式

  3. AIC and BIC

    AIC和BIC

  4. Advantages and disadvantages of RMSE, MSE, MAE, MAPE

    RMSE,MSE,MAE,MAPE的优缺点

偏差/偏差权衡,过度/欠缺 (Bias-Variance Tradeoff, Over/Under-Fitting)

  1. K Nearest Neighbours algorithm and the choice of k in bias-variance trade-off

    K最近邻算法和偏差方差折衷中的k选择

  2. Random Forests

    随机森林

  3. The asymptotic property

    渐近性质

  4. Curse of dimensionality

    维度诅咒

选型 (Model Selection)

  1. K-Fold Cross Validation

    K折交叉验证

  2. L1 and L2 Regularisation

    L1和L2正则化

  3. Bayesian Optimization

    贝叶斯优化

采样 (Sampling)

  1. Dealing with class imbalance when training classification models

    训练分类模型时应对班级失衡

  2. SMOTE for generating pseudo observations for an underrepresented class

    SMOTE用于为代表性不足的类生成伪观察

  3. Class imbalance in the independent variables

    自变量中的类不平衡

  4. Sampling methods

    采样方式

  5. Sources of sampling bias

    抽样偏差的来源

  6. Measuring Sampling Error

    测量采样误差

假设检验 (Hypothesis Testing)

This really comes under under applied statistics, but I cannot stress enough the importance of learning about statistical power. It’s enormously important in A/B testing.

这确实属于应用统计的范畴,但是我不能足够强调学习统计能力的重要性。 在A / B测试中,这非常重要。

回归模型 (Regression Models)

Ordinary Linear Regression, its assumptions, estimator derivation and limitations are covered in significant detail in the sources cited in the applied statistics section. Other regression models you should be familiar with are:

在“应用统计”部分引用的来源中,非常详细地介绍了普通线性回归,其假设,估计量的推导和限制。 您应该熟悉的其他回归模型是:

  1. Deep Neural Networks for Regression

    深度神经网络回归

  2. Random Forest Regression

    森林随机回归

  3. XGBoost Regression

    XGBoost回归

  4. Time Series Regression (ARIMA/SARIMA)

    时间序列回归(ARIMA / SARIMA)

  5. Bayesian Linear Regression

    贝叶斯线性回归

  6. Gaussian Process Regression

    高斯过程回归

聚类算法 (Clustering Algorithms)

  1. K-Means

    K均值

  2. Hierarchical Clustering

    层次聚类

  3. Dirichlet Process Mixture Models

    Dirichlet过程混合模型

分类模型 (Classification Models)

  1. Logistic Regression (Most important one, revise well)

    Logistic回归( 最重要的一个,请认真修改 )

  2. Multiple Regression

    多重回归

  3. XGBoost Classification

    XGBoost分类

  4. Support Vector Machines

    支持向量机

It’s a lot, but much of the content will be trivial if your applied statistics foundation is strong enough. I would recommend knowing the ins and outs of at least three different classification/regression/clustering methods, because the interviewer could always (and has previously) asked “what other methods could we have used, what are some advantages/disadvantages”? This is a small subset of the machine learning knowledge in the world, but if you know these important examples, the interviews will flow a lot more smoothly.

数量很多,但是如果您的应用统计基础足够强大,那么很多内容将变得微不足道。 我建议您至少了解三种不同的分类/回归/聚类方法的来龙去脉,因为面试官可以总是(并且以前曾问过)“我们还可以使用什么其他方法,有什么优点/缺点”? 这是世界上机器学习知识的一小部分,但如果你知道这些重要的例子,面试会更加顺畅流很多

5.数据处理和可视化 (5. Data Manipulation and Visualisation)

“What are some of the steps for data wrangling and data cleaning before applying machine learning algorithms”?

“在应用机器学习算法之前,数据整理和数据清理有哪些步骤?”

We are given a new dataset, the first thing you’ll need to prove is that you can perform an exploratory data analysis (EDA). Before you learn anything realise that there is one path to success in data wrangling: Pandas. The Pandas IDE, when used correctly, is the most powerful tool in a data scientists toolbox. The best way to learn how to use Pandas for data manipulation is to download many, many datasets and learn how to do the following set of tasks as confidently as you making your morning cup of coffee.

我们得到了一个新的数据集,首先需要证明的是,您可以执行探索性数据分析(EDA)。 在您学任何东西之前,请先了解一下,成功进行数据整理的方法是:熊猫。 正确使用Pandas IDE是数据科学家工具箱中最强大的工具。 学习如何使用Pandas进行数据操作的最好方法是下载许多数据集,并学习如何像制作早晨咖啡一样自信地完成以下任务。

One of my interviews involved downloading a dataset, cleaning it, visualising it, performing feature selection, building and evaluating a model all in one hour. It was a crazy hard task, and I felt overwhelmed at times, but I made sure I had practiced building model pipelines for weeks before actually attempting the interview, so I knew I could find my way if I got lost.

我的采访之一涉及在一小时内下载数据集,对其进行清理,对其进行可视化,进行特征选择,构建和评估模型。 这是一项疯狂的艰巨任务,有时我会感到不知所措,但是我确保在实际尝试面试之前已经练习了数周的模型流水线建设,所以我知道如果迷路了,我会找到方法的。

Advice: The only way to get good at all this is to practice, and the Kaggle community has an incredible wealth of knowledge on mastering EDAs and model pipeline building. I would check out some of the top ranking notebooks on some of the projects out there. Download some example datasets and build your own notebooks, get familiar with the Pandas syntax.

建议:擅长于这一切的唯一方法就是练习,而Kaggle社区在掌握EDA和模型管道构建方面拥有不可思议的丰富知识。 我会在一些项目中查看一些顶级笔记本。 下载一些示例数据集并构建自己的笔记本,熟悉Pandas语法。

资料组织 (Data Organisation)

There are three sure things in life: death, taxes and getting asked to merge datasets, and perform groupby and apply tasks on said merged datasets. Pandas is INCREDIBLY versatile at this, so please practice practice practice.

生活中存在三件事:死亡,税收和被要求合并数据集 ,并执行groupby并将任务应用于所述合并的数据集。 熊猫在此方面具有多种用途,因此请练习练习。

资料剖析 (Data Profiling)

This involves getting a feel for the “meta” characteristics of the dataset, such as the shape and description of numerical, categorical and date-time features in the data. You should always be seeking to address a set of questions like “how many observations do I have”, “what does the distribution of each feature look like”, “what do the features mean”. This kind of profiling early on can help you reject non-relevant features from the outset, such as categorical features with thousands of levels (names, unique identifiers) and mean less work for you and your machine later on (work smart, not hard, or something woke like that).

这涉及感受数据集的“元”特征,例如数据中数字,分类和日期时间特征的形状和描述。 您应该一直在寻求解决一系列问题,例如“我有多少个观测值”,“每个特征的分布是什么样的”,“特征是什么意思”。 尽早进行此类剖析可以帮助您从一开始就拒绝不相关的功能,例如具有数千个级别的分类功能(名称,唯一标识符),并为您和以后的机器减少工作量(聪明,不费力,或类似的东西唤醒)。

数据可视化 (Data Visualisation)

Here you are asking yourself “what does the distribution of my features even look like?”. A word of advice, if you didn’t learn about boxplots in the applied statistics part of the study guide, then here is where I stress you learn about them, because you need to learn how to identify outliers visually and we can discuss how to deal with them later on. Histograms and kernel density estimation plots are extremely useful tools when looking at properties of the distributions of each feature.

在这里,您在问自己“我的功能分布看起来是什么样?”。 一条建议,如果您在学习指南的“应用统计信息”部分中未了解箱线图 ,那么我在这里强调您要了解它们,因为您需要学习如何直观地识别异常值,我们可以讨论如何待会儿再处理。 当查看每个特征的分布特性时, 直方图和核密度估计图是非常有用的工具。

We can then ask “what does the relationship between my features look like”, in which case Python has a package called seaborn containing very nifty tools like pairplot and a visually satisfying heatmap for correlation plots.

然后我们可以问“我的功能之间的关系是什么样的”,在这种情况下,Python有一个名为seaborn的程序包, 其中包含非常漂亮的工具,例如pairplot和一个视觉令人满意的相关图热图 。

处理空值,语法错误和重复的行/列 (Handling Null Values, Syntax Errors and Duplicate Rows/Columns)

Missing values are a sure thing in any dataset, and arise due to a multitude of different factors, each contributing to bias in their own unique way. There is a whole field of study on how best to deal with missing values (and I once had an interview where I was expected to know individual methods for missing value imputation in much detail). Check out this primer on ways of handling null values.

在任何数据集中,缺失值都是确定的事情,并且由于多种不同因素而产生,每种因素都以自己独特的方式造成偏差。 有一个关于如何最好地处理缺失值的完整研究领域(我曾经接受过一次面试,希望我能更详细地了解缺失值估算的各个方法)。 查阅本入门手册 ,了解处理空值的方法。

Syntax errors typically arise when our dataset contains information that has been manually input, such as through a form. This could lead us to erroneously conclude that a categorical feature has many more levels than are actually present, because “Hot”, ‘hOt”, “hot/n” are all considered unique levels. Check out this primer on handling dirty text data.

当我们的数据集包含手动输入的信息(例如通过表单)时,通常会出现语法错误。 这可能导致我们错误地得出结论:分类特征的级别比实际存在的级别多得多,因为“热”,“ hOt”,“热/ n”都被认为是唯一的级别。 查看有关处理脏文本数据的入门知识 。

Finally, duplicate columns are of no use to anyone, and having duplicate rows could lead to overrepresentation bias, so it’s worth dealing with them early on.

最后,重复的列对任何人都没有用,并且重复的行可能会导致代表过多的偏见,因此值得尽早处理它们。

标准化或标准化 (Standardisation or Normalisation)

Depending on the dataset you’re working with and the machine learning method you decide to use, it may be useful to standardize or normalize your data so that different scales of different variables don’t negatively impact the performance of your model.

根据您正在使用的数据集和您决定使用的机器学习方法,对数据进行标准化或标准化可能会很有用,这样不同比例的不同变量不会对模型的性能产生负面影响。

There’s a lot here to go through, but honestly it wasn’t as much the “memorise everything” mentality that helped me insofar as it was the confidence building that learning as much as I could instilled in me. I must have failed so many interviews before the formula “clicked” and I realised that all of these things aren’t esoteric concepts that only the elite can master, they’re just tools that you use to build incredible models and derive insights from data.

这里有很多事情要经过,但说实话,“记住一切”的心态并没有帮助我,只要是建立足够的信心就可以使我学到很多。 在公式“被点击”之前,我一定没有经过太多的采访,我意识到所有这些都不是只有精英才能掌握的深奥概念,它们只是您用来构建令人难以置信的模型并从数据中获得见解的工具。

Best of luck on your job quest guys, if you need any help at all please let me know and I will answer emails/questions when I can.

最好的求职者是您,如果您需要任何帮助,请告诉我,我会在可能的时候回复电子邮件/问题。

翻译自: https://towardsdatascience.com/the-data-science-interview-blueprint-75d69c92516c

大数据平台蓝图

你可能感兴趣的:(python,java,面试,算法)