Common Patterns for Analyzing Data

前言

这篇文章是ARTS打卡英文分享的第一篇文章,文章很长,所以我计划用2篇文章完成原文的翻译和相关内容分享。感谢ARTS打卡群里身处国外的朋友提供的英文原文。

文章名称 Common Patterns for Analyzing Data 数据分析的通用模式

文集相关术语

data science 数据科学,对于数据分析,数据挖掘相关工作的泛指,一般会涉及统计学和计算机科学与技术两门学科
feature engineering 特征工程

行业类型:数据分析相关

数据集来源 Kaggle,Kaggle is the place to do data science projects

相关词汇

impart 给予 告知 传授
in handy for 方便的
rated 认定 认为
slice 切片
potential 潜力 潜能,潜在的
interactive 相互影响的,互相作用

数据总是混乱的,当我前几个月自学机器学习时,我不知道该如何更好的理解数据。构建一个准确的模型的关键步骤是对将要操作的数据的全面理解。

Data Scientists spend [the] vast majority of their time by [doing] data preparation, not model optimization. — lorinc

数据科学家们会花费大量时间在数据预处理过程中,而不是在模型优化。

用代码描述数据集

在数据预处理中处理空值和缺失值,是一个严肃的步骤

In this article, I chose a number of Exploratory Data Analyses (or EDAs) that were made publicly available on Kaggle, a website for data science. These analyses mix interactive code snippets alongside prose, and can help offer a birds-eye view of the data or tease out patterns in the data.

本篇文章的数据来源于kaggle,可以认为是探索性数据分析。kaggle是一个专门用于数据科学的网站。对数据结合代码片段进行分析,可以对原有数据形态进行一个鸟瞰。

I simultaneously looked at feature engineering, a technique for taking existing data and transforming it in such a way as to impart additional meaning (for example, taking a timestamp and pulling out a DAY_OF_WEEK column, which might come in handy for predicting sales in a store).

我同时查看了这个地址(https://www.quora.com/Does-deep-learning-reduce-the-importance-of-feature-engineering)下的数据分析文章,关于提取已有数据,追加更多的含义,比如把时间戳timestamp 单独提取到DAY_OF_WEEK列中,在一个商店的销售额预测中,可能会派上用场。

I wanted to look at a variety of different kinds of datasets, so I chose:
Structured Data
NLP (Natural Language)
Image

我希望查看不同种类的数据集,所以我从以下分类中进行选择
结构化数据
自然语言处理
图像数据

Feel free to jump ahead to the conclusions below, or read on to dive into the datasets.

Criteria
For each category I chose two competitions where the submission date had passed, and sorted (roughly) by how many teams had submitted.

对于每个分类,我从已通过的提交中选择两个竞赛项目,根据有多少个团队提交

For each competition I searched for EDA tags, and chose three kernels that were highly rated or well commented. Final scores did not factor in (some EDAs didn’t even submit a score).

在每个竞赛项目中,我以EDA为标签进行选择,被很好的推荐和高认可

Structured Data
结构化数据

A structured data dataset is characterized by spreadsheets containing training and test data. The spreadsheets may contain categorical variables (colors, like green, red, and blue), continuous variables (ages, like 4, 15, and 67) and ordinal variables (educational level, like elementary, high school, college).

Imputation — Filling in missing values in the data
Binning — Combining continuous data into buckets, a form of feature engineering

结构化数据是固定为训练数据和测试数据的电子表格。数据可能包含分类数据,如颜色,连续变量,顺序变量,如(学历水平,高中,大学)

装箱 压缩连续的数据,进入管道或者容器

The training spreadsheet has a target column that you’re trying to solve for, which will be missing in the test data. The majority of the EDAs I examined focused on teasing out potential correlations between the target variable and the other columns.

训练数据包含目标列,目标列就是需要预测的列,这一列在测试数据中并不包含。EDA的主要目的集中在目标变量和其它列之间的潜在关联关系。

Because you’re mostly looking for correlations between different variables, there’s only so many ways you can slice and dice the data. For visualizations, there’s more options, but even so, some techniques seem better suited for a task at hand than others, resulting in a lot of similar-looking notebooks.

Where you can really let your imagination run wild is with feature engineering. Each of the authors I looked at had different approaches to feature engineering, whether it was choosing how to bin a feature or combining categorical features into new ones.

在特征工程方面,你可以充分发挥你的想象力,我看到很多作者有不同的途径对于特征工程,无论他们是选择一个已存在的列还是合并分类特征到新的项。

你可能感兴趣的:(Common Patterns for Analyzing Data)