Common Patterns for Analyzing Data



文章名称 Common Patterns for Analyzing Data 数据分析的通用模式


data science 数据科学,对于数据分析,数据挖掘相关工作的泛指,一般会涉及统计学和计算机科学与技术两门学科
feature engineering 特征工程


数据集来源 Kaggle,Kaggle is the place to do data science projects


impart 给予 告知 传授
in handy for 方便的
rated 认定 认为
slice 切片
potential 潜力 潜能,潜在的
interactive 相互影响的,互相作用


Data Scientists spend [the] vast majority of their time by [doing] data preparation, not model optimization. — lorinc




In this article, I chose a number of Exploratory Data Analyses (or EDAs) that were made publicly available on Kaggle, a website for data science. These analyses mix interactive code snippets alongside prose, and can help offer a birds-eye view of the data or tease out patterns in the data.


I simultaneously looked at feature engineering, a technique for taking existing data and transforming it in such a way as to impart additional meaning (for example, taking a timestamp and pulling out a DAY_OF_WEEK column, which might come in handy for predicting sales in a store).

我同时查看了这个地址(下的数据分析文章,关于提取已有数据,追加更多的含义,比如把时间戳timestamp 单独提取到DAY_OF_WEEK列中,在一个商店的销售额预测中,可能会派上用场。

I wanted to look at a variety of different kinds of datasets, so I chose:
Structured Data
NLP (Natural Language)


Feel free to jump ahead to the conclusions below, or read on to dive into the datasets.

For each category I chose two competitions where the submission date had passed, and sorted (roughly) by how many teams had submitted.


For each competition I searched for EDA tags, and chose three kernels that were highly rated or well commented. Final scores did not factor in (some EDAs didn’t even submit a score).


Structured Data

A structured data dataset is characterized by spreadsheets containing training and test data. The spreadsheets may contain categorical variables (colors, like green, red, and blue), continuous variables (ages, like 4, 15, and 67) and ordinal variables (educational level, like elementary, high school, college).

Imputation — Filling in missing values in the data
Binning — Combining continuous data into buckets, a form of feature engineering


装箱 压缩连续的数据,进入管道或者容器

The training spreadsheet has a target column that you’re trying to solve for, which will be missing in the test data. The majority of the EDAs I examined focused on teasing out potential correlations between the target variable and the other columns.


Because you’re mostly looking for correlations between different variables, there’s only so many ways you can slice and dice the data. For visualizations, there’s more options, but even so, some techniques seem better suited for a task at hand than others, resulting in a lot of similar-looking notebooks.

Where you can really let your imagination run wild is with feature engineering. Each of the authors I looked at had different approaches to feature engineering, whether it was choosing how to bin a feature or combining categorical features into new ones.


你可能感兴趣的:(Common Patterns for Analyzing Data)