Common Patterns for Analyzing Data-案例分解

本文是ATRS打卡系列Common Patterns for Analyzing Data的第二部分

生词释意

boasts 宣扬
Each home boasts an unprecedented level of quality throughout
每家的质量都堪称一流

surviv 生还
survivor 幸存者
acknowledgment 承认，感谢
comply 遵守同意
complicated 结构复杂的；混乱的，麻烦的
discrete 分离的

英文阅读

Let’s take a deeper look at two competitions, the Titanic competition, followed by the House Prices competition.

让我们通过两个竞赛项目进行深入的查看。泰坦尼克竞赛和房屋竞赛项目

泰坦尼克生还预测竞赛

The Titanic competition is a popular beginners’ competition, and lots of folks on Kaggle cycle through it. As a result the EDAs tend to be well written and thoroughly documented, and were amongst the clearest I saw. The dataset includes a training spreadsheet with a column Survived indicating whether a passenger survived or not, along with other supplementary data like their age, gender, ticket fare price, and more.

Common Patterns for Analyzing Data-案例分解_第1张图片

泰坦尼克生还预测.png

以下是项目主页对项目目的的描述

In this challenge, we ask you to complete the analysis of what sorts of people were likely to surviv

Binary classification

用二元分类分析哪类人群有更多的生还可能

训练数据和测试数据

Common Patterns for Analyzing Data-案例分解_第2张图片

训练数据和测试数据.png

以上图片主要描述了训练数据和测试数据。训练数据是以已知结果为前提，测试数据并不知道结果，结果需要通过预测模型来得出。

House Prices is another structured data competition. This one boasts many more variables than the Titanic competition, and includes categorical, ordinal and continuous features.

房屋价格预测竞赛

Common Patterns for Analyzing Data-案例分解_第3张图片

image

房屋价格是另外一个结构化数据竞赛，这个比Titanic competition 宣扬有更多的变量，包括分类，排序以及连续特征。

这里提供了Python编程的指导
https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

Understand how variables are distributed and how they interact
Apply different transformations before training machine learning models

理解如此之多的变量应该如何被使用到模型中，并且相互之间发生作用
常用的机器学习模型的理解

The EDAs I chose for analysis were Comprehensive Data Exploration with Python by Pedro Marcelino, Detailed Data Exploration in Python by Angela, and Fun Python EDA Step by Step by Sang-eon Park.

一篇英文文章内容很长，拆分作为两次英文阅读训练的素材，如果对于数据分析和机器学习没有概念的阅读者读起来会一头雾水。简单的总结下文章的内容。

首先这是一篇描述数据分析和机器学习方面的文章，内容围绕数据，借助www.kaggle.com这个数据竞赛网站提供的两个实际竞赛项目，试图向读者说明数据分析的常见模式。文中涉及很多专业领域名词，包括数据集，测试数据，训练数据，数据预处理，模型，特征工程和数据科学等。

收获思考

本次阅读的英文素材不足以支撑数据分析和机器学习的概览全貌，但不失一篇数据分析入门和实践的好文，我也是通过文章引导，注册了https://www.kaggle.com/，后续在学习过程中可以借助这个网站，发现指定类型的竞赛项目，查阅数据集，看看别人是如何描述，理解，分析数据，特别是用Python实践。

另外附上一篇一年前我对学习Python用语言的思考该不该学点python保身？