数据科学【一】:dataframe基本操作(一)

数据科学【一】:dataframe基本操作(一)

数据准备

我们采用祖传泰坦尼克数据集(https://www.kaggle.com/c/titanic/data)

读入文件

使用read_csv函数读入文件

import pandas as pd

df = pd.DataFrame(pd.read_csv("train.csv"))
df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

关于空值

返回含有空值的行数

def num_nans(df):
    return df.isnull().T.any().sum()

print("there are " +  str(num_nans(df)) + " rows with at least one empty value")
there are 708 rows with at least one empty value

移除含有超过200个空值的列

def drop_na(df):
    return df.dropna(thresh=200, axis=1)

df = drop_na(df)
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

使用平均值填充空值

def replace_with_mean(df):
    mvalue = df['Age'].mean()
    
    return df['Age'].fillna(mvalue)

df['Age'] = replace_with_mean(df)
df.head()

这里调整了我写代码的顺序,所以输出的dataframe的行索引其实是列拆分(见下文)的结果

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked First Name Middle Name Last Name Title
0 1 0 3 Braund, Mr. Owen Harris 0 22.0 1 0 A/5 21171 7.2500 NaN S Owen Harris Braund Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 38.0 1 0 PC 17599 71.2833 C85 C John Bradley (Florence Briggs Thayer) Cumings Mrs
2 3 1 3 Heikkinen, Miss. Laina 1 26.0 0 0 STON/O2. 3101282 7.9250 NaN S Laina None Heikkinen Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 35.0 1 0 113803 53.1000 C123 S Jacques Heath (Lily May Peel) Futrelle Mrs
4 5 0 3 Allen, Mr. William Henry 0 35.0 0 0 373450 8.0500 NaN S William Henry Allen Mr

替换值

使用replace函数
例:将male换成0, female换成1:

def to_numerical(df):
    return df.replace("male", 0).replace("female", 1)['Sex']

df['Sex'] = to_numerical(df)
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris 0 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina 1 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry 0 35.0 0 0 373450 8.0500 NaN S

列拆分

首先,对于一个 dataframe,我们采用列索引得到的一列是一个series对象。我们不能直接将string的方法应用在series上,而是要用 .str 才能在series上应用 string的方法。
例如,在本数据集下,想将原dataframe的name列分为First Name, Middle Name, Last Name, Title 四列, 如Braund, Mr. Owen Harris 应当为

First Name Middle Name Last Name Title
Owen Harris Braund Mr

思路:.str.split()
代码:

def extract_names(df):
    
    tmp_df = df['Name'].str.split('.', 1, expand=True)
    tmp_df.columns = ['0', '1']
    tmp_df[['Last Name', 'Title']] = tmp_df['0'].str.strip().str.split(',',1, expand=True)
    tmp_df[['First Name', 'Middle Name']] = tmp_df['1'].str.strip().str.split(' ',1, expand=True)
    tmp_df['First Name'] = tmp_df['First Name'].str.strip()
    tmp_df['Middle Name'] = tmp_df['Middle Name'].str.strip()
    tmp_df['Last Name'] = tmp_df['Last Name'].str.strip()
    tmp_df['Title'] = tmp_df['Title'].str.strip()
    
    return tmp_df[['First Name', 'Middle Name', 'Last Name', 'Title']]

df[['First Name', 'Middle Name', 'Last Name', 'Title']] = extract_names(df)
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked First Name Middle Name Last Name Title
0 1 0 3 Braund, Mr. Owen Harris 0 22.0 1 0 A/5 21171 7.2500 NaN S Owen Harris Braund Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 38.0 1 0 PC 17599 71.2833 C85 C John Bradley (Florence Briggs Thayer) Cumings Mrs
2 3 1 3 Heikkinen, Miss. Laina 1 26.0 0 0 STON/O2. 3101282 7.9250 NaN S Laina None Heikkinen Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 35.0 1 0 113803 53.1000 C123 S Jacques Heath (Lily May Peel) Futrelle Mrs
4 5 0 3 Allen, Mr. William Henry 0 35.0 0 0 373450 8.0500 NaN S William Henry Allen Mr

你可能感兴趣的:(#,数据科学,数据挖掘,数据分析,机器学习)