我们采用祖传泰坦尼克数据集(https://www.kaggle.com/c/titanic/data)
使用read_csv
函数读入文件
import pandas as pd
df = pd.DataFrame(pd.read_csv("train.csv"))
df.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
def num_nans(df):
return df.isnull().T.any().sum()
print("there are " + str(num_nans(df)) + " rows with at least one empty value")
there are 708 rows with at least one empty value
def drop_na(df):
return df.dropna(thresh=200, axis=1)
df = drop_na(df)
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
def replace_with_mean(df):
mvalue = df['Age'].mean()
return df['Age'].fillna(mvalue)
df['Age'] = replace_with_mean(df)
df.head()
这里调整了我写代码的顺序,所以输出的dataframe的行索引其实是列拆分(见下文)的结果
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | First Name | Middle Name | Last Name | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 0 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | Owen | Harris | Braund | Mr |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | John | Bradley (Florence Briggs Thayer) | Cumings | Mrs |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 1 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | Laina | None | Heikkinen | Miss |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | Jacques | Heath (Lily May Peel) | Futrelle | Mrs |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 0 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | William | Henry | Allen | Mr |
使用replace
函数
例:将male换成0, female换成1:
def to_numerical(df):
return df.replace("male", 0).replace("female", 1)['Sex']
df['Sex'] = to_numerical(df)
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 0 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 1 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 0 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
首先,对于一个 dataframe,我们采用列索引得到的一列是一个series对象。我们不能直接将string的方法应用在series上,而是要用 .str
才能在series上应用 string的方法。
例如,在本数据集下,想将原dataframe的name
列分为First Name
, Middle Name
, Last Name
, Title
四列, 如Braund, Mr. Owen Harris
应当为
First Name | Middle Name | Last Name | Title |
---|---|---|---|
Owen | Harris | Braund | Mr |
思路:.str
与.split()
代码:
def extract_names(df):
tmp_df = df['Name'].str.split('.', 1, expand=True)
tmp_df.columns = ['0', '1']
tmp_df[['Last Name', 'Title']] = tmp_df['0'].str.strip().str.split(',',1, expand=True)
tmp_df[['First Name', 'Middle Name']] = tmp_df['1'].str.strip().str.split(' ',1, expand=True)
tmp_df['First Name'] = tmp_df['First Name'].str.strip()
tmp_df['Middle Name'] = tmp_df['Middle Name'].str.strip()
tmp_df['Last Name'] = tmp_df['Last Name'].str.strip()
tmp_df['Title'] = tmp_df['Title'].str.strip()
return tmp_df[['First Name', 'Middle Name', 'Last Name', 'Title']]
df[['First Name', 'Middle Name', 'Last Name', 'Title']] = extract_names(df)
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | First Name | Middle Name | Last Name | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 0 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | Owen | Harris | Braund | Mr |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | John | Bradley (Florence Briggs Thayer) | Cumings | Mrs |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 1 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | Laina | None | Heikkinen | Miss |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | Jacques | Heath (Lily May Peel) | Futrelle | Mrs |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 0 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | William | Henry | Allen | Mr |