Pandas是入门Python做数据分析所必须要掌握的一个库,本文精选了十套练习题,帮助读者上手Python代码,完成数据集探索。
本文内容由科赛网翻译整理自Github,建议读者完成科赛网 从零上手Python关键代码 和 Pandas基础命令速查表 教程学习的之后,再对本教程代码进行调试学习。
【小提示:本文所使用的数据集下载地址:DATA | TRAIN 练习数据集】
点击此处,不用配环境,就可以在线运行代码
这十套练习,教你如何使用Pandas做数据分析
其他x题系列:
习题编号 | 内容 | 相应数据集 |
---|---|---|
练习1 - 开始了解你的数据 | 探索Chipotle快餐数据 | chipotle.tsv |
练习2 - 数据过滤与排序 | 探索2012欧洲杯数据 | Euro2012_stats.csv |
练习3 - 数据分组 | 探索酒类消费数据 | drinks.csv |
练习4 -Apply函数 | 探索1960 - 2014 美国犯罪数据 | US_Crime_Rates_1960_2014.csv |
练习5 - 合并 | 探索虚拟姓名数据 | 练习中手动内置的数据 |
练习6 - 统计 | 探索风速数据 | wind.data |
练习7 - 可视化 | 探索泰坦尼克灾难数据 | train.csv |
练习8 - 创建数据框 | 探索Pokemon数据 | 练习中手动内置的数据 |
练习9 - 时间序列 | 探索Apple公司股价数据 | Apple_stock.csv |
练习10 - 删除数据 | 探索Iris纸鸢花数据 | iris.csv |
相应数据集:chipotle.tsv
# 运行以下代码
import pandas as pd
# 运行以下代码
path1 = "../input/pandas_exercise/exercise_data/chipotle.tsv" # chipotle.tsv
# 运行以下代码
chipo = pd.read_csv(path1, sep = '\t')
# 运行以下代码
chipo.head(10)
# 运行以下代码
chipo.shape[1]
# 运行以下代码
chipo.columns
# 运行以下代码
chipo.index
# 运行以下代码,做了修正
c = chipo[['item_name','quantity']].groupby(['item_name'],as_index=False).agg({'quantity':sum})
c.sort_values(['quantity'],ascending=False,inplace=True)
c.head()
# 运行以下代码
chipo['item_name'].nunique()
# 运行以下代码,存在一些小问题
chipo['choice_description'].value_counts().head()
# 运行以下代码
total_items_orders = chipo['quantity'].sum()
total_items_orders
# 运行以下代码
dollarizer = lambda x: float(x[1:-1])
chipo['item_price'] = chipo['item_price'].apply(dollarizer)
# 运行以下代码,已经做更正
chipo['sub_total'] = round(chipo['item_price'] * chipo['quantity'],2)
chipo['sub_total'].sum()
# 运行以下代码
chipo['order_id'].nunique()
# 运行以下代码,已经做过更正
chipo[['order_id','sub_total']].groupby(by=['order_id']).agg({'sub_total':'sum'})['sub_total'].mean()
# 运行以下代码
chipo['item_name'].nunique()
相应数据集:Euro2012_stats.csv
# 运行以下代码
import pandas as pd
# 运行以下代码
path2 = "../input/pandas_exercise/exercise_data/Euro2012_stats.csv" # Euro2012_stats.csv
# 运行以下代码
euro12 = pd.read_csv(path2)
euro12
# 运行以下代码
euro12.Goals
# 运行以下代码
euro12.shape[0]
# 运行以下代码
euro12.info()
# 运行以下代码
discipline = euro12[['Team', 'Yellow Cards', 'Red Cards']]
discipline
# 运行以下代码
discipline.sort_values(['Red Cards', 'Yellow Cards'], ascending = False)
# 运行以下代码
round(discipline['Yellow Cards'].mean())
# 运行以下代码
euro12[euro12.Goals > 6]
# 运行以下代码
euro12[euro12.Team.str.startswith('G')]
# 运行以下代码
euro12.loc[euro12.Team.isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]
相应数据集:drinks.csv
# 运行以下代码
import pandas as pd
# 运行以下代码
path3 ='../input/pandas_exercise/exercise_data/drinks.csv' #'drinks.csv'
# 运行以下代码
drinks = pd.read_csv(path3)
drinks.head()
# 运行以下代码
drinks.groupby('continent').beer_servings.mean()
# 运行以下代码
drinks.groupby('continent').wine_servings.describe()
# 运行以下代码
drinks.groupby('continent').mean()
# 运行以下代码
drinks.groupby('continent').median()
# 运行以下代码
drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])
相应数据集:US_Crime_Rates_1960_2014.csv
# 运行以下代码
import numpy as np
import pandas as pd
# 运行以下代码
path4 = '../input/pandas_exercise/exercise_data/US_Crime_Rates_1960_2014.csv' # "US_Crime_Rates_1960_2014.csv"
# 运行以下代码
crime = pd.read_csv(path4)
crime.head()
# 运行以下代码
crime.info()
注意到了吗,Year的数据类型为int64,但是pandas有一个不同的数据类型去处理时间序列(time series),我们现在来看看。
# 运行以下代码
crime.Year = pd.to_datetime(crime.Year, format='%Y')
crime.info()
# 运行以下代码
crime = crime.set_index('Year', drop = True)
crime.head()
# 运行以下代码
del crime['Total']
crime.head()
crime.resample('10AS').sum()
注意Population这一列,若直接对其求和,是不正确的*
更多关于 .resample 的介绍
更多关于 Offset Aliases的介绍
# 运行以下代码
crimes = crime.resample('10AS').sum() # resample a time series per decades
# 用resample去得到“Population”列的最大值
population = crime['Population'].resample('10AS').max()
# 更新 "Population"
crimes['Population'] = population
crimes
# 运行以下代码
crime.idxmax(0)
相应数据集:练习中手动内置的数据
# 运行以下代码
import numpy as np
import pandas as pd
# 运行以下代码
raw_data_1 = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
raw_data_2 = {
'subject_id': ['4', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
raw_data_3 = {
'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
# 运行以下代码
data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])
# 运行以下代码
all_data = pd.concat([data1, data2])
all_data
# 运行以下代码
all_data_col = pd.concat([data1, data2], axis = 1)
all_data_col
# 运行以下代码
data3
# 运行以下代码
pd.merge(all_data, data3, on='subject_id')
# 运行以下代码
pd.merge(data1, data2, on='subject_id', how='inner')
# 运行以下代码
pd.merge(data1, data2, on='subject_id', how='outer')
相应数据集:wind.data
# 运行以下代码
import pandas as pd
import datetime
```python
### 步骤2 从以下地址导入数据
```python
# 运行以下代码
path6 = "../input/pandas_exercise/exercise_data/wind.data" # wind.data
# 运行以下代码
data = pd.read_table(path6, sep = "\s+", parse_dates = [[0,1,2]])
data.head()
# 运行以下代码
def fix_century(x):
year = x.year - 100 if x.year > 1989 else x.year
return datetime.date(year, x.month, x.day)
# apply the function fix_century on the column and replace the values to the right ones
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)
# data.info()
data.head()
# 运行以下代码
# transform Yr_Mo_Dy it to date type datetime64
data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])
# set 'Yr_Mo_Dy' as the index
data = data.set_index('Yr_Mo_Dy')
data.head()
# 运行以下代码
data.isnull().sum()
# 运行以下代码
data.shape[0] - data.isnull().sum()
# 运行以下代码
data.mean().mean()
# 运行以下代码
loc_stats = pd.DataFrame()
loc_stats['min'] = data.min() # min
loc_stats['max'] = data.max() # max
loc_stats['mean'] = data.mean() # mean
loc_stats['std'] = data.std() # standard deviations
loc_stats
# 运行以下代码
# create the dataframe
day_stats = pd.DataFrame()
# this time we determine axis equals to one so it gets each row.
day_stats['min'] = data.min(axis = 1) # min
day_stats['max'] = data.max(axis = 1) # max
day_stats['mean'] = data.mean(axis = 1) # mean
day_stats['std'] = data.std(axis = 1) # standard deviations
day_stats.head()
(注意,1961年的1月和1962年的1月应该区别对待)
# 运行以下代码
# creates a new column 'date' and gets the values from the index
data['date'] = data.index
# creates a column for each value from date
data['month'] = data['date'].apply(lambda date: date.month)
data['year'] = data['date'].apply(lambda date: date.year)
data['day'] = data['date'].apply(lambda date: date.day)
# gets all value from the month 1 and assign to janyary_winds
january_winds = data.query('month == 1')
# gets the mean from january_winds, using .loc to not print the mean of month, year and day
january_winds.loc[:,'RPT':"MAL"].mean()
# 运行以下代码
data.query('month == 1 and day == 1')
# 运行以下代码
data.query('day == 1')
相应数据集:train.csv
# 运行以下代码
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
# 运行以下代码
path7 = '../input/pandas_exercise/exercise_data/train.csv' # train.csv
# 运行以下代码
titanic = pd.read_csv(path7)
titanic.head()
# 运行以下代码
titanic.set_index('PassengerId').head()
# 运行以下代码
# sum the instances of males and females
males = (titanic['Sex'] == 'male').sum()
females = (titanic['Sex'] == 'female').sum()
# put them into a list called proportions
proportions = [males, females]
# Create a pie chart
plt.pie(
# using proportions
proportions,
# with the labels being officer names
labels = ['Males', 'Females'],
# with no shadows
shadow = False,
# with colors
colors = ['blue','red'],
# with one slide exploded out
explode = (0.15 , 0),
# with the start angle at 90%
startangle = 90,
# with the percent listed as a fraction
autopct = '%1.1f%%'
)
# View the plot drop above
plt.axis('equal')
# Set labels
plt.title("Sex Proportion")
# View the plot
plt.tight_layout()
plt.show()
# 运行以下代码
# creates the plot using
lm = sns.lmplot(x = 'Age', y = 'Fare', data = titanic, hue = 'Sex', fit_reg=False)
# set title
lm.set(title = 'Fare x Age')
# get the axes object and tweak it
axes = lm.axes
axes[0,0].set_ylim(-5,)
axes[0,0].set_xlim(-5,85)
# 运行以下代码
titanic.Survived.sum()
# 运行以下代码
# sort the values from the top to the least value and slice the first 5 items
df = titanic.Fare.sort_values(ascending = False)
df
# create bins interval using numpy
binsVal = np.arange(0,600,10)
binsVal
# create the plot
plt.hist(df, bins = binsVal)
# Set the title and labels
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Fare Payed Histrogram')
# show the plot
plt.show()
相应数据集:练习中手动内置的数据
# 运行以下代码
import pandas as pd
# 运行以下代码
raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
"evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
"type": ['grass', 'fire', 'water', 'bug'],
"hp": [45, 39, 44, 45],
"pokedex": ['yes', 'no','yes','no']
}
# 运行以下代码
pokemon = pd.DataFrame(raw_data)
pokemon.head()
# 运行以下代码
pokemon = pokemon[['name', 'type', 'hp', 'evolution','pokedex']]
pokemon
# 运行以下代码
pokemon['place'] = ['park','street','lake','forest']
pokemon
# 运行以下代码
pokemon.dtypes
相应数据集:Apple_stock.csv
# 运行以下代码
import pandas as pd
import numpy as np
# visualization
import matplotlib.pyplot as plt
%matplotlib inline
# 运行以下代码
path9 = '../input/pandas_exercise/exercise_data/Apple_stock.csv' # Apple_stock.csv
# 运行以下代码
apple = pd.read_csv(path9)
apple.head()
# 运行以下代码
apple.dtypes
# 运行以下代码
apple.Date = pd.to_datetime(apple.Date)
apple['Date'].head()
# 运行以下代码
apple = apple.set_index('Date')
apple.head()
# 运行以下代码
apple.index.is_unique
# 运行以下代码
apple.sort_index(ascending = True).head()
# 运行以下代码
apple_month = apple.resample('BM')
apple_month.head()
# 运行以下代码
(apple.index.max() - apple.index.min()).days
# 运行以下代码
apple_months = apple.resample('BM').mean()
len(apple_months.index)
# 运行以下代码
# makes the plot and assign it to a variable
appl_open = apple['Adj Close'].plot(title = "Apple Stock")
# changes the size of the graph
fig = appl_open.get_figure()
fig.set_size_inches(13.5, 9)
相应数据集:iris.csv
# 运行以下代码
import pandas as pd
# 运行以下代码
path10 ='../input/pandas_exercise/exercise_data/iris.csv' # iris.csv
# 运行以下代码
iris = pd.read_csv(path10)
iris.head()
iris = pd.read_csv(path10,names = ['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class'])
iris.head()
# 运行以下代码
pd.isnull(iris).sum()
# 运行以下代码
iris.iloc[10:20,2:3] = np.nan
iris.head(20)
# 运行以下代码
iris.petal_length.fillna(1, inplace = True)
iris
# 运行以下代码
del iris['class']
iris.head()
# 运行以下代码
iris.iloc[0:3 ,:] = np.nan
iris.head()
# 运行以下代码
iris = iris.dropna(how='any')
iris.head()
# 运行以下代码
iris = iris.reset_index(drop = True)
iris.head()