DataCamp上 Data Manipulation with pandas 的笔记记录
.head():返回DataFrame的前几行。
.info():显示每一column的信息,包括数据类型、Na值的多少。
.shape:返回每row每column的数量。
.describe():对每一列进行一个简单的统计。
homelessness 是一个数据集.
# Print the head of the homelessness data
print(homelessness.head())
# Print information about homelessness
print(homelessness.info())
# Print the shape of homelessness
print(homelessness.shape)
# Print a description of homelessness
print(homelessness.describe())
.values: NumPy array 行列2维的数据.
.columns: 可以看到每个columns的名字.
.index: 可以看到每一行rows的名字.
# Import pandas using the alias pd
import pandas as pd
# Print the values of homelessness
print(homelessness.values)
# Print the column index of homelessness
print(homelessness.columns)
# Print the row index of homelessness
print(homelessness.index)
按某列column排序,也可以给多个columns同时排序,同时可以设置他们分别是升序还是降序。
Sort on … | Syntax |
---|---|
one column | df.sort_values(“breed”) |
multiple columns | df.sort_values([“breed”, “weight_kg”]) |
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(by = ['region', 'family_members'],
ascending=[True, False])
# Print the top few rows
print(homelessness_reg_fam.head())
只想选取一个column:df[“col_a”]
选取多个columns:df[[“col_a”, “col_b”]]
# Select the individuals column
individuals = homelessness['individuals']
# Print the head of the result
print(individuals.head())
# Select the state and family_members columns
state_fam = homelessness[['state', 'family_members']]
# Print the head of the result
print(state_fam.head())
选取row,或者是划分row,比较像R中的filter
dogs[dogs[“height_cm”] > 60]
dogs[dogs[“color”] == “tan”]
也可以通过多个条件来选取
dogs[(dogs[“height_cm”] > 60) & (dogs[“col_b”] == “tan”)]
# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness.individuals > 10000]
# or
ind_gt_10k = homelessness[homelessness['individuals'] > 10000]
# See the result
print(ind_gt_10k)
# Filter for rows where region is Mountain
mountain_reg = homelessness[homelessness['region'] == "Mountain"]
# See the result
print(mountain_reg)
# Filter for rows where family_members is less than 1000
# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness['family_members'] < 1000) &
(homelessness['region'] == 'Pacific')]
# See the result
print(fam_lt_1k_pac)
基于类别变数的数据子集通常需要使用 "或 "运算符(|)从多个类别中选择行。 例如,当你希望所有的状态都在三个不同区域中的一个区域时,这可能会变得乏味。相反,使用.isin()方法,可以通过编写一个条件来解决这个问题,而不是三个单独的条件。
colors = [“brown”, “black”, “tan”]
condition = dogs[“color”].isin(colors)
dogs[condition]
# Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic = homelessness[(homelessness["region"] == "South Atlantic") |
(homelessness["region"] == "Mid-Atlantic")]
# See the result
print(south_mid_atlantic)
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]
# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness['state'].isin(canu)]
# See the result
print(mojave_homelessness)
# Add total col as sum of individuals and family_members
homelessness['total'] = homelessness['individuals']+homelessness['family_members']
# Add p_individuals col as proportion of individuals
homelessness['p_individuals'] = homelessness['individuals']/homelessness['total']
# See the result
print(homelessness)
# Print the mean of weekly_sales
print((sales['weekly_sales']).mean())
# Print the median of weekly_sales
print((sales['weekly_sales']).median())
# Print the maximum of the date column
print(sales['date'].max())
# Print the minimum of the date column
print(sales['date'].min())
.agg()方法允许你将自己的自定义函数应用到DataFrame中,也可以一次将函数应用到DataFrame中的多个列,使聚合变得超级高效。
# A custom IQR function
def iqr(column):
return column.quantile(0.75) - column.quantile(0.25)
# Print IQR of the temperature_c column
print(sales["temperature_c"].agg(iqr))
# Update to print IQR of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c","fuel_price_usd_per_l","unemployment"]].agg(iqr))
# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median]))
累计统计也可以帮助跟踪一段时间以来的汇总變化。
# Sort sales_1_1 by date
sales_1_1 = sales_1_1.sort_values('date')
# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1['cum_weekly_sales'] = sales_1_1['weekly_sales'].cumsum()
# Get the cumulative max of weekly_sales, add as cum_max_sales col
sales_1_1['cum_max_sales'] = sales_1_1['weekly_sales'].cummax()
# See the columns you calculated
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])
去除重复是获得准确计数的基本技能,因为很多时候,你不想把同样的东西重复计数多次。
# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset=["store", "type"])
print(store_types.head())
# Drop duplicate store/department combinations
store_depts = sales.drop_duplicates(subset=["store", "department"])
print(store_depts.head())
# Subset the rows that are holiday weeks and drop duplicate dates
holiday_dates = sales[sales["is_holiday"]].drop_duplicates(subset="date")
# Print date col of holiday_dates
print(holiday_dates["date"])
清点是一个很好的方法,可以让你对你的数据有一个大致的了解,也可以发现你在其他方面可能没有注意到的好奇心。
# Count the number of stores of each type
store_counts = stores["type"].value_counts()
print(store_counts)
# Get the proportion of stores of each type
store_props = stores["type"].value_counts(normalize=True)
print(store_props)
# Count the number of departments of each type and sort
dept_counts_sorted = departments["department"].value_counts(sort=True)
print(dept_counts_sorted)
# Get the proportion of departments of each type and sort
dept_props_sorted = departments["department"].value_counts(sort=True, normalize=True)
print(dept_props_sorted)
虽然.groupby()很有用,但不用它也可以计算出分组汇总统计。
# Calc total weekly sales
sales_all = sales["weekly_sales"].sum()
# Subset for type A stores, calc total weekly sales
sales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()
# Subset for type B stores, calc total weekly sales
sales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()
# Subset for type C stores, calc total weekly sales
sales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()
# Get proportion for each type
sales_propn_by_type = [sales_A, sales_B, sales_C] / sales_all
print(sales_propn_by_type)
在本章的前面,你已经看到了.agg()方法对于计算多个变量的多个统计量很有用。它也适用于分组数据。NumPy作为np导入,它有许多不同的汇总统计函数,包括:
np.min()
np.max()
np.mean()
np.median()
# Import NumPy with the alias np
import numpy as np
# For each store type, aggregate weekly_sales: get min, max, mean, and median
sales_stats = sales.groupby("type")["weekly_sales"].agg([np.min, np.max, np.mean, np.median])
# Print sales_stats
print(sales_stats)
# For each store type, aggregate unemployment and fuel_price_usd_per_l: get min, max, mean, and median
unemp_fuel_stats = sales.groupby("type")[["unemployment", "fuel_price_usd_per_l"]].agg([np.min, np.max, np.mean, np.median])
# Print unemp_fuel_stats
print(unemp_fuel_stats)
数据透视表是电子表格中聚合数据的标准方法。在pandas中,透视表本质上只是执行分组计算的另一种方式。也就是说,.pivot_table()方法只是.groupby()的一种替代方法。
# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(values="weekly_sales", index="type")
# Print mean_sales_by_type
print(mean_sales_by_type)
.pivot_table()方法有几个有用的参数,包括fill_value和margins。
# Print the mean weekly_sales by department and type; fill missing values with 0s; sum all rows and cols
print(sales.pivot_table(values="weekly_sales", index="department",
columns="type", fill_value = 0,margins=sum))
pandas允许你指定列作为索引。这使得取子集时的代码更干净(在某些情况下也提供了更有效的查找功能)。
# Look at temperatures
print(temperatures)
# Index temperatures by city
temperatures_ind = temperatures.set_index('city')
# Look at temperatures_ind
print(temperatures_ind)
# Reset the index, keeping its contents
print(temperatures_ind.reset_index())
# Reset the index, dropping its contents
print(temperatures_ind.reset_index(drop=True))
索引的杀手锏是.loc[]:一个接受索引值的子集方法。当你传递给它一个参数时,它将接受一个行的子集。
使用.loc[]进行子集的代码可以比标准的方括号子集更容易读,这可以使你的代码不至于太过繁琐。
# Make a list of cities to subset on
cities = ["Moscow", "Saint Petersburg"]
# Subset temperatures using square brackets
print(temperatures[temperatures["city"].isin(cities)])
# Subset temperatures_ind using .loc[]
print(temperatures_ind.loc[cities])
索引也可以由多个列组成,形成一个多级索引(有时称为分层索引)。使用这些是有一定的权衡的。
# Index temperatures by country & city
temperatures_ind = temperatures.set_index(['country','city'])
# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep =[('Brazil','Rio De Janeiro'),('Pakistan','Lahore')]
# Subset for rows to keep
print(temperatures_ind.loc[rows_to_keep])
以前,你可以通过调用.sort_values()来改变DataFrame中的行的顺序。如果能够按照索引中的元素进行排序,也是非常有用的。为此,你需要使用.sort_index()。
# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())
# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level="city"))
# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level=["country", "city"], ascending = [True, False]))
你可以使用first:last语法选择一个对象的连续元素。DataFrames可以按索引值或按行/列数进行分片Slicing
# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()
# Incorrectly subset rows from Pakistan to Russia
print(temperatures_srt.loc['Pakistan':'Russia'])
# Subset rows from Lahore to Moscow
print(temperatures_srt.loc['Lahore':'Moscow'])
# Subset rows from Pakistan, Lahore to Russia, Moscow
print(temperatures_srt.loc[('Pakistan', 'Lahore'):('Russia', 'Moscow')])
你已经看到过按行和列划分DataFrames,但由于DataFrames是二维对象,所以通常情况下,自然而然地,可以同时划分两个维度。也就是说,通过向.loc[]传递两个参数,你可以一次性地按行和列进行子集。
# Subset rows from India, Hyderabad to Iraq, Baghdad
print(temperatures_srt.loc[('India', 'Hyderabad'):('Iraq', 'Baghdad')])
# Subset columns from date to avg_temp_c
print(temperatures_srt.loc[:,'date':'avg_temp_c'])
# Subset in both directions at once
print(temperatures_srt[('India', 'Hyderabad'):('Iraq', 'Baghdad'),'date':'avg_temp_c'])
Slicing对时间序列特别有用,因为在一个日期范围内的数据想要过滤是很常见的事情。将日期列添加到索引中,然后使用.loc[]来执行分片。重要的是要记住的是,要把日期保持在ISO 8601格式,也就是yyyyy-mm-dd。
# Use Boolean conditions to subset temperatures for rows in 2010 and 2011
print(temperatures[(temperatures["date"] >= "2010") & (temperatures["date"] < "2012")])
# Set date as an index
temperatures_ind = temperatures.set_index("date")
# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011
print(temperatures_ind.loc["2010":"2011"])
# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
print(temperatures_ind.loc["2010-08":"2011-02"])
看看每个城市的气温是如何随着时间的推移而变化的,这很有意思。
你可以使用dataframe.dt.component形式的代码访问一个日期的组件(年、月、日)。例如,月份组件是dataframe.dt.month,年份组件是dataframe.dt.year。
# Add a year column to temperatures
temperatures["year"] = temperatures["date"].dt.year
# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table("avg_temp_c", index = ["country", "city"], columns = "year")
# See the result
print(temp_by_country_city_vs_year)
数据透视表中充满了总结性的统计数字,但它们只是找到有深度的东西的第一步。通常情况下,你还需要对它们进行进一步的计算。通常要做的事情是找到最高值或最低值出现的行或列。
从第1章中回想一下,你可以很容易地用方括号内的逻辑条件对一个系列或DataFrame进行子集,找到感兴趣的行。例如: series[series > value]。
# Get the worldwide mean temp by year
mean_temp_by_year = temp_by_country_city_vs_year.mean()
# Find the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])
# Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis="columns")
# Find the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])
为不同的数据子集创建多个图,可以让你对组别进行比较。
# Modify bins to 20
avocados[avocados["type"] == "conventional"]["avg_price"].hist(alpha=0.5,bins = 20)
# Modify bins to 20
avocados[avocados["type"] == "organic"]["avg_price"].hist(alpha=0.5, bins = 20)
# Add a legend
plt.legend(["conventional", "organic"])
# Show the plot
plt.show()
# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt
# Check individual values for missing values
print(avocados_2016.isna())
# Check each column for missing values
print(avocados_2016.isna().any())
# Bar plot of missing values by variable
avocados_2016.isna().sum().plot(kind = 'bar')
# Show plot
plt.show()
# Remove rows with missing values
avocados_complete = avocados_2016.dropna()
# Check if any columns contain missing values
print(avocados_complete.isna().any())
# List the columns with missing values
cols_with_missing = ['small_sold','large_sold','xl_sold']
# Create histograms showing the distributions cols_with_missing
avocados_2016[cols_with_missing].hist()
# Show the plot
plt.show()