DataCamp:用pandas进行数据处理

DataCamp上 Data Manipulation with pandas 的笔记记录

Inspecting a DataFrame

.head():返回DataFrame的前几行。

.info():显示每一column的信息,包括数据类型、Na值的多少。

.shape:返回每row每column的数量。

.describe():对每一列进行一个简单的统计。

homelessness 是一个数据集.

# Print the head of the homelessness data
print(homelessness.head())

# Print information about homelessness
print(homelessness.info())

# Print the shape of homelessness
print(homelessness.shape)

# Print a description of homelessness
print(homelessness.describe())

Parts of a DataFrame

.values: NumPy array 行列2维的数据.

.columns: 可以看到每个columns的名字.

.index: 可以看到每一行rows的名字.

# Import pandas using the alias pd
import pandas as pd

# Print the values of homelessness
print(homelessness.values)

# Print the column index of homelessness
print(homelessness.columns)

# Print the row index of homelessness
print(homelessness.index)

Sorting

按某列column排序,也可以给多个columns同时排序,同时可以设置他们分别是升序还是降序。

Sort on … Syntax
one column df.sort_values(“breed”)
multiple columns df.sort_values([“breed”, “weight_kg”])
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(by = ['region', 'family_members'], 
                                                ascending=[True, False])

# Print the top few rows
print(homelessness_reg_fam.head())

Subsetting columns

只想选取一个column:df[“col_a”]

选取多个columns:df[[“col_a”, “col_b”]]

# Select the individuals column
individuals = homelessness['individuals']

# Print the head of the result
print(individuals.head())

# Select the state and family_members columns
state_fam = homelessness[['state', 'family_members']]

# Print the head of the result
print(state_fam.head())

Subsetting rows

选取row,或者是划分row,比较像R中的filter
dogs[dogs[“height_cm”] > 60]

dogs[dogs[“color”] == “tan”]

也可以通过多个条件来选取
dogs[(dogs[“height_cm”] > 60) & (dogs[“col_b”] == “tan”)]

# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness.individuals > 10000]
# or 
ind_gt_10k = homelessness[homelessness['individuals'] > 10000]

# See the result
print(ind_gt_10k)

# Filter for rows where region is Mountain
mountain_reg = homelessness[homelessness['region'] == "Mountain"]

# See the result
print(mountain_reg)

# Filter for rows where family_members is less than 1000 
# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness['family_members'] < 1000) &
                             (homelessness['region'] == 'Pacific')]
# See the result
print(fam_lt_1k_pac)

Subsetting rows by categorical variables

基于类别变数的数据子集通常需要使用 "或 "运算符(|)从多个类别中选择行。 例如,当你希望所有的状态都在三个不同区域中的一个区域时,这可能会变得乏味。相反,使用.isin()方法,可以通过编写一个条件来解决这个问题,而不是三个单独的条件。

colors = [“brown”, “black”, “tan”]

condition = dogs[“color”].isin(colors)

dogs[condition]

# Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic = homelessness[(homelessness["region"] == "South Atlantic") | 
                                  (homelessness["region"] == "Mid-Atlantic")]

# See the result
print(south_mid_atlantic)

# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness['state'].isin(canu)]

# See the result
print(mojave_homelessness)

Adding new columns

# Add total col as sum of individuals and family_members
homelessness['total'] = homelessness['individuals']+homelessness['family_members']

# Add p_individuals col as proportion of individuals
homelessness['p_individuals'] = homelessness['individuals']/homelessness['total']

# See the result
print(homelessness)

Mean and median

# Print the mean of weekly_sales
print((sales['weekly_sales']).mean())

# Print the median of weekly_sales
print((sales['weekly_sales']).median())

Summarizing dates

# Print the maximum of the date column
print(sales['date'].max())

# Print the minimum of the date column
print(sales['date'].min())

Efficient summaries

.agg()方法允许你将自己的自定义函数应用到DataFrame中,也可以一次将函数应用到DataFrame中的多个列,使聚合变得超级高效。

# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)
    
# Print IQR of the temperature_c column
print(sales["temperature_c"].agg(iqr))

# Update to print IQR of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c","fuel_price_usd_per_l","unemployment"]].agg(iqr))

# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median]))

Cumulative statistics

累计统计也可以帮助跟踪一段时间以来的汇总變化。

# Sort sales_1_1 by date
sales_1_1 = sales_1_1.sort_values('date')

# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1['cum_weekly_sales'] = sales_1_1['weekly_sales'].cumsum()

# Get the cumulative max of weekly_sales, add as cum_max_sales col
sales_1_1['cum_max_sales'] = sales_1_1['weekly_sales'].cummax()

# See the columns you calculated
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])

Dropping duplicates

去除重复是获得准确计数的基本技能,因为很多时候,你不想把同样的东西重复计数多次。

# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset=["store", "type"])
print(store_types.head())

# Drop duplicate store/department combinations
store_depts = sales.drop_duplicates(subset=["store", "department"])
print(store_depts.head())

# Subset the rows that are holiday weeks and drop duplicate dates
holiday_dates = sales[sales["is_holiday"]].drop_duplicates(subset="date")

# Print date col of holiday_dates
print(holiday_dates["date"])

Counting categorical variables

清点是一个很好的方法,可以让你对你的数据有一个大致的了解,也可以发现你在其他方面可能没有注意到的好奇心。

# Count the number of stores of each type
store_counts = stores["type"].value_counts()
print(store_counts)

# Get the proportion of stores of each type
store_props = stores["type"].value_counts(normalize=True)
print(store_props)

# Count the number of departments of each type and sort
dept_counts_sorted = departments["department"].value_counts(sort=True)
print(dept_counts_sorted)

# Get the proportion of departments of each type and sort
dept_props_sorted = departments["department"].value_counts(sort=True, normalize=True)
print(dept_props_sorted)

What percent of sales occurred at each store type?

虽然.groupby()很有用,但不用它也可以计算出分组汇总统计。

# Calc total weekly sales
sales_all = sales["weekly_sales"].sum()

# Subset for type A stores, calc total weekly sales
sales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()

# Subset for type B stores, calc total weekly sales
sales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()

# Subset for type C stores, calc total weekly sales
sales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = [sales_A, sales_B, sales_C] / sales_all
print(sales_propn_by_type)

Multiple grouped summaries

在本章的前面,你已经看到了.agg()方法对于计算多个变量的多个统计量很有用。它也适用于分组数据。NumPy作为np导入,它有许多不同的汇总统计函数,包括:

np.min()
np.max()
np.mean()
np.median()

# Import NumPy with the alias np
import numpy as np

# For each store type, aggregate weekly_sales: get min, max, mean, and median
sales_stats = sales.groupby("type")["weekly_sales"].agg([np.min, np.max, np.mean, np.median])

# Print sales_stats
print(sales_stats)

# For each store type, aggregate unemployment and fuel_price_usd_per_l: get min, max, mean, and median
unemp_fuel_stats = sales.groupby("type")[["unemployment", "fuel_price_usd_per_l"]].agg([np.min, np.max, np.mean, np.median])

# Print unemp_fuel_stats
print(unemp_fuel_stats)

Pivoting on one variable

数据透视表是电子表格中聚合数据的标准方法。在pandas中,透视表本质上只是执行分组计算的另一种方式。也就是说,.pivot_table()方法只是.groupby()的一种替代方法。

# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(values="weekly_sales", index="type")

# Print mean_sales_by_type
print(mean_sales_by_type)

Fill in missing values and sum values with pivot tables

.pivot_table()方法有几个有用的参数,包括fill_value和margins。

# Print the mean weekly_sales by department and type; fill missing values with 0s; sum all rows and cols
print(sales.pivot_table(values="weekly_sales", index="department",
                        columns="type", fill_value = 0,margins=sum))

Setting & removing indexes

pandas允许你指定列作为索引。这使得取子集时的代码更干净(在某些情况下也提供了更有效的查找功能)。

# Look at temperatures
print(temperatures)

# Index temperatures by city
temperatures_ind = temperatures.set_index('city')

# Look at temperatures_ind
print(temperatures_ind)

# Reset the index, keeping its contents
print(temperatures_ind.reset_index())

# Reset the index, dropping its contents
print(temperatures_ind.reset_index(drop=True))

Subsetting with .loc[]

索引的杀手锏是.loc[]:一个接受索引值的子集方法。当你传递给它一个参数时,它将接受一个行的子集。

使用.loc[]进行子集的代码可以比标准的方括号子集更容易读,这可以使你的代码不至于太过繁琐。

# Make a list of cities to subset on
cities = ["Moscow", "Saint Petersburg"]

# Subset temperatures using square brackets
print(temperatures[temperatures["city"].isin(cities)])

# Subset temperatures_ind using .loc[]
print(temperatures_ind.loc[cities])

Setting multi-level indexes

索引也可以由多个列组成,形成一个多级索引(有时称为分层索引)。使用这些是有一定的权衡的。

# Index temperatures by country & city
temperatures_ind = temperatures.set_index(['country','city'])


# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep =[('Brazil','Rio De Janeiro'),('Pakistan','Lahore')]

# Subset for rows to keep
print(temperatures_ind.loc[rows_to_keep])

Sorting by index values

以前,你可以通过调用.sort_values()来改变DataFrame中的行的顺序。如果能够按照索引中的元素进行排序,也是非常有用的。为此,你需要使用.sort_index()。

# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())

# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level="city"))

# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level=["country", "city"], ascending = [True, False]))

Slicing index values

你可以使用first:last语法选择一个对象的连续元素。DataFrames可以按索引值或按行/列数进行分片Slicing

# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()

# Incorrectly subset rows from Pakistan to Russia
print(temperatures_srt.loc['Pakistan':'Russia'])

# Subset rows from Lahore to Moscow
print(temperatures_srt.loc['Lahore':'Moscow'])

# Subset rows from Pakistan, Lahore to Russia, Moscow
print(temperatures_srt.loc[('Pakistan', 'Lahore'):('Russia', 'Moscow')])

Slicing in both directions

你已经看到过按行和列划分DataFrames,但由于DataFrames是二维对象,所以通常情况下,自然而然地,可以同时划分两个维度。也就是说,通过向.loc[]传递两个参数,你可以一次性地按行和列进行子集。

# Subset rows from India, Hyderabad to Iraq, Baghdad
print(temperatures_srt.loc[('India', 'Hyderabad'):('Iraq', 'Baghdad')])

# Subset columns from date to avg_temp_c
print(temperatures_srt.loc[:,'date':'avg_temp_c'])

# Subset in both directions at once
print(temperatures_srt[('India', 'Hyderabad'):('Iraq', 'Baghdad'),'date':'avg_temp_c'])

Slicing time series

Slicing对时间序列特别有用,因为在一个日期范围内的数据想要过滤是很常见的事情。将日期列添加到索引中,然后使用.loc[]来执行分片。重要的是要记住的是,要把日期保持在ISO 8601格式,也就是yyyyy-mm-dd。

# Use Boolean conditions to subset temperatures for rows in 2010 and 2011
print(temperatures[(temperatures["date"] >= "2010") & (temperatures["date"] < "2012")])

# Set date as an index
temperatures_ind = temperatures.set_index("date")

# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011
print(temperatures_ind.loc["2010":"2011"])

# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
print(temperatures_ind.loc["2010-08":"2011-02"])

Pivot temperature by city and year

看看每个城市的气温是如何随着时间的推移而变化的,这很有意思。

你可以使用dataframe.dt.component形式的代码访问一个日期的组件(年、月、日)。例如,月份组件是dataframe.dt.month,年份组件是dataframe.dt.year。

# Add a year column to temperatures
temperatures["year"] = temperatures["date"].dt.year

# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table("avg_temp_c", index = ["country", "city"], columns = "year")

# See the result
print(temp_by_country_city_vs_year)

Calculating on a pivot table

数据透视表中充满了总结性的统计数字,但它们只是找到有深度的东西的第一步。通常情况下,你还需要对它们进行进一步的计算。通常要做的事情是找到最高值或最低值出现的行或列。

从第1章中回想一下,你可以很容易地用方括号内的逻辑条件对一个系列或DataFrame进行子集,找到感兴趣的行。例如: series[series > value]。

# Get the worldwide mean temp by year
mean_temp_by_year = temp_by_country_city_vs_year.mean()

# Find the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])

# Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis="columns")

# Find the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])

data visualization

Price of conventional vs. organic avocados

为不同的数据子集创建多个图,可以让你对组别进行比较。

# Modify bins to 20
avocados[avocados["type"] == "conventional"]["avg_price"].hist(alpha=0.5,bins = 20)

# Modify bins to 20
avocados[avocados["type"] == "organic"]["avg_price"].hist(alpha=0.5, bins = 20)

# Add a legend
plt.legend(["conventional", "organic"])

# Show the plot
plt.show()

Finding missing values

# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

# Check individual values for missing values
print(avocados_2016.isna())

# Check each column for missing values
print(avocados_2016.isna().any())

# Bar plot of missing values by variable
avocados_2016.isna().sum().plot(kind = 'bar')

# Show plot
plt.show()

Removing missing values

# Remove rows with missing values
avocados_complete = avocados_2016.dropna()

# Check if any columns contain missing values
print(avocados_complete.isna().any())

Replacing missing values

# List the columns with missing values
cols_with_missing = ['small_sold','large_sold','xl_sold']

# Create histograms showing the distributions cols_with_missing
avocados_2016[cols_with_missing].hist()

# Show the plot
plt.show()

你可能感兴趣的:(python,数据科学)