class Manipulating DataFrames with pandas

更多关于 dataframe的操作


  • Slice the row labels 'Perry' to 'Potter' and assign the output to p_counties.
  • Print the p_counties DataFrame. This has been done for you.
  • Slice the row labels 'Potter' to 'Perry' in reverse order. To do this for hypothetical row labels 'a' and 'b', you could use a stepsize of -1 like so: df.loc['b':'a':-1].
  • Print the p_counties_rev DataFrame. This has also been done for you, so hit 'Submit Answer' to see the result of your slicing!

# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election.loc['Perry':'Potter']


# Print the p_counties DataFrame
print(p_counties)


# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev = election.loc["Potter":'Perry':-1]


# Print the p_counties_rev DataFrame
print(p_counties_rev)


pandas dataframe 的列索引, 数据的索引

  • Slice the columns from the starting column to 'Obama' and assign the result to left_columns
  • Slice the columns from 'Obama' to 'winner' and assign the result to middle_columns
  • Slice the columns from 'Romney' to the end and assign the result to right_columns
  • The code to print the first 5 rows of left_columnsmiddle_columns, and right_columns has been written, so hit 'Submit Answer' to see the results!

# Slice the columns from the starting column to 'Obama': left_columns
left_columns = election.loc[:,:'Obama']


# Print the output of left_columns.head()
print(left_columns.head())


# Slice the columns from 'Obama' to 'winner': middle_columns
middle_columns = election.loc[:,'Obama':'winner']


# Print the output of middle_columns.head()
print(middle_columns.head())


# Slice the columns from 'Romney' to the end: 'right_columns'
right_columns = election.loc[:,'Romney':]


# Print the output of right_columns.head()
print(right_columns.head())

  • Create the list of row labels ['Philadelphia', 'Centre', 'Fulton'] and assign it to rows.
  • Create the list of column labels ['winner', 'Obama', 'Romney'] and assign it to cols.
  • Create a new DataFrame by selecting with rows and cols in .loc[] and assign it to three_counties.
  • Print the three_counties DataFrame. This has been done for you, so hit 'Submit Answer` to see your new DataFrame.

# Create the list of row labels: rows
rows = ['Philadelphia', 'Centre', 'Fulton']


# Create the list of column labels: cols
cols = ['winner', 'Obama', 'Romney']


# Create the new DataFrame: three_counties
three_counties = election.loc[rows, cols]


# Print the three_counties DataFrame
print(three_counties)

  • Import numpy as np.
  • Create a boolean array for the condition where the 'margin'column is less than 1 and assign it to too_close.
  • Convert the entries in the 'winner' column where the result was too close to call to np.nan.
  • Print the output of election.info(). This has been done for you, so hit 'Submit Answer' to see the results.

# Import numpy
import numpy as np


# Create the boolean array: too_close
too_close = election.margin < 1


# Assign np.nan to the 'winner' column where the results were too close to call
election.winner[too_close] = np.nan


# Print the output of election.info()
print(election.info())

  • Select the 'age' and 'cabin' columns of titanic and create a new DataFrame df.
  • Print the shape of df. This has been done for you.
  • Drop rows in df with how='any' and print the shape.
  • Drop rows in df with how='all' and print the shape.
  • Drop columns from the titanic DataFrame that have more than 1000 missing values by specifying the thresh and axiskeyword arguments. Print the output of .info() from this.

# Select the 'age' and 'cabin' columns: df
df = titanic[['age','cabin']]


# Print the shape of df
print(df.shape)


# Drop rows in df with how='any' and print the shape
print(df.dropna(how='any').shape)


# Drop rows in df with how='all' and print the shape
print(df.dropna(how='all').shape)


# Call .dropna() with thresh=1000 and axis='columns' and print the output of .info() from titanic
print(titanic.dropna(thresh=1000, axis='columns').info())

  • Apply the to_celsius function over the ['Mean TemperatureF','Mean Dew PointF'] columns of the weatherDataFrame.
  • Reassign the columns of df_celsius to ['Mean TemperatureC','Mean Dew PointC'].
  • Hit 'Submit Answer' to see the new DataFrame with the converted units.

# Write a function to convert degrees Fahrenheit to degrees Celsius: to_celsius
def to_celsius(F):
    return 5/9*(F - 32)


# Apply the function over 'Mean TemperatureF' and 'Mean Dew PointF': df_celsius
df_celsius = weather[['Mean TemperatureF','Mean Dew PointF']].apply(to_celsius)


# Reassign the columns df_celsius
df_celsius.columns = ['Mean TemperatureC', 'Mean Dew PointC']


# Print the output of df_celsius.head()
print(df_celsius.head())


  • Create a dictionary with the key:value pairs 'Obama':'blue'and 'Romney':'red'.
  • Use the .map() method on the 'winner' column using the red_vs_blue dictionary you created.
  • Print the output of election.head(). This has been done for you, so hit 'Submit Answer' to see the new column!

# Create the dictionary: red_vs_blue
red_vs_blue = dict([('Obama','blue') ,( 'Romney','red')])


# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election['winner'].map(red_vs_blue)


# Print the output of election.head()
print(election.head())


  • Import zscore from scipy.stats.
  • Call zscore with election['turnout'] as input .
  • Print the output of type(turnout_zscore). This has been done for you.
  • Assign turnout_zscore to a new column in election as 'turnout_zscore'.
  • Print the output of election.head(). This has been done for you, so hit 'Submit Answer' to view the result.

# Import zscore from scipy.stats
from scipy.stats import zscore 


# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(election['turnout'])


# Print the type of turnout_zscore
print(type(turnout_zscore))


# Assign turnout_zscore to a new column: election['turnout_zscore']
election['turnout_zscore'] = turnout_zscore


# Print the output of election.head()
print(election.head())


index的一些操作:

  • Create a list new_idx with the same elements as in sales.index, but with all characters capitalized.
  • Assign new_idx to sales.index.
  • Print the sales dataframe. This has been done for you, so hit 'Submit Answer' and to see how the index changed.

# Create the list of new indexes: new_idx
new_idx = [new_idx.upper() for new_idx in sales.index]


# Assign new_idx to sales.index
sales.index = new_idx


# Print the sales DataFrame
print(sales)

  • Assign the string 'MONTHS' to sales.index.nameto create a name for the index.
  • Print the sales dataframe to see the index name you just created.
  • Now assign the string 'PRODUCTS' to sales.columns.name to give a name to the set of columns.
  • Print the sales dataframe again to see the columns name you just created.
# Assign the string 'MONTHS' to sales.index.name
sales.index.name = 'MONTHS'


# Print the sales DataFrame
print(sales)


# Assign the string 'PRODUCTS' to sales.columns.name 
sales.columns.name = 'PRODUCTS'


# Print the sales dataframe again
print(sales)

  • Create a MultiIndex by setting the index to be the columns ['state', 'month'].
  • Sort the MultiIndex using the .sort_index() method.
  • Print the sales DataFrame. This has been done for you, so hit 'Submit Answer' to verify that indeed you have an index with the fields state and month!

# Set the index to be the columns ['state', 'month']: sales
sales = sales.set_index(['state', 'month'])


# Sort the MultiIndex: sales
sales = sales.sort_index()


# Print the sales DataFrame
print(sales)

  • Set the index of sales to be the column 'state'.
  • Print the sales DataFrame to verify that indeed you have an index with state values.
  • Access the data from 'NY' and print it to verify that you obtain two rows.

# Set the index to the column 'state': sales
sales = sales.set_index(['state'])


# Print the sales DataFrame
print(sales)


# Access the data from 'NY'
print(sales.loc['NY'])

stocks.loc[(slice(None), slice('2016-10-03', '2016-10-04')), :]
  • Look up data for the New York column ('NY') in month 1.
  • Look up data for the California and Texas columns ('CA''TX') in month 2.
  • Look up data for all states in month 2. Use (slice(None), 2) to extract all rows in month 2.
# Look up data for NY in month 1: NY_month1
NY_month1 = sales.loc[("NY", 1), :]


# Look up data for CA and TX in month 2: CA_TX_month2
CA_TX_month2 = sales.loc[(['CA','TX'], 2), :]


# Look up data for all states in month 2: all_month2
all_month2 = sales.loc[(slice(None), 2) ,:]

数据透视表 pivot
 
  
  • Pivot the users DataFrame with the rows indexed by 'weekday', the columns indexed by 'city', and the values populated with 'visitors'.
  • Print the pivoted DataFrame. This has been done for you, so hit 'Submit Answer' to view the result.
# Pivot the users DataFrame: visitors_pivot visitors_pivot = users.pivot(index='weekday', columns='city', values='visitors') # Print the pivoted DataFrame print(visitors_pivot)
  • Pivot the users DataFrame with the 'signups' indexed by 'weekday' in the rows and 'city' in the columns.
  • Print the new DataFrame. This has been done for you.
  • Pivot the users DataFrame with both 'signups' and 'visitors' pivoted - that is, all the variables. This will happen automatically if you do not specify an argument for the valuesparameter of .pivot().
  • Print the pivoted DataFrame. This has been done for you, so hit 'Submit Answer' to see the result.
# Pivot users with signups indexed by weekday and city: signups_pivot signups_pivot = users.pivot(index='weekday', columns='city',values='signups') # Print signups_pivot print(signups_pivot) # Pivot users pivoted by both signups and visitors: pivot pivot = users.pivot(index='weekday', columns='city') # Print the pivoted DataFrame print(pivot)
  • Define a DataFrame byweekday with the 'weekday' level of users unstacked.
  • Print the byweekday DataFrame to see the new data layout. This has been done for you.
  • Stack byweekday by 'weekday' and print it to check if you get the same layout as the original users DataFrame.
# Unstack users by 'weekday': byweekday byweekday = users.unstack('weekday') # Print the byweekday DataFrame print(byweekday) # Stack byweekday by 'weekday' and print it print(byweekday.stack(level='weekday'))
  • Define a DataFrame newusers with the 'city' level stacked back into the index of bycity.
  • Swap the levels of the index of newusers.
  • Print newusers and verify that the index is not sorted. This has been done for you.
  • Sort the index of newusers.
  • Print newusers and verify that the index is now sorted. This has been done for you.
  • Assert that newusers equals users. This has been done for you, so hit 'Submit Answer' to see the result.
# Stack 'city' back into the index of bycity: newusers newusers = bycity.stack(level='city') # Swap the levels of the index of newusers: newusers newusers = newusers.swaplevel(0,1) # Print newusers and verify that the index is not sorted print(newusers) # Sort the index of newusers: newusers newusers = newusers.sort_index() # Print newusers and verify that the index is now sorted print(newusers) # Verify that the new DataFrame is equal to the original print(newusers.equals(users))
  • Reset the index of visitors_by_city_weekday with .reset_index().
  • Print visitors_by_city_weekday and verify that you have just a range index, 0, 1, 2, 3. This has been done for you.
  • Melt visitors_by_city_weekday to move the city names from the column labels to values in a single column called city.
  • Print visitors to check that the city values are in a single column now and that the dataframe is longer and skinnier.
# Reset the index: visitors_by_city_weekday visitors_by_city_weekday = visitors_by_city_weekday.reset_index()  # Print visitors_by_city_weekday print(visitors_by_city_weekday) # Melt visitors_by_city_weekday: visitors visitors = pd.melt(visitors_by_city_weekday, id_vars=['weekday'], value_name='visitors') # Print visitors print(visitors)
  • Define a DataFrame skinny where you melt the 'visitors'and 'signups' columns of users into a single column.
  • Print skinny to verify the results. Note the value column that had the cell values in users.
# Melt users: skinny skinny = pd.melt(users, id_vars=['weekday', 'city']) # Print skinny print(skinny)
  • Set the index of users to ['city', 'weekday'].
  • Print the DataFrame users_idx to see the new index.
  • Obtain the key-value pairs corresponding to visitors and signups by melting users_idx with the keyword argument col_level=0.
# Set the new index: users_idx users_idx = users.set_index(['city', 'weekday']) # Print the users_idx DataFrame print(users_idx) # Obtain the key-value pairs: kv_pairs kv_pairs = pd.melt(users_idx, col_level=0) # Print the key-value pairs print(kv_pairs)
pivot table 中类似group by 的操作
  • Define a DataFrame count_by_weekday1 that shows the count of each column with the parameter aggfunc='count'. The index here is 'weekday'.
  • Print count_by_weekday1. This has been done for you.
  • Replace aggfunc='count' with aggfunc=len and verify you obtain the same result.
# Use a pivot table to display the count of each column: count_by_weekday1 count_by_weekday1 = users.pivot_table(index='weekday', aggfunc='count') # Print count_by_weekday print(count_by_weekday1) # Replace 'aggfunc='count'' with 'aggfunc=len': count_by_weekday2 count_by_weekday2 = users.pivot_table(index='weekday', aggfunc=len) # Verify that the same result is obtained print('==========================================') print(count_by_weekday1.equals(count_by_weekday2))
  • Define a DataFrame signups_and_visitors that shows the breakdown of signups and visitors by day, as well as the totals.
    • You will need to use aggfunc=sum to do this.
  • Print signups_and_visitors. This has been done for you.
  • Now pass the additional argument margins=True to the .pivot_table() method to obtain the totals.
  • Print signups_and_visitors_total. This has been done for you, so hit 'Submit Answer' to see the result.
# Create the DataFrame with the appropriate pivot table: signups_and_visitors signups_and_visitors = users.pivot_table(index='weekday', aggfunc=sum) # Print signups_and_visitors print(signups_and_visitors) # Add in the margins: signups_and_visitors_total  signups_and_visitors_total = users.pivot_table(index='weekday', aggfunc=sum, margins=True) # Print signups_and_visitors_total print(signups_and_visitors_total)
  • Group by the 'pclass' column and save the result as by_class.
  • Aggregate the 'survived' column of by_classusing .count(). Save the result as count_by_class.
  • Print count_by_class. This has been done for you.
  • Group titanic by the 'embarked' and 'pclass' columns. Save the result as by_mult.
  • Aggregate the 'survived' column of by_mult using .count(). Save the result as count_mult.
  • Print count_mult. This has been done for you, so hit 'Submit Answer' to view the result.
# Group titanic by 'pclass' by_class = titanic.groupby(['pclass']) # Aggregate 'survived' column of by_class by count count_by_class = by_class['survived'].count() # Print count_by_class print(count_by_class) # Group titanic by 'embarked' and 'pclass' by_mult = titanic.groupby(['embarked', 'pclass']) # Aggregate 'survived' column of by_mult by count count_mult = by_mult['survived'].count() # Print count_mult print(count_mult)
  • Read life_fname into a DataFrame called life and set the index to 'Country'.
  • Read regions_fname into a DataFrame called regions and set the index to 'Country'.
  • Group life by the region column of regionsand store the result in life_by_region.
  • Print the mean over the 2010 column of life_by_region.
# Read life_fname into a DataFrame: life life = pd.read_csv(life_fname, index_col='Country') # Read regions_fname into a DataFrame: regions regions = pd.read_csv(regions_fname, index_col='Country') # Group life by regions['region']: life_by_region life_by_region = life.groupby(regions['region']) # Print the mean over the '2010' column of life_by_region print(life_by_region['2010'].mean())
  • Group titanic by 'pclass' and save the result as by_class.
  • Select the 'age' and 'fare' columns from by_class and save the result as by_class_sub.
  • Aggregate by_class_sub using 'max' and 'median'. You'll have to pass 'max' and 'median' in the form of a list to .agg().
  • Use .loc[] to print all of the rows and the column specification ('age','max'). This has been done for you.
  • Use .loc[] to print all of the rows and the column specification ('fare','median').
# Group titanic by 'pclass': by_class by_class = titanic.groupby(['pclass']) # Select 'age' and 'fare' by_class_sub = by_class[['age','fare']] # Aggregate by_class_sub by 'max' and 'median': aggregated aggregated = by_class_sub.agg(['max', 'median']) # Print the maximum age in each class print(aggregated.loc[:, ('age','max')]) # Print the median fare in each class print(aggregated.loc[:, ('fare', 'median')])
  • Read 'gapminder.csv' into a DataFrame with index_col=['Year','region','Country']. Sort the index.
  • Group gapminder with a level of ['Year','region'] using its level parameter. Save the result as by_year_region.
  • Define the function spread which returns the maximum and minimum of an input series. This has been done for you.
  • Create a dictionary with 'population':'sum''child_mortality':'mean' and 'gdp':spreadas aggregator. This has been done for you.
  • Use the aggregator dictionary to aggregate by_year_region. Save the result as aggregated.
  • Print the last 6 entries of aggregated. This has been done for you, so hit 'Submit Answer' to view the result.
# Read the CSV file into a DataFrame and sort the index: gapminder gapminder = pd.read_csv('gapminder.csv', index_col=['Year', 'region', 'Country']).sort_index() # Group gapminder by 'Year' and 'region': by_year_region by_year_region = gapminder.groupby(level=['Year', 'region']) # Define the function to compute spread: spread def spread(series):     return series.max() - series.min() # Create the dictionary: aggregator aggregator = {'population':'sum', 'child_mortality':'mean', 'gdp':spread} # Aggregate by_year_region using the dictionary: aggregated aggregated = by_year_region.agg(aggregator) # Print the last 6 entries of aggregated  print(aggregated.tail(6))
  • Read 'sales.csv' into a DataFrame with index_col='Date' and parse_dates=True.
  • Create a groupby object with sales.index.strftime('%a') as input and assign it to by_day.
  • Aggregate the 'Units' column of by_day with the .sum() method. Save the result as units_sum.
  • Print units_sum. This has been done for you, so hit 'Submit Answer' to see the result.
# Read file: sales sales = pd.read_csv('sales.csv', index_col='Date', parse_dates=True) # Create a groupby object: by_day by_day = sales.groupby(sales.index.strftime('%a')) # Create sum: units_sum units_sum = by_day['Units'].sum() # Print units_sum print(units_sum)
transform 函数以及找出异常点
  • Import zscore from scipy.stats.
  • Group gapminder_2010 by 'region' and transform the ['life','fertility'] columns by zscore.
  • Construct a boolean Series of the bitwise or between standardized['life'] < -3 and standardized['fertility'] > 3.
  • Filter gapminder_2010 using .loc[] and the outliers Boolean Series. Save the result as gm_outliers.
  • Print gm_outliers. This has been done for you, so hit 'Submit Answer' to see the results.
# Import zscore from scipy.stats import zscore # Group gapminder_2010: standardized standardized = gapminder_2010.groupby(['region'])[['life', 'fertility']].transform(zscore) # Construct a Boolean Series to identify outliers: outliers outliers = (standardized['life'] < -3) | (standardized['fertility'] > 3) # Filter gapminder_2010 by the outliers: gm_outliers gm_outliers = gapminder_2010.loc[outliers] # Print gm_outliers print(gm_outliers)
  • Group titanic by 'sex' and 'pclass'. Save the result as by_sex_class.
  • Write a function called impute_median() that fills missing values with the median of a series. This has been done for you.
  • Call .transform() with impute_median on the 'age' column of by_sex_class.
  • Print the output of titanic.tail(10). This has been done for you - hit 'Submit Answer' to see how the missing values have now been imputed.
# Create a groupby object: by_sex_class by_sex_class = titanic.groupby(['sex', 'pclass']) # Write a function that imputes median def impute_median(series):     return series.fillna(series.median()) # Impute age and assign to titanic['age'] titanic.age = by_sex_class['age'].transform(impute_median) # Print the output of titanic.tail(10) print(titanic.tail(10))
  • Group gapminder_2010 by 'region'. Save the result as regional.
  • Apply the provided disparity function on regional, and save the result as reg_disp.
  • Use .loc[] to select ['United States','United Kingdom','China'] from reg_disp and print the results.
# Group gapminder_2010 by 'region': regional regional = gapminder_2010.groupby(['region']) # Apply the disparity function on regional: reg_disp reg_disp = regional.apply(disparity) # Print the disparity of 'United States', 'United Kingdom', and 'China' print(reg_disp.loc[['United States','United Kingdom','China']])
  • Group sales by 'Company'. Save the result as by_company.
  • Compute and print the sum of the 'Units' column of by_company.
  • Call .filter() on by_company with lambda g:g['Units'].sum() > 35 as input and print the result.
# Read the CSV file into a DataFrame: sales sales = pd.read_csv('sales.csv', index_col='Date', parse_dates=True) # Group sales by 'Company': by_company by_company = sales.groupby(['Company']) # Compute the sum of the 'Units' of by_company: by_com_sum by_com_sum = by_company['Units'].sum() print(by_com_sum) # Filter 'Units' where the sum is > 35: by_com_filt by_com_filt = by_company.filter(lambda g: g['Units'].sum() > 35) print(by_com_filt)
  • Create a Boolean Series of titanic['age'] < 10 and call .map with {True:'under 10', False:'over 10'}.
  • Group titanic by the under10 Series and then compute and print the mean of the 'survived' column.
  • Group titanic by the under10 Series as well as the 'pclass' column and then compute and print the mean of the 'survived' column.
# Create the Boolean Series: under10 under10 = (titanic['age'] < 10).map({True:'under 10', False:'over 10'}) # Group by under10 and compute the survival rate survived_mean_1 = titanic.groupby(under10)['survived'].mean() print(survived_mean_1) # Group by under10 and pclass and compute the survival rate survived_mean_2 = titanic.groupby([under10, 'pclass'])['survived'].mean() print(survived_mean_2)
一个探索并且操作数据集的实例
  • Extract the 'NOC' column from the DataFrame medalsand assign the result to country_names. Notice that this Series has repeated entries for every medal (of any type) a country has won in any Edition of the Olympics.
  • Create a Series medal_counts by applying .value_counts() to the Series country_names.
  • Print the top 15 countries ranked by total number of medals won. This has been done for you, so hit 'Submit Answer' to see the result.
# Select the 'NOC' column of medals: country_names country_names = medals['NOC'] # Count the number of medals won by each country: medal_counts medal_counts = country_names.value_counts() # Print top 15 countries ranked by medals print(medal_counts.head(15))
  • Construct a pivot table counted from the DataFrame medals aggregating by count. Use 'NOC' as the index, 'Athlete' for the values, and 'Medal' for the columns.
  • Modify the DataFrame counted by adding a column counted['totals']. The new column 'totals'should contain the result of taking the sum along the columns (i.e., use .sum(axis='columns')).
  • Overwrite the DataFrame counted by sorting it with the .sort_values() method. Specify the keyword argument ascending=False.
  • Print the first 15 rows of counted using .head(15). This has been done for you, so hit 'Submit Answer' to see the result.
# Construct the pivot table: counted counted = medals.pivot_table(index='NOC', values='Athlete', columns='Medal', aggfunc='count') # Create the new column: counted['totals'] counted['totals'] = counted.sum(axis='columns') # Sort counted by the 'totals' column counted = counted.sort_values(['totals'], ascending=False) # Print the top 15 rows of counted print(counted.head(15))
  • Group medals by 'NOC'.
  • Compute the number of distinct sports in which each country won medals. To do this, select the 'Sport' column from country_grouped and apply .nunique().
  • Sort Nsports in descending order with .sort_values() and ascending=False.
  • Print the first 15 rows of Nsports. This has been done for you, so hit 'Submit Answer' to see the result.
# Group medals by 'NOC': country_grouped country_grouped = medals.groupby('NOC') # Compute the number of distinct sports in which each country won medals: Nsports Nsports = country_grouped['Sport'].nunique() # Sort the values of Nsports in descending order Nsports = Nsports.sort_values(ascending=False) # Print the top 15 rows of Nsports print(Nsports.head(15))
  • Create a Boolean Series called during_cold_war by extracting all rows from medals for which the 'Edition' is >= 1952 and <= 1988.
  • Create a Boolean Series called is_usa_urs by extracting rows from medals for which 'NOC' is either 'USA'or 'URS'.
  • Filter the medals DataFrame using during_cold_war and is_usa_urs to create a new DataFrame called cold_war_medals.
  • Group cold_war_medals by 'NOC'.
  • Create a Series Nsports from country_groupedusing indexing & chained methods:
    • Extract the column 'Sport'.
    • Use .nunique() to get the number of unique elements in each group;
    • Apply .sort_values(ascending=False) to rearrange the Series.
  • Print the final Series Nsports. This has been done for you, so hit 'Submit Answer' to see the result!
# Extract all rows for which the 'Edition' is between 1952 & 1988: during_cold_war during_cold_war = (medals.Edition >= 1952) & (medals.Edition <= 1988) # Extract rows for which 'NOC' is either 'USA' or 'URS': is_usa_urs is_usa_urs = medals.NOC.isin(['USA', 'URS']) # Use during_cold_war and is_usa_urs to create the DataFrame: cold_war_medals cold_war_medals = medals.loc[during_cold_war & is_usa_urs] # Group cold_war_medals by 'NOC' country_grouped = cold_war_medals.groupby(['NOC']) # Create Nsports Nsports = country_grouped['Sport'].nunique().sort_values(ascending=False) # Print Nsports print(Nsports)
  • Construct medals_won_by_country using medals.pivot_table().
    • The index should the years ('Edition') & the columns should be country ('NOC')
    • the values should be 'Athlete' (which captures every medal regardless of kind) & the aggregation method should be 'count' (which captures the total number of medals won).
  • Create cold_war_usa_usr_medals by slicing the pivot table medals_won_by_country. Your slice should contain the editions from years 1952:1988 and only the columns 'USA' & 'URS' from the pivot table.
  • Create the Series most_medals by applying the .idxmax() method to cold_war_usa_usr_medals. Be sure to use axis='columns'.
  • Print the result of applying .value_counts() to most_medals. The result reported gives the number of times each of the USA or the USSR won more Olympic medals in total than the other between 1952 and 1988.
# Create the pivot table: medals_won_by_country medals_won_by_country = medals.pivot_table(index='Edition', columns='NOC', values='Athlete', aggfunc='count') # Slice medals_won_by_country: cold_war_usa_usr_medals cold_war_usa_usr_medals = medals_won_by_country.loc[1952:1988, ['USA','URS']] # Create most_medals  most_medals = cold_war_usa_usr_medals.idxmax(axis='columns') # Print most_medals.value_counts() print(most_medals.value_counts())
  • Create a DataFrame usa with data only for the USA.
  • Group usa such that ['Edition', 'Medal'] is the index. Aggregate the count over 'Athlete'.
  • Use .unstack() with level='Medal' to reshape the DataFrame usa_medals_by_year.
  • Construct a line plot from the final DataFrame usa_medals_by_year. This has been done for you, so hit 'Submit Answer' to see the plot!
# Create the DataFrame: usa usa = medals[medals.NOC == 'USA'] # Group usa by ['Edition', 'Medal'] and aggregate over 'Athlete' usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count() # Reshape usa_medals_by_year by unstacking usa_medals_by_year = usa_medals_by_year.unstack(level='Medal') # Plot the DataFrame usa_medals_by_year usa_medals_by_year.plot() plt.show()
  • Redefine the 'Medal' column of the DataFrame medals as an ordered categorical. To do this, use pd.Categorical() with three keyword arguments:
    • values = medals.Medal.
    • categories=['Bronze', 'Silver', 'Gold'].
    • ordered=True.
    • After this, you can verify that the type has changed using medals.info().
  • Plot the final DataFrame usa_medals_by_year as an area plot. This has been done for you, so hit 'Submit Answer' to see how the plot has changed!
# Redefine 'Medal' as an ordered categorical medals.Medal = pd.Categorical(values=medals.Medal, categories=['Bronze', 'Silver', 'Gold'], ordered=True) # Create the DataFrame: usa usa = medals[medals.NOC == 'USA'] # Group usa by 'Edition', 'Medal', and 'Athlete' usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count() # Reshape usa_medals_by_year by unstacking usa_medals_by_year = usa_medals_by_year.unstack(level='Medal') # Create an area plot of usa_medals_by_year usa_medals_by_year.plot.area() plt.show()

你可能感兴趣的:(class Manipulating DataFrames with pandas)