更多关于 dataframe的操作
'Perry'
to 'Potter'
and assign the output to p_counties
.p_counties
DataFrame. This has been done for you.'Potter'
to 'Perry'
in reverse order. To do this for hypothetical row labels 'a'
and 'b'
, you could use a stepsize of -1
like so: df.loc['b':'a':-1]
.p_counties_rev
DataFrame. This has also been done for you, so hit 'Submit Answer' to see the result of your slicing!# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election.loc['Perry':'Potter']
# Print the p_counties DataFrame
print(p_counties)
# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev = election.loc["Potter":'Perry':-1]
# Print the p_counties_rev DataFrame
print(p_counties_rev)
pandas dataframe 的列索引, 数据的索引
'Obama'
and assign the result to left_columns
'Obama'
to 'winner'
and assign the result to middle_columns
'Romney'
to the end and assign the result to right_columns
left_columns
, middle_columns
, and right_columns
has been written, so hit 'Submit Answer' to see the results!['Philadelphia', 'Centre', 'Fulton']
and assign it to rows
.['winner', 'Obama', 'Romney']
and assign it to cols
.rows
and cols
in .loc[]
and assign it to three_counties
.three_counties
DataFrame. This has been done for you, so hit 'Submit Answer` to see your new DataFrame.numpy
as np
.'margin'
column is less than 1 and assign it to too_close
.'winner'
column where the result was too close to call to np.nan
.election.info()
. This has been done for you, so hit 'Submit Answer' to see the results.'age'
and 'cabin'
columns of titanic
and create a new DataFrame df
.df
. This has been done for you.df
with how='any'
and print the shape.df
with how='all'
and print the shape.titanic
DataFrame that have more than 1000 missing values by specifying the thresh
and axis
keyword arguments. Print the output of .info()
from this.# Select the 'age' and 'cabin' columns: df
df = titanic[['age','cabin']]
# Print the shape of df
print(df.shape)
# Drop rows in df with how='any' and print the shape
print(df.dropna(how='any').shape)
# Drop rows in df with how='all' and print the shape
print(df.dropna(how='all').shape)
# Call .dropna() with thresh=1000 and axis='columns' and print the output of .info() from titanic
print(titanic.dropna(thresh=1000, axis='columns').info())
to_celsius
function over the ['Mean TemperatureF','Mean Dew PointF']
columns of the weather
DataFrame.df_celsius
to ['Mean TemperatureC','Mean Dew PointC']
.# Write a function to convert degrees Fahrenheit to degrees Celsius: to_celsius
def to_celsius(F):
return 5/9*(F - 32)
# Apply the function over 'Mean TemperatureF' and 'Mean Dew PointF': df_celsius
df_celsius = weather[['Mean TemperatureF','Mean Dew PointF']].apply(to_celsius)
# Reassign the columns df_celsius
df_celsius.columns = ['Mean TemperatureC', 'Mean Dew PointC']
# Print the output of df_celsius.head()
print(df_celsius.head())
'Obama':'blue'
and 'Romney':'red'
..map()
method on the 'winner'
column using the red_vs_blue
dictionary you created.election.head()
. This has been done for you, so hit 'Submit Answer' to see the new column!# Create the dictionary: red_vs_blue
red_vs_blue = dict([('Obama','blue') ,( 'Romney','red')])
# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election['winner'].map(red_vs_blue)
# Print the output of election.head()
print(election.head())
zscore
from scipy.stats
.zscore
with election['turnout']
as input .type(turnout_zscore)
. This has been done for you.turnout_zscore
to a new column in election
as 'turnout_zscore'
.election.head()
. This has been done for you, so hit 'Submit Answer' to view the result.# Import zscore from scipy.stats
from scipy.stats import zscore
# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(election['turnout'])
# Print the type of turnout_zscore
print(type(turnout_zscore))
# Assign turnout_zscore to a new column: election['turnout_zscore']
election['turnout_zscore'] = turnout_zscore
# Print the output of election.head()
print(election.head())
index的一些操作:
new_idx
with the same elements as in sales.index
, but with all characters capitalized.new_idx
to sales.index
.sales
dataframe. This has been done for you, so hit 'Submit Answer' and to see how the index changed.'MONTHS'
to sales.index.name
to create a name for the index.sales
dataframe to see the index name you just created.'PRODUCTS'
to sales.columns.name
to give a name to the set of columns.sales
dataframe again to see the columns name you just created.['state', 'month']
..sort_index()
method.sales
DataFrame. This has been done for you, so hit 'Submit Answer' to verify that indeed you have an index with the fields state
and month
!sales
to be the column 'state'
.sales
DataFrame to verify that indeed you have an index with state
values.'NY'
and print it to verify that you obtain two rows.stocks.loc[(slice(None), slice('2016-10-03', '2016-10-04')), :]
'NY'
) in month 1
.'CA'
, 'TX'
) in month 2
.2
. Use (slice(None), 2)
to extract all rows in month 2
.# Look up data for NY in month 1: NY_month1
NY_month1 = sales.loc[("NY", 1), :]
# Look up data for CA and TX in month 2: CA_TX_month2
CA_TX_month2 = sales.loc[(['CA','TX'], 2), :]
# Look up data for all states in month 2: all_month2
all_month2 = sales.loc[(slice(None), 2) ,:]
数据透视表 pivot
users
DataFrame with the rows indexed by 'weekday'
, the columns indexed by 'city'
, and the values populated with 'visitors'
.users
DataFrame with the 'signups'
indexed by 'weekday'
in the rows and 'city'
in the columns.users
DataFrame with both 'signups'
and 'visitors'
pivoted - that is, all the variables. This will happen automatically if you do not specify an argument for the values
parameter of .pivot()
.byweekday
with the 'weekday'
level of users
unstacked.byweekday
DataFrame to see the new data layout. This has been done for you.byweekday
by 'weekday'
and print it to check if you get the same layout as the original users
DataFrame.newusers
with the 'city'
level stacked back into the index of bycity
.newusers
.newusers
and verify that the index is not sorted. This has been done for you.newusers
.newusers
and verify that the index is now sorted. This has been done for you.newusers
equals users
. This has been done for you, so hit 'Submit Answer' to see the result.visitors_by_city_weekday
with .reset_index()
.visitors_by_city_weekday
and verify that you have just a range index, 0, 1, 2, 3. This has been done for you.visitors_by_city_weekday
to move the city names from the column labels to values in a single column called city
.visitors
to check that the city values are in a single column now and that the dataframe is longer and skinnier.skinny
where you melt the 'visitors'
and 'signups'
columns of users
into a single column.skinny
to verify the results. Note the value
column that had the cell values in users
.users
to ['city', 'weekday']
.users_idx
to see the new index.users_idx
with the keyword argument col_level=0
.count_by_weekday1
that shows the count of each column with the parameter aggfunc='count'
. The index here is 'weekday'
.count_by_weekday1
. This has been done for you.aggfunc='count'
with aggfunc=len
and verify you obtain the same result.signups_and_visitors
that shows the breakdown of signups and visitors by day, as well as the totals.
aggfunc=sum
to do this.signups_and_visitors
. This has been done for you.margins=True
to the .pivot_table()
method to obtain the totals.signups_and_visitors_total
. This has been done for you, so hit 'Submit Answer' to see the result.'pclass'
column and save the result as by_class
.'survived'
column of by_class
using .count()
. Save the result as count_by_class
.count_by_class
. This has been done for you.titanic
by the 'embarked'
and 'pclass'
columns. Save the result as by_mult
.'survived'
column of by_mult
using .count()
. Save the result as count_mult
.count_mult
. This has been done for you, so hit 'Submit Answer' to view the result.life_fname
into a DataFrame called life
and set the index to 'Country'
.regions_fname
into a DataFrame called regions
and set the index to 'Country'
.life
by the region
column of regions
and store the result in life_by_region
.2010
column of life_by_region
.titanic
by 'pclass'
and save the result as by_class
.'age'
and 'fare'
columns from by_class
and save the result as by_class_sub
.by_class_sub
using 'max'
and 'median'
. You'll have to pass 'max'
and 'median'
in the form of a list to .agg()
..loc[]
to print all of the rows and the column specification ('age','max')
. This has been done for you..loc[]
to print all of the rows and the column specification ('fare','median')
.'gapminder.csv'
into a DataFrame with index_col=['Year','region','Country']
. Sort the index.gapminder
with a level of ['Year','region']
using its level
parameter. Save the result as by_year_region
.spread
which returns the maximum and minimum of an input series. This has been done for you.'population':'sum'
, 'child_mortality':'mean'
and 'gdp':spread
as aggregator. This has been done for you.aggregator
dictionary to aggregate by_year_region
. Save the result as aggregated
.aggregated
. This has been done for you, so hit 'Submit Answer' to view the result.'sales.csv'
into a DataFrame with index_col='Date'
and parse_dates=True
.sales.index.strftime('%a')
as input and assign it to by_day
.'Units'
column of by_day
with the .sum()
method. Save the result as units_sum
.units_sum
. This has been done for you, so hit 'Submit Answer' to see the result.zscore
from scipy.stats
.gapminder_2010
by 'region'
and transform the ['life','fertility']
columns by zscore
.or
between standardized['life'] < -3
and standardized['fertility'] > 3
.gapminder_2010
using .loc[]
and the outliers
Boolean Series. Save the result as gm_outliers
.gm_outliers
. This has been done for you, so hit 'Submit Answer' to see the results.titanic
by 'sex'
and 'pclass'
. Save the result as by_sex_class
.impute_median()
that fills missing values with the median of a series. This has been done for you..transform()
with impute_median
on the 'age'
column of by_sex_class
.titanic.tail(10)
. This has been done for you - hit 'Submit Answer' to see how the missing values have now been imputed.gapminder_2010
by 'region'
. Save the result as regional
.disparity
function on regional
, and save the result as reg_disp
..loc[]
to select ['United States','United Kingdom','China']
from reg_disp
and print the results.sales
by 'Company'
. Save the result as by_company
.'Units'
column of by_company
..filter()
on by_company
with lambda g:g['Units'].sum() > 35
as input and print the result.titanic['age'] < 10
and call .map
with {True:'under 10', False:'over 10'}
.titanic
by the under10
Series and then compute and print the mean of the 'survived'
column.titanic
by the under10
Series as well as the 'pclass'
column and then compute and print the mean of the 'survived'
column.'NOC'
column from the DataFrame medals
and assign the result to country_names
. Notice that this Series has repeated entries for every medal (of any type) a country has won in any Edition of the Olympics.medal_counts
by applying .value_counts()
to the Series country_names
.counted
from the DataFrame medals
aggregating by count
. Use 'NOC'
as the index, 'Athlete'
for the values, and 'Medal'
for the columns.counted
by adding a column counted['totals']
. The new column 'totals'
should contain the result of taking the sum along the columns (i.e., use .sum(axis='columns')
).counted
by sorting it with the .sort_values()
method. Specify the keyword argument ascending=False
.counted
using .head(15)
. This has been done for you, so hit 'Submit Answer' to see the result.medals
by 'NOC'
.'Sport'
column from country_grouped
and apply .nunique()
.Nsports
in descending order with .sort_values()
and ascending=False
.Nsports
. This has been done for you, so hit 'Submit Answer' to see the result.during_cold_war
by extracting all rows from medals
for which the 'Edition'
is >=
1952
and <=
1988
.is_usa_urs
by extracting rows from medals
for which 'NOC'
is either 'USA'
or 'URS'
.medals
DataFrame using during_cold_war
and is_usa_urs
to create a new DataFrame called cold_war_medals
.cold_war_medals
by 'NOC'
.Nsports
from country_grouped
using indexing & chained methods:
'Sport'
..nunique()
to get the number of unique elements in each group;.sort_values(ascending=False)
to rearrange the Series.Nsports
. This has been done for you, so hit 'Submit Answer' to see the result!medals_won_by_country
using medals.pivot_table()
.
'Edition'
) & the columns should be country ('NOC'
)'Athlete'
(which captures every medal regardless of kind) & the aggregation method should be 'count'
(which captures the total number of medals won).cold_war_usa_usr_medals
by slicing the pivot table medals_won_by_country
. Your slice should contain the editions from years 1952:1988
and only the columns 'USA'
& 'URS'
from the pivot table.most_medals
by applying the .idxmax()
method to cold_war_usa_usr_medals
. Be sure to use axis='columns'
..value_counts()
to most_medals
. The result reported gives the number of times each of the USA or the USSR won more Olympic medals in total than the other between 1952 and 1988.usa
with data only for the USA.usa
such that ['Edition', 'Medal']
is the index. Aggregate the count over 'Athlete'
..unstack()
with level='Medal'
to reshape the DataFrame usa_medals_by_year
.usa_medals_by_year
. This has been done for you, so hit 'Submit Answer' to see the plot!'Medal'
column of the DataFrame medals
as an ordered categorical. To do this, use pd.Categorical()
with three keyword arguments:
values = medals.Medal
.categories=['Bronze', 'Silver', 'Gold']
.ordered=True
.medals.info()
.usa_medals_by_year
as an area plot. This has been done for you, so hit 'Submit Answer' to see how the plot has changed!