最近过了一遍kaggle上的pandas入门,感觉还是有些东西不熟,所以把内容做一笔记供以后查阅。
pandas cheatsheet
Index, Reference
Select a column 'description'
reviews['description']
reviews.description
Select the first elements of 'description' column
reviews.loc[0, 'description']
Select first row of dataframe
reviews.loc[0, :]
reviews.iloc[0, :]
Select the first 10 elements of dataframe
reviews.iloc[0:10, 0]
reviews.loc[0:10, 'description']
Select the row number 1, 2, 3, 5, 8
reviews.iloc[[1, 2, 3, 5, 8], :]
Select 'country' and 'variety' of the first 100 records
reviews.loc[0:100, ['country', 'variety']]
Select wines made from 'Italy'
reviews[reviews['country']='Italy']
Select entries whose 'region2' is not empty
reviews[reviews.region2.notnull()]
Select last 1000 entries from points
reviews.iloc[-1000:, 3]
Select points for wines made from Italy
reviews[reviews.country=='Italy']].points
Who produces more above-averagely good wines, France or Italy? Select the country
column, but only when said country
is one of those two options, and the points
column is greater than or equal to 90.
reviews[reviews.country.isin(['France', 'Italy']) & reviews.points>=90].country
Summary and maps
What is the median of the points
column?
reviews.points.median()
What countries are represented in the dataset?
reviews.country.unique()
What countries appear in the dataset most often?
reviews.country.value_counts()
Remap the price
column by subtracting the median price. Use the Series.map
method.
m_val = reviews.price.median()
reviews.price.map(lambd x:x-m_val)
Remap the price
column by subtracting the median price. Use the DataFrame.apply
method.
def f(x):
return x - m_val
reviews.price.apply(f)
I"m an economical wine buyer. Which wine in is the "best bargain", e.g., which wine has the highest points-to-price ratio in the dataset?
reviews.loc[(reviews.points/reviews.price).idmax()].title
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series
counting how many times each of these two words appears in the description
column in the dataset.
c_tropical = reviews.description.map(lambda r:'tropical' in r).value_counts()
c_fruity = reviews.description.map(lambda r:'fruity' in r).value_counts()
pd.Series([c_tropical[True], c_fruity[True]], index = ['tropical', 'fruity'])
What combination of countries and varieties are most common?
Create a Series
whose index consists of strings of the form "
. For example, a pinot noir produced in the US should map to "US - Pinot Noir"
. The values should be counts of how many times the given wine appears in the dataset. Drop any reviews with incomplete country
or variety
data.
df1 = reviews[(reviews.country.notna()&reviews.variety.notna())]
df = df1.apply(lambda s: s.country+ " - "+s.variety, axis = 'columns')
df.value_counts()