Kaggle学习笔记--pandas【未完】

Kaggle学习笔记--pandas

  • Part1基础操作
    • 读取csv文件
    • 构建Dataframe与Series
    • 将dataframe写入csv文件
    • 访问某列column的值
    • pandas的访问运算符--loc[:,:]和iloc[:,:]
    • set_index()
    • 多条件筛选:
    • isin()
    • notnull()
    • 添加/修改值
  • Part2
    • describe()
    • mean() & unique() & value_counts()
    • 应用1.分数修正map()
    • 应用2.分数修正apply()
    • 【*】在原来数据上直接修改的方法
    • 简单的字符数据合并
    • 应用3.多层级条件调用idxmax()
    • 应用4.字符串频次统计map()
    • 应用5.构建评分系统

原文链接:https://www.kaggle.com/residentmario/summary-functions-and-maps

Part1基础操作

读取csv文件

import pandas as pd
reviews = pd.read_csv('.../winemag-data-130k-v2.csv',index_col=0)

构建Dataframe与Series

#构建dataframe
a=pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])
print("the dataframe is\n",a)
#构建series
b=pd.Series([30, 35, 40],
          index=['2015 Sales', '2016 Sales', '2017 Sales'],
          name='Product A')
print("the series is\n",b)

the dataframe is
Bob Sue
Product A I liked it. Pretty good.
Product B It was awful. Bland.

the series is
2015 Sales 30
2016 Sales 35
2017 Sales 40
Name: Product A, dtype: int64

将dataframe写入csv文件

animals = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2'])
animals.to_csv("C:/Users/Administrator/Desktop/wine-reviews/cows_and_goats.csv")

访问某列column的值

#访问dataframe的country列两种方法:
print("(1)the countries in reviews:\n",reviews.country)
print("(2)the countries in reviews:\n",reviews['country'])

(1)the countries in reviews:
0 Italy
1 Portugal
2 US
3 US
4 US

129966 Germany
129967 US
129968 France
129969 France
129970 France
Name: country, Length: 129971, dtype: object

(2)the countries in reviews:
0 Italy
1 Portugal
2 US
3 US
4 US

129966 Germany
129967 US
129968 France
129969 France
129970 France
Name: country, Length: 129971, dtype: object

pandas的访问运算符–loc[:,:]和iloc[:,:]

#pandas有自己的访问运算符loc和iloc
#使用dataframe的第一行数据  .iloc[行,列],一般情况下选择前面的行来查找
print("the first row:\n",reviews.iloc[0])
#要使用iloc获取列
print("the first column:\n",reviews.iloc[:, 0])
#从1到三行的打印:下标【0.1.2】
print("the first 3 row:\n",reviews.iloc[:3, 0])
#下标[1:3,0]
print("the 1-3 row:\n",reviews.iloc[1:3, 0])

the first row:
country Italy
description Aromas include tropical fruit, broom, brimston…
designation Vulkà Bianco
points 87
price NaN
province Sicily & Sardinia
region_1 Etna
region_2 NaN
taster_name Kerin O’Keefe
taster_twitter_handle @kerinokeefe
title Nicosia 2013 Vulkà Bianco (Etna)
variety White Blend
winery Nicosia
Name: 0, dtype: object

the first column:
0 Italy
1 Portugal
2 US
3 US
4 US

129966 Germany
129967 US
129968 France
129969 France
129970 France
Name: country, Length: 129971, dtype: object

the first 3 row:
0 Italy
1 Portugal
2 US
Name: country, dtype: object

the 1-3 row:
1 Portugal
2 US
Name: country, dtype: object

【注意】iloc使用Python stdlib索引方案,其中包含范围的第一个元素,而排除最后一个,所以[0:10]将包含0,…,9。而loc的[0:10]将包含0,…,10。

#loc使用索引中的信息来完成其工作,一般情况下选择后面的列来查找
print("find data of column 'taster_name', 'taster_twitter_handle', 'points':\n",reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']])
#选择带有索引标签“ 1”,“ 2”,“ 3”,“ 5”和“ 8”的记录
print("find data of row [1,2,3,5,8]:\n",reviews.loc[[1,2,3,5,8]])
#loc查找一个区间的值【从'Apples'到'Potatoes'】:df.loc['Apples':'Potatoes']

find data of column ‘taster_name’, ‘taster_twitter_handle’, ‘points’:
taster_name taster_twitter_handle points
0 Kerin O’Keefe @kerinokeefe 87
1 Roger Voss @vossroger 87
2 Paul Gregutt @paulgwine 87
3 Alexander Peartree NaN 87
4 Paul Gregutt @paulgwine 87
… … … …
129966 Anna Lee C. Iijima NaN 90
129967 Paul Gregutt @paulgwine 90
129968 Roger Voss @vossroger 90
129969 Roger Voss @vossroger 90
129970 Roger Voss @vossroger 90
[129971 rows x 3 columns]

find data of row [1,2,3,5,8]:
country … winery
1 Portugal … Quinta dos Avidagos
2 US … Rainstorm
3 US … St. Julian
5 Spain … Tandem
8 Germany … Heinz Eifel
[5 rows x 13 columns]

set_index()

#添加索引
reviews.set_index("title")

#检查国家是否为'Italy'
reviews.country == 'Italy'

多条件筛选:

#条件筛选
loc_italy_find=reviews.loc[reviews.country == 'Italy']
loc_italy_find_and_90=reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]
loc_italy_find_or_90=reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

isin()

#isin是让您选择值“在”值列表中的数据。 这样就可以同时筛选某列的两个以上的值。
isin_italy_and_france=reviews.loc[reviews.country.isin(['Italy', 'France'])]

notnull()

#isnull notnull
notnull_price=reviews.loc[reviews.price.notnull()]

添加/修改值

#给某列分配一个常量值
reviews['critic'] = 'everyone'
print("the create:\n",reviews.critic)
#或分配具有可迭代的值:
reviews['index_backwards'] = range(len(reviews), 0, -1)
print("the change:\n",reviews.index_backwards)

the create:
0 everyone
1 everyone
2 everyone
3 everyone
4 everyone

129966 everyone
129967 everyone
129968 everyone
129969 everyone
129970 everyone
Name: critic, Length: 129971, dtype: object
the change:
0 129971
1 129970
2 129969
3 129968
4 129967

129966 5
129967 4
129968 3
129969 2
129970 1
Name: index_backwards, Length: 129971, dtype: int32

Part2

describe()

#PANDAS提供了许多简单的“摘要功能”(不是官方名称),它们以某种有用的方式重组了数据。
# 例如describe()方法: 提供了具体的属性描述
points_describe=reviews.points.describe()
'''
count    129971.000000
mean         88.447138
             ...      
75%          91.000000
max         100.000000
Name: points, Length: 8, dtype: float64

mean() & unique() & value_counts()

#查看平均值
points_mean=reviews.points.mean()
#查看唯一值
taster_name_unique=reviews.taster_name.unique()
#查看唯一字段以及其统计频率
taster_name_value_counts=reviews.taster_name.value_counts()

Roger Voss 25514
Michael Schachner 15134

Fiona Adams 27
Christina Pickard 6
Name: taster_name, Length: 19, dtype: int64

应用1.分数修正map()

#映射是一个从数学中借来的术语,表示一个函数,它接受一组值并将它们“映射”到另一组值。
#例如,假设我们想将收到的葡萄酒的分数修正为0。我们可以这样做:
review_points_mean = reviews.points.mean()
points_map=reviews.points.map(lambda p: p - review_points_mean)
print("the change map:\n",points_map)
#传递给map()的函数应该期望得到Series中的单个值(在上面的示例中为点值),并返回该值的转换版本。 map()返回一个新的Series,其中所有值都已由您的函数转换。

the change map:
0 -1.447138
1 -1.447138
2 -1.447138
3 -1.447138
4 -1.447138

129966 1.552862
129967 1.552862
129968 1.552862
129969 1.552862
129970 1.552862
Name: points, Length: 129971, dtype: float64

应用2.分数修正apply()

#如果我们要通过在每一行上调用自定义方法来转换整个DataFrame,则apply()是等效的方法。
def remean_points(row):
    row.points = row.points - review_points_mean
    return row
#axis表示计算的维度/位置;如axis=0(默认)表示计算列;axis=1表示计算行。  如果我们使用axis ='index'调用了reviews.apply(),则需要传递一个函数来转换每一列,而不是传递函数来转换每一行。
reviews.apply(remean_points, axis='columns')
print("the change points:\n",reviews.points)

the change points:
0 87
1 87
2 87
3 87
4 87

129966 90
129967 90
129968 90
129969 90
129970 90
Name: points, Length: 129971, dtype: int64

请注意,map()和apply()分别返回新的,转换后的Series和DataFrames。 他们不会修改被调用的原始数据。

【*】在原来数据上直接修改的方法

#在原来数据上直接修改的方法
#pandas提供了许多常见的内置映射操作。 例如,这是一种重新定义我们的points列的更快方法:
print("change dericter:\n",reviews.points - review_points_mean)

change dericter:
0 -1.447138
1 -1.447138
2 -1.447138
3 -1.447138
4 -1.447138

129966 1.552862
129967 1.552862
129968 1.552862
129969 1.552862
129970 1.552862
Name: points, Length: 129971, dtype: float64

在此代码中,我们在左侧的多个值(系列中的所有值)和右侧的单个值(平均值)之间执行运算。 Pandas查看了此表达式,并指出我们必须要从数据集中的每个值中减去该平均值。

简单的字符数据合并

在数据集中合并国家和地区信息的一种简单方法是执行以下操作:

print("reviews.country - reviews.region_1:\n",(reviews.country + " - " + reviews.region_1))

reviews.country - reviews.region_1:
0 Italy - Etna
1 NaN
2 US - Willamette Valley
3 US - Lake Michigan Shore
4 US - Willamette Valley

129966 NaN
129967 US - Oregon
129968 France - Alsace
129969 France - Alsace
129970 France - Alsace
Length: 129971, dtype: object

应用3.多层级条件调用idxmax()

'''创建一个变量“ bargain_wine”,其名称与数据集中最高的price/point葡萄酒名称相同。'''
#注意此处index.max的用法
bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']
print("the bargain_wine:",bargain_wine)

the bargain_wine: Bandit NV Merlot (California)

应用4.字符串频次统计map()

计算这两个字段【“tropical”,“fruity”】中的每一个出现在数据集“ description”列中的次数。

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
print(points_describe)

count 129971.000000
mean 88.447138
std 3.039730
min 80.000000
25% 86.000000
50% 88.000000
75% 91.000000
max 100.000000
Name: points, dtype: float64

应用5.构建评分系统

构建评级系统(介于80到100分之间) 95分或更高的得分为3星,至少85分但小于95的得分为2星。 其他任何得分均为1星。另外,来自加拿大的任何葡萄酒都将自动获得3星(无论分数如何)。

def stars(row):
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1

star_ratings = reviews.apply(stars, axis='columns')

参考文章:https://blog.csdn.net/liuhehe123/article/details/85786200

你可能感兴趣的:(Kaggle学习笔记--pandas【未完】)