本章在前两章内容的理解之上研究一下和Categorical Data相关的一些函数和属性。
Categorical Data数据的categories是可以通过赋值或者rename函数被修替换改掉的。
import pandas as pd
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
print df,"\n"
print df.fruit.values.categories
print df.fruit.values.codes
df.fruit.values.categories = ["Pearl", "Orange", "Apple"]
print df.fruit.values.categories
print df.fruit.values.codes
print df
程序执行结果
fruit price
1 apple 5.2
2 pearl 3.5
3 orange 7.3
5 apple 5.0
6 orange 7.5
7 orange 7.3
9 apple 5.2
4 pearl 3.7
8 orange 7.3
Index([u'apple', u'orange', u'pearl'], dtype='object')
[0 2 1 0 1 1 0 2 1]
Index([u'Pearl', u'Orange', u'Apple'], dtype='object')
[0 2 1 0 1 1 0 2 1]
fruit price
1 Pearl 5.2
2 Apple 3.5
3 Orange 7.3
5 Pearl 5.0
6 Orange 7.5
7 Orange 7.3
9 Pearl 5.2
4 Apple 3.7
8 Orange 7.3
对比一下改变了categorical date的categories后数据的变化情况,这里categorical的codes并未改变,但最后的df.fruit的输出值values的值却发生了变化。
供货商 | 水果 | 价格 | 替换前后 | 水果 | 价格 |
---|---|---|---|---|---|
1 | apple | 5.20 | <==> | Pearl | 5.20 |
2 | pearl | 3.50 | <==> | Apple | 3.50 |
3 | orange | 7.30 | <==> | Orange | 7.30 |
5 | apple | 5.00 | <==> | Pearl | 5.00 |
6 | orange | 7.50 | <==> | Orange | 7.50 |
7 | orange | 7.30 | <==> | Orange | 7.30 |
9 | apple | 5.20 | <==> | Pearl | 5.20 |
4 | pearl | 3.70 | <==> | Apple | 3.70 |
8 | orange | 7.30 | <==> | Orange | 7.30 |
原因是原categorical data变量df.fruit的categories是["apple","orange","pearl" ]
被变成了["Pearl", "Orange", "Apple"]
,注意此函数有个参数inplace默认是False即不影响原数据,如果想影响原categorical data数据则需将inplace设置为True。
import pandas as pd
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
print df[:4],"\n"
print df.fruit.values.categories
print df.fruit.values.codes
df.fruit.values.rename_categories(["Pearl", "Orange", "Apple"],inplace = True)
print df.fruit.values.categories
print df.fruit.values.codes
print df[:4]
程序的执行结果:
fruit price
1 apple 5.2
2 pearl 3.5
3 orange 7.3
5 apple 5.0
Index([u'apple', u'orange', u'pearl'], dtype='object')
[0 2 1 0 1 1 0 2 1]
Index([u'Pearl', u'Orange', u'Apple'], dtype='object')
[0 2 1 0 1 1 0 2 1]
fruit price
1 Pearl 5.2
2 Apple 3.5
3 Orange 7.3
5 Pearl 5.0
增加categories即增加分类个数,可以使用add_categories函数。下面给示例增加一个供应水果种类watermelon西瓜。
import pandas as pd
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
df_new = pd.DataFrame({"fruit":["watermelon"] * 3,
"price":[2.75, 2.60, 2.55]},
index = [11, 12, 13])
df.fruit.values.add_categories("watermelon", inplace = True)
print "insert datas->\n",df_new
df = df.append(df_new)
print "after insert->\n", df
print "categories->\n",df.fruit.values.categories
print "codes->\n",df.fruit.values.codes
程序执行结果如下:
insert datas->
fruit price
11 watermelon 2.75
12 watermelon 2.60
13 watermelon 2.55
after insert->
fruit price
1 apple 5.20
2 pearl 3.50
3 orange 7.30
5 apple 5.00
6 orange 7.50
7 orange 7.30
9 apple 5.20
4 pearl 3.70
8 orange 7.30
11 watermelon 2.75
12 watermelon 2.60
13 watermelon 2.55
categories->
Index([u'apple', u'orange', u'pearl', u'watermelon'], dtype='object')
codes->
[0 2 1 0 1 1 0 2 1 3 3 3]
这里需要注意的是add_categories函数需要在插入数据之前调用,否则数据增加进去了但是codes并未更新都是-1。 如果将df.fruit.values.add_categories("watermelon", inplace = True)
放在添加数据语句df = df.append(df_new)
之后:
import pandas as pd
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
df_new = pd.DataFrame({"fruit":["watermelon"] * 3,
"price":[2.75, 2.60, 2.55]},
index = [11, 12, 13])
print "insert datas->\n",df_new
df = df.append(df_new)
df.fruit.values.add_categories("watermelon", inplace = True)
print "after insert->\n", df
print "categories->\n",df.fruit.values.categories
print "codes->\n",df.fruit.values.codes
结果则是:
insert datas->
fruit price
11 watermelon 2.75
12 watermelon 2.60
13 watermelon 2.55
after insert->
fruit price
1 apple 5.20
2 pearl 3.50
3 orange 7.30
5 apple 5.00
6 orange 7.50
7 orange 7.30
9 apple 5.20
4 pearl 3.70
8 orange 7.30
11 NaN 2.75
12 NaN 2.60
13 NaN 2.55
categories->
Index([u'apple', u'orange', u'pearl', u'watermelon'], dtype='object')
codes->
[ 0 2 1 0 1 1 0 2 1 -1 -1 -1]
如果水果点不卖苹果了apple那么fruit下得删除所有的apple记录,种类categories里也得去掉apple。
import pandas as pd
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
print "before del 'apple'\n", df
df = df[df.fruit != "apple"]
df.fruit.values.remove_categories("apple", inplace = True)
print "after del 'apple'\n", df
print "categories->\n",df.fruit.values.categories
print "codes->\n",df.fruit.values.codes
print df.fruit
程序的执行结果:
before del 'apple'
fruit price
1 apple 5.2
2 pearl 3.5
3 orange 7.3
5 apple 5.0
6 orange 7.5
7 orange 7.3
9 apple 5.2
4 pearl 3.7
8 orange 7.3
after del 'apple'
fruit price
2 pearl 3.5
3 orange 7.3
6 orange 7.5
7 orange 7.3
4 pearl 3.7
8 orange 7.3
categories->
Index([u'orange', u'pearl'], dtype='object')
codes->
[1 0 0 0 1 0]
2 pearl
3 orange
6 orange
7 orange
4 pearl
8 orange
Name: fruit, dtype: category
Categories (2, object): [orange, pearl]
代码里df = df[df.fruit != "apple"]
是利用布尔选择删除了所有的"apple"的记录,而df.fruit.values.remove_categories("apple", inplace = True)
则是删除了df的fruit这个categorical data的categories里的种类"apple",如果注释掉此语句,codes则还是用原categories进行编码。
import pandas as pd
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
print "before del 'apple'\n", df
df = df[df.fruit != "apple"]
#df.fruit.values.remove_categories("apple", inplace = True)
print "after del 'apple'\n", df
print "categories->\n",df.fruit.values.categories
print "codes->\n",df.fruit.values.codes
print df.fruit
结果为:
before del 'apple'
fruit price
1 apple 5.2
2 pearl 3.5
3 orange 7.3
5 apple 5.0
6 orange 7.5
7 orange 7.3
9 apple 5.2
4 pearl 3.7
8 orange 7.3
after del 'apple'
fruit price
2 pearl 3.5
3 orange 7.3
6 orange 7.5
7 orange 7.3
4 pearl 3.7
8 orange 7.3
categories->
Index([u'apple', u'orange', u'pearl'], dtype='object')
codes->
[2 1 1 1 2 1]
2 pearl
3 orange
6 orange
7 orange
4 pearl
8 orange
Name: fruit, dtype: category
Categories (3, object): [apple, orange, pearl]
删除了categories后的codes为[1 0 0 0 1 0]
,没执行删除categories的codes为[2 1 1 1 2 1]
。
删除未使用的categories的意思是数据里没有那么的分类,那么可以将categories没有用到的categories删除。
import pandas as pd
cat = ["watermelon","pearl","orange", "apple"]
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
print "1_initial-->"
print "dataframe:\n",df[:3]
print "categories:\n",df.fruit.values.categories
print "codes:\n",df.fruit.values.codes
df.fruit.cat.set_categories(cat, inplace = True)
print "\n2_after set_catgories()-->"
print "dataframe:\n",df[:3]
print "categories:\n",df.fruit.values.categories
print "codes:\n",df.fruit.values.codes
df.fruit.cat.remove_unused_categories(inplace = True)
print "\n3_after remove used categories-->"
print "categories:\n",df.fruit.values.categories
print "codes:\n",df.fruit.values.codes
print "dataframe:\n",df[:3]
程序的执行结果:
1_initial-->
dataframe:
fruit price
1 apple 5.2
2 pearl 3.5
3 orange 7.3
categories:
Index([u'apple', u'orange', u'pearl'], dtype='object')
codes:
[0 2 1 0 1 1 0 2 1]
2_after set_catgories()-->
dataframe:
fruit price
1 apple 5.2
2 pearl 3.5
3 orange 7.3
categories:
Index([u'watermelon', u'pearl', u'orange', u'apple'], dtype='object')
codes:
[3 1 2 3 2 2 3 1 2]
3_after remove used categories-->
categories:
Index([u'pearl', u'orange', u'apple'], dtype='object')
codes:
[2 0 1 2 1 1 2 0 1]
dataframe:
fruit price
1 apple 5.2
2 pearl 3.5
3 orange 7.3
从程序的执行结果可以看出,数据并未发生变化,变化的只是数据的categories。df['fruit'] = df['fruit'].astype('category')
设置fruit列为categorical data型数据,创建了其categories为['pearl', 'orange', 'apple']
;语句df.fruit.cat.set_categories(cat, inplace = True)
改变了其categories为['watermelon', 'pearl', 'orange', 'apple']
;而语句df.fruit.cat.remove_unused_categories(inplace = True)
删除了尚未使用的watermelon分类回到了['pearl', 'orange', 'apple']
。
value_counts函数可以统计categorical data的各个categories数据出现的次数,算式categorical data的一种典型应用。
import pandas as pd
cat = ["watermelon","pearl","orange", "apple"]
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
print "dataframe:\n",df
print "categories:\n",df.fruit.values.categories
print "codes:\n",df.fruit.values.codes
print "value_counts()\n", df.fruit.value_counts()
程序的执行结果:
dataframe:
fruit price
1 apple 5.2
2 pearl 3.5
3 orange 7.3
5 apple 5.0
6 orange 7.5
7 orange 7.3
9 apple 5.2
4 pearl 3.7
8 orange 7.3
categories:
Index([u'apple', u'orange', u'pearl'], dtype='object')
codes:
[0 2 1 0 1 1 0 2 1]
value_counts()
orange 4
apple 3
pearl 2
dtype: int64
Next Previous