16. Pandas的Categorical Data创建
前一章里介绍了Categorical Data的基本含义,本章就如何创建、使用本数据类型进行较为相近的解析。
需再说明一下Categorical Data和categories的区别,Categorical Data由两部分组成即categories和codes, categories是有限且唯一的分类的集合,codes是Categorical data的值对应于categories的编码用于存储。
16.1 创建Categorical Data数据
在Pandas里有很多的方式可以创建Categorical Data型的数据,可以基于已有的dataframe数据将模列转化成Catagorical data型的数据,也可直接创建Categorical data型数据,某些函数的返回值也有可能就是Categorical data型数据。
1). astype('category')方式创建 ,可以将某dataframe的某列直接转为Categorical Data型的数据。
import pandas as pd
import time
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
#df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
N = 1
df = pd.DataFrame({ "fruit": name * N, "price" : price * N}, index = idx * N)
df['fruit'] = df['fruit'].astype('category')
print df,"\n"
#print type(df.fruit.values)
print "df.price.values\n", df.price.values,"\n"
print "df.fruit.values\n", df.fruit.values, "\n"
这是前一章里使用的例子就是直接将dataframe的df的第2列即fruit由Series型数据直接转为categorical data型数据即category。
fruit price
1 apple 5.2
2 pearl 3.5
3 orange 7.3
5 apple 5.0
6 orange 7.5
7 orange 7.3
9 apple 5.2
4 pearl 3.7
8 orange 7.3
df.price.values
[5.2 3.5 7.3 5. 7.5 7.3 5.2 3.7 7.3]
df.fruit.values
[apple, pearl, orange, apple, orange, orange, apple, pearl, orange]
Categories (3, object): [apple, orange, pearl]
2). pandas.Categorical直接创建Categorical
import pandas as pd
val = ["apple","pearl","orange", "apple", "orange"]
cat = pd.Categorical(val)
print "type is",type(cat)
print "*" * 20
print "categorical data:\n",cat
print "*" * 20
print cat.categories
print cat.codes
程序执行结果:
type is
********************
categorical data:
[apple, pearl, orange, apple, orange]
Categories (3, object): [apple, orange, pearl]
********************
Index([u'apple', u'orange', u'pearl'], dtype='object')
[0 2 1 0 1]
********************
val是python的列表,而cat则是categorical data数据类型,有categories和codes属性,分别表示数据存储时的分类和编码。
3). 用categoris和codes生成Categorical Data,categories要求唯一、有限,codes可以任意定义。
import pandas as pd
val = ["apple","pearl","orange", "apple", "orange"]
cat = pd.Categorical(val)
print "type is",type(cat)
print "*" * 20
print "categorical data:\n",cat
print "*" * 20
print cat.categories
print cat.codes
print "*" * 20
codes = pd.Series([0,1, 0,2,1,0,2,0])
print "create categorical data:"
print cat.take(codes)
print pd.Categorical.take(cat, codes)
print cat.from_codes(codes, cat.categories)
程序执行结果:
type is
********************
categorical data:
[apple, pearl, orange, apple, orange]
Categories (3, object): [apple, orange, pearl]
********************
Index([u'apple', u'orange', u'pearl'], dtype='object')
[0 2 1 0 1]
********************
create categorical data:
[apple, pearl, apple, orange, pearl, apple, orange, apple]
Categories (3, object): [apple, orange, pearl]
[apple, pearl, apple, orange, pearl, apple, orange, apple]
Categories (3, object): [apple, orange, pearl]
[apple, orange, apple, pearl, orange, apple, pearl, apple]
Categories (3, object): [apple, orange, pearl]
程序里的cat变量是基于列表val创建的一个categorical data数据,cat有categories和codes属性。下面用cat的categories作为分类集来生成另一个categorical。
Categorical Data的实例对象调用take函数,一个categorical的实例对象cat可以传入"要查询"的编码表codes给take函数获得其对应的值,即给出编码找对应的分类。
print cat.take(codes)
"查出"的数据为:
[apple, pearl, apple, orange, pearl, apple, orange, apple]
Categories (3, object): [apple, orange, pearl]
pd.Categorical类调用take函数,这时形参有两个,一个是pd.Categorical的实例对象cat,另一个是编码表。
print pd.Categorical.take(cat, codes)
"查询"结果:
[apple, pearl, apple, orange, pearl, apple, orange, apple]
Categories (3, object): [apple, orange, pearl]
Categorical Data的实例对象调用from_codes函数,此函数需要传入“查询”编码表和分类即categories。
print cat.from_codes(codes, cat.categories)
"查询"结果:
[apple, pearl, apple, orange, pearl, apple, orange, apple]
Categories (3, object): [apple, orange, pearl]
16.2 DataFrame里插入Categorical Data
可以利用pandas.Categorical创建的Categorical data数据插入到DataFrame里。
import pandas as pd
idx = [1,2,3,5,6,7,9,4,8]
fruit = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({"price" : price}, index = idx)
print df
cat = pd.Categorical(fruit)
df["fruit"] = cat
print df
print cat.codes
print cat.categories
程序执行结果:
price
1 5.2
2 3.5
3 7.3
5 5.0
6 7.5
7 7.3
9 5.2
4 3.7
8 7.3
price fruit
1 5.2 apple
2 3.5 pearl
3 7.3 orange
5 5.0 apple
6 7.5 orange
7 7.3 orange
9 5.2 apple
4 3.7 pearl
8 7.3 orange
[0 2 1 0 1 1 0 2 1]
Index([u'apple', u'orange', u'pearl'], dtype='object')
当然先创建DataFrame再将某列用astype('category')转也可以。