目录
文件下载地址
抽取数据
分析过程
美国农业部USDA制作一份有关食物营养的数据库。由Ashley Williams制作出JSON版。
https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/usda_food/database.json
*文件比较大,建议先下载好在导进去而不是复制进编译器
>>> import json
>>> db = json.load(open('D:\python\DataAnalysis\data\database.json'))
>>> len(db)
6636
>>> db[0].keys()
[u'portions', u'description', u'tags', u'nutrients', u'group', u'id', u'manufacturer']
>>> db[0]['nutrients'][0]
{u'units': u'g', u'group': u'Composition', u'description': u'Protein', u'value': 25.18}
在转为DataFrame时,可以只抽取一部分字段,这里取出食物的名称,分类,编号及制造商信息
>>> from pandas import DataFrame,Series
Backend TkAgg is interactive backend. Turning interactive mode on.
>>> info_keys = ['description','group','id','manufacturer']
>>> info = DataFrame(db,columns=info_keys)
>>> info[:5]
description ... manufacturer
0 Cheese, caraway ...
1 Cheese, cheddar ...
2 Cheese, edam ...
3 Cheese, feta ...
4 Cheese, mozzarella, part skim milk ...
[5 rows x 4 columns]
通过value_counts查看食物类别的分布情况:
>>> import pandas as pd
>>> pd.value_counts(info.group)
Vegetables and Vegetable Products 812
Beef Products 618
Baked Products 496
Breakfast Cereals 403
Legumes and Legume Products 365
Fast Foods 365
Lamb, Veal, and Game Products 345
Sweets 341
Fruits and Fruit Juices 328
Pork Products 328
Beverages 278
Soups, Sauces, and Gravies 275
Finfish and Shellfish Products 255
Baby Foods 209
Cereal Grains and Pasta 183
Ethnic Foods 165
Snacks 162
Nut and Seed Products 128
Poultry Products 116
Sausages and Luncheon Meats 111
Dairy and Egg Products 107
Fats and Oils 97
Meals, Entrees, and Sidedishes 57
Restaurant Foods 51
Spices and Herbs 41
Name: group, dtype: int64
现在,为了对全部营养数据做一些分析,最简单的办法是将所有食物的营养成分整合到一个大表中,我们分步骤实现该目的。
首先将各食物的营养成分列表转换为一个DataFrame,并添加一个表示编号的列,然后将该DataFrame添加到一个列表中,最后通过concaat将这些东西连接起来。
>>> nutrients = []
>>> for rec in db:
... fnuts = DataFrame(rec['nutrients'])
... fnuts['id'] = rec['id']
... nutrients.append(fnuts)
... nutrients = pd.concat(nutrients,ignore_index=True)
连接后的nutrients[ ]
>>> nutrients
description group ... value id
0 Protein Composition ... 25.180 1008
1 Total lipid (fat) Composition ... 29.200 1008
2 Carbohydrate, by difference Composition ... 3.060 1008
3 Ash Other ... 3.280 1008
4 Energy Energy ... 376.000 1008
5 Water Composition ... 39.280 1008
6 Energy Energy ... 1573.000 1008
7 Fiber, total dietary Composition ... 0.000 1008
8 Calcium, Ca Elements ... 673.000 1008
9 Iron, Fe Elements ... 0.640 1008
10 Magnesium, Mg Elements ... 22.000 1008
11 Phosphorus, P Elements ... 490.000 1008
12 Potassium, K Elements ... 93.000 1008
13 Sodium, Na Elements ... 690.000 1008
14 Zinc, Zn Elements ... 2.940 1008
15 Copper, Cu Elements ... 0.024 1008
16 Manganese, Mn Elements ... 0.021 1008
17 Selenium, Se Elements ... 14.500 1008
18 Vitamin A, IU Vitamins ... 1054.000 1008
19 Retinol Vitamins ... 262.000 1008
20 Vitamin A, RAE Vitamins ... 271.000 1008
21 Vitamin C, total ascorbic acid Vitamins ... 0.000 1008
22 Thiamin Vitamins ... 0.031 1008
23 Riboflavin Vitamins ... 0.450 1008
24 Niacin Vitamins ... 0.180 1008
25 Pantothenic acid Vitamins ... 0.190 1008
26 Vitamin B-6 Vitamins ... 0.074 1008
27 Folate, total Vitamins ... 18.000 1008
28 Vitamin B-12 Vitamins ... 0.270 1008
29 Folic acid Vitamins ... 0.000 1008
... ... ... ... ...
1168085 Selenium, Se Elements ... 1.100 43546
1168086 Vitamin A, IU Vitamins ... 5.000 43546
1168087 Retinol Vitamins ... 0.000 43546
1168088 Vitamin A, RAE Vitamins ... 0.000 43546
1168089 Carotene, beta Vitamins ... 2.000 43546
1168090 Carotene, alpha Vitamins ... 2.000 43546
1168091 Vitamin E (alpha-tocopherol) Vitamins ... 0.250 43546
1168092 Vitamin D Vitamins ... 0.000 43546
1168093 Vitamin D (D2 + D3) Vitamins ... 0.000 43546
1168094 Cryptoxanthin, beta Vitamins ... 0.000 43546
1168095 Lycopene Vitamins ... 0.000 43546
1168096 Lutein + zeaxanthin Vitamins ... 20.000 43546
1168097 Vitamin C, total ascorbic acid Vitamins ... 21.900 43546
1168098 Thiamin Vitamins ... 0.020 43546
1168099 Riboflavin Vitamins ... 0.060 43546
1168100 Niacin Vitamins ... 0.540 43546
1168101 Vitamin B-6 Vitamins ... 0.260 43546
1168102 Folate, total Vitamins ... 17.000 43546
1168103 Vitamin B-12 Vitamins ... 0.000 43546
1168104 Choline, total Vitamins ... 4.100 43546
1168105 Vitamin K (phylloquinone) Vitamins ... 0.500 43546
1168106 Folic acid Vitamins ... 0.000 43546
1168107 Folate, food Vitamins ... 17.000 43546
1168108 Folate, DFE Vitamins ... 17.000 43546
1168109 Vitamin E, added Vitamins ... 0.000 43546
1168110 Vitamin B-12, added Vitamins ... 0.000 43546
1168111 Cholesterol Other ... 0.000 43546
1168112 Fatty acids, total saturated Other ... 0.072 43546
1168113 Fatty acids, total monounsaturated Other ... 0.028 43546
1168114 Fatty acids, total polyunsaturated Other ... 0.041 43546
[1168115 rows x 5 columns]
丢弃重复项
>>> nutrients.duplicated().sum()
792939
重命名列对象
>>> col_mapping = {'description':'food','group':'fgroup'}
>>> info = info.rename(columns = col_mapping,copy=False)
>>> col_mapping = {'description':'nutrient','group':'nutgroup'}
>>> nutrients = nutrients.rename(columns = col_mapping,copy = False)
结合info与nutrients
>>> ndata = pd.merge(nutrients,info,on='id',how='outer')
>>> ndata.ix[30000]
nutrient Folate, food
nutgroup Vitamins
units mcg
value 11
id 1180
food Sour cream, fat free
fgroup Dairy and Egg Products
manufacturer None
Name: 30000, dtype: object
接下来利用前面的知识练练手,比如根据食物分类和营养类型画出一张中位值的图。
>>> result = ndata.groupby(['nutrient','fgroup'])['value'].quantile(0.5)
>>> result['Zinc, Zn'].sort_values().plot(kind='barh')
氨基酸最丰富的食物:
>>> by_nutrient = ndata.groupby(['nutgroup','nutrient'])
>>> get_maximum = lambda x: x.xs(x.value.idxmax())
>>> get_minimum = lambda x: x.xs(x.value.idxmin())
>>> max_foods = by_nutrient.apply(get_maximum)[['value','food']]
>>> max_foods.food = max_foods.food.str[:50]
>>> max_foods.ix['Amino Acids']['food']
nutrient
Alanine Gelatins, dry powder, unsweetened
Arginine Seeds, sesame flour, low-fat
Aspartic acid Soy protein isolate
Cystine Seeds, cottonseed flour, low fat (glandless)
Glutamic acid Soy protein isolate
Glycine Gelatins, dry powder, unsweetened
Histidine Whale, beluga, meat, dried (Alaska Native)
Hydroxyproline KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINA...
Isoleucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Leucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Lysine Seal, bearded (Oogruk), meat, dried (Alaska Na...
Methionine Fish, cod, Atlantic, dried and salted
Phenylalanine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Proline Gelatins, dry powder, unsweetened
Serine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Threonine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Tryptophan Sea lion, Steller, meat with fat (Alaska Native)
Tyrosine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Valine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Name: food, dtype: object