import pandas
food_info = pandas.read_csv("food_info.csv")
print(type(food_info)
output:
<class 'pandas.core.frame.DataFrame'>
food_info = food_info.head(3)
dimensions = food_info.shape
output:
(3,36)
#此时创建了一个新的DataFrame对象
Pandas将文件的第一行当做列标签,将行号当做行标签。当对数据进行行索引的时候,返回的数据包含列标签和数值。当对数据进行列索引的时候,返回的数据包含行标签和数值。
print(food_info.dtypes)
output:
NDB_No int64
Shrt_Desc object
Water_(g) float64
...
FA_Mono_(g) float64
FA_Poly_(g) float64
Cholestrl_(mg) float64
dtype: object
Pandas包含以下几种数据类型:
DataFrame.dtypes返回每一列的数据类型
#选择第100行
hundredth_row = food_info.loc[99]
print(hundredth_row)
output:
NDB_No 1111
Shrt_Desc MILK SHAKES THICK VANILLA
Water_(g) 74.45
...
FA_Mono_(g) 0.875
FA_Poly_(g) 0.113
Cholestrl_(mg) 12
Name: 99, dtype: object
loc[list]:取出list值对应的行,或者loc[[1,4,6]]。loc[5:8]:取出5 6 7 行。
选择多行的输出结果:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) \ 3
1004 CHEESE BLUE 42.41 353 21.40 4
1005 CHEESE BRICK 41.11 371 23.24 5
1006 CHEESE BRIE 48.42 334 20.75
注意:loc[ n ]选择行数据,返回的是列标签加数值
返回列数据——直接是food_info[”列标签“]
同理,取出多行food_info[list]或者food_info[[“标签1”,”标签2”]]
#将列标签存储为列表形式
col_names = food_info.columns.tolist()
Returns
summary: NDFrame of summary statistics
Notes
The output DataFrame index depends on the requested dtypes:
For numeric dtypes, it will include: count, mean, std, min,max, and lower, 50, and upper percentiles.
For object dtypes (e.g. timestamps or strings), the index will
include the count, unique, most common, and frequency of the most common. Timestamps also include the first and last items.
For mixed dtypes, the index will be the union of the corresponding
output types. Non-applicable entries will be filled with NaN.
#该数据是titanic上的乘客的信息
titanic = pandas.read_csv("titanic_train.csv")
print(titanic.describe())
结果显示:
会发现,Age这一栏的count的值比其他的小,表Age有很多missing value,因此使用describe()可以帮助我们查看数据的详情,便于做出相应的处理。
可以利用fillna()函数来填充缺省的数据,使用格式如下:
# 用中位数来填充缺省数据
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
titanic["Sex"].unique()
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
This applies a function you pass in to each element in a dataframe or series. We can pass in a lambda function, which enables us to define a function inline.
Examples
——–
df.apply(numpy.sqrt) # returns DataFrame
df.apply(numpy.sum, axis=0) # equiv to df.sum(0)
df.apply(numpy.sum, axis=1) # equiv to df.sum(1)
Parameters
———-
values : ndarray (1-d)
sort : boolean, default True
Sort by values
ascending : boolean, default False
Sort in ascending order
normalize: boolean, default False
If True then compute a relative histogram
bins : integer, optional
Rather than count values, group them into half-open bins,
convenience for pd.cut, only works with numeric data
dropna : boolean, default True
Don’t include counts of NaN
Returns
——-
value_counts : Series
for index, row in data.iterrows():
index:数据的行号
row:包括标签和一行数据