文章目录
- pandas介绍
- 含有的数据结构
- 生成数据
- 访问数据
- 添加与删除
- 查看数据
- 处理数据
- lambda表达式
- 修改行列及索引名
- 修改数据类型
pandas介绍
Pandas 是基于 NumPy 的一种工具,该工具是为解决数据分析任务而创建的。
Pandas纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。
含有的数据结构
Series:一维数组,与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近。Series如今能保存不同种数据类型,字符串、boolean值、数字等都能保存在Series中。
Time- Series:以时间为索引的Series。
DataFrame:二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。
Panel :三维的数组,可以理解为DataFrame的容器。
Panel4D:是像Panel一样的4维数据容器。
PanelND:拥有factory集合,可以创建像Panel4D一样N维命名容器的模块。
------------- 下面用实例对pandas基本命令进行讲解 -------------
原始数据
|
age |
city |
name |
|
|
Tom |
18 |
BeiJing |
Bob |
30 |
ShangHai |
Mary |
25 |
GuangZhou |
James |
40 |
ShenZhen |
生成数据
pd.Index()定义索引
data可以用字典表示
pd.DataFrame(data=,index=)#生成DataFrame数据结构
import numpy as np
import pandas as pd
index = pd.Index(data=["Tom","Bob","Mary","James"],name="name")
data = {
"age":[18,30,25,40],"city":["BeiJing","ShangHai","GuangZhou","ShenZhen"]}
user_info = pd.DataFrame(data=data,index=index)
print(user_info)
age city
name
Tom 18 BeiJing
Bob 30 ShangHai
Mary 25 GuangZhou
James 40 ShenZhen
还可以用下面这种形式定义
index = pd.Index(data=["Tom","Bob","Mary","James"],name='name')
data = [[18,"BeiJing"],
[30,"ShangHai"],
[25,"Guangzhou"],
[40,"ShenZhen"]]
columns = ["age","city"]
user_info = pd.DataFrame(data=data,index=index,columns=columns)
print(user_info)
age city
name
Tom 18 BeiJing
Bob 30 ShangHai
Mary 25 Guangzhou
James 40 ShenZhen
访问数据
print(user_info.loc["Tom"])
age 18
city BeiJing
Name: Tom, dtype: object
print(user_info.iloc[1:3])
age city
name
Bob 30 ShangHai
Mary 25 Guangzhou
print(user_info.age)
name
Tom 18
Bob 30
Mary 25
James 40
Name: age, dtype: int64
print(user_info[["city","age"]])
city age
name
Tom BeiJing 18
Bob ShangHai 30
Mary Guangzhou 25
James ShenZhen 40
添加与删除
user_info["sex"]="male"
print(user_info)
age city sex
name
Tom 18 BeiJing male
Bob 30 ShangHai male
Mary 25 Guangzhou male
James 40 ShenZhen male
del user_info["sex"]
user_info
|
age |
city |
name |
|
|
Tom |
18 |
BeiJing |
Bob |
30 |
ShangHai |
Mary |
25 |
Guangzhou |
James |
40 |
ShenZhen |
user_info.drop("Tom")
print(user_info)
print(user_info.drop("Tom"))
age city
name
Tom 18 BeiJing
Bob 30 ShangHai
Mary 25 Guangzhou
James 40 ShenZhen
age city
name
Bob 30 ShangHai
Mary 25 Guangzhou
James 40 ShenZhen
查看数据
user_info.info()
Index: 4 entries, Tom to James
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 4 non-null int64
1 city 4 non-null object
dtypes: int64(1), object(1)
memory usage: 256.0+ bytes
user_info.head(2)
|
age |
city |
name |
|
|
Tom |
18 |
BeiJing |
Bob |
30 |
ShangHai |
user_info.tail(2)
|
age |
city |
name |
|
|
Mary |
25 |
Guangzhou |
James |
40 |
ShenZhen |
user_info.age.max()
40
user_info.age.cumsum()
name
Tom 18
Bob 48
Mary 73
James 113
Name: age, dtype: int64
user_info.describe()
|
age |
count |
4.000000 |
mean |
28.250000 |
std |
9.251126 |
min |
18.000000 |
25% |
23.250000 |
50% |
27.500000 |
75% |
32.500000 |
max |
40.000000 |
user_info.describe(include=["object"])
|
city |
count |
4 |
unique |
4 |
top |
ShenZhen |
freq |
1 |
user_info.city.value_counts()
ShenZhen 1
Guangzhou 1
ShangHai 1
BeiJing 1
Name: city, dtype: int64
user_info.age.idxmax()
'James'
处理数据
pd.cut(user_info.age,3)
name
Tom (17.978, 25.333]
Bob (25.333, 32.667]
Mary (17.978, 25.333]
James (32.667, 40.0]
Name: age, dtype: category
Categories (3, interval[float64]): [(17.978, 25.333] < (25.333, 32.667] < (32.667, 40.0]]
pd.cut(user_info.age,[1,18,30,50])
name
Tom (1, 18]
Bob (18, 30]
Mary (18, 30]
James (30, 50]
Name: age, dtype: category
Categories (3, interval[int64]): [(1, 18] < (18, 30] < (30, 50]]
pd.cut(user_info.age,[1,18,30,50],labels=["childhood","youth","middle"])
name
Tom childhood
Bob youth
Mary youth
James middle
Name: age, dtype: category
Categories (3, object): ['childhood' < 'youth' < 'middle']
pd.qcut(user_info.age,3)
name
Tom (17.999, 25.0]
Bob (25.0, 30.0]
Mary (17.999, 25.0]
James (30.0, 40.0]
Name: age, dtype: category
Categories (3, interval[float64]): [(17.999, 25.0] < (25.0, 30.0] < (30.0, 40.0]]
user_info["sex"]=["male","male","female","male"]
print(user_info)
age city sex
name
Tom 18 BeiJing male
Bob 30 ShangHai male
Mary 25 Guangzhou female
James 40 ShenZhen male
user_info.sort_index()
|
age |
city |
sex |
name |
|
|
|
Bob |
30 |
ShangHai |
male |
James |
40 |
ShenZhen |
male |
Mary |
25 |
Guangzhou |
female |
Tom |
18 |
BeiJing |
male |
user_info.sort_index(axis=1,ascending=False)
|
sex |
city |
age |
name |
|
|
|
Tom |
male |
BeiJing |
18 |
Bob |
male |
ShangHai |
30 |
Mary |
female |
Guangzhou |
25 |
James |
male |
ShenZhen |
40 |
user_info.sort_values(by="age")
|
age |
city |
sex |
name |
|
|
|
Tom |
18 |
BeiJing |
male |
Mary |
25 |
Guangzhou |
female |
Bob |
30 |
ShangHai |
male |
James |
40 |
ShenZhen |
male |
user_info.sort_values(by=["age","city"])
|
age |
city |
sex |
name |
|
|
|
Tom |
18 |
BeiJing |
male |
Mary |
25 |
Guangzhou |
female |
Bob |
30 |
ShangHai |
male |
James |
40 |
ShenZhen |
male |
user_info.age.nlargest(2)
name
James 40
Bob 30
Name: age, dtype: int64
lambda表达式
user_info.age.map(lambda x:"yes" if x>=30 else "no")
name
Tom no
Bob yes
Mary no
James yes
Name: age, dtype: object
city_map={
"BeiJing":"north",
"ShangHai":"south",
"Guangzhou":"south",
"ShenZhen":"south"
}
user_info.city.map(city_map)
name
Tom north
Bob south
Mary south
James south
Name: city, dtype: object
user_info.apply(lambda x:x.max(),axis=0)
age 40
city ShenZhen
sex male
dtype: object
user_info.apply(lambda x:x.min(),axis=0)
age 18
city BeiJing
sex female
dtype: object
user_info.applymap(lambda x:str(x).lower())
|
age |
city |
sex |
name |
|
|
|
Tom |
18 |
beijing |
male |
Bob |
30 |
shanghai |
male |
Mary |
25 |
guangzhou |
female |
James |
40 |
shenzhen |
male |
user_info.applymap(lambda x:str(x).upper())
|
age |
city |
sex |
name |
|
|
|
Tom |
18 |
BEIJING |
MALE |
Bob |
30 |
SHANGHAI |
MALE |
Mary |
25 |
GUANGZHOU |
FEMALE |
James |
40 |
SHENZHEN |
MALE |
修改行列及索引名
user_info.rename(columns={
"age":"Age","city":"City","sex":"Sex"})
|
Age |
City |
Sex |
name |
|
|
|
Tom |
18 |
BeiJing |
male |
Bob |
30 |
ShangHai |
male |
Mary |
25 |
Guangzhou |
female |
James |
40 |
ShenZhen |
male |
user_info.rename(index={
"Tom":"tom","Bob":"bob"})
|
age |
city |
sex |
name |
|
|
|
tom |
18 |
BeiJing |
male |
bob |
30 |
ShangHai |
male |
Mary |
25 |
Guangzhou |
female |
James |
40 |
ShenZhen |
male |
修改数据类型
user_info["age"].astype(float)
name
Tom 18.0
Bob 30.0
Mary 25.0
James 40.0
Name: age, dtype: float64
user_info["height"]=["178","168","178","180cm"]
pd.to_numeric(user_info.height,errors="coerce")
name
Tom 178.0
Bob 168.0
Mary 178.0
James NaN
Name: height, dtype: float64
pd.to_numeric(user_info.height,errors="ignore")