Pandas有两个核型概念: Series、DataFrame。 Series类似于Python中的list,可以理解为表格中的一列数据。 DataFrame包含Series,理解为一个表格。核心价值就在于基于表格数据的过滤、转换、统计、分析并输出可视化图表。
后面所有的示例都是基于以下几份测试数据:
PV转换数据,变量名: pv_conv
首页PV | 搜索页PV | 注册数 | 下单用户数 | 订单数 |
---|---|---|---|---|
985 | 290 | 98 | 46 | 40 |
211 | 200 | 21 | 43 | 50 |
688 | 201 | 19 | 68 | 70 |
766 | 228 | 71 | 72 | 80 |
所有的代码示例都假设你已经导入如下python包:
import pandas as pd
import numpy as np
import matplotlib as plt
import sys
import MySQLdb
pv_conv = {
'首页PV': [985, 211, 688, 766],
'搜索页PV': [290, 200, 201, 228],
'注册数': [98, 21, 19, 71],
'下单用户数': [46, 43, 68, 72],
'订单数': [40, 50, 70, 80],
}
1. 创建DataFrame
1.1 基于dict
创建
基于代码来创建DataFrame
的场景还是很常见的,这样数据就可以是来自任意位置、采用任意格式。DataFrame
很多操作都是基于index
列完成的。
DataFrame的处理本省兼容性很强,天然支持类似2.1.3 基于JSON创建
里的orient='split'
、orient='records'
、orient='index'
,orient='columns'
1. 类orient='split'
格式
我们需要将index
、columns
、data
分成3个参数传递:
index = ["2020-09-01", "2020-09-02", "2020-09-03", "2020-09-04"]
columns = ["首页PV", "搜索页PV", "注册数", "下单用户数", "订单数"]
data = [
[1, 2, 3, 4, 5],
[6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20]
]
record_json = pd.DataFrame(index=index, columns=columns, data=data)
print(record_json)
2. 类orient='records'
格式
也能支持使用orient='records'
格式的数据:
data = [
{"日期": "2020-09-01", "首页PV":"1", "搜索页PV": "1", "注册数" : "1" , "下单用户数" : "1", "订单数" : "1"},
{"日期": "2020-09-02", "首页PV":"2", "搜索页PV": "2", "注册数" : "2" , "下单用户数" : "2", "订单数" : "2"},
{"日期": "2020-09-03", "首页PV":"3", "搜索页PV": "3", "注册数" : "3" , "下单用户数" : "3", "订单数" : "3"},
{"日期": "2020-09-04", "首页PV":"4", "搜索页PV": "4", "注册数" : "4" , "下单用户数" : "4", "订单数" : "4"}
]
record_json = pd.DataFrame(data)
print(record_json)
3. 类orient='columns'
格式
pv_conv = {
'首页PV': [985, 211, 688, 766],
'搜索页PV': [290, 200, 201, 228],
'注册数': [98, 21, 19, 71],
'下单用户数': [46, 43, 68, 72],
'订单数': [40, 50, 70, 80],
}
record_json = pd.DataFrame(pv_conv)
print(record_json)
4. 不支持类orient='index'
格式
这种格式以数组下标做为key,以行数据做为value,组成一个dict
传给DataFrame
,DataFrame
无法区分orient='columns'
。
5. 不支持orient='values'
格式
这种格式和类orient='split'
格式也是冲突的, 传入的二维数组会被当成data
字段,导致你的index
和columns
都被当做数据
6. index
的设定
- 自动生成
index
列
适用于index
列无实际意义的场景,index
默认从0开始,相当于数组的位置索引。
df = pd.DataFrame(pv_conv)
输出
- 手动切换
index
列
df = pd.DataFrame(pv_conv)
df.set_index('日期')
- 额外提供
index
列
数据处理的时候,自己将index
列单独拿出。
df = pd.DataFrame(pv_conv, index=["2020-09-01", "2020-09-03", "2020-09-03", "2020-09-04"])
输出
1.2 基于csv文件创建
csv文件内容:
日期,首页PV,搜索页PV,注册数,下单用户数,订单数
2020-09-01,985,290,98,46,40
2020-09-03,211,200,21,43,50
2020-09-03,688,201,19,68,70
2020-09-04,766,228,71,72,80
1. 默认index
列自动生成
csv = pd.read_csv('d:/pv_conv.csv')
print(csv)
输出
2. 通过列索引指定index
csv = pd.read_csv('d:/pv_conv.csv', index_col=0)
print(csv)
输出
3. Python 2.7下pandas遇到的问题
csv文件如果使用index_col(默认使用),会导致列头和实际数据错位一列,需要指定index_col=False
才能避免改问题。
csv = pd.read_csv("e:/wht_mobile.csv", index_col=False)
1.3 基于json创建
1. 字符串,格式split
可以通过orient
指定json
格式,如果orient='split'
格式, index
和columns
指定单独字段,data
是二维数组,数组里每一个元素对应数据的一行
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20]
]
}
代码
split_json = pd.read_json(tj, orient='split')
print(split_json)
所有JSON
格式创建的DataFrame
都需要默认生成0
开始的类似数组下标的index
,如果数据里有index
列,需要你自己指定。
输出
2. 字符串,格式reacords
如果orient='records'
格式, 数据是json对象的数组,数组里的每一个对象对应一行记录:
[
{"日期": "2020-09-01", "首页PV":"1", "搜索页PV": "1", "注册数" : "1" , "下单用户数" : "1", "订单数" : "1"},
{"日期": "2020-09-02", "首页PV":"2", "搜索页PV": "2", "注册数" : "2" , "下单用户数" : "2", "订单数" : "2"},
{"日期": "2020-09-03", "首页PV":"3", "搜索页PV": "3", "注册数" : "3" , "下单用户数" : "3", "订单数" : "3"},
{"日期": "2020-09-04", "首页PV":"4", "搜索页PV": "4", "注册数" : "4" , "下单用户数" : "4", "订单数" : "4"}
]
==这里有个非常坑的点是,数据元素后面的",", 在最后一个元素后面不能有,json格式本身也不允许,但是如果你放了提示很诡异,这个问题困扰了我好几个小时==
代码
record_json = pd.read_json(rj, orient='records', encoding="utf-8", dtype={"a": str, "b": str})
print(record_json)
默认是自动生成index
列,如果要自己指定,通过set_index
修改,这个方法是通用的
输出
3. 字符串,格式index
如果orient='index'
格式,数据的json格式会变成object,以位置索引为key
{
"0":{"日期": "2020-09-01", "首页PV":"1", "搜索页PV": "1", "注册数" : "1" , "下单用户数" : "1", "订单数" : "1"},
"1":{"日期": "2020-09-02", "首页PV":"2", "搜索页PV": "2", "注册数" : "2" , "下单用户数" : "2", "订单数" : "2"},
"2":{"日期": "2020-09-03", "首页PV":"3", "搜索页PV": "3", "注册数" : "3" , "下单用户数" : "3", "订单数" : "3"},
"3":{"日期": "2020-09-04", "首页PV":"4", "搜索页PV": "4", "注册数" : "4" , "下单用户数" : "4", "订单数" : "4"}
}
代码
record_json = pd.read_json(ridx, orient='index', encoding="utf-8")
print(record_json)
4. 字符串,格式columns
columns
是导入的默认格式,输入json是一个object,每个key对应一列数据
{
"首页PV": [985, 211, 688, 766],
"搜索页PV": [290, 200, 201, 228],
"注册数": [98, 21, 19, 71],
"下单用户数": [46, 43, 68, 72],
"订单数": [40, 50, 70, 80],
"日期":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"]
}
代码
record_json = pd.read_json(ridx, orient='columns', encoding="utf-8")
print(record_json)
5. 字符串,格式values
如果指定格式values
,数据是一个json的二维数据,每一个元素代表一行
[
["首页PV","搜索页PV","注册数","下单用户数","订单数","日期"],
["985","290","98","46","40","2020-09-01"],
["211","200","21","43","50","2020-09-02"],
["688","201","19","68","70","2020-09-03"],
["766","228","71","72","80","2020-09-04"]
]
代码
record_json = pd.read_json(ridx, orient='values', encoding="utf-8")
print(record_json)
6. 从文件创建
文件和字符串的区别仅仅是数据来源问题,同样适用上面提到的orient
,默认orient=columns
。
json文件内容:
{
"首页PV": [985, 211, 688, 766],
"搜索页PV": [290, 200, 201, 228],
"注册数": [98, 21, 19, 71],
"下单用户数": [46, 43, 68, 72],
"订单数": [40, 50, 70, 80],
"日期":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"]
}
代码
json = pd.read_json("d:/pv_conv.json")
json = json.set_index("日期")
print(json)
按自己的需求指定index
列,如果不指定默认是类似数组下标索引,从0
开始的数字。
1.4 从数据库表创建
# 通过数据库导入
mysql_conn = MySQLdb.connect(host='192.168.36.92', user='hadoop', passwd='aa.123', port=3306, db='dp', charset="utf8")
db = pd.read_sql_query("select * from client", mysql_conn)
# print(db)
# 通过数据库导入, 指定索引列, 这里用位置索引报错
db = pd.read_sql_query("select * from client", mysql_conn, index_col='id')
# print(db)
# 切换索引列
db = db.set_index("name")
print(db)
2. 数据操作
2.1 基于index
列的筛选
1. 自动生成index
列的筛选
df = pd.DataFrame(pv_conv)
row = df.loc[0]
输出
2. 额外提供index
列的筛选
df = pd.DataFrame(pv_conv, index=["2020-09-01", "2020-09-03", "2020-09-03", "2020-09-04"])
row = df.loc['2020-09-01']
输出
3. 基于index
范围切片
指定index
的起始范围,返回包含两端的边界值。
示例
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.loc['2020-09-02':'2020-09-03'])
输出
2.2 基于行位置索引
1. 根据行位置筛选
操作和数组下标一致
示例
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.iloc[0])
输出
2. 基于行位置范围筛选
这里比较诡异的一个点是,loc
是包含起始值的,但iloc
不包含结束值。
示例
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.iloc[1:3])
输出
2.3 前N行、后N行
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv)
print(pv_conv.head(2))
print(pv_conv.tail(1))
2.4 数据去重,基于index
append
是将两个DataFrame
拼接,这里我们用pv_conv
拼接自己,导致数据重复出现,再使用drop_duplicates
去重
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.shape)
pv_conv = pv_conv.append(pv_conv)
print(pv_conv.shape)
print(pv_conv)
# 删除重复的
pv_conv = pv_conv.drop_duplicates()
print(pv_conv)
输出
通过指定keep参数,控制保留那条数据:
取值 | 说明 |
---|---|
keep=first | 保留第一条数据 |
keep=last | 保留最后一条 |
keep=False | 删除所有重复 |
pv_conv = pv_conv.drop_duplicates(keep=False)
print(pv_conv)
2.5 空值处理
1. 判断是否为null
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.isnull())
输出
单纯的判断感觉意义不大
2. 空值统计
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.isnull().sum())
输出
3. 填充空值
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv)
pv_conv = pv_conv.fillna('HI')
print(pv_conv)
输出
2.6 数据清洗,删除空数据
1. 删除有空值的行
删除有空值的行,只要有一个空值这一行就会被丢弃
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv)
print(pv_conv.dropna())
输出
2. 删除有空值的列
我们能通过dropna
的参数axis
指定删除行还是列,默认axis=0
表示删除行,指定axis=1
表示删除列
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv)
print(pv_conv.dropna(axis=1))
输出
2.7 基于条件的筛选
1. 过滤条件
通过DataFrame['列名']
能拿到Series
类型,Series
配合条件运算符会生成我们的过滤条件,过滤条件的输出类似于isnull()
方法的输出。
示例
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
condition = (pv_conv['首页PV'] > 6)
print(condition)
输出
DataFrame[过滤条件]
返回的是结果为True
的那些行。
2. 逻辑运算符
过滤条件支持逻辑运算符,&
表示与,|
表示或。每个条件都用()
包含,&
、|
在括号外。
示例
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
condition = (pv_conv['首页PV'] > 6) & (pv_conv['下单用户数'] < 18)
print(condition)
print(pv_conv[condition])
print(pv_conv[(pv_conv['首页PV'] > 6) & (pv_conv['下单用户数'] < 18)]) # 也可以直接使用
3. 在列表中,类似数据库的in
示例
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv[pv_conv['首页PV'].isin([1, 11])])
输出
4. 基于过滤条件的筛选
示例
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv)
home_page_pv = pv_conv[pv_conv['首页PV'] > 10]
print(home_page_pv)
输出
2.8 转换函数
通过列名获取DataFrame
列Series
,结合apply
方法,提供转换函数,能对特定列数据进行转换。转换结果需要赋值会DataFrame['列名']
后才能使修改生效。
示例
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[3, 7, 8, null, 10],
[5, null, 13, 14, 15],
[7, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
def multiply_pv(pv):
return pv * 100 + random.randint(0, 100)
temp = pv_conv['首页PV'].apply(multiply_pv)
print(temp)
pv_conv['首页PV'] = temp
print(pv_conv)
输出
3. 反射、DML
3.1 info 获取列名和类型,非空行数
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.info())
输出
3.2 shape 获取行数和列数
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.shape)
输出
说明当前DataFrame有4行5列
3.3 获取列名
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.columns)
输出
3.4 修改列名
1. 修改指定列列名
将DataFrame
中的订单数列名改成order_count
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.columns)
pv_conv.rename(columns={
'订单数': 'oder_count'
}, inplace=True)
print(pv_conv.columns)
2. 批量改名
一次修改多个列名:
pv_conv.columns = ['首页PV', 'search_pv', 'register_count', '下单用户数', '订单数']
print(pv_conv.columns)
3. 基于列表解析,用函数自动改名
适用场景是基于规则修改,比如将列名都改成大写的
pv_conv.columns = [c.upper() for c in temp]
print(pv_conv)
输出
4. 表操作
4.1 union
将两个DataFrame
内容拼接成一个DataFrame
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.shape)
pv_conv = pv_conv.append(pv_conv)
print(pv_conv.shape)
4.2 join
通过DataFrame.join
方法我们可以见一个DataFrame
和另外的一个到多个Series
或DataFrame
做关联操作,如果是多个用list传入。
格式
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
参数明说:
参数名 | 说明 |
---|---|
other | 要关联的Series 或DataFrame ,多个用数组传参 |
on | 用来和DataFrame 关联的other 上的字段名称,默认是用index 字段关联 |
how | 关联方式,类型与数据库关联的left join 、right join 的概念,可选值是:left ,right , outer , inner |
lsuffix | 左侧字段名加前缀,用于避免重名 |
rsuffix | 右侧字段名加前缀,用于避免重名 |
sort | 是否对结果按关联字段排序 |
示例:
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
js2 = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["APP下载","APP活跃"],
"data":[
[111,112],
[121,122],
[131,132],
[141,142]
]
}
'''
print(pv_conv)
pv_conv2 = pd.read_json(js2, orient='split')
print(pv_conv2)
pv_conv3 = pv_conv.join(pv_conv2)
print(pv_conv3)
输出
4.3 差集
DataFrame
本身不直接提供方法支持差集,我们是利用drop_duplicate
实现的差集,pv_conv
拼接两次pv_conv2
,这样会导致pv_conv2
上的index
必然重复,再指定keep=False
这样就保证pv_conv2
上的说有index
只都会删除。
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
js2 = '''
{
"index":["2020-09-01","2020-09-02"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10]
]
}
'''
pv_conv2 = pd.read_json(js2, orient='split')
pv_diff = pv_conv.append(pv_conv2)
pv_diff = pv_diff.append(pv_conv2)
pv_diff = pv_diff.drop_duplicates(keep=False)
print(pv_diff)
输出
4.4 统计
1. describe统计: 个数、均值、最大最小值、百分位
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.describe())
输出
对于非数值类型,返回的就是 行数、distinct行数、出现频率最高的关键字、出现的次数
2. value_count: 出现频率统计
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.value_counts())
print(pv_conv['首页PV'].value_counts())
DataFrame.value_count
的实际作用待考究。Serries
返回每个特定值出现的频率
输出
3. corr: 计算列之间的相关性
计算各个列之间的相关性,正数表明正相关(一个涨另一个也上涨), 负数表明负相关(一个涨一个跌),越接近1表示相关性越大
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.corr())
输出
4. 百分位
示例
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[3, 7, 8, null, 10],
[5, null, 13, 14, 15],
[7, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(pv_conv['首页PV'].quantile(0.25))
输出
5. 平均值
示例
print(csv[csv['用户ID'] < 18000].head(3)['用户ID'].mean())
4.5 列裁剪
关于行裁切可以看一下2.2.1
和2.2.2
基于index
和数组下标的切片。
1. 获取特定列
默认pv_conv['订单数']
返回的是Series
,如果想要返回DataFrame
需要传递一个数组,pv_conv[['订单数','首页PV']
示例
js = '''
{
"index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
"columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
"data":[
[1, 2, 3, 4, 5],
[6, 7, 8, null, 10],
[11, null, 13, 14, 15],
[16, null, 18, 19, 20]
]
}
'''
pv_conv = pd.read_json(js, orient='split')
print(type(pv_conv['订单数']))
print(pv_conv['订单数'])
print(type(pv_conv[['订单数', '首页PV']]))
print(pv_conv[['订单数', '首页PV']])
输出
5. 导出
5.1 to_csv
如果写出有中文,必须指定encoding
db.to_csv('e:/test_csv.csv', encoding='utf-8')
5.2 to_json
可以通过orient
指定输出格式,默认是orient='columns'
。
db.to_json("e:/test_json.json")
5.3 to_sql
db.to_sql('table_name',mysql_conn)
参考资料
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/