Pandas中DataFrame的操作

Pandas有两个核型概念: Series、DataFrame。 Series类似于Python中的list,可以理解为表格中的一列数据。 DataFrame包含Series,理解为一个表格。核心价值就在于基于表格数据的过滤、转换、统计、分析并输出可视化图表。

后面所有的示例都是基于以下几份测试数据:

PV转换数据,变量名: pv_conv

首页PV 搜索页PV 注册数 下单用户数 订单数
985 290 98 46 40
211 200 21 43 50
688 201 19 68 70
766 228 71 72 80

所有的代码示例都假设你已经导入如下python包:

import pandas as pd
import numpy as np
import matplotlib as plt
import sys
import MySQLdb

pv_conv = {
    '首页PV': [985, 211, 688, 766],
    '搜索页PV': [290, 200, 201, 228],
    '注册数': [98, 21, 19, 71],
    '下单用户数': [46, 43, 68, 72],
    '订单数': [40, 50, 70, 80],
}

1. 创建DataFrame

1.1 基于dict创建

基于代码来创建DataFrame的场景还是很常见的,这样数据就可以是来自任意位置、采用任意格式。DataFrame很多操作都是基于index列完成的。

DataFrame的处理本省兼容性很强,天然支持类似2.1.3 基于JSON创建里的orient='split'orient='records'orient='index'orient='columns'

1. 类orient='split'格式

我们需要将indexcolumnsdata分成3个参数传递:

index = ["2020-09-01", "2020-09-02", "2020-09-03", "2020-09-04"]
columns = ["首页PV", "搜索页PV", "注册数", "下单用户数", "订单数"]
data = [
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10],
    [11, 12, 13, 14, 15],
    [16, 17, 18, 19, 20]
]

record_json = pd.DataFrame(index=index, columns=columns, data=data)
print(record_json)
2. 类orient='records'格式

也能支持使用orient='records'格式的数据:

data = [
    {"日期": "2020-09-01", "首页PV":"1",  "搜索页PV": "1", "注册数" : "1" , "下单用户数" : "1", "订单数" : "1"},
    {"日期": "2020-09-02", "首页PV":"2",  "搜索页PV": "2", "注册数" : "2" , "下单用户数" : "2", "订单数" : "2"},
    {"日期": "2020-09-03", "首页PV":"3",  "搜索页PV": "3", "注册数" : "3" , "下单用户数" : "3", "订单数" : "3"},
    {"日期": "2020-09-04", "首页PV":"4",  "搜索页PV": "4", "注册数" : "4" , "下单用户数" : "4", "订单数" : "4"}
]

record_json = pd.DataFrame(data)
print(record_json)
3. 类orient='columns'格式
pv_conv = {
    '首页PV': [985, 211, 688, 766],
    '搜索页PV': [290, 200, 201, 228],
    '注册数': [98, 21, 19, 71],
    '下单用户数': [46, 43, 68, 72],
    '订单数': [40, 50, 70, 80],
}

record_json = pd.DataFrame(pv_conv)
print(record_json)
4. 不支持类orient='index'格式

这种格式以数组下标做为key,以行数据做为value,组成一个dict传给DataFrameDataFrame无法区分orient='columns'

5. 不支持orient='values'格式

这种格式和类orient='split'格式也是冲突的, 传入的二维数组会被当成data字段,导致你的indexcolumns都被当做数据

6. index的设定
  • 自动生成index

适用于index列无实际意义的场景,index默认从0开始,相当于数组的位置索引。

df = pd.DataFrame(pv_conv)

输出

image.png
  • 手动切换index
df = pd.DataFrame(pv_conv)
df.set_index('日期')
  • 额外提供index

数据处理的时候,自己将index列单独拿出。

df = pd.DataFrame(pv_conv, index=["2020-09-01", "2020-09-03", "2020-09-03", "2020-09-04"])

输出

image.png
1.2 基于csv文件创建

csv文件内容:

日期,首页PV,搜索页PV,注册数,下单用户数,订单数
2020-09-01,985,290,98,46,40
2020-09-03,211,200,21,43,50
2020-09-03,688,201,19,68,70
2020-09-04,766,228,71,72,80
1. 默认index列自动生成
csv = pd.read_csv('d:/pv_conv.csv')
print(csv)

输出

image.png
2. 通过列索引指定index
csv = pd.read_csv('d:/pv_conv.csv', index_col=0)
print(csv)

输出

image.png
3. Python 2.7下pandas遇到的问题

csv文件如果使用index_col(默认使用),会导致列头和实际数据错位一列,需要指定index_col=False才能避免改问题。

csv = pd.read_csv("e:/wht_mobile.csv", index_col=False)
1.3 基于json创建
1. 字符串,格式split

可以通过orient指定json格式,如果orient='split'格式, indexcolumns指定单独字段,data是二维数组,数组里每一个元素对应数据的一行

{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, 9, 10],
            [11, 12, 13, 14, 15],
            [16, 17, 18, 19, 20]
     ]
}

代码

split_json = pd.read_json(tj, orient='split')
print(split_json)

所有JSON格式创建的DataFrame都需要默认生成0开始的类似数组下标的index,如果数据里有index列,需要你自己指定。

输出

image.png
2. 字符串,格式reacords

如果orient='records'格式, 数据是json对象的数组,数组里的每一个对象对应一行记录:

[
    {"日期": "2020-09-01", "首页PV":"1",  "搜索页PV": "1", "注册数" : "1" , "下单用户数" : "1", "订单数" : "1"},
    {"日期": "2020-09-02", "首页PV":"2",  "搜索页PV": "2", "注册数" : "2" , "下单用户数" : "2", "订单数" : "2"},
    {"日期": "2020-09-03", "首页PV":"3",  "搜索页PV": "3", "注册数" : "3" , "下单用户数" : "3", "订单数" : "3"},
    {"日期": "2020-09-04", "首页PV":"4",  "搜索页PV": "4", "注册数" : "4" , "下单用户数" : "4", "订单数" : "4"}
]

==这里有个非常坑的点是,数据元素后面的",", 在最后一个元素后面不能有,json格式本身也不允许,但是如果你放了提示很诡异,这个问题困扰了我好几个小时==

代码

record_json = pd.read_json(rj, orient='records', encoding="utf-8", dtype={"a": str, "b": str})
print(record_json)

默认是自动生成index列,如果要自己指定,通过set_index修改,这个方法是通用的

输出

image.png
3. 字符串,格式index

如果orient='index'格式,数据的json格式会变成object,以位置索引为key

{
    "0":{"日期": "2020-09-01", "首页PV":"1",  "搜索页PV": "1", "注册数" : "1" , "下单用户数" : "1", "订单数" : "1"},
    "1":{"日期": "2020-09-02", "首页PV":"2",  "搜索页PV": "2", "注册数" : "2" , "下单用户数" : "2", "订单数" : "2"},
    "2":{"日期": "2020-09-03", "首页PV":"3",  "搜索页PV": "3", "注册数" : "3" , "下单用户数" : "3", "订单数" : "3"},
    "3":{"日期": "2020-09-04", "首页PV":"4",  "搜索页PV": "4", "注册数" : "4" , "下单用户数" : "4", "订单数" : "4"}
}

代码

record_json = pd.read_json(ridx, orient='index', encoding="utf-8")
print(record_json)
4. 字符串,格式columns

columns是导入的默认格式,输入json是一个object,每个key对应一列数据

{
    "首页PV": [985, 211, 688, 766],
    "搜索页PV": [290, 200, 201, 228],
    "注册数": [98, 21, 19, 71],
    "下单用户数": [46, 43, 68, 72],
    "订单数": [40, 50, 70, 80],
    "日期":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"]
}

代码

record_json = pd.read_json(ridx, orient='columns', encoding="utf-8")
print(record_json)
5. 字符串,格式values

如果指定格式values,数据是一个json的二维数据,每一个元素代表一行

[
    ["首页PV","搜索页PV","注册数","下单用户数","订单数","日期"],
    ["985","290","98","46","40","2020-09-01"],
    ["211","200","21","43","50","2020-09-02"],
    ["688","201","19","68","70","2020-09-03"],
    ["766","228","71","72","80","2020-09-04"]
]

代码

record_json = pd.read_json(ridx, orient='values', encoding="utf-8")
print(record_json)
6. 从文件创建

文件和字符串的区别仅仅是数据来源问题,同样适用上面提到的orient,默认orient=columns

json文件内容:

{
    "首页PV": [985, 211, 688, 766],
    "搜索页PV": [290, 200, 201, 228],
    "注册数": [98, 21, 19, 71],
    "下单用户数": [46, 43, 68, 72],
    "订单数": [40, 50, 70, 80],
    "日期":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"]
}

代码

json = pd.read_json("d:/pv_conv.json")
json = json.set_index("日期")
print(json)

按自己的需求指定index列,如果不指定默认是类似数组下标索引,从0开始的数字。

1.4 从数据库表创建
# 通过数据库导入
mysql_conn = MySQLdb.connect(host='192.168.36.92', user='hadoop', passwd='aa.123', port=3306, db='dp', charset="utf8")
db = pd.read_sql_query("select * from client", mysql_conn)
# print(db)

# 通过数据库导入, 指定索引列, 这里用位置索引报错
db = pd.read_sql_query("select * from client", mysql_conn, index_col='id')
# print(db)

# 切换索引列
db = db.set_index("name")
print(db)

2. 数据操作

2.1 基于index列的筛选
1. 自动生成index列的筛选
df = pd.DataFrame(pv_conv)
row = df.loc[0]

输出

image.png
2. 额外提供index列的筛选
df = pd.DataFrame(pv_conv, index=["2020-09-01", "2020-09-03", "2020-09-03", "2020-09-04"])
row = df.loc['2020-09-01']

输出

image.png
3. 基于index范围切片

指定index的起始范围,返回包含两端的边界值。

示例

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(pv_conv.loc['2020-09-02':'2020-09-03'])

输出

image.png
2.2 基于行位置索引
1. 根据行位置筛选

操作和数组下标一致

示例


js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(pv_conv.iloc[0])

输出

image.png
2. 基于行位置范围筛选

这里比较诡异的一个点是,loc是包含起始值的,但iloc不包含结束值。

示例

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(pv_conv.iloc[1:3])

输出

image.png
2.3 前N行、后N行
js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, 9, 10],
            [11, 12, 13, 14, 15],
            [16, 17, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(pv_conv)

print(pv_conv.head(2))
print(pv_conv.tail(1))
2.4 数据去重,基于index

append是将两个DataFrame拼接,这里我们用pv_conv拼接自己,导致数据重复出现,再使用drop_duplicates去重

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, 9, 10],
            [11, 12, 13, 14, 15],
            [16, 17, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(pv_conv.shape)

pv_conv = pv_conv.append(pv_conv)
print(pv_conv.shape)

print(pv_conv)

# 删除重复的
pv_conv = pv_conv.drop_duplicates()
print(pv_conv)

输出

image.png

通过指定keep参数,控制保留那条数据:

取值 说明
keep=first 保留第一条数据
keep=last 保留最后一条
keep=False 删除所有重复
pv_conv = pv_conv.drop_duplicates(keep=False)
print(pv_conv)
2.5 空值处理
1. 判断是否为null
js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(pv_conv.isnull())

输出

image.png

单纯的判断感觉意义不大

2. 空值统计
js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(pv_conv.isnull().sum())

输出

image.png
3. 填充空值
js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(pv_conv)

pv_conv = pv_conv.fillna('HI')
print(pv_conv)

输出

image.png
2.6 数据清洗,删除空数据
1. 删除有空值的行

删除有空值的行,只要有一个空值这一行就会被丢弃

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(pv_conv)
print(pv_conv.dropna())

输出

image.png
2. 删除有空值的列

我们能通过dropna的参数axis指定删除行还是列,默认axis=0表示删除行,指定axis=1表示删除列

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(pv_conv)
print(pv_conv.dropna(axis=1))

输出

image.png
2.7 基于条件的筛选
1. 过滤条件

通过DataFrame['列名']能拿到Series类型,Series配合条件运算符会生成我们的过滤条件,过滤条件的输出类似于isnull()方法的输出。

示例

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')

condition = (pv_conv['首页PV'] > 6)
print(condition)

输出

image.png

DataFrame[过滤条件]返回的是结果为True的那些行。

2. 逻辑运算符

过滤条件支持逻辑运算符,&表示与,|表示或。每个条件都用()包含,&|在括号外。

示例

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')

condition = (pv_conv['首页PV'] > 6) & (pv_conv['下单用户数'] < 18)
print(condition)

print(pv_conv[condition])
print(pv_conv[(pv_conv['首页PV'] > 6) & (pv_conv['下单用户数'] < 18)]) # 也可以直接使用
3. 在列表中,类似数据库的in

示例

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')

print(pv_conv[pv_conv['首页PV'].isin([1, 11])])

输出

image.png
4. 基于过滤条件的筛选

示例

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(pv_conv)

home_page_pv = pv_conv[pv_conv['首页PV'] > 10]
print(home_page_pv)

输出

image.png
2.8 转换函数

通过列名获取DataFrameSeries,结合apply方法,提供转换函数,能对特定列数据进行转换。转换结果需要赋值会DataFrame['列名']后才能使修改生效。

示例

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [3, 7, 8, null, 10],
            [5, null, 13, 14, 15],
            [7, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')

def multiply_pv(pv):
    return pv * 100 + random.randint(0, 100)

temp = pv_conv['首页PV'].apply(multiply_pv)
print(temp)

pv_conv['首页PV'] = temp
print(pv_conv)

输出

image.png

3. 反射、DML

3.1 info 获取列名和类型,非空行数
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.info())

输出

image.png
3.2 shape 获取行数和列数
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.shape)

输出

说明当前DataFrame有4行5列

image.png
3.3 获取列名
pv_conv = pd.read_json(js, orient='split')
print(pv_conv.columns)

输出

image.png
3.4 修改列名
1. 修改指定列列名

DataFrame中的订单数列名改成order_count

pv_conv = pd.read_json(js, orient='split')
print(pv_conv.columns)

pv_conv.rename(columns={
    '订单数': 'oder_count'
}, inplace=True)

print(pv_conv.columns)
2. 批量改名

一次修改多个列名:

pv_conv.columns = ['首页PV', 'search_pv', 'register_count', '下单用户数', '订单数']
print(pv_conv.columns)
3. 基于列表解析,用函数自动改名

适用场景是基于规则修改,比如将列名都改成大写的

pv_conv.columns = [c.upper() for c in temp]
print(pv_conv)

输出

image.png

4. 表操作

4.1 union

将两个DataFrame内容拼接成一个DataFrame

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, 9, 10],
            [11, 12, 13, 14, 15],
            [16, 17, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(pv_conv.shape)

pv_conv = pv_conv.append(pv_conv)
print(pv_conv.shape)
4.2 join

通过DataFrame.join方法我们可以见一个DataFrame和另外的一个到多个SeriesDataFrame做关联操作,如果是多个用list传入。

格式

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

参数明说:

参数名 说明
other 要关联的SeriesDataFrame,多个用数组传参
on 用来和DataFrame关联的other上的字段名称,默认是用index字段关联
how 关联方式,类型与数据库关联的left joinright join的概念,可选值是:left,right, outer, inner
lsuffix 左侧字段名加前缀,用于避免重名
rsuffix 右侧字段名加前缀,用于避免重名
sort 是否对结果按关联字段排序

示例:

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')

js2 = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["APP下载","APP活跃"],
    "data":[
            [111,112],
            [121,122],
            [131,132],
            [141,142]
     ]
}
'''

print(pv_conv)

pv_conv2 = pd.read_json(js2, orient='split')
print(pv_conv2)

pv_conv3 = pv_conv.join(pv_conv2)
print(pv_conv3)

输出

image.png
4.3 差集

DataFrame本身不直接提供方法支持差集,我们是利用drop_duplicate实现的差集,pv_conv拼接两次pv_conv2,这样会导致pv_conv2上的index必然重复,再指定keep=False这样就保证pv_conv2上的说有index只都会删除。

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')

js2 = '''
{
    "index":["2020-09-01","2020-09-02"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10]
     ]
}
'''

pv_conv2 = pd.read_json(js2, orient='split')

pv_diff = pv_conv.append(pv_conv2)
pv_diff = pv_diff.append(pv_conv2)
pv_diff = pv_diff.drop_duplicates(keep=False)

print(pv_diff)

输出

image.png
4.4 统计
1. describe统计: 个数、均值、最大最小值、百分位
js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(pv_conv.describe())

输出

image.png

对于非数值类型,返回的就是 行数、distinct行数、出现频率最高的关键字、出现的次数

2. value_count: 出现频率统计
js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(pv_conv.value_counts())
print(pv_conv['首页PV'].value_counts())

DataFrame.value_count的实际作用待考究。Serries返回每个特定值出现的频率

输出

image.png
3. corr: 计算列之间的相关性

计算各个列之间的相关性,正数表明正相关(一个涨另一个也上涨), 负数表明负相关(一个涨一个跌),越接近1表示相关性越大

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(pv_conv.corr())

输出

image.png
4. 百分位

示例

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [3, 7, 8, null, 10],
            [5, null, 13, 14, 15],
            [7, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')

print(pv_conv['首页PV'].quantile(0.25))

输出

image.png
5. 平均值

示例

print(csv[csv['用户ID'] < 18000].head(3)['用户ID'].mean())
4.5 列裁剪

关于行裁切可以看一下2.2.12.2.2基于index和数组下标的切片。

1. 获取特定列

默认pv_conv['订单数']返回的是Series,如果想要返回DataFrame需要传递一个数组,pv_conv[['订单数','首页PV']

示例

js = '''
{
    "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
    "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
    "data":[
            [1, 2, 3, 4, 5],
            [6, 7, 8, null, 10],
            [11, null, 13, 14, 15],
            [16, null, 18, 19, 20]
     ]
}
'''

pv_conv = pd.read_json(js, orient='split')
print(type(pv_conv['订单数']))
print(pv_conv['订单数'])

print(type(pv_conv[['订单数', '首页PV']]))
print(pv_conv[['订单数', '首页PV']])

输出

image.png

5. 导出

5.1 to_csv

如果写出有中文,必须指定encoding

db.to_csv('e:/test_csv.csv', encoding='utf-8')
5.2 to_json

可以通过orient指定输出格式,默认是orient='columns'

db.to_json("e:/test_json.json")
5.3 to_sql
db.to_sql('table_name',mysql_conn)

参考资料

  1. https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/

 
 
 

你可能感兴趣的:(Pandas中DataFrame的操作)