Day2-Python-Pandas基础(Datawhale)

1. Pandas基础

import numpy as np
import pandas as pd
print("pandas的版本是:",pd.__version__)
pandas的版本是: 1.1.3

1.1 文件的读取和保存

1.1.1 文件的读取

1.1.1.1 读取csv文件

pd.read_csv() 将数据放入目录下,会减少不必要的麻烦

df_csv = pd.read_csv('data/my_csv.csv',header=None,usecols=[3,4])
df_csv
3 4
0 col4 col5
1 apple 2020/1/1
2 banana 2020/1/2
3 orange 2020/1/5
4 lemon 2020/1/7

1.1.1.2 读取txt文件

pd.read_table()

txt文件遇到分隔符非空格时,使用sep自定义分隔符号

df_txt = pd.read_table('data/my_table.txt',index_col=['col1'])
df_txt
col2 col3 col4
col1
2 a 1.4 apple 2020/1/1
3 b 3.4 banana 2020/1/2
6 c 2.5 orange 2020/1/5
5 d 3.2 lemon 2020/1/7
df_txt1 = pd.read_table('data/my_table_special_sep.txt')
df_txt1
col1 |||| col2
0 TS |||| This is an apple.
1 GQ |||| My name is Bob.
2 WT |||| Well done!
3 PT |||| May I help you?
df_t = pd.read_table('data/my_table_special_sep.txt',sep='\|\|\|\|',engine='python')
df_t
col1 col2
0 TS This is an apple.
1 GQ My name is Bob.
2 WT Well done!
3 PT May I help you?

思考:engine=‘python’ 默认是c引擎解析,如果使用python引擎,可以解析更丰富的内容;

sep使用的是正则表达式,需要将|转义。(此知识点等学完正则表达式补充)

1.1.1.3 读取excel文件

pd.read_excel

df_excel = pd.read_excel('data/my_excel.xlsx',nrows=2, parse_dates=['col5'])
df_excel
col1 col2 col3 col4 col5
0 2 a 1.4 apple 2020-01-01
1 3 b 3.4 banana 2020-01-02

公共参数:

header=None: 第一列不作为列名

index_col: 某一列或几列作为索引

usecols: 读取列的集合,默认读取所有的列

parse_dates: 需要转化为时间的列

nrows: 读取的数据行数

若在使用了header=None,usecols时填写的是新的列名

1.1.2 数据的保存

1.1.2.1 保存csv

table_name.to_csv

df_csv1 = pd.read_csv('data/my_csv.csv')  
df_csv1
col1 col2 col3 col4 col5
0 2 a 1.4 apple 2020/1/1
1 3 b 3.4 banana 2020/1/2
2 6 c 2.5 orange 2020/1/5
3 5 d 3.2 lemon 2020/1/7
df_csv1.to_csv('data/my_csv_saved_mine.csv',index=False)

注意:

(1).保存的名字重复时,原文件被覆盖

(2).保存的文件需要带文件后缀,to_csv不会默认保存为csv文件

(3).索引index一般设置为False

1.1.2.2 保存txt

table_name.to_csv

df_txt1 = pd.read_table('data/my_table.txt')
df_txt1
col1 col2 col3 col4
0 2 a 1.4 apple 2020/1/1
1 3 b 3.4 banana 2020/1/2
2 6 c 2.5 orange 2020/1/5
3 5 d 3.2 lemon 2020/1/7
df_txt1.to_csv('data/my_txt_saved_mine.txt',sep='\t',index=False)

注意:

to_csv可以保存txt文件,并可以自定义分隔符,常见的为制表符\t分隔

1.1.2.3 保存xls文件

table_name.to_excel

df_excel1 = pd.read_excel('data/my_excel.xlsx')
df_excel1
col1 col2 col3 col4 col5
0 2 a 1.4 apple 2020/1/1
1 3 b 3.4 banana 2020/1/2
2 6 c 2.5 orange 2020/1/5
3 5 d 3.2 lemon 2020/1/7
df_excel1.to_excel('data/my_excel_saved_mine.xlsx')

1.1.2.4 表格转换为markdown语言

to_markdown

首先安装tabulate包:(在shell中)pip install tabulate

import tabulate
print(df_csv1.to_markdown())
|    |   col1 | col2   |   col3 | col4   | col5     |
|---:|-------:|:-------|-------:|:-------|:---------|
|  0 |      2 | a      |    1.4 | apple  | 2020/1/1 |
|  1 |      3 | b      |    3.4 | banana | 2020/1/2 |
|  2 |      6 | c      |    2.5 | orange | 2020/1/5 |
|  3 |      5 | d      |    3.2 | lemon  | 2020/1/7 |

1.1.2.5 表格转换为latex语言

to_latex

print(df_csv1.to_latex())
\begin{tabular}{lrlrll}
\toprule
{} &  col1 & col2 &  col3 &    col4 &      col5 \\
\midrule
0 &     2 &    a &   1.4 &   apple &  2020/1/1 \\
1 &     3 &    b &   3.4 &  banana &  2020/1/2 \\
2 &     6 &    c &   2.5 &  orange &  2020/1/5 \\
3 &     5 &    d &   3.2 &   lemon &  2020/1/7 \\
\bottomrule
\end{tabular}

1.2 基本数据结构

pandas两种基本数据存储结构:一维:values的Series和二维:values的DataFrame。重点了解:属性和方法

1.2.1 Series

组成部分:

data:值

index:索引

dtype:存储类型

name:序列的名字

索引可以指定名字,默认为空

s = pd.Series(data=[100, 'a', {
     'dict1':5}], #值
             index=pd.Index(['id1', 20, 'third'], name='my_idx'),
            # 索引的值和名称
             dtype='object', #存储的类型
             name='my_name' #整个序列的名称)
s
my_idx
id1               100
20                  a
third    {'dict1': 5}
Name: my_name, dtype: object

备注:object是混合类型,存储不同数据结构;纯字符串序列也是一种object类型序列,可以使用string类型存储。

1.2.1.1属性的获取

Series_name.values:获取值

Series_name.index:获取索引

Series_name.dtype:获取存储类型

Series_name.name:获取序列名称

Series_name.shape:获取序列的长度

Series_name[index_name]:获取单个索引对应的值

s.values
array([100, 'a', {'dict1': 5}], dtype=object)
s.index
Index(['id1', 20, 'third'], dtype='object', name='my_idx')
s.dtype
dtype('O')
s.name
'my_name'
s.shape
(3,)

备注:shape的解释:shape[0]表示最外围的数组的维数,shape[1]表示次外围的数组的维数,数字不断增大,维数由外到内。Series是一维,直接代表的是元素的个数,所以显示的(3,)

x  = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[0,1,2]],[[3,4,5],[6,7,8]]])
x
array([[[1, 2, 3],
        [4, 5, 6]],

       [[7, 8, 9],
        [0, 1, 2]],

       [[3, 4, 5],
        [6, 7, 8]]])
print(x.shape)
print(x.shape[0]) #最外围包含3个2×3的二维数组
print(x.shape[1]) #进一层包含了2个一维数组
print(x.shape[2]) #最里层一个一维数组包含3个元素
(3, 2, 3)
3
2
3
s['third']
{'dict1': 5}

1.2.2 DataFrame

在Series基础上增加列索引,一个数据框由二维的data与行列索引构成

DataFrame_name.values:获取值

DataFrame_name.index:获取索引

DataFrame_name.dtype:获取存储类型:返回对应列数据类型的Series

DataFrame_name.columns:获取列名称

DataFrame_name.shape:获取数据框的(行,列)长度

DataFrame_name[index_name]:获取索引对应的值可以单一列和多个列

DataFrame.T:转置

data = [[1, 'a', 1.2], [2, 'b', 2.2], [3, 'c', 3.3]]
data
[[1, 'a', 1.2], [2, 'b', 2.2], [3, 'c', 3.3]]
df = pd.DataFrame(data=data,
                 index=['row_%d'%i for i in range(3)],
                 columns=['col_0', 'col_1', 'col_2'])
df
col_0 col_1 col_2
row_0 1 a 1.2
row_1 2 b 2.2
row_2 3 c 3.3

补充:%d:整数;%f:浮点型;%s:字符串

df.values
array([[1, 'a', 1.2],
       [2, 'b', 2.2],
       [3, 'c', 3.3]], dtype=object)
df.index
Index(['row_0', 'row_1', 'row_2'], dtype='object')
df.columns
Index(['col_0', 'col_1', 'col_2'], dtype='object')
df.dtypes
col_0      int64
col_1     object
col_2    float64
dtype: object
df.shape
(3, 3)
df[['col_0','col_1']] #列名是一个整体的列表需要使用[]
col_0 col_1
row_0 1 a
row_1 2 b
row_2 3 c
df.T
row_0 row_1 row_2
col_0 1 2 3
col_1 a b c
col_2 1.2 2.2 3.3

1.3 常用基本函数

df = pd.read_csv('data/learn_pandas.csv')
df.head()
School Grade Name Gender Height Weight Transfer Test_Number Test_Date Time_Record
0 Shanghai Jiao Tong University Freshman Gaopeng Yang Female 158.9 46.0 N 1 2019/10/5 0:04:34
1 Peking University Freshman Changqiang You Male 166.5 70.0 N 1 2019/9/4 0:04:20
2 Shanghai Jiao Tong University Senior Mei Sun Male 188.9 89.0 N 2 2019/9/12 0:05:22
3 Fudan University Sophomore Xiaojuan Sun Female NaN 41.0 N 2 2020/1/3 0:04:08
4 Fudan University Sophomore Gaojuan You Male 174.0 74.0 N 2 2019/11/6 0:05:22
#获取前7列
df = df[df.columns[:7]]
df.head()
School Grade Name Gender Height Weight Transfer
0 Shanghai Jiao Tong University Freshman Gaopeng Yang Female 158.9 46.0 N
1 Peking University Freshman Changqiang You Male 166.5 70.0 N
2 Shanghai Jiao Tong University Senior Mei Sun Male 188.9 89.0 N
3 Fudan University Sophomore Xiaojuan Sun Female NaN 41.0 N
4 Fudan University Sophomore Gaojuan You Male 174.0 74.0 N

1.3.1 汇总函数

head: 返回前n行,默认为5

tail: 返回后n行

info: 返回表的信息概况

describe: 返回表中数值列对应的主要统计量。可使用(pandas-profiling包)

df.head()
School Grade Name Gender Height Weight Transfer
0 Shanghai Jiao Tong University Freshman Gaopeng Yang Female 158.9 46.0 N
1 Peking University Freshman Changqiang You Male 166.5 70.0 N
2 Shanghai Jiao Tong University Senior Mei Sun Male 188.9 89.0 N
3 Fudan University Sophomore Xiaojuan Sun Female NaN 41.0 N
4 Fudan University Sophomore Gaojuan You Male 174.0 74.0 N
df.tail()
School Grade Name Gender Height Weight Transfer
195 Fudan University Junior Xiaojuan Sun Female 153.9 46.0 N
196 Tsinghua University Senior Li Zhao Female 160.9 50.0 N
197 Shanghai Jiao Tong University Senior Chengqiang Chu Female 153.9 45.0 N
198 Shanghai Jiao Tong University Senior Chengmei Shen Male 175.3 71.0 N
199 Tsinghua University Sophomore Chunpeng Lv Male 155.7 51.0 N
df.info()

RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   School    200 non-null    object 
 1   Grade     200 non-null    object 
 2   Name      200 non-null    object 
 3   Gender    200 non-null    object 
 4   Height    183 non-null    float64
 5   Weight    189 non-null    float64
 6   Transfer  188 non-null    object 
dtypes: float64(2), object(5)
memory usage: 11.1+ KB
df.describe()
Height Weight
count 183.000000 189.000000
mean 163.218033 55.015873
std 8.608879 12.824294
min 145.400000 34.000000
25% 157.150000 46.000000
50% 161.900000 51.000000
75% 167.500000 65.000000
max 193.900000 89.000000

1.3.2 特征统计函数

sum:求和

mean:均值

median:中位数

var:方差

std:标准差

max:最大值

min:最小值

quantile:分位数

count:非缺失值个数

idxmax:最大值对应的索引

idxmin:最小值对应的索引

公共参数axis。默认为0代表逐列聚合,1表示逐行聚合

df_demo = df[['Height','Weight']]
df_demo.mean()
Height    163.218033
Weight     55.015873
dtype: float64
df_demo.max()
Height    193.9
Weight     89.0
dtype: float64
df_demo.quantile(0.75)
Height    167.5
Weight     65.0
Name: 0.75, dtype: float64
df_demo.idxmin()
Height    143
Weight     49
dtype: int64
df_demo.mean(axis=1).head()
0    102.45
1    118.25
2    138.95
3     41.00
4    124.00
dtype: float64

1.3.3 唯一值函数

unique:唯一值组成的列表(某一列中含有的不同值)

nunique:唯一值的个数

value_counts:唯一值和其对应出现的频数

drop_duplicates:观察多个列组合的唯一值

df['School'].unique()
array(['Shanghai Jiao Tong University', 'Peking University',
       'Fudan University', 'Tsinghua University'], dtype=object)
df['School'].nunique()
4
df['School'].value_counts()
Tsinghua University              69
Shanghai Jiao Tong University    57
Fudan University                 40
Peking University                34
Name: School, dtype: int64

drop_duplicates参数

keep:默认first(每个组合保留第一次出现的所在行),last(保留最后一次出现的所在行),False:表示所有重复组合所在行的剔除(保留只出现过一次的列组合)

df_demo = df[['Gender','Transfer','Name']]
df_demo.drop_duplicates(['Gender','Transfer'])
Gender Transfer Name
0 Female N Gaopeng Yang
1 Male N Changqiang You
12 Female NaN Peng You
21 Male NaN Xiaopeng Shen
36 Male Y Xiaojuan Qin
43 Female Y Gaoli Feng
df_demo.drop_duplicates(['Gender','Transfer'],keep='last')
Gender Transfer Name
147 Male NaN Juan You
150 Male Y Chengpeng You
169 Female Y Chengquan Qin
194 Female NaN Yanmei Qian
197 Female N Chengqiang Chu
199 Male N Chunpeng Lv
df_demo.drop_duplicates(['Name','Transfer'],keep=False)
Gender Transfer Name
0 Female N Gaopeng Yang
1 Male N Changqiang You
4 Male N Gaojuan You
5 Female N Xiaoli Qian
7 Female N Gaoqiang Qian
... ... ... ...
192 Male N Gaojuan Wang
194 Female NaN Yanmei Qian
196 Female N Li Zhao
197 Female N Chengqiang Chu
198 Male N Chengmei Shen

155 rows × 3 columns

df['School'].drop_duplicates()
0    Shanghai Jiao Tong University
1                Peking University
3                 Fudan University
5              Tsinghua University
Name: School, dtype: object

duplicated返回唯一值的布尔列表,参数keep与drop_duplicated一致。重复元素设为True,否则为False

1.3.4 替换函数

映射替换: replace(通过字典构造,或者传入两个列表)

指定method参数为ffill可以用前面一个最近的未被替换的值进行替换,

bfill:使后面未被替换的值进行替换

df['Gender'].replace({
     'Female':0, 'Male':1}).head()
0    0
1    1
2    1
3    0
4    1
Name: Gender, dtype: int64
df['Gender'].replace(['Female','Male'],[0,1]).head()
0    0
1    1
2    1
3    0
4    1
Name: Gender, dtype: int64
s = pd.Series(['a',1,'b',2,'c',1 ,'d',2])
s.replace([1,2],method='ffill')
0    a
1    a
2    b
3    b
4    c
5    c
6    d
7    d
dtype: object
s.replace([1,2],method='bfill')
0    a
1    b
2    b
3    c
4    c
5    d
6    d
7    2
dtype: object

逻辑替换:

where:传入条件为False对应的替换

mask:传入条件为True的进行替换

当不指定替换值时,替换为缺失值

传入的参数只需要与被调用的Series索引一致的布尔序列(可以自己指定需要替换的值)

s = pd.Series([-1, 1.2345, 100, -50])
s.where(s<10)
0    -1.0000
1     1.2345
2        NaN
3   -50.0000
dtype: float64
s.where(s<10, 100)
0     -1.0000
1      1.2345
2    100.0000
3    -50.0000
dtype: float64
s.mask(s<10)
0      NaN
1      NaN
2    100.0
3      NaN
dtype: float64
s.mask(s<10,-50)
0    -50.0
1    -50.0
2    100.0
3    -50.0
dtype: float64
s_condition=pd.Series([True,False,True,False],index=s.index)
s.mask(s_condition,-30)
0   -30.0000
1     1.2345
2   -30.0000
3   -50.0000
dtype: float64

数值替换

round: 取整

abs: 取绝对值

clip: 截断

s = pd.Series([-1, 1.2345, 100, -50])
s.round(2) #保留两位小数
0     -1.00
1      1.23
2    100.00
3    -50.00
dtype: float64
s.abs()
0      1.0000
1      1.2345
2    100.0000
3     50.0000
dtype: float64
s.clip(0,2)
0    0.0000
1    1.2345
2    2.0000
3    0.0000
dtype: float64

1.3.5 排序函数

值排序:sort_values

索引排序:sort_index。level指定索引层的名字或层号

ascending=True:升序

df_demo = df[['Grade','Name','Height','Weight']].set_index(['Grade','Name'])
df_demo.sort_values('Height').head()
Height Weight
Grade Name
Junior Xiaoli Chu 145.4 34.0
Senior Gaomei Lv 147.3 34.0
Sophomore Peng Han 147.8 34.0
Senior Changli Lv 148.7 41.0
Sophomore Changjuan You 150.5 40.0
df_demo.sort_values(['Weight','Height'],ascending=[True,False]).head()
Height Weight
Grade Name
Sophomore Peng Han 147.8 34.0
Senior Gaomei Lv 147.3 34.0
Junior Xiaoli Chu 145.4 34.0
Sophomore Qiang Zhou 150.5 36.0
Freshman Yanqiang Xu 152.4 38.0
df_demo.sort_index(level=['Grade','Name'],ascending=[True,False]).head()
Height Weight
Grade Name
Freshman Yanquan Wang 163.5 55.0
Yanqiang Xu 152.4 38.0
Yanqiang Feng 162.3 51.0
Yanpeng Lv NaN 65.0
Yanli Zhang 165.1 52.0

1.3.6 apply方法

对DataFrame的行迭代或者列迭代;axis=0,表示列;apply的参数一般为序列输入的函数

由内置函数的尽量不使用apply,会影响性能

df_demo = df[['Height','Weight']]

def my_mean(x):
    res = x.mean()
    return res
df_demo
Height Weight
0 158.9 46.0
1 166.5 70.0
2 188.9 89.0
3 NaN 41.0
4 174.0 74.0
... ... ...
195 153.9 46.0
196 160.9 50.0
197 153.9 45.0
198 175.3 71.0
199 155.7 51.0

200 rows × 2 columns

df_demo.apply(my_mean)
Height    163.218033
Weight     55.015873
dtype: float64
df_demo.apply(lambda x:x.mean())
Height    163.218033
Weight     55.015873
dtype: float64
df_demo.apply(lambda x:x.mean(),axis=1).head()
0    102.45
1    118.25
2    138.95
3     41.00
4    124.00
dtype: float64
df_demo.mad() #偏离序列均值的绝对值大小的均值
Height     6.707229
Weight    10.391870
dtype: float64
df_demo.apply(lambda x:(x-x.mean()).abs().mean())
Height     6.707229
Weight    10.391870
dtype: float64

1.4 窗口对象

1.4.1 滑窗对象

Series_name.rolling

窗口大小:window(每次捕捉到几个值)

s = pd.Series([1,2,3,4,5])
roller = s.rolling(window=3)
roller
Rolling [window=3,center=False,axis=0]
roller.mean()
0    NaN
1    NaN
2    2.0
3    3.0
4    4.0
dtype: float64
s2 = pd.Series([1,2,3,16,30])
roller.cov(s2)
0     NaN
1     NaN
2     1.0
3     7.0
4    13.5
dtype: float64
roller.apply(lambda x:x.mean())
0    NaN
1    NaN
2    2.0
3    3.0
4    4.0
dtype: float64

shift:periods=n,默认为1,表示取向前第n个元素的值

diff:periods=n,默认为1,表示与向前第n个元素最差

pct_change:periods=n,默认为1,表示向前第n个元素相比计算增长率

n为负,表示反向的操作

功能可以用n+1的rolling等价替换

s = pd.Series([1,3,6,10,15])
s.shift(2)
0    NaN
1    NaN
2    1.0
3    3.0
4    6.0
dtype: float64
s.diff(3)
0     NaN
1     NaN
2     NaN
3     9.0
4    12.0
dtype: float64
s.pct_change()
0         NaN
1    2.000000
2    1.000000
3    0.666667
4    0.500000
dtype: float64
s.shift(-1)
0     3.0
1     6.0
2    10.0
3    15.0
4     NaN
dtype: float64
s.rolling(3).apply(lambda x:list(x)[0])#等价于s.shift(2)
0    NaN
1    NaN
2    1.0
3    3.0
4    6.0
dtype: float64
s.rolling(4).apply(lambda x:list(x)[-1]-list(x)[0]) #s.diff(3)
0     NaN
1     NaN
2     NaN
3     9.0
4    12.0
dtype: float64
def my_pct(x):
    L=list(x)
    return L[-1]/L[0]-1
s.rolling(2).apply(my_pct)
0         NaN
1    2.000000
2    1.000000
3    0.666667
4    0.500000
dtype: float64

1.4.2 扩张窗口

又称累计窗口,一个动态长度的窗口,窗口大小是从序列开始处到具体操作的对应位置,使用的聚合函数会作用于逐步扩张的窗口上

序列为[a1,a2,a3,a4],窗口为[a1]、[a1,a2]、[a1,a2,a3]、[a1,a2,a3,a4]

s = pd.Series([1,3,6,10])
s.expanding().mean()
0    1.000000
1    2.000000
2    3.333333
3    5.000000
dtype: float64

2. 练习

  1. 口袋妖怪数据集

Day2-Python-Pandas基础(Datawhale)_第1张图片

df = pd.read_csv('data/Pokemon.csv')
df.head()
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80
4 4 Charmander Fire NaN 309 39 52 43 60 50 65
  1. 验证是否为Total值,需要进行行合计,然后将求和后的结果与Total列进行作差,最后对结果进行唯一值查找,是否全部为0
df1 = df[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']]
s = df1.sum(axis=1)-df['Total']
s.unique()
array([0], dtype=int64)

2.首先对#重读的妖怪只保留第一条记录,首先查看未改之前表的信息,#列共有800个值,重复的只保留第一条记录,使用drop_duplicates后只有721条数

df.info()

RangeIndex: 800 entries, 0 to 799
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   #        800 non-null    int64 
 1   Name     800 non-null    object
 2   Type 1   800 non-null    object
 3   Type 2   414 non-null    object
 4   Total    800 non-null    int64 
 5   HP       800 non-null    int64 
 6   Attack   800 non-null    int64 
 7   Defense  800 non-null    int64 
 8   Sp. Atk  800 non-null    int64 
 9   Sp. Def  800 non-null    int64 
 10  Speed    800 non-null    int64 
dtypes: int64(8), object(3)
memory usage: 68.9+ KB
df2 = df.drop_duplicates(['#'],keep='first')
df2
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80
4 4 Charmander Fire NaN 309 39 52 43 60 50 65
5 5 Charmeleon Fire NaN 405 58 64 58 80 65 80
... ... ... ... ... ... ... ... ... ... ... ...
793 717 Yveltal Dark Flying 680 126 131 95 131 98 99
794 718 Zygarde50% Forme Dragon Ground 600 108 100 121 81 95 95
795 719 Diancie Rock Fairy 600 50 100 150 100 150 50
797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70
799 721 Volcanion Fire Water 600 80 110 120 130 90 70

721 rows × 11 columns

2.1 求Type1的种类数量,先求出各组的值,之后再count个数

df_test = df2['Type 1'].value_counts()
df_test.count()
18

2.1 求第一属性种类数量前三多数量对应的种类,找到第一种类的数量后,筛选出前3的,观察结果会发现他已经按照降序排列,我们只需要选取前3即可,使用head

index1 = df2['Type 1'].value_counts()
index1.head(3)
Water     105
Normal     93
Grass      66
Name: Type 1, dtype: int64

2.2 求第一属性和第二属性的组合种类,使用drop_duplicates得到的是所有的列,之后再选取除需要的Type 1和Type 2,取出值以后再计算出实际的数量,因为Type1和Type2中都含有nan值,使用count()会存在误差,因此选择使用shape[0]

df3 = df.drop_duplicates(['Type 1','Type 2'],keep='first')
count_type = pd.DataFrame(df3[['Type 1','Type 2']])
count_type.shape[0]
154

2.3 求出尚未出现的属性组合,先求出Type 1和Type 2的理论组合数,之后再根据上面计算过的组合数,得到的就是未出现的属性组合数

count1 = df3['Type 1'].nunique()
count1
18
count2 = df3['Type 2'].nunique()
count2
18
res = count1*count2
res
324
diff_res = res-count_type.shape[0]
diff_res
170

3.1 首先取出Attack,超过120的替换high,不足50的替换low,否则为mid,使用mask进行迭代

df['Attack'].mask(df['Attack']>120,'high').mask(df['Attack']<50,'low').mask((df['Attack']>=50)&(df['Attack']<=120),'mid')
0       low
1       mid
2       mid
3       mid
4       mid
       ... 
795     mid
796    high
797     mid
798    high
799     mid
Name: Attack, Length: 800, dtype: object

3.2 取出第一属性,然后通过字典推导式将Type 1的数据使用upper函数进行大写,因为存在一一对应的关系,使用字典推导式比列表推导式更好,对Type 1进行去重,也节省记录重复值;使用apply,结合lambda将字母进行大写

df4 = df['Type 1']
df4.head()
0    Grass
1    Grass
2    Grass
3    Grass
4     Fire
Name: Type 1, dtype: object
df4.replace({
     i: i.upper() for i in df4.unique() })
0        GRASS
1        GRASS
2        GRASS
3        GRASS
4         FIRE
        ...   
795       ROCK
796       ROCK
797    PSYCHIC
798    PSYCHIC
799       FIRE
Name: Type 1, Length: 800, dtype: object
df4.apply(lambda x:x.upper())
0        GRASS
1        GRASS
2        GRASS
3        GRASS
4         FIRE
        ...   
795       ROCK
796       ROCK
797    PSYCHIC
798    PSYCHIC
799       FIRE
Name: Type 1, Length: 800, dtype: object

3.3 求每个妖怪的离差,首先求出每个妖怪的能力的中位数,然后再计算中位数的偏离,使用apply进行迭代,注意要在行级别进行运算,之后使用sort_values进行排序

df5 = df[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']]
df['diff_max'] = df5.apply(lambda x: max((x-x.median()).abs()),axis=1)
df.sort_values('diff_max',ascending=False).head()
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed diff_max
230 213 Shuckle Bug Rock 505 20 10 230 10 230 5 215.0
121 113 Chansey Normal NaN 450 250 5 5 35 105 50 207.5
261 242 Blissey Normal NaN 540 255 10 10 75 135 55 190.0
333 306 AggronMega Aggron Steel NaN 630 70 140 230 60 80 50 155.0
224 208 SteelixMega Steelix Steel Ground 610 75 125 230 55 95 30 145.0

你可能感兴趣的:(Python,python,数据分析)