import numpy as np
import pandas as pd
print("pandas的版本是:",pd.__version__)
pandas的版本是: 1.1.3
pd.read_csv() 将数据放入目录下,会减少不必要的麻烦
df_csv = pd.read_csv('data/my_csv.csv',header=None,usecols=[3,4])
df_csv
3 | 4 | |
---|---|---|
0 | col4 | col5 |
1 | apple | 2020/1/1 |
2 | banana | 2020/1/2 |
3 | orange | 2020/1/5 |
4 | lemon | 2020/1/7 |
pd.read_table()
txt文件遇到分隔符非空格时,使用sep自定义分隔符号
df_txt = pd.read_table('data/my_table.txt',index_col=['col1'])
df_txt
col2 | col3 | col4 | |
---|---|---|---|
col1 | |||
2 | a | 1.4 | apple 2020/1/1 |
3 | b | 3.4 | banana 2020/1/2 |
6 | c | 2.5 | orange 2020/1/5 |
5 | d | 3.2 | lemon 2020/1/7 |
df_txt1 = pd.read_table('data/my_table_special_sep.txt')
df_txt1
col1 |||| col2 | |
---|---|
0 | TS |||| This is an apple. |
1 | GQ |||| My name is Bob. |
2 | WT |||| Well done! |
3 | PT |||| May I help you? |
df_t = pd.read_table('data/my_table_special_sep.txt',sep='\|\|\|\|',engine='python')
df_t
col1 | col2 | |
---|---|---|
0 | TS | This is an apple. |
1 | GQ | My name is Bob. |
2 | WT | Well done! |
3 | PT | May I help you? |
思考:engine=‘python’ 默认是c引擎解析,如果使用python引擎,可以解析更丰富的内容;
sep使用的是正则表达式,需要将|转义。(此知识点等学完正则表达式补充)
pd.read_excel
df_excel = pd.read_excel('data/my_excel.xlsx',nrows=2, parse_dates=['col5'])
df_excel
col1 | col2 | col3 | col4 | col5 | |
---|---|---|---|---|---|
0 | 2 | a | 1.4 | apple | 2020-01-01 |
1 | 3 | b | 3.4 | banana | 2020-01-02 |
公共参数:
header=None: 第一列不作为列名
index_col: 某一列或几列作为索引
usecols: 读取列的集合,默认读取所有的列
parse_dates: 需要转化为时间的列
nrows: 读取的数据行数
若在使用了header=None,usecols时填写的是新的列名
table_name.to_csv
df_csv1 = pd.read_csv('data/my_csv.csv')
df_csv1
col1 | col2 | col3 | col4 | col5 | |
---|---|---|---|---|---|
0 | 2 | a | 1.4 | apple | 2020/1/1 |
1 | 3 | b | 3.4 | banana | 2020/1/2 |
2 | 6 | c | 2.5 | orange | 2020/1/5 |
3 | 5 | d | 3.2 | lemon | 2020/1/7 |
df_csv1.to_csv('data/my_csv_saved_mine.csv',index=False)
注意:
(1).保存的名字重复时,原文件被覆盖
(2).保存的文件需要带文件后缀,to_csv不会默认保存为csv文件
(3).索引index一般设置为False
table_name.to_csv
df_txt1 = pd.read_table('data/my_table.txt')
df_txt1
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
0 | 2 | a | 1.4 | apple 2020/1/1 |
1 | 3 | b | 3.4 | banana 2020/1/2 |
2 | 6 | c | 2.5 | orange 2020/1/5 |
3 | 5 | d | 3.2 | lemon 2020/1/7 |
df_txt1.to_csv('data/my_txt_saved_mine.txt',sep='\t',index=False)
注意:
to_csv可以保存txt文件,并可以自定义分隔符,常见的为制表符\t分隔
table_name.to_excel
df_excel1 = pd.read_excel('data/my_excel.xlsx')
df_excel1
col1 | col2 | col3 | col4 | col5 | |
---|---|---|---|---|---|
0 | 2 | a | 1.4 | apple | 2020/1/1 |
1 | 3 | b | 3.4 | banana | 2020/1/2 |
2 | 6 | c | 2.5 | orange | 2020/1/5 |
3 | 5 | d | 3.2 | lemon | 2020/1/7 |
df_excel1.to_excel('data/my_excel_saved_mine.xlsx')
to_markdown
首先安装tabulate包:(在shell中)pip install tabulate
import tabulate
print(df_csv1.to_markdown())
| | col1 | col2 | col3 | col4 | col5 |
|---:|-------:|:-------|-------:|:-------|:---------|
| 0 | 2 | a | 1.4 | apple | 2020/1/1 |
| 1 | 3 | b | 3.4 | banana | 2020/1/2 |
| 2 | 6 | c | 2.5 | orange | 2020/1/5 |
| 3 | 5 | d | 3.2 | lemon | 2020/1/7 |
to_latex
print(df_csv1.to_latex())
\begin{tabular}{lrlrll}
\toprule
{} & col1 & col2 & col3 & col4 & col5 \\
\midrule
0 & 2 & a & 1.4 & apple & 2020/1/1 \\
1 & 3 & b & 3.4 & banana & 2020/1/2 \\
2 & 6 & c & 2.5 & orange & 2020/1/5 \\
3 & 5 & d & 3.2 & lemon & 2020/1/7 \\
\bottomrule
\end{tabular}
pandas两种基本数据存储结构:一维:values的Series和二维:values的DataFrame。重点了解:属性和方法
组成部分:
data:值
index:索引
dtype:存储类型
name:序列的名字
索引可以指定名字,默认为空
s = pd.Series(data=[100, 'a', {
'dict1':5}], #值
index=pd.Index(['id1', 20, 'third'], name='my_idx'),
# 索引的值和名称
dtype='object', #存储的类型
name='my_name' #整个序列的名称)
s
my_idx
id1 100
20 a
third {'dict1': 5}
Name: my_name, dtype: object
备注:object是混合类型,存储不同数据结构;纯字符串序列也是一种object类型序列,可以使用string类型存储。
Series_name.values:获取值
Series_name.index:获取索引
Series_name.dtype:获取存储类型
Series_name.name:获取序列名称
Series_name.shape:获取序列的长度
Series_name[index_name]:获取单个索引对应的值
s.values
array([100, 'a', {'dict1': 5}], dtype=object)
s.index
Index(['id1', 20, 'third'], dtype='object', name='my_idx')
s.dtype
dtype('O')
s.name
'my_name'
s.shape
(3,)
备注:shape的解释:shape[0]表示最外围的数组的维数,shape[1]表示次外围的数组的维数,数字不断增大,维数由外到内。Series是一维,直接代表的是元素的个数,所以显示的(3,)
x = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[0,1,2]],[[3,4,5],[6,7,8]]])
x
array([[[1, 2, 3],
[4, 5, 6]],
[[7, 8, 9],
[0, 1, 2]],
[[3, 4, 5],
[6, 7, 8]]])
print(x.shape)
print(x.shape[0]) #最外围包含3个2×3的二维数组
print(x.shape[1]) #进一层包含了2个一维数组
print(x.shape[2]) #最里层一个一维数组包含3个元素
(3, 2, 3)
3
2
3
s['third']
{'dict1': 5}
在Series基础上增加列索引,一个数据框由二维的data与行列索引构成
DataFrame_name.values:获取值
DataFrame_name.index:获取索引
DataFrame_name.dtype:获取存储类型:返回对应列数据类型的Series
DataFrame_name.columns:获取列名称
DataFrame_name.shape:获取数据框的(行,列)长度
DataFrame_name[index_name]:获取索引对应的值可以单一列和多个列
DataFrame.T:转置
data = [[1, 'a', 1.2], [2, 'b', 2.2], [3, 'c', 3.3]]
data
[[1, 'a', 1.2], [2, 'b', 2.2], [3, 'c', 3.3]]
df = pd.DataFrame(data=data,
index=['row_%d'%i for i in range(3)],
columns=['col_0', 'col_1', 'col_2'])
df
col_0 | col_1 | col_2 | |
---|---|---|---|
row_0 | 1 | a | 1.2 |
row_1 | 2 | b | 2.2 |
row_2 | 3 | c | 3.3 |
补充:%d:整数;%f:浮点型;%s:字符串
df.values
array([[1, 'a', 1.2],
[2, 'b', 2.2],
[3, 'c', 3.3]], dtype=object)
df.index
Index(['row_0', 'row_1', 'row_2'], dtype='object')
df.columns
Index(['col_0', 'col_1', 'col_2'], dtype='object')
df.dtypes
col_0 int64
col_1 object
col_2 float64
dtype: object
df.shape
(3, 3)
df[['col_0','col_1']] #列名是一个整体的列表需要使用[]
col_0 | col_1 | |
---|---|---|
row_0 | 1 | a |
row_1 | 2 | b |
row_2 | 3 | c |
df.T
row_0 | row_1 | row_2 | |
---|---|---|---|
col_0 | 1 | 2 | 3 |
col_1 | a | b | c |
col_2 | 1.2 | 2.2 | 3.3 |
df = pd.read_csv('data/learn_pandas.csv')
df.head()
School | Grade | Name | Gender | Height | Weight | Transfer | Test_Number | Test_Date | Time_Record | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Shanghai Jiao Tong University | Freshman | Gaopeng Yang | Female | 158.9 | 46.0 | N | 1 | 2019/10/5 | 0:04:34 |
1 | Peking University | Freshman | Changqiang You | Male | 166.5 | 70.0 | N | 1 | 2019/9/4 | 0:04:20 |
2 | Shanghai Jiao Tong University | Senior | Mei Sun | Male | 188.9 | 89.0 | N | 2 | 2019/9/12 | 0:05:22 |
3 | Fudan University | Sophomore | Xiaojuan Sun | Female | NaN | 41.0 | N | 2 | 2020/1/3 | 0:04:08 |
4 | Fudan University | Sophomore | Gaojuan You | Male | 174.0 | 74.0 | N | 2 | 2019/11/6 | 0:05:22 |
#获取前7列
df = df[df.columns[:7]]
df.head()
School | Grade | Name | Gender | Height | Weight | Transfer | |
---|---|---|---|---|---|---|---|
0 | Shanghai Jiao Tong University | Freshman | Gaopeng Yang | Female | 158.9 | 46.0 | N |
1 | Peking University | Freshman | Changqiang You | Male | 166.5 | 70.0 | N |
2 | Shanghai Jiao Tong University | Senior | Mei Sun | Male | 188.9 | 89.0 | N |
3 | Fudan University | Sophomore | Xiaojuan Sun | Female | NaN | 41.0 | N |
4 | Fudan University | Sophomore | Gaojuan You | Male | 174.0 | 74.0 | N |
head: 返回前n行,默认为5
tail: 返回后n行
info: 返回表的信息概况
describe: 返回表中数值列对应的主要统计量。可使用(pandas-profiling包)
df.head()
School | Grade | Name | Gender | Height | Weight | Transfer | |
---|---|---|---|---|---|---|---|
0 | Shanghai Jiao Tong University | Freshman | Gaopeng Yang | Female | 158.9 | 46.0 | N |
1 | Peking University | Freshman | Changqiang You | Male | 166.5 | 70.0 | N |
2 | Shanghai Jiao Tong University | Senior | Mei Sun | Male | 188.9 | 89.0 | N |
3 | Fudan University | Sophomore | Xiaojuan Sun | Female | NaN | 41.0 | N |
4 | Fudan University | Sophomore | Gaojuan You | Male | 174.0 | 74.0 | N |
df.tail()
School | Grade | Name | Gender | Height | Weight | Transfer | |
---|---|---|---|---|---|---|---|
195 | Fudan University | Junior | Xiaojuan Sun | Female | 153.9 | 46.0 | N |
196 | Tsinghua University | Senior | Li Zhao | Female | 160.9 | 50.0 | N |
197 | Shanghai Jiao Tong University | Senior | Chengqiang Chu | Female | 153.9 | 45.0 | N |
198 | Shanghai Jiao Tong University | Senior | Chengmei Shen | Male | 175.3 | 71.0 | N |
199 | Tsinghua University | Sophomore | Chunpeng Lv | Male | 155.7 | 51.0 | N |
df.info()
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 School 200 non-null object
1 Grade 200 non-null object
2 Name 200 non-null object
3 Gender 200 non-null object
4 Height 183 non-null float64
5 Weight 189 non-null float64
6 Transfer 188 non-null object
dtypes: float64(2), object(5)
memory usage: 11.1+ KB
df.describe()
Height | Weight | |
---|---|---|
count | 183.000000 | 189.000000 |
mean | 163.218033 | 55.015873 |
std | 8.608879 | 12.824294 |
min | 145.400000 | 34.000000 |
25% | 157.150000 | 46.000000 |
50% | 161.900000 | 51.000000 |
75% | 167.500000 | 65.000000 |
max | 193.900000 | 89.000000 |
sum:求和
mean:均值
median:中位数
var:方差
std:标准差
max:最大值
min:最小值
quantile:分位数
count:非缺失值个数
idxmax:最大值对应的索引
idxmin:最小值对应的索引
公共参数axis。默认为0代表逐列聚合,1表示逐行聚合
df_demo = df[['Height','Weight']]
df_demo.mean()
Height 163.218033
Weight 55.015873
dtype: float64
df_demo.max()
Height 193.9
Weight 89.0
dtype: float64
df_demo.quantile(0.75)
Height 167.5
Weight 65.0
Name: 0.75, dtype: float64
df_demo.idxmin()
Height 143
Weight 49
dtype: int64
df_demo.mean(axis=1).head()
0 102.45
1 118.25
2 138.95
3 41.00
4 124.00
dtype: float64
unique:唯一值组成的列表(某一列中含有的不同值)
nunique:唯一值的个数
value_counts:唯一值和其对应出现的频数
drop_duplicates:观察多个列组合的唯一值
df['School'].unique()
array(['Shanghai Jiao Tong University', 'Peking University',
'Fudan University', 'Tsinghua University'], dtype=object)
df['School'].nunique()
4
df['School'].value_counts()
Tsinghua University 69
Shanghai Jiao Tong University 57
Fudan University 40
Peking University 34
Name: School, dtype: int64
drop_duplicates参数
keep:默认first(每个组合保留第一次出现的所在行),last(保留最后一次出现的所在行),False:表示所有重复组合所在行的剔除(保留只出现过一次的列组合)
df_demo = df[['Gender','Transfer','Name']]
df_demo.drop_duplicates(['Gender','Transfer'])
Gender | Transfer | Name | |
---|---|---|---|
0 | Female | N | Gaopeng Yang |
1 | Male | N | Changqiang You |
12 | Female | NaN | Peng You |
21 | Male | NaN | Xiaopeng Shen |
36 | Male | Y | Xiaojuan Qin |
43 | Female | Y | Gaoli Feng |
df_demo.drop_duplicates(['Gender','Transfer'],keep='last')
Gender | Transfer | Name | |
---|---|---|---|
147 | Male | NaN | Juan You |
150 | Male | Y | Chengpeng You |
169 | Female | Y | Chengquan Qin |
194 | Female | NaN | Yanmei Qian |
197 | Female | N | Chengqiang Chu |
199 | Male | N | Chunpeng Lv |
df_demo.drop_duplicates(['Name','Transfer'],keep=False)
Gender | Transfer | Name | |
---|---|---|---|
0 | Female | N | Gaopeng Yang |
1 | Male | N | Changqiang You |
4 | Male | N | Gaojuan You |
5 | Female | N | Xiaoli Qian |
7 | Female | N | Gaoqiang Qian |
... | ... | ... | ... |
192 | Male | N | Gaojuan Wang |
194 | Female | NaN | Yanmei Qian |
196 | Female | N | Li Zhao |
197 | Female | N | Chengqiang Chu |
198 | Male | N | Chengmei Shen |
155 rows × 3 columns
df['School'].drop_duplicates()
0 Shanghai Jiao Tong University
1 Peking University
3 Fudan University
5 Tsinghua University
Name: School, dtype: object
duplicated返回唯一值的布尔列表,参数keep与drop_duplicated一致。重复元素设为True,否则为False
映射替换: replace(通过字典构造,或者传入两个列表)
指定method参数为ffill可以用前面一个最近的未被替换的值进行替换,
bfill:使后面未被替换的值进行替换
df['Gender'].replace({
'Female':0, 'Male':1}).head()
0 0
1 1
2 1
3 0
4 1
Name: Gender, dtype: int64
df['Gender'].replace(['Female','Male'],[0,1]).head()
0 0
1 1
2 1
3 0
4 1
Name: Gender, dtype: int64
s = pd.Series(['a',1,'b',2,'c',1 ,'d',2])
s.replace([1,2],method='ffill')
0 a
1 a
2 b
3 b
4 c
5 c
6 d
7 d
dtype: object
s.replace([1,2],method='bfill')
0 a
1 b
2 b
3 c
4 c
5 d
6 d
7 2
dtype: object
逻辑替换:
where:传入条件为False对应的替换
mask:传入条件为True的进行替换
当不指定替换值时,替换为缺失值
传入的参数只需要与被调用的Series索引一致的布尔序列(可以自己指定需要替换的值)
s = pd.Series([-1, 1.2345, 100, -50])
s.where(s<10)
0 -1.0000
1 1.2345
2 NaN
3 -50.0000
dtype: float64
s.where(s<10, 100)
0 -1.0000
1 1.2345
2 100.0000
3 -50.0000
dtype: float64
s.mask(s<10)
0 NaN
1 NaN
2 100.0
3 NaN
dtype: float64
s.mask(s<10,-50)
0 -50.0
1 -50.0
2 100.0
3 -50.0
dtype: float64
s_condition=pd.Series([True,False,True,False],index=s.index)
s.mask(s_condition,-30)
0 -30.0000
1 1.2345
2 -30.0000
3 -50.0000
dtype: float64
数值替换
round: 取整
abs: 取绝对值
clip: 截断
s = pd.Series([-1, 1.2345, 100, -50])
s.round(2) #保留两位小数
0 -1.00
1 1.23
2 100.00
3 -50.00
dtype: float64
s.abs()
0 1.0000
1 1.2345
2 100.0000
3 50.0000
dtype: float64
s.clip(0,2)
0 0.0000
1 1.2345
2 2.0000
3 0.0000
dtype: float64
值排序:sort_values
索引排序:sort_index。level指定索引层的名字或层号
ascending=True:升序
df_demo = df[['Grade','Name','Height','Weight']].set_index(['Grade','Name'])
df_demo.sort_values('Height').head()
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Junior | Xiaoli Chu | 145.4 | 34.0 |
Senior | Gaomei Lv | 147.3 | 34.0 |
Sophomore | Peng Han | 147.8 | 34.0 |
Senior | Changli Lv | 148.7 | 41.0 |
Sophomore | Changjuan You | 150.5 | 40.0 |
df_demo.sort_values(['Weight','Height'],ascending=[True,False]).head()
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Sophomore | Peng Han | 147.8 | 34.0 |
Senior | Gaomei Lv | 147.3 | 34.0 |
Junior | Xiaoli Chu | 145.4 | 34.0 |
Sophomore | Qiang Zhou | 150.5 | 36.0 |
Freshman | Yanqiang Xu | 152.4 | 38.0 |
df_demo.sort_index(level=['Grade','Name'],ascending=[True,False]).head()
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Freshman | Yanquan Wang | 163.5 | 55.0 |
Yanqiang Xu | 152.4 | 38.0 | |
Yanqiang Feng | 162.3 | 51.0 | |
Yanpeng Lv | NaN | 65.0 | |
Yanli Zhang | 165.1 | 52.0 |
对DataFrame的行迭代或者列迭代;axis=0,表示列;apply的参数一般为序列输入的函数
由内置函数的尽量不使用apply,会影响性能
df_demo = df[['Height','Weight']]
def my_mean(x):
res = x.mean()
return res
df_demo
Height | Weight | |
---|---|---|
0 | 158.9 | 46.0 |
1 | 166.5 | 70.0 |
2 | 188.9 | 89.0 |
3 | NaN | 41.0 |
4 | 174.0 | 74.0 |
... | ... | ... |
195 | 153.9 | 46.0 |
196 | 160.9 | 50.0 |
197 | 153.9 | 45.0 |
198 | 175.3 | 71.0 |
199 | 155.7 | 51.0 |
200 rows × 2 columns
df_demo.apply(my_mean)
Height 163.218033
Weight 55.015873
dtype: float64
df_demo.apply(lambda x:x.mean())
Height 163.218033
Weight 55.015873
dtype: float64
df_demo.apply(lambda x:x.mean(),axis=1).head()
0 102.45
1 118.25
2 138.95
3 41.00
4 124.00
dtype: float64
df_demo.mad() #偏离序列均值的绝对值大小的均值
Height 6.707229
Weight 10.391870
dtype: float64
df_demo.apply(lambda x:(x-x.mean()).abs().mean())
Height 6.707229
Weight 10.391870
dtype: float64
Series_name.rolling
窗口大小:window(每次捕捉到几个值)
s = pd.Series([1,2,3,4,5])
roller = s.rolling(window=3)
roller
Rolling [window=3,center=False,axis=0]
roller.mean()
0 NaN
1 NaN
2 2.0
3 3.0
4 4.0
dtype: float64
s2 = pd.Series([1,2,3,16,30])
roller.cov(s2)
0 NaN
1 NaN
2 1.0
3 7.0
4 13.5
dtype: float64
roller.apply(lambda x:x.mean())
0 NaN
1 NaN
2 2.0
3 3.0
4 4.0
dtype: float64
shift:periods=n,默认为1,表示取向前第n个元素的值
diff:periods=n,默认为1,表示与向前第n个元素最差
pct_change:periods=n,默认为1,表示向前第n个元素相比计算增长率
n为负,表示反向的操作
功能可以用n+1的rolling等价替换
s = pd.Series([1,3,6,10,15])
s.shift(2)
0 NaN
1 NaN
2 1.0
3 3.0
4 6.0
dtype: float64
s.diff(3)
0 NaN
1 NaN
2 NaN
3 9.0
4 12.0
dtype: float64
s.pct_change()
0 NaN
1 2.000000
2 1.000000
3 0.666667
4 0.500000
dtype: float64
s.shift(-1)
0 3.0
1 6.0
2 10.0
3 15.0
4 NaN
dtype: float64
s.rolling(3).apply(lambda x:list(x)[0])#等价于s.shift(2)
0 NaN
1 NaN
2 1.0
3 3.0
4 6.0
dtype: float64
s.rolling(4).apply(lambda x:list(x)[-1]-list(x)[0]) #s.diff(3)
0 NaN
1 NaN
2 NaN
3 9.0
4 12.0
dtype: float64
def my_pct(x):
L=list(x)
return L[-1]/L[0]-1
s.rolling(2).apply(my_pct)
0 NaN
1 2.000000
2 1.000000
3 0.666667
4 0.500000
dtype: float64
又称累计窗口,一个动态长度的窗口,窗口大小是从序列开始处到具体操作的对应位置,使用的聚合函数会作用于逐步扩张的窗口上
序列为[a1,a2,a3,a4],窗口为[a1]、[a1,a2]、[a1,a2,a3]、[a1,a2,a3,a4]
s = pd.Series([1,3,6,10])
s.expanding().mean()
0 1.000000
1 2.000000
2 3.333333
3 5.000000
dtype: float64
df = pd.read_csv('data/Pokemon.csv')
df.head()
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 |
df1 = df[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']]
s = df1.sum(axis=1)-df['Total']
s.unique()
array([0], dtype=int64)
2.首先对#重读的妖怪只保留第一条记录,首先查看未改之前表的信息,#列共有800个值,重复的只保留第一条记录,使用drop_duplicates后只有721条数
df.info()
RangeIndex: 800 entries, 0 to 799
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 # 800 non-null int64
1 Name 800 non-null object
2 Type 1 800 non-null object
3 Type 2 414 non-null object
4 Total 800 non-null int64
5 HP 800 non-null int64
6 Attack 800 non-null int64
7 Defense 800 non-null int64
8 Sp. Atk 800 non-null int64
9 Sp. Def 800 non-null int64
10 Speed 800 non-null int64
dtypes: int64(8), object(3)
memory usage: 68.9+ KB
df2 = df.drop_duplicates(['#'],keep='first')
df2
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 |
5 | 5 | Charmeleon | Fire | NaN | 405 | 58 | 64 | 58 | 80 | 65 | 80 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
793 | 717 | Yveltal | Dark | Flying | 680 | 126 | 131 | 95 | 131 | 98 | 99 |
794 | 718 | Zygarde50% Forme | Dragon | Ground | 600 | 108 | 100 | 121 | 81 | 95 | 95 |
795 | 719 | Diancie | Rock | Fairy | 600 | 50 | 100 | 150 | 100 | 150 | 50 |
797 | 720 | HoopaHoopa Confined | Psychic | Ghost | 600 | 80 | 110 | 60 | 150 | 130 | 70 |
799 | 721 | Volcanion | Fire | Water | 600 | 80 | 110 | 120 | 130 | 90 | 70 |
721 rows × 11 columns
2.1 求Type1的种类数量,先求出各组的值,之后再count个数
df_test = df2['Type 1'].value_counts()
df_test.count()
18
2.1 求第一属性种类数量前三多数量对应的种类,找到第一种类的数量后,筛选出前3的,观察结果会发现他已经按照降序排列,我们只需要选取前3即可,使用head
index1 = df2['Type 1'].value_counts()
index1.head(3)
Water 105
Normal 93
Grass 66
Name: Type 1, dtype: int64
2.2 求第一属性和第二属性的组合种类,使用drop_duplicates得到的是所有的列,之后再选取除需要的Type 1和Type 2,取出值以后再计算出实际的数量,因为Type1和Type2中都含有nan值,使用count()会存在误差,因此选择使用shape[0]
df3 = df.drop_duplicates(['Type 1','Type 2'],keep='first')
count_type = pd.DataFrame(df3[['Type 1','Type 2']])
count_type.shape[0]
154
2.3 求出尚未出现的属性组合,先求出Type 1和Type 2的理论组合数,之后再根据上面计算过的组合数,得到的就是未出现的属性组合数
count1 = df3['Type 1'].nunique()
count1
18
count2 = df3['Type 2'].nunique()
count2
18
res = count1*count2
res
324
diff_res = res-count_type.shape[0]
diff_res
170
3.1 首先取出Attack,超过120的替换high,不足50的替换low,否则为mid,使用mask进行迭代
df['Attack'].mask(df['Attack']>120,'high').mask(df['Attack']<50,'low').mask((df['Attack']>=50)&(df['Attack']<=120),'mid')
0 low
1 mid
2 mid
3 mid
4 mid
...
795 mid
796 high
797 mid
798 high
799 mid
Name: Attack, Length: 800, dtype: object
3.2 取出第一属性,然后通过字典推导式将Type 1的数据使用upper函数进行大写,因为存在一一对应的关系,使用字典推导式比列表推导式更好,对Type 1进行去重,也节省记录重复值;使用apply,结合lambda将字母进行大写
df4 = df['Type 1']
df4.head()
0 Grass
1 Grass
2 Grass
3 Grass
4 Fire
Name: Type 1, dtype: object
df4.replace({
i: i.upper() for i in df4.unique() })
0 GRASS
1 GRASS
2 GRASS
3 GRASS
4 FIRE
...
795 ROCK
796 ROCK
797 PSYCHIC
798 PSYCHIC
799 FIRE
Name: Type 1, Length: 800, dtype: object
df4.apply(lambda x:x.upper())
0 GRASS
1 GRASS
2 GRASS
3 GRASS
4 FIRE
...
795 ROCK
796 ROCK
797 PSYCHIC
798 PSYCHIC
799 FIRE
Name: Type 1, Length: 800, dtype: object
3.3 求每个妖怪的离差,首先求出每个妖怪的能力的中位数,然后再计算中位数的偏离,使用apply进行迭代,注意要在行级别进行运算,之后使用sort_values进行排序
df5 = df[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']]
df['diff_max'] = df5.apply(lambda x: max((x-x.median()).abs()),axis=1)
df.sort_values('diff_max',ascending=False).head()
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | diff_max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
230 | 213 | Shuckle | Bug | Rock | 505 | 20 | 10 | 230 | 10 | 230 | 5 | 215.0 |
121 | 113 | Chansey | Normal | NaN | 450 | 250 | 5 | 5 | 35 | 105 | 50 | 207.5 |
261 | 242 | Blissey | Normal | NaN | 540 | 255 | 10 | 10 | 75 | 135 | 55 | 190.0 |
333 | 306 | AggronMega Aggron | Steel | NaN | 630 | 70 | 140 | 230 | 60 | 80 | 50 | 155.0 |
224 | 208 | SteelixMega Steelix | Steel | Ground | 610 | 75 | 125 | 230 | 55 | 95 | 30 | 145.0 |