不知道有没有小伙伴跟我一样,之前也断断续续学习过python相关的知识,可是迟迟就是入不了门,开始是出现了问题,不知道怎么解决问题,动不动就是入门到放弃;后续是认认真真看视频,认认真真“复制”代码,可是还是没学懂,后续发现学习都还是一点一滴的来的,最开始就要从最最最简单的东西开始,从自己熟悉的东西入手,而不是看一些复杂的代码,从懂一行代码开始,再到后续慢慢学会看复杂的代码
下面的内容基于《对比excel,轻松学习python数据分析》这本书以及《这十套练习,教你如何用Pandas做数据分析》这个练习做的一点点总结,后续发现关于python的数据的处理也是有一套流程的;后面如果发现一些用法也会不断补充在这里面滴~
year | month | day | date | Id | Client_Id | Driver_Id | City_Id | Status |
---|---|---|---|---|---|---|---|---|
2013 | 10 | 1 | 2013/10/1 | 1 | 1 | 10 | 1 | completed |
2013 | 10 | 1 | 2013/10/1 | 2 | 2 | 11 | 1 | cancelled_by_driver |
2013 | 10 | 1 | 2013/10/1 | 3 | 3 | 12 | 6 | completed |
2013 | 10 | 1 | 2013/10/1 | 4 | 4 | 13 | 6 | cancelled_by_client |
2013 | 10 | 2 | 2013/10/2 | 5 | 1 | 10 | 1 | completed |
2013 | 10 | 2 | 2013/10/2 | 6 | 2 | 11 | 6 | completed |
2013 | 10 | 2 | 2013/10/2 | 7 | 3 | 12 | 6 | completed |
2013 | 10 | 3 | 2013/10/3 | 8 | 2 | 12 | 12 | completed |
2013 | 10 | 3 | 2013/10/3 | 9 | 3 | 10 | 12 | completed |
2013 | 10 | 3 | 2013/10/3 | 10 | 4 | 13 | 12 | cancelled_by_driver |
Users_Id | Banned | Role |
---|---|---|
1 | No | client |
2 | Yes | client |
3 | No | client |
4 | No | client |
10 | No | driver |
11 | No | driver |
12 | No | driver |
13 | No | driver |
import pandas as pd
df_excel = pd.read_excel("D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\练习数据.xlsx")
df_excel
df_excel = pd.read_excel("D://PythonFlie//python//pandas//pandas_exercise//exercise_data//练习数据.xlsx")
df_excel
year month day date Id Client_Id Driver_Id City_Id Status
0 2013 10 1 2013-10-01 1 1 10 1 completed
1 2013 10 1 2013-10-01 2 2 11 1 cancelled_by_driver
2 2013 10 1 2013-10-01 3 3 12 6 completed
3 2013 10 1 2013-10-01 4 4 13 6 cancelled_by_client
4 2013 10 2 2013-10-02 5 1 10 1 completed
5 2013 10 2 2013-10-02 6 2 11 6 completed
6 2013 10 2 2013-10-02 7 3 12 6 completed
7 2013 10 3 2013-10-03 8 2 12 12 completed
8 2013 10 3 2013-10-03 9 3 10 12 completed
9 2013 10 3 2013-10-03 10 4 13 12 cancelled_by_driver
df_excel = pd.read_excel("D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\练习数据.xlsx")
df_excel
df_excel = pd.read_excel("D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\练习数据.xlsx",sheet_name = 1)
df_excel
df_excel = pd.read_excel("D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\练习数据.xlsx",sheet_name = "Users")
df_excel
Users_Id Banned Role
0 1 No client
1 2 Yes client
2 3 No client
3 4 No client
4 10 No driver
5 11 No driver
6 12 No driver
7 13 No driver
#将User表的第一列作为索引
df_excel = pd.read_excel("D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\练习数据.xlsx",sheet_name = 1
,index_col = 0)
df_excel
Banned Role
Users_Id
1 No client
2 Yes client
3 No client
4 No client
10 No driver
11 No driver
12 No driver
13 No driver
#将第四行设置为列索引
df_excel = pd.read_excel("D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\练习数据.xlsx",sheet_name = "Sheet1"
,header = 3 )
df_excel
client
0 driver
1 driver
2 driver
3 driver
#只取第一个表的date、Client_Id、Driver_Id这三列
df_excel = pd.read_excel("D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\练习数据.xlsx"
,usecols = [3,5,6])
df_excel
df_excel = pd.read_excel("D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\练习数据.xlsx"
,usecols = ["date","Client_Id","Driver_Id"])
df_excel
date Client_Id Driver_Id
0 2013-10-01 1 10
1 2013-10-01 2 11
2 2013-10-01 3 12
3 2013-10-01 4 13
4 2013-10-02 1 10
5 2013-10-02 2 11
6 2013-10-02 3 12
7 2013-10-03 2 12
8 2013-10-03 3 10
9 2013-10-03 4 13
#只读取前两行数据
df_excel = pd.read_excel(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\练习数据.xlsx"
,nrows = 2)
df_excel
year month day date Id Client_Id Driver_Id City_Id Status
0 2013 10 1 2013-10-01 1 1 10 1 completed
1 2013 10 1 2013-10-01 2 2 11 1 cancelled_by_driver
df_excel = pd.read_excel("D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\练习数据.xlsx"
,parse_dates = [[0,1,2]])
df_excel
year_month_day date Id Client_Id Driver_Id City_Id Status
0 2013-10-01 2013-10-01 1 1 10 1 completed
1 2013-10-01 2013-10-01 2 2 11 1 cancelled_by_driver
2 2013-10-01 2013-10-01 3 3 12 6 completed
3 2013-10-01 2013-10-01 4 4 13 6 cancelled_by_client
4 2013-10-02 2013-10-02 5 1 10 1 completed
5 2013-10-02 2013-10-02 6 2 11 6 completed
6 2013-10-02 2013-10-02 7 3 12 6 completed
7 2013-10-03 2013-10-03 8 2 12 12 completed
8 2013-10-03 2013-10-03 9 3 10 12 completed
9 2013-10-03 2013-10-03 10 4 13 12 cancelled_by_driver
df_csv = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\练习数据.csv")
df_csv
df_csv = pd.read_csv("D://PythonFlie//python//pandas//pandas_exercise//exercise_data//练习数据.csv")
df_csv
df_csv = pd.read_csv("D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\练习数据.csv")
df_csv
df_csv = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\练习数据.csv"
,sep = ",")
df_csv.head()
df_csv = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\second_cars_info.csv")
df_csv.head()
Brand Name Boarding_time Km Discharge Sec_price New_price
0 �µ� �µ�A6L 2006�� 2.4 CVT ������ 2006��8�� 9.00���� ��3 6.90 50.89��
1 �µ� �µ�A6L 2007�� 2.4 CVT ������ 2007��1�� 8.00���� ��4 8.88 50.89��
2 �µ� �µ�A6L 2004�� 2.4L ���������� 2005��5�� 15.00���� ��2 3.82 54.24��
3 �µ� �µ�A8L 2013�� 45 TFSI quattro������ 2013��10�� 4.80���� ŷ4 44.80 101.06��
4 �µ� �µ�A6L 2014�� 30 FSI ������ 2014��9�� 0.81���� ��4,��5 33.19 54.99��
df_csv = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\second_cars_info.csv"
,encoding = "gbk")
df_csv.head()
Brand Name Boarding_time Km Discharge Sec_price New_price
0 奥迪 奥迪A6L 2006款 2.4 CVT 舒适型 2006年8月 9.00万公里 国3 6.90 50.89万
1 奥迪 奥迪A6L 2007款 2.4 CVT 舒适型 2007年1月 8.00万公里 国4 8.88 50.89万
2 奥迪 奥迪A6L 2004款 2.4L 技术领先型 2005年5月 15.00万公里 国2 3.82 54.24万
3 奥迪 奥迪A8L 2013款 45 TFSI quattro舒适型 2013年10月 4.80万公里 欧4 44.80 101.06万
4 奥迪 奥迪A6L 2014款 30 FSI 豪华型 2014年9月 0.81万公里 国4,国5 33.19 54.99万
data = pd.read_table(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\wind.data",sep = "\s+",parse_dates = [[0,1,2]])
data.head()
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 2061-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 2061-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 2061-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 2061-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 2061-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
#查看整个数据
df_excel
year month day date Id Client_Id Driver_Id City_Id Status
0 2013 10 1 2013-10-01 1 1 10 1 completed
1 2013 10 1 2013-10-01 2 2 11 1 cancelled_by_driver
2 2013 10 1 2013-10-01 3 3 12 6 completed
3 2013 10 1 2013-10-01 4 4 13 6 cancelled_by_client
4 2013 10 2 2013-10-02 5 1 10 1 completed
5 2013 10 2 2013-10-02 6 2 11 6 completed
6 2013 10 2 2013-10-02 7 3 12 6 completed
7 2013 10 3 2013-10-03 8 2 12 12 completed
8 2013 10 3 2013-10-03 9 3 10 12 completed
9 2013 10 3 2013-10-03 10 4 13 12 cancelled_by_driver
df_excel.head()
year month day date Id Client_Id Driver_Id City_Id Status
0 2013 10 1 2013-10-01 1 1 10 1 completed
1 2013 10 1 2013-10-01 2 2 11 1 cancelled_by_driver
2 2013 10 1 2013-10-01 3 3 12 6 completed
3 2013 10 1 2013-10-01 4 4 13 6 cancelled_by_client
4 2013 10 2 2013-10-02 5 1 10 1 completed
#可以设置为负数,代表数据只读到倒数第N行
df_excel.head(-3)
year month day date Id Client_Id Driver_Id City_Id Status
0 2013 10 1 2013-10-01 1 1 10 1 completed
1 2013 10 1 2013-10-01 2 2 11 1 cancelled_by_driver
2 2013 10 1 2013-10-01 3 3 12 6 completed
3 2013 10 1 2013-10-01 4 4 13 6 cancelled_by_client
4 2013 10 2 2013-10-02 5 1 10 1 completed
5 2013 10 2 2013-10-02 6 2 11 6 completed
6 2013 10 2 2013-10-02 7 3 12 6 completed
#可以设置查看的行数大于数据本身的行数,不会报错,只显示数据本身的数据
df_excel.head(11)
year month day date Id Client_Id Driver_Id City_Id Status
0 2013 10 1 2013-10-01 1 1 10 1 completed
1 2013 10 1 2013-10-01 2 2 11 1 cancelled_by_driver
2 2013 10 1 2013-10-01 3 3 12 6 completed
3 2013 10 1 2013-10-01 4 4 13 6 cancelled_by_client
4 2013 10 2 2013-10-02 5 1 10 1 completed
5 2013 10 2 2013-10-02 6 2 11 6 completed
6 2013 10 2 2013-10-02 7 3 12 6 completed
7 2013 10 3 2013-10-03 8 2 12 12 completed
8 2013 10 3 2013-10-03 9 3 10 12 completed
9 2013 10 3 2013-10-03 10 4 13 12 cancelled_by_driver
df_excel.tail()
year month day date Id Client_Id Driver_Id City_Id Status
5 2013 10 2 2013-10-02 6 2 11 6 completed
6 2013 10 2 2013-10-02 7 3 12 6 completed
7 2013 10 3 2013-10-03 8 2 12 12 completed
8 2013 10 3 2013-10-03 9 3 10 12 completed
9 2013 10 3 2013-10-03 10 4 13 12 cancelled_by_driver
df_excel.tail(11)
year month day date Id Client_Id Driver_Id City_Id Status
0 2013 10 1 2013-10-01 1 1 10 1 completed
1 2013 10 1 2013-10-01 2 2 11 1 cancelled_by_driver
2 2013 10 1 2013-10-01 3 3 12 6 completed
3 2013 10 1 2013-10-01 4 4 13 6 cancelled_by_client
4 2013 10 2 2013-10-02 5 1 10 1 completed
5 2013 10 2 2013-10-02 6 2 11 6 completed
6 2013 10 2 2013-10-02 7 3 12 6 completed
7 2013 10 3 2013-10-03 8 2 12 12 completed
8 2013 10 3 2013-10-03 9 3 10 12 completed
9 2013 10 3 2013-10-03 10 4 13 12 cancelled_by_driver
df_excel.tail(-3)
year month day date Id Client_Id Driver_Id City_Id Status
3 2013 10 1 2013-10-01 4 4 13 6 cancelled_by_client
4 2013 10 2 2013-10-02 5 1 10 1 completed
5 2013 10 2 2013-10-02 6 2 11 6 completed
6 2013 10 2 2013-10-02 7 3 12 6 completed
7 2013 10 3 2013-10-03 8 2 12 12 completed
8 2013 10 3 2013-10-03 9 3 10 12 completed
9 2013 10 3 2013-10-03 10 4 13 12 cancelled_by_driver
df_excel.shape
(10, 9)
#返回数据的行数
df_excel.shape[0]
10
#返回数据的列数
df_excel.shape[1]
9
df_excel.columns
Index(['year', 'month', 'day', 'date', 'Id', 'Client_Id', 'Driver_Id',
'City_Id', 'Status'],
dtype='object')
df_excel.index
RangeIndex(start=0, stop=10, step=1)
df_excel.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 10 non-null int64
1 month 10 non-null int64
2 day 10 non-null int64
3 date 10 non-null datetime64[ns]
4 Id 10 non-null int64
5 Client_Id 10 non-null int64
6 Driver_Id 10 non-null int64
7 City_Id 10 non-null int64
8 Status 10 non-null object
dtypes: datetime64[ns](1), int64(7), object(1)
memory usage: 848.0+ bytes
#输出显示数据表里面没有缺失值
df_excel.isnull()
year month day date Id Client_Id Driver_Id City_Id Status
0 False False False False False False False False False
1 False False False False False False False False False
2 False False False False False False False False False
3 False False False False False False False False False
4 False False False False False False False False False
5 False False False False False False False False False
6 False False False False False False False False False
7 False False False False False False False False False
8 False False False False False False False False False
9 False False False False False False False False False
#计算各列缺失值的数量
df_excel.isnull().sum()
year 0
month 0
day 0
date 0
Id 0
Client_Id 0
Driver_Id 0
City_Id 0
Status 0
dtype: int64
#计算各列非空值的数量
df_excel.shape[0] - df_excel.isnull().sum()
year 10
month 10
day 10
date 10
Id 10
Client_Id 10
Driver_Id 10
City_Id 10
Status 10
dtype: int64
#这里我们需要将设置成缺失值后在删除,需要用到numpy这里库,缺失值为np.nan
import numpy as np
df_excel["Status"][0] = np.nan
df_excel
year month day date Id Client_Id Driver_Id City_Id Status
0 2013 10 1 2013-10-01 1 1 10 1 NaN
1 2013 10 1 2013-10-01 2 2 11 1 cancelled_by_driver
2 2013 10 1 2013-10-01 3 3 12 6 completed
3 2013 10 1 2013-10-01 4 4 13 6 cancelled_by_client
4 2013 10 2 2013-10-02 5 1 10 1 completed
5 2013 10 2 2013-10-02 6 2 11 6 completed
6 2013 10 2 2013-10-02 7 3 12 6 completed
7 2013 10 3 2013-10-03 8 2 12 12 completed
8 2013 10 3 2013-10-03 9 3 10 12 completed
9 2013 10 3 2013-10-03 10 4 13 12 cancelled_by_driver
#对缺失值进行删除,此删除会将整行数据进行删除,注意这里需要加上inplace=True,这样会将原来的df_excel的数据覆盖掉
#即是不存在缺失值的数据
df_excel.dropna(inplace=True)
df_excel
year month day date Id Client_Id Driver_Id City_Id Status
1 2013 10 1 2013-10-01 2 2 11 1 cancelled_by_driver
2 2013 10 1 2013-10-01 3 3 12 6 completed
3 2013 10 1 2013-10-01 4 4 13 6 cancelled_by_client
4 2013 10 2 2013-10-02 5 1 10 1 completed
5 2013 10 2 2013-10-02 6 2 11 6 completed
6 2013 10 2 2013-10-02 7 3 12 6 completed
7 2013 10 3 2013-10-03 8 2 12 12 completed
8 2013 10 3 2013-10-03 9 3 10 12 completed
9 2013 10 3 2013-10-03 10 4 13 12 cancelled_by_driver
df_excel.fillna(1,inplace=True)
df_excel
year month day date Id Client_Id Driver_Id City_Id Status
0 2013 10 1 2013-10-01 1 1 10 1 1
1 2013 10 1 2013-10-01 2 2 11 1 cancelled_by_driver
2 2013 10 1 2013-10-01 3 3 12 6 completed
3 2013 10 1 2013-10-01 4 4 13 6 cancelled_by_client
4 2013 10 2 2013-10-02 5 1 10 1 completed
5 2013 10 2 2013-10-02 6 2 11 6 completed
6 2013 10 2 2013-10-02 7 3 12 6 completed
7 2013 10 3 2013-10-03 8 2 12 12 completed
8 2013 10 3 2013-10-03 9 3 10 12 completed
9 2013 10 3 2013-10-03 10 4 13 12 cancelled_by_driver
df_excel.describe()
year month day Id Client_Id Driver_Id City_Id
count 10.0 10.0 10.000000 10.00000 10.000000 10.000000 10.000000
mean 2013.0 10.0 1.900000 5.50000 2.500000 11.400000 6.300000
std 0.0 0.0 0.875595 3.02765 1.080123 1.173788 4.498148
min 2013.0 10.0 1.000000 1.00000 1.000000 10.000000 1.000000
25% 2013.0 10.0 1.000000 3.25000 2.000000 10.250000 2.250000
50% 2013.0 10.0 2.000000 5.50000 2.500000 11.500000 6.000000
75% 2013.0 10.0 2.750000 7.75000 3.000000 12.000000 10.500000
max 2013.0 10.0 3.000000 10.00000 4.000000 13.000000 12.000000
df_excel.nunique()
year 1
month 1
day 3
date 3
Id 10
Client_Id 4
Driver_Id 4
City_Id 3
Status 3
dtype: int64
df_excel.drop_duplicates()
year month day date Id Client_Id Driver_Id City_Id Status
0 2013 10 1 2013-10-01 1 1 10 1 1
1 2013 10 1 2013-10-01 2 2 11 1 cancelled_by_driver
2 2013 10 1 2013-10-01 3 3 12 6 completed
3 2013 10 1 2013-10-01 4 4 13 6 cancelled_by_client
4 2013 10 2 2013-10-02 5 1 10 1 completed
5 2013 10 2 2013-10-02 6 2 11 6 completed
6 2013 10 2 2013-10-02 7 3 12 6 completed
7 2013 10 3 2013-10-03 8 2 12 12 completed
8 2013 10 3 2013-10-03 9 3 10 12 completed
9 2013 10 3 2013-10-03 10 4 13 12 cancelled_by_driver
df_excel.drop_duplicates(subset="Client_Id")
year month day date Id Client_Id Driver_Id City_Id Status
0 2013 10 1 2013-10-01 1 1 10 1 1
1 2013 10 1 2013-10-01 2 2 11 1 cancelled_by_driver
2 2013 10 1 2013-10-01 3 3 12 6 completed
3 2013 10 1 2013-10-01 4 4 13 6 cancelled_by_client
df_excel.drop_duplicates(subset=["Client_Id","Driver_Id"])
year month day date Id Client_Id Driver_Id City_Id Status
0 2013 10 1 2013-10-01 1 1 10 1 1
1 2013 10 1 2013-10-01 2 2 11 1 cancelled_by_driver
2 2013 10 1 2013-10-01 3 3 12 6 completed
3 2013 10 1 2013-10-01 4 4 13 6 cancelled_by_client
7 2013 10 3 2013-10-03 8 2 12 12 completed
8 2013 10 3 2013-10-03 9 3 10 12 completed
df_excel.drop_duplicates(subset=["Client_Id","Driver_Id"],keep = "last")
year month day date Id Client_Id Driver_Id City_Id Status
4 2013 10 2 2013-10-02 5 1 10 1 completed
5 2013 10 2 2013-10-02 6 2 11 6 completed
6 2013 10 2 2013-10-02 7 3 12 6 completed
7 2013 10 3 2013-10-03 8 2 12 12 completed
8 2013 10 3 2013-10-03 9 3 10 12 completed
9 2013 10 3 2013-10-03 10 4 13 12 cancelled_by_driver
df_excel.drop_duplicates(subset=["Client_Id","Driver_Id"],keep = False)
year month day date Id Client_Id Driver_Id City_Id Status
7 2013 10 3 2013-10-03 8 2 12 12 completed
8 2013 10 3 2013-10-03 9 3 10 12 completed