创建一个Series
list创建
s1 = pd.Series([1, 2, 3, 4])
----------------------------
0 1
1 2
2 3
3 4
dtype: int64
array创建
s2 = pd.Series(np.arange(10))
-----------------------------
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int32
dict创建(Key可指定)
# dict 创建 Series
s3 = pd.Series({'a':1, 'b':2, 'c':3})
-------------------------------------
a 1
b 2
c 3
dtype: int64
# 指定 index 的 Series
s4 = pd.Series([1, 2, 3, 4], index={'A', 'B', 'C', 'D'})
--------------------------------------------------------
B 1
A 2
C 3
D 4
dtype: int64
Series 转换为 dict
to_dict()
s4.to_dict()
------------
{'B': 1, 'A': 2, 'C': 3, 'D': 4}
index 变换
# index 转换
index_1 = {'A', 'B', 'C', 'D', 'E'}
s6 = pd.Series(s5, index_1)
-----------------------------------
C 3.0
D 4.0
B 1.0
E NaN
A 2.0
dtype: float64
Series 元素操作
判空
pd.isnull(s6) //notnull(s6)
---------------------------
C False
D False
B False
E True
A False
dtype: bool
索引命名
s6.name = 'demo'
----------------
C 3.0
D 4.0
B 1.0
E NaN
A 2.0
Name: demo, dtype: float64
==========================
s6.index.name = 'demo index'
s6.index
---------------------------
Index(['C', 'D', 'B', 'E', 'A'], dtype='object', name='demo index')
通过粘贴板创建一个DataFrame
# 通过粘贴的方法创建一个 DataFrame
import webbrowser
link = 'http://www.tiobe.com/tiobe-index'
webbrowser.open(link)
----------------------------------------
True
========================================
# 获取粘贴板内容进行DataFrame创建
df = pd.read_clipboard()
获取列
df.columns
----------
Index(['Nov 2018', 'Nov 2017',
'Change', 'Programming Language',
'Ratings', 'Change.1'], dtype='object')
获取特定列的value
# 获取Ratings列的value
df.Ratings
----------
0 16.746%
1 14.396%
2 8.282%
3 7.683%
4 6.490%
5 3.952%
6 2.655%
Name: Ratings, dtype: object
获取某几列的value(过滤产生新的DF)
df_new = DataFrame(df, columns={'Programming Language', 'Nov 2018'})
--------------------------------------------------------------------
Nov 2018 Programming Language
0 1 Java
1 2 C
2 3 C++
3 4 Python
4 5 Visual Basic .NET
5 6 C#
6 7 JavaScript
通过列名进行获取value(规避列名有空格问题),获取的列类型为Series
df['Programming Language']
-------------------------
0 Java
1 C
2 C++
3 Python
4 Visual Basic .NET
5 C#
6 JavaScript
Name: Programming Language, dtype: object
=========================================
pandas.core.series.Series
过滤后新DF中含有原DF中不存在列,Pandas会自动进行填充NaN
df_new2 = DataFrame(df, columns={'Programming Language',
'Nov 2018', 'Sep 2018'})
-------------------------------------------------------
Nov 2018 Sep 2018 Programming Language
0 1 NaN Java
1 2 NaN C
2 3 NaN C++
3 4 NaN Python
4 5 NaN Visual Basic .NET
5 6 NaN C#
6 7 NaN JavaScript
新列数据填充
list方式 range
df_new2['Sep 2018'] = range(0,7)
array方式 arange
df_new2['Sep 2018'] = np.arange(10, 17)
Serire方式
df_new2['Sep 2018'] = pd.Series(np.arange(20, 27))
Series对指定列元素进行数据填充
# 对新列中索引为1、2的元素进行数据填充
df_new3['Sep 2018'] = pd.Series([100, 200], index={1, 2})
DataFrame
df1 = pd.DataFrame(data)
------------------------
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasilia 207847528
==========================================
# DataFrame 中 每列为 Serie, DataFrame 是由多个 Series 组成的
type(df1['Country'])
--------------------
pandas.core.series.Series
=========================
# iterrows 返回一个 生成器 generator ,可通过for循环取出内部数据
df1.iterrows()
for row in df1.iterrows():
print(row)
--------------
通过Series 创建 DataFrame
# 根据 data 创建 三个 Series
s1 = pd.Series(data['Capital'])
s2 = pd.Series(data['Country'])
s3 = pd.Series(data['Population'])
# 以 Series list 形式创建 DataFrame
df_new = pd.DataFrame([s2, s1, s3], index=['Country', 'Capital', 'Population'])
# 以行的形式进行了 DataFrame 构建
df_new
------
0 1 2
Country Belgium India Brazil
Capital Brussels New Delhi Brasilia
Population 11190846 1303171035 207847528
=========================================================
# DataFrame转置
df_new = df_new.T
-----------------
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasilia 207847528
DataFrame and Clipboard(从粘贴板中读取数据,写入粘贴版数据)
# 写入数据到粘贴板
df1.to_clipboard()
DataFrame and CSV:index=False 去除保存文件索引
# 将 DataFrame 保存为 CSV 文件,去除左侧 index
df1.to_csv('df1.csv', index=False)
DataFrame and JSON
# to_json
df1.to_json()
-------------
# read_json
pd.read_json(df1.to_json())
DataFrame and HTML
# to_html
df1.to_html()
DataFrame and excel
# to_excel
df1.to_excel('df1.xlsx')
shape
# 读取CSV文件到 DataFrame
imdb = pd.read_csv('J:/csv/movie_metadata.csv')
imdb.shape
----------
(5043, 28)
head、tail 获取前5条、后五条数据记录
iloc 基于index的行列过滤,与label无关
# 指定第10到第20行数据,对列不做过滤
sub_df.iloc[10:20,:]
--------------------
director_name movie_title imdb_score
10 Zack Snyder Batman v Superman: Dawn of Justice 6.9
11 Bryan Singer Superman Returns 6.1
12 Marc Forster Quantum of Solace 6.7
13 Gore Verbinski Pirates of the Caribbean: Dead Man's Chest 7.3
14 Gore Verbinski The Lone Ranger 6.5
15 Zack Snyder Man of Steel 7.2
16 Andrew Adamson The Chronicles of Narnia: Prince Caspian 6.6
17 Joss Whedon The Avengers 8.1
18 Rob Marshall Pirates of the Caribbean: On Stranger Tides 6.7
19 Barry Men in Black 3 6.8
loc 基于label的行列过滤,与index无关
# 通过label进行过滤
sub_df.loc[15:17,'movie_title']
-------------------------------
15 Man of Steel
16 The Chronicles of Narnia: Prince Caspian
17 The Avengers
Name: movie_title, dtype: object
Series Reindex:fill_value 数据填充
s1 = pd.Series([1, 2, 3, 4], index=['A', 'B', 'C', 'D'])
--------------------------------------------------------
A 1
B 2
C 3
D 4
dtype: int64
============
s1.reindex(index=['A', 'B', 'C', 'D','E'], fill_value=10)
------------------------------------------
A 1.0
B 2.0
C 3.0
D 4.0
E 10
dtype: float64
==============
s2 = Series(['A', 'B', 'C'], index=[1, 5, 10])
----------------------------------------------
1 A
5 B
10 C
dtype: object
=============
# ffill 进行填充 0 不会自动填充 1-4 参照5;6-9参照10;11-14参照15;
s2.reindex(index=range(15), method='ffill')
-------------------------------------------
0 NaN
1 A
2 A
3 A
4 A
5 B
6 B
7 B
8 B
9 B
10 C
11 C
12 C
13 C
14 C
dtype: object
DataFrame Reindex
# 同时对一个DataFrame 进行Reindex columns and index
df1.reindex(index=['A', 'B', 'C', 'D'],
columns=['c1', 'c2', 'c3', 'c4'])
---------------------------------------------------------
c1 c2 c3 c4
A 0.282241 0.535411 0.257932 NaN
B 0.105177 0.011686 0.285663 NaN
C 0.084748 0.407965 0.484152 NaN
D NaN NaN NaN NaN
Reindex/Drop 实现切片功能
Series
s1.reindex(index=['A', 'B'])
----------------------------
A 1
B 2
dtype: int64
DataFrame
df1.reindex(index=['A', 'B'])
-----------------------------
c1 c2 c3
A 0.282241 0.535411 0.257932
B 0.105177 0.011686 0.285663
Drop
s1.drop('A')
------------
B 2
C 3
D 4
dtype: int64
============
# 删除行
df1.drop('A', axis=0)
通过numpy创建一个NaN
# 通过numpy创建一个NaN
n = np.nan
type(n)
-------
float
任何Number数据,与NaN做运算结果均为NaN
# 任何Number数据,与NaN做运算结果均为NaN
m = 1
m + n
-----
nan
NaN in Series
isnull / notnull 判断是否存在元素NaN,结果为bool类型
s1.isnull()
dropna() 移除NaN存在的数据项(行)
s1.dropna()
NaN in DataFrame
isnull / notnull 判断是否存在元素NaN,结果返回bool类型的DF
dframe.isnull()
dropna()
axis=0 判断行是否存在NaN数据项,存在即drop该行
# 判断行、列是否存在NaN数据项,存在即drop该行、列
df1 = dframe.dropna(axis=0, how='all')
axis=1 判断列是否存在NaN数据项,存在即drop该列
df2 = dframe.dropna(axis=1, how='all')
thresh=2 NaN数据项存在数量 > 2 时,会进行drop操作
dframe2 = DataFrame([[1, 2, 3], [np.nan, 5, 6], [7, np.nan, 9], [np.nan, np.nan, np.nan]])
---------------------------------------------------------
0 1 2
0 1.0 2.0 3.0
1 NaN 5.0 6.0
2 7.0 NaN 9.0
3 NaN NaN NaN
===========================
# thresh=2 NaN数据项存在数量 > 2 时,会进行drop操作
df2 = dframe2.dropna(thresh=2)
------------------------------
0 1 2
0 1.0 2.0 3.0
1 NaN 5.0 6.0
2 7.0 NaN 9.0
fillna() NaN数据项填充 操作特点:调用方法后新创建结果DF,不影响原DF
value:NaN数据项填充值
# fillna() NaN数据项填充 默认按照列进行填充
df2.fillna(value={0:0, 1:-1, 2:-2})
-----------------------------------
0 1 2
0 1.0 2.0 3.0
1 0.0 5.0 6.0
2 7.0 -1.0 9.0
多级Series
s1 = Series(np.random.randn(6), index=[['1', '1', '1', '2', '2', '2'], ['a', 'b', 'c', 'a', 'b', 'c']])
-------------------------------------------
1 a 0.227699
b -0.137033
c -0.233315
2 a 0.201417
b 0.683764
c 0.693293
dtype: float64
==============
s1['1']
-------
a 0.227699
b -0.137033
c -0.233315
dtype: float64
==============
s1['1']['a']
------------
0.22769876479819515
===================
s1[:, 'a']
----------
1 0.227699
2 0.201417
dtype: float64
多级Series和DataFrame的相互转化:unstack()
# 多级Series 向 DataFrame 转换
df1 = s1.unstack()
------------------
a b c
1 0.227699 -0.137033 -0.233315
2 0.201417 0.683764 0.693293
=================================================
# DataFrame 向 多级Series 进行转换
s1 = df1.unstack()
# 转置重新构建s2
s2 = df1.T.unstack()
多级DataFrame(多级index + 多级columns)
# 多级DataFrame
df = DataFrame(np.arange(16).reshape([4, 4]),
index=[['a','a','b','b'], [1,2,1,2]],
columns=[['BJ','BJ','SH','GZ'], ['A','B','C','D']])
---------------------------------------------------------------
BJ SH GZ
A B C D
a 1 0 1 2 3
2 4 5 6 7
b 1 8 9 10 11
2 12 13 14 15
=========================
df['BJ']
--------
A B
a 1 0 1
2 4 5
b 1 8 9
2 12 13
==================
df['BJ']['A']
-------------
a 1 0
2 4
b 1 8
2 12
Name: A, dtype: int32
DataFrame Mapping
# create a dataframe
df1 = DataFrame({"城市": ["北京", "上海", "广州"], "人口":[1000, 2000, 1500]})
--------------------------------------------------------
城市 人口
0 北京 1000
1 上海 2000
2 广州 1500
====================
# add a column named GDP by Series 默认index为 0 1 2 若DF index 发生变化,需要指定index 才可以进行填充
# df1['GDP'] = Series([1000, 2000, 1500])
# map 方式增加列
gdp_map = {
"北京": 1000,
"上海": 2000,
"广州": 1500
}
# map方式增加列
df1['GDP'] = df1['城市'].map(gdp_map)
------------------------------------
城市 人口 GDP
0 北京 1000 1000
1 上海 2000 2000
2 广州 1500 1500
Series Replace
# replace in Series
s1 = Series(np.arange(10))
--------------------------
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int32
============
# replace 单个
s1.replace(1, np.nan)
--------------------
0 0.0
1 NaN
2 2.0
3 3.0
4 4.0
5 5.0
6 6.0
7 7.0
8 8.0
9 9.0
dtype: float64
==============
# 字典方式replace
s1.replace({2:-2})
------------------
0 0
1 1
2 -2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int64
============
# replace 多个
s1.replace([7,8,9], [-7,-8,-9])
-------------------------------
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 -7
8 -8
9 -9
dtype: int64