Pandas 库是一个免费、开源的第三方 Python 库,是 Python 数据分析必不可少的工具之一,它为 Python 数据分析提供了高性能,且易于使用的数据结构,即 Series 和 DataFrame。Pandas 自诞生后被应用于众多的领域,比如金融、统计学、社会科学、建筑工程等。
Pandas 名字衍生自术语 “panel data”(面板数据)和 “Python data analysis”(Python 数据分析)。
Pandas 库基于 Python NumPy 库开发而来,因此,它可以与 Python 的科学计算库配合使用。Pandas 提供了两种数据结构,分别是 Series(一维数组结构)与 DataFrame(二维数组结构),这两种数据结构极大地增强的了 Pandas 的数据分析能力。
内容提要:
Pandas Resource 资源
Pandas 的优势
Pandas data model 数据结构
Pandas 抽象数据类型:Index Object
Pandas 抽象数据类型:Series
Python 内置的聚合函数zip() 和 Unzip with zip ( *zipped_obj)
Pandas 抽象数据类型:DataFrame
Pandas 属性和核心信息方法
导入包并 check version,通常 pd 作为 pandas 的别名,更多信息可参考 pandas 官网。
import pandas as pd
version = pd.__version__
print(version)
输出:
1.1.1
1 维不可变数组
: 一个有序的,可切分的序列可哈希的对象 hashable objects
pd.Index(data=None, dtype=None, copy=False, ame=None, fastpath=False, tupleize_cols=True, **kwargs)
例如:
import pandas as pd
column_label = pd.Index(['name', 'age', 'salary'])
row_label = pd.Index(['ID'+ str(i) for i in range (1, 3)])
print("column label:{}".format(column_label))
print("the type of column label:{}".format(type(column_label)))
print("row label:{}".format(row_label))
输出:
column label:Index(['name', 'age', 'salary'], dtype='object')
the type of column label:<class 'pandas.core.indexes.base.Index'>
row label:Index(['ID1', 'ID2'], dtype='object')
pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
默认 index 的 Series
Series 默认 index 是 np.arange()
import pandas as pd
s = pd.Series (data = ['Hunt', 'Smith'],name = "Last Name")
print("series:\n{}".format(s))
print("series index:{}".format(s.index))
输出:
series:
0 Hunt
1 Smith
Name: Last Name, dtype: object
series index:RangeIndex(start=0, stop=2, step=1)
带轴标签的 Series
import pandas as pd
row_label = pd.Index(['ID'+ str(i) for i in range (1, 3)])
s = pd.Series (data = ['Hunt', 'Smith'],name = "Last Name", index=row_label)
print("series:\n{}".format(s))
print("series index:{}".format(s.index))
输出:
series:
ID1 Hunt
ID2 Smith
Name: Last Name, dtype: object
series index:Index(['ID1', 'ID2'], dtype='object')
indexing 访问
可以通过位置或标签来访问 pandas 的 series 对象
例如:
import pandas as pd
row_label = pd.Index(['ID'+ str(i) for i in range (1, 3)])
s = pd.Series (data = ['Hunt', 'Smith'],name = "Last Name", index=row_label)
print("series:\n{}".format(s))
# Change
s[0] = ["kelly"]
print("s[0]:{}".format(s[0]))
print("s[1]:{}".format(s[1]))
print("s['ID1']:{}".format(s['ID1']))
print("s['ID1']:{}".format(s['ID2']))
输出:
series:
ID1 Hunt
ID2 Smith
Name: Last Name, dtype: object
s[0]:['kelly']
s[1]:Smith
s['ID1']:['kelly']
s['ID1']:Smith
zip() 和 zip ( *zipped_obj) 例子:
import pandas as pd
fnames = ['Ann', 'Jon']
lnames = ['Hunt', 'Smith']
ages = [33, 45]
zipped = list(zip(fnames, lnames, ages)) # list of tuples/records
print("zipped:{}".format(zipped))
# unzip into tuples
f, l, a = zip (*zipped)
print ("unzip f:{}".format(f))
print ("unzip l:{}".format(l))
print ("unzip a:{}".format(a))
输出:
zipped:[('Ann', 'Hunt', 33), ('Jon', 'Smith', 45)]
unzip f:('Ann', 'Jon')
unzip l:('Hunt', 'Smith')
unzip a:(33, 45)
zipping tuples 也是可以的
import pandas as pd
a = (1, 2)
b = (3, 4)
zipped_tuple = list (zip (a,b))
print(zipped_tuple)
输出:
[(1, 3), (2, 4)]
pd.DataFrame (data=None, index=None, columns=None, type=None, copy=False)
默认 index 和 columns 的 DataFrame
默认的 index 和 columns 是 np.arange()
例如:
import pandas as pd
fnames = ['Ann', 'Jon']
lnames = ['Hunt', 'Smith']
ages = [33, 45]
zipped = list(zip(fnames, lnames, ages))
print("zipped:{}".format(zipped))
df = pd.DataFrame(data = zipped)
print("df:\n{}".format(df))
print("df index:{}".format(df.index))
print("df column:{}".format(df.columns))
输出:
zipped:[('Ann', 'Hunt', 33), ('Jon', 'Smith', 45)]
df:
0 1 2
0 Ann Hunt 33
1 Jon Smith 45
df index:RangeIndex(start=0, stop=2, step=1)
df column:RangeIndex(start=0, stop=3, step=1)
import pandas as pd
fnames = ['Ann', 'Jon']
lnames = ['Hunt', 'Smith']
ages = [33, 45]
zipped = list(zip(fnames, lnames, ages))
row_labels = pd.Index (['ID'+str(i) for i in range (1, 3)])
col_labels = pd.Index (['First Name', 'Last Name', 'Age'])
df = pd.DataFrame(data = zipped, index=row_labels, columns=col_labels)
print("zipped:{}".format(zipped))
print("df:\n{}".format(df))
print("df index:{}".format(df.index))
print("df column:{}".format(df.columns))
输出:
zipped:[('Ann', 'Hunt', 33), ('Jon', 'Smith', 45)]
df:
First Name Last Name Age
ID1 Ann Hunt 33
ID2 Jon Smith 45
df index:Index(['ID1', 'ID2'], dtype='object')
df column:Index(['First Name', 'Last Name', 'Age'], dtype='object')
为最快的速度进行矢量运算,Pandas 数据的类型最好是一样,但不是必须要求一致的类型。
属性:
import pandas as pd
sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))
print("sales_df.shape:{}".format(sales_df.shape))
print("sales_df['Date'].shape:{}".format(sales_df['Date'].shape))
print("sales_df.axes:{}".format(sales_df.axes))
print("sales_df.axes[0]:{}".format(sales_df.axes[0]))
print("sales_df.axes[1]:{}".format(sales_df.axes[1]))
print("sales_df.index:{}".format(sales_df.index))
print("sales_df.columns:{}".format(sales_df.columns))
print("sales_df['UPS'].name:{}".format(sales_df['UPS'].name))
print("sales_df['UPS'].axes:{}".format(sales_df['UPS'].axes))
输出:
sales_df:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
sales_df.shape:(7, 4)
sales_df['Date'].shape:(7,)
sales_df.axes:[RangeIndex(start=0, stop=7, step=1), Index(['UPS', 'Units', 'Sales', 'Date'], dtype='object')]
sales_df.axes[0]:RangeIndex(start=0, stop=7, step=1)
sales_df.axes[1]:Index(['UPS', 'Units', 'Sales', 'Date'], dtype='object')
sales_df.index:RangeIndex(start=0, stop=7, step=1)
sales_df.columns:Index(['UPS', 'Units', 'Sales', 'Date'], dtype='object')
sales_df['UPS'].name:UPS
sales_df['UPS'].axes:[RangeIndex(start=0, stop=7, step=1)]
核心信息方法:
import pandas as pd
sales_df = pd.read_csv("../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/sales.csv")
print("sales_df:\n{}".format(sales_df))
print("sales_df.head(2):\n{}".format(sales_df.head(2)))
print("sales_df.tail(2):\n{}".format(sales_df.tail(2)))
print("sales_df info =======================")
sales_df.info()
输出:
sales_df:
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
2 1234 3.0 13.0 1/3/2014
3 789 1.0 2.0 1/1/2014
4 789 2.0 3.8 1/2/2014
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
sales_df.head(2):
UPS Units Sales Date
0 1234 5.0 20.2 1/1/2014
1 1234 2.0 8.0 1/2/2014
sales_df.tail(2):
UPS Units Sales Date
5 789 NaN NaN 1/3/2014
6 789 1.0 1.8 1/5/2014
sales_df info =======================
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 UPS 7 non-null int64
1 Units 6 non-null float64
2 Sales 6 non-null float64
3 Date 7 non-null object
dtypes: float64(2), int64(1), object(1)
memory usage: 352.0+ bytes