本次主要讲以下三部分:
1.Object creation(对象创建)
2.Viewing data(查看数据)
3.Selection(筛选)
import numpy as np
import pandas as pd
Creating a Series by passing a list of values, letting pandas create a default integer index:
通过传递一列值创建序列,利用pandas(熊猫)创建默认整数索引
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:
通过传递NumPy数组创建带有日期时间索引和带标签列名的数据帧(数据框),
创建时间索引
#dates = pd.date_range("20130101", periods=6)
#dates = pd.date_range("2013-01-01", periods=6)
dates = pd.date_range("2013/01/01", periods=6,freq='d')
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
#创建数据框
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df = pd.DataFrame(np.random.randn(6, 4), index=dates,columns=['A','B','C','D'])
df
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -0.520896 | -0.340412 | -1.265841 | -0.419562 |
2013-01-02 | -0.270485 | 1.139635 | -0.099596 | -0.622623 |
2013-01-03 | 1.380236 | -1.922205 | 1.406446 | -1.534292 |
2013-01-04 | 1.049023 | 0.363657 | -0.479516 | -0.243051 |
2013-01-05 | 0.720896 | 0.821581 | 0.369389 | -0.133051 |
2013-01-06 | -0.337006 | -0.329537 | 1.296696 | -2.602595 |
Creating a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:
通过传递字典对象创建数据帧,这些对象可以转换为类似序列的结构
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df2
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
1 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
2 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
3 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
The columns of the resulting DataFrame have different dtypes:
结果数据帧的列具有不同的数据类型:
#查看数据的列类型
df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:
如果使用的是IPython,则会自动启用列名(以及公共属性)的制表符补齐功能,以下是将要完成的属性子集:
#对数据进行统计描述
df2.describe()
A | C | D | |
---|---|---|---|
count | 4.0 | 4.0 | 4.0 |
mean | 1.0 | 1.0 | 3.0 |
std | 0.0 | 0.0 | 0.0 |
min | 1.0 | 1.0 | 3.0 |
25% | 1.0 | 1.0 | 3.0 |
50% | 1.0 | 1.0 | 3.0 |
75% | 1.0 | 1.0 | 3.0 |
max | 1.0 | 1.0 | 3.0 |
As you can see, the columns A, B, C, and D are automatically tab completed. E and F are there as well; the rest of the attributes have been truncated for brevity.
如您所见,A、B、C和D列是自动完成的。E和F也存在;为简洁起见,其余属性已被截断。
Here is how to view the top and bottom rows of the frame:
以下是如何查看数据框的顶行和底行:
# 访问头部数据
df.head()
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -0.520896 | -0.340412 | -1.265841 | -0.419562 |
2013-01-02 | -0.270485 | 1.139635 | -0.099596 | -0.622623 |
2013-01-03 | 1.380236 | -1.922205 | 1.406446 | -1.534292 |
2013-01-04 | 1.049023 | 0.363657 | -0.479516 | -0.243051 |
2013-01-05 | 0.720896 | 0.821581 | 0.369389 | -0.133051 |
# 访问底部数据
df.tail(3)
A | B | C | D | |
---|---|---|---|---|
2013-01-04 | 1.049023 | 0.363657 | -0.479516 | -0.243051 |
2013-01-05 | 0.720896 | 0.821581 | 0.369389 | -0.133051 |
2013-01-06 | -0.337006 | -0.329537 | 1.296696 | -2.602595 |
Display the index, columns:
显示索引,列:
#显示索引
df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
# 显示列名
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.
DataFrame.to_numpy()给出了底层数据的numpy表示。请注意,当您的DataFrame具有不同数据类型的列时,这可能是一个代价昂贵的操作,这可以归结为pandas和numpy之间的一个根本区别:NumPy整个数组有一个数据类型,而pandas数据框的每列有自己的一个数据类型。当你调用函数DataFrame.to_numpy()时,pandas需要找到可以保存数据帧中所有数据类型的NumPy数据类型。这最终可能将数据类型转化为一个对象,需要将每个值都转换为Python对象。
For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying data:
对于df,DataFrame中的每个值都是浮点型,DataFrame.to_numpy() 很快,不需要对数据进行复制
#df.to_numpy() #已经删除此功能
#dir(df)
#df.to_xarray()
For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive:
对于 df2, DataFrame(数据框)有多种数据类型, DataFrame.to_numpy() 的操作代价相对昂贵
df2.to_numpy()
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
note:
DataFrame.to_numpy() does not include the index or column labels in the output.
注意:DataFrame.to_numpy() 在输出中不包括索引或列标签。
describe() shows a quick statistic summary of your data:
函数descriple()显示数据的快速统计概要
df.describe()
A | B | C | D | |
---|---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | 0.336962 | -0.044547 | 0.204596 | -0.925862 |
std | 0.812650 | 1.096672 | 1.037980 | 0.961737 |
min | -0.520896 | -1.922205 | -1.265841 | -2.602595 |
25% | -0.320375 | -0.337693 | -0.384536 | -1.306375 |
50% | 0.225206 | 0.017060 | 0.134896 | -0.521093 |
75% | 0.966991 | 0.707100 | 1.064869 | -0.287179 |
max | 1.380236 | 1.139635 | 1.406446 | -0.133051 |
Transposing your data:
对数据进行转置
df.T
2013-01-01 00:00:00 | 2013-01-02 00:00:00 | 2013-01-03 00:00:00 | 2013-01-04 00:00:00 | 2013-01-05 00:00:00 | 2013-01-06 00:00:00 | |
---|---|---|---|---|---|---|
A | -0.520896 | -0.270485 | 1.380236 | 1.049023 | 0.720896 | -0.337006 |
B | -0.340412 | 1.139635 | -1.922205 | 0.363657 | 0.821581 | -0.329537 |
C | -1.265841 | -0.099596 | 1.406446 | -0.479516 | 0.369389 | 1.296696 |
D | -0.419562 | -0.622623 | -1.534292 | -0.243051 | -0.133051 | -2.602595 |
Sorting by an axis:
按轴排序:
#按照列名做降序
df.sort_index(axis=1, ascending=False)
D | C | B | A | |
---|---|---|---|---|
2013-01-01 | -0.419562 | -1.265841 | -0.340412 | -0.520896 |
2013-01-02 | -0.622623 | -0.099596 | 1.139635 | -0.270485 |
2013-01-03 | -1.534292 | 1.406446 | -1.922205 | 1.380236 |
2013-01-04 | -0.243051 | -0.479516 | 0.363657 | 1.049023 |
2013-01-05 | -0.133051 | 0.369389 | 0.821581 | 0.720896 |
2013-01-06 | -2.602595 | 1.296696 | -0.329537 | -0.337006 |
#按照行索引做降序
df.sort_index(axis=0, ascending=False)
A | B | C | D | |
---|---|---|---|---|
2013-01-06 | -0.337006 | -0.329537 | 1.296696 | -2.602595 |
2013-01-05 | 0.720896 | 0.821581 | 0.369389 | -0.133051 |
2013-01-04 | 1.049023 | 0.363657 | -0.479516 | -0.243051 |
2013-01-03 | 1.380236 | -1.922205 | 1.406446 | -1.534292 |
2013-01-02 | -0.270485 | 1.139635 | -0.099596 | -0.622623 |
2013-01-01 | -0.520896 | -0.340412 | -1.265841 | -0.419562 |
Sorting by values:
对值进行排序
df.sort_values(by="B")
A | B | C | D | |
---|---|---|---|---|
2013-01-03 | 1.380236 | -1.922205 | 1.406446 | -1.534292 |
2013-01-01 | -0.520896 | -0.340412 | -1.265841 | -0.419562 |
2013-01-06 | -0.337006 | -0.329537 | 1.296696 | -2.602595 |
2013-01-04 | 1.049023 | 0.363657 | -0.479516 | -0.243051 |
2013-01-05 | 0.720896 | 0.821581 | 0.369389 | -0.133051 |
2013-01-02 | -0.270485 | 1.139635 | -0.099596 | -0.622623 |
note:
While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc. See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.
注意:虽然用于选择和设置的标准Python/NumPy表达式非常直观,并且对于交互式工作非常方便,对于生产代码,我们推荐优化的pandas数据访问方法。如 at,iat,loc和 .iloc.请参阅索引文档索引,选择数据以及多索引/高级索引。
Selecting a single column, which yields a Series, equivalent to df.A:
选择一个列,生成一个序列,相当于df.A:
df["A"]
2013-01-01 -0.520896
2013-01-02 -0.270485
2013-01-03 1.380236
2013-01-04 1.049023
2013-01-05 0.720896
2013-01-06 -0.337006
Freq: D, Name: A, dtype: float64
Selecting via [], which slices the rows:
通过[]进行筛选,将行切片
df[0:3]
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -0.520896 | -0.340412 | -1.265841 | -0.419562 |
2013-01-02 | -0.270485 | 1.139635 | -0.099596 | -0.622623 |
2013-01-03 | 1.380236 | -1.922205 | 1.406446 | -1.534292 |
df["2013-01-02":"2013-05-04"]
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | -0.270485 | 1.139635 | -0.099596 | -0.622623 |
2013-01-03 | 1.380236 | -1.922205 | 1.406446 | -1.534292 |
2013-01-04 | 1.049023 | 0.363657 | -0.479516 | -0.243051 |
2013-01-05 | 0.720896 | 0.821581 | 0.369389 | -0.133051 |
2013-01-06 | -0.337006 | -0.329537 | 1.296696 | -2.602595 |
See more in Selection by Label.
For getting a cross section using a label:
按标签选择
请参阅“按标签选择”中的详细信息
要使用标签获取横截面:
df.loc[dates[0]]
A -0.520896
B -0.340412
C -1.265841
D -0.419562
Name: 2013-01-01 00:00:00, dtype: float64
Selecting on a multi-axis by label:
按标签在多轴上选择:
df.loc[:, ["A", "B"]]
A | B | |
---|---|---|
2013-01-31 | -0.512502 | -1.073798 |
2013-02-28 | 1.671920 | -1.603149 |
2013-03-31 | 0.116484 | -0.519765 |
2013-04-30 | 0.383318 | 0.410609 |
2013-05-31 | -0.818920 | -2.595957 |
2013-06-30 | 1.059115 | 0.402510 |
Showing label slicing, both endpoints are included:
显示标签切片时,包括两个端点:
df.loc["20130102":"20130104", ["A", "B"]]
A | B | |
---|---|---|
2013-01-02 | -0.270485 | 1.139635 |
2013-01-03 | 1.380236 | -1.922205 |
2013-01-04 | 1.049023 | 0.363657 |
Reduction in the dimensions of the returned object:
减少返回对象的维度:
df.loc["20130102", ["A", "B"]]
A -0.270485
B 1.139635
Name: 2013-01-02 00:00:00, dtype: float64
For getting a scalar value:
要获取标量值,
df.loc[dates[0], "A"]
-0.52089556678858
For getting fast access to a scalar (equivalent to the prior method):
为了快速访问标量(相当于前面的方法):
df.at[dates[0], "A"]
-0.52089556678858
See more in Selection by Position.
Select via the position of the passed integers:
按位置选择
请参阅“按位置选择”中的更多内容
通过传递的整数位置选择:
df.iloc[3]
A 1.049023
B 0.363657
C -0.479516
D -0.243051
Name: 2013-01-04 00:00:00, dtype: float64
By integer slices, acting similar to NumPy/Python:
通过整数切片,其作用类似于NumPy/Python:
df.iloc[3:5, 0:2]
By lists of integer position locations, similar to the NumPy/Python style:
通过整数位置列表,类似于NumPy/Python样式:
df.iloc[[1, 2, 4], [0, 2]]
A | C | |
---|---|---|
2013-01-02 | -0.440009 | -0.094901 |
2013-01-03 | -1.095589 | 1.443271 |
2013-01-05 | -0.826357 | 2.082919 |
For slicing rows explicitly:
对于精确地行切片:
df.iloc[1:3, :]
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | -0.440009 | 0.666086 | -0.094901 | 1.087610 |
2013-01-03 | -1.095589 | 0.708428 | 1.443271 | -0.012472 |
For slicing columns explicitly:
对于精确地列切片:
df.iloc[:, 1:3]
B | C | |
---|---|---|
2013-01-01 | 0.129966 | 0.749187 |
2013-01-02 | 0.666086 | -0.094901 |
2013-01-03 | 0.708428 | 1.443271 |
2013-01-04 | -0.339991 | 0.584877 |
2013-01-05 | 0.072159 | 2.082919 |
2013-01-06 | -0.746247 | 0.195187 |
For getting a value explicitly:
对于精确地获取值,
df.iloc[1, 1]
0.6660861685291358
For getting fast access to a scalar (equivalent to the prior method):
为了快速访问标量(相当于前面的方法):
df.iat[1, 1]
0.6660861685291358
Using a single column’s values to select data:
布尔索引
使用某个列的值选择数据:
df[df["A"] > 0]
A | B | C | D | |
---|---|---|---|---|
2013-02-28 | 1.671920 | -1.603149 | -0.154643 | -0.752101 |
2013-03-31 | 0.116484 | -0.519765 | 0.918146 | -0.717562 |
2013-04-30 | 0.383318 | 0.410609 | 0.071098 | -0.029965 |
2013-06-30 | 1.059115 | 0.402510 | 0.773409 | -1.164358 |
Selecting values from a DataFrame where a boolean condition is met:
从满足布尔条件的 DataFrame(数据帧)中选择值:
df[df > 0]
A | B | C | D | |
---|---|---|---|---|
2013-01-31 | NaN | NaN | 1.407725 | NaN |
2013-02-28 | 1.671920 | NaN | NaN | NaN |
2013-03-31 | 0.116484 | NaN | 0.918146 | NaN |
2013-04-30 | 0.383318 | 0.410609 | 0.071098 | NaN |
2013-05-31 | NaN | NaN | 0.362031 | NaN |
2013-06-30 | 1.059115 | 0.402510 | 0.773409 | NaN |
Using the isin() method for filtering:
通过isin()方法进行过滤
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
df2
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-01 | -0.633072 | 0.129966 | 0.749187 | 1.201542 | one |
2013-01-02 | -0.440009 | 0.666086 | -0.094901 | 1.087610 | one |
2013-01-03 | -1.095589 | 0.708428 | 1.443271 | -0.012472 | two |
2013-01-04 | -0.012166 | -0.339991 | 0.584877 | -0.930127 | three |
2013-01-05 | -0.826357 | 0.072159 | 2.082919 | -0.478526 | four |
2013-01-06 | -0.357370 | -0.746247 | 0.195187 | -1.009280 | three |
df2[df2["E"].isin(["two", "four"])]
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-03 | -1.095589 | 0.708428 | 1.443271 | -0.012472 | two |
2013-01-05 | -0.826357 | 0.072159 | 2.082919 | -0.478526 | four |
Setting a new column automatically aligns the data by the indexes:
设置新列并自动按索引对齐原数据:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130131", periods=6))
df["F"] = s1
df
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-31 | -0.512502 | -1.073798 | 1.407725 | -2.042528 | 1.0 |
2013-02-28 | 1.671920 | -1.603149 | -0.154643 | -0.752101 | NaN |
2013-03-31 | 0.116484 | -0.519765 | 0.918146 | -0.717562 | NaN |
2013-04-30 | 0.383318 | 0.410609 | 0.071098 | -0.029965 | NaN |
2013-05-31 | -0.818920 | -2.595957 | 0.362031 | -1.440398 | NaN |
2013-06-30 | 1.059115 | 0.402510 | 0.773409 | -1.164358 | NaN |
Setting values by label:
df.at[dates[0], "A"] = 0
df
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.129966 | 0.749187 | 1.201542 | NaN |
2013-01-02 | -0.440009 | 0.666086 | -0.094901 | 1.087610 | 1.0 |
2013-01-03 | -1.095589 | 0.708428 | 1.443271 | -0.012472 | 2.0 |
2013-01-04 | -0.012166 | -0.339991 | 0.584877 | -0.930127 | 3.0 |
2013-01-05 | -0.826357 | 0.072159 | 2.082919 | -0.478526 | 4.0 |
2013-01-06 | -0.357370 | -0.746247 | 0.195187 | -1.009280 | 5.0 |
Setting values by position:
按位置设置值:
df.iat[0, 1] = 0
df
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | 0.749187 | 1.201542 | NaN |
2013-01-02 | -0.440009 | 0.666086 | -0.094901 | 1.087610 | 1.0 |
2013-01-03 | -1.095589 | 0.708428 | 1.443271 | -0.012472 | 2.0 |
2013-01-04 | -0.012166 | -0.339991 | 0.584877 | -0.930127 | 3.0 |
2013-01-05 | -0.826357 | 0.072159 | 2.082919 | -0.478526 | 4.0 |
2013-01-06 | -0.357370 | -0.746247 | 0.195187 | -1.009280 | 5.0 |
Setting by assigning with a NumPy array:
通过使用NumPy数组来赋值:
df.loc[:, "D"] = np.array([5] * len(df))
df
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | 0.749187 | 5 | NaN |
2013-01-02 | -0.440009 | 0.666086 | -0.094901 | 5 | 1.0 |
2013-01-03 | -1.095589 | 0.708428 | 1.443271 | 5 | 2.0 |
2013-01-04 | -0.012166 | -0.339991 | 0.584877 | 5 | 3.0 |
2013-01-05 | -0.826357 | 0.072159 | 2.082919 | 5 | 4.0 |
2013-01-06 | -0.357370 | -0.746247 | 0.195187 | 5 | 5.0 |
A where operation with setting:
使用where操作赋值:
df2 = df.copy()
df2[df2 > 0] = -df2
df2
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | -0.749187 | -5 | NaN |
2013-01-02 | -0.440009 | -0.666086 | -0.094901 | -5 | -1.0 |
2013-01-03 | -1.095589 | -0.708428 | -1.443271 | -5 | -2.0 |
2013-01-04 | -0.012166 | -0.339991 | -0.584877 | -5 | -3.0 |
2013-01-05 | -0.826357 | -0.072159 | -2.082919 | -5 | -4.0 |
2013-01-06 | -0.357370 | -0.746247 | -0.195187 | -5 | -5.0 |
df2[df2 <0] = -df2+1
df2
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | 1.749187 | 6 | NaN |
2013-01-02 | 1.440009 | 1.666086 | 1.094901 | 6 | 2.0 |
2013-01-03 | 2.095589 | 1.708428 | 2.443271 | 6 | 3.0 |
2013-01-04 | 1.012166 | 1.339991 | 1.584877 | 6 | 4.0 |
2013-01-05 | 1.826357 | 1.072159 | 3.082919 | 6 | 5.0 |
2013-01-06 | 1.357370 | 1.746247 | 1.195187 | 6 | 6.0 |