4.Missing data 缺失数据
5.Operations 操作
6.Merge 合并
pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section.
Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:
import numpy as np
import pandas as pd
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
df["F"] = s1
A | B | C | D | F | |
2013-01-01 | 0.184624 | -1.042814 | 0.444349 | -0.259771 | NaN |
2013-01-02 | -0.744011 | -0.390294 | -0.133267 | 0.952179 | 1.0 |
2013-01-03 | 1.003910 | 0.718454 | -0.082483 | 2.182944 | 2.0 |
2013-01-04 | -2.222158 | -0.509435 | -0.367156 | 0.852158 | 3.0 |
2013-01-05 | -0.420209 | 2.178601 | 2.552643 | 0.733452 | 4.0 |
2013-01-06 | 0.450958 | 1.065650 | 0.171798 | 0.701391 | 5.0 |
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
A | B | C | D | F | E | |
2013-01-01 | 0.184624 | -1.042814 | 0.444349 | -0.259771 | NaN | NaN |
2013-01-02 | -0.744011 | -0.390294 | -0.133267 | 0.952179 | 1.0 | NaN |
2013-01-03 | 1.003910 | 0.718454 | -0.082483 | 2.182944 | 2.0 | NaN |
2013-01-04 | -2.222158 | -0.509435 | -0.367156 | 0.852158 | 3.0 | NaN |
To drop any rows that have missing data:
A | B | C | D | F | E |
Filling missing data:
A | B | C | D | F | E | |
2013-01-01 | 0.184624 | -1.042814 | 0.444349 | -0.259771 | 5.0 | 5.0 |
2013-01-02 | -0.744011 | -0.390294 | -0.133267 | 0.952179 | 1.0 | 5.0 |
2013-01-03 | 1.003910 | 0.718454 | -0.082483 | 2.182944 | 2.0 | 5.0 |
2013-01-04 | -2.222158 | -0.509435 | -0.367156 | 0.852158 | 3.0 | 5.0 |
To get the boolean mask where values are nan:
A | B | C | D | F | E | |
2013-01-01 | False | False | False | False | True | True |
2013-01-02 | False | False | False | False | False | True |
2013-01-03 | False | False | False | False | False | True |
2013-01-04 | False | False | False | False | False | True |
See the Basic section on Binary Ops.
Operations in general exclude missing data.
Performing a descriptive statistic:
A -0.291148
B 0.336694
C 0.430981
D 0.860392
F 3.000000
dtype: float64
Same operation on the other axis:
2013-01-01 0.191630
2013-01-02 -0.114052
2013-01-03 0.071200
2013-01-04 -0.257770
2013-01-05 0.466199
2013-01-06 0.878283
Freq: D, dtype: float64
Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension:
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
df.sub(s, axis="index")
A | B | C | D | F | |
2013-01-01 | NaN | NaN | NaN | NaN | NaN |
2013-01-02 | NaN | NaN | NaN | NaN | NaN |
2013-01-03 | 0.003910 | -0.281546 | -1.082483 | 1.182944 | 1.0 |
2013-01-04 | -5.222158 | -3.509435 | -3.367156 | -2.147842 | 0.0 |
2013-01-05 | -5.420209 | -2.821399 | -2.447357 | -4.266548 | -1.0 |
2013-01-06 | NaN | NaN | NaN | NaN | NaN |
Applying functions to the data:
A | B | C | D | F | |
2013-01-01 | 0.184624 | -1.042814 | 0.444349 | -0.259771 | NaN |
2013-01-02 | -0.559387 | -1.433107 | 0.311082 | 0.692408 | 1.0 |
2013-01-03 | 0.444523 | -0.714653 | 0.228599 | 2.875352 | 3.0 |
2013-01-04 | -1.777635 | -1.224088 | -0.138557 | 3.727510 | 6.0 |
2013-01-05 | -2.197844 | 0.954513 | 2.414086 | 4.460962 | 10.0 |
2013-01-06 | -1.746887 | 2.020164 | 2.585884 | 5.162353 | 15.0 |
df.apply(lambda x: x.max() - x.min())
A 3.226068
B 3.221415
C 2.919799
D 2.442716
F 4.000000
dtype: float64
df.apply(lambda x: x.max() - x.min(),axis=1)
2013-01-01 1.487163
2013-01-02 1.744011
2013-01-03 2.265428
2013-01-04 5.222158
2013-01-05 4.420209
2013-01-06 4.828202
Freq: D, dtype: float64
See more at Histogramming and Discretization.
s = pd.Series(np.random.randint(0, 7, size=10))
0 5
1 2
2 6
3 6
4 4
5 1
6 2
7 3
8 1
9 2
dtype: int64
2 3
6 2
1 2
5 1
4 1
3 1
dtype: int64
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
See the Merging section.
Concatenating pandas objects together with concat():
6.1 连接
df = pd.DataFrame(np.random.randn(10, 4))
0 | 1 | 2 | 3 | |
0 | 0.488970 | 1.237504 | -1.640805 | -0.672117 |
1 | 0.390873 | 0.906830 | 0.260662 | 0.119989 |
2 | -0.854710 | -0.535410 | 1.641878 | 0.321487 |
3 | -0.134780 | 0.555554 | 1.024371 | -0.103164 |
4 | -1.241929 | -0.116488 | -0.922242 | -2.066726 |
5 | -0.432397 | 2.018692 | -0.536801 | 0.074576 |
6 | 1.452204 | -0.587196 | 0.918798 | 1.192130 |
7 | 0.819954 | 0.224358 | -0.022698 | -0.745293 |
8 | 0.266344 | -0.321944 | 1.251543 | 0.603333 |
9 | -0.491671 | 0.278449 | 0.194751 | 1.056218 |
pieces = [df[:3], df[3:7], df[7:]]
[ 0 1 2 3
0 0.488970 1.237504 -1.640805 -0.672117
1 0.390873 0.906830 0.260662 0.119989
2 -0.854710 -0.535410 1.641878 0.321487,
0 1 2 3
3 -0.134780 0.555554 1.024371 -0.103164
4 -1.241929 -0.116488 -0.922242 -2.066726
5 -0.432397 2.018692 -0.536801 0.074576
6 1.452204 -0.587196 0.918798 1.192130,
0 1 2 3
7 0.819954 0.224358 -0.022698 -0.745293
8 0.266344 -0.321944 1.251543 0.603333
9 -0.491671 0.278449 0.194751 1.056218]
0 | 1 | 2 | 3 | |
0 | 0.488970 | 1.237504 | -1.640805 | -0.672117 |
1 | 0.390873 | 0.906830 | 0.260662 | 0.119989 |
2 | -0.854710 | -0.535410 | 1.641878 | 0.321487 |
0 | 1 | 2 | 3 | |
0 | 0.488970 | 1.237504 | -1.640805 | -0.672117 |
1 | 0.390873 | 0.906830 | 0.260662 | 0.119989 |
2 | -0.854710 | -0.535410 | 1.641878 | 0.321487 |
3 | -0.134780 | 0.555554 | 1.024371 | -0.103164 |
4 | -1.241929 | -0.116488 | -0.922242 | -2.066726 |
5 | -0.432397 | 2.018692 | -0.536801 | 0.074576 |
6 | 1.452204 | -0.587196 | 0.918798 | 1.192130 |
7 | 0.819954 | 0.224358 | -0.022698 | -0.745293 |
8 | 0.266344 | -0.321944 | 1.251543 | 0.603333 |
9 | -0.491671 | 0.278449 | 0.194751 | 1.056218 |
Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be expensive. We recommend passing a pre-built list of records to the DataFrame constructor instead of building a DataFrame by iteratively appending records to it.
注意:向数据帧中添加列的速度相对较快。但是,添加行需要一个副本,而且可能会很昂贵。 我们建议将预构建的记录列表传递给DataFrame容器中,而不是通过迭代地向其追加记录来构建DataFrame。
SQL style merges. See the Database style joining section.
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
key | lval | |
0 | foo | 1 |
1 | foo | 2 |
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
key | rval | |
0 | foo | 4 |
1 | foo | 5 |
pd.merge(left, right, on="key")
key | lval | rval | |
0 | foo | 1 | 4 |
1 | foo | 1 | 5 |
2 | foo | 2 | 4 |
3 | foo | 2 | 5 |
pd.merge(left, right)
key | lval | rval | |
0 | foo | 1 | 4 |
1 | foo | 1 | 5 |
2 | foo | 2 | 4 |
3 | foo | 2 | 5 |
Another example that can be given is:
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
pd.merge(left, right, on="key")
key | lval | rval | |
0 | foo | 1 | 4 |
1 | bar | 2 | 5 |