Python 数据处理（十二）—

13 dtypes

在大多数情况下，pandas 将 NumPy 数组和 dtype 作用于 Series 和 DataFrame 的每一列。

NumPy 支持 float, int, bool, timedelta64[ns] 和 datetime64[ns] 数据类型

注意：NumPy 不支持带有时区信息的 datetimes

而本节我们将介绍 pandas 的扩展类型，下面列出了所有的 pandas 扩展类型

image.png

pandas 有两种存储字符串数据的方法：

object 类型，可以容纳任何 Python 对象，包括字符串
StringDtype 类型专门用于存储字符串。

通常建议使用 StringDtype，虽然任意对象都可以存为 object，但是会导致性能及兼容问题，应尽可能避免。

DataFrame 有一个方便的 dtypes 属性用于返回一个包含每个列的数据类型的序列

In [347]: dft = pd.DataFrame(
   .....:     {
   .....:         "A": np.random.rand(3),
   .....:         "B": 1,
   .....:         "C": "foo",
   .....:         "D": pd.Timestamp("20010102"),
   .....:         "E": pd.Series([1.0] * 3).astype("float32"),
   .....:         "F": False,
   .....:         "G": pd.Series([1] * 3, dtype="int8"),
   .....:     }
   .....: )
   .....: 

In [348]: dft
Out[348]: 
          A  B    C          D    E      F  G
0  0.035962  1  foo 2001-01-02  1.0  False  1
1  0.701379  1  foo 2001-01-02  1.0  False  1
2  0.281885  1  foo 2001-01-02  1.0  False  1

In [349]: dft.dtypes
Out[349]: 
A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

在 Series 对象上，使用 dtype 属性。

In [350]: dft["A"].dtype
Out[350]: dtype('float64')

如果 pandas 数据对象在一列中包含多种数据类型，将会自动选择一种能够容纳所有数据类型的类型（即向上转换）。最常用的就是 object

# these ints are coerced to floats
In [351]: pd.Series([1, 2, 3, 4, 5, 6.0])
Out[351]: 
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

# string data forces an ``object`` dtype
In [352]: pd.Series([1, 2, 3, 6.0, "foo"])
Out[352]: 
0      1
1      2
2      3
3    6.0
4    foo
dtype: object

可以通过调用 DataFrame.dtypes.value_counts() 来统计 DataFrame 中每种类型的列数

In [353]: dft.dtypes.value_counts()
Out[353]: 
float32           1
datetime64[ns]    1
float64           1
bool              1
int8              1
object            1
int64             1
dtype: int64

不同的数据类型可以在 DataFrame 中共存。不论是通过 dtype 参数设置，还是传递 ndarray 或 Series，都会在 DataFrame 操作中保留其类型。

此外，不同的数值类型不会合并

In [354]: df1 = pd.DataFrame(np.random.randn(8, 1), columns=["A"], dtype="float32")

In [355]: df1
Out[355]: 
          A
0  0.224364
1  1.890546
2  0.182879
3  0.787847
4 -0.188449
5  0.667715
6 -0.011736
7 -0.399073

In [356]: df1.dtypes
Out[356]: 
A    float32
dtype: object

In [357]: df2 = pd.DataFrame(
   .....:     {
   .....:         "A": pd.Series(np.random.randn(8), dtype="float16"),
   .....:         "B": pd.Series(np.random.randn(8)),
   .....:         "C": pd.Series(np.array(np.random.randn(8), dtype="uint8")),
   .....:     }
   .....: )
   .....: 

In [358]: df2
Out[358]: 
          A         B    C
0  0.823242  0.256090    0
1  1.607422  1.426469    0
2 -0.333740 -0.416203  255
3 -0.063477  1.139976    0
4 -1.014648 -1.193477    0
5  0.678711  0.096706    0
6 -0.040863 -1.956850    1
7 -0.357422 -0.714337    0

In [359]: df2.dtypes
Out[359]: 
A    float16
B    float64
C      uint8
dtype: object

13.1 默认值

默认情况下，整数类型为 int64, float 类型为 float64。

无论平台是 32 位还是 64 位，下面的数据都是 int64 类型。

In [360]: pd.DataFrame([1, 2], columns=["a"]).dtypes
Out[360]: 
a    int64
dtype: object

In [361]: pd.DataFrame({"a": [1, 2]}).dtypes
Out[361]: 
a    int64
dtype: object

In [362]: pd.DataFrame({"a": 1}, index=list(range(2))).dtypes
Out[362]: 
a    int64
dtype: object

注意：NumPy 在创建数组是会根据系统选择相应的类型，下面的代码在 32 位操作系统中会返回 int32

In [363]: frame = pd.DataFrame(np.array([1, 2]))

13.2 向上转型

当与其他类型混合时，类型会隐式向上转换，这意味着它们从当前类型提升为另一种类型,例如 int 提升到 float

In [364]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2

In [365]: df3
Out[365]: 
          A         B      C
0  1.047606  0.256090    0.0
1  3.497968  1.426469    0.0
2 -0.150862 -0.416203  255.0
3  0.724370  1.139976    0.0
4 -1.203098 -1.193477    0.0
5  1.346426  0.096706    0.0
6 -0.052599 -1.956850    1.0
7 -0.756495 -0.714337    0.0

In [366]: df3.dtypes
Out[366]: 
A    float32
B    float64
C    float64
dtype: object

DataFrame.to_numpy() 返回的数组的类型是出现次数最多的类型，因此这可能会发生一些强制的转换

In [367]: df3.to_numpy().dtype
Out[367]: dtype('float64')

13.3 astype

可以使用 astype() 方法显式地将 dtype 从一种类型转换为另一种类型。

默认情况下，这些函数将返回一份拷贝（可以使用 copy=False 来更改这一行为），即使 dtype 并没有改变

此外，如果 astype 操作无效，将引发异常

In [368]: df3
Out[368]: 
          A         B      C
0  1.047606  0.256090    0.0
1  3.497968  1.426469    0.0
2 -0.150862 -0.416203  255.0
3  0.724370  1.139976    0.0
4 -1.203098 -1.193477    0.0
5  1.346426  0.096706    0.0
6 -0.052599 -1.956850    1.0
7 -0.756495 -0.714337    0.0

In [369]: df3.dtypes
Out[369]: 
A    float32
B    float64
C    float64
dtype: object

# conversion of dtypes
In [370]: df3.astype("float32").dtypes
Out[370]: 
A    float32
B    float32
C    float32
dtype: object

使用 astype() 将某些列转换为指定的类型。

In [371]: dft = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})

In [372]: dft[["a", "b"]] = dft[["a", "b"]].astype(np.uint8)

In [373]: dft
Out[373]: 
   a  b  c
0  1  4  7
1  2  5  8
2  3  6  9

In [374]: dft.dtypes
Out[374]: 
a    uint8
b    uint8
c    int64
dtype: object

通过对 astype() 传递字典的方式，将某些列转换为特定的 dtype

In [375]: dft1 = pd.DataFrame({"a": [1, 0, 1], "b": [4, 5, 6], "c": [7, 8, 9]})

In [376]: dft1 = dft1.astype({"a": np.bool_, "c": np.float64})

In [377]: dft1
Out[377]: 
       a  b    c
0   True  4  7.0
1  False  5  8.0
2   True  6  9.0

In [378]: dft1.dtypes
Out[378]: 
a       bool
b      int64
c    float64
dtype: object

注意

当尝试使用 astype() 和 loc() 将某些列转换为指定的类型时，将会发生向上转换

因此，下列代码会产出意料之外的结果：
In [379]: dft = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})

In [380]: dft.loc[:, ["a", "b"]].astype(np.uint8).dtypes
Out[380]: 
a    uint8
b    uint8
dtype: object

In [381]: dft.loc[:, ["a", "b"]] = dft.loc[:, ["a", "b"]].astype(np.uint8)

In [382]: dft.dtypes
Out[382]: 
a    int64
b    int64
c    int64
dtype: object

13.4 对象转换

pandas 提供了各种函数来尝试强制将类型从对象类型转换为其他类型。

如果数据已经具有正确的类型，但是存储在对象数组中，则可以使用 datafame.infer_objects() 和 Series.infer_objects() 方法将其转换为正确的类型

In [383]: import datetime

In [384]: df = pd.DataFrame(
   .....:     [
   .....:         [1, 2],
   .....:         ["a", "b"],
   .....:         [datetime.datetime(2016, 3, 2), datetime.datetime(2016, 3, 2)],
   .....:     ]
   .....: )
   .....: 

In [385]: df = df.T

In [386]: df
Out[386]: 
   0  1          2
0  1  a 2016-03-02
1  2  b 2016-03-02

In [387]: df.dtypes
Out[387]: 
0            object
1            object
2    datetime64[ns]
dtype: object

由于数据被转置，所以原始推断将所有的列存储为对象，但是可以使用 infer_objects 纠正

Out[388]: 
0             int64
1            object
2    datetime64[ns]
dtype: object

以下函数可用于一维对象数组或标量，执行指定类型的转换:

to_numeric()(转换为数字类型)

In [389]: m = ["1.1", 2, 3]

In [390]: pd.to_numeric(m)
Out[390]: array([1.1, 2. , 3. ])

to_datetime()(转换为 datetime 对象)

In [391]: import datetime

In [392]: m = ["2016-07-09", datetime.datetime(2016, 3, 2)]

In [393]: pd.to_datetime(m)
Out[393]: DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]', freq=None)

to_timedelta()(转换为 timedelta 对象)

In [394]: m = ["5us", pd.Timedelta("1day")]

In [395]: pd.to_timedelta(m)
Out[395]: TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype='timedelta64[ns]', freq=None)

如果要执行强制转换，可以传入一个 errors 参数，来指定 pandas 应如何处理不能转换为指定 dtype 或对象的元素

默认情况下，errors='raise'，这意味着在转换过程中遇到的任何错误都会引发异常

但是，如果 errors='coerce'，这些错误将被忽略，pandas 将把有问题的元素转换为 pd.NaT 或 np.nan

有时候你的数据大部分都是正确的类型，但是可能有很少一部分不一致的类型，你可能希望将其标记为缺失值而不是引发异常

In [396]: import datetime

In [397]: m = ["apple", datetime.datetime(2016, 3, 2)]

In [398]: pd.to_datetime(m, errors="coerce")
Out[398]: DatetimeIndex(['NaT', '2016-03-02'], dtype='datetime64[ns]', freq=None)

In [399]: m = ["apple", 2, 3]

In [400]: pd.to_numeric(m, errors="coerce")
Out[400]: array([nan,  2.,  3.])

In [401]: m = ["apple", pd.Timedelta("1day")]

In [402]: pd.to_timedelta(m, errors="coerce")
Out[402]: TimedeltaIndex([NaT, '1 days'], dtype='timedelta64[ns]', freq=None)

当 errors='ignore' 时，如果在转换类型时遇到任何错误，它将简单地返回转换成功的数据

In [403]: import datetime

In [404]: m = ["apple", datetime.datetime(2016, 3, 2)]

In [405]: pd.to_datetime(m, errors="ignore")
Out[405]: Index(['apple', 2016-03-02 00:00:00], dtype='object')

In [406]: m = ["apple", 2, 3]

In [407]: pd.to_numeric(m, errors="ignore")
Out[407]: array(['apple', 2, 3], dtype=object)

In [408]: m = ["apple", pd.Timedelta("1day")]

In [409]: pd.to_timedelta(m, errors="ignore")
Out[409]: array(['apple', Timedelta('1 days 00:00:00')], dtype=object)

除了对象转换外，to_numeric() 还提供了另一个参数 downcast，设置该参数能够将数值型数据向下转换为较小的 dtype，以节省内存

In [410]: m = ["1", 2, 3]

In [411]: pd.to_numeric(m, downcast="integer")  # smallest signed int dtype
Out[411]: array([1, 2, 3], dtype=int8)

In [412]: pd.to_numeric(m, downcast="signed")  # same as 'integer'
Out[412]: array([1, 2, 3], dtype=int8)

In [413]: pd.to_numeric(m, downcast="unsigned")  # smallest unsigned int dtype
Out[413]: array([1, 2, 3], dtype=uint8)

In [414]: pd.to_numeric(m, downcast="float")  # smallest float dtype
Out[414]: array([1., 2., 3.], dtype=float32)

这些方法只适用于一维数组、列表或标量，因此，它们不能直接用于多维对象，如 DataFrame。但是我们可以使用 apply 函数将其应用到每列上

In [415]: import datetime

In [416]: df = pd.DataFrame([["2016-07-09", datetime.datetime(2016, 3, 2)]] * 2, dtype="O")

In [417]: df
Out[417]: 
            0                    1
0  2016-07-09  2016-03-02 00:00:00
1  2016-07-09  2016-03-02 00:00:00

In [418]: df.apply(pd.to_datetime)
Out[418]: 
           0          1
0 2016-07-09 2016-03-02
1 2016-07-09 2016-03-02

In [419]: df = pd.DataFrame([["1.1", 2, 3]] * 2, dtype="O")

In [420]: df
Out[420]: 
     0  1  2
0  1.1  2  3
1  1.1  2  3

In [421]: df.apply(pd.to_numeric)
Out[421]: 
     0  1  2
0  1.1  2  3
1  1.1  2  3

In [422]: df = pd.DataFrame([["5us", pd.Timedelta("1day")]] * 2, dtype="O")

In [423]: df
Out[423]: 
     0                1
0  5us  1 days 00:00:00
1  5us  1 days 00:00:00

In [424]: df.apply(pd.to_timedelta)
Out[424]: 
                       0      1
0 0 days 00:00:00.000005 1 days
1 0 days 00:00:00.000005 1 days

13.5 陷阱

对整数类型数据执行选择操作时，会很容易地将数据向上转换为 float。而在没有引入 nan 的情况下，输入数据的 dtype 将被保留。

In [425]: dfi = df3.astype("int32")

In [426]: dfi["E"] = 1

In [427]: dfi
Out[427]: 
   A  B    C  E
0  1  0    0  1
1  3  1    0  1
2  0  0  255  1
3  0  1    0  1
4 -1 -1    0  1
5  1  0    0  1
6  0 -1    1  1
7  0  0    0  1

In [428]: dfi.dtypes
Out[428]: 
A    int32
B    int32
C    int32
E    int64
dtype: object

In [429]: casted = dfi[dfi > 0]

In [430]: casted
Out[430]: 
     A    B      C  E
0  1.0  NaN    NaN  1
1  3.0  1.0    NaN  1
2  NaN  NaN  255.0  1
3  NaN  1.0    NaN  1
4  NaN  NaN    NaN  1
5  1.0  NaN    NaN  1
6  NaN  NaN    1.0  1
7  NaN  NaN    NaN  1

In [431]: casted.dtypes
Out[431]: 
A    float64
B    float64
C    float64
E      int64
dtype: object

而 float 类型不会改变

In [432]: dfa = df3.copy()

In [433]: dfa["A"] = dfa["A"].astype("float32")

In [434]: dfa.dtypes
Out[434]: 
A    float32
B    float64
C    float64
dtype: object

In [435]: casted = dfa[df2 > 0]

In [436]: casted
Out[436]: 
          A         B      C
0  1.047606  0.256090    NaN
1  3.497968  1.426469    NaN
2       NaN       NaN  255.0
3       NaN  1.139976    NaN
4       NaN       NaN    NaN
5  1.346426  0.096706    NaN
6       NaN       NaN    1.0
7       NaN       NaN    NaN

In [437]: casted.dtypes
Out[437]: 
A    float32
B    float64
C    float64
dtype: object

14 根据 dtype 选择列

select_dtypes() 方法可以根据列的 dtype 实现列的提取。

首先，让我们创建一个具有不同 dtype 的数据框

In [438]: df = pd.DataFrame(
   .....:     {
   .....:         "string": list("abc"),
   .....:         "int64": list(range(1, 4)),
   .....:         "uint8": np.arange(3, 6).astype("u1"),
   .....:         "float64": np.arange(4.0, 7.0),
   .....:         "bool1": [True, False, True],
   .....:         "bool2": [False, True, False],
   .....:         "dates": pd.date_range("now", periods=3),
   .....:         "category": pd.Series(list("ABC")).astype("category"),
   .....:     }
   .....: )
   .....: 

In [439]: df["tdeltas"] = df.dates.diff()

In [440]: df["uint64"] = np.arange(3, 6).astype("u8")

In [441]: df["other_dates"] = pd.date_range("20130101", periods=3)

In [442]: df["tz_aware_dates"] = pd.date_range("20130101", periods=3, tz="US/Eastern")

In [443]: df
Out[443]: 
  string  int64  uint8  float64  bool1  ...  category tdeltas uint64 other_dates            tz_aware_dates
0      a      1      3      4.0   True  ...         A     NaT      3  2013-01-01 2013-01-01 00:00:00-05:00
1      b      2      4      5.0  False  ...         B  1 days      4  2013-01-02 2013-01-02 00:00:00-05:00
2      c      3      5      6.0   True  ...         C  1 days      5  2013-01-03 2013-01-03 00:00:00-05:00

[3 rows x 12 columns]

所有列的 dtypes

In [444]: df.dtypes
Out[444]: 
string                                object
int64                                  int64
uint8                                  uint8
float64                              float64
bool1                                   bool
bool2                                   bool
dates                         datetime64[ns]
category                            category
tdeltas                      timedelta64[ns]
uint64                                uint64
other_dates                   datetime64[ns]
tz_aware_dates    datetime64[ns, US/Eastern]
dtype: object

select_dtypes() 有两个参数:

include: 包含这些类型的列
exclude: 不包含这些类型的列

例如，要选择 bool 列

In [445]: df.select_dtypes(include=[bool])
Out[445]: 
   bool1  bool2
0   True  False
1  False   True
2   True  False

你也可以使用 NumPy dtype 层次结构中的类型名称

In [446]: df.select_dtypes(include=["bool"])
Out[446]: 
   bool1  bool2
0   True  False
1  False   True
2   True  False

select_dtypes() 也适用于通用数据类型

例如，选择所有数字和布尔列，同时排除无符号整数

In [447]: df.select_dtypes(include=["number", "bool"], exclude=["unsignedinteger"])
Out[447]: 
   int64  float64  bool1  bool2 tdeltas
0      1      4.0   True  False     NaT
1      2      5.0  False   True  1 days
2      3      6.0   True  False  1 days

要选择字符串列，必须使用 object 类型

In [448]: df.select_dtypes(include=["object"])
Out[448]: 
  string
0      a
1      b
2      c

如果想要查看通用数据类型的所有子类型，你可以定义类似如下的函数来返回一个子类型树

In [449]: def subdtypes(dtype):
   .....:     subs = dtype.__subclasses__()
   .....:     if not subs:
   .....:         return dtype
   .....:     return [dtype, [subdtypes(dt) for dt in subs]]

In [450]: subdtypes(np.generic)
Out[450]: 
[numpy.generic,
 [[numpy.number,
   [[numpy.integer,
     [[numpy.signedinteger,
       [numpy.int8,
        numpy.int16,
        numpy.int32,
        numpy.int64,
        numpy.longlong,
        numpy.timedelta64]],
      [numpy.unsignedinteger,
       [numpy.uint8,
        numpy.uint16,
        numpy.uint32,
        numpy.uint64,
        numpy.ulonglong]]]],
    [numpy.inexact,
     [[numpy.floating,
       [numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
      [numpy.complexfloating,
       [numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
  [numpy.flexible,
   [[numpy.character, [numpy.bytes_, numpy.str_]],
    [numpy.void, [numpy.record]]]],
  numpy.bool_,
  numpy.datetime64,
  numpy.object_]]

注意

pandas 还定义了 category 和 datetime64[ns, tz] 类型，但它们没有集成到通用的 NumPy 层次结构中，因此没有显示在上述结果中

Python 数据处理（十二）—— dtypes