Series、DataFrame(pandas)和ndarray(numpy)三者相互转换

笔者从事数据分析的工作,经常会用到pandas和numpy,虽然使用了很久,但仍有部分疑惑,现抽个时间好好梳理下。

下文将从是什么(what),怎么做(how)两个角度进行说明。

老规矩,talk is cheap, show me the code.

Ⅰ. What

1.1 numpy.ndarray

numpy.ndarray(下称ndarray)可以理解为一个多维同质的数组,ndarray可以拆分为n(multi)-d(dimension)-array。其由两部分组成:

  1. array object :数组中的数据;
  2. data-type object :数据的元数据信息。

数据具有以下特性:

  • 多维度的 multidimensional
  • 同数据类型 homogeneous
  • 大小固定 fixed-size

元数据信息则主要包括:字节顺序、占用字节数、数据类型等。

如下是官网的介绍信息:

numpy.ndarray

An array object represents a multidimensional, homogeneous array of fixed-size items. An associated data-type object describes the format of each element in the array (its byte-order, how many bytes it occupies in memory, whether it is an integer, a floating point number, or something else, etc.)

Parameters

class numpy.ndarray(shape, dtype=float, buffer=None, offset=0, strides=None, order=None)

Examples

ndarray一般用于矩阵创建和操作,如下,我们创建一个简单的ndarray对象。

import numpy as np

# nda = np.array(range(12)).reshape(3, -1)  # 和下面的效果相同
nda = np.arange(12).reshape(3, -1)
nda[1]
nda[1,1]==nda[1][1]	# True

查看对象type

>>> nda
Out[72]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

>>> type(nda)
Out[73]: numpy.ndarray
    
>>> nda.shape
Out[74]: (3, 4)
    
>>> nda.dtype
Out[75]: dtype('int32')

1.2 pandas.Series

具有轴标签的一维数组(One-dimensional ndarray with axis labels (including time series).),但是这里的数据类型可以不一致。

官网介绍 pandas.Series

One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).

Parameters

class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
  • data array-like, Iterable, dict, or scalar value

    Contains data stored in Series. If data is a dict, argument order is maintained.

  • index array-like or Index (1d)

    Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values.

  • dtype str, numpy.dtype, or ExtensionDtype, optional

    Data type for the output Series. If not specified, this will be inferred from data. See the user guide for more usages.

  • name str, optional

    The name to give to the Series.

  • copy bool, default False

    Copy input data.

Examples

如下,我们创建一些Series。

d1 = {'a': 1, 'b': 2, 'c': 3}
# d1 = {'a': 1, 'b': 2, 'c': 'hello'}	# 数据类型可以不一致,一般不推荐
ser1 = pd.Series(data=d1, index=['a', 'b', 'c', 'd'])

d2 = [['python', 10, 99, 'male'],
      ['java', 14, 92, 'female'],
      ['c', 18, 97, 'male'],
      ['go', 22, 90, 'female']]
ser2 = pd.Series(data=d2, index=['lst', '2nd', '3rd', '4th'])
ser1[1]

查看输出:

>>> ser1
Out[88]: 
a    1.0
b    2.0
c    3.0
d    NaN
dtype: float64
>>> ser2
Out[89]: 
lst    [python, 10, 99, male]
2nd    [java, 14, 92, female]
3rd         [c, 18, 97, male]
4th      [go, 22, 90, female]
dtype: object
>>> type(ser1)
Out[92]: pandas.core.series.Series
>>> type(ser2)
Out[92]: pandas.core.series.Series
>>> ser1[1]
out[1]: 2.0

1.3 pandas.DataFrame

DataFrame是二维的、可变大小的、多数据类型的数据表。可以把DataFrame想象成Mysql的表。

官网介绍 pandas.DataFrame

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

Parameters

class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
  • data ndarray (structured or homogeneous), Iterable, dict, or DataFrame

    Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order.Changed in version 0.25.0: If data is a list of dicts, column order follows insertion-order.

  • index Index or array-like

    Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.

  • columns Index or array-like

    Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided.

  • dtype dtype, default None

    Data type to force. Only a single dtype is allowed. If None, infer.

  • copy bool, default False

    Copy data from inputs. Only affects DataFrame / 2d ndarray input.

Examples

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

d2 = [['python', 10, 99, 'male'],
      ['java', 14, 92, 'female'],
      ['c', 18, 97, 'male'],
      ['go', 22, 90, 'female']]
df = pd.DataFrame(data=d2, columns=['lang', 'age', 'popular', 'sex'], index=['lst', '2nd', '3rd', '4th'])
>>> df
Out[110]: 
       lang  age  popular     sex
lst  python   10       99    male
2nd    java   14       92  female
3rd       c   18       97    male
4th      go   22       90  female

Ⅱ. How

下面将演示Series、DataFrame、ndarray三者之间如何转化。

转化方式很简单,转化为ndarray直接使用np.array()即可。

转华为pd对象,直接通过pd.Series() or pd.DataFrame() 即可。

# ndarray => Series
npa = np.arange(12)
ser = pd.Series(npa)
# Series => ndarray
npa_s = np.array(ser)

# ndarray => DataFrame
npa2 = npa.reshape(3, -1)
df = pd.DataFrame(npa2)
# DataFrame => ndarray
npa_d = np.array(df)
npa_v = df.values  # npa_d npa_v 一样

# DataFrame -> Series
type(df[0])  # pandas.core.series.Series
# Series -> DataFrame
pd.DataFrame(ser)

你可能感兴趣的:(机器学习ML,python,numpy,pandas,数据分析)