pandas表格处理

使用pandas处理表格数据时,会经常的用到Series.map,Series.apply,Pandas.apply三个方法,熟悉这几个方法的使用,可以极大的提高工作效率。

一、常用的处理表格的函数

pd.read_csv('xx.csv',encoding='utf-8')
pd.read_excel('xxx.xls',encoding='gbk')
to_csv('xxx.csv',encoding='gbk')
to_excel('xxx.xls',encoding='gbk')
pd.concat([df1,df2],ignore_index=True,axis=1)

二、pandas的遍历

pandas提供了iter*系列函数,来遍历DataFrame

pandas表格处理_第1张图片
使用iterrows遍历DataFrame

二、Series方法

在jupyter notebook中的在某个方法后输入?,然后Ctrl+Enter会显示出文档

Series.map

Series.map的文档

Signature: a.map(arg, na_action=None)
Docstring:
Map values of Series according to input correspondence.

Used for substituting each value in a Series with another value,
that may be derived from a function, a ``dict`` or
a :class:`Series`.

Parameters
----------
arg : function, dict, or Series
    Mapping correspondence.
na_action : {None, 'ignore'}, default None
    If 'ignore', propagate NaN values, without passing them to the
    mapping correspondence.

Returns
-------
Series
    Same index as caller.

See Also
--------
Series.apply : For applying more complex functions on a Series.
DataFrame.apply : Apply a function row-/column-wise.
DataFrame.applymap : Apply a function elementwise on a whole DataFrame.

Notes
-----
When ``arg`` is a dictionary, values in Series that are not in the
dictionary (as keys) are converted to ``NaN``. However, if the
dictionary is a ``dict`` subclass that defines ``__missing__`` (i.e.
provides a method for default values), then this default is used
rather than ``NaN``.

Examples
--------
>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
>>> s
0      cat
1      dog
2      NaN
3   rabbit
dtype: object

``map`` accepts a ``dict`` or a ``Series``. Values that are not found
in the ``dict`` are converted to ``NaN``, unless the dict has a default
value (e.g. ``defaultdict``):

>>> s.map({'cat': 'kitten', 'dog': 'puppy'})
0   kitten
1    puppy
2      NaN
3      NaN
dtype: object

It also accepts a function:

>>> s.map('I am a {}'.format)
0       I am a cat
1       I am a dog
2       I am a nan
3    I am a rabbit
dtype: object

To avoid applying the function to missing values (and keep them as
``NaN``) ``na_action='ignore'`` can be used:

>>> s.map('I am a {}'.format, na_action='ignore')
0     I am a cat
1     I am a dog
2            NaN
3  I am a rabbit
dtype: object
File:      c:\anaconda3\lib\site-packages\pandas\core\series.py
Type:      method

示例

#Series.map的参数为一个字典
data['gender1'] = data['gender'].map({'男':1,'女':0})
#Series.map的参数为一个函数
data['gender2'] = data['gender'].map(lambda x: 1 if x == '男' else 0)
data.head()

结果

pandas表格处理_第2张图片
map函数的执行结果

Series.apply

文档

Signature: a.apply(func, convert_dtype=True, args=(), **kwds)
Docstring:
Invoke function on values of Series.

Can be ufunc (a NumPy function that applies to the entire Series)
or a Python function that only works on single values.

Parameters
----------
func : function
    Python function or NumPy ufunc to apply.
convert_dtype : bool, default True
    Try to find better dtype for elementwise function results. If
    False, leave as dtype=object.
args : tuple
    Positional arguments passed to func after the series value.
**kwds
    Additional keyword arguments passed to func.

Returns
-------
Series or DataFrame
    If func returns a Series object the result will be a DataFrame.

See Also
--------
Series.map: For element-wise operations.
Series.agg: Only perform aggregating type operations.
Series.transform: Only perform transforming type operations.

Examples
--------
Create a series with typical summer temperatures for each city.

>>> s = pd.Series([20, 21, 12],
...               index=['London', 'New York', 'Helsinki'])
>>> s
London      20
New York    21
Helsinki    12
dtype: int64

Square the values by defining a function and passing it as an
argument to ``apply()``.

>>> def square(x):
...     return x ** 2
>>> s.apply(square)
London      400
New York    441
Helsinki    144
dtype: int64

Square the values by passing an anonymous function as an
argument to ``apply()``.

>>> s.apply(lambda x: x ** 2)
London      400
New York    441
Helsinki    144
dtype: int64

Define a custom function that needs additional positional
arguments and pass these additional arguments using the
``args`` keyword.

>>> def subtract_custom_value(x, custom_value):
...     return x - custom_value

>>> s.apply(subtract_custom_value, args=(5,))
London      15
New York    16
Helsinki     7
dtype: int64

Define a custom function that takes keyword arguments
and pass these arguments to ``apply``.

>>> def add_custom_values(x, **kwargs):
...     for month in kwargs:
...         x += kwargs[month]
...     return x

>>> s.apply(add_custom_values, june=30, july=20, august=25)
London      95
New York    96
Helsinki    87
dtype: int64

Use a function from the Numpy library.

>>> s.apply(np.log)
London      2.995732
New York    3.044522
Helsinki    2.484907
dtype: float64
File:      c:\anaconda3\lib\site-packages\pandas\core\series.py
Type:      method

Series.agg

Signature: a.agg(func, axis=0, *args, **kwargs)
Docstring:
Aggregate using one or more operations over the specified axis.

.. versionadded:: 0.20.0

Parameters
----------
func : function, str, list or dict
    Function to use for aggregating the data. If a function, must either
    work when passed a Series or when passed to Series.apply.

    Accepted combinations are:

    - function
    - string function name
    - list of functions and/or function names, e.g. ``[np.sum, 'mean']``
    - dict of axis labels -> functions, function names or list of such.
axis : {0 or 'index'}
        Parameter needed for compatibility with DataFrame.
*args
    Positional arguments to pass to `func`.
**kwargs
    Keyword arguments to pass to `func`.

Returns
-------
DataFrame, Series or scalar
    if DataFrame.agg is called with a single function, returns a Series
    if DataFrame.agg is called with several functions, returns a DataFrame
    if Series.agg is called with single function, returns a scalar
    if Series.agg is called with several functions, returns a Series


See Also
--------
Series.apply : Invoke function on a Series.
Series.transform : Transform function producing a Series with like indexes.


Notes
-----
`agg` is an alias for `aggregate`. Use the alias.

A passed user-defined-function will be passed a Series for evaluation.


Examples
--------
>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.agg('min')
1

>>> s.agg(['min', 'max'])
min   1
max   4
dtype: int64
File:      c:\anaconda3\lib\site-packages\pandas\core\series.py
Type:      method

四、pandas方法

pandas.apply

Signature:
data.apply(
    func,
    axis=0,
    broadcast=None,
    raw=False,
    reduce=None,
    result_type=None,
    args=(),
    **kwds,
)
Docstring:
Apply a function along an axis of the DataFrame.

Objects passed to the function are Series objects whose index is
either the DataFrame's index (``axis=0``) or the DataFrame's columns
(``axis=1``). By default (``result_type=None``), the final return type
is inferred from the return type of the applied function. Otherwise,
it depends on the `result_type` argument.

Parameters
----------
func : function
    Function to apply to each column or row.
axis : {0 or 'index', 1 or 'columns'}, default 0
    Axis along which the function is applied:

    * 0 or 'index': apply function to each column.
    * 1 or 'columns': apply function to each row.
broadcast : bool, optional
    Only relevant for aggregation functions:

    * ``False`` or ``None`` : returns a Series whose length is the
      length of the index or the number of columns (based on the
      `axis` parameter)
    * ``True`` : results will be broadcast to the original shape
      of the frame, the original index and columns will be retained.

    .. deprecated:: 0.23.0
       This argument will be removed in a future version, replaced
       by result_type='broadcast'.

raw : bool, default False
    * ``False`` : passes each row or column as a Series to the
      function.
    * ``True`` : the passed function will receive ndarray objects
      instead.
      If you are just applying a NumPy reduction function this will
      achieve much better performance.
reduce : bool or None, default None
    Try to apply reduction procedures. If the DataFrame is empty,
    `apply` will use `reduce` to determine whether the result
    should be a Series or a DataFrame. If ``reduce=None`` (the
    default), `apply`'s return value will be guessed by calling
    `func` on an empty Series
    (note: while guessing, exceptions raised by `func` will be
    ignored).
    If ``reduce=True`` a Series will always be returned, and if
    ``reduce=False`` a DataFrame will always be returned.

    .. deprecated:: 0.23.0
       This argument will be removed in a future version, replaced
       by ``result_type='reduce'``.

result_type : {'expand', 'reduce', 'broadcast', None}, default None
    These only act when ``axis=1`` (columns):

    * 'expand' : list-like results will be turned into columns.
    * 'reduce' : returns a Series if possible rather than expanding
      list-like results. This is the opposite of 'expand'.
    * 'broadcast' : results will be broadcast to the original shape
      of the DataFrame, the original index and columns will be
      retained.

    The default behaviour (None) depends on the return value of the
    applied function: list-like results will be returned as a Series
    of those. However if the apply function returns a Series these
    are expanded to columns.

    .. versionadded:: 0.23.0

args : tuple
    Positional arguments to pass to `func` in addition to the
    array/series.
**kwds
    Additional keyword arguments to pass as keywords arguments to
    `func`.

Returns
-------
applied : Series or DataFrame

See Also
--------
DataFrame.applymap: For elementwise operations.
DataFrame.aggregate: Only perform aggregating type operations.
DataFrame.transform: Only perform transforming type operations.

Notes
-----
In the current implementation apply calls `func` twice on the
first column/row to decide whether it can take a fast or slow
code path. This can lead to unexpected behavior if `func` has
side-effects, as they will take effect twice for the first
column/row.

Examples
--------

>>> df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])
>>> df
   A  B
0  4  9
1  4  9
2  4  9

Using a numpy universal function (in this case the same as
``np.sqrt(df)``):

>>> df.apply(np.sqrt)
     A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0

Using a reducing function on either axis

>>> df.apply(np.sum, axis=0)
A    12
B    27
dtype: int64

>>> df.apply(np.sum, axis=1)
0    13
1    13
2    13
dtype: int64

Retuning a list-like will result in a Series

>>> df.apply(lambda x: [1, 2], axis=1)
0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

Passing result_type='expand' will expand list-like results
to columns of a Dataframe

>>> df.apply(lambda x: [1, 2], axis=1, result_type='expand')
   0  1
0  1  2
1  1  2
2  1  2

Returning a Series inside the function is similar to passing
``result_type='expand'``. The resulting column names
will be the Series index.

>>> df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
   foo  bar
0    1    2
1    1    2
2    1    2

Passing ``result_type='broadcast'`` will ensure the same shape
result, whether list-like or scalar is returned by the function,
and broadcast it along the axis. The resulting column names will
be the originals.

>>> df.apply(lambda x: [1, 2], axis=1, result_type='broadcast')
   A  B
0  1  2
1  1  2
2  1  2
File:      c:\anaconda3\lib\site-packages\pandas\core\frame.py
Type:      method

pandas.applymap

Signature: data.applymap(func)
Docstring:
Apply a function to a Dataframe elementwise.

This method applies a function that accepts and returns a scalar
to every element of a DataFrame.

Parameters
----------
func : callable
    Python function, returns a single value from a single value.

Returns
-------
DataFrame
    Transformed DataFrame.

See Also
--------
DataFrame.apply : Apply a function along input axis of DataFrame.

Notes
-----
In the current implementation applymap calls `func` twice on the
first column/row to decide whether it can take a fast or slow
code path. This can lead to unexpected behavior if `func` has
side-effects, as they will take effect twice for the first
column/row.

Examples
--------
>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]])
>>> df
       0      1
0  1.000  2.120
1  3.356  4.567

>>> df.applymap(lambda x: len(str(x)))
   0  1
0  3  4
1  5  5

Note that a vectorized version of `func` often exists, which will
be much faster. You could square each number elementwise.

>>> df.applymap(lambda x: x**2)
           0          1
0   1.000000   4.494400
1  11.262736  20.857489

But it's better to avoid applymap in that case.

>>> df ** 2
           0          1
0   1.000000   4.494400
1  11.262736  20.857489
File:      c:\anaconda3\lib\site-packages\pandas\core\frame.py
Type:      method

pandas.agg

Signature: data.agg(func, axis=0, *args, **kwargs)
Docstring:
Aggregate using one or more operations over the specified axis.

.. versionadded:: 0.20.0

Parameters
----------
func : function, str, list or dict
    Function to use for aggregating the data. If a function, must either
    work when passed a DataFrame or when passed to DataFrame.apply.

    Accepted combinations are:

    - function
    - string function name
    - list of functions and/or function names, e.g. ``[np.sum, 'mean']``
    - dict of axis labels -> functions, function names or list of such.
axis : {0 or 'index', 1 or 'columns'}, default 0
        If 0 or 'index': apply function to each column.
        If 1 or 'columns': apply function to each row.
*args
    Positional arguments to pass to `func`.
**kwargs
    Keyword arguments to pass to `func`.

Returns
-------
DataFrame, Series or scalar
    if DataFrame.agg is called with a single function, returns a Series
    if DataFrame.agg is called with several functions, returns a DataFrame
    if Series.agg is called with single function, returns a scalar
    if Series.agg is called with several functions, returns a Series


The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
`numpy` aggregation functions (`mean`, `median`, `prod`, `sum`, `std`,
`var`), where the default is to compute the aggregation of the flattened
array, e.g., ``numpy.mean(arr_2d)`` as opposed to ``numpy.mean(arr_2d,
axis=0)``.

`agg` is an alias for `aggregate`. Use the alias.

See Also
--------
DataFrame.apply : Perform any type of operations.
DataFrame.transform : Perform transformation type operations.
pandas.core.groupby.GroupBy : Perform operations over groups.
pandas.core.resample.Resampler : Perform operations over resampled bins.
pandas.core.window.Rolling : Perform operations over rolling window.
pandas.core.window.Expanding : Perform operations over expanding window.
pandas.core.window.EWM : Perform operation over exponential weighted
    window.


Notes
-----
`agg` is an alias for `aggregate`. Use the alias.

A passed user-defined-function will be passed a Series for evaluation.


Examples
--------
>>> df = pd.DataFrame([[1, 2, 3],
...                    [4, 5, 6],
...                    [7, 8, 9],
...                    [np.nan, np.nan, np.nan]],
...                   columns=['A', 'B', 'C'])

Aggregate these functions over the rows.

>>> df.agg(['sum', 'min'])
        A     B     C
sum  12.0  15.0  18.0
min   1.0   2.0   3.0

Different aggregations per column.

>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
        A    B
max   NaN  8.0
min   1.0  2.0
sum  12.0  NaN

Aggregate over the columns.

>>> df.agg("mean", axis="columns")
0    2.0
1    5.0
2    8.0
3    NaN
dtype: float64
File:      c:\anaconda3\lib\site-packages\pandas\core\frame.py
Type:      method

你可能感兴趣的:(python,过拟合,xss,dos,办公软件)