文章目录
- Objective : Pandas for Data Wrangling
- Introduction to Data Wrangling & Pandas
- 2. Series & DataFrames
- Objective : Loading Data into DataFrames|目标:将数据加载到DataFrames中
- Sources from which dataframes can be created|可以创建数据框的来源
- Reading CSV
- Loading from JSON
- Loading Excel
- Creating & Loading Pickled Data|创建和加载腌制数据
- Loading from databases
- Objective : Indexing & Selecting Data
- Indexing using loc|使用loc编制索引
- Indexing using iloc
- Indexing using [ ]
- Selecting with isin|选择用isin
- Selecting data using where method|使用where方法选择数据
- Selecting Data using Query|使用查询选择数据
- Set/Reset Index|设置/重置索引
- Selecting columns by Type
- Accessing Multi-Index Data
- Objective : Working on TimeSeries Data
- Objective : Combining DataFrames
- 1. Concatenate
- Append
- Merge
- Join
- TimeSeries Friendly Operations
- Objective : Shaping & Structuring
- Pivoting
- Pivot Table
- Stacking & Unstacking
- Melting
- GroupBy
- Cross Tabulations
- Tiling
Objective : Pandas for Data Wrangling
- Introduction to Data Wrangling & Pandas
- Series & DataFrames
- Loading data into DataFrames
- Indexing and selecting data
- Working on TimeSeries Data
- Merge, join, and concatenate dataframes
- Reshaping and pivot tables
- Working with text data
- Working with missing data
- Styling Pandas Table
- Computational tools
- Group By: split-apply-combine
- Options and settings
- Enhancing performance
- 介绍数据纠缠和Pandas
- 系列和数据框
- 将数据加载到DataFrames中
- 索引编制和选择数据
- 关于时间序列数据的工作
-
- 合并、连接和串联数据框。
- 重组和数据透视表
- 处理文本数据
- 处理缺失数据的工作
- pandas的样式
- 计算工具
- 分组由。split-apply-combine
- 选项和设置
- 提高性能
Introduction to Data Wrangling & Pandas
Data Wrangling
- Getting & Reading data from different sources.
- Cleaning Data
- Shaping & Structuring Data
- Storing Data
There are many tools & libraries available for data wrangling. Tools like rapidminer & libraries like pandas. Organizations find libraries more suited because of flexibility.
#####数据整理
- 从不同来源获取和读取数据。
- 清理数据
- 塑造和结构化数据
- 储存数据
有许多工具和库可以用于数据整理。像 rapidminer 和 pandas 这样的工具和库。组织觉得库更适合,因为它的灵活性。
Pandas
- High Performance, Easy-to-use open source library for Data Analysis
- Creates tabular format of data from different sources like csv, json, database.
- Have utilities for descriptive statistics, aggregation, handling missing data
- Database utilities like merge, join are available
- Fast, Programmable & Easy alternative to spreadsheets
- 高性能、易于使用的开源数据分析库
- 创建不同来源的数据的表格格式,如csv、json、数据库。
- 具有描述性统计、聚合、处理缺失数据的实用工具。
- 可使用数据库工具,如合并、加入等。
- 快速、可编程和简单的电子表格的替代方案。
2. Series & DataFrames
import pandas as pd
import numpy as np
pd.__version__
'0.25.1'
Series
- Series datastructure represents a column.
- Each columns will a data type
- Combine multiple columns to create a table ( .i.e DataFrame )
- 系列数据结构代表一个列。
- 每个列将有一个数据类型
- 组合多个列来创建一个表(即DataFrame)。
ser1 = pd.Series(data=[1,2,3,4,5], index=list('abcde'))
ser1
a 1
b 2
c 3
d 4
e 5
dtype: int64
ser1.dtype
dtype('int64')
ser2 = pd.Series(data=[11,22,33,44,55], index=list('abcde'))
ser2
a 11
b 22
c 33
d 44
e 55
dtype: int64
DataFrame
- DataFrame is tabular representation of data.
- Combine multiple series to create a dataframe
- Data corresponding to same index belongs to same row
- DataFrame是数据的表格化表示。
- 结合多个系列来创建一个数据框架。
- 同一索引对应的数据属于同一行。
df = pd.DataFrame({'A':ser1, 'B':ser2})
df
|
A |
B |
a |
1 |
11 |
b |
2 |
22 |
c |
3 |
33 |
d |
4 |
44 |
e |
5 |
55 |
df = pd.DataFrame(data=np.random.randint(1,10,size=(10,10)),
index=list('ABCDEFGHIJ'),
columns=list('abcdefghij'))
df
|
a |
b |
c |
d |
e |
f |
g |
h |
i |
j |
A |
3 |
3 |
4 |
1 |
1 |
9 |
3 |
1 |
2 |
3 |
B |
7 |
6 |
4 |
6 |
5 |
2 |
7 |
3 |
8 |
6 |
C |
1 |
5 |
5 |
1 |
6 |
5 |
3 |
8 |
9 |
6 |
D |
4 |
7 |
7 |
2 |
9 |
7 |
5 |
8 |
5 |
1 |
E |
1 |
4 |
5 |
9 |
8 |
4 |
7 |
2 |
6 |
9 |
F |
6 |
2 |
8 |
3 |
7 |
5 |
2 |
3 |
5 |
8 |
G |
6 |
8 |
4 |
9 |
3 |
2 |
1 |
3 |
7 |
9 |
H |
9 |
5 |
3 |
1 |
7 |
8 |
2 |
6 |
6 |
3 |
I |
9 |
8 |
6 |
9 |
4 |
6 |
3 |
3 |
7 |
6 |
J |
4 |
8 |
1 |
4 |
3 |
3 |
9 |
9 |
7 |
8 |
Objective : Loading Data into DataFrames|目标:将数据加载到DataFrames中
- Sources from which dataframes can be created
- Loading from CSV
- Loading from JSON - Structured & Unstructured
- Loading from Excel
- Creating pickled data & Loading from Pickled Data
- Loading from Database
- 可据此创建数据框的来源
- 从CSV加载
- 从JSON加载–结构化和非结构化
- 从Excel加载
- 创建腌制数据和从腌制数据中加载数据
- 从数据库加载
Sources from which dataframes can be created|可以创建数据框的来源
- Reading data from different sources, here is the list.
- Also, includes writing data to different sources.
- 读取不同来源的数据,这里是列表。
- 还包括向不同来源写入数据。
Reading CSV
import pandas as pd
rental_data = pd.read_csv('../Data/house_rental_data.csv.txt')
rental_data.info()
RangeIndex: 645 entries, 0 to 644
Data columns (total 8 columns):
Unnamed: 0 645 non-null int64
Sqft 645 non-null float64
Floor 645 non-null int64
TotalFloor 645 non-null int64
Bedroom 645 non-null int64
Living.Room 645 non-null int64
Bathroom 645 non-null int64
Price 645 non-null int64
dtypes: float64(1), int64(7)
memory usage: 40.4 KB
rental_data.head()
|
Unnamed: 0 |
Sqft |
Floor |
TotalFloor |
Bedroom |
Living.Room |
Bathroom |
Price |
0 |
1 |
1177.698 |
2 |
7 |
2 |
2 |
2 |
62000 |
1 |
2 |
2134.800 |
5 |
7 |
4 |
2 |
2 |
78000 |
2 |
3 |
1138.560 |
5 |
7 |
2 |
2 |
1 |
58000 |
3 |
4 |
1458.780 |
2 |
7 |
3 |
2 |
2 |
45000 |
4 |
5 |
967.776 |
11 |
14 |
3 |
2 |
2 |
45000 |
rental_data = pd.read_csv('../Data/house_rental_data.csv.txt', index_col = 'Unnamed: 0')
rental_data.head()
|
Sqft |
Floor |
TotalFloor |
Bedroom |
Living.Room |
Bathroom |
Price |
1 |
1177.698 |
2 |
7 |
2 |
2 |
2 |
62000 |
2 |
2134.800 |
5 |
7 |
4 |
2 |
2 |
78000 |
3 |
1138.560 |
5 |
7 |
2 |
2 |
1 |
58000 |
4 |
1458.780 |
2 |
7 |
3 |
2 |
2 |
45000 |
5 |
967.776 |
11 |
14 |
3 |
2 |
2 |
45000 |
rental_data = pd.read_csv('../Data/house_rental_data.csv.txt', usecols=lambda c: c.startswith('B'))
rental_data.head()
|
Bedroom |
Bathroom |
0 |
2 |
2 |
1 |
4 |
2 |
2 |
2 |
1 |
3 |
3 |
2 |
4 |
3 |
2 |
'''
nrows : int, 可选
要读取的文件行数。用于读取大文件的碎片。
'''
rental_data = pd.read_csv('../Data/house_rental_data.csv.txt', nrows=10)
rental_data.shape
(10, 8)
help(pd.read_csv)
Help on function read_csv in module pandas.io.parsers:
read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
Read a comma-separated values (csv) file into DataFrame.
Also supports optionally iterating or breaking of the file
into chunks.
Additional help can be found in the online docs for
`IO Tools `_.
Parameters
----------
filepath_or_buffer : str, path object or file-like object
Any valid string path is acceptable. The string could be a URL. Valid
URL schemes include http, ftp, s3, and file. For file URLs, a host is
expected. A local file could be: file://localhost/path/to/table.csv.
If you want to pass in a path object, pandas accepts any ``os.PathLike``.
By file-like object, we refer to objects with a ``read()`` method, such as
a file handler (e.g. via builtin ``open`` function) or ``StringIO``.
sep : str, default ','
Delimiter to use. If sep is None, the C engine cannot automatically detect
the separator, but the Python parsing engine can, meaning the latter will
be used and automatically detect the separator by Python's builtin sniffer
tool, ``csv.Sniffer``. In addition, separators longer than 1 character and
different from ``'\s+'`` will be interpreted as regular expressions and
will also force the use of the Python parsing engine. Note that regex
delimiters are prone to ignoring quoted data. Regex example: ``'\r\t'``.
delimiter : str, default ``None``
Alias for sep.
header : int, list of int, default 'infer'
Row number(s) to use as the column names, and the start of the
data. Default behavior is to infer the column names: if no names
are passed the behavior is identical to ``header=0`` and column
names are inferred from the first line of the file, if column
names are passed explicitly then the behavior is identical to
``header=None``. Explicitly pass ``header=0`` to be able to
replace existing names. The header can be a list of integers that
specify row locations for a multi-index on the columns
e.g. [0,1,3]. Intervening rows that are not specified will be
skipped (e.g. 2 in this example is skipped). Note that this
parameter ignores commented lines and empty lines if
``skip_blank_lines=True``, so ``header=0`` denotes the first line of
data rather than the first line of the file.
names : array-like, optional
List of column names to use. If file contains no header row, then you
should explicitly pass ``header=None``. Duplicates in this list are not
allowed.
index_col : int, str, sequence of int / str, or False, default ``None``
Column(s) to use as the row labels of the ``DataFrame``, either given as
string name or column index. If a sequence of int / str is given, a
MultiIndex is used.
Note: ``index_col=False`` can be used to force pandas to *not* use the first
column as the index, e.g. when you have a malformed file with delimiters at
the end of each line.
usecols : list-like or callable, optional
Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or strings
that correspond to column names provided either by the user in `names` or
inferred from the document header row(s). For example, a valid list-like
`usecols` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``.
Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
To instantiate a DataFrame from ``data`` with element order preserved use
``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns
in ``['foo', 'bar']`` order or
``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]``
for ``['bar', 'foo']`` order.
If callable, the callable function will be evaluated against the column
names, returning names where the callable function evaluates to True. An
example of a valid callable argument would be ``lambda x: x.upper() in
['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
parsing time and lower memory usage.
squeeze : bool, default False
If the parsed data only contains one column then return a Series.
prefix : str, optional
Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
mangle_dupe_cols : bool, default True
Duplicate columns will be specified as 'X', 'X.1', ...'X.N', rather than
'X'...'X'. Passing in False will cause data to be overwritten if there
are duplicate names in the columns.
dtype : Type name or dict of column -> type, optional
Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32,
'c': 'Int64'}
Use `str` or `object` together with suitable `na_values` settings
to preserve and not interpret dtype.
If converters are specified, they will be applied INSTEAD
of dtype conversion.
engine : {'c', 'python'}, optional
Parser engine to use. The C engine is faster while the python engine is
currently more feature-complete.
converters : dict, optional
Dict of functions for converting values in certain columns. Keys can either
be integers or column labels.
true_values : list, optional
Values to consider as True.
false_values : list, optional
Values to consider as False.
skipinitialspace : bool, default False
Skip spaces after delimiter.
skiprows : list-like, int or callable, optional
Line numbers to skip (0-indexed) or number of lines to skip (int)
at the start of the file.
If callable, the callable function will be evaluated against the row
indices, returning True if the row should be skipped and False otherwise.
An example of a valid callable argument would be ``lambda x: x in [0, 2]``.
skipfooter : int, default 0
Number of lines at bottom of file to skip (Unsupported with engine='c').
nrows : int, optional
Number of rows of file to read. Useful for reading pieces of large files.
na_values : scalar, str, list-like, or dict, optional
Additional strings to recognize as NA/NaN. If dict passed, specific
per-column NA values. By default the following values are interpreted as
NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
'1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan',
'null'.
keep_default_na : bool, default True
Whether or not to include the default NaN values when parsing the data.
Depending on whether `na_values` is passed in, the behavior is as follows:
* If `keep_default_na` is True, and `na_values` are specified, `na_values`
is appended to the default NaN values used for parsing.
* If `keep_default_na` is True, and `na_values` are not specified, only
the default NaN values are used for parsing.
* If `keep_default_na` is False, and `na_values` are specified, only
the NaN values specified `na_values` are used for parsing.
* If `keep_default_na` is False, and `na_values` are not specified, no
strings will be parsed as NaN.
Note that if `na_filter` is passed in as False, the `keep_default_na` and
`na_values` parameters will be ignored.
na_filter : bool, default True
Detect missing value markers (empty strings and the value of na_values). In
data without any NAs, passing na_filter=False can improve the performance
of reading a large file.
verbose : bool, default False
Indicate number of NA values placed in non-numeric columns.
skip_blank_lines : bool, default True
If True, skip over blank lines rather than interpreting as NaN values.
parse_dates : bool or list of int or names or list of lists or dict, default False
The behavior is as follows:
* boolean. If True -> try parsing the index.
* list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
each as a separate date column.
* list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as
a single date column.
* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call
result 'foo'
If a column or index cannot be represented as an array of datetimes,
say because of an unparseable value or a mixture of timezones, the column
or index will be returned unaltered as an object data type. For
non-standard datetime parsing, use ``pd.to_datetime`` after
``pd.read_csv``. To parse an index or column with a mixture of timezones,
specify ``date_parser`` to be a partially-applied
:func:`pandas.to_datetime` with ``utc=True``. See
:ref:`io.csv.mixed_timezones` for more.
Note: A fast-path exists for iso8601-formatted dates.
infer_datetime_format : bool, default False
If True and `parse_dates` is enabled, pandas will attempt to infer the
format of the datetime strings in the columns, and if it can be inferred,
switch to a faster method of parsing them. In some cases this can increase
the parsing speed by 5-10x.
keep_date_col : bool, default False
If True and `parse_dates` specifies combining multiple columns then
keep the original columns.
date_parser : function, optional
Function to use for converting a sequence of string columns to an array of
datetime instances. The default uses ``dateutil.parser.parser`` to do the
conversion. Pandas will try to call `date_parser` in three different ways,
advancing to the next if an exception occurs: 1) Pass one or more arrays
(as defined by `parse_dates`) as arguments; 2) concatenate (row-wise) the
string values from the columns defined by `parse_dates` into a single array
and pass that; and 3) call `date_parser` once for each row using one or
more strings (corresponding to the columns defined by `parse_dates`) as
arguments.
dayfirst : bool, default False
DD/MM format dates, international and European format.
cache_dates : boolean, default True
If True, use a cache of unique, converted dates to apply the datetime
conversion. May produce significant speed-up when parsing duplicate
date strings, especially ones with timezone offsets.
.. versionadded:: 0.25.0
iterator : bool, default False
Return TextFileReader object for iteration or getting chunks with
``get_chunk()``.
chunksize : int, optional
Return TextFileReader object for iteration.
See the `IO Tools docs
`_
for more information on ``iterator`` and ``chunksize``.
compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
For on-the-fly decompression of on-disk data. If 'infer' and
`filepath_or_buffer` is path-like, then detect compression from the
following extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise no
decompression). If using 'zip', the ZIP file must contain only one data
file to be read in. Set to None for no decompression.
.. versionadded:: 0.18.1 support for 'zip' and 'xz' compression.
thousands : str, optional
Thousands separator.
decimal : str, default '.'
Character to recognize as decimal point (e.g. use ',' for European data).
lineterminator : str (length 1), optional
Character to break file into lines. Only valid with C parser.
quotechar : str (length 1), optional
The character used to denote the start and end of a quoted item. Quoted
items can include the delimiter and it will be ignored.
quoting : int or csv.QUOTE_* instance, default 0
Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote : bool, default ``True``
When quotechar is specified and quoting is not ``QUOTE_NONE``, indicate
whether or not to interpret two consecutive quotechar elements INSIDE a
field as a single ``quotechar`` element.
escapechar : str (length 1), optional
One-character string used to escape other characters.
comment : str, optional
Indicates remainder of line should not be parsed. If found at the beginning
of a line, the line will be ignored altogether. This parameter must be a
single character. Like empty lines (as long as ``skip_blank_lines=True``),
fully commented lines are ignored by the parameter `header` but not by
`skiprows`. For example, if ``comment='#'``, parsing
``#empty\na,b,c\n1,2,3`` with ``header=0`` will result in 'a,b,c' being
treated as the header.
encoding : str, optional
Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
standard encodings
`_ .
dialect : str or csv.Dialect, optional
If provided, this parameter will override values (default or not) for the
following parameters: `delimiter`, `doublequote`, `escapechar`,
`skipinitialspace`, `quotechar`, and `quoting`. If it is necessary to
override values, a ParserWarning will be issued. See csv.Dialect
documentation for more details.
error_bad_lines : bool, default True
Lines with too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no DataFrame will be returned.
If False, then these "bad lines" will dropped from the DataFrame that is
returned.
warn_bad_lines : bool, default True
If error_bad_lines is False, and warn_bad_lines is True, a warning for each
"bad line" will be output.
delim_whitespace : bool, default False
Specifies whether or not whitespace (e.g. ``' '`` or ``' '``) will be
used as the sep. Equivalent to setting ``sep='\s+'``. If this option
is set to True, nothing should be passed in for the ``delimiter``
parameter.
.. versionadded:: 0.18.1 support for the Python parser.
low_memory : bool, default True
Internally process the file in chunks, resulting in lower memory use
while parsing, but possibly mixed type inference. To ensure no mixed
types either set False, or specify the type with the `dtype` parameter.
Note that the entire file is read into a single DataFrame regardless,
use the `chunksize` or `iterator` parameter to return the data in chunks.
(Only valid with C parser).
memory_map : bool, default False
If a filepath is provided for `filepath_or_buffer`, map the file object
directly onto memory and access the data directly from there. Using this
option can improve performance because there is no longer any I/O overhead.
float_precision : str, optional
Specifies which converter the C engine should use for floating-point
values. The options are `None` for the ordinary converter,
`high` for the high-precision converter, and `round_trip` for the
round-trip converter.
Returns
-------
DataFrame or TextParser
A comma-separated values (csv) file is returned as two-dimensional
data structure with labeled axes.
See Also
--------
to_csv : Write DataFrame to a comma-separated values (csv) file.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_fwf : Read a table of fixed-width formatted lines into DataFrame.
Examples
--------
>>> pd.read_csv('data.csv') # doctest: +SKIP
'''
chunksize : int, 可选
返回TextFileReader对象进行迭代。
请参见 `IO工具文档
`http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking
关于 "iterator "和 "chunksize "的更多信息。
'''
rental_data_itr = pd.read_csv('../Data/house_rental_data.csv.txt', chunksize=300)
for data in rental_data_itr:
print (data.count())
Unnamed: 0 300
Sqft 300
Floor 300
TotalFloor 300
Bedroom 300
Living.Room 300
Bathroom 300
Price 300
dtype: int64
Unnamed: 0 300
Sqft 300
Floor 300
TotalFloor 300
Bedroom 300
Living.Room 300
Bathroom 300
Price 300
dtype: int64
Unnamed: 0 45
Sqft 45
Floor 45
TotalFloor 45
Bedroom 45
Living.Room 45
Bathroom 45
Price 45
dtype: int64
titanic_data = pd.read_csv('../Data/titanic-train.csv.txt', index_col = 'PassengerId', na_values={'Ticket':'PC 17599'})
titanic_data.head()
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
PassengerId |
|
|
|
|
|
|
|
|
|
|
|
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
NaN |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
pd.read_csv('../Data/sales-data.csv').head()
|
Month |
Sales |
0 |
1-01 |
266.0 |
1 |
1-02 |
145.9 |
2 |
1-03 |
183.1 |
3 |
1-04 |
119.3 |
4 |
1-05 |
180.3 |
from datetime import datetime
def parser(x):
return datetime.strptime('200'+x, '%Y-%m')
data = pd.read_csv('../Data/sales-data.csv', header=0, parse_dates=[0], index_col=0, date_parser=parser)
data.head()
|
Sales |
Month |
|
2001-01-01 |
266.0 |
2001-02-01 |
145.9 |
2001-03-01 |
183.1 |
2001-04-01 |
119.3 |
2001-05-01 |
180.3 |
Loading from JSON
pd.read_json('https://raw.githubusercontent.com/corysimmons/colors.json/master/colors.json', orient='records').T.head()
|
0 |
1 |
2 |
3 |
aliceblue |
240 |
248 |
255 |
1 |
antiquewhite |
250 |
235 |
215 |
1 |
aqua |
0 |
255 |
255 |
1 |
aquamarine |
127 |
255 |
212 |
1 |
azure |
240 |
255 |
255 |
1 |
pd.set_option('display.max_colwidth', -1)
pd.read_json('../Data/raw_nyc_phil.json').head(1)
|
programs |
0 |
{'season': '1842-43', 'orchestra': 'New York Philharmonic', 'concerts': [{'Date': '1842-12-07T05:00:00Z', 'eventType': 'Subscription Season', 'Venue': 'Apollo Rooms', 'Location': 'Manhattan, NY', 'Time': '8:00PM'}], 'programID': '3853', 'works': [{'workTitle': 'SYMPHONY NO. 5 IN C MINOR, OP.67', 'conductorName': 'Hill, Ureli Corelli', 'ID': '52446*', 'soloists': [], 'composerName': 'Beethoven, Ludwig van'}, {'workTitle': 'OBERON', 'composerName': 'Weber, Carl Maria Von', 'conductorName': 'Timm, Henry C.', 'ID': '8834*4', 'soloists': [{'soloistName': 'Otto, Antoinette', 'soloistRoles': 'S', 'soloistInstrument': 'Soprano'}], 'movement': '"Ozean, du Ungeheuer" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II'}, {'workTitle': 'QUINTET, PIANO, D MINOR, OP. 74', 'ID': '3642*', 'soloists': [{'soloistName': 'Scharfenberg, William', 'soloistRoles': 'A', 'soloistInstrument': 'Piano'}, {'soloistName': 'Hill, Ureli Corelli', 'soloistRoles': 'A', 'soloistInstrument': 'Violin'}, {'soloistName': 'Derwort, G. H.', 'soloistRoles': 'A', 'soloistInstrument': 'Viola'}, {'soloistName': 'Boucher, Alfred', 'soloistRoles': 'A', 'soloistInstrument': 'Cello'}, {'soloistName': 'Rosier, F. W.', 'soloistRoles': 'A', 'soloistInstrument': 'Contrabass'}], 'composerName': 'Hummel, Johann'}, {'interval': 'Intermission', 'ID': '0*', 'soloists': []}, {'workTitle': 'OBERON', 'composerName': 'Weber, Carl Maria Von', 'conductorName': 'Etienne, Denis G.', 'ID': '8834*3', 'soloists': [], 'movement': 'Overture'}, {'workTitle': 'ARMIDA', 'composerName': 'Rossini, Gioachino', 'conductorName': 'Timm, Henry C.', 'ID': '8835*1', 'soloists': [{'soloistName': 'Otto, Antoinette', 'soloistRoles': 'S', 'soloistInstrument': 'Soprano'}, {'soloistName': 'Horn, Charles Edward', 'soloistRoles': 'S', 'soloistInstrument': 'Tenor'}], 'movement': 'Duet'}, {'workTitle': 'FIDELIO, OP. 72', 'composerName': 'Beethoven, Ludwig van', 'conductorName': 'Timm, Henry C.', 'ID': '8837*6', 'soloists': [{'soloistName': 'Horn, Charles Edward', 'soloistRoles': 'S', 'soloistInstrument': 'Tenor'}], 'movement': '"In Des Lebens Fruhlingstagen...O spur ich nicht linde," Florestan (aria)'}, {'workTitle': 'ABDUCTION FROM THE SERAGLIO,THE, K.384', 'composerName': 'Mozart, Wolfgang Amadeus', 'conductorName': 'Timm, Henry C.', 'ID': '8336*4', 'soloists': [{'soloistName': 'Otto, Antoinette', 'soloistRoles': 'S', 'soloistInstrument': 'Soprano'}], 'movement': '"Ach Ich liebte," Konstanze (aria)'}, {'workTitle': 'OVERTURE NO. 1, D MINOR, OP. 38', 'conductorName': 'Timm, Henry C.', 'ID': '5543*', 'soloists': [], 'composerName': 'Kalliwoda, Johann W.'}], 'id': '38e072a7-8fc9-4f9a-8eac-3957905c0002'} |
import json
with open('../Data/raw_nyc_phil.json') as f:
d = json.load(f)
nycphil = pd.io.json.json_normalize(d['programs'])
nycphil.head(3)
|
season |
orchestra |
concerts |
programID |
works |
id |
0 |
1842-43 |
New York Philharmonic |
[{'Date': '1842-12-07T05:00:00Z', 'eventType': 'Subscription Season', 'Venue': 'Apollo Rooms', 'Location': 'Manhattan, NY', 'Time': '8:00PM'}] |
3853 |
[{'workTitle': 'SYMPHONY NO. 5 IN C MINOR, OP.67', 'conductorName': 'Hill, Ureli Corelli', 'ID': '52446*', 'soloists': [], 'composerName': 'Beethoven, Ludwig van'}, {'workTitle': 'OBERON', 'composerName': 'Weber, Carl Maria Von', 'conductorName': 'Timm, Henry C.', 'ID': '8834*4', 'soloists': [{'soloistName': 'Otto, Antoinette', 'soloistRoles': 'S', 'soloistInstrument': 'Soprano'}], 'movement': '"Ozean, du Ungeheuer" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II'}, {'workTitle': 'QUINTET, PIANO, D MINOR, OP. 74', 'ID': '3642*', 'soloists': [{'soloistName': 'Scharfenberg, William', 'soloistRoles': 'A', 'soloistInstrument': 'Piano'}, {'soloistName': 'Hill, Ureli Corelli', 'soloistRoles': 'A', 'soloistInstrument': 'Violin'}, {'soloistName': 'Derwort, G. H.', 'soloistRoles': 'A', 'soloistInstrument': 'Viola'}, {'soloistName': 'Boucher, Alfred', 'soloistRoles': 'A', 'soloistInstrument': 'Cello'}, {'soloistName': 'Rosier, F. W.', 'soloistRoles': 'A', 'soloistInstrument': 'Contrabass'}], 'composerName': 'Hummel, Johann'}, {'interval': 'Intermission', 'ID': '0*', 'soloists': []}, {'workTitle': 'OBERON', 'composerName': 'Weber, Carl Maria Von', 'conductorName': 'Etienne, Denis G.', 'ID': '8834*3', 'soloists': [], 'movement': 'Overture'}, {'workTitle': 'ARMIDA', 'composerName': 'Rossini, Gioachino', 'conductorName': 'Timm, Henry C.', 'ID': '8835*1', 'soloists': [{'soloistName': 'Otto, Antoinette', 'soloistRoles': 'S', 'soloistInstrument': 'Soprano'}, {'soloistName': 'Horn, Charles Edward', 'soloistRoles': 'S', 'soloistInstrument': 'Tenor'}], 'movement': 'Duet'}, {'workTitle': 'FIDELIO, OP. 72', 'composerName': 'Beethoven, Ludwig van', 'conductorName': 'Timm, Henry C.', 'ID': '8837*6', 'soloists': [{'soloistName': 'Horn, Charles Edward', 'soloistRoles': 'S', 'soloistInstrument': 'Tenor'}], 'movement': '"In Des Lebens Fruhlingstagen...O spur ich nicht linde," Florestan (aria)'}, {'workTitle': 'ABDUCTION FROM THE SERAGLIO,THE, K.384', 'composerName': 'Mozart, Wolfgang Amadeus', 'conductorName': 'Timm, Henry C.', 'ID': '8336*4', 'soloists': [{'soloistName': 'Otto, Antoinette', 'soloistRoles': 'S', 'soloistInstrument': 'Soprano'}], 'movement': '"Ach Ich liebte," Konstanze (aria)'}, {'workTitle': 'OVERTURE NO. 1, D MINOR, OP. 38', 'conductorName': 'Timm, Henry C.', 'ID': '5543*', 'soloists': [], 'composerName': 'Kalliwoda, Johann W.'}] |
38e072a7-8fc9-4f9a-8eac-3957905c0002 |
1 |
1842-43 |
New York Philharmonic |
[{'Date': '1843-02-18T05:00:00Z', 'eventType': 'Subscription Season', 'Venue': 'Apollo Rooms', 'Location': 'Manhattan, NY', 'Time': '8:00PM'}] |
5178 |
[{'workTitle': 'SYMPHONY NO. 3 IN E FLAT MAJOR, OP. 55 (EROICA)', 'conductorName': 'Hill, Ureli Corelli', 'ID': '52437*', 'soloists': [], 'composerName': 'Beethoven, Ludwig van'}, {'workTitle': 'I PURITANI', 'composerName': 'Bellini, Vincenzo', 'conductorName': 'Hill, Ureli Corelli', 'ID': '8838*2', 'soloists': [{'soloistName': 'Otto, Antoinette', 'soloistRoles': 'S', 'soloistInstrument': 'Soprano'}], 'movement': 'Elvira (aria): "Qui la voce...Vien, diletto"'}, {'workTitle': 'CELEBRATED ELEGIE', 'conductorName': 'Hill, Ureli Corelli', 'ID': '3659*', 'soloists': [{'soloistName': 'Boucher, Alfred', 'soloistRoles': 'S', 'soloistInstrument': 'Cello'}], 'composerName': 'Romberg, Bernhard'}, {'interval': 'Intermission', 'ID': '0*', 'soloists': []}, {'workTitle': 'WILLIAM TELL', 'composerName': 'Rossini, Gioachino', 'conductorName': 'Alpers, William', 'ID': '8839*2', 'soloists': [], 'movement': 'Overture'}, {'workTitle': 'STABAT MATER', 'composerName': 'Rossini, Gioachino', 'conductorName': 'Alpers, William', 'ID': '53076*2', 'soloists': [{'soloistName': 'Otto, Antoinette', 'soloistRoles': 'S', 'soloistInstrument': 'Soprano'}], 'movement': 'Inflammatus et Accensus (Aria with Chorus)'}, {'workTitle': 'CONCERTO, PIANO, A-FLAT MAJOR, OP. 113', 'composerName': 'Hummel, Johann', 'conductorName': 'Alpers, William', 'ID': '51568*2', 'soloists': [{'soloistName': 'Timm, Henry C.', 'soloistRoles': 'S', 'soloistInstrument': 'Piano'}], 'movement': 'Romanza: Larghetto con moto'}, {'workTitle': 'CONCERTO, PIANO, A-FLAT MAJOR, OP. 113', 'composerName': 'Hummel, Johann', 'conductorName': 'Alpers, William', 'ID': '51568*3', 'soloists': [{'soloistName': 'Timm, Henry C.', 'soloistRoles': 'S', 'soloistInstrument': 'Piano'}], 'movement': 'Rondo alla spagniola: Allegro moderato'}, {'workTitle': 'FREISCHUTZ, DER', 'composerName': 'Weber, Carl Maria Von', 'conductorName': 'Alpers, William', 'ID': '6709*16', 'soloists': [], 'movement': 'Overture'}] |
c7b2b95c-5e0b-431c-a340-5b37fc860b34 |
2 |
1842-43 |
Musicians from the New York Philharmonic |
[{'Date': '1843-04-07T05:00:00Z', 'eventType': 'Special', 'Venue': 'Apollo Rooms', 'Location': 'Manhattan, NY', 'Time': '8:00PM'}] |
10785 |
[{'workTitle': 'EGMONT, OP.84', 'composerName': 'Beethoven, Ludwig van', 'conductorName': 'Hill, Ureli Corelli', 'ID': '52364*1', 'soloists': [], 'movement': 'Overture'}, {'workTitle': 'OBERON', 'composerName': 'Weber, Carl Maria Von', 'conductorName': 'Not conducted', 'ID': '8834*4', 'soloists': [{'soloistName': 'Otto, Antoinette', 'soloistRoles': 'S', 'soloistInstrument': 'Soprano'}, {'soloistName': 'Timm, Henry C.', 'soloistRoles': 'A', 'soloistInstrument': 'Piano'}], 'movement': '"Ozean, du Ungeheuer" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II'}, {'workTitle': 'CONCERTO, PIANO, A MINOR, OP. 85', 'conductorName': 'Hill, Ureli Corelli', 'ID': '4567*', 'soloists': [{'soloistName': 'Scharfenberg, William', 'soloistRoles': 'S', 'soloistInstrument': 'Piano'}], 'composerName': 'Hummel, Johann'}, {'workTitle': 'O HAPPY HAPPY HOUR', 'conductorName': 'Not conducted', 'ID': '5150*', 'soloists': [{'soloistName': 'Otto, Antoinette', 'soloistRoles': 'S', 'soloistInstrument': 'Soprano'}, {'soloistName': 'Timm, Henry C.', 'soloistRoles': 'A', 'soloistInstrument': 'Piano'}], 'composerName': 'Pacini, Giovanni'}, {'workTitle': 'FANTASIA ON SWEEDISH AIRS', 'conductorName': 'Not conducted', 'ID': '5161*', 'soloists': [{'soloistName': 'Boucher, Alfred', 'soloistRoles': 'S', 'soloistInstrument': 'Cello'}], 'composerName': 'Romberg, Bernhard'}, {'workTitle': 'SEXTET IN E FLAT MAJOR, OP. 30', 'composerName': 'Onslow, George', 'conductorName': 'Not conducted', 'ID': '5162*2', 'soloists': [{'soloistName': 'Scharfenberg, William', 'soloistRoles': 'S', 'soloistInstrument': 'Piano'}, {'soloistName': 'Lehman', 'soloistRoles': 'A', 'soloistInstrument': 'Flute'}, {'soloistName': 'Groneveldt, Theodore W.', 'soloistRoles': 'A', 'soloistInstrument': 'Clarinet'}, {'soloistName': 'Hegelund, H. W.', 'soloistRoles': 'A', 'soloistInstrument': 'Bassoon'}, {'soloistName': 'Woehning, F. C.', 'soloistRoles': 'A', 'soloistInstrument': 'French Horn'}, {'soloistName': 'Rosier, F. W.', 'soloistRoles': 'A', 'soloistInstrument': 'Contrabass'}], 'movement': 'Andante con variazioni'}, {'workTitle': 'SEXTET IN E FLAT MAJOR, OP. 30', 'composerName': 'Onslow, George', 'conductorName': 'Not conducted', 'ID': '5162*3', 'soloists': [{'soloistName': '', 'soloistRoles': '', 'soloistInstrument': ''}, {'soloistName': 'Scharfenberg, William', 'soloistRoles': 'S', 'soloistInstrument': 'Piano'}, {'soloistName': 'Lehman', 'soloistRoles': 'A', 'soloistInstrument': 'Flute'}, {'soloistName': 'Groneveldt, Theodore W.', 'soloistRoles': 'A', 'soloistInstrument': 'Clarinet'}, {'soloistName': 'Hegelund, H. W.', 'soloistRoles': 'A', 'soloistInstrument': 'Bassoon'}, {'soloistName': 'Woehning, F. C.', 'soloistRoles': 'A', 'soloistInstrument': 'French Horn'}, {'soloistName': 'Rosier, F. W.', 'soloistRoles': 'A', 'soloistInstrument': 'Contrabass'}], 'movement': 'Minuetto'}, {'workTitle': 'WILLIAM TELL', 'composerName': 'Rossini, Gioachino', 'conductorName': 'Alpers, William', 'ID': '8839*2', 'soloists': [], 'movement': 'Overture'}, {'workTitle': 'FANTASIA AND VARIATIONS ON THEMES FROM NORMA, OP. 12 (FOUR HANDS)', 'conductorName': 'Not conducted', 'ID': '5166*', 'soloists': [{'soloistName': '', 'soloistRoles': '', 'soloistInstrument': ''}, {'soloistName': 'Rakeman, Frederick', 'soloistRoles': 'S', 'soloistInstrument': 'Piano'}, {'soloistName': 'Scharfenberg, William', 'soloistRoles': 'S', 'soloistInstrument': 'Piano'}], 'composerName': 'Thalberg, Sigismond'}, {'workTitle': 'MAGIC FLUTE, THE, K.620', 'composerName': 'Mozart, Wolfgang Amadeus', 'conductorName': 'Not conducted', 'ID': '8955*13', 'soloists': [{'soloistName': '', 'soloistRoles': '', 'soloistInstrument': ''}, {'soloistName': 'Otto, Antoinette', 'soloistRoles': 'S', 'soloistInstrument': 'Soprano'}, {'soloistName': 'Timm, Henry C.', 'soloistRoles': 'A', 'soloistInstrument': 'Piano'}], 'movement': 'Aria (unspecified)'}, {'workTitle': 'INTRODUCTION AND VARIATIONS ON THE ROMANCE OF JOSEPH, OP. 20', 'conductorName': 'Alpers, William', 'ID': '5172*', 'soloists': [{'soloistName': '', 'soloistRoles': '', 'soloistInstrument': ''}, {'soloistName': 'Scharfenberg, William', 'soloistRoles': 'S', 'soloistInstrument': 'Piano'}], 'composerName': 'Herz, Henri'}, {'workTitle': 'QUINTET FOR WINDS AND ORCHESTRA', 'conductorName': 'Not conducted', 'ID': '5174*', 'soloists': [{'soloistName': 'Lehman', 'soloistRoles': 'A', 'soloistInstrument': 'Flute'}, {'soloistName': 'Wiese, Frederick', 'soloistRoles': 'A', 'soloistInstrument': 'Oboe'}, {'soloistName': 'Groneveldt, Theodore W.', 'soloistRoles': 'A', 'soloistInstrument': 'Clarinet'}, {'soloistName': 'Hegelund, H. W.', 'soloistRoles': 'A', 'soloistInstrument': 'Bassoon'}, {'soloistName': 'Woehning, F. C.', 'soloistRoles': 'A', 'soloistInstrument': 'French Horn'}], 'composerName': 'Lindpaintner, Peter Von'}] |
894e1a52-1ae5-4fa7-aec0-b99997555a37 |
works_data = pd.io.json.json_normalize(data=d['programs'], record_path='works',
meta=['id', 'orchestra','programID', 'season'])
works_data.head(3)
|
workTitle |
conductorName |
ID |
soloists |
composerName |
movement |
interval |
movement.em |
movement._ |
workTitle.em |
workTitle._ |
id |
orchestra |
programID |
season |
0 |
SYMPHONY NO. 5 IN C MINOR, OP.67 |
Hill, Ureli Corelli |
52446* |
[] |
Beethoven, Ludwig van |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
38e072a7-8fc9-4f9a-8eac-3957905c0002 |
New York Philharmonic |
3853 |
1842-43 |
1 |
OBERON |
Timm, Henry C. |
8834*4 |
[{'soloistName': 'Otto, Antoinette', 'soloistRoles': 'S', 'soloistInstrument': 'Soprano'}] |
Weber, Carl Maria Von |
"Ozean, du Ungeheuer" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II |
NaN |
NaN |
NaN |
NaN |
NaN |
38e072a7-8fc9-4f9a-8eac-3957905c0002 |
New York Philharmonic |
3853 |
1842-43 |
2 |
QUINTET, PIANO, D MINOR, OP. 74 |
NaN |
3642* |
[{'soloistName': 'Scharfenberg, William', 'soloistRoles': 'A', 'soloistInstrument': 'Piano'}, {'soloistName': 'Hill, Ureli Corelli', 'soloistRoles': 'A', 'soloistInstrument': 'Violin'}, {'soloistName': 'Derwort, G. H.', 'soloistRoles': 'A', 'soloistInstrument': 'Viola'}, {'soloistName': 'Boucher, Alfred', 'soloistRoles': 'A', 'soloistInstrument': 'Cello'}, {'soloistName': 'Rosier, F. W.', 'soloistRoles': 'A', 'soloistInstrument': 'Contrabass'}] |
Hummel, Johann |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
38e072a7-8fc9-4f9a-8eac-3957905c0002 |
New York Philharmonic |
3853 |
1842-43 |
works_data = pd.io.json.json_normalize(data=d['programs'], record_path='concerts',
meta=['id', 'orchestra','programID', 'season'])
works_data.head(3)
|
Date |
eventType |
Venue |
Location |
Time |
id |
orchestra |
programID |
season |
0 |
1842-12-07T05:00:00Z |
Subscription Season |
Apollo Rooms |
Manhattan, NY |
8:00PM |
38e072a7-8fc9-4f9a-8eac-3957905c0002 |
New York Philharmonic |
3853 |
1842-43 |
1 |
1843-02-18T05:00:00Z |
Subscription Season |
Apollo Rooms |
Manhattan, NY |
8:00PM |
c7b2b95c-5e0b-431c-a340-5b37fc860b34 |
New York Philharmonic |
5178 |
1842-43 |
2 |
1843-04-07T05:00:00Z |
Special |
Apollo Rooms |
Manhattan, NY |
8:00PM |
894e1a52-1ae5-4fa7-aec0-b99997555a37 |
Musicians from the New York Philharmonic |
10785 |
1842-43 |
soloist_data = pd.io.json.json_normalize(data=d['programs'], record_path=['works', 'soloists'],
meta=['id'])
soloist_data.head(3)
|
soloistName |
soloistRoles |
soloistInstrument |
id |
0 |
Otto, Antoinette |
S |
Soprano |
38e072a7-8fc9-4f9a-8eac-3957905c0002 |
1 |
Scharfenberg, William |
A |
Piano |
38e072a7-8fc9-4f9a-8eac-3957905c0002 |
2 |
Hill, Ureli Corelli |
A |
Violin |
38e072a7-8fc9-4f9a-8eac-3957905c0002 |
Loading Excel
sales_data = pd.read_excel('../Data/sales-funnel.xlsx')
sales_data.head()
|
Account |
Name |
Rep |
Manager |
Product |
Quantity |
Price |
Status |
0 |
714466 |
Trantow-Barrows |
Craig Booker |
Debra Henley |
CPU |
1 |
30000 |
presented |
1 |
714466 |
Trantow-Barrows |
Craig Booker |
Debra Henley |
Software |
1 |
10000 |
presented |
2 |
714466 |
Trantow-Barrows |
Craig Booker |
Debra Henley |
Maintenance |
2 |
5000 |
pending |
3 |
737550 |
Fritsch, Russel and Anderson |
Craig Booker |
Debra Henley |
CPU |
1 |
35000 |
declined |
4 |
146832 |
Kiehn-Spinka |
Daniel Hilton |
Debra Henley |
CPU |
2 |
65000 |
won |
Creating & Loading Pickled Data|创建和加载腌制数据
- Python pickle module is used for serializing and de-serializing a Python object structure. Any object in Python can be pickled so that it can be saved on disk.
- What pickle does is that it “serializes” the object first before writing it to file. Pickling is a way to convert a python object (list, dict, etc.) into a character stream.
- The idea is that this character stream contains all the information necessary to reconstruct the object in another python script.
- Python pickle 模块用于序列化和去序列化一个 Python 对象结构。在Python中的任何对象都可以被腌制,这样它就可以保存在磁盘上。
- pickle 所做的是,它在将对象 “序列化”,然后再将其写入文件。Pickling 是一种将 python 对象 (list、dict 等) 转换为字符流的方法。
- 其想法是,这个字符流包含了在另一个python脚本中重构对象所需的所有信息。
sales_data.to_pickle('sales.pkl')
pd.read_pickle('sales.pkl')
|
Account |
Name |
Rep |
Manager |
Product |
Quantity |
Price |
Status |
0 |
714466 |
Trantow-Barrows |
Craig Booker |
Debra Henley |
CPU |
1 |
30000 |
presented |
1 |
714466 |
Trantow-Barrows |
Craig Booker |
Debra Henley |
Software |
1 |
10000 |
presented |
2 |
714466 |
Trantow-Barrows |
Craig Booker |
Debra Henley |
Maintenance |
2 |
5000 |
pending |
3 |
737550 |
Fritsch, Russel and Anderson |
Craig Booker |
Debra Henley |
CPU |
1 |
35000 |
declined |
4 |
146832 |
Kiehn-Spinka |
Daniel Hilton |
Debra Henley |
CPU |
2 |
65000 |
won |
5 |
218895 |
Kulas Inc |
Daniel Hilton |
Debra Henley |
CPU |
2 |
40000 |
pending |
6 |
218895 |
Kulas Inc |
Daniel Hilton |
Debra Henley |
Software |
1 |
10000 |
presented |
7 |
412290 |
Jerde-Hilpert |
John Smith |
Debra Henley |
Maintenance |
2 |
5000 |
pending |
8 |
740150 |
Barton LLC |
John Smith |
Debra Henley |
CPU |
1 |
35000 |
declined |
9 |
141962 |
Herman LLC |
Cedric Moss |
Fred Anderson |
CPU |
2 |
65000 |
won |
10 |
163416 |
Purdy-Kunde |
Cedric Moss |
Fred Anderson |
CPU |
1 |
30000 |
presented |
11 |
239344 |
Stokes LLC |
Cedric Moss |
Fred Anderson |
Maintenance |
1 |
5000 |
pending |
12 |
239344 |
Stokes LLC |
Cedric Moss |
Fred Anderson |
Software |
1 |
10000 |
presented |
13 |
307599 |
Kassulke, Ondricka and Metz |
Wendy Yule |
Fred Anderson |
Maintenance |
3 |
7000 |
won |
14 |
688981 |
Keeling LLC |
Wendy Yule |
Fred Anderson |
CPU |
5 |
100000 |
won |
15 |
729833 |
Koepp Ltd |
Wendy Yule |
Fred Anderson |
CPU |
2 |
65000 |
declined |
16 |
729833 |
Koepp Ltd |
Wendy Yule |
Fred Anderson |
Monitor |
2 |
5000 |
presented |
Loading from databases
- No matter what your database is, all you need to do is create a connection object.
import sqlite3
- Create connection object for sqlite3
conn = sqlite3.connect('../Data/chinook.db')
albums = pd.read_sql_query("select * from albums", conn, index_col='AlbumId')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
in
----> 1 albums = pd.read_sql_query("select * from albums", conn, index_col='AlbumId')
NameError: name 'conn' is not defined
albums.head()
|
Title |
ArtistId |
AlbumId |
|
|
1 |
For Those About To Rock We Salute You |
1 |
2 |
Balls to the Wall |
2 |
3 |
Restless and Wild |
2 |
4 |
Let There Be Rock |
1 |
5 |
Big Ones |
3 |
import MySQLdb
mysql_cn= MySQLdb.connect(host='myhost',
port=3306,user='myusername', passwd='mypassword',
db='information_schema')
df_mysql = pd.read_sql('select * from VIEWS;', con=mysql_cn)
Objective : Indexing & Selecting Data
- Indexing using loc
- Indexing using iloc
- Accessing with [ ]
- Selecting data using isin
- Selecting data using where
- Selecting data using query
- In & not in Operator
- set & reset index
- Selecting columns by type
- Accessing multiIndex data
import pandas as pd
Indexing using loc|使用loc编制索引
- loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found.
- loc主要是基于标签,但也可以使用布尔数组。当项目没有找到时,loc会引发KeyError。
df = pd.read_csv('../Data/titanic-train.csv.txt', index_col = 'PassengerId')
df.head()
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
PassengerId |
|
|
|
|
|
|
|
|
|
|
|
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
- A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
df.loc[2]
Survived 1
Pclass 1
Name Cumings, Mrs. John Bradley (Florence Briggs Th...
Sex female
Age 38
SibSp 1
Parch 0
Ticket PC 17599
Fare 71.2833
Cabin C85
Embarked C
Name: 2, dtype: object
- A list or array of labels [‘a’, ‘b’, ‘c’].
df.loc[[2,3,4]]
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
PassengerId |
|
|
|
|
|
|
|
|
|
|
|
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
- A slice object with labels ‘a’:‘f’ (Note that contrary to usual python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)
df.loc[2:5]
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
PassengerId |
|
|
|
|
|
|
|
|
|
|
|
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
a = pd.Series(False, df.index)
a[2] = True
a[3] = True
a
PassengerId
1 False
2 True
3 True
4 False
5 False
...
887 False
888 False
889 False
890 False
891 False
Length: 891, dtype: bool
a[:5]
PassengerId
1 False
2 True
3 True
4 False
5 False
dtype: bool
df.loc[a]
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
PassengerId |
|
|
|
|
|
|
|
|
|
|
|
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
- A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
def func(e):
return e.Sex == 'female'
df.loc[func].head()
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
PassengerId |
|
|
|
|
|
|
|
|
|
|
|
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
9 |
1 |
3 |
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) |
female |
27.0 |
0 |
2 |
347742 |
11.1333 |
NaN |
S |
10 |
1 |
2 |
Nasser, Mrs. Nicholas (Adele Achem) |
female |
14.0 |
1 |
0 |
237736 |
30.0708 |
NaN |
C |
Indexing using iloc
- iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.
df.head()
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
PassengerId |
|
|
|
|
|
|
|
|
|
|
|
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
df.iloc[2]
Survived 1
Pclass 3
Name Heikkinen, Miss. Laina
Sex female
Age 26
SibSp 0
Parch 0
Ticket STON/O2. 3101282
Fare 7.925
Cabin NaN
Embarked S
Name: 3, dtype: object
- A list of array of integers
df.iloc[[1,2,3]]
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
PassengerId |
|
|
|
|
|
|
|
|
|
|
|
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
df.iloc[1:7]
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
PassengerId |
|
|
|
|
|
|
|
|
|
|
|
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
6 |
0 |
3 |
Moran, Mr. James |
male |
NaN |
0 |
0 |
330877 |
8.4583 |
NaN |
Q |
7 |
0 |
1 |
McCarthy, Mr. Timothy J |
male |
54.0 |
0 |
0 |
17463 |
51.8625 |
E46 |
S |
a[:5]
PassengerId
1 False
2 True
3 True
4 False
5 False
dtype: bool
df.iloc[a.values]
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
PassengerId |
|
|
|
|
|
|
|
|
|
|
|
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
- A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
def func(e):
res = e.Sex == 'female'
return res.values
df.iloc[func].head()
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
PassengerId |
|
|
|
|
|
|
|
|
|
|
|
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
9 |
1 |
3 |
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) |
female |
27.0 |
0 |
2 |
347742 |
11.1333 |
NaN |
S |
10 |
1 |
2 |
Nasser, Mrs. Nicholas (Adele Achem) |
female |
14.0 |
1 |
0 |
237736 |
30.0708 |
NaN |
C |
Indexing using [ ]
- Access in a Series & DataFrame
df['Name'][:5]
PassengerId
1 Braund, Mr. Owen Harris
2 Cumings, Mrs. John Bradley (Florence Briggs Th...
3 Heikkinen, Miss. Laina
4 Futrelle, Mrs. Jacques Heath (Lily May Peel)
5 Allen, Mr. William Henry
Name: Name, dtype: object
df.Name[5]
'Allen, Mr. William Henry'
Selecting with isin|选择用isin
- isin() method of Series, which returns a boolean vector that is true wherever the Series elements exist in the passed list. This allows you to select rows where one or more columns have values you want:
- Series的isin()方法,它返回一个布尔向量,只要在传递的列表中存在Series元素的任何地方,该向量都是真值。这允许你选择一个或多个列有你想要的值的行。
s = df.Age
s
PassengerId
1 22.0
2 38.0
3 26.0
4 35.0
5 35.0
...
887 27.0
888 19.0
889 NaN
890 26.0
891 32.0
Name: Age, Length: 891, dtype: float64
matches = s.isin([10,20,30])
matches
PassengerId
1 False
2 False
3 False
4 False
5 False
...
887 False
888 False
889 False
890 False
891 False
Name: Age, Length: 891, dtype: bool
df[matches]
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
PassengerId |
|
|
|
|
|
|
|
|
|
|
|
13 |
0 |
3 |
Saundercock, Mr. William Henry |
male |
20.0 |
0 |
0 |
A/5. 2151 |
8.0500 |
NaN |
S |
80 |
1 |
3 |
Dowdell, Miss. Elizabeth |
female |
30.0 |
0 |
0 |
364516 |
12.4750 |
NaN |
S |
92 |
0 |
3 |
Andreasson, Mr. Paul Edvin |
male |
20.0 |
0 |
0 |
347466 |
7.8542 |
NaN |
S |
114 |
0 |
3 |
Jussila, Miss. Katriina |
female |
20.0 |
1 |
0 |
4136 |
9.8250 |
NaN |
S |
132 |
0 |
3 |
Coelho, Mr. Domingos Fernandeo |
male |
20.0 |
0 |
0 |
SOTON/O.Q. 3101307 |
7.0500 |
NaN |
S |
158 |
0 |
3 |
Corn, Mr. Harry |
male |
30.0 |
0 |
0 |
SOTON/OQ 392090 |
8.0500 |
NaN |
S |
179 |
0 |
2 |
Hale, Mr. Reginald |
male |
30.0 |
0 |
0 |
250653 |
13.0000 |
NaN |
S |
214 |
0 |
2 |
Givard, Mr. Hans Kristensen |
male |
30.0 |
0 |
0 |
250646 |
13.0000 |
NaN |
S |
220 |
0 |
2 |
Harris, Mr. Walter |
male |
30.0 |
0 |
0 |
W/C 14208 |
10.5000 |
NaN |
S |
245 |
0 |
3 |
Attalah, Mr. Sleiman |
male |
30.0 |
0 |
0 |
2694 |
7.2250 |
NaN |
C |
254 |
0 |
3 |
Lobb, Mr. William Arthur |
male |
30.0 |
1 |
0 |
A/5. 3336 |
16.1000 |
NaN |
S |
258 |
1 |
1 |
Cherry, Miss. Gladys |
female |
30.0 |
0 |
0 |
110152 |
86.5000 |
B77 |
S |
287 |
1 |
3 |
de Mulder, Mr. Theodore |
male |
30.0 |
0 |
0 |
345774 |
9.5000 |
NaN |
S |
309 |
0 |
2 |
Abelson, Mr. Samuel |
male |
30.0 |
1 |
0 |
P/PP 3381 |
24.0000 |
NaN |
C |
310 |
1 |
1 |
Francatelli, Miss. Laura Mabel |
female |
30.0 |
0 |
0 |
PC 17485 |
56.9292 |
E36 |
C |
323 |
1 |
2 |
Slayter, Miss. Hilda Mary |
female |
30.0 |
0 |
0 |
234818 |
12.3500 |
NaN |
Q |
366 |
0 |
3 |
Adahl, Mr. Mauritz Nils Martin |
male |
30.0 |
0 |
0 |
C 7076 |
7.2500 |
NaN |
S |
379 |
0 |
3 |
Betros, Mr. Tannous |
male |
20.0 |
0 |
0 |
2648 |
4.0125 |
NaN |
C |
405 |
0 |
3 |
Oreskovic, Miss. Marija |
female |
20.0 |
0 |
0 |
315096 |
8.6625 |
NaN |
S |
419 |
0 |
2 |
Matthews, Mr. William John |
male |
30.0 |
0 |
0 |
28228 |
13.0000 |
NaN |
S |
420 |
0 |
3 |
Van Impe, Miss. Catharina |
female |
10.0 |
0 |
2 |
345773 |
24.1500 |
NaN |
S |
442 |
0 |
3 |
Hampe, Mr. Leon |
male |
20.0 |
0 |
0 |
345769 |
9.5000 |
NaN |
S |
453 |
0 |
1 |
Foreman, Mr. Benjamin Laventall |
male |
30.0 |
0 |
0 |
113051 |
27.7500 |
C111 |
C |
489 |
0 |
3 |
Somerton, Mr. Francis William |
male |
30.0 |
0 |
0 |
A.5. 18509 |
8.0500 |
NaN |
S |
521 |
1 |
1 |
Perreault, Miss. Anne |
female |
30.0 |
0 |
0 |
12749 |
93.5000 |
B73 |
S |
535 |
0 |
3 |
Cacic, Miss. Marija |
female |
30.0 |
0 |
0 |
315084 |
8.6625 |
NaN |
S |
538 |
1 |
1 |
LeRoy, Miss. Bertha |
female |
30.0 |
0 |
0 |
PC 17761 |
106.4250 |
NaN |
C |
607 |
0 |
3 |
Karaic, Mr. Milan |
male |
30.0 |
0 |
0 |
349246 |
7.8958 |
NaN |
S |
623 |
1 |
3 |
Nakid, Mr. Sahid |
male |
20.0 |
1 |
1 |
2653 |
15.7417 |
NaN |
C |
641 |
0 |
3 |
Jensen, Mr. Hans Peder |
male |
20.0 |
0 |
0 |
350050 |
7.8542 |
NaN |
S |
665 |
1 |
3 |
Lindqvist, Mr. Eino William |
male |
20.0 |
1 |
0 |
STON/O 2. 3101285 |
7.9250 |
NaN |
S |
683 |
0 |
3 |
Olsvigen, Mr. Thor Anderson |
male |
20.0 |
0 |
0 |
6563 |
9.2250 |
NaN |
S |
726 |
0 |
3 |
Oreskovic, Mr. Luka |
male |
20.0 |
0 |
0 |
315094 |
8.6625 |
NaN |
S |
727 |
1 |
2 |
Renouf, Mrs. Peter Henry (Lillian Jefferys) |
female |
30.0 |
3 |
0 |
31027 |
21.0000 |
NaN |
S |
748 |
1 |
2 |
Sinkkonen, Miss. Anna |
female |
30.0 |
0 |
0 |
250648 |
13.0000 |
NaN |
S |
763 |
1 |
3 |
Barah, Mr. Hanna Assi |
male |
20.0 |
0 |
0 |
2663 |
7.2292 |
NaN |
C |
799 |
0 |
3 |
Ibrahim Shawah, Mr. Yousseff |
male |
30.0 |
0 |
0 |
2685 |
7.2292 |
NaN |
C |
800 |
0 |
3 |
Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go... |
female |
30.0 |
1 |
1 |
345773 |
24.1500 |
NaN |
S |
820 |
0 |
3 |
Skoog, Master. Karl Thorsten |
male |
10.0 |
3 |
2 |
347088 |
27.9000 |
NaN |
S |
841 |
0 |
3 |
Alhomaki, Mr. Ilmari Rudolf |
male |
20.0 |
0 |
0 |
SOTON/O2 3101287 |
7.9250 |
NaN |
S |
843 |
1 |
1 |
Serepeca, Miss. Augusta |
female |
30.0 |
0 |
0 |
113798 |
31.0000 |
NaN |
C |
877 |
0 |
3 |
Gustafsson, Mr. Alfred Ossian |
male |
20.0 |
0 |
0 |
7534 |
9.8458 |
NaN |
S |
- DataFrame also has an isin() method. When calling isin, pass a set of values as either an array or dict.
- If values is an array, isin returns a DataFrame of booleans that is the same shape as the original DataFrame, with True wherever the element is in the sequence of values.
- Just make values a dict where the key is the column, and the value is a list of items you want to check for.
- DataFrame也有一个isin()方法。调用isin时,传递一组值作为数组或dict。
- 如果values是数组,isin会返回一个由booleans组成的DataFrame,这个DataFrame的形状和原始DataFrame的形状相同,元素在值的序列中的任何地方都是True。
- 只需将value作为一个dict,其中key是列,value是你要检查的项目列表。
matches = {'Pclass':[3], 'Age':[20,26]}
df[['Pclass','Age']].isin(matches).all(axis=1)[:5]
PassengerId
1 False
2 False
3 True
4 False
5 False
dtype: bool
df[df[['Pclass','Age']].isin(matches).all(axis=1)]
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
PassengerId |
|
|
|
|
|
|
|
|
|
|
|
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
13 |
0 |
3 |
Saundercock, Mr. William Henry |
male |
20.0 |
0 |
0 |
A/5. 2151 |
8.0500 |
NaN |
S |
70 |
0 |
3 |
Kink, Mr. Vincenz |
male |
26.0 |
2 |
0 |
315151 |
8.6625 |
NaN |
S |
74 |
0 |
3 |
Chronopoulos, Mr. Apostolos |
male |
26.0 |
1 |
0 |
2680 |
14.4542 |
NaN |
C |
92 |
0 |
3 |
Andreasson, Mr. Paul Edvin |
male |
20.0 |
0 |
0 |
347466 |
7.8542 |
NaN |
S |
94 |
0 |
3 |
Dean, Mr. Bertram Frank |
male |
26.0 |
1 |
2 |
C.A. 2315 |
20.5750 |
NaN |
S |
114 |
0 |
3 |
Jussila, Miss. Katriina |
female |
20.0 |
1 |
0 |
4136 |
9.8250 |
NaN |
S |
132 |
0 |
3 |
Coelho, Mr. Domingos Fernandeo |
male |
20.0 |
0 |
0 |
SOTON/O.Q. 3101307 |
7.0500 |
NaN |
S |
163 |
0 |
3 |
Bengtsson, Mr. John Viktor |
male |
26.0 |
0 |
0 |
347068 |
7.7750 |
NaN |
S |
208 |
1 |
3 |
Albimona, Mr. Nassef Cassem |
male |
26.0 |
0 |
0 |
2699 |
18.7875 |
NaN |
C |
316 |
1 |
3 |
Nilsson, Miss. Helmina Josefina |
female |
26.0 |
0 |
0 |
347470 |
7.8542 |
NaN |
S |
379 |
0 |
3 |
Betros, Mr. Tannous |
male |
20.0 |
0 |
0 |
2648 |
4.0125 |
NaN |
C |
402 |
0 |
3 |
Adams, Mr. John |
male |
26.0 |
0 |
0 |
341826 |
8.0500 |
NaN |
S |
405 |
0 |
3 |
Oreskovic, Miss. Marija |
female |
20.0 |
0 |
0 |
315096 |
8.6625 |
NaN |
S |
442 |
0 |
3 |
Hampe, Mr. Leon |
male |
20.0 |
0 |
0 |
345769 |
9.5000 |
NaN |
S |
510 |
1 |
3 |
Lang, Mr. Fang |
male |
26.0 |
0 |
0 |
1601 |
56.4958 |
NaN |
S |
618 |
0 |
3 |
Lobb, Mrs. William Arthur (Cordelia K Stanlick) |
female |
26.0 |
1 |
0 |
A/5. 3336 |
16.1000 |
NaN |
S |
623 |
1 |
3 |
Nakid, Mr. Sahid |
male |
20.0 |
1 |
1 |
2653 |
15.7417 |
NaN |
C |
629 |
0 |
3 |
Bostandyeff, Mr. Guentcho |
male |
26.0 |
0 |
0 |
349224 |
7.8958 |
NaN |
S |
641 |
0 |
3 |
Jensen, Mr. Hans Peder |
male |
20.0 |
0 |
0 |
350050 |
7.8542 |
NaN |
S |
665 |
1 |
3 |
Lindqvist, Mr. Eino William |
male |
20.0 |
1 |
0 |
STON/O 2. 3101285 |
7.9250 |
NaN |
S |
683 |
0 |
3 |
Olsvigen, Mr. Thor Anderson |
male |
20.0 |
0 |
0 |
6563 |
9.2250 |
NaN |
S |
705 |
0 |
3 |
Hansen, Mr. Henrik Juul |
male |
26.0 |
1 |
0 |
350025 |
7.8542 |
NaN |
S |
726 |
0 |
3 |
Oreskovic, Mr. Luka |
male |
20.0 |
0 |
0 |
315094 |
8.6625 |
NaN |
S |
763 |
1 |
3 |
Barah, Mr. Hanna Assi |
male |
20.0 |
0 |
0 |
2663 |
7.2292 |
NaN |
C |
811 |
0 |
3 |
Alexander, Mr. William |
male |
26.0 |
0 |
0 |
3474 |
7.8875 |
NaN |
S |
841 |
0 |
3 |
Alhomaki, Mr. Ilmari Rudolf |
male |
20.0 |
0 |
0 |
SOTON/O2 3101287 |
7.9250 |
NaN |
S |
871 |
0 |
3 |
Balkic, Mr. Cerin |
male |
26.0 |
0 |
0 |
349248 |
7.8958 |
NaN |
S |
877 |
0 |
3 |
Gustafsson, Mr. Alfred Ossian |
male |
20.0 |
0 |
0 |
7534 |
9.8458 |
NaN |
S |
Selecting data using where method|使用where方法选择数据
- Selecting values from a Series with a boolean vector generally returns a subset of the data.
- To guarantee that selection output has the same shape as the original data, you can use the where method in Series and DataFrame.
- 用布尔向量从Series中选择值,通常会返回数据的子集。
- 为了保证选择输出与原始数据的形状相同,可以在Series和DataFrame中使用where方法。
import numpy as np
df = pd.DataFrame(np.random.randn(20).reshape(4,5))
df
|
0 |
1 |
2 |
3 |
4 |
0 |
-1.141670 |
1.575618 |
0.407986 |
0.432170 |
1.641420 |
1 |
-0.392605 |
0.102139 |
-0.264050 |
-1.397058 |
-0.176585 |
2 |
-0.418739 |
-0.932027 |
1.775478 |
0.145980 |
0.355938 |
3 |
-1.155615 |
0.853764 |
-0.871912 |
0.346349 |
0.558242 |
df.where(df > 0,-df)
|
0 |
1 |
2 |
3 |
4 |
0 |
1.141670 |
1.575618 |
0.407986 |
0.432170 |
1.641420 |
1 |
0.392605 |
0.102139 |
0.264050 |
1.397058 |
0.176585 |
2 |
0.418739 |
0.932027 |
1.775478 |
0.145980 |
0.355938 |
3 |
1.155615 |
0.853764 |
0.871912 |
0.346349 |
0.558242 |
help(pd.DataFrame.where)
Help on function where in module pandas.core.generic:
where(self, cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)
Replace values where the condition is False.
Parameters
----------
cond : boolean Series/DataFrame, array-like, or callable
Where `cond` is True, keep the original value. Where
False, replace with corresponding value from `other`.
If `cond` is callable, it is computed on the Series/DataFrame and
should return boolean Series/DataFrame or array. The callable must
not change input Series/DataFrame (though pandas doesn't check it).
.. versionadded:: 0.18.1
A callable can be used as cond.
other : scalar, Series/DataFrame, or callable
Entries where `cond` is False are replaced with
corresponding value from `other`.
If other is callable, it is computed on the Series/DataFrame and
should return scalar or Series/DataFrame. The callable must not
change input Series/DataFrame (though pandas doesn't check it).
.. versionadded:: 0.18.1
A callable can be used as other.
inplace : bool, default False
Whether to perform the operation in place on the data.
axis : int, default None
Alignment axis if needed.
level : int, default None
Alignment level if needed.
errors : str, {'raise', 'ignore'}, default 'raise'
Note that currently this parameter won't affect
the results and will always coerce to a suitable dtype.
- 'raise' : allow exceptions to be raised.
- 'ignore' : suppress exceptions. On error return original object.
try_cast : bool, default False
Try to cast the result back to the input type (if possible).
Returns
-------
Same type as caller
See Also
--------
:func:`DataFrame.mask` : Return an object of same shape as
self.
Notes
-----
The where method is an application of the if-then idiom. For each
element in the calling DataFrame, if ``cond`` is ``True`` the
element is used; otherwise the corresponding element from the DataFrame
``other`` is used.
The signature for :func:`DataFrame.where` differs from
:func:`numpy.where`. Roughly ``df1.where(m, df2)`` is equivalent to
``np.where(m, df1, df2)``.
For further details and examples see the ``where`` documentation in
:ref:`indexing `.
Examples
--------
>>> s = pd.Series(range(5))
>>> s.where(s > 0)
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
>>> s.mask(s > 0)
0 0.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
>>> s.where(s > 1, 10)
0 10
1 10
2 2
3 3
4 4
dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
>>> df
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
>>> m = df % 3 == 0
>>> df.where(m, -df)
A B
0 0 -1
1 -2 3
2 -4 -5
3 6 -7
4 -8 9
>>> df.where(m, -df) == np.where(m, df, -df)
A B
0 True True
1 True True
2 True True
3 True True
4 True True
>>> df.where(m, -df) == df.mask(~m, -df)
A B
0 True True
1 True True
2 True True
3 True True
4 True True
Selecting Data using Query|使用查询选择数据
- DataFrame objects have a query() method that allows selection using an expression.
- Same query to multiple dataframes
- Slightly faster than python way of doing things
- DataFrame对象有一个query()方法,可以使用表达式进行选择
- 同一查询到多个数据框
- 比python的方法略快一些
df = pd.DataFrame(np.random.randint(10, size=(10, 2)), columns=list('bc'))
df1 = pd.DataFrame(np.random.randint(15, size=(15, 2)), columns=list('bc'))
df
|
b |
c |
0 |
5 |
9 |
1 |
5 |
6 |
2 |
0 |
9 |
3 |
6 |
5 |
4 |
0 |
5 |
5 |
8 |
1 |
6 |
5 |
2 |
7 |
5 |
5 |
8 |
0 |
6 |
9 |
0 |
4 |
df.query('b < c')
|
b |
c |
0 |
5 |
9 |
1 |
5 |
6 |
2 |
0 |
9 |
4 |
0 |
5 |
8 |
0 |
6 |
9 |
0 |
4 |
res = map(lambda f: f.query('b < c'),[df,df1])
list(res)
[ b c
0 5 9
1 5 6
2 0 9
4 0 5
8 0 6
9 0 4, b c
2 2 7
3 7 9
4 1 2
5 1 14
8 0 6
9 5 7
13 2 13]
for elem in res:
print (elem)
df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
'c': np.random.randint(5, size=12),
'd': np.random.randint(9, size=12)})
df
|
a |
b |
c |
d |
0 |
a |
a |
1 |
8 |
1 |
a |
a |
4 |
0 |
2 |
b |
a |
1 |
0 |
3 |
b |
a |
1 |
3 |
4 |
c |
b |
1 |
5 |
5 |
c |
b |
1 |
6 |
6 |
d |
b |
3 |
1 |
7 |
d |
b |
1 |
3 |
8 |
e |
c |
0 |
2 |
9 |
e |
c |
1 |
2 |
10 |
f |
c |
3 |
3 |
11 |
f |
c |
0 |
0 |
df.query('a in b')
|
a |
b |
c |
d |
0 |
a |
a |
1 |
8 |
1 |
a |
a |
4 |
0 |
2 |
b |
a |
1 |
0 |
3 |
b |
a |
1 |
3 |
4 |
c |
b |
1 |
5 |
5 |
c |
b |
1 |
6 |
df.query('a not in b')
|
a |
b |
c |
d |
6 |
d |
b |
3 |
1 |
7 |
d |
b |
1 |
3 |
8 |
e |
c |
0 |
2 |
9 |
e |
c |
1 |
2 |
10 |
f |
c |
3 |
3 |
11 |
f |
c |
0 |
0 |
df.query('b == ["b","c"]')
|
a |
b |
c |
d |
4 |
c |
b |
1 |
5 |
5 |
c |
b |
1 |
6 |
6 |
d |
b |
3 |
1 |
7 |
d |
b |
1 |
3 |
8 |
e |
c |
0 |
2 |
9 |
e |
c |
1 |
2 |
10 |
f |
c |
3 |
3 |
11 |
f |
c |
0 |
0 |
Set/Reset Index|设置/重置索引
- The pandas Index class and its subclasses can be viewed as implementing an ordered multiset. Duplicates are allowed.
- However, if you try to convert an Index object with duplicate entries into a set, an exception will be raised.
- Index also provides the infrastructure necessary for lookups, data alignment, and reindexing. The easiest way to create an Index directly is to pass a list or other sequence to Index:
- pandas Index类及其子类可以被视为实现了一个有序的多集。允许重复。
- 但是,如果您尝试将一个带有重复条目的 Index 对象转换为一个集合,将会出现异常。
- Index 还提供了查找、数据对齐和重新索引所需的基础架构。直接创建 Index 的最简单的方法是将一个列表或其他序列传递给 Index。
index = pd.Index(['e', 'd', 'a', 'b'])
df = pd.DataFrame([1,2,3,4])
df.index = index
df
sales_data = pd.read_excel('../Data/sales-funnel.xlsx')
print(sales_data)
sales_data.set_index(['Manager','Rep'])
Account Name Rep Manager \
0 714466 Trantow-Barrows Craig Booker Debra Henley
1 714466 Trantow-Barrows Craig Booker Debra Henley
2 714466 Trantow-Barrows Craig Booker Debra Henley
3 737550 Fritsch, Russel and Anderson Craig Booker Debra Henley
4 146832 Kiehn-Spinka Daniel Hilton Debra Henley
5 218895 Kulas Inc Daniel Hilton Debra Henley
6 218895 Kulas Inc Daniel Hilton Debra Henley
7 412290 Jerde-Hilpert John Smith Debra Henley
8 740150 Barton LLC John Smith Debra Henley
9 141962 Herman LLC Cedric Moss Fred Anderson
10 163416 Purdy-Kunde Cedric Moss Fred Anderson
11 239344 Stokes LLC Cedric Moss Fred Anderson
12 239344 Stokes LLC Cedric Moss Fred Anderson
13 307599 Kassulke, Ondricka and Metz Wendy Yule Fred Anderson
14 688981 Keeling LLC Wendy Yule Fred Anderson
15 729833 Koepp Ltd Wendy Yule Fred Anderson
16 729833 Koepp Ltd Wendy Yule Fred Anderson
Product Quantity Price Status
0 CPU 1 30000 presented
1 Software 1 10000 presented
2 Maintenance 2 5000 pending
3 CPU 1 35000 declined
4 CPU 2 65000 won
5 CPU 2 40000 pending
6 Software 1 10000 presented
7 Maintenance 2 5000 pending
8 CPU 1 35000 declined
9 CPU 2 65000 won
10 CPU 1 30000 presented
11 Maintenance 1 5000 pending
12 Software 1 10000 presented
13 Maintenance 3 7000 won
14 CPU 5 100000 won
15 CPU 2 65000 declined
16 Monitor 2 5000 presented
|
|
Account |
Name |
Product |
Quantity |
Price |
Status |
Manager |
Rep |
|
|
|
|
|
|
Debra Henley |
Craig Booker |
714466 |
Trantow-Barrows |
CPU |
1 |
30000 |
presented |
Craig Booker |
714466 |
Trantow-Barrows |
Software |
1 |
10000 |
presented |
Craig Booker |
714466 |
Trantow-Barrows |
Maintenance |
2 |
5000 |
pending |
Craig Booker |
737550 |
Fritsch, Russel and Anderson |
CPU |
1 |
35000 |
declined |
Daniel Hilton |
146832 |
Kiehn-Spinka |
CPU |
2 |
65000 |
won |
Daniel Hilton |
218895 |
Kulas Inc |
CPU |
2 |
40000 |
pending |
Daniel Hilton |
218895 |
Kulas Inc |
Software |
1 |
10000 |
presented |
John Smith |
412290 |
Jerde-Hilpert |
Maintenance |
2 |
5000 |
pending |
John Smith |
740150 |
Barton LLC |
CPU |
1 |
35000 |
declined |
Fred Anderson |
Cedric Moss |
141962 |
Herman LLC |
CPU |
2 |
65000 |
won |
Cedric Moss |
163416 |
Purdy-Kunde |
CPU |
1 |
30000 |
presented |
Cedric Moss |
239344 |
Stokes LLC |
Maintenance |
1 |
5000 |
pending |
Cedric Moss |
239344 |
Stokes LLC |
Software |
1 |
10000 |
presented |
Wendy Yule |
307599 |
Kassulke, Ondricka and Metz |
Maintenance |
3 |
7000 |
won |
Wendy Yule |
688981 |
Keeling LLC |
CPU |
5 |
100000 |
won |
Wendy Yule |
729833 |
Koepp Ltd |
CPU |
2 |
65000 |
declined |
Wendy Yule |
729833 |
Koepp Ltd |
Monitor |
2 |
5000 |
presented |
sales_data.reset_index()
|
index |
Account |
Name |
Rep |
Manager |
Product |
Quantity |
Price |
Status |
0 |
0 |
714466 |
Trantow-Barrows |
Craig Booker |
Debra Henley |
CPU |
1 |
30000 |
presented |
1 |
1 |
714466 |
Trantow-Barrows |
Craig Booker |
Debra Henley |
Software |
1 |
10000 |
presented |
2 |
2 |
714466 |
Trantow-Barrows |
Craig Booker |
Debra Henley |
Maintenance |
2 |
5000 |
pending |
3 |
3 |
737550 |
Fritsch, Russel and Anderson |
Craig Booker |
Debra Henley |
CPU |
1 |
35000 |
declined |
4 |
4 |
146832 |
Kiehn-Spinka |
Daniel Hilton |
Debra Henley |
CPU |
2 |
65000 |
won |
5 |
5 |
218895 |
Kulas Inc |
Daniel Hilton |
Debra Henley |
CPU |
2 |
40000 |
pending |
6 |
6 |
218895 |
Kulas Inc |
Daniel Hilton |
Debra Henley |
Software |
1 |
10000 |
presented |
7 |
7 |
412290 |
Jerde-Hilpert |
John Smith |
Debra Henley |
Maintenance |
2 |
5000 |
pending |
8 |
8 |
740150 |
Barton LLC |
John Smith |
Debra Henley |
CPU |
1 |
35000 |
declined |
9 |
9 |
141962 |
Herman LLC |
Cedric Moss |
Fred Anderson |
CPU |
2 |
65000 |
won |
10 |
10 |
163416 |
Purdy-Kunde |
Cedric Moss |
Fred Anderson |
CPU |
1 |
30000 |
presented |
11 |
11 |
239344 |
Stokes LLC |
Cedric Moss |
Fred Anderson |
Maintenance |
1 |
5000 |
pending |
12 |
12 |
239344 |
Stokes LLC |
Cedric Moss |
Fred Anderson |
Software |
1 |
10000 |
presented |
13 |
13 |
307599 |
Kassulke, Ondricka and Metz |
Wendy Yule |
Fred Anderson |
Maintenance |
3 |
7000 |
won |
14 |
14 |
688981 |
Keeling LLC |
Wendy Yule |
Fred Anderson |
CPU |
5 |
100000 |
won |
15 |
15 |
729833 |
Koepp Ltd |
Wendy Yule |
Fred Anderson |
CPU |
2 |
65000 |
declined |
16 |
16 |
729833 |
Koepp Ltd |
Wendy Yule |
Fred Anderson |
Monitor |
2 |
5000 |
presented |
Selecting columns by Type
sales_data.select_dtypes(include=['object'])
|
Name |
Rep |
Manager |
Product |
Status |
0 |
Trantow-Barrows |
Craig Booker |
Debra Henley |
CPU |
presented |
1 |
Trantow-Barrows |
Craig Booker |
Debra Henley |
Software |
presented |
2 |
Trantow-Barrows |
Craig Booker |
Debra Henley |
Maintenance |
pending |
3 |
Fritsch, Russel and Anderson |
Craig Booker |
Debra Henley |
CPU |
declined |
4 |
Kiehn-Spinka |
Daniel Hilton |
Debra Henley |
CPU |
won |
5 |
Kulas Inc |
Daniel Hilton |
Debra Henley |
CPU |
pending |
6 |
Kulas Inc |
Daniel Hilton |
Debra Henley |
Software |
presented |
7 |
Jerde-Hilpert |
John Smith |
Debra Henley |
Maintenance |
pending |
8 |
Barton LLC |
John Smith |
Debra Henley |
CPU |
declined |
9 |
Herman LLC |
Cedric Moss |
Fred Anderson |
CPU |
won |
10 |
Purdy-Kunde |
Cedric Moss |
Fred Anderson |
CPU |
presented |
11 |
Stokes LLC |
Cedric Moss |
Fred Anderson |
Maintenance |
pending |
12 |
Stokes LLC |
Cedric Moss |
Fred Anderson |
Software |
presented |
13 |
Kassulke, Ondricka and Metz |
Wendy Yule |
Fred Anderson |
Maintenance |
won |
14 |
Keeling LLC |
Wendy Yule |
Fred Anderson |
CPU |
won |
15 |
Koepp Ltd |
Wendy Yule |
Fred Anderson |
CPU |
declined |
16 |
Koepp Ltd |
Wendy Yule |
Fred Anderson |
Monitor |
presented |
sales_data.select_dtypes(include=['int64'])
|
Account |
Quantity |
Price |
0 |
714466 |
1 |
30000 |
1 |
714466 |
1 |
10000 |
2 |
714466 |
2 |
5000 |
3 |
737550 |
1 |
35000 |
4 |
146832 |
2 |
65000 |
5 |
218895 |
2 |
40000 |
6 |
218895 |
1 |
10000 |
7 |
412290 |
2 |
5000 |
8 |
740150 |
1 |
35000 |
9 |
141962 |
2 |
65000 |
10 |
163416 |
1 |
30000 |
11 |
239344 |
1 |
5000 |
12 |
239344 |
1 |
10000 |
13 |
307599 |
3 |
7000 |
14 |
688981 |
5 |
100000 |
15 |
729833 |
2 |
65000 |
16 |
729833 |
2 |
5000 |
Accessing Multi-Index Data
- Many a times, excel data is multi-index
sales_data = pd.read_excel('../Data/sales-funnel.xlsx',index_col=[0,1])
sales_data
|
|
Account |
Name |
Product |
Quantity |
Price |
Status |
Manager |
Rep |
|
|
|
|
|
|
Debra Henley |
Craig Booker |
714466 |
Trantow-Barrows |
CPU |
1 |
30000 |
presented |
Craig Booker |
714466 |
Trantow-Barrows |
Software |
1 |
10000 |
presented |
Craig Booker |
714466 |
Trantow-Barrows |
Maintenance |
2 |
5000 |
pending |
Craig Booker |
737550 |
Fritsch, Russel and Anderson |
CPU |
1 |
35000 |
declined |
Daniel Hilton |
146832 |
Kiehn-Spinka |
CPU |
2 |
65000 |
won |
Daniel Hilton |
218895 |
Kulas Inc |
CPU |
2 |
40000 |
pending |
Daniel Hilton |
218895 |
Kulas Inc |
Software |
1 |
10000 |
presented |
John Smith |
412290 |
Jerde-Hilpert |
Maintenance |
2 |
5000 |
pending |
John Smith |
740150 |
Barton LLC |
CPU |
1 |
35000 |
declined |
Fred Anderson |
Cedric Moss |
141962 |
Herman LLC |
CPU |
2 |
65000 |
won |
Cedric Moss |
163416 |
Purdy-Kunde |
CPU |
1 |
30000 |
presented |
Cedric Moss |
239344 |
Stokes LLC |
Maintenance |
1 |
5000 |
pending |
Cedric Moss |
239344 |
Stokes LLC |
Software |
1 |
10000 |
presented |
Wendy Yule |
307599 |
Kassulke, Ondricka and Metz |
Maintenance |
3 |
7000 |
won |
Wendy Yule |
688981 |
Keeling LLC |
CPU |
5 |
100000 |
won |
Wendy Yule |
729833 |
Koepp Ltd |
CPU |
2 |
65000 |
declined |
Wendy Yule |
729833 |
Koepp Ltd |
Monitor |
2 |
5000 |
presented |
Objective : Working on TimeSeries Data
- Overview
- Timestamps vs. Time Spans
- Converting to timestamps
- Generating ranges of timestamps
- Timestamp limitations
- Indexing
- Time/date components
- DateOffset objects
- Time Series-Related Instance Methods
- Resampling
- Time span representation
- Converting between representations
- Representing out-of-bounds spans
- Time zone handling
Overview
-
Pandas contains extensive capabilities and features for working with time series data for all domains.
-
Using the NumPy datetime64 and timedelta64 dtypes, pandas has consolidated a large number of features from other Python libraries like scikits.timeseries as well as created a tremendous amount of new functionality for manipulating time series data.
-
pandas captures 4 general time related concepts:
- Date times: A specific date and time with timezone support. Similar to datetime.datetime from the standard library.
- Time deltas: An absolute time duration. Similar to datetime.timedelta from the standard library.
- Time spans: A span of time defined by a point in time and its associated frequency.
- Date offsets: A relative time duration that respects calendar arithmetic. Similar to dateutil.relativedelta.relativedelta from the dateutil package.
import pandas as pd
date = pd.to_datetime("4th of July, 2015")
date
Timestamp('2015-07-04 00:00:00')
date.strftime('%A')
'Saturday'
pd.Series(range(3), index=pd.date_range('2000', freq='D', periods=3))
2000-01-01 0
2000-01-02 1
2000-01-03 2
Freq: D, dtype: int64
pd.Series(pd.period_range('1/1/2011', freq='M', periods=3))
0 2011-01
1 2011-02
2 2011-03
dtype: period[M]
Objective : Combining DataFrames
- Concatenate
- Append
- Database Style Merge
- Database Style Join
- Working on TimeSeries Data
1. Concatenate
import pandas as pd
df1 = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
print(df1)
df2 = pd.DataFrame({'A':[11,12,13],'B':[14,15,16],'C':[17,18,19]})
print(df2)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
A B C
0 11 14 17
1 12 15 18
2 13 16 19
- Combining dataframes
- Index is not reset
- Default axis is 0
pd.concat([df1,df2])
|
A |
B |
C |
0 |
1 |
4 |
7 |
1 |
2 |
5 |
8 |
2 |
3 |
6 |
9 |
0 |
11 |
14 |
17 |
1 |
12 |
15 |
18 |
2 |
13 |
16 |
19 |
pd.concat([df1,df2],ignore_index=True)
|
A |
B |
C |
0 |
1 |
4 |
7 |
1 |
2 |
5 |
8 |
2 |
3 |
6 |
9 |
3 |
11 |
14 |
17 |
4 |
12 |
15 |
18 |
5 |
13 |
16 |
19 |
- Concatenating rows
- It uses index to identify row of concatenation
pd.concat([df1,df2],axis=1)
|
A |
B |
C |
A |
B |
C |
0 |
1 |
4 |
7 |
11 |
14 |
17 |
1 |
2 |
5 |
8 |
12 |
15 |
18 |
2 |
3 |
6 |
9 |
13 |
16 |
19 |
df2 = pd.DataFrame({'A':[11,12,13],'B':[14,15,16],'C':[17,18,19]}, index=[2,3,4])
df2
|
A |
B |
C |
2 |
11 |
14 |
17 |
3 |
12 |
15 |
18 |
4 |
13 |
16 |
19 |
- By default, outer join is done
pd.concat([df1,df2],axis=1)
|
A |
B |
C |
A |
B |
C |
0 |
1.0 |
4.0 |
7.0 |
NaN |
NaN |
NaN |
1 |
2.0 |
5.0 |
8.0 |
NaN |
NaN |
NaN |
2 |
3.0 |
6.0 |
9.0 |
11.0 |
14.0 |
17.0 |
3 |
NaN |
NaN |
NaN |
12.0 |
15.0 |
18.0 |
4 |
NaN |
NaN |
NaN |
13.0 |
16.0 |
19.0 |
pd.concat([df1,df2],axis=1,join='inner')
|
A |
B |
C |
A |
B |
C |
2 |
3 |
6 |
9 |
11 |
14 |
17 |
- Retaining dataframe information after merging
pd.concat([df1,df2], keys=['df1','df2'])
|
|
A |
B |
C |
df1 |
0 |
1 |
4 |
7 |
1 |
2 |
5 |
8 |
2 |
3 |
6 |
9 |
df2 |
2 |
11 |
14 |
17 |
3 |
12 |
15 |
18 |
4 |
13 |
16 |
19 |
Append
- Predated to concat
- Avoid append & prefer concat
Merge
- pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL.
- These methods perform significantly better (in some cases well over an order of magnitude better) than other open source implementations (like base::merge.data.frame in R).
- The reason for this is careful algorithmic design and the internal layout of the data in DataFrame.
Type of merges
- one-to-one : for example when joining two DataFrame objects on their indexes (which must contain unique values).
- many-to-one : for example when joining an index (unique) to one or more columns in a different DataFrame.
- many-to-many : joining columns on columns.
df1 = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
print(df1)
df2 = pd.DataFrame({'A':[1,2,3],'D':[14,15,16],'E':[17,18,19]})
print(df2)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
A D E
0 1 14 17
1 2 15 18
2 3 16 19
- This validates one-to-one mapping
- Explicitly tells column A
df1.merge(df2, on='A', validate='one_to_one')
|
A |
B |
C |
D |
E |
0 |
1 |
4 |
7 |
14 |
17 |
1 |
2 |
5 |
8 |
15 |
18 |
2 |
3 |
6 |
9 |
16 |
19 |
-
left : LEFT OUTER JOIN : Use keys from left frame only
-
right : RIGHT OUTER JOIN : Use keys from right frame only
-
outer : FULL OUTER JOIN : Use union of keys from both frames
-
inner : INNER JOIN : Use intersection of keys from both frames : default
-
Validating one to many join
-
We are doing outer join
df1.merge(df2, on='A', validate='one_to_many', how='outer')
|
A |
B |
C |
D |
E |
0 |
1 |
4 |
7 |
14 |
17 |
1 |
2 |
5 |
8 |
15 |
18 |
2 |
3 |
6 |
9 |
16 |
19 |
df1.merge(df2, on='A', validate='one_to_many', how='inner')
|
A |
B |
C |
D |
E |
0 |
1 |
4 |
7 |
14 |
17 |
1 |
2 |
5 |
8 |
15 |
18 |
2 |
3 |
6 |
9 |
16 |
19 |
- Another way of doing this
pd.merge(df1,df2,on=['A'])
|
A |
B |
C |
D |
E |
0 |
1 |
4 |
7 |
14 |
17 |
1 |
2 |
5 |
8 |
15 |
18 |
2 |
3 |
6 |
9 |
16 |
19 |
print(df1)
df2 = pd.DataFrame({'A':[2,2,3],'D':[1,2,3],'E':[17,18,19]})
print(df2)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
A D E
0 2 1 17
1 2 2 18
2 3 3 19
- Working on two different columns of dataframes
- Adding different suffixes to distinguish data
df1.merge(df2,left_on='A',right_on='D', suffixes=['_df1','_df2'])
|
A_df1 |
B |
C |
A_df2 |
D |
E |
0 |
1 |
4 |
7 |
2 |
1 |
17 |
1 |
2 |
5 |
8 |
2 |
2 |
18 |
2 |
3 |
6 |
9 |
3 |
3 |
19 |
Join
- This is a convenient method for combining the columns DataFrames into a single result DataFrame
print(df1)
print(df2)
df1.join(df2,lsuffix='_df1',rsuffix='_df2')
A B C
0 1 4 7
1 2 5 8
2 3 6 9
A D E
0 2 1 17
1 2 2 18
2 3 3 19
|
A_df1 |
B |
C |
A_df2 |
D |
E |
0 |
1 |
4 |
7 |
2 |
1 |
17 |
1 |
2 |
5 |
8 |
2 |
2 |
18 |
2 |
3 |
6 |
9 |
3 |
3 |
19 |
TimeSeries Friendly Operations
- merge_ordered : Merging ordered data like time series
df1 = pd.DataFrame({'Date':pd.date_range('2000', freq='D', periods=3), 'Sales':[10,20,30]} )
df1
|
Date |
Sales |
0 |
2000-01-01 |
10 |
1 |
2000-01-02 |
20 |
2 |
2000-01-03 |
30 |
df2 = pd.DataFrame({'Date':pd.date_range('2001', freq='D', periods=3), 'Sales':[20,40,50]} )
df2
|
Date |
Sales |
0 |
2001-01-01 |
20 |
1 |
2001-01-02 |
40 |
2 |
2001-01-03 |
50 |
pd.merge_ordered(df1,df2)
|
Date |
Sales |
0 |
2000-01-01 |
10 |
1 |
2000-01-02 |
20 |
2 |
2000-01-03 |
30 |
3 |
2001-01-01 |
20 |
4 |
2001-01-02 |
40 |
5 |
2001-01-03 |
50 |
pd.merge_ordered(df1,df2,suffixes=['_df1','_df2'], on='Date')
|
Date |
Sales_df1 |
Sales_df2 |
0 |
2000-01-01 |
10.0 |
NaN |
1 |
2000-01-02 |
20.0 |
NaN |
2 |
2000-01-03 |
30.0 |
NaN |
3 |
2001-01-01 |
NaN |
20.0 |
4 |
2001-01-02 |
NaN |
40.0 |
5 |
2001-01-03 |
NaN |
50.0 |
- merge_asof : is similar to an ordered left-join except that we match on nearest key rather than equal keys. For each row in the left DataFrame, we select the last row in the right DataFrame whose on key is less than the left’s key. Both DataFrames must be sorted by the key.
trades = pd.DataFrame({
'time': pd.to_datetime(['20160525 13:30:00.023',
'20160525 13:30:00.038',
'20160525 13:30:00.048',
'20160525 13:30:00.048',
'20160525 13:30:00.048']),
'ticker': ['MSFT', 'MSFT',
'GOOG', 'GOOG', 'AAPL'],
'price': [51.95, 51.95,
720.77, 720.92, 98.00],
'quantity': [75, 155,
100, 100, 100]},
columns=['time', 'ticker', 'price', 'quantity'])
trades
|
time |
ticker |
price |
quantity |
0 |
2016-05-25 13:30:00.023 |
MSFT |
51.95 |
75 |
1 |
2016-05-25 13:30:00.038 |
MSFT |
51.95 |
155 |
2 |
2016-05-25 13:30:00.048 |
GOOG |
720.77 |
100 |
3 |
2016-05-25 13:30:00.048 |
GOOG |
720.92 |
100 |
4 |
2016-05-25 13:30:00.048 |
AAPL |
98.00 |
100 |
quotes = pd.DataFrame({
'time': pd.to_datetime(['20160525 13:30:00.023',
'20160525 13:30:00.023',
'20160525 13:30:00.030',
'20160525 13:30:00.041',
'20160525 13:30:00.048',
'20160525 13:30:00.049',
'20160525 13:30:00.072',
'20160525 13:30:00.075']),
'ticker': ['GOOG', 'MSFT', 'MSFT',
'MSFT', 'GOOG', 'AAPL', 'GOOG',
'MSFT'],
'bid': [720.50, 51.95, 51.97, 51.99,
720.50, 97.99, 720.50, 52.01],
'ask': [720.93, 51.96, 51.98, 52.00,
720.93, 98.01, 720.88, 52.03]},
columns=['time', 'ticker', 'bid', 'ask'])
quotes
|
time |
ticker |
bid |
ask |
0 |
2016-05-25 13:30:00.023 |
GOOG |
720.50 |
720.93 |
1 |
2016-05-25 13:30:00.023 |
MSFT |
51.95 |
51.96 |
2 |
2016-05-25 13:30:00.030 |
MSFT |
51.97 |
51.98 |
3 |
2016-05-25 13:30:00.041 |
MSFT |
51.99 |
52.00 |
4 |
2016-05-25 13:30:00.048 |
GOOG |
720.50 |
720.93 |
5 |
2016-05-25 13:30:00.049 |
AAPL |
97.99 |
98.01 |
6 |
2016-05-25 13:30:00.072 |
GOOG |
720.50 |
720.88 |
7 |
2016-05-25 13:30:00.075 |
MSFT |
52.01 |
52.03 |
trades.merge(quotes, on=['time','ticker'])
|
time |
ticker |
price |
quantity |
bid |
ask |
0 |
2016-05-25 13:30:00.023 |
MSFT |
51.95 |
75 |
51.95 |
51.96 |
1 |
2016-05-25 13:30:00.048 |
GOOG |
720.77 |
100 |
720.50 |
720.93 |
2 |
2016-05-25 13:30:00.048 |
GOOG |
720.92 |
100 |
720.50 |
720.93 |
- Merging on approximate value
- Merging on nearest value
pd.merge_asof(trades, quotes,
on='time',
by='ticker', direction='nearest')
|
time |
ticker |
price |
quantity |
bid |
ask |
0 |
2016-05-25 13:30:00.023 |
MSFT |
51.95 |
75 |
51.95 |
51.96 |
1 |
2016-05-25 13:30:00.038 |
MSFT |
51.95 |
155 |
51.99 |
52.00 |
2 |
2016-05-25 13:30:00.048 |
GOOG |
720.77 |
100 |
720.50 |
720.93 |
3 |
2016-05-25 13:30:00.048 |
GOOG |
720.92 |
100 |
720.50 |
720.93 |
4 |
2016-05-25 13:30:00.048 |
AAPL |
98.00 |
100 |
97.99 |
98.01 |
- Putting limit on Merged value
pd.merge_asof(trades, quotes,
on='time',
by='ticker', tolerance=pd.Timedelta('2ms'))
|
time |
ticker |
price |
quantity |
bid |
ask |
0 |
2016-05-25 13:30:00.023 |
MSFT |
51.95 |
75 |
51.95 |
51.96 |
1 |
2016-05-25 13:30:00.038 |
MSFT |
51.95 |
155 |
NaN |
NaN |
2 |
2016-05-25 13:30:00.048 |
GOOG |
720.77 |
100 |
720.50 |
720.93 |
3 |
2016-05-25 13:30:00.048 |
GOOG |
720.92 |
100 |
720.50 |
720.93 |
4 |
2016-05-25 13:30:00.048 |
AAPL |
98.00 |
100 |
NaN |
NaN |
Objective : Shaping & Structuring
- Pivoting
- Pivot Tables
- Stacking & Unstacking
- Melting
- GroupBy
- Cross Tab
- Tiling
- Computing Dummy Variables
- Factorize
- Exploding Data
import pandas as pd
import numpy as np
gap_data = pd.read_csv('../Data/gapminder-FiveYearData.csv')
gap_data.sample(10)
|
country |
year |
pop |
continent |
lifeExp |
gdpPercap |
398 |
Czech Republic |
1962 |
9620282.0 |
Europe |
69.900 |
10136.867130 |
446 |
Ecuador |
1962 |
4681707.0 |
Americas |
54.640 |
4086.114078 |
1505 |
Taiwan |
1977 |
16785196.0 |
Asia |
70.590 |
5596.519826 |
86 |
Bahrain |
1962 |
171863.0 |
Asia |
56.923 |
12753.275140 |
1078 |
Nepal |
2002 |
25873917.0 |
Asia |
61.340 |
1057.206311 |
162 |
Botswana |
1982 |
970347.0 |
Africa |
61.484 |
4551.142150 |
1228 |
Poland |
1972 |
33039545.0 |
Europe |
70.850 |
8006.506993 |
1089 |
Netherlands |
1997 |
15604464.0 |
Europe |
78.030 |
30246.130630 |
1645 |
Vietnam |
1957 |
28998543.0 |
Asia |
42.887 |
676.285448 |
451 |
Ecuador |
1987 |
9545158.0 |
Americas |
67.231 |
6481.776993 |
help(pd.DataFrame.sample)
Help on function sample in module pandas.core.generic:
sample(self, n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
Return a random sample of items from an axis of object.
You can use `random_state` for reproducibility.
Parameters
----------
n : int, optional
Number of items from axis to return. Cannot be used with `frac`.
Default = 1 if `frac` = None.
frac : float, optional
Fraction of axis items to return. Cannot be used with `n`.
replace : bool, default False
Sample with or without replacement.
weights : str or ndarray-like, optional
Default 'None' results in equal probability weighting.
If passed a Series, will align with target object on index. Index
values in weights not found in sampled object will be ignored and
index values in sampled object not in weights will be assigned
weights of zero.
If called on a DataFrame, will accept the name of a column
when axis = 0.
Unless weights are a Series, weights must be same length as axis
being sampled.
If weights do not sum to 1, they will be normalized to sum to 1.
Missing values in the weights column will be treated as zero.
Infinite values not allowed.
random_state : int or numpy.random.RandomState, optional
Seed for the random number generator (if int), or numpy RandomState
object.
axis : int or string, optional
Axis to sample. Accepts axis number or name. Default is stat axis
for given data type (0 for Series and DataFrames).
Returns
-------
Series or DataFrame
A new object of same type as caller containing `n` items randomly
sampled from the caller object.
See Also
--------
numpy.random.choice: Generates a random sample from a given 1-D numpy
array.
Examples
--------
>>> df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
... 'num_wings': [2, 0, 0, 0],
... 'num_specimen_seen': [10, 2, 1, 8]},
... index=['falcon', 'dog', 'spider', 'fish'])
>>> df
num_legs num_wings num_specimen_seen
falcon 2 2 10
dog 4 0 2
spider 8 0 1
fish 0 0 8
Extract 3 random elements from the ``Series`` ``df['num_legs']``:
Note that we use `random_state` to ensure the reproducibility of
the examples.
>>> df['num_legs'].sample(n=3, random_state=1)
fish 0
spider 8
falcon 2
Name: num_legs, dtype: int64
A random 50% sample of the ``DataFrame`` with replacement:
>>> df.sample(frac=0.5, replace=True, random_state=1)
num_legs num_wings num_specimen_seen
dog 4 0 2
fish 0 0 8
Using a DataFrame column as weights. Rows with larger value in the
`num_specimen_seen` column are more likely to be sampled.
>>> df.sample(n=2, weights='num_specimen_seen', random_state=1)
num_legs num_wings num_specimen_seen
falcon 2 2 10
fish 0 0 8
- Reshaping using dataframes means the transformation of the structure of a table or vector (i.e. DataFrame or Series) to make it suitable for further analysis. We will study 10 techniques for this.
Pivoting
- Create a new derived table out of a given table.
- Pivot() take three params, all columns. values param can have multiple columns.
- The below table extracts relation between country year & population trend.
- Constraint - There cannot be more than one value corresponding to (country,year) tuple
gap_data.pivot(index='country',columns='year', values=['lifeExp'])
|
lifeExp |
year |
1952 |
1957 |
1962 |
1967 |
1972 |
1977 |
1982 |
1987 |
1992 |
1997 |
2002 |
2007 |
country |
|
|
|
|
|
|
|
|
|
|
|
|
Afghanistan |
28.801 |
30.332 |
31.997 |
34.020 |
36.088 |
38.438 |
39.854 |
40.822 |
41.674 |
41.763 |
42.129 |
43.828 |
Albania |
55.230 |
59.280 |
64.820 |
66.220 |
67.690 |
68.930 |
70.420 |
72.000 |
71.581 |
72.950 |
75.651 |
76.423 |
Algeria |
43.077 |
45.685 |
48.303 |
51.407 |
54.518 |
58.014 |
61.368 |
65.799 |
67.744 |
69.152 |
70.994 |
72.301 |
Angola |
30.015 |
31.999 |
34.000 |
35.985 |
37.928 |
39.483 |
39.942 |
39.906 |
40.647 |
40.963 |
41.003 |
42.731 |
Argentina |
62.485 |
64.399 |
65.142 |
65.634 |
67.065 |
68.481 |
69.942 |
70.774 |
71.868 |
73.275 |
74.340 |
75.320 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
Vietnam |
40.412 |
42.887 |
45.363 |
47.838 |
50.254 |
55.764 |
58.816 |
62.820 |
67.662 |
70.672 |
73.017 |
74.249 |
West Bank and Gaza |
43.160 |
45.671 |
48.127 |
51.631 |
56.532 |
60.765 |
64.406 |
67.046 |
69.718 |
71.096 |
72.370 |
73.422 |
Yemen Rep. |
32.548 |
33.970 |
35.180 |
36.984 |
39.848 |
44.175 |
49.113 |
52.922 |
55.599 |
58.020 |
60.308 |
62.698 |
Zambia |
42.038 |
44.077 |
46.023 |
47.768 |
50.107 |
51.386 |
51.821 |
50.821 |
46.100 |
40.238 |
39.193 |
42.384 |
Zimbabwe |
48.451 |
50.469 |
52.358 |
53.995 |
55.635 |
57.674 |
60.363 |
62.351 |
60.377 |
46.809 |
39.989 |
43.487 |
142 rows × 12 columns
gap_data.pivot(index='continent',columns='year', values=['lifeExp'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in
1 #Error: Since multiple values for tuple (continent,year)
----> 2 gap_data.pivot(index='continent',columns='year', values=['lifeExp'])
d:\Anaconda3\lib\site-packages\pandas\core\frame.py in pivot(self, index, columns, values)
5917 from pandas.core.reshape.pivot import pivot
5918
-> 5919 return pivot(self, index=index, columns=columns, values=values)
5920
5921 _shared_docs[
d:\Anaconda3\lib\site-packages\pandas\core\reshape\pivot.py in pivot(data, index, columns, values)
428 else:
429 indexed = data._constructor_sliced(data[values].values, index=index)
--> 430 return indexed.unstack(columns)
431
432
d:\Anaconda3\lib\site-packages\pandas\core\frame.py in unstack(self, level, fill_value)
6376 from pandas.core.reshape.reshape import unstack
6377
-> 6378 return unstack(self, level, fill_value)
6379
6380 _shared_docs[
d:\Anaconda3\lib\site-packages\pandas\core\reshape\reshape.py in unstack(obj, level, fill_value)
410 if isinstance(obj, DataFrame):
411 if isinstance(obj.index, MultiIndex):
--> 412 return _unstack_frame(obj, level, fill_value=fill_value)
413 else:
414 return obj.T.stack(dropna=False)
d:\Anaconda3\lib\site-packages\pandas\core\reshape\reshape.py in _unstack_frame(obj, level, fill_value)
440 value_columns=obj.columns,
441 fill_value=fill_value,
--> 442 constructor=obj._constructor,
443 )
444 return unstacker.get_result()
d:\Anaconda3\lib\site-packages\pandas\core\reshape\reshape.py in __init__(self, values, index, level, value_columns, fill_value, constructor)
140
141 self._make_sorted_values_labels()
--> 142 self._make_selectors()
143
144 def _make_sorted_values_labels(self):
d:\Anaconda3\lib\site-packages\pandas\core\reshape\reshape.py in _make_selectors(self)
178
179 if mask.sum() < len(self.index):
--> 180 raise ValueError("Index contains duplicate entries, " "cannot reshape")
181
182 self.group_index = comp_index
ValueError: Index contains duplicate entries, cannot reshape
Pivot Table
- Pivot table have solution to the previous problem.
- It have the ability to aggregate overlapping values.
- The previous data have continents which have repeating information from multiple companies
- The aggregate function is sum
gap_data.pivot_table(index='continent',columns='year',values='pop', aggfunc=np.sum)
year |
1952 |
1957 |
1962 |
1967 |
1972 |
1977 |
1982 |
1987 |
1992 |
1997 |
2002 |
2007 |
continent |
|
|
|
|
|
|
|
|
|
|
|
|
Africa |
2.376405e+08 |
2.648377e+08 |
2.965169e+08 |
3.352895e+08 |
3.798795e+08 |
4.330610e+08 |
4.993486e+08 |
5.748341e+08 |
6.590815e+08 |
7.438330e+08 |
8.337239e+08 |
9.295397e+08 |
Americas |
3.451524e+08 |
3.869539e+08 |
4.332703e+08 |
4.807466e+08 |
5.293842e+08 |
5.780677e+08 |
6.302909e+08 |
6.827540e+08 |
7.392741e+08 |
7.969004e+08 |
8.497728e+08 |
8.988712e+08 |
Asia |
1.395357e+09 |
1.562781e+09 |
1.696357e+09 |
1.905663e+09 |
2.150972e+09 |
2.384514e+09 |
2.610136e+09 |
2.871221e+09 |
3.133292e+09 |
3.383286e+09 |
3.601802e+09 |
3.811954e+09 |
Europe |
4.181208e+08 |
4.378904e+08 |
4.603552e+08 |
4.811790e+08 |
5.006351e+08 |
5.171645e+08 |
5.312669e+08 |
5.430942e+08 |
5.581428e+08 |
5.689441e+08 |
5.782239e+08 |
5.860985e+08 |
Oceania |
1.068601e+07 |
1.194198e+07 |
1.328352e+07 |
1.460041e+07 |
1.610610e+07 |
1.723900e+07 |
1.839485e+07 |
1.957442e+07 |
2.091965e+07 |
2.224143e+07 |
2.345483e+07 |
2.454995e+07 |
- Adding margins for getting cumulative information
- Margin names can also be added
gap_data.pivot_table(index='continent',columns='year',values='pop', aggfunc=np.sum,margins=True, margins_name='Total')
year |
1952 |
1957 |
1962 |
1967 |
1972 |
1977 |
1982 |
1987 |
1992 |
1997 |
2002 |
2007 |
Total |
continent |
|
|
|
|
|
|
|
|
|
|
|
|
|
Africa |
2.376405e+08 |
2.648377e+08 |
2.965169e+08 |
3.352895e+08 |
3.798795e+08 |
4.330610e+08 |
4.993486e+08 |
5.748341e+08 |
6.590815e+08 |
7.438330e+08 |
8.337239e+08 |
9.295397e+08 |
6.187586e+09 |
Americas |
3.451524e+08 |
3.869539e+08 |
4.332703e+08 |
4.807466e+08 |
5.293842e+08 |
5.780677e+08 |
6.302909e+08 |
6.827540e+08 |
7.392741e+08 |
7.969004e+08 |
8.497728e+08 |
8.988712e+08 |
7.351438e+09 |
Asia |
1.395357e+09 |
1.562781e+09 |
1.696357e+09 |
1.905663e+09 |
2.150972e+09 |
2.384514e+09 |
2.610136e+09 |
2.871221e+09 |
3.133292e+09 |
3.383286e+09 |
3.601802e+09 |
3.811954e+09 |
3.050733e+10 |
Europe |
4.181208e+08 |
4.378904e+08 |
4.603552e+08 |
4.811790e+08 |
5.006351e+08 |
5.171645e+08 |
5.312669e+08 |
5.430942e+08 |
5.581428e+08 |
5.689441e+08 |
5.782239e+08 |
5.860985e+08 |
6.181115e+09 |
Oceania |
1.068601e+07 |
1.194198e+07 |
1.328352e+07 |
1.460041e+07 |
1.610610e+07 |
1.723900e+07 |
1.839485e+07 |
1.957442e+07 |
2.091965e+07 |
2.224143e+07 |
2.345483e+07 |
2.454995e+07 |
2.129921e+08 |
Total |
2.406957e+09 |
2.664405e+09 |
2.899783e+09 |
3.217478e+09 |
3.576977e+09 |
3.930046e+09 |
4.289437e+09 |
4.691477e+09 |
5.110710e+09 |
5.515204e+09 |
5.886978e+09 |
6.251013e+09 |
5.044047e+10 |
Stacking & Unstacking
- Let us assume we have a DataFrame with MultiIndices on the rows and columns.
- Stacking a DataFrame means moving (also rotating or pivoting) the innermost column index to become the innermost row index.
- The inverse operation is called unstacking. It means moving the innermost row index to become the innermost column index.
- Stacking makes dataframe taller & can yield useful insights.
- Unstacking makes dataframe wider & can yield useful observations.
index = pd.MultiIndex.from_product([[2013, 2014], ['yes','no']],
names=['year', 'death'])
columns = pd.MultiIndex.from_product([['Mumbai', 'Delhi', 'Bangalore'],
['two-wheeler', 'four-wheeler']],
names=['city', 'type'])
data = np.random.randint(1,100,(4,6))
accident_data = pd.DataFrame(data, index=index, columns=columns)
accident_data
|
city |
Mumbai |
Delhi |
Bangalore |
|
type |
two-wheeler |
four-wheeler |
two-wheeler |
four-wheeler |
two-wheeler |
four-wheeler |
year |
death |
|
|
|
|
|
|
2013 |
yes |
93 |
31 |
2 |
17 |
95 |
23 |
no |
12 |
7 |
65 |
54 |
5 |
29 |
2014 |
yes |
83 |
54 |
51 |
45 |
36 |
71 |
no |
47 |
65 |
25 |
84 |
44 |
30 |
accident_data.stack()
|
|
city |
Bangalore |
Delhi |
Mumbai |
year |
death |
type |
|
|
|
2013 |
yes |
four-wheeler |
23 |
17 |
31 |
two-wheeler |
95 |
2 |
93 |
no |
four-wheeler |
29 |
54 |
7 |
two-wheeler |
5 |
65 |
12 |
2014 |
yes |
four-wheeler |
71 |
45 |
54 |
two-wheeler |
36 |
51 |
83 |
no |
four-wheeler |
30 |
84 |
65 |
two-wheeler |
44 |
25 |
47 |
accident_data.unstack()
city |
Mumbai |
Delhi |
Bangalore |
type |
two-wheeler |
four-wheeler |
two-wheeler |
four-wheeler |
two-wheeler |
four-wheeler |
death |
no |
yes |
no |
yes |
no |
yes |
no |
yes |
no |
yes |
no |
yes |
year |
|
|
|
|
|
|
|
|
|
|
|
|
2013 |
12 |
93 |
7 |
31 |
65 |
2 |
54 |
17 |
5 |
95 |
29 |
23 |
2014 |
47 |
83 |
65 |
54 |
25 |
51 |
84 |
45 |
44 |
36 |
30 |
71 |
countries = ['India','India','US','US','Australia','Australia','Japan','Japan']
gender = ['male','female','male','female','male','female','male','female']
list(zip(countries,gender))
[('India', 'male'),
('India', 'female'),
('US', 'male'),
('US', 'female'),
('Australia', 'male'),
('Australia', 'female'),
('Japan', 'male'),
('Japan', 'female')]
index = pd.MultiIndex.from_tuples(list(zip(countries,gender)), names=['country', 'gender'])
index
MultiIndex([( 'India', 'male'),
( 'India', 'female'),
( 'US', 'male'),
( 'US', 'female'),
('Australia', 'male'),
('Australia', 'female'),
( 'Japan', 'male'),
( 'Japan', 'female')],
names=['country', 'gender'])
fake_phd_data = pd.DataFrame([10,20,13,15,16,20,33,12], index=index, columns=['PhDs'])
fake_phd_data
|
|
PhDs |
country |
gender |
|
India |
male |
10 |
female |
20 |
US |
male |
13 |
female |
15 |
Australia |
male |
16 |
female |
20 |
Japan |
male |
33 |
female |
12 |
fake_phd_data.unstack()
|
PhDs |
gender |
female |
male |
country |
|
|
Australia |
20 |
16 |
India |
20 |
10 |
Japan |
12 |
33 |
US |
15 |
13 |
fake_phd_data.T.stack()
|
country |
Australia |
India |
Japan |
US |
|
gender |
|
|
|
|
PhDs |
female |
20 |
20 |
12 |
15 |
male |
16 |
10 |
33 |
13 |
Melting
- Unpivot a DataFrame from wide format to long format.
res = gap_data.pivot_table(index='continent',columns='year',values='gdpPercap', aggfunc=np.sum).round(1)
res
year |
1952 |
1957 |
1962 |
1967 |
1972 |
1977 |
1982 |
1987 |
1992 |
1997 |
2002 |
2007 |
continent |
|
|
|
|
|
|
|
|
|
|
|
|
Africa |
65133.8 |
72032.3 |
83100.1 |
106618.9 |
121660.0 |
134468.8 |
129042.8 |
118698.8 |
118654.1 |
123695.5 |
135168.0 |
160629.7 |
Americas |
101976.6 |
115401.1 |
122538.5 |
141706.3 |
162283.4 |
183800.2 |
187668.4 |
194835.0 |
201123.4 |
222232.5 |
232191.9 |
275075.8 |
Asia |
171451.0 |
190995.2 |
189069.2 |
197048.7 |
270186.5 |
257113.4 |
245326.5 |
251071.5 |
285109.8 |
324525.1 |
335745.0 |
411609.9 |
Europe |
169831.7 |
208890.4 |
250964.6 |
304314.7 |
374387.3 |
428519.4 |
468536.9 |
516429.3 |
511847.0 |
572303.5 |
651352.0 |
751634.4 |
Oceania |
20596.2 |
23197.0 |
25392.9 |
28990.0 |
32834.7 |
34567.9 |
37109.4 |
40896.1 |
41788.1 |
48048.4 |
53877.6 |
59620.4 |
res.reset_index(inplace=True)
res
year |
continent |
1952 |
1957 |
1962 |
1967 |
1972 |
1977 |
1982 |
1987 |
1992 |
1997 |
2002 |
2007 |
0 |
Africa |
65133.8 |
72032.3 |
83100.1 |
106618.9 |
121660.0 |
134468.8 |
129042.8 |
118698.8 |
118654.1 |
123695.5 |
135168.0 |
160629.7 |
1 |
Americas |
101976.6 |
115401.1 |
122538.5 |
141706.3 |
162283.4 |
183800.2 |
187668.4 |
194835.0 |
201123.4 |
222232.5 |
232191.9 |
275075.8 |
2 |
Asia |
171451.0 |
190995.2 |
189069.2 |
197048.7 |
270186.5 |
257113.4 |
245326.5 |
251071.5 |
285109.8 |
324525.1 |
335745.0 |
411609.9 |
3 |
Europe |
169831.7 |
208890.4 |
250964.6 |
304314.7 |
374387.3 |
428519.4 |
468536.9 |
516429.3 |
511847.0 |
572303.5 |
651352.0 |
751634.4 |
4 |
Oceania |
20596.2 |
23197.0 |
25392.9 |
28990.0 |
32834.7 |
34567.9 |
37109.4 |
40896.1 |
41788.1 |
48048.4 |
53877.6 |
59620.4 |
melted_df = pd.melt(res, id_vars=['continent'])
melted_df
|
continent |
year |
value |
0 |
Africa |
1952 |
65133.8 |
1 |
Americas |
1952 |
101976.6 |
2 |
Asia |
1952 |
171451.0 |
3 |
Europe |
1952 |
169831.7 |
4 |
Oceania |
1952 |
20596.2 |
5 |
Africa |
1957 |
72032.3 |
6 |
Americas |
1957 |
115401.1 |
7 |
Asia |
1957 |
190995.2 |
8 |
Europe |
1957 |
208890.4 |
9 |
Oceania |
1957 |
23197.0 |
10 |
Africa |
1962 |
83100.1 |
11 |
Americas |
1962 |
122538.5 |
12 |
Asia |
1962 |
189069.2 |
13 |
Europe |
1962 |
250964.6 |
14 |
Oceania |
1962 |
25392.9 |
15 |
Africa |
1967 |
106618.9 |
16 |
Americas |
1967 |
141706.3 |
17 |
Asia |
1967 |
197048.7 |
18 |
Europe |
1967 |
304314.7 |
19 |
Oceania |
1967 |
28990.0 |
20 |
Africa |
1972 |
121660.0 |
21 |
Americas |
1972 |
162283.4 |
22 |
Asia |
1972 |
270186.5 |
23 |
Europe |
1972 |
374387.3 |
24 |
Oceania |
1972 |
32834.7 |
25 |
Africa |
1977 |
134468.8 |
26 |
Americas |
1977 |
183800.2 |
27 |
Asia |
1977 |
257113.4 |
28 |
Europe |
1977 |
428519.4 |
29 |
Oceania |
1977 |
34567.9 |
30 |
Africa |
1982 |
129042.8 |
31 |
Americas |
1982 |
187668.4 |
32 |
Asia |
1982 |
245326.5 |
33 |
Europe |
1982 |
468536.9 |
34 |
Oceania |
1982 |
37109.4 |
35 |
Africa |
1987 |
118698.8 |
36 |
Americas |
1987 |
194835.0 |
37 |
Asia |
1987 |
251071.5 |
38 |
Europe |
1987 |
516429.3 |
39 |
Oceania |
1987 |
40896.1 |
40 |
Africa |
1992 |
118654.1 |
41 |
Americas |
1992 |
201123.4 |
42 |
Asia |
1992 |
285109.8 |
43 |
Europe |
1992 |
511847.0 |
44 |
Oceania |
1992 |
41788.1 |
45 |
Africa |
1997 |
123695.5 |
46 |
Americas |
1997 |
222232.5 |
47 |
Asia |
1997 |
324525.1 |
48 |
Europe |
1997 |
572303.5 |
49 |
Oceania |
1997 |
48048.4 |
50 |
Africa |
2002 |
135168.0 |
51 |
Americas |
2002 |
232191.9 |
52 |
Asia |
2002 |
335745.0 |
53 |
Europe |
2002 |
651352.0 |
54 |
Oceania |
2002 |
53877.6 |
55 |
Africa |
2007 |
160629.7 |
56 |
Americas |
2007 |
275075.8 |
57 |
Asia |
2007 |
411609.9 |
58 |
Europe |
2007 |
751634.4 |
59 |
Oceania |
2007 |
59620.4 |
melted_df.sort_values(['continent','year']).round(1)
|
continent |
year |
value |
0 |
Africa |
1952 |
65133.8 |
5 |
Africa |
1957 |
72032.3 |
10 |
Africa |
1962 |
83100.1 |
15 |
Africa |
1967 |
106618.9 |
20 |
Africa |
1972 |
121660.0 |
25 |
Africa |
1977 |
134468.8 |
30 |
Africa |
1982 |
129042.8 |
35 |
Africa |
1987 |
118698.8 |
40 |
Africa |
1992 |
118654.1 |
45 |
Africa |
1997 |
123695.5 |
50 |
Africa |
2002 |
135168.0 |
55 |
Africa |
2007 |
160629.7 |
1 |
Americas |
1952 |
101976.6 |
6 |
Americas |
1957 |
115401.1 |
11 |
Americas |
1962 |
122538.5 |
16 |
Americas |
1967 |
141706.3 |
21 |
Americas |
1972 |
162283.4 |
26 |
Americas |
1977 |
183800.2 |
31 |
Americas |
1982 |
187668.4 |
36 |
Americas |
1987 |
194835.0 |
41 |
Americas |
1992 |
201123.4 |
46 |
Americas |
1997 |
222232.5 |
51 |
Americas |
2002 |
232191.9 |
56 |
Americas |
2007 |
275075.8 |
2 |
Asia |
1952 |
171451.0 |
7 |
Asia |
1957 |
190995.2 |
12 |
Asia |
1962 |
189069.2 |
17 |
Asia |
1967 |
197048.7 |
22 |
Asia |
1972 |
270186.5 |
27 |
Asia |
1977 |
257113.4 |
32 |
Asia |
1982 |
245326.5 |
37 |
Asia |
1987 |
251071.5 |
42 |
Asia |
1992 |
285109.8 |
47 |
Asia |
1997 |
324525.1 |
52 |
Asia |
2002 |
335745.0 |
57 |
Asia |
2007 |
411609.9 |
3 |
Europe |
1952 |
169831.7 |
8 |
Europe |
1957 |
208890.4 |
13 |
Europe |
1962 |
250964.6 |
18 |
Europe |
1967 |
304314.7 |
23 |
Europe |
1972 |
374387.3 |
28 |
Europe |
1977 |
428519.4 |
33 |
Europe |
1982 |
468536.9 |
38 |
Europe |
1987 |
516429.3 |
43 |
Europe |
1992 |
511847.0 |
48 |
Europe |
1997 |
572303.5 |
53 |
Europe |
2002 |
651352.0 |
58 |
Europe |
2007 |
751634.4 |
4 |
Oceania |
1952 |
20596.2 |
9 |
Oceania |
1957 |
23197.0 |
14 |
Oceania |
1962 |
25392.9 |
19 |
Oceania |
1967 |
28990.0 |
24 |
Oceania |
1972 |
32834.7 |
29 |
Oceania |
1977 |
34567.9 |
34 |
Oceania |
1982 |
37109.4 |
39 |
Oceania |
1987 |
40896.1 |
44 |
Oceania |
1992 |
41788.1 |
49 |
Oceania |
1997 |
48048.4 |
54 |
Oceania |
2002 |
53877.6 |
59 |
Oceania |
2007 |
59620.4 |
GroupBy
titanic_data = pd.read_csv('../Data/titanic-train.csv.txt', index_col='PassengerId')
titanic_data.groupby(['Pclass','Sex','Survived']).size()
Pclass Sex Survived
1 female 0 3
1 91
male 0 77
1 45
2 female 0 6
1 70
male 0 91
1 17
3 female 0 72
1 72
male 0 300
1 47
dtype: int64
gap_data.groupby(['continent']).lifeExp.mean()
continent
Africa 48.865330
Americas 64.658737
Asia 60.064903
Europe 71.903686
Oceania 74.326208
Name: lifeExp, dtype: float64
gap_data.groupby(['continent','country']).lifeExp.mean().round(1).sort_values()
continent country
Africa Sierra Leone 36.8
Asia Afghanistan 37.5
Africa Angola 37.9
Guinea-Bissau 39.2
Mozambique 40.4
...
Europe Netherlands 75.6
Switzerland 75.6
Norway 75.8
Sweden 76.2
Iceland 76.5
Name: lifeExp, Length: 142, dtype: float64
Cross Tabulations
- CrossTab() is used to compute a cross-tabulation of two (or more) factors. By default crosstab computes a frequency table of the factors unless an array of values and an aggregation function are passed.
- CrossTab is one of the easiest way to get quick results compared to pivot_table & other options
- The question still remains, why even use a crosstab function? The short answer is that it provides a couple of handy functions to more easily format and summarize the data.
- The longer answer is that sometimes it can be tough to remember all the steps to make this happen on your own. The simple crosstab API is the quickest route to the solution and provides some useful shortcuts for certain types of analysis.
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
"num_doors", "body_style", "drive_wheels", "engine_location",
"wheel_base", "length", "width", "height", "curb_weight",
"engine_type", "num_cylinders", "engine_size", "fuel_system",
"bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
"city_mpg", "highway_mpg", "price"]
df_raw = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",
header=None, names=headers, na_values="?" )
df_raw
|
symboling |
normalized_losses |
make |
fuel_type |
aspiration |
num_doors |
body_style |
drive_wheels |
engine_location |
wheel_base |
... |
engine_size |
fuel_system |
bore |
stroke |
compression_ratio |
horsepower |
peak_rpm |
city_mpg |
highway_mpg |
price |
0 |
3 |
NaN |
alfa-romero |
gas |
std |
two |
convertible |
rwd |
front |
88.6 |
... |
130 |
mpfi |
3.47 |
2.68 |
9.0 |
111.0 |
5000.0 |
21 |
27 |
13495.0 |
1 |
3 |
NaN |
alfa-romero |
gas |
std |
two |
convertible |
rwd |
front |
88.6 |
... |
130 |
mpfi |
3.47 |
2.68 |
9.0 |
111.0 |
5000.0 |
21 |
27 |
16500.0 |
2 |
1 |
NaN |
alfa-romero |
gas |
std |
two |
hatchback |
rwd |
front |
94.5 |
... |
152 |
mpfi |
2.68 |
3.47 |
9.0 |
154.0 |
5000.0 |
19 |
26 |
16500.0 |
3 |
2 |
164.0 |
audi |
gas |
std |
four |
sedan |
fwd |
front |
99.8 |
... |
109 |
mpfi |
3.19 |
3.40 |
10.0 |
102.0 |
5500.0 |
24 |
30 |
13950.0 |
4 |
2 |
164.0 |
audi |
gas |
std |
four |
sedan |
4wd |
front |
99.4 |
... |
136 |
mpfi |
3.19 |
3.40 |
8.0 |
115.0 |
5500.0 |
18 |
22 |
17450.0 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
200 |
-1 |
95.0 |
volvo |
gas |
std |
four |
sedan |
rwd |
front |
109.1 |
... |
141 |
mpfi |
3.78 |
3.15 |
9.5 |
114.0 |
5400.0 |
23 |
28 |
16845.0 |
201 |
-1 |
95.0 |
volvo |
gas |
turbo |
four |
sedan |
rwd |
front |
109.1 |
... |
141 |
mpfi |
3.78 |
3.15 |
8.7 |
160.0 |
5300.0 |
19 |
25 |
19045.0 |
202 |
-1 |
95.0 |
volvo |
gas |
std |
four |
sedan |
rwd |
front |
109.1 |
... |
173 |
mpfi |
3.58 |
2.87 |
8.8 |
134.0 |
5500.0 |
18 |
23 |
21485.0 |
203 |
-1 |
95.0 |
volvo |
diesel |
turbo |
four |
sedan |
rwd |
front |
109.1 |
... |
145 |
idi |
3.01 |
3.40 |
23.0 |
106.0 |
4800.0 |
26 |
27 |
22470.0 |
204 |
-1 |
95.0 |
volvo |
gas |
turbo |
four |
sedan |
rwd |
front |
109.1 |
... |
141 |
mpfi |
3.78 |
3.15 |
9.5 |
114.0 |
5400.0 |
19 |
25 |
22625.0 |
205 rows × 26 columns
pd.crosstab(df_raw.make, df_raw.body_style)
body_style |
convertible |
hardtop |
hatchback |
sedan |
wagon |
make |
|
|
|
|
|
alfa-romero |
2 |
0 |
1 |
0 |
0 |
audi |
0 |
0 |
1 |
5 |
1 |
bmw |
0 |
0 |
0 |
8 |
0 |
chevrolet |
0 |
0 |
2 |
1 |
0 |
dodge |
0 |
0 |
5 |
3 |
1 |
honda |
0 |
0 |
7 |
5 |
1 |
isuzu |
0 |
0 |
1 |
3 |
0 |
jaguar |
0 |
0 |
0 |
3 |
0 |
mazda |
0 |
0 |
10 |
7 |
0 |
mercedes-benz |
1 |
2 |
0 |
4 |
1 |
mercury |
0 |
0 |
1 |
0 |
0 |
mitsubishi |
0 |
0 |
9 |
4 |
0 |
nissan |
0 |
1 |
5 |
9 |
3 |
peugot |
0 |
0 |
0 |
7 |
4 |
plymouth |
0 |
0 |
4 |
2 |
1 |
porsche |
1 |
2 |
2 |
0 |
0 |
renault |
0 |
0 |
1 |
0 |
1 |
saab |
0 |
0 |
3 |
3 |
0 |
subaru |
0 |
0 |
3 |
5 |
4 |
toyota |
1 |
3 |
14 |
10 |
4 |
volkswagen |
1 |
0 |
1 |
9 |
1 |
volvo |
0 |
0 |
0 |
8 |
3 |
pd.crosstab(df_raw.make, df_raw.num_doors, margins=True, margins_name="Total")
num_doors |
four |
two |
Total |
make |
|
|
|
alfa-romero |
0 |
3 |
3 |
audi |
5 |
2 |
7 |
bmw |
5 |
3 |
8 |
chevrolet |
1 |
2 |
3 |
dodge |
4 |
4 |
8 |
honda |
5 |
8 |
13 |
isuzu |
2 |
2 |
4 |
jaguar |
2 |
1 |
3 |
mazda |
7 |
9 |
16 |
mercedes-benz |
5 |
3 |
8 |
mercury |
0 |
1 |
1 |
mitsubishi |
4 |
9 |
13 |
nissan |
9 |
9 |
18 |
peugot |
11 |
0 |
11 |
plymouth |
4 |
3 |
7 |
porsche |
0 |
5 |
5 |
renault |
1 |
1 |
2 |
saab |
3 |
3 |
6 |
subaru |
9 |
3 |
12 |
toyota |
18 |
14 |
32 |
volkswagen |
8 |
4 |
12 |
volvo |
11 |
0 |
11 |
Total |
114 |
89 |
203 |
pd.crosstab(df_raw.make, df_raw.body_style, values=df_raw.price, aggfunc='mean').round(1)
body_style |
convertible |
hardtop |
hatchback |
sedan |
wagon |
make |
|
|
|
|
|
alfa-romero |
14997.5 |
NaN |
16500.0 |
NaN |
NaN |
audi |
NaN |
NaN |
NaN |
17647.0 |
18920.0 |
bmw |
NaN |
NaN |
NaN |
26118.8 |
NaN |
chevrolet |
NaN |
NaN |
5723.0 |
6575.0 |
NaN |
dodge |
NaN |
NaN |
7819.8 |
7619.7 |
8921.0 |
honda |
NaN |
NaN |
7054.4 |
9945.0 |
7295.0 |
isuzu |
NaN |
NaN |
11048.0 |
6785.0 |
NaN |
jaguar |
NaN |
NaN |
NaN |
34600.0 |
NaN |
mazda |
NaN |
NaN |
10085.0 |
11464.1 |
NaN |
mercedes-benz |
35056.0 |
36788.0 |
NaN |
33074.0 |
28248.0 |
mercury |
NaN |
NaN |
16503.0 |
NaN |
NaN |
mitsubishi |
NaN |
NaN |
9597.9 |
8434.0 |
NaN |
nissan |
NaN |
8249.0 |
14409.0 |
8604.6 |
9915.7 |
peugot |
NaN |
NaN |
NaN |
15758.6 |
15017.5 |
plymouth |
NaN |
NaN |
8130.5 |
7150.5 |
8921.0 |
porsche |
37028.0 |
33278.0 |
22018.0 |
NaN |
NaN |
renault |
NaN |
NaN |
9895.0 |
NaN |
9295.0 |
saab |
NaN |
NaN |
15013.3 |
15433.3 |
NaN |
subaru |
NaN |
NaN |
6591.3 |
9070.6 |
9342.0 |
toyota |
17669.0 |
9762.3 |
9616.0 |
9542.2 |
9836.0 |
volkswagen |
11595.0 |
NaN |
9980.0 |
9673.9 |
12290.0 |
volvo |
NaN |
NaN |
NaN |
18726.9 |
16293.3 |
Tiling
- Transforms continues value into discrete value
titanic_data.fillna({'Age':29}, inplace=True)
titanic_data.head()
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
PassengerId |
|
|
|
|
|
|
|
|
|
|
|
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |