nympy/pandas基本操作

NumPy是Python语言的一个扩充程序库。支持高级大量的维度数组与矩阵运算,此外也针对数组运算提供大量的数学函数库。Numpy内部解除了Python的PIL(全局解释器锁),运算效率极好,是大量机器学习框架的基础库!

Pandas是基于Numpy开发出的,专门用于数据分析的开源Python库

机器学习可直接安装annaconda整合包(一个开源的Python发行版本,其包含了conda、Python等180多个科学包及其依赖项)

导入包:

import numpy as np
import pandas as pd

基本操作:

# 读取鸢尾花数据集,header参数来指定标题的行。默认为0。如果没有标题,则使用None。
data = pd.read_csv(r"Iris.csv", header=0)
# 显示前n行记录。默认n的值为5。
# data.head()
# 显示末尾的n行记录。默认n的值为5。
#data.tail()
# 随机抽取样本。默认抽取一条,我们可以通过参数进行指定抽取样本的数量。
display(data.sample())
# 将类别文本映射成为数值类型。
data["Species"] = data["Species"].map({"Iris-virginica": 0, "Iris-setosa": 1, "Iris-versicolor": 2})
# 删除不需要的Id列。
data.drop("Id", axis=1, inplace=True)
# data.duplicated().any()
# 查看数据集的记录数。
# len(data)
# 删除重复的记录。
data.drop_duplicates(inplace=True)
# len(data)
# 查看各个类别的鸢尾花具有多少条记录。
data["Species"].value_counts()

pandas.DataFrame:

Two_dimensional size_mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.


pandas.DataFrame.at/iat/loc/iloc:

at和loc用于依靠label选取单个或者部分数据
iat和iloc用于依靠index选取单个或者部分数据

DataFrame.at:Access a single value for a row/column label pair.
DataFrame.iat:Access a single value for a row/column pair by integer position.
DataFrame.loc:Access a group of rows and columns by label(s).
DataFrame.iloc:Access a group of rows and columns by label(s) or a boolean array.

columnsNum=len(data.columns)
i=0
while(i

pandas.DataFrame.any/all:

DataFrame.any:Return whether any element is True over requested axis.
DataFrame.all:Return whether all elements are True.


numpy.asarray/asmatrix:

numpy.asarray:Convert the input to an array.
numpy.asmatrix:Interpret the input as a matrix.


numpy.reshape:

Gives a new shape to an array without changing its data.
One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions.


ndarray/array:

What is the difference between ndarray and array in Numpy? And where can I find the implementations in the numpy source code?
Well, np.array is just a convenience function to create an ndarray, it is not a class itself.


numpy.T/.I

numpy.T:Same as self.transpose(), except that self is returned if self.ndim < 2.
numpy.I:Returns the (multiplicative) inverse of invertible self.

>>> m = np.matrix('[1, 2; 3, 4]'); m
matrix([[1, 2],
        [3, 4]])
>>> m.getI()
matrix([[-2. ,  1. ],
        [ 1.5, -0.5]])
>>> m.getI() * m
matrix([[ 1.,  0.],
        [ 0.,  1.]])

numpy.ravel:

Return a contiguous flattened array.

>>> x = np.array([[1, 2, 3], [4, 5, 6]])
>>> print(np.ravel(x))
[1 2 3 4 5 6]

>>> print(x.reshape(-1))
[1 2 3 4 5 6]

pandas.DataFrame.insert:

Insert column into DataFrame at specified location.


pandas.Series.reindex/pandas.Dataframe.reindex:

pandas.Series.reindex:
Conform Series to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
pandas.Dataframe.reindex:
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False

pandas.Series.reindex:
>>> df.reindex(index=[date1, date2, date3], columns=['A', 'B', 'C'])

pandas.Dataframe.reindex:
>>> df.reindex(columns=['http_status', 'user_agent'])
           http_status  user_agent
Firefox            200         NaN
Chrome             200         NaN
Safari             404         NaN
IE10               404         NaN
Konqueror          301         NaN

numpy.argmax:

Returns the indices of the maximum values along an axis.

>>> a = np.arange(6).reshape(2,3)
>>> a
array([[0, 1, 2],
       [3, 4, 5]])
>>> np.argmax(a)
5
>>> np.argmax(a, axis=0)
array([1, 1, 1])
>>> np.argmax(a, axis=1)
array([2, 2])

numpy.random.randint

Return random integers from the “discrete uniform” distribution of the specified dtype in the “half-open” interval [low, high). If high is None (the default), then results are from [0, low).

>>> np.random.randint(2, size=10)
array([1, 0, 0, 0, 1, 1, 0, 0, 1, 0])
>>> np.random.randint(1, size=10)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

>>> np.random.randint(5, size=(2, 4))
array([[4, 0, 2, 1],
       [3, 2, 2, 0]])

你可能感兴趣的:(nympy/pandas基本操作)