NumPy是Python语言的一个扩充程序库。支持高级大量的维度数组与矩阵运算,此外也针对数组运算提供大量的数学函数库。Numpy内部解除了Python的PIL(全局解释器锁),运算效率极好,是大量机器学习框架的基础库!
Pandas是基于Numpy开发出的,专门用于数据分析的开源Python库
机器学习可直接安装annaconda整合包(一个开源的Python发行版本,其包含了conda、Python等180多个科学包及其依赖项)
导入包:
import numpy as np
import pandas as pd
基本操作:
# 读取鸢尾花数据集,header参数来指定标题的行。默认为0。如果没有标题,则使用None。
data = pd.read_csv(r"Iris.csv", header=0)
# 显示前n行记录。默认n的值为5。
# data.head()
# 显示末尾的n行记录。默认n的值为5。
#data.tail()
# 随机抽取样本。默认抽取一条,我们可以通过参数进行指定抽取样本的数量。
display(data.sample())
# 将类别文本映射成为数值类型。
data["Species"] = data["Species"].map({"Iris-virginica": 0, "Iris-setosa": 1, "Iris-versicolor": 2})
# 删除不需要的Id列。
data.drop("Id", axis=1, inplace=True)
# data.duplicated().any()
# 查看数据集的记录数。
# len(data)
# 删除重复的记录。
data.drop_duplicates(inplace=True)
# len(data)
# 查看各个类别的鸢尾花具有多少条记录。
data["Species"].value_counts()
pandas.DataFrame:
Two_dimensional size_mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
pandas.DataFrame.at/iat/loc/iloc:
at和loc用于依靠label选取单个或者部分数据
iat和iloc用于依靠index选取单个或者部分数据
DataFrame.at:Access a single value for a row/column label pair.
DataFrame.iat:Access a single value for a row/column pair by integer position.
DataFrame.loc:Access a group of rows and columns by label(s).
DataFrame.iloc:Access a group of rows and columns by label(s) or a boolean array.
columnsNum=len(data.columns)
i=0
while(i
pandas.DataFrame.any/all:
DataFrame.any:Return whether any element is True over requested axis.
DataFrame.all:Return whether all elements are True.
numpy.asarray/asmatrix:
numpy.asarray:Convert the input to an array.
numpy.asmatrix:Interpret the input as a matrix.
numpy.reshape:
Gives a new shape to an array without changing its data.
One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions.
ndarray/array:
What is the difference between ndarray and array in Numpy? And where can I find the implementations in the numpy source code?
Well, np.array is just a convenience function to create an ndarray, it is not a class itself.
numpy.T/.I
numpy.T:Same as self.transpose(), except that self is returned if self.ndim < 2.
numpy.I:Returns the (multiplicative) inverse of invertible self.
>>> m = np.matrix('[1, 2; 3, 4]'); m
matrix([[1, 2],
[3, 4]])
>>> m.getI()
matrix([[-2. , 1. ],
[ 1.5, -0.5]])
>>> m.getI() * m
matrix([[ 1., 0.],
[ 0., 1.]])
numpy.ravel:
Return a contiguous flattened array.
>>> x = np.array([[1, 2, 3], [4, 5, 6]])
>>> print(np.ravel(x))
[1 2 3 4 5 6]
>>> print(x.reshape(-1))
[1 2 3 4 5 6]
pandas.DataFrame.insert:
Insert column into DataFrame at specified location.
pandas.Series.reindex/pandas.Dataframe.reindex:
pandas.Series.reindex:
Conform Series to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
pandas.Dataframe.reindex:
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
pandas.Series.reindex:
>>> df.reindex(index=[date1, date2, date3], columns=['A', 'B', 'C'])
pandas.Dataframe.reindex:
>>> df.reindex(columns=['http_status', 'user_agent'])
http_status user_agent
Firefox 200 NaN
Chrome 200 NaN
Safari 404 NaN
IE10 404 NaN
Konqueror 301 NaN
numpy.argmax:
Returns the indices of the maximum values along an axis.
>>> a = np.arange(6).reshape(2,3)
>>> a
array([[0, 1, 2],
[3, 4, 5]])
>>> np.argmax(a)
5
>>> np.argmax(a, axis=0)
array([1, 1, 1])
>>> np.argmax(a, axis=1)
array([2, 2])
numpy.random.randint
Return random integers from the “discrete uniform” distribution of the specified dtype in the “half-open” interval [low, high). If high is None (the default), then results are from [0, low).
>>> np.random.randint(2, size=10)
array([1, 0, 0, 0, 1, 1, 0, 0, 1, 0])
>>> np.random.randint(1, size=10)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> np.random.randint(5, size=(2, 4))
array([[4, 0, 2, 1],
[3, 2, 2, 0]])