Python-杂记

1.“.fit_transform”与“.transform”的区别
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
#用data_y数据 训练标签的编码准则,并 返回用data_y编码好的标签赋值给data_y
data_y = le.fit_transform(data_y)
#用data_y数据 训练标签的编码准则
data_y = le.fit_transform(data_y)
#用之前训练过的的编码准则和data_y数据来 进行标签编码,将编码好的标签 返回,赋值给data_y
data_y = le.fit_transform(data_y)

2.打印交叉表
print pd.crosstab(data['FAULT_TYPE_3'],data['ORG_NO_5'],margins= True)


3.生成随机矩阵
>>> from numpy import random
>>> data = random.random(size = ( 5 , 4 ))
>>> data
array([[ 8.83326804e-01, 4.62247133e-01, 7.00437565e-04,
6.06600334e-02],
[ 9.76011953e-01, 9.28506787e-01, 6.00816917e-01,
3.81064458e-01],
[ 9.46751253e-01, 4.25659552e-01, 3.25210318e-01,
7.47624195e-01],
[ 6.71764806e-01, 2.65358764e-01, 1.84557967e-01,
4.33813712e-01],
[ 6.02910969e-01, 3.82080865e-01, 6.20733312e-01,
8.27651438e-01]])
random函数接收需要生成 随机矩阵的形状的元组作为唯一参数。上面的代码将会返回一个两行四列的 随机矩阵,随机数的值位于0到1之间,矩阵是numpy.array类型。除了random函数外,还有生成整数 随机矩阵的函数randint。
>>> data=random.randint( 1 , 100 ,size = ( 5 , 4 ))
>>> df = DataFrame(data,index = [' one ',' two ',' three ',' four ',' five '],
columns = [' year ',' state ',' pop ',' debt '])
>>> df
>>> data
array([[95, 53, 98, 55],
[94, 93, 44, 62],
[52, 47, 42, 13],
[97, 74, 50, 34],
[53, 4, 25, 11]])

4.将矩阵化成dataframe
>>> from pandas import DataFrame
>>> df = DataFrame(data,index=['one','two','three','four','five'],
columns=['year','state','pop','debt'])
>>> df
year state pop debt
one 95 53 98 55
two 94 93 44 62
three 52 47 42 13
four 97 74 50 34
five 53 4 25 11


5.索引、切片
——pandas 对象的 index 不限于整数
series
>>> df['year']
one 95
two 94
three 52
four 97
five 53
Name: year, dtype: int32
①使用整数做切片索引——从0开始,不包含右边界
>>> df['year'][2:4]
three 52
four 97
Name: year, dtype: int32
②使用非整数作为切片索引——包含末端
>>> df['year']['two':'four']
two 94
three 52
four 97
Name: year, dtype: int32
DataFrame
DataFrame 对象的标准切片语法为: .ix[::,::]。ix 对象可以接受两套切片,分别为行(axis=0)和列(axis=1)的方向
>>> df.ix[2:4,0:3]
year state pop
three 52 47 42
four 97 74 50
而不使用 ix ,直接切的情况就特殊了:
  • 索引时,选取的是列
  • 切片时,选取的是行
索引
>>> df['year']
one 95
two 94
three 52
four 97
five 53
Name: year, dtype: int32
切片
>>> df[2:4]
year state pop debt
three 52 47 42 13
four 97 74 50 34
>>> df['two':'four']
year state pop debt
two 94 93 44 62
three 52 47 42 13
four 97 74 50 34

6.使用pandas的get_dummies实现分类属性的独热编码
源码如下( 红色为个人翻译的注释):
def get_dummies(data, prefix= None , prefix_sep= '_' , dummy_na= False ,
columns= None , sparse= False ):
"""
Convert categorical variable into dummy/indicator variables

Parameters
----------
data : array-like, Series, or DataFrame #数据集
prefix : string, list of strings, or dict of strings, default None #给编码后的列加前缀,默认是none;可以定义前缀名字,如统一标注prefix='col'或者按原始列名标注prefix=['colA','colB']
String to append DataFrame column names
Pass a list with length equal to the number of columns
when calling get_dummies on a DataFrame. Alternativly, `prefix`
can be a dictionary mapping column names to prefixes.
prefix_sep : string, default '_' #编码后的前缀与原始列名之间的分隔符,默认为'_',可以自定义为其他
If appending prefix, separator/delimiter to use. Or pass a
list or dictionary as with `prefix.`
dummy_na : bool, default False #布尔值,是否加一列来给空行做标记,默认为否
Add a column to indicate NaNs, if False NaNs are ignored.
columns : list-like, default None #将指定的列做独热编码,默认为none,个人认为与prefix类似,但是prefix是默认将全部分类变量进行独热编码,而columns可以指定部分列进行编码
Column names in the DataFrame to be encoded.
If `columns` is None then all the columns with
`object` or `category` dtype will be converted.
sparse : bool, default False #布尔值,是否将DataFrame转换为稀疏矩阵,默认为否
Whether the returned DataFrame should be sparse or not.

.. versionadded:: 0.16.1

Returns
-------
dummies : DataFrame

Examples
--------
>>> import pandas as pd
>>> s = pd.Series(list('abca'))

>>> get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0

>>> s1 = ['a', 'b', np.nan]

>>> get_dummies(s1)
a b
0 1 0
1 0 1
2 0 0

>>> get_dummies(s1, dummy_na=True)
a b NaN
0 1 0 0
1 0 1 0
2 0 0 1

>>> df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
'C': [1, 2, 3]})

>>> get_dummies(df, prefix=['col1', 'col2']):
C col1_a col1_b col2_a col2_b col2_c
0 1 1 0 0 1 0
1 2 0 1 1 0 0
2 3 1 0 0 0 1

See also ``Series.str.get_dummies``.

"""
PS
>>> pd.get_dummies(df, prefix_sep='.',columns='A')
B C A.a A.b
0 b 1 1 0
1 a 2 0 1
2 c 3 1 0
>>> pd.get_dummies(df, prefix_sep='.',columns='A','B')
SyntaxError: non-keyword arg after keyword arg
>>> pd.get_dummies(df, prefix_sep='.',columns=['A','B'])
C A.a A.b B.a B.b B.c
0 1 1 0 0 1 0
1 2 0 1 1 0 0
2 3 1 0 0 0 1


你可能感兴趣的:(Python)