1,读取cs
training_raw = pd.read_csv('dataset/adult.data', header=None, names=headers, sep=',\s', na_values=["?"], engine='python')
sep 分隔符,na_values将?设置为na
dtype={'onpromotion': bool}, 指定数据格式
converters={'unit_sales': lambda u: np.log1p(
float(u)) if float(u) > 0 else 0}, 数据转换
parse_dates=["date"], 将数据转换成时间数据,并放在第一列
skiprows=range(1, 66458909) 跳过这两行不读
2,存储csv
df1.to_csv('test.csv', encoding='utf-8', index=False)
index=False 为不要索引
3,读取excel
pd.read_excel
可指定sheet_name 如sheet_name='Sheet1',
keep_default_na=False 使默认空值变为'',
header=None不要表头
4,DataFrame
pd.DataFrame(x_test, columns=columns)
数据表内容为x_test,columns为数据表所有属性值
DataFrame.columns 为取数据表所有属性值
5,loc,iloc
df1.loc[:, '营销是否成功'] = y_test
loc可指定不存在的属性,iloc只可指定存在的属性
6,取索引的值
df.iloc[:, :-1].values
取除最后一列,所有行的值
7,取所有索引
index_list = data.index.values
8,改变数据表的值
data.iloc[i +1, 0] = name
9,显示数字型特征和标量型特征
数字型: 可计算.
标量型: 任何包含类别或文本的特征.
# 显示所有数字型特征
dataset_raw.describe()
# 标量型特征
dataset_raw.describe(include=['0'])
10,显示该列数据类型
dataset_raw.dtypes['fnlwgt']
11,predclass列大于50设为1
dataset_raw.loc[dataset_raw['predclass'] == '>50K', 'predclass'] = 1
12,cut 分箱
dataset_bin['age'] = pd.cut(dataset_raw['age'], 10)
10为分箱个数
13,pandas提供对one-hot编码的函数是:pd.get_dummies()
dataset_bin_enc = pd.get_dummies(dataset_bin, columns=one_hot_cols)
14,astype 设定字符类型
dataset_con = dataset_con.astype(str)
将非数字特征转化为数字特征
grid_df[col] = grid_df[col].astype('category')
15,删除某一列
dataset_con_enc.drop('predclass', axis=1)
16,根据某值进行排序
importance.sort_values(by='Importance', ascending=True)
# 可以通过subset参数来删除在age和sex中含有空数据的全部行,空值值np.nan
new_titanic_survival = titanic_survival.dropna(subset=["age","sex"])
17,删除空行,去除缺失值
train = train.dropna(axis=0)
18,空值填充
dataset.fillna(-1,inplace=True)
19,分组,gruop by
https://www.jianshu.com/p/50fb023f208c
20 reset_index 添加索引
https://blog.csdn.net/weixin_43655282/article/details/97889398
#drop=True: 把原来的索引index列去掉,丢掉
21, merge 合并
https://blog.csdn.net/Asher117/article/details/84725199
22,value_counts() 计算每一列有多少重复值
dropna=False,不去除空值,normalize 计算每个值的占比
https://www.jianshu.com/p/f773b4b82c66
23,iterrows
https://blog.csdn.net/Softdiamonds/article/details/80218777
24,pandas group分组与agg聚合
https://blog.csdn.net/u012706792/article/details/80892510
25,map,apply
https://blog.csdn.net/u010814042/article/details/76401133
26,quantile
#quantile 四分位数函数
group[group < group.quantile(.05)] = group.quantile(.05)
27,transform
https://www.jianshu.com/p/509d7b97088c
28.drop_duplicates 数据去重
https://blog.csdn.net/ghr5582/article/details/80693882
29,nunique 即返回的是唯一值的个数
https://blog.csdn.net/feizxiang3/article/details/93380525
30,sample 混排
x_data = x_data.sample(frac=1, random_state=1).reset_index(drop=True)
https://www.cnblogs.com/webRobot/p/11484648.html
31,tail
tail() 方法就是从数据集尾部开始显示了,同样默认 5 条,可自定义。
32,相关系数,corr()
https://blog.csdn.net/walking_visitor/article/details/85128461
32 as_matrix
https://www.cnblogs.com/key221/p/9394051.html
33,.transpose
行列转换
pd.DataFrame(deck_percentages).transpose()
34,.levels
层级索引,只有groupby之后会用到
35,qcut 分箱
pd.qcut(df_all['Fare'], 13)
36,数据切分 split
df_all['Title'] = df_all['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
# expand : 布尔值,默认为False.如果为真返回数据框(DataFrame)或复杂索引(MultiIndex);如果为假,返回序列(Series)或者索引(Index)
37,.cat 连接字符串
https://blog.csdn.net/zbrj12345/article/details/81181015
38,melt
index_columns = ['id','item_id','dept_id','cat_id','store_id','state_id']
#id_vars 指数据的id(标识,不变的量),剩下的列为目标变量,变化之后变量名字为var_name,指的名字为value_name
train_df = train_df.melt(id_vars = index_columns,var_name='d',value_name='sales')
前:
后
34,shift
数据在df中移位
https://www.cnblogs.com/liulangmao/p/9301032.html
35,rolling 处理时间序列方法
https://blog.csdn.net/liuhaolei1992/article/details/89421212
36,reindex 改变索引,可以做到增改的操作
https://blog.csdn.net/missyougoon/article/details/83409717
37,diff diff用于计算一列中某元素与该列中另一个元素的差值
https://jingyan.baidu.com/article/2a13832852b1d1464a134f90.html
38 add_prefix
带有字符串前缀的前缀标签
,https://www.cjavapy.com/article/276/
39 resamle 重新采样
https://www.jb51.net/article/164438.htm
40,slice 切分数据
https://blog.csdn.net/claroja/article/details/64925356
41,assign 直接向DF中添加一列
https://www.cnblogs.com/jason--/p/11502710.html
42,to_pickle
保存数据
43,日期格式方法
grid_df['date'] = pd.to_datetime(grid_df['date'])
grid_df['tm_d'] = grid_df['date'].dt.day.astype(np.int8)
grid_df['tm_w'] = grid_df['date'].dt.week.astype(np.int8)
grid_df['tm_m'] = grid_df['date'].dt.month.astype(np.int8)
grid_df['tm_y'] = grid_df['date'].dt.year
grid_df['tm_y'] = (grid_df['tm_y'] - grid_df['tm_y'].min()).astype(np.int8)
grid_df['tm_wm'] = grid_df['tm_d'].apply(lambda x: ceil(x/7)).astype(np.int8) 全年的第几个星期
grid_df['tm_dw'] = grid_df['date'].dt.dayofweek.astype(np.int8)
grid_df['tm_w_end'] = (grid_df['tm_dw']>=5).astype(np.int8)是否为周末
44,train['SalePrice'].skew() 偏度
train['SalePrice'].kurt() 峰度
45, crosstable 交叉表
https://www.cnblogs.com/rachelross/p/10468589.html