涉及python pandas(pd)的知识点:
# 读取输入文件,并根据分割符划分字段, 并指定字段名
import pandas as pd
data = pd.read_csv('train.csv', sep = "\t", names=['label', 'msg'])
# 查看输入的数据
print(data.shape)
print(data.head(10))
#对数据进行随机打乱
data = data.sample(frac=1, random_state=42)
data['msg'] = data['msg'].apply(lambda x: ' '.join(x))
print(data['label'].value_counts())
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = \
train_test_split(data['msg'],
data['label'],
test_size=0.3,
random_state=42
)
1、tf-idf模型
# tf-idf模型保存和读取
#tf-idf训练
import pickle
vectorizer_word = TfidfVectorizer(
max_features=800000,
token_pattern=r"(?u)\b\w+\b",
min_df=1,
#max_df=0.1,
analyzer='word',
ngram_range=(1, 5)
)
vectorizer_word.fit(x_train)
# 保存模型到文件
with open('tfidf_model.pkl', 'wb') as f:
pickle.dump(vectorizer_word, f)
# 从文件中读取模型
with open('tfidf_model.pkl', 'rb') as f:
tfidf_model = pickle.load(f)
tfidf_test = tfidf_model.transform(x_test)
2、lr模型:
import joblib
#逻辑回归模型的训练
lr_word = LogisticRegression(
solver='sag',
verbose=2)
lr_word.fit(tfidf_train, y_train)
# 保存模型,下次可以直接使用
joblib.dump(lr_word, 'lr_word_ngram.pkl')
#模型读取
model = joblib.load(filename="lr_word_ngram.pkl")
3、woe 模型:
mid_result = {"woe_dic": woe_dic,
"select_features": selected_features}
# 模型保存
joblib.dump(mid_result, 'cheat_model/nnt_buyer_model_woe_'+formatted_today+'.pkl')
# 模型加载
mid_result = joblib.load('cheat_model/nnt_woe_'+model_version+'.pkl')
woe_dic_new = mid_result.get("woe_dic")
selected_features = mid_result.get("select_features")
# 打印结果
for i in range(10):
print(y_pred_word[i], x_test.iloc[i])
# 保存到 CSV 文件
predict_df = pd.DataFrame({ 'y_pred_word': y_pred_word, 'x_test': x_test})
predict_df.to_csv('predict_test.csv', index=False, sep = "\t")
iloc:基于位置,用行号、列号进行索引,i 可以看着 int,因此 iloc 只能用整数来索引,例如data.iloc[0:2,:]
loc :基于标签,用行名、列名进行索引
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(25).reshape(5, 5),
index = ['row1', 'row2','row3','row4','row5'],
columns=['col1', 'col2','col3','col4', 'col5'])
data
col1 col2 col3 col4 col5
row1 0 1 2 3 4
row2 5 6 7 8 9
row3 10 11 12 13 14
row4 15 16 17 18 19
row5 20 21 22 23 24
1、用列名直接索引 (不推荐)
取一列:data[‘col1’] 即取得第一列,得到的是一个Series对象。
取多列:data[[‘col1’,‘col2’]]
2、iloc按行号、列号来索引(推荐)
1)取一行 :data.iloc[0] 、data.iloc[0,:]
2)取多行 :data.iloc[[0,2]] 、data.iloc[[0,2],:]
3)取连续多行 :data.iloc[0:2] 、data.iloc[0:2,:]
4)取一列 : data.iloc[:,0]
5)取多列 :data.iloc[:,[0,2]]、data.iloc[:,[0,2]]
6)取连续多列 :data.iloc[:,0:2]
3、loc按行名、列名来索引
1)取一行: data.loc[‘row1’]
2)取多行 :data.loc[[‘row1’, ‘row3’]] 、data.loc[[‘row1’, ‘ro3’],:]
3)取连续多行 :data.loc[‘row1’:‘row3’] 、data.loc[‘row0’, ‘row3’,:]
4)取一列 : data.loc[:,‘col1’]
5)取多列 :data.loc[:,[‘col1’, ‘col3’]]、data.loc[:,[‘col2’, ‘col3’]]
6)取连续多列 :data.loc[:, ‘col1’: ‘col3’]