python pandas编程知识点20240104

涉及python pandas(pd)的知识点:

1、读取输入文件, 并转化为pd dataframe

# 读取输入文件,并根据分割符划分字段, 并指定字段名
import pandas as pd
data = pd.read_csv('train.csv', sep = "\t", names=['label', 'msg'])

# 查看输入的数据
print(data.shape)
print(data.head(10))

2、将训练样本打散

#对数据进行随机打乱
data = data.sample(frac=1, random_state=42)

3、对字段进行转化

data['msg'] = data['msg'].apply(lambda x: ' '.join(x))

4、查看样本label的分布

print(data['label'].value_counts())

5、将样本根据比例切分成训练集和测试集

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = \
    train_test_split(data['msg'],
                     data['label'],
                     test_size=0.3,
                     random_state=42
                     )

6、将训练好的模型进行保存和读取

1、tf-idf模型

# tf-idf模型保存和读取
#tf-idf训练
import pickle
    vectorizer_word = TfidfVectorizer(
    max_features=800000,
    token_pattern=r"(?u)\b\w+\b",
    min_df=1,
    #max_df=0.1,
    analyzer='word',
    ngram_range=(1, 5)
    )

    vectorizer_word.fit(x_train)
    # 保存模型到文件
    with open('tfidf_model.pkl', 'wb') as f:
        pickle.dump(vectorizer_word, f)
    # 从文件中读取模型
    with open('tfidf_model.pkl', 'rb') as f:
        tfidf_model = pickle.load(f)
    tfidf_test  = tfidf_model.transform(x_test)

2、lr模型:

import joblib
#逻辑回归模型的训练
    lr_word = LogisticRegression(
    solver='sag',
    verbose=2)
    lr_word.fit(tfidf_train, y_train)
    # 保存模型,下次可以直接使用
    joblib.dump(lr_word, 'lr_word_ngram.pkl')

    #模型读取
    model = joblib.load(filename="lr_word_ngram.pkl")

3、woe 模型:

mid_result = {"woe_dic": woe_dic,
                  "select_features": selected_features}
# 模型保存
joblib.dump(mid_result, 'cheat_model/nnt_buyer_model_woe_'+formatted_today+'.pkl')
# 模型加载
    mid_result = joblib.load('cheat_model/nnt_woe_'+model_version+'.pkl')
    woe_dic_new = mid_result.get("woe_dic")
    selected_features = mid_result.get("select_features")

7、将结果拼装成pd dataframe, 并写入输出文件

# 打印结果
    for i in range(10):
        print(y_pred_word[i], x_test.iloc[i])

    # 保存到 CSV 文件
    predict_df = pd.DataFrame({ 'y_pred_word': y_pred_word, 'x_test': x_test})
    predict_df.to_csv('predict_test.csv', index=False, sep = "\t")

8、iloc、loc的区别

iloc:基于位置,用行号、列号进行索引,i 可以看着 int,因此 iloc 只能用整数来索引,例如data.iloc[0:2,:]
loc :基于标签,用行名、列名进行索引

import pandas as pd
import numpy  as np
data = pd.DataFrame(np.arange(25).reshape(5, 5), 
                  index = ['row1', 'row2','row3','row4','row5'], 
                  columns=['col1', 'col2','col3','col4', 'col5'])
data 
      col1  col2  col3  col4  col5
row1     0     1     2     3     4
row2     5     6     7     8     9
row3    10    11    12    13    14
row4    15    16    17    18    19
row5    20    21    22    23    24

1、用列名直接索引 (不推荐)

取一列:data[‘col1’] 即取得第一列,得到的是一个Series对象。

取多列:data[[‘col1’,‘col2’]]

2、iloc按行号、列号来索引(推荐)
1)取一行 :data.iloc[0] 、data.iloc[0,:]
2)取多行 :data.iloc[[0,2]] 、data.iloc[[0,2],:]
3)取连续多行 :data.iloc[0:2] 、data.iloc[0:2,:]
4)取一列 : data.iloc[:,0]
5)取多列 :data.iloc[:,[0,2]]、data.iloc[:,[0,2]]
6)取连续多列 :data.iloc[:,0:2]

3、loc按行名、列名来索引
1)取一行: data.loc[‘row1’]
2)取多行 :data.loc[[‘row1’, ‘row3’]] 、data.loc[[‘row1’, ‘ro3’],:]
3)取连续多行 :data.loc[‘row1’:‘row3’] 、data.loc[‘row0’, ‘row3’,:]
4)取一列 : data.loc[:,‘col1’]
5)取多列 :data.loc[:,[‘col1’, ‘col3’]]、data.loc[:,[‘col2’, ‘col3’]]
6)取连续多列 :data.loc[:, ‘col1’: ‘col3’]

你可能感兴趣的:(python基础,python,pandas,开发语言)