1 比赛地址

“达观杯”文本智能处理挑战赛

2 数据

数据包含2个csv文件：

train_set.csv：此数据集用于训练模型，每一行对应一篇文章。文章分别在“字”和“词”的级别上做了脱敏处理。共有四列：
第一列，是文章的索引(id)
第二列，是文章正文在“字”级别上的表示，即字符相隔正文(article)
第三列，是在“词”级别上的表示，即词语相隔正文(word_seg)
第四列，是这篇文章的标注(class)
注：每一个数字对应一个“字”，或“词”，或“标点符号”。“字”的编号与“词”的编号是独立的！
test_set.csv：此数据用于测试。数据格式同train_set.csv，但不包含class。
注：test_set与train_test中文章id的编号是独立的。

3 逻辑回归实现分类

# 导入相关包
import pandas as pd # 
from sklearn.linear_model import LogisticRegression # 逻辑回归
from sklearn.feature_extraction.text import CountVectorizer # 文本分词

# 数据预处理
df_train = pd.read_csv("./train_set.csv") # 使用read_csv方法 导入训练集
df_test = pd.read_csv("./test_set.csv") # 用read_csv方法 导入测试集
df_train.drop(columns=['article', 'id'], inplace=True) # drop方法 指定删除 article 和 id 这两个特征
df_test.drop(columns=['article'], inplace=True) # drop方法 指定删除 article 这个特征

"""
从上面的预处理的结果来看
1. 训练集只保留了词级别的特征（即word_seg）和 标签（class）
2. 测试集中只保留了词级别的特征（即word_seg）
"""

# 特征工程
vectorizer = CountVectorizer(ngram_range=(1, 2), min_df=3, max_df=0.9, max_features=100000)
vectorizer.fit(df_train['word_seg']) # 使用 CountVectorizer 训练 训练集中的 word_seg
x_train = vectorizer.transform(df_train['word_seg']) # 重新生成 训练集 的 特征向量
y_train = df_train["class"] - 1 # 训练集的标签
x_test = vectorizer.transform(df_test['word_seg']) # 重新生成 测试集 的 特征向量

"""
由于没有使用过 CountVectorizer ，这里重新被 CountVectorizer 处理过后，训练集和测试集的数据已经不知道是怎样的了~~待后续再分析
"""

# 训练一个 LogisticRegression 分类器
lg = LogisticRegression(C=4, dual=True)
lg.fit(x_train, y_train) # 训练LogisticRegression的模型
y_test = lg.predict(x_test) # 使用训练好的模型 预测 测试集 得到分类结果

"""
 LogisticRegression C和dual的作用也是个未知数，一脸懵
"""

# 将结果保存至本地
df_test["class"] = y_test.tolist()
df_test["class"] = df_test["class"] + 1
df_result = df_test.loc[:, ["id", "class"]]
df_result.to_csv("./result_lg.csv", index=False)

"""
不要太纠结这个保存，其实y_test已经有我们想要的结果，只是把数据直接生成比赛可以用的格式
"""

4 关于CountVectorizer类的参数

可以通过help（CountVectorizer）查看英文说明，英文不好，直接放到google翻译了

ngram_range

ngram_range : tuple (min_n, max_n)
 |      The lower and upper boundary of the range of n-values for different
 |      n-grams to be extracted. All values of n such that min_n <= n <= max_n
 |      will be used.

ngram_range：tuple（min_n，max_n）
要提取的不同n-gram的n值范围的下边界和上边界。 
将使用n的所有值，使得min_n <= n <= max_n。

min_df

min_df : float in range [0.0, 1.0] or int, default=1
 |      When building the vocabulary ignore terms that have a document
 |      frequency strictly lower than the given threshold. This value is also
 |      called cut-off in the literature.
 |      If float, the parameter represents a proportion of documents, integer
 |      absolute counts.
 |      This parameter is ignored if vocabulary is not None.

min_df：float在范围[0.0,1.0]或int中，默认值= 1
构建词汇表时，请忽略文档频率严格低于给定阈值的术语。 
该值在文献中也称为截止值。 
如果是float，则参数表示文档的比例，整数绝对计数。 
如果词汇表不是None，则忽略此参数。

max_df

max_df : float in range [0.0, 1.0] or int, default=1.0
 |      When building the vocabulary ignore terms that have a document
 |      frequency strictly higher than the given threshold (corpus-specific
 |      stop words).
 |      If float, the parameter represents a proportion of documents, integer
 |      absolute counts.
 |      This parameter is ignored if vocabulary is not None.

max_df：float在范围[0.0,1.0]或int中，默认值= 1.0
在构建词汇表时，忽略文档频率严格高于给定阈值的术语（语料库特定的停用词）。 
如果是float，则参数表示文档的比例，整数绝对计数。 
如果词汇表不是None，则忽略此参数。

max_features

max_features : int or None, default=None
 |      If not None, build a vocabulary that only consider the top
 |      max_features ordered by term frequency across the corpus.
 |
 |      This parameter is ignored if vocabulary is not None.

max_features：int或None，默认= None
如果不是None，则构建一个词汇表，该词汇表仅考虑语料库中按术语频率排序的最高max_features。 
如果词汇表不是None，则忽略此参数。

6 关于LogisticRegression类的参数

C : float, default: 1.0
 |      Inverse of regularization strength; must be a positive float.
 |      Like in support vector machines, smaller values specify stronger
 |      regularization.

C：float，默认值：1.0
正规化强度逆; 必须是积极的浮动。 
与支持向量机一样，较小的值指定更强的正则化。

dual

dual : bool, default: False
 |      Dual or primal formulation. Dual formulation is only implemented for
 |      l2 penalty with liblinear solver. Prefer dual=False when
 |      n_samples > n_features.

dual：bool，默认值：False
双重或原始配方。 双配方仅用于利用liblinear解算器的l2惩罚。 
当n_samples> n_features时，首选dual = False。

达观杯 LogisticRegression 简单实现分析

1 比赛地址

2 数据

3 逻辑回归实现分类

4 关于CountVectorizer类的参数

6 关于LogisticRegression类的参数

你可能感兴趣的:(达观杯 LogisticRegression 简单实现分析)

达观杯 LogisticRegression 简单实现分析

1 比赛地址

2 数据

3 逻辑回归 实现分类

4 关于CountVectorizer类的参数

6 关于LogisticRegression类的参数

你可能感兴趣的:(达观杯 LogisticRegression 简单实现分析)

3 逻辑回归实现分类