NLP实践记录

NLP记录

Day1

1.下载数据集
2.观察数据集内容并理解题意

题目如下

NLP实践记录_第1张图片

NLP实践记录_第2张图片

Day2

import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter

train_df = pd.read_csv('./train_set.csv', sep='\t', nrows=100)

train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' ')))
print(train_df['text_len'].describe())

_ = plt.hist(train_df['text_len'], bins=200)
plt.xlabel('Text char count')
plt.title("Histogram of char count")


all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse = True)

print(len(word_count))
# 6869

print(word_count[0])
# ('3750', 7482224)

print(word_count[-1])
# ('3133', 1)

train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(train_df['text_unique']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:int(d[1]), reverse = True)

print(word_count[0])
# ('3750', 197997)

print(word_count[1])
# ('900', 197653)

print(word_count[2])
# ('648', 191975)

NLP实践记录_第3张图片

NLP实践记录_第4张图片

Day3

先空着

你可能感兴趣的:(NLP实践记录)