[数据分析实践]-文本分析-U.S. Patent Phrase-1

数据背景

美国专利商标局 (USPTO) 通过其开放数据门户提供世界上最大的科学、技术和商业信息库之一。专利是一种授予知识产权的形式,以换取公开披露新的和有用的发明。由于专利在授予前经过了严格的审查程序,并且由于美国创新的历史跨越了两个世纪和 1100 万项专利,因此美国专利档案是数据量、质量和多样性的罕见组合。

“美国专利商标局通过授予专利、注册商标和在全球推广知识产权,为美国的创新机器提供服务。从灯泡到量子计算机,美国专利商标局与世界分享了 200 多年的人类智慧。结合数据科学界的创造力,USPTO 数据集具有无限的潜力,可以增强 AI 和 ML 模型,这将有利于科学和整个社会的进步。”

[图片上传失败...(image-a27790-1655301757973)]

数据介绍

数据集来源:https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data

  • id - 一对短语的唯一标识符
  • anchor - 第一个短语
  • target - 第二个短语
  • context - CPC 分类(版本 2021.05),表示要对相似度进行评分的主题
  • score - 相似度。 这来自一个或多个手动专家评级的组合。


import pandas as pd
from termcolor import colored
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from wordcloud import WordCloud, STOPWORDS
import numpy as np
import bq_helper
from bq_helper import BigQueryHelper

import warnings
warnings.filterwarnings("ignore")

我将在一个新颖的语义相似性数据集上训练您的模型,以通过匹配专利文档中的关键短语来提取相关信息。 在专利检索和审查过程中,确定短语之间的语义相似性对于确定之前是否已经描述过一项发明至关重要。

例如,如果一项发明声称是“television set”,而先前的出版物描述了“TV set”,那么理想情况下,模型会识别出它们是相同的,并帮助专利代理人或审查员检索相关文件。

train_df = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/train.csv")
test_df = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/test.csv")

print(f"Number of observations in TRAIN: {colored(train_df.shape, 'yellow')}")
print(f"Number of observations in TEST: {colored(test_df.shape, 'yellow')}")

#Number of observations in TRAIN: (36473, 5)
#Number of observations in TEST: (36, 4)

Number of observations in TRAIN: (36473, 5) Number of observations in TEST: (36, 4)

#看看训练数据集中的前 20 个观察值。
train_df.sample(10)

[图片上传失败...(image-c5dd7e-1655301757973)]

ANCHOR COLUMN

print(f"Number of uniques values in ANCHOR column: {colored(train_df.anchor.nunique(), 'yellow')}")
#Number of uniques values in ANCHOR column: 733

train_df.anchor.value_counts().head(20)

[图片上传失败...(image-4936b2-1655301757973)]

pattern = 'base'
mask = train_df['target'].str.contains(pattern, case=False, na=False)
train_df.query("anchor =='component composite coating'")[mask]

[图片上传失败...(image-b20f84-1655301757973)]

anchor_desc = train_df[train_df.anchor.notnull()].anchor.values
stopwords = set(STOPWORDS) 
wordcloud = WordCloud(width = 800, 
                      height = 800,
                      background_color ='white',
                      min_font_size = 10,
                      stopwords = stopwords,).generate(' '.join(anchor_desc)) 

# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 

plt.show()

[图片上传失败...(image-bc787e-1655301757973)]

train_df['anchor_len'] = train_df['anchor'].str.split().str.len()

print(f"Anchors with maximum lenght of 5: \n{colored(train_df.query('anchor_len == 5')['anchor'].unique(), 'yellow')}")
print(f"\nAnchors with maximum lenght of 4: \n{colored(train_df.query('anchor_len == 4')['anchor'].unique(), 'green')}")

Anchors with maximum lenght of 5: ['make of high density polyethylene' 'produce by recombinant dna technology' 'reflection type liquid crystal display' 'rotate on its longitudinal axis']

Anchors with maximum lenght of 4: ['align with input shaft' 'apply to anode electrode' 'average power ratio reduction' 'coat with conducting layer' 'combine with optical elements' 'connect to common conductor' 'connect to electrode structure' 'consist of oxalic acid' 'disk type recording medium' 'disperse in plastic material' 'dissolve in solvent system' 'engage in guide slot' 'extend from groove bottom' 'fall to low value' 'high gradient magnetic separators' 'operate internal combustion engine' 'peripheral nervous system stimulation' 'pulse width modulated control' 'recover from reaction product' 'reflect by reflection mirror' 'remain below threshold value' 'send to control node' 'show in chemical formula' 'transparent liquid crystal display' 'use as cooling fluid' 'use physically unclonable functions']

train_df.anchor_len.hist(orientation='horizontal', color='#FFCF56')

[图片上传失败...(image-4da430-1655301757973)]

pattern = '[0-9]'
mask = train_df['anchor'].str.contains(pattern, na=False)
train_df['num_anchor'] = mask
train_df[mask]['anchor'].value_counts()

[图片上传失败...(image-3651b5-1655301757973)]

TARGET COLUMN

print(f"Number of uniques values in TARGET column: {colored(train_df.target.nunique(), 'yellow')}")
#Number of uniques values in TARGET column: 29340

train_df.target.value_counts().head(20)

[图片上传失败...(image-afcf7f-1655301757973)]

target_desc = train_df[train_df.target.notnull()].target.values
stopwords = set(STOPWORDS) 
wordcloud = WordCloud(width = 800, 
                      height = 800,
                      background_color ='white',
                      min_font_size = 10,
                      stopwords = stopwords,).generate(' '.join(target_desc)) 

# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 

plt.show() 

[图片上传失败...(image-6f56dd-1655301757973)]

train_df['target_len'] = train_df['target'].str.split().str.len()
train_df.target_len.value_counts()

[图片上传失败...(image-71ff01-1655301757973)]

print(f"Targets with maximum lenght of 11: \n{colored(train_df.query('target_len == 11')['target'].unique(), 'yellow')}")
print(f"\nTargets with lenght of 10: \n{colored(train_df.query('target_len == 10')['target'].unique(), 'green')}")
print(f"\nTargets with lenght of 9: \n{colored(train_df.query('target_len == 9')['target'].unique(), 'yellow')}")
print(f"\nTargets with lenght of 8: \n{colored(train_df.query('target_len == 8')['target'].unique(), 'green')}")

Targets with maximum lenght of 11: ['n 9 fluorenylmethyloxycarbonyl 3 amino 3 45 dimethoxy 2 nitrophenylpropionic acid']

Targets with lenght of 10: ['a substance used as a reagent in a rocket engine' 'heating calcium oxide and aluminium oxide together at high temperatures' 'a quadric surface that has exactly one axis of symmetry']

Targets with lenght of 9: ['testing the life of a leakage current protection device' 'a quadric surface that has no center of symmetry' 'machine that converts the kinetic energy of a fluid']

Targets with lenght of 8: ['loading sequence of a breech loading naval gun' 'loading sequence of a breech loading small arm' 'gearbox has two clutches but no clutch pedal' 'partial displacement of a bone from its joint' 'inflatable curtain module for use in a vehicle' 'conveyors are used to convey larger sized items' 'ability of an article to withstand prolonged wear']

# Checking numbers in target feature

pattern = '[0-9]'
mask = train_df['target'].str.contains(pattern, na=False)
train_df['num_target'] = mask
train_df[mask]['target'].value_counts()

[图片上传失败...(image-3d2b9a-1655301757973)]

pattern = '1 multiplexer'
mask = train_df['target'].str.contains(pattern, na=False)
train_df[mask]

[图片上传失败...(image-2b09d4-1655301757973)]

CONTEXT COLUMN

资料来源:https://en.wikipedia.org/wiki/Cooperative_Patent_Classification

第一个字母是“截面符号”,由“A”(“人类必需品”)到“H”(“电力”)或“Y”的字母组成,表示新兴的横截面技术。 后面是一个两位数的数字,表示“类符号”(“A01”代表“农业;林业;畜牧业;诱捕;渔业”)。

  • A:人类必需品
  • B:运营和运输
  • C:化学与冶金
  • D:纺织品
  • E:固定结构
  • F:机械工程
  • G:物理学
  • H:电力
  • Y:新兴的横截面技术
print(f"Number of uniques values in CONTEXT column: {colored(train_df.context.nunique(), 'yellow')}")
#Number of uniques values in CONTEXT column: 106

train_df.context.value_counts().head(20)

[图片上传失败...(image-ff7d3b-1655301757973)]

train_df['section'] = train_df['context'].astype(str).str[0]
train_df['classes'] = train_df['context'].astype(str).str[1:]
train_df.head(10)

[图片上传失败...(image-116df3-1655301757973)]

print(f"Number of uniques SECTIONS: {colored(train_df.section.nunique(), 'yellow')}")
print(f"Number of uniques CLASS: {colored(train_df.classes.nunique(), 'yellow')}")
#Number of uniques SECTIONS: 8
#Number of uniques CLASS: 44

di = {"A" : "A - Human Necessities", 
      "B" : "B - Operations and Transport",
      "C" : "C - Chemistry and Metallurgy",
      "D" : "D - Textiles",
      "E" : "E - Fixed Constructions",
      "F" : "F- Mechanical Engineering",
      "G" : "G - Physics",
      "H" : "H - Electricity",
      "Y" : "Y - Emerging Cross-Sectional Technologies"}
      
train_df.replace({"section": di}).section.hist(orientation='horizontal', color='#FFCF56')

[图片上传失败...(image-844a9c-1655301757973)]

train_df.classes.value_counts().head(15)

[图片上传失败...(image-a7fb98-1655301757973)]

score

train_df.score.hist(color='#FFCF56')
train_df.score.value_counts()

[图片上传失败...(image-e38e8-1655301757973)]

train_df[['anchor', 'target', 'section', 'classes', 'score']].replace({"section": di}).query('score==1.0')

[图片上传失败...(image-9d994e-1655301757973)]

train_df[['anchor', 'target', 'section', 'classes', 'score']].replace({"section": di}).query('score==0.0')

[图片上传失败...(image-c65660-1655301757973)]

你可能感兴趣的:([数据分析实践]-文本分析-U.S. Patent Phrase-1)