语义分析提取关键字_关键字树:用于语义提取的图分析

语义分析提取关键字

语义分析提取关键字_关键字树:用于语义提取的图分析_第1张图片

This post is a small abstract of full-scaled research focused on keyword recognition. Technique of semantics extraction was initially applied in field of social media research of depressive patterns. Here I focus on NLP and math aspects without psychological interpretation. It is clear that analysis of single word frequencies is not enough. Multiple random mixing of collection does not affect the relative frequency but destroys information totally — bag of words effect. We need more accurate approach for the mining of semantics attractors.

这篇文章是针对关键字识别的全面研究的一个小摘要。 语义提取技术最初应用于社交媒体的抑郁模式研究领域。 在这里,我专注于没有心理学解释的自然语言处理和数学方面。 显然,单字频率的分析是不够的。 集合的多次随机混合不会影响相对频率,但会完全破坏信息-词袋效应。 我们需要更准确的方法来挖掘语义吸引子。

According to Relational Frame Theory (RFT) bidirectional links of entities are basic cognitive elements. The hypothesis of bigram dictionary has been tested. We explored top Russian speaking Wall Of Help. 150,000 visits per day. Response/request collections have been parsed: 25,000 of records in 2018.

根据关系框架理论(RFT),实体的双向链接是基本的认知元素。 bigram词典的假设已得到验证。 我们探索了讲俄语的顶级帮助墙。 每天150,000次访问。 响应/请求收集已解析:2018年有25,000条记录。

语义分析提取关键字_关键字树:用于语义提取的图分析_第2张图片

Text cleaning included age/sex/text and message length standardizations. Sex standardization was reached by [name — sex] recognition. Morphological cleaning and tokenization allowed getting nouns in standard form. Vocabulary of bigrams with corresponding frequences was mined. Bigram sets are ordered by frequency and normalized to equal volume in both groups by cutoff criteria. Each group, Request/Responce is characterized by unique bigram matrix. Increase of information as inverse to Shannon entropy is shown: 30% of increment. I(3)-I(2)=6% for the 3-grams, [H(4)-H(3)]=2% and less than 1% for N>4.

文本清理包括年龄/性别/文本和消息长度标准化。 通过[名字-性别]认可达到了性别标准化。 形态清洗和标记化允许以标准形式获得名词。 挖掘具有相应频率的双语法例的词汇。 Bigram集按频率排序,并根据截止标准在两组中归一化为相等的体积。 每个组,请求/响应都有唯一的双字母组矩阵。 显示出与香农熵成反比的信息增加:增量的30%。 对于3克,I(3)-I(2)= 6%,对于[N(4)],[H(4)-H(3)] = 2%并且小于1%。

语义分析提取关键字_关键字树:用于语义提取的图分析_第3张图片

Bigram matrix was used as a generator of weighted undirected 3D graph. Сonversion was implemented by Open Ord force-directed layout algorithm. It makes transformation from 2D matrix to the tree based topology. Weight of each node corresponds to the single word frequency (not shown) while edge length is the inverse function of bigram frequency. I considered betweennes centrality (BC) and modified closest neighbours. Entities with extra high BC may be considered as information hubs, which influence the semantics: removal of these entities affects information mostly. Closest neighbours are based on co-occurrence frequency analysis. I considered modified neighbour ordering. BC of neighbour inverse to co-occurrence distance (CD) was used as weighting function: BC/CD.

Bigram矩阵用作加权无向3D图的生成器。 转换是通过Open Ord力导向布局算法实现的。 它可以将2D矩阵转换为基于树的拓扑。 每个节点的权重对应于单个单词的频率(未显示),而边长则是双字频率的反函数。 我考虑了中间地带(BC)并修改了最近的邻居。 具有超高BC的实体可以被视为信息中心,这会影响语义:移除这些实体主要会影响信息 。 最近邻居基于同现频率分析。 我考虑了修改后的邻居排序。 与共现距离(CD)成反比的邻居的BC用作加权函数:BC / CD。

语义分析提取关键字_关键字树:用于语义提取的图分析_第4张图片

We investigated closest neighbors in the vicinity of selected BC Root: #Life. #Man (№1) value is almost fused with #Life attractor. #Procreation (№2), #Family (№3) are next closest entities with lower BC/CD grade. Responce values are represented in the following order: #Man №1, #Job №2, #Procreation №3. It should be noticed that topic bias is obviously present in responce group. However separation of personal and group values (#Man vice #Life) is remarkable inspite of topic noise. Graph was based on 10,000 most frequent bigrams: 44% of data. However the top 5 entities ranked by BC/CD do not change after rescaling to 50% and 88% of bigram dictionary.

我们调查了所选卑诗根附近的最近邻居:#Life。 #Man(№1)值几乎与#Life吸引子融合在一起。 #Procreation(№2),#Family(№3)是下一个最近的BC / CD等级较低的实体。 响应值按以下顺序表示:#Man№1,#Job№2,#Procreation№3。 应该注意的是,在回答组中显然存在话题偏见。 然而,尽管话题喧noise,个人和群体价值观的分离(#Man Vice #Life)还是很明显的。 Graph基于10,000个最常见的二元组:44%的数据。 但是,按BC / CD排序的前5个实体在重新缩放为双链词典的50%和88%之后不会改变。

The considered results correlate with empirical observations in psychology. Consequently they preliminarily confirm selected algorithm of BC/CD ranging for recognition of semantics attractors. It is convinient if you deal with Big Noisy Text/Speech Data. It may be used for mining of keywords in relation to selected entity or in absolute terms. You may read more here. The instrument may have applications in HR evaluation as well. Authors conduct relevant research in English speaking segment and look for collaboration. The full version of research is pending in the peer reviewed journal. However you may ask draft upon the personal request. Thank you.

考虑的结果与心理学中的经验观察相关。 因此,他们初步确认了选择的BC / CD测距算法以识别语义吸引子。 如果您要处理大噪音文本/语音数据,这将非常方便。 它可以用于与所选实体或绝对术语相关的关键字的挖掘。 您可以在此处阅读更多内容。 该仪器也可以在人力资源评估中应用。 作者在英语领域进行相关研究,并寻求合作。 完整的研究报告尚待同行评审期刊上发表。 但是,您可以根据个人要求索取汇票。 谢谢。

I would like to thank Dmitry Vodyanov for the fruitful discussion.

我感谢Dmitry Vodyanov的富有成果的讨论。

翻译自: https://habr.com/en/post/470301/

语义分析提取关键字

你可能感兴趣的:(算法,机器学习,人工智能,自然语言处理,大数据)