简介
TextRank是受到Google的PageRank的启发,通过把文本分割成若干组成单元(单词、句子)并建立图模型, 利用投票机制对文本中的重要成分进行排序, 仅利用单篇文档本身的信息即可实现关键词提取、文摘。和 LDA、HMM 等模型不同, TextRank不需要事先对多篇文档进行学习训练, 因其简洁有效而得到广泛应用。
TextRank 一般模型可以表示为一个有向有权图 G =(V, E), 由点集合 V和边集合 E 组成, E 是V ×V的子集。图中任两点 Vi , Vj 之间边的权重为 wji , 对于一个给定的点 Vi, In(Vi) 为 指 向 该 点 的 点 集 合 , Out(Vi) 为点 Vi 指向的点集合。点 Vi 的得分定义如下:
其中, d 为阻尼系数, 取值范围为 0 到 1, 代表从图中某一特定点指向其他任意点的概率, 一般取值为 0.85。使用TextRank 算法计算图中各点的得分时, 需要给图中的点指定任意的初值, 并递归计算直到收敛, 即图中任意一点的误差率小于给定的极限值时就可以达到收敛, 一般该极限值取 0.0001。
我们可以通过TextRank算法,对文章做关键词的提取以及自动文摘提取。
更详细的内容参见:和textrank4ZH代码一模一样的算法详细解读
相关组件安装
相关组件为textrank4zh,其用于抽取中文文章的关键字以及关键句(作为文摘)
安装方法很简单:
pip install textrank4zh
中文文章摘要提取
以我之前写的一篇文章:AWS认证备考心得为例(硬广啊,O(∩_∩)O哈哈~)
我们通过非常简单的代码实现TextRank的使用:
# coding:utf-8
from textrank4zh import TextRank4Keyword, TextRank4Sentence
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
if __name__ == '__main__':
f = open('./aws.txt', mode='r', encoding='utf-8')
text = f.read()
f.close()
tr4w = TextRank4Keyword()
tr4w.analyze(text=text, lower=True, window=5)
print('关键词:')
for item in tr4w.get_keywords(10, word_min_len=1):
print(item['word'], item['weight'])
tr4s = TextRank4Sentence()
tr4s.analyze(text=text, lower=True, source = 'no_stop_words')
data = pd.DataFrame(data=tr4s.key_sentences)
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
plt.figure(facecolor='w')
plt.plot(data['weight'], 'ro-', lw=2, ms=5, alpha=0.7, mec='#404040')
plt.grid(b=True, ls=':', color='#606060')
plt.xlabel('句子', fontsize=12)
plt.ylabel('重要度', fontsize=12)
plt.title('句子的重要度曲线', fontsize=15)
plt.show()
key_sentences = tr4s.get_key_sentences(num=10, sentence_min_len=2)
for sentence in key_sentences:
print(sentence['weight'], sentence['sentence'])
得到的结果如下:
关键字
关键字 + 权重的格式
考试 0.027935018564767485
相关 0.019789154134256207
aws 0.016636259077242657
题目 0.01407576775822367
看 0.013558049263992051
备考 0.01328776426934064
知识点 0.010689853883336327
认证 0.010566520363575575
会 0.009300281741320988
应该 0.009139419618958523
句子权重占比Top 10
权重 + 句子的格式
0.024624674608926533 如果这些题目在考试前能够达到90%的准确率,说明基本掌握了备考的知识点了,如果不放心可以在考试前在AWS认证网站花费20美元预考,大概20个题目,65%及格
0.024016249587116383 不过有一些考点在这本书里并没有,或者很少提及,比如:Lambda,ECS,Elastic Beanstalk,CloudFormation,API Gateway,我考试的题目里面,有几题的知识点与它们相关,所以还是应该看一下相关的内容,最起码知道它们是做什么的
0.022585396746875635 根据备考经验,这本书应该是必看的,考试的内容应该大都包含了,内容编排很合理,如每章都有详细的介绍,然后有考试需要注意的重点项目,还有相关内容的仿真题型
0.021385870302766598 总结就是没有捷径,上述的各个内容最好都认真看一遍,如果备考时间是一个月,每天备考时间大概晚上8点到10点,先半个月左右,即30个学时,把官方备考书先看完
0.02124521540684917 值得注意的是,AWS认证考试是连接美国那边服务器的,考试之前最好打电话确认一下当天AWS认证考试的网络情况如何,我去考试那天,网络断了四次,不过每次恢复网络之后,做题进度是联网保存的,考试时间也不会有影响
0.02081627266255728 因为工作需要,今年年初备考AWS Solution Architecture Associate的认证考试
0.02027674881999819 AWS资料中,有很多拓扑图或架构图,先把架构相关的知识点理解,然后再分别学习构成相关架构的各个元素,如学会读懂下图,掌握VPC, AZone, subnet, load balancer, auto scaling
0.01950681834056804 考试不仅考察应试者对AWS相关云服务的熟悉程度,题目也多来源于实际中要解决的问题,要求应试者有全面的架构设计经验和积累,所以含金量很高
0.019032121092932345 3. 考试注册:https://aws.amazon.com/cn/certification/certified-solutions-architect-associate/
0.018858895659387472 选择AWS SAA相关的题目,题目很多,一共是500多道题,我在考试过程中,发现有5题是来自这里的
句子的重要度曲线
结果还是靠谱的,评判标准?毕竟我是作者啊~~~
英文文章摘要提取
事实上,这个textrank4zh做英文文章的摘要提取,也非常不错。因为该算法本身应该与语言无关啊,只是textrank4zh的组件包,对中文分词支持非常好而已。
以SEC公开披露的文章为例,
直观来看,与share merge什么的有关。
我们来看TextRank的结果,代码同上,只是喂的文章是这篇英文文章而已。
关键字
关键字 + 权重的格式
portfolio 0.02184658057602816
policy 0.01727234660766818
merger 0.017022765571195125
fund 0.01688258532487905
date 0.01650937567134385
reorganization 0.014920320912647621
merging 0.014717803511731473
acquired 0.012681283504756821
prospectus 0.012258546110201685
surviving 0.010469987795383606
句子权重占比Top 10
看起来内容聚焦在merge上,八九不离十了~~~~
权重 + 句子的格式
0.030845544238077883 After the Merger Date. Immediately following the Merger Date, the Acquired Portfolio will no longer be available as an investment option under the policies. In addition, for the sixty (60) days following the Merger Date, you may transfer all or a portion of your accumulation unit value out of the Investment Division for the Acquiring Portfolio to another investment option without any charge or limitation (except potentially harmful transfers (see the “Limits on Transfers” section in the Prospectus for your policy)) and without the transfer counting toward the number of free transfers that otherwise may be made in a given Policy Year. Such transfers will be based on the accumulation unit value of the Investment Division for the Acquiring Portfolio as of the close of the Business Day that we receive the transfer request. All other transfers are subject to limitations, and may be subject to charges, as described in the Prospectus for your policy. Please see the Prospectus for your policy for information on how to complete transfers from the Acquiring Portfolio to other investment options that we currently offer.
0.030211569807805698 Prior to the Merger Date. For forty-five (45) days before the Merger Date, if you have allocations to the Acquired Portfolio, you may transfer such allocations to any other available investment option without any charge or limitation (except potentially harmful transfers (see the “Limits on Transfers” section in the Prospectus for your policy)) and without the transfer counting toward the number of free transfers that otherwise may be made in a given Policy Year. Such transfers will be based on the accumulation unit value of the Investment Division for the Acquired Portfolio as of the close of the Business Day that we receive the transfer request. All other transfers are subject to limitations, and may be subject to charges, as described in the Prospectus for your policy. Please see the Prospectus for your policy for information on how to complete transfers from the Investment Division for the Acquired Portfolio to other investment options that we currently offer.
0.02840028243826242 Prior to the Reorganization Date. With the exception listed below, for forty-five (45) days before the Reorganization Date, if you have allocations to the Merging Fund, you may transfer such allocations to any other available investment option without any charge or limitation (except potentially harmful transfers (see the “Limits on Transfers” section in the Prospectus for your policy)) and without the transfer counting toward the number of free transfers that otherwise may be made in a given Policy Year. Such transfers will be based on the accumulation unit value of the Investment Division for the Merging Fund as of the close of the Business Day that we receive the transfer request. All other transfers are subject to limitations, and may be subject to charges, as described in the Prospectus for your policy. Please see the Prospectus for your policy for information on how to complete transfers from the Investment Division for the Merging Fund to other investment options that we currently offer.
0.0274712995401908 the sixty (60) days following the Reorganization Date, you may transfer all or a portion of your accumulation unit value out of the Investment Division for the Surviving Fund to another investment option without any charge or limitation (except potentially harmful transfers (see the “Limits on Transfers” section in the Prospectus for your policy)) and without the transfer counting toward the number of free transfers that otherwise may be made in a given Policy Year. Such transfers will be based on the accumulation unit value of the Investment Division for the Surviving Fund as of the close of the Business Day that we receive the transfer request. All other transfers are subject to limitations, and may be subject to charges, as described in the Prospectus for your policy. Please see the Prospectus for your policy for information on how to complete transfers from the Surviving Fund to other investment options that we currently offer.
0.02445596932569474 On the Merger Date. Any of your allocations that remain in the Acquired Portfolio will be redeemed. Those redemptions will then be used to purchase accumulation units in the Investment Division for the Acquiring Portfolio. All policyowners affected by the Merger will receive written confirmation of the transaction. The redemption and subsequent repurchase transactions required to effectuate the Merger will not be treated as transfers that count toward the number of free transfers that may otherwise be made in a given Policy Year.
0.021829344645832766 At a meeting held on December 13, 2018, the Board of Trustees of Neuberger Berman Advisers Management Trust approved a Plan of Reorganization and Dissolution (the “Plan”) as of on or about April 30, 2019 (the “Reorganization Date”). As a result of the Reorganization, on the Reorganization Date, policyowners that have allocated cash value to the Large Cap Value Portfolio (the “Merging Fund”) will transfer all of its assets to the Sustainable Equity Portfolio (the “Surviving Fund”) in exchange for shares of the Surviving Fund. In addition to the Reorganization of the Merging Fund, the Board of Trustees also approved a separate reorganization of Neuberger Berman AMT Guardian Portfolio with and into the Surviving Fund.
0.021154204652431365 Until the Merger Date, we will continue to process automatic transactions (such as dollar cost averaging and automatic asset rebalancing) involving the Acquired Portfolio, unless you provide us with alternate allocation instructions. Also note that the Acquired Portfolio will not accept new premium payment allocations or transfers as of the Merger Date.
0.02092614650125736 At a meeting held on December 10-12, 2018, the Board of Trustees of the MainStay VP Funds Trust (the “Board”) approved the Merger of the MainStay VP Epoch U.S. Small Cap Portfolio (the “Acquired Portfolio”) into the MainStay VP MacKay Small Cap Core Portfolio (the “Acquiring Portfolio”), followed by the complete liquidation of the Acquired Portfolio. Shareholder approval is required for the Merger to take place. If shareholder approval is obtained on or about April 22, 2019, the effective date established for the Merger is expected to be on or about May 1, 2019 (the “Merger Date”).
0.020559526950775668 Policy Proceeds are payable under the policy to the named Beneficiary when the Insured dies. Upon receiving due proof of death at our Service Office in Good Order, We will pay the Beneficiary the Life Insurance Benefit determined as of the date we receive due proof of death in Good Order as part of the Policy Proceeds. All or part of the Policy Proceeds can be paid in cash or applied under one or more of Our payment options described under ‘Policy Payment Information—When We Pay Policy Proceeds—Payment Options.’
0.02049239852496608 As a result of the Merger, the Acquiring Portfolio will be available as an Investment Division under your policy on the Merger Date.
句子的重要度曲线
较大的英文文章摘要提取
这里的较大,是指超过3500句以上的文档,36万以上的单词量
粗略看上去,我也不清楚主要是说什么的。。。
那就直接看结果吧
关键字
关键字 + 权重的格式
policy 0.015148000545573168
insurance 0.0076696873841799375
investment 0.006696896916816165
life 0.0061127787815478205
amount 0.005691912888323185
benefit 0.005606466435793987
rider 0.005400878053318155
premium 0.004975816146068638
death 0.004657684469154609
portfolio 0.004598998468890753
感觉关键字一针见血的样子~~~~
句子权重占比Top 10
看起来有几句是重复的?因为在文章里面这些句子确实重复出现了几次,而且也被TextRank算法认为很重要。是否真的应该权重这么高,还需要业务人员的专业意见。
不过建议对这种几千句的文章,可以输出Top 20或更多的句子,这样更客观一些。
权重 + 句子的格式
0.0009159391391327122 3 The mortality and expense risk face amount charge rate is based on the Age and Risk Class of the Insured and the Face Amount on the Policy Date. It also varies with the Death Benefit Option you choose. Each Coverage Segment will have a corresponding face amount charge related to the amount of the increase, based on the Age and Risk Class of the Insured at the time of the increase. The charge is level for 10 Policy Years from the effective date of the Coverage Segment, then is reduced in Policy Year 11 and thereafter. The mortality and expense risk face amount charges shown in the table may not be typical of the charges you will pay. Ask your life insurance producer for information regarding this charge for your Policy. The mortality and expense risk face amount charge for your Policy will be stated in the Policy Specifications.
0.0009159391391327117 3 The mortality and expense risk face amount charge rate is based on the Age and Risk Class of the Insured and the Face Amount on the Policy Date. It also varies with the Death Benefit Option you choose. Each Coverage Segment will have a corresponding face amount charge related to the amount of the increase, based on the Age and Risk Class of the Insured at the time of the increase. The charge is level for 10 Policy Years from the effective date of the Coverage Segment, then is reduced in Policy Year 11 and thereafter. The mortality and expense risk face amount charges shown in the table may not be typical of the charges you will pay. Ask your life insurance producer for information regarding this charge for your Policy. The mortality and expense risk face amount charge for your Policy will be stated in the Policy Specifications.
0.0009101920742634353 1 Cost of insurance rates apply uniformly to all members of the same Class. The cost of insurance charges shown in the table may not be typical of the charges you will pay. Your Policy Specifications will indicate the guaranteed cost of insurance charge applicable to your Policy, and more detailed information concerning your cost of insurance charges is available on request from your life insurance producer or us. Also, before you purchase the Policy, you may request personalized Illustrations of your future benefits under the Policy based upon the Insured�s Class, the Death Benefit Option, Face Amount, planned periodic premiums, and any Riders requested. Cost of insurance rates for your Policy will be stated in the Policy Specifications and calculated using the Net Amount At Risk.
0.0009085211753153032 1 Cost of insurance rates apply uniformly to all members of the same Class. The cost of insurance charges shown in the table may not be typical of the charges you will pay. Your Policy Specifications pages will indicate the guaranteed cost of insurance charge applicable to your Policy, and more detailed information concerning your cost of insurance charges is available on request from your life insurance producer or us. Also, before you purchase the Policy, you may request personalized Illustrations of your future benefits under the Policy based upon the Insured�s Class, the Death Benefit Option, Face Amount, planned periodic premiums, and any Riders requested. Cost of insurance rates for your Policy will be stated in the Policy Specifications and calculated using the Net Amount At Risk.
0.0009071944047640959 1 Cost of insurance rates apply uniformly to all members of the same Class. The cost of insurance charges shown in the table may not be typical of the charges you will pay. Your Policy Specifications pages will indicate the guaranteed cost of insurance charge applicable to your Policy, and more detailed information concerning your cost of insurance charges is available on request from your life insurance producer or us. Also, before you purchase the Policy, you may request personalized illustrations of your future benefits under the Policy based upon the Insured�s Class, the Death Benefit Option, Face Amount, planned periodic premiums, and any Riders requested. Cost of insurance rates for your Policy will be stated in the Policy Specifications and calculated per $1.00 of Net Amount At Risk.
0.0008978323334814181 · M&E risk asset charge We deduct a risk asset charge every month at a guaranteed maximum annual rate of 0.45% (0.0375% monthly) on the first $25,000 of your Policy�s Accumulated Value in the Investment Options plus an annual rate of 0.05% (0.0042% monthly) of the Accumulated Value in the Investment Options that exceeds $25,000. We may charge a lower annual rate for the M&E risk asset charge. For the purposes of this charge, the amount of Accumulated Value is calculated on the Monthly Payment Date before we deduct the monthly charge, but after we deduct any Policy Debt or allocate any new Net Premiums, withdrawals or loans. When the Insured reaches Age 100, the annual rate is reduced to 0%.
0.0008645495698352475 3 The mortality and expense risk face amount charge rate is based on the Age and Risk Class of the Insured and the Face Amount on the Policy Date. It also varies with the Death Benefit Option you choose. Each Coverage Segment will have a corresponding face amount charge related to the amount of the increase, based on the Age and Risk Class of the Insured at the time of the increase. The mortality and expense risk face amount charges shown in the table may not be typical of the charges you will pay. Ask your life insurance producer for information regarding this charge for your Policy. The mortality and expense risk face amount charge for your Policy will be stated in the Policy Specifications.
0.0008583041949445304 2 The surrender charge is based on the Age and Risk Class of the Insured, as well as the Death Benefit Option you choose. The surrender charge reduces to $0 after 10 years from the effective date of each Coverage Segment. The surrender charge shown in the table may not be typical of the surrender charge you will pay. Ask your life insurance producer for information on this charge for your Policy. The surrender charge and maximum surrender charge for your Policy will be stated in the Policy Specifications.
0.0008583041949445304 2 The surrender charge is based on the Age and Risk Class of the Insured, as well as the Death Benefit Option you choose. The surrender charge reduces to $0 after 10 years from the effective date of each Coverage Segment. The surrender charge shown in the table may not be typical of the surrender charge you will pay. Ask your life insurance producer for information on this charge for your Policy. The surrender charge and maximum surrender charge for your Policy will be stated in the Policy Specifications.
0.0008583041949445304 2 The surrender charge is based on the Age and Risk Class of the Insured, as well as the Death Benefit Option you choose. The surrender charge reduces to $0 after 10 years from the effective date of each Coverage Segment. The surrender charge shown in the table may not be typical of the surrender charge you will pay. Ask your life insurance producer for information on this charge for your Policy. The surrender charge and maximum surrender charge for your Policy will be stated in the Policy Specifications.