一个获取大量文章标题标签的办法

数据

这是大量的论文文章的标题
一个获取大量文章标题标签的办法_第1张图片

思维方法

所谓标签指的就是有些共同的特征,所以不能局限于一个文章标题,要全局考虑
文章标题中很多停用词(stopwords)以及标点符号应该去除
一个获取大量文章标题标签的办法_第2张图片
ngram模型
一个获取大量文章标题标签的办法_第3张图片
有了上述条件就可以粗略的寻找文章的标签了

工具

使用python以及python的nltk自然语言处理库非常方便
这个是nltk中文文档

方法1

按照上述思路借助nltk库进行

import re
        from pymongo import MongoClient
        client = MongoClient("192.168.33.131", 27017)#连接数据库取用数据
        db = client.ccf.article.find()
        text = ""
        for a in db:
            text += " " + a['title']#将标题拼接成一个文本
        from nltk import word_tokenize, bigrams#使用nltk的英文分词以及2gram模型
        from nltk.corpus import stopwords#停用词
        list_stopWords = list(set(stopwords.words('english')))#获取英文停用词数据
        text = text.lower()#小写转换方便分析
        text = re.sub(r'[{}]+'.format(":,.?"), ' ', text)#用正则表达式去除无用的符号
        text = word_tokenize(text)#分词技术

        text = [w for w in text if w not in list_stopWords]#去除停用词
        bigrams = bigrams(text)#2gram技术
        from nltk import FreqDist
        bigramsDist = FreqDist(bigrams)#词频分析,其实也就是一个map
        print(bigramsDist.most_common(100))#打印前100个最流行的词
        client.close()

这些是输出,可以看到有15个文章是关于文件系统的,14个文章是关于操作系统的,诸如此类。

[(('file', 'system'), 15), (('operating', 'system'), 14), (('distributed', 'systems'), 13), (('fault', 'tolerance'), 7), (('preface', 'special'), 6), (('special', 'issue'), 5), (('virtual', 'memory'), 5), (('mutual', 'exclusion'), 5), (('design', 'implementation'), 4), (('distributed', 'file'), 4), (('shared', 'memory'), 4), (('reuse', 'distance'), 3), (('storage', 'system'), 3), (('operating', 'systems'), 3), (('memory', 'management'), 3), (('run-time', 'support'), 3), (('distributed', 'system'), 3), (('shared-memory', 'multiprocessors'), 3), (('network', 'file'), 3), (('issue', 'operating'), 3), (('distributed', 'mutual'), 3), (('interprocess', 'communication'), 3), (('optimal', 'parallel'), 2), (('warehouse-scale', 'computers'), 2), (('power', 'energy'), 2), (('ix', 'operating'), 2), (('system', 'combining'), 2), (('combining', 'low'), 2), (('low', 'latency'), 2), (('latency', 'high'), 2), (('high', 'throughput'), 2), (('throughput', 'efficiency'), 2), (('efficiency', 'protected'), 2), (('protected', 'dataplane'), 2), (('cache', 'hierarchies'), 2), (('distance', 'analysis'), 2), (('value', 'prediction'), 2), (('virtual', 'machine'), 2), (('content-based', 'publish/subscribe'), 2), (('scheduling', 'improve'), 2), (('multicore', 'systems'), 2), (('memory', 'systems'), 2), (('garbage', 'collection'), 2), (('networks', 'efficient'), 2), (('wireless', 'ad'), 2), (('ad', 'hoc'), 2), (('hoc', 'networks'), 2), (('load', 'balancing'), 2), (('byzantine', 'fault'), 2), (('thread-level', 'speculation'), 2), (('membership', 'service'), 2), (('multiprocessor', 'cache'), 2), (('cache', 'miss'), 2), (('real-time', 'systems'), 2), (('case', 'study'), 2), (('performance', 'analysis'), 2), (('replicated', 'services'), 2), (('multimedia', 'applications'), 2), (('speculative', 'execution'), 2), (('system', 'using'), 2), (('commodity', 'operating'), 2), (('data', 'structures'), 2), (('branch', 'prediction'), 2), (('area', 'networks'), 2), (('storage', 'systems'), 2), (('performance', 'prediction'), 2), (('hardware', 'support'), 2), (('support', 'network'), 2), (('secure', 'distributed'), 2), (('design', 'evaluation'), 2), (('shared', 'virtual'), 2), (('network', 'interface'), 2), (('file', 'systems'), 2), (('automatically', 'parallelized'), 2), (('parallelized', 'programs'), 2), (('programs', 'using'), 2), (('performance', 'evaluation'), 2), (('system', 'based'), 2), (('traffic', 'control'), 2), (('control', 'systems'), 2), (('disk', 'scheduling'), 2), (('heterogeneous', 'distributed'), 2), (('systems', 'using'), 2), (('lightweight', 'recoverable'), 2), (('recoverable', 'virtual'), 2), (('i/o', 'performance'), 2), (('kernel', 'support'), 2), (('continuous', 'media'), 2), (('multiprocessors', 'preface'), 2), (('architectural', 'support'), 2), (('system', 'principles'), 2), (('concurrency', 'control'), 2), (('data', 'types'), 2), (('exclusion', 'algorithms'), 2), (('special', 'section'), 2), (('measurement', 'modeling'), 2), (('modeling', 'computer'), 2), (('cache', 'performance'), 2), (('systems', 'disk'), 2), (('naming', 'service'), 2)]

方法2

nltk库封装好的方法,全自动???
一个获取大量文章标题标签的办法_第4张图片
将文本分词之后构建成nltk的Text类,就能解锁该方法,自动化分析,去除了停用词以及标点符号

class getArticalTag():
    from pymongo import MongoClient
    import util
    client = client = MongoClient(util.mongodb, 27017)
    db = client.ccf.article.find()#连接上mongo数据库
    text = ""
    for a in db:#将标题拼接成一个文本
        text += " " + a['title']
    from nltk import word_tokenize
    #使用nltk Python 自然语言处理库
    from nltk import Text
    text = text.lower()#将文本转换为小写方便去重
    text = word_tokenize(text)#分词
    text = Text(text)#构造成nltk文本
    print(text.collocations(num=1000))#直接调用该方法

输出也是差不多

operating system; file system; fault tolerance; mutual exclusion;
distributed systems; special issue; reuse distance; virtual memory;
interprocess communication; automatically parallelized; content-based
publish/subscribe; garbage collection; protected dataplane; warehouse-
scale computers; load balancing; thread-level speculation; run-time
support; shared memory; case study; shared-memory multiprocessors;
continuous media; combining low; branch prediction; lightweight
recoverable; parallelized programs; special section; low latency;
byzantine fault; recoverable virtual; virtual machine; naming service;
replicated services; area networks; hoc networks; multimedia
applications; value prediction; data types; cache hierarchies;
speculative execution; commodity operating; high throughput;
concurrency control; distributed mutual; distance analysis; optimal
parallel; traffic control; data structures; membership service; cache
miss; network interface; replicated data; memory management; network
file; architectural support; kernel support; multiprocessor cache;
distributed file; shared virtual; hardware support; disk scheduling;
fault-tolerant distributed; system principles; heterogeneous
distributed; secure distributed; programs using; performance
prediction; storage system; operating systems; performance evaluation;
system based; real-time systems; i/o performance; performance
analysis; control systems; multicore systems; cache performance;
storage systems; distributed system; memory systems; file systems

有了标签之后

给数据库的文章打上标签

使用文本索引的精确检索,找到对应的文章打上标签即可
一个获取大量文章标题标签的办法_第5张图片

你可能感兴趣的:(mongodb,大数据)