今天开始对HTML文件的处理,主要根据Python自然语言处理这本书籍。
1.实现对本地文件的读取和可视化过程。
>>> f=open('try111.txt',encoding='utf-8')
>>> raw=f.read()
>>> print(raw)
或者对于需要换行输出的文本(即需要删掉原有的换行符),可以执行如下操作
f=open('try111.txt',encoding='utf-8')
for line in f:
print (line.strip())
结果输出位文本try111.txt中的内容。
# -*- coding: utf-8 -*-
"""
Created on Sat Mar 24 10:54:42 2018
@author: Jasmine
"""
今天是2018年3月24日,周六,天气小雨
今天在学习python自然语言处理书的第三章
关于HTML文件的处理
2.简单对print 运用的小例子
>>> s=input('enter some text:')
enter some text:today is 2018 03 24,the weather is very good.
>>> print("you typed",len(s),"words")
you typed 45 words
3.访问单个字符串——通过导入NLTK中古腾堡的语料库进行小的尝试(前提是已经pip install nltk_data)
>>> from nltk.corpus import gutenberg
>>> raw=gutenberg.raw('melville-moby_dick.txt')
>>> fdist=nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
>>> fdist.keys()
dict_keys(['m', 'o', 'b', 'y', 'd', 'i', 'c', 'k', 'h', 'e', 'r', 'a', 'n', 'l', 'v', 't', 'g', 's', 'u', 'p', 'w', 'x', 'q', 'f', 'j', 'z'])
4.利用in测试一个字符是否在宁一特定的字符串中
if 'thing' in phrase:
print"found 'hing'"
利用.find('')来判断某字符在字符串内的位置。
phrase='And noe for something completely different'
>>> phrase.find('for')
8
5.通过Python程序实现对某一特定文件的路径查找
>>> path=nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
>>> print(path)
D:\Anaconda\nltk_data\corpora\unicode_samples.zip\unicode_samples\polish-lat2.txt
6.关于正则表达式的运用
1)
import re
>>> wordlist=[w for w in nltk.corpus.words.words('en') if w.islower()]
打印wordlist将会打印出所有的wordlist()
2)利用正则表达式来检查s是否含有p---re,search(p,s)此正则表达式用于检查s中含有模式p。
[w for w in wordlist if re.search('ed$',w)]
3)利用通配符“.”可以用来匹配任何单个字符,例如查找所有第三个字母是j,第六个字母是t,每个空白单元格用通配符隔开。
其中"^"表示开始,"$"表示结束。
[w for w in wordlist if re.search('^..j..t..$',w)]
['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', 'objectee', 'objector', 'rejecter', 'rejector', 'unjilted', 'unjolted', 'unjustly']
4)对于一个有四个占位符的单词而言,可以通过限制每个位置上的字母来源,来匹配查找相应的单词。
例如;通正则表达式《^[ghi][mno][jlk][def]$》第一部分来匹配第一位置上以g、h、i开头的,相应第二部分匹配第二个字符是来自m、n、o……,通过这样的正则表达式可以匹配出所有的符合情况的单词。
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$',w)]
['gold', 'golf', 'hold', 'hole']
5)更多关于正则表达式的符号详见另一篇文章 https://www.cnblogs.com/kuqs/p/5727409.html
wsj=sorted(set(nltk.corpus.treebank.words()))
>>> [w for w in wsj if re.search('^[0-9]+\.[0-9]+$',w)]
>>>
['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5', '0.50', '0.54', '0.56', '0.60', '0.7', '0.82', '0.84', '0.9', '0.95', '0.99', '1.01', '1.1', '1.125', '1.14', '1.1650', '1.17', '1.18', '1.19', '1.2', '1.20', '1.24', '1.25', '1.26', '1.28', '1.35', '1.39', '1.4', '1.457', '1.46', '1.49', '1.5', '1.50', '1.55', '1.56', '1.5755', '1.5805', '1.6', '1.61', '1.637', '1.64', '1.65', '1.7', '1.75', '1.76', '1.8', '1.82', '1.8415', '1.85', '1.8500', '1.9', '1.916', '1.92', '10.19', '10.2', '10.5', '107.03', '107.9', '109.73', '11.10', '11.5', '11.57', '11.6', '11.72', '11.95', '112.9', '113.2', '116.3', '116.4', '116.7', '116.9', '118.6', '12.09', '12.5', '12.52', '12.68', '12.7', '12.82', '12.97', '120.7', '1206.26', '121.6', '126.1', '126.15', '127.03', '129.91', '13.1', '13.15', '13.5', '13.50', '13.625', '13.65', '13.73', '13.8', '13.90', '130.6', '130.7', '131.01', '132.9', '133.7', '133.8', '14.00', '14.13', '14.26', '14.28', '14.43', '14.5', '14.53', '14.54', '14.6', '14.75', '14.99', '141.9', '142.84', '142.85', '143.08', '143.80', '143.93', '148.9', '149.9', '15.5', '150.00', '153.3', '154.2', '16.05', '16.09', '16.125', '16.2', '16.5', '16.68', '16.7', '16.9', '169.9', '17.3', '17.4', '17.5', '17.95', '1738.1', '176.1', '18.3', '18.6', '18.95', '185.9', '188.84', '19.3', '19.50', '19.6', '19.94', '19.95', '191.9', '2.07', '2.1', '2.15', '2.19', '2.2', '2.25', '2.29', '2.3', '2.30', '2.35', '2.375', '2.4', '2.42', '2.44', '2.46', '2.47', '2.5', '2.50', '2.6', '2.62', '2.65', '2.7', '2.75', '2.8', '2.80', '2.87', '2.875', '2.9', '2.95', '20.07', '20.5', '21.1', '21.9', '2141.7', '2160.1', '2163.2', '22.75', '220.45', '221.4', '225.6', '23.25', '23.4', '23.5', '23.72', '234.4', '236.74', '236.79', '24.95', '25.50', '25.6', '251.2', '26.2', '26.5', '26.8', '263.07', '2645.90', '2691.19', '27.1', '27.4', '273.5', '278.7', '28.25', '28.36', '28.4', '28.5', '28.53', '28.6', '29.3', '29.4', '29.9', '292.32', '3.01', '3.04', '3.1', '3.16', '3.18', '3.19', '3.2', '3.20', '3.23', '3.253', '3.28', '3.3', '3.35', '3.375', '3.4', '3.42', '3.43', '3.5', '3.55', '3.6', '3.61', '3.625', '3.7', '3.75', '3.8', '3.80', '3.9', '30.6', '30.9', '319.75', '32.8', '334.5', '34.625', '341.20', '3436.58', '35.2', '35.7', '352.7', '352.9', '35500.64', '35564.43', '36.9', '361.8', '3648.82', '37.3', '37.5', '372.14', '372.9', '374.19', '374.20', '377.60', '38.3', '38.375', '38.5', '38.875', '387.8', '4.1', '4.10', '4.2', '4.25', '4.3', '4.4', '4.5', '4.55', '4.6', '4.7', '4.75', '4.8', '4.875', '4.898', '4.9', '40.21', '41.60', '415.6', '415.8', '42.1', '42.5', '422.5', '43.875', '434.4', '436.01', '446.62', '449.04', '45.2', '45.3', '45.75', '456.64', '46.1', '47.1', '47.125', '47.5', '47.6', '49.9', '494.50', '497.34', '5.1', '5.2180', '5.276', '5.29', '5.3', '5.39', '5.4', '5.435', '5.5', '5.57', '5.6', '5.63', '5.7', '5.70', '5.8', '5.82', '5.9', '5.92', '50.1', '50.38', '50.45', '51.25', '51.6', '55.1', '566.54', '57.50', '57.6', '57.7', '58.64', '59.6', '59.9', '6.03', '6.1', '6.20', '6.21', '6.25', '6.4', '6.40', '6.44', '6.5', '6.50', '6.53', '6.6', '6.7', '6.70', '6.79', '6.84', '6.9', '60.36', '618.1', '62.1', '62.5', '62.625', '63.79', '630.9', '64.5', '66.5', '7.15', '7.2', '7.20', '7.272', '7.3', '7.4', '7.40', '7.422', '7.45', '7.458', '7.5', '7.50', '7.52', '7.55', '7.60', '7.62', '7.63', '7.65', '7.74', '7.78', '7.79', '7.8', '7.80', '7.84', '7.88', '7.90', '7.95', '70.2', '70.7', '705.6', '72.7', '734.9', '737.5', '77.56', '77.6', '77.70', '8.04', '8.06', '8.07', '8.1', '8.12', '8.14', '8.15', '8.19', '8.2', '8.22', '8.25', '8.30', '8.35', '8.45', '8.467', '8.47', '8.48', '8.5', '8.50', '8.53', '8.55', '8.56', '8.575', '8.60', '8.64', '8.65', '8.70', '8.75', '8.9', '80.50', '80.8', '81.8', '811.9', '83.4', '84.29', '84.9', '85.1', '85.7', '86.12', '87.5', '88.32', '89.7', '89.9', '9.3', '9.32', '9.37', '9.45', '9.5', '9.625', '9.75', '9.8', '9.82', '9.9', '92.9', '93.3', '93.9', '94.2', '94.8', '95.09', '96.4', '98.3', '99.1', '99.3']
>>>
在所有的词中查找个位是0-9的,小数位是多个0-9的数字。
[w for w in wsj if re.search('^[A-Z]+\$$',w)]
>>>
['C$', 'US$']
在所有的词中,查找首字母是属于大写的A-Z之间的,并且其后跟着一个或多个$
>>> [w for w in wsj if re.search('^[0-9]{4}$',w)]
>>>
['1614', '1637', '1787', '1901', '1903', '1917', '1925', '1929', '1933', '1934', '1948', '1953', '1955', '1956', '1961', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1975', '1976', '1977', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2005', '2009', '2017', '2019', '2029', '3057', '8300']
查找含有四位数字的0-9的数字
>>> [w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$',w)]
>>>
['10-day', '10-lap', '10-year', '100-share', '12-point', '12-year', '14-hour', '15-day', '150-point', '190-point', '20-point', '20-stock', '21-month', '237-seat', '240-page', '27-year', '30-day', '30-point', '30-share', '30-year', '300-day', '36-day', '36-store', '42-year', '50-state', '500-stock', '52-week', '69-point', '84-month', '87-store', '90-day']
查找所有-符号前是属于0-9的数字,-后是3到5个a-z的字母。
>>> [w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$',w)]
['black-and-white', 'bread-and-butter', 'father-in-law', 'machine-gun-toting', 'savings-and-loan']
查找至少有5个a-z的字母—2到3个a-z的字母—至多6个a-z的字母。
[w for w in wsj if re.search('(ed|ing)$',w)]
查找所有以ed或者ing结尾的词。