请完成一个程序，并能按步骤实现以下功能：
1.下载https://en.wikipedia.org/wiki/Machine_translation页面的内容并保存为mt.html
需要编写代码来下载页面。
2.统计mt.html中
标签内下所有单词以及数目并存储到mt_word.txt中。
mt_word.txt有如下几点要求：
a) 每个单词一行。单词在前，单词出现的次数在后，中间用Tab(\t)进行分隔。
b) 单词要按照单词数目从多到少的顺序进行排列。比如说单词a出现了100次，单词b出现了10次，则单词a要在单词b的前面。

提取出mt.html中所有的年份信息（比如说页面中的1629, 1951这些的四位数字就是年份）存储到mt_year.txt中。

mt_year.txt有如下几点要求：

a)每个年份是一行。

   a) 年份需要从过去到现在的顺序进行排列。比如说文章中出现了2007和1997，则1997需要排在2007的前面。

要求：

仅限python编程，而且仅仅可以使用python自带的函数或库。

2.提交可执行的程序以及mt.html, mt_word.txt, mt_year.txt。

限定在一个小时内完成。

1.下载https://en.wikipedia.org/wiki/Machine_translation页面的内容并保存为mt.html需要编写代码来下载页面。

session = requests.session()

response = session.get(url="https://en.wikipedia.org/wiki/Machine_translation")

with open('mt.html','wb') as f:

    f.write(response.content)

2、统计mt.html中

标签内下所有单词以及数目并存储到mt_word.txt中

解析页面，拿到所有的p标签中的文本

soup = BeautifulSoup(response.text,features="lxml")

tag2 = soup.find_all(name='p')

list_p = []

for i in tag2:

    list_p.append(i.get_text())

# 将所有的文本合并成一个字符串

str_p = ' '.join(list_p)

word_set = set()

for word in str_p.split():

    word = word.strip(',.()""/; ')

    word_set.add(word)

# word_dict = {}

word_list = []

for word in word_set:

    if word == '':

        continue

    # word_dict[word] = str_p.count(word)

    dict2 = {word:str_p.count(word)}

    word_list.append(dict2)

# 将单词按照数目反序排列，然后写入文件

blist = sorted(word_list,key = lambda x:list(x.values())[0],reverse =True)

with open('mt_word.txt','w') as f:

    for item in blist:

        for k,v in item.items():

            line = k + '\t' + str(v) + '\n'

            f.write(line)

3、提取出mt.html中所有的年份信息（比如说页面中的1629, 1951这些的四位数字就是年份）存储到mt_year.txt中

year = re.compile(r'\d{4}')

years_list = re.findall(year,response.text)

years_list = sorted(list(set(years_list)))

with open('mt_year.txt','w') as f:

    for year in years_list:

        line = year + '\n'

        f.write(line)

python面试题（1）

1.下载https://en.wikipedia.org/wiki/Machine_translation页面的内容并保存为mt.html需要编写代码来下载页面。

2、统计mt.html中

解析页面，拿到所有的p标签中的文本

3、提取出mt.html中所有的年份信息（比如说页面中的1629, 1951这些的四位数字就是年份）存储到mt_year.txt中

你可能感兴趣的:(python面试题（1）)