昨天搞了一个多进程爬电影的小项目,将电影信息进行分类并保存在excel表,可以说以后选好电影都不用一一去豆瓣搜了,直接打开excel表筛选即可,非常的方便快捷,还在等什么,赶紧学起
来。
首先得导入pandas库
pip install pandas
获取主页url关键代码如下:
def get_home(self,url):
response=requests.get(url)
response.encoding="gb2312"
if response.status_code==200:
bs=BeautifulSoup(response.text,features="lxml")
a=bs.select("td[height='26']")
for n,b in enumerate(a):
if n%2==0 or n==0:
continue
else:
info_url ="https://www.ygdy8.com"+b.select("a")[1]["href"]
print(info_url)
self.movie_url.append(info_url)
获取电影信息代码如下:
def get_info(self):
headers=UserAgent().random
for url in self.movie_url:
response=requests.get(url,headers)
if response.status_code==200:
response.encoding="gb2312"
bs=BeautifulSoup(response.text,"html.parser")
span=bs.select("span[style='FONT-SIZE: 12px']")[0].text.replace("\u3000",'').split("◎")
print(span)
ym=span[1]
name=span[2]
old=span[3]
place=span[4]
date=span[8]
douban=span[9]
self.df.loc[len(self.df)+1]={"译名": ym,"片名": name,"年代": old,"产地": place,"上映日期": date,"豆瓣评分": douban}
创建临时表格:
self.df=pd.DataFrame(columns=("译名","片名","年代","产地","上映日期","豆瓣评分"))
最后创建xlsx
s.df.to_excel("movie.xlsx")