weixin_39637661

python3爬虫douban_【学习笔记】Python3爬虫

案例1：Python3爬虫代理池

1.找一个公开的代理ip网站

比如西刺免费代理IP

2.编写xpath进行筛选

//tr/td[2]/text()

//tr/td[3]/text()

//tr/td[6]/text()

3.编写代码

import requests

import os,time,random

from fake_useragent import UserAgent

from lxml import etree

class ProxySpider(object):

def __init__(self):

self.baseurl = 'https://www.xicidaili.com/nn/{}'

self.xpathip = '//tr/td[2]/text()'

self.xpathport = '//tr/td[3]/text()'

self.xpathhttps = '//tr/td[6]/text()'

self.ua = UserAgent()

def request_html(self,url):

try:

header = {'User-Agent':'Mozilla/4.0'}

html = requests.get(url=url, headers=header).text

return html

except Exception as e:

print(e)

return 'error'

def proxy_request_html(self,url,ip,isHttps):

time.sleep(random.randint(1,2))

proxy = {}

if isHttps is True:

proxy = {

'https': ip

}

else:

proxy = {

'http': ip

}

try:

header = {'User-Agent': self.ua.random}

html = requests.get(url=url,headers=header, proxies=proxy, timeout=8)

return True

except Exception as e:

print(ip,e)

return False

def get_html(self,url):

print(url)

html = self.request_html(url)

self.parse_html(html)

def parse_html(self,html):

item_ip = []

item_port = []

item_http = []

xpathobj = etree.HTML(html)

iplist = xpathobj.xpath(self.xpathip)

for ip in iplist:

item_ip.append(ip)

port_list = xpathobj.xpath(self.xpathport)

for port in port_list:

item_port.append(port)

httpsStrs = xpathobj.xpath(self.xpathhttps)

for is_https in httpsStrs:

item_http.append(is_https)

for li in range(0, len(item_ip),1):

test_ip = item_ip[li]+":"+item_port[li]

print('开始检测ip')

if item_http[li] == "HTTPS":

self.test_proxy(test_ip,True)

elif item_http[li] == "HTTP":

self.test_proxy(test_ip, False)

def test_proxy(self,proxy_address,isHttps):

ret = self.proxy_request_html('https://www.baidu.com/',proxy_address,isHttps)

if ret is True:

with open('proxy.log','a+') as f:

f.write(proxy_address+'\n')

print('代理节点可用',proxy_address)

def run(self):

url = self.baseurl.format(1)

self.get_html(url)

if __name__ == '__main__':

spider = ProxySpider()

spider.run();

4.结果

可以看出，http的基本都是可以使用的，HTTPS的基本都不能使用

案例2：Python3爬虫-baidutieba-xpath

使用xpath插件，进行筛选

直接鼠标在想筛选的文字或者图片，右键，就有xpath，然后F12，修改修改就可以了

2.编写代码

Response Content

We can read the content of the server’s response. Consider the GitHub timeline again:

import requests

r = requests.get('https://api.github.com/events')

r.text

'[{"repository":{"open_issues":0,"url":"https://github.com/...

Requests will automatically decode content from the server. Most unicode charsets are seamlessly decoded.

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encodingproperty:

r.encoding

'utf-8'

r.encoding = 'ISO-8859-1'

If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text. You might want to do this in any situation where you can apply special logic to work out what the encoding of the content will be. For example, HTML and XML have the ability to specify their encoding in their body. In situations like this, you should use r.content to find the encoding, and then set r.encoding. This will let you use r.text with the correct encoding.

Requests will also use custom encodings in the event that you need them. If you have created your own encoding and registered it with the codecs module, you can simply use the codec name as the value of r.encoding and Requests will handle the decoding for you.

Binary Response Content

You can also access the response body as bytes, for non-text requests:

r.content

b'[{"repository":{"open_issues":0,"url":"https://github.com/...

The gzip and deflate transfer-encodings are automatically decoded for you.

For example, to create an image from binary data returned by a request, you can use the following code:

from PIL import Image

from io import BytesIO

i = Image.open(BytesIO(r.content))

JSON Response Content

There’s also a builtin JSON decoder, in case you’re dealing with JSON data:

import requests

r = requests.get('https://api.github.com/events')

r.json()

[{'repository': {'open_issues': 0, 'url': 'https://github.com/...

In case the JSON decoding fails, r.json() raises an exception. For example, if the response gets a 204 (No Content), or if the response contains invalid JSON, attempting r.json() raises ValueError: No JSON object could be decoded.

It should be noted that the success of the call to r.json() does not indicate the success of the response. Some servers may return a JSON object in a failed response (e.g. error details with HTTP 500). Such JSON will be decoded and returned. To check that a request is successful, user.raise_for_status() or check r.status_code is what you expect.

Raw Response Content

In the rare case that you’d like to get the raw socket response from the server, you can access r.raw. If you want to do this, make sure you set stream=True in your initial request. Once you do, you can do this:

r = requests.get('https://api.github.com/events', stream=True)

r.raw

r.raw.read(10)

'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

In general, however, you should use a pattern like this to save what is being streamed to a file:

with open(filename, 'wb') as fd:

for chunk in r.iter_content(chunk_size=128):

fd.write(chunk)

Using Response.iter_content will handle a lot of what you would otherwise have to handle when using Response.raw directly. When streaming a download, the above is the preferred and recommended way to retrieve the content. Note that chunk_size can be freely adjusted to a number that may better fit your use cases.

from lxml import etree

import re,time,os,random

import requests

from urllib import parse

from fake_useragent import UserAgent

class BaiduTiebaSpider(object):

def __init__(self):

self.baseurl = r'http://tieba.baidu.com/f?kw={}&pn={}'

self.title_baseurl = r'https://tieba.baidu.com{}'

self.picXpath = r'//cc//img[@class="BDE_Image"]/@src'

self.titleurlXpath = r'//li//a[@class="j_th_tit "]/@href'

self.videoXpath = r'/div[@class="video_src_wrap_main"]/video/@src'

self.ua = UserAgent()

self.savePath = r'/home/user/work/spider/baidu/BaiduTieba/'

def get_html(self,url):

# header = {'User-Agent':self.ua.random}

header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'}

res = requests.get(url=url,headers=header).content

return res

def parse_html(self,html):

parse = etree.HTML(html)

titlelink_list = parse.xpath(self.titleurlXpath)

for li in titlelink_list:

titleurl = self.title_baseurl.format(li)

print(titleurl)

self.save_html(titleurl)

time.sleep(random.randint(2,3))

def save_html(self,url):

html = self.get_html(url)

parse = etree.HTML(html)

piclinks = parse.xpath(self.picXpath)

for pics in piclinks:

self.save_img(pics,self.savePath+pics[-10:])

videolinks = parse.xpath(self.videoXpath)

for videos in videolinks:

self.save_img(videos,self.savePath+videos[-10:])

def save_img(self,imgurl,filename):

img = self.get_html(imgurl)

with open(filename,'wb') as f :

f.write(img)

print(filename,'DownLoad Sucess')

def run(self):

name = input('输入要查询的贴吧名称>')

start = input('Start Page>')

end = input('End Page>')

mainurl = self.baseurl.format(parse.quote(name),0)

print(mainurl)

pagehtml = self.get_html(mainurl)

self.parse_html(pagehtml)

if __name__ == '__main__':

spider = BaiduTiebaSpider()

spider.run();

注意header

3.结果

案例3：Python3 爬虫-链家2手房-xpath

XPath 是一门在 XML 文档中查找信息的语言。XPath 可用来在 XML 文档中对元素和属性进行遍历。

XPath 是 W3C XSLT 标准的主要元素，并且 XQuery 和 XPointer 都构建于 XPath 表达之上。

因此，对 XPath 的理解是很多高级 XML 应用的基础。

Xpath插件在Chrome浏览器商店中，360浏览器扩展中心里也有

打开一个网页，F12，就在最后的

3.xpath过滤

在这里面测试，测试好了，放到代码中

Python3 代码

import requests

import random,re,time

from fake_useragent import UserAgent

from lxml import etree

class LianJiaSpider(object):

def __init__(self):

self.baseurl = 'https://sz.lianjia.com/ershoufang/pg{}/'

self.ua = UserAgent()

def get_html(self,url):

header = {'User-Agent':self.ua.random}

html = requests.get(url,headers=header,timeout=5).text

# html.encoding = 'utf-8'

self.parse_html(html)

def parse_html(self,html):

parse = etree.HTML(html)

li_list = parse.xpath('//ul[@class="sellListContent"]/li[@class="clear LOGVIEWDATA LOGCLICKDATA"]')

item = {}

for i in li_list:

item['name'] = i.xpath('.//a[@data-el="region"]/text()')[0]

info_list = i.xpath('.//div[@class="houseInfo"]/text()')[0].split('|')

item['model'] = info_list[0].strip()

item['area'] = info_list[1].strip()

item['direction'] = info_list[2].strip()

item['perfect'] = info_list[3].strip()

item['floor'] = info_list[4].strip()

item['age'] = info_list[5].strip()

item['address'] = i.xpath('.//div[@class="positionInfo"]/a/text()')[0].strip()

item['total'] = i.xpath('.//div[@class="totalPrice"]/span/text()')[0].strip()

item['unit'] = i.xpath('.//div[@class="unitPrice"]/span/text()')[0].strip()[2:-4]

print(item)

def run(self):

url = self.baseurl.format(1)

self.get_html(url)

if __name__ == '__main__':

spider = LianJiaSpider();

spider.run();

5.结果

穷穷穷，买不起，2手都买不起

案例4：python3 爬虫-百度图片

import requests

import re,time,random,os

from urllib import parse

from fake_useragent import UserAgent

class BaiduImgSpider(object):

def __init__(self):

self.baseurl = 'https://image.baidu.com/search/index?tn=baiduimage&word={}'

self.count = 1;

self.ua = UserAgent()

self.savepath = '/home/user/work/spider/day03/'

self.re_str = r'{"thumbURL":"(.*?)","replaceUrl":'

def get_html(self,name,orgname):

header = {'User-Agent':self.ua.random}

url = self.baseurl.format(name)

html = requests.get(url=url,headers = header).text

pattent = re.compile(self.re_str,re.S)

img_list = pattent.findall(html)

path = self.savepath+orgname

if not os.path.exists(path):

os.mkdir(path)

for img_link in img_list:

print(img_link)

self.save_img(img_link,path)

time.sleep(random.randint(1,2))

def save_img(self,url,path):

header = {'User-Agent': self.ua.random}

html = requests.get(url=url,headers=header).content

filename = path+"/"+str(self.count)+'.jpg'

with open(filename,'wb') as f:

f.write(html)

print('下载成功',filename)

self.count += 1

def run(self):

search_name = input('输入要获取的名字>');

word = parse.quote(search_name)

self.get_html(word,search_name)

if __name__ == '__main__':

spider = BaiduImgSpider()

spider.run();

直接上代码了，非常简单的

案例5：Python3 爬虫电影天堂

from urllib import request

import re,time,random

from fake_useragent import UserAgent

class DyTTSpider(object):

def __init__(self):

self.base_url = 'https://www.dytt8.net'

self.url_one = 'https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html'

self.ua = UserAgent()

def get_html(self,url):

header = {'User-Agent':self.ua.random}

req = request.Request(url,headers=header)

res = request.urlopen(req)

ret = res.read().decode('gb2312','ignore')

return ret;

def re_html(self,html,restr):

patent = re.compile(restr,re.S)

ret = patent.findall(html)

return ret;

def parse_html(self,one_url):

html_ret = self.get_html(one_url);

re_str = r'

ret_list = self.re_html(html_ret, re_str);

for link in ret_list:

print(link)

self.parse_sencond(self.base_url+link)

time.sleep(random.randint(2,3))

def parse_sencond(self,second_html):

item = {}

html_ret = self.get_html(second_html)

re_str = r'

(.*?).*?'

two_list = self.re_html(html_ret,re_str)

item['name'] = two_list[0].strip()

item['dlink'] = two_list[1].strip()

def run(self):

geturl = self.url_one.format(1)

self.parse_html(geturl)

if __name__ == '__main__':

dy = DyTTSpider()

dy.run();

案例6：Python3 爬虫 youdao

import requests

import random,time

from hashlib import md5

from fake_useragent import UserAgent

'''

var t = n.md5(navigator.appVersion)

, r = "" + (new Date).getTime()

, i = r + parseInt(10 * Math.random(), 10);

return {

ts: r,

bv: t,

salt: i,

sign: n.md5("fanyideskweb" + e + i + "Nw(nmmbP%A-r6U3EUn]Aj")

}

'''

class FanyiSpider(object):

def __init__(self):

self.baseurl = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'

self.ua = UserAgent()

def make_formdata_string(self,word):

formdata = {

"i": "",

"from": "AUTO",

"to": "AUTO",

"smartresult": "dict",

"client": "fanyideskweb",

"salt": "",

"sign": "",

"ts": "",

"bv": "37074a7035f34bfbf10d32bb8587564a",

"doctype": "json",

"version": "2.1",

"keyfrom": "fanyi.web",

"action": "FY_BY_REALTlME",

}

s = md5()

formdata['i'] = word;

formdata['ts'] = str(int(time.time()*1000));

# formdata['bv'] = s.hexdigest();

formdata['salt'] = formdata['ts'] + str(int(random.randint(0,9)))

signstring = "fanyideskweb" + word + formdata['salt'] + "Nw(nmmbP%A-r6U3EUn]Aj"

s.update(signstring.encode())

formdata['sign'] = s.hexdigest();

return formdata;

def make_headerString(self):

headerdata = {

"Accept": "application/json, text/javascript, */*; q=0.01",

"Accept-Encoding": "gzip, deflate",

"Accept-Language": "zh-CN,zh;q=0.9",

"Connection": "keep-alive",

"Content-Length": "240",

"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",

"Cookie": "DICT_UGC=be3af0da19b5c5e6aa4e17bd8d90b28a|; [email protected]; JSESSIONID=abcd4CqXd2rvOfBBNVfgx; OUTFOX_SEARCH_USER_ID_NCOO=928698907.9532578; _ntes_nnid=163fba552b6912766f975a5c9077e584,1587086791577; SESSION_FROM_COOKIE=fanyiweb; YOUDAO_FANYI_SELECTOR=OFF; ___rl__test__cookies=1587095527239",

"Host": "fanyi.youdao.com",

"Origin": "http://fanyi.youdao.com",

"Referer": "http://fanyi.youdao.com/?keyfrom=fanyi-new.logo",

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",

"X-Requested-With": "XMLHttpRequest",

}

# headerdata['User-Agent'] = self.ua.random;

return headerdata;

def request_str(self):

pass

def post_html(self,headerdata,formdata):

ret = requests.post(url=self.baseurl,data=formdata,headers=headerdata)

print(ret.text)

def run(self):

word = input('输入要查询的文字>')

formdata = self.make_formdata_string(word);

self.post_html(self.make_headerString(),formdata)

if __name__ == '__main__':

spider = FanyiSpider()

spider.run();

案例7：Tencent招聘

这个网站是HTTP2.0的，不过有些参数未验证，仍可以跑

import requests

import time,random,os

from fake_useragent import UserAgent

class TencentSpider(object):

def __init__(self):

self.baseurl = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1587104037054&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'

self.header= {

# ":authority": "careers.tencent.com",

# ":method": "GET",

# ":path": "/tencentcareer/api/post/Query?timestamp=1587104037054&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=3&pageSize=10&language=zh-cn&area=cn",

# ":scheme": "https",

"accept": "application/json, text/plain, */*",

"accept-encoding": "gzip, deflate, br",

"accept-language": "zh-CN,zh;q=0.9",

"referer": "https://careers.tencent.com/search.html?index={}",

"user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",

}

self.ua = UserAgent()

def get_html(self,pageindex):

self.header['user-agent'] = self.ua.random

self.header['referer'] = self.header['referer'].format(pageindex)

url = self.baseurl.format(pageindex)

ret = requests.get(url=url,headers=self.header)

print(ret.text)

def run(self):

self.get_html(3)

if __name__ == '__main__':

spider = TencentSpider()

spider.run();

案例8 douban电影

整体比较简单，F12里面都有，网站返回的json

import requests

from fake_useragent import UserAgent

import random,time

import json

from urllib import parse

class DoubanMovieSpider(object):

def __init__(self):

self.baseurl = 'https://movie.douban.com/j/search_tags?type=movie&source='

self.detilurl = 'https://movie.douban.com/j/search_subjects?type=movie&tag={}&sort=recommend&page_limit=20&page_start={}'

self.ua = UserAgent()

self.douban_movie_types = []

def make_header(self):

data = {

'User-Agent':self.ua.random

}

return data

def get_html(self,url):

ret = requests.get(url=url,headers=self.make_header()).text

return ret

def run(self):

movie_type_json = self.get_html(self.baseurl)

move_type = json.loads(movie_type_json);

for item in move_type['tags']:

self.douban_movie_types.append(item)

print(item,sep=None)

want_type = input('请输入感兴趣的类别>')

startpage = input('输入开始页>')

endpage = input('输入结束页>')

if want_type in self.douban_movie_types:

# for i in range(startpage,endpage,20):

insterurl = self.detilurl.format(parse.quote(want_type),0)

movie_infos_json = self.get_html(insterurl)

movie_infos = json.loads(movie_infos_json)

for info in movie_infos['subjects']:

print(info['title'],info['rate'],info['url'])

if __name__ == '__main__':

spider = DoubanMovieSpider()

spider.run();

案例9：xiaomiappshop

获取应用类别和总页数，多线程获取

import requests

import json

from threading import Thread

from queue import Queue

import time,random

from fake_useragent import UserAgent

from lxml import etree

class AppShopSpider(object):

def __init__(self):

self.mainurl = 'http://app.mi.com/'

self.baseurl = 'http://app.mi.com/categotyAllListApi?page={}&categoryId={}&pageSize=30'

self.xpathtype = '//ul[@class="category-list"]/li/a'

self.xpathpagenum = '//div[@class="pages"]/a[6]/text()'

self.q = Queue()

self.ua = UserAgent()

self.type_code = {}

def make_url(self,categoryid,startpage,endpage):

for page in range(startpage,endpage,1):

url = self.baseurl.format(page,categoryid)

self.q.put(url)

def get_url(self):

if not self.q.empty():

url = self.q.get()

print(url)

self.parse_html(self.get_html(url))

def get_html(self,url):

header = {'User-Agent':self.ua.random}

ret = requests.get(url=url,headers=header).text

return ret;

def parse_html(self,html):

jsonstr = json.loads(html)

for item in jsonstr['data']:

print(item['displayName'])

def get_typecode(self):

html = self.get_html(self.mainurl)

xpathobj = etree.HTML(html)

ret = xpathobj.xpath(self.xpathtype)

for item in ret:

apptype = item.xpath('./text()')[0]

appcode = item.xpath('./@href')[0].split('/')[-1]

self.type_code[apptype] = appcode

print(apptype)

def get_typepage(self,keyword):

for key,value in self.type_code.items():

if key == keyword:

url = self.baseurl.format(0,value)

html = self.get_html(url)

jsonstr = json.loads(html)

pagenum = int(int(jsonstr['count'])/30 +1)

print('总页数为>',pagenum)

def run(self):

self.get_typecode();

instert_type = input('输入想获取的类别>')

self.get_typepage(instert_type)

strtpage = int(input('输入开始页面>'))

endpage = int(input('输入结束页面>'))

self.make_url(self.type_code[instert_type],strtpage,endpage)

thread_list= []

for i in range(5):

t = Thread(target=self.get_url())

thread_list.append(t)

t.start();

for t in thread_list:

t.join()

if __name__ == '__main__':

starttime = time.time()

spider = AppShopSpider()

spider.run();

endtime = time.time()

print('时间>%.2f'%(endtime-starttime))

案例10：jingdong商品

使用selenium调起chrome浏览器抓取数据，速度慢，但是简单

2020-04-19

from selenium import webdriver

import time

class JDSpider(object):

def __init__(self):

self.baseurl = 'https://www.jd.com/'

self.searchinput_xpath = '//*[@id="key"]'

self.searchButton_xpath = '//*[@id="search"]/div/div[2]/button'

self.browser = None

self.detail_xpath = '//*[@id="J_goodsList"]/ul/li'

self.nextpage_xpath = '//*[@id="J_bottomPage"]/span[1]/a[9]'

self.sum = 0;

def get_html(self,url,word):

self.browser.get(url)

search_input = self.browser.find_element_by_xpath(self.searchinput_xpath)

search_input.send_keys(word)

time.sleep(3)

self.send_click(self.searchButton_xpath)

self.parse_html()

def send_click(self,xpathstr):

button = self.browser.find_element_by_xpath(xpathstr)

button.click()

time.sleep(3)

def scrollend(self):

js = "var q=document.documentElement.scrollTop=100000"

self.browser.execute_script(js)

time.sleep(3)

def parse_html(self):

self.scrollend()

item = {}

li_list = self.browser.find_elements_by_xpath(self.detail_xpath)

for li in li_list:

item['price'] = li.find_element_by_xpath('.//div[@class="p-price"]').text.strip()

item['title'] = li.find_element_by_xpath('.//div[@class="p-name p-name-type-2"]/a/em').text.strip()

item['commit'] = li.find_element_by_xpath('.//div[@class="p-commit"]/strong').text.strip()

print(item)

self.sum +=1;

def run(self):

name = input('输入想搜索的关键字>')

self.browser = webdriver.Chrome()

self.get_html(self.baseurl,name)

while True:

if self.browser.page_source.find('pn-next disable') == -1:

self.browser.find_element_by_class_name('pn-next').click()

time.sleep(3)

self.parse_html()

else:

break;

if __name__ == '__main__':

spider = JDSpider();

spider.run();

print('共计：',spider.sum)

你可能感兴趣的:(python3爬虫douban)

pip下载 weixin_74 JavaWeb pip
万能公式：pip--default-timeout=100install库名称-ihttp://pypi.douban.com/simple/--trusted-hostpypi.douban.com
python中beautifulsoup怎么安装_Python3爬虫中Beautiful Soup库的安装方法是什么柳虎璐 Python3 BeautifulSoup 安装教程 lxml 爬虫
Python3爬虫中BeautifulSoup库的安装方法是什么发布时间：2020-08-0517:38:09来源：亿速云阅读：70作者：小新这篇文章将为大家详细讲解有关Python3爬虫中BeautifulSoup库的安装方法是什么，小编觉得挺实用的，因此分享给大家做个参考，希望大家阅读完这篇文章后可以有所收获。BeautifulSoup是Python的一个HTML或XML的解析库，我们可以用它
Python常用国内镜像源（清华、豆瓣、阿里云、中科大）老马达 #深度学习基础 python
使用Python最便捷的就是下载使用各种“包”，然而往往长时间不用就容易忘记，在这里将网上一些资源收集整理，一是方便自己日后使用，二是与诸君共享。一、代码清华大学开源软件镜像站：https://pypi.tuna.tsinghua.edu.cn/simple阿里云开源镜像站：https://mirrors.aliyun.com/pypi/simple/豆瓣：https://pypi.douban.
Ubuntu上搭建虚拟环境字节熊猫 ubuntu linux 运维 python
环境搭建1.安装pipaptinstallpython3-pip2.安装虚拟环境命令pip安装可能超时，可以使用一下国内镜像阿里云http://mirrors.aliyun.com/pypi/simple/中国科技大学https://pypi.mirrors.ustc.edu.cn/simple/豆瓣(douban)http://pypi.douban.com/simple/清华大学https:/
【实践】Python爬取豆瓣电影排行榜页面数据大数据张老师 Python程序设计 python 开发语言
在本节中，我们将使用requests库爬取豆瓣电影排行榜页面数据。通过一步步实操，学习如何使用requests库发送HTTP请求、获取网页HTML数据，并为后续的网页解析做好准备。1.目标：获取豆瓣电影排行榜的网页数据豆瓣电影提供了电影排行榜页面，网址如下：https://movie.douban.com/top250在本节中，我们的目标是：访问豆瓣电影排行榜页面。获取该页面的HTML数据。解析并
UI自动化测试：playwright工具（一）：python环境下安装、UI录制使用（需要些代码能力）冷凝娇 #python自动化 Python python playwright 自动化测试 UI自动化测试
一、python环境下安装playwright工具1.安装playwright库pipinstallplaywright-ihttp://pypi.douban.com/simple/--trusted-hostpypi.douban.com#至于镜像源，可以选，也可不选：#阿里云http://mirrors.aliyun.com/pypi/simple/#中国科技大学https://pypi.m
python爬虫之自动化爬取网页陌小 python selenium python chrome
以下为公开源码fromseleniumimportwebdriverfromselenium.webdriver.chrome.optionsimportOptionsimporttimefrombs4importBeautifulSoupurl='https://movie.douban.com/'chrome_optins=Options()chrome_optins.add_argument
python3.8安装lxml库,Python3爬虫利器之lxml解析库的安装 Intopia
lxml是Python的一个解析库，支持HTML和XML的解析，支持XPath解析方式，而且解析效率非常高。本节中，我们了解一下lxml的安装方式，这主要从Windows、Linux和Mac三大平台来介绍。1.相关链接官方网站：http://lxml.deGitHub：https://github.com/lxml/lxmlPyPI：https://pypi.python.org/pypi/lxm
moviepy.editor下载失败 ANAN永不败315 moviepy pip3 豆瓣镜像软件包管理加速安装
输入以下命令pip3installmoviepy-ihttp://pypi.douban.com/simple--trusted-host=pypi.douban.com看完记得点赞，（づ￣3￣）づ╭❤～
Anaconda 国内常用镜像地址那个发光的 anaconda
Anaconda镜像地址http://mirrors.aliyun.com/pypi/simple///阿里https://pypi.tuna.tsinghua.edu.cn/simple///清华http://pypi.douban.com///豆瓣http://pypi.hustunique.com///华中理工大学http://pypi.sdutlinux.org///山东理工大学http:
【app逆向】hook工具frida的安装和基本使用小宇python android adb
搭建环境建议大家在python3.8版本上进行操作。如果你现在电脑上只安装了python3.9，也可以再安装一个python3.8，Python支持多版本共存。安装frida，python的第三方包pipinstallfrida==15.2.2如果安装不上去那么下安装eggegg下载地址：https://pypi.doubanio.com/simple/frida/放入指定目录然后再次重新安装安装
豆瓣电影TOP250爬虫项目诚信爱国敬业友善爬虫爬虫 python
以下是一个基于Python的豆瓣电影TOP250爬虫项目案例，包含完整的技术原理说明、关键知识点解析和项目源代码。本案例采用面向对象编程思想，涵盖反爬机制处理、数据解析和存储等核心内容。豆瓣电影TOP250爬虫项目一、项目需求分析目标网站：https://movie.douban.com/top250爬取内容：电影名称导演和主演信息上映年份制片国家电影类型评分评价人数短评金句技术挑战：请求头验证分
pip 镜像余将董道而不豫兮 pip python conda
以下列出常用的国内镜像地址阿里云http://mirrors.aliyun.com/pypi/simple/清华大学https://pypi.tuna.tsinghua.edu.cn/simple/中国科技大学https://pypi.mirrors.ustc.edu.cn/simple/豆瓣http://pypi.douban.com/simple/中国科学技术大学http://pypi.mir
PIP添加永久性国内镜像源宇雯 pip
一.临时镜像源#使用豆瓣镜像源pipinstallnumpy-ihttps://pypi.doubanio.com/simple/二.永久镜像源1.windows在C:\Users\your_name下创建pip文件夹，pip文件夹下面创建pip.ini文件,添加#镜像源网址[global]index-url=https://pypi.doubanio.com/simple/#备用extra-in
爬去网页时出现raise etree.ParserError(lxml.etree.ParserError: Document is empty问题，想知道哪里出现了错误源代码如下 SWDYSQBL python 开发语言 pycharm 网络爬虫
importrequestsfromurllibimportresponseimportlxml.htmlimportcsvfromrequestsimportResponsedoubanurl='https://movie.douban.com/top250?start={}&filter='defgetSource(url):#获取目标网页response=requests.get(url)r
自动化工具DrissionPage的使用(二) dh_浩开玩笑自动化工具自动化 python 爬虫
概要继上篇->自动化工具DrissionPage的使用(一)我们继续研究DrissionPage的使用开始整活对于自动化程序,标签定位以及获取标签对应的数据是重中之重本次以豆瓣排行榜为例(https://movie.douban.com/chart)SessionPage获取SessionPage对象fromDrissionPageimportSessionPageasSessiondriver=
pip install命令 | 多版本python | 指定版本 Monica Bing python sklearn 开发语言
需要先把python37下的python.exe文件重命名为python37，即可用python37的指令调用该版本python。python37-mpipinstallscikit-learn==0.21.3-ihttp://pypi.douban.com/simple/--trusted-hostpypi.douban.com指定版本python+指定版本库+换源python37-mpipin
flask操作数据库骑台风走 flask(更订中)flask python 后端
1.环境安装1.python3.852.模块pip3installflask-ihttps://pypi.douban.com/simplepip3installpymysql-ihttps://pypi.douban.com/simplepip3installflask-script-ihttps://pypi.douban.com/simplepip3installflask-sqlalche
把Python的pip源修改为国内亦安✘ python 开发语言
我们在用pip下载时访问的是国外pip源超级慢，因此可将源改为国内的镜像，就能飞速的下载,可临时修改，也可永久修改国内pip源：豆瓣http://pypi.douban.com/华中理工大学http://pypi.hustunique.com/山东理工大学http://pypi.sdutlinux.org/中国科学技术大学http://pypi.mirrors.ustc.edu.cn/阿里云htt
网络爬虫爬取动态网页数据 db_sqy_2012 爬虫
目录一、导学与指南豆瓣单页分析豆瓣多页输出二、理论学习1.抓取动态网页的技术2.Selenium和WebDriver的安装与配置3.Selenium的基本使用三、小结一、导学与指南豆瓣单页分析importjsonimportrequests#基础URL不顶事了url_base="https://movie.douban.com/typerank?type_name=%E5%89%A7%E6%83%
记录python中常用的镜像源 Littlehero_121 python系列篇 python 开发语言
参考博客：python常用镜像_python镜像-CSDN博客以下内容摘抄以上博客1.阿里云镜像https://mirrors.aliyun.com/pypi/simple/2、清华大学镜像https://pypi.tuna.tsinghua.edu.cn/simple3、豆瓣镜像https://pypi.doubanio.com/simple/4、中科大镜像https://pypi.mirror
正在更新丨豆瓣电影详细数据的采集与可视化分析（scrapy+mysql+matplotlib+flask） Want595 Python数据分析 scrapy mysql matplotlib
文章目录豆瓣电影详细数据的采集与可视化分析（scrapy+mysql+matplotlib+flask）写在前面数据采集0.注意事项1.创建Scrapy项目`douban2025`2.用`PyCharm`打开项目3.创建爬虫脚本`douban.py`4.修改`items.py`的代码5.修改`pipelines.py`代码6.修改`settings.py`代码7.启动`douban2025`项目8
【已解决】ERROR: Could not find a version that satisfies the requirement torch==2.0.1 宇宙霹雳无敌超级小鬼头疑难杂症 conda YOLO
运行v10代码，第一次下载requirement.txt用的是下面的指令：pipinstall-rrequirements.txt-ihttps://pypi.doubanio.com/simple遇到以下的报错：更换源后下载问题得到解决：pipinstall-rrequirements.txt-ihttps://pypi.tuna.tsinghua.edu.cn/simple
【Python3爬虫】Scrapy入门教程 TM0831 Python3爬虫 Python3 网络爬虫
Python版本：3.5系统：Windows一、准备工作需要先安装几个库（pip，lxml，pywin32，Twisted，pyOpenSSL），这些都比较容易，如果使用的是Pycharm，就可以更方便的安装模块，在settings里可以选择版本进行下载。如果在命令行模式下输入pip-V出现'pip'不是内部或外部命令，也不是可运行的程序或批处理文件，先确保自己在环境变量中配置E:\Python3
java爬虫工具Jsoup学习 Future_yzx java 爬虫学习
目录前言一、基本使用二、爬取豆瓣电影的案例三、Jsoup能做什么？四、Jsoup相关概念五、Jsoup获取文档六、定位选择元素七、获取数据八、具体案例前言JSoup是一个用于处理HTML的Java库，它提供了一个非常方便类似于使用DOM，CSS和jquery的方法的API来提取和操作数据。一、基本使用org.jsoupjsoup1.13.1二、爬取豆瓣电影的案例publicclassDouBan{
python打包opencv为exe可执行程序 AIOT魔法师 YOLOv5和YOLOv11 opencv python 计算机视觉
网上很多教程但是呢，每次运行起来打包出来的exe，都会报错，多数是提示找不到cv2的库，或者说让安装opencv，例如下面这种：解决方案如下：1、使用opencv-python的版本为：pipinstall-ihttps://pypi.douban.com/simple/opencv-python==4.3.0</
同步清华镜像源，制作本地pip镜像源淡若静水Summer pip pip pip源
同步清华镜像源，制作本地pip镜像源访问清华源下载建立索引启动pip服务客户端测试为了方便国内用户使用pip模块，国内很多已经配置专用的pip镜像源国内镜像源阿里云http://mirrors.aliyun.com/pypi/simple/豆瓣http://pypi.douban.com/simple/清华大学https://pypi.tuna.tsinghua.edu.cn/simple/中国科
Python国内镜像源修改教程网友阿贵 Python python 青少年编程 pycharm 后端
知名国企：豆瓣https://pypi.doubanio.com/simple/网易https://mirrors.163.com/pypi/simple/阿里云https://mirrors.aliyun.com/pypi/simple/腾讯云https://mirrors.cloud.tencent.com/pypi/simple————————————————知名高校：清华大学（推荐）：ht
基于Python的豆瓣电影爬虫数据分析可视化设计与实现计算机软件程序设计 Python爬虫 Python程序设计数据分析 python 爬虫
【1】系统介绍1.研究背景随着互联网的快速发展，电影产业已经成为全球文化产业的重要组成部分。观众对电影的需求和兴趣日益增长，而在线电影平台如豆瓣电影（DoubanMovie）成为了用户获取电影信息、发表评论和评分的主要渠道之一。豆瓣电影不仅提供了丰富的电影资料，还拥有庞大的用户群体，这些用户生成的内容（UGC）为电影市场分析提供了宝贵的数据资源。然而，尽管豆瓣电影平台提供了大量的公开数据，但这些数
豆瓣API-我在IDE上标记想看的电影(低配版API文档) dreadp 前端 python beautifulsoup selenium html web 数据分析
引言我只是想在IDE上标记想看的电影,所以写了这个脚本…以下是脚本调用的API接口使用指南.脚本运行方式以及使用方法在使用MovieWishlister.py脚本之前,保证运行过一次TagAssassin.py中的get_all_tags(douban_user_url)函数来更新写入的文件中的标签,保证此时是最新的,以便可看JSON文件的标签来核对自己曾自定义的标签名.因为标签过多可能无法一次性
mongodb3.03开启认证 21jhf mongodb
下载了最新mongodb3.03版本，当使用--auth 参数命令行开启mongodb用户认证时遇到很多问题，现总结如下：（百度上搜到的基本都是老版本的，看到db.addUser的就是，请忽略） Windows下我做了一个bat文件，用来启动mongodb，命令行如下： mongod --dbpath db\data --port 27017 --directoryperdb --logp
【Spark103】Task not serializable bit1129 Serializable
Task not serializable是Spark开发过程最令人头疼的问题之一，这里记录下出现这个问题的两个实例，一个是自己遇到的，另一个是stackoverflow上看到。等有时间了再仔细探究出现Task not serialiazable的各种原因以及出现问题后如何快速定位问题的所在，至少目前阶段碰到此类问题，没有什么章法 1. package spark.exampl
你所熟知的 LRU(最近最少使用) dalan_123 java
关于LRU这个名词在很多地方或听说，或使用，接下来看下lru缓存回收的实现 1、大体的想法 a、查询出最近最晚使用的项 b、给最近的使用的项做标记通过使用链表就可以完成这两个操作，关于最近最少使用的项只需要返回链表的尾部；标记最近使用的项，只需要将该项移除并放置到头部，那么难点就出现你如何能够快速在链表定位对应的该项？这时候多
Javascript 跨域周凡杨 JavaScript jsonp 跨域 cross-domain
linux下安装apache服务器 g21121 apache
安装apache 下载windows版本apache，下载地址：http://httpd.apache.org/download.cgi 1.windows下安装apache Windows下安装apache比较简单，注意选择路径和端口即可，这里就不再赘述了。 2.linux下安装apache：下载之后上传到linux的相关目录，这里指定为/home/apach
FineReport的JS编辑框和URL地址栏语法简介老A不折腾 finereport web报表报表软件语法总结
JS编辑框： 1.FineReport的js。作为一款BS产品，browser端的JavaScript是必不可少的。 FineReport中的js是已经调用了finereport.js的。大家知道，预览报表时，报表servlet会将cpt模板转为html，在这个html的head头部中会引入FineReport的js，这个finereport.js中包含了许多内置的fun
根据STATUS信息对MySQL进行优化墙头上一根草 status
mysql 查看当前正在执行的操作，即正在执行的sql语句的方法为: show processlist 命令 mysql> show global status;可以列出MySQL服务器运行各种状态值，我个人较喜欢的用法是show status like '查询值%';一、慢查询mysql> show variab
我的spring学习笔记7-Spring的Bean配置文件给Bean定义别名 aijuans Spring 3
本文介绍如何给Spring的Bean配置文件的Bean定义别名？原始的 <bean id="business" class="onlyfun.caterpillar.device.Business"> <property name="writer"> <ref b
高性能mysql 之性能剖析 annan211 性能 mysql mysql 性能剖析剖析
1 定义性能优化 mysql服务器性能，此处定义为响应时间。在解释性能优化之前，先来消除一个误解，很多人认为，性能优化就是降低cpu的利用率或者减少对资源的使用。这是一个陷阱。资源时用来消耗并用来工作的，所以有时候消耗更多的资源能够加快查询速度，保持cpu忙绿，这是必要的。很多时候发现编译进了新版本的InnoDB之后，cpu利用率上升的很厉害，这并不
主外键和索引唯一性约束百合不是茶索引唯一性约束主外键约束联机删除
目标;第一步;创建两张表用户表和文章表第二步;发表文章 1,建表; ---用户表 BlogUsers --userID唯一的 --userName --pwd --sex create
线程的调度 bijian1013 java 多线程 thread 线程的调度 java多线程
1. Java提供一个线程调度程序来监控程序中启动后进入可运行状态的所有线程。线程调度程序按照线程的优先级决定应调度哪些线程来执行。 2. 多数线程的调度是抢占式的（即我想中断程序运行就中断，不需要和将被中断的程序协商） a)
查看日志常用命令 bijian1013 linux 命令 unix
一.日志查找方法，可以用通配符查某台主机上的所有服务器grep "关键字" /wls/applogs/custom-*/error.log 二.查看日志常用命令1.grep '关键字' error.log：在error.log中搜索'关键字'2.grep -C10 '关键字' error.log：显示关键字前后10行记录3.grep '关键字' error.l
【持久化框架MyBatis3一】MyBatis版HelloWorld bit1129 helloworld
MyBatis这个系列的文章，主要参考《Java Persistence with MyBatis 3》。样例数据本文以MySQL数据库为例，建立一个STUDENTS表，插入两条数据，然后进行单表的增删改查 CREATE TABLE STUDENTS ( stud_id int(11) NOT NULL AUTO_INCREMENT,
【Hadoop十五】Hadoop Counter bit1129 hadoop
1. 只有Map任务的Map Reduce Job File System Counters FILE: Number of bytes read=3629530 FILE: Number of bytes written=98312 FILE: Number of read operations=0 FILE: Number of lar
解决Tomcat数据连接池无法释放 ronin47 tomcat 连接池　优化
近段时间，公司的检测中心报表系统(SMC)的开发人员时不时找到我，说用户老是出现无法登录的情况。前些日子因为手头上有Jboss集群的测试工作，发现用户不能登录时，都是在Tomcat中将这个项目Reload一下就好了，不过只是治标而已，因为大概几个小时之后又会再次出现无法登录的情况。今天上午，开发人员小毛又找到我，要我协助将这个问题根治一下，拖太久用户难保不投诉。简单分析了一
java-75-二叉树两结点的最低共同父结点 bylijinnan java
import java.util.LinkedList; import java.util.List; import ljn.help.*; public class BTreeLowestParentOfTwoNodes { public static void main(String[] args) { /* * node data is stored in
行业垂直搜索引擎网页抓取项目 carlwu Lucene Nutch Heritrix Solr
公司有一个搜索引擎项目，希望各路高人有空来帮忙指导，谢谢！这是详细需求：（1）通过提供的网站地址(大概100-200个网站)，网页抓取程序能不断抓取网页和其它类型的文件（如Excel、PDF、Word、ppt及zip类型），并且程序能够根据事先提供的规则，过滤掉不相干的下载内容。（2）程序能够搜索这些抓取的内容，并能对这些抓取文件按照油田名进行分类，然后放到服务器不同的目录中。
[通讯与服务]在总带宽资源没有大幅增加之前,不适宜大幅度降低资费 comsci 资源
降低通讯服务资费，就意味着有更多的用户进入，就意味着通讯服务提供商要接待和服务更多的用户，在总体运维成本没有由于技术升级而大幅下降的情况下，这种降低资费的行为将导致每个用户的平均带宽不断下降，而享受到的服务质量也在下降，这对用户和服务商都是不利的。。。。。。。。 &nbs
Java时区转换及时间格式 Cwind java
本文介绍Java API 中 Date, Calendar, TimeZone和DateFormat的使用，以及不同时区时间相互转化的方法和原理。问题描述：向处于不同时区的服务器发请求时需要考虑时区转换的问题。譬如，服务器位于东八区（北京时间，GMT+8:00），而身处东四区的用户想要查询当天的销售记录。则需把东四区的“今天”这个时间范围转换为服务器所在时区的时间范围。
readonly,只读，不可用 dashuaifu js jsp disable readOnly readOnly
readOnly 和 readonly 不同，在做js开发时一定要注意函数大小写和jsp黄线的警告！！！我就经历过这么一件事：使用readOnly在某些浏览器或同一浏览器不同版本有的可以实现“只读”功能，有的就不行，而且函数readOnly有黄线警告！！！就这样被折磨了不短时间！！！（期间使用过disable函数，但是发现disable函数之后后台接收不到前台的的数据！！！）
LABjs、RequireJS、SeaJS 介绍 dcj3sjt126com js Web
LABjs 的核心是 LAB（Loading and Blocking）：Loading 指异步并行加载，Blocking 是指同步等待执行。LABjs 通过优雅的语法（script 和 wait）实现了这两大特性，核心价值是性能优化。LABjs 是一个文件加载器。RequireJS 和 SeaJS 则是模块加载器，倡导的是一种模块化开发理念，核心价值是让 JavaScript 的模块化开发变得更
[应用结构]入口脚本 dcj3sjt126com PHP yii2
入口脚本入口脚本是应用启动流程中的第一环，一个应用（不管是网页应用还是控制台应用）只有一个入口脚本。终端用户的请求通过入口脚本实例化应用并将将请求转发到应用。 Web 应用的入口脚本必须放在终端用户能够访问的目录下，通常命名为 index.php，也可以使用 Web 服务器能定位到的其他名称。控制台应用的入口脚本一般在应用根目录下命名为 yii（后缀为.php），该文
haoop shell命令 eksliang hadoop hadoop shell
cat chgrp chmod chown copyFromLocal copyToLocal cp du dus expunge get getmerge ls lsr mkdir movefromLocal mv put rm rmr setrep stat tail test text
MultiStateView不同的状态下显示不同的界面 gundumw100 android
只要将指定的view放在该控件里面，可以该view在不同的状态下显示不同的界面，这对ListView很有用，比如加载界面，空白界面，错误界面。而且这些见面由你指定布局，非常灵活。 PS：ListView虽然可以设置一个EmptyView，但使用起来不方便，不灵活，有点累赘。 <com.kennyc.view.MultiStateView xmlns:android=&qu
jQuery实现页面内锚点平滑跳转 ini JavaScript html jquery html5 css
平时我们做导航滚动到内容都是通过锚点来做，刷的一下就直接跳到内容了，没有一丝的滚动效果，而且 url 链接最后会有“小尾巴”，就像#keleyi，今天我就介绍一款 jquery 做的滚动的特效，既可以设置滚动速度，又可以在 url 链接上没有“小尾巴”。效果体验：http://keleyi.com/keleyi/phtml/jqtexiao/37.htmHTML文件代码： &
kafka offset迁移 kane_xie kafka
在早前的kafka版本中（0.8.0），offset是被存储在zookeeper中的。到当前版本（0.8.2）为止，kafka同时支持offset存储在zookeeper和offset manager（broker）中。从官方的说明来看，未来offset的zookeeper存储将会被弃用。因此现有的基于kafka的项目如果今后计划保持更新的话，可以考虑在合适
android > 搭建 cordova 环境 mft8899 android
1 , 安装 node.js http://nodejs.org node -v 查看版本 2, 安装 npm 可以先从 https://github.com/isaacs/npm/tags 下载源码解压到
java封装的比较器，比较是否全相同，获取不同字段名字 qifeifei
非常实用的java比较器，贴上代码： import java.util.HashSet; import java.util.List; import java.util.Set; import net.sf.json.JSONArray; import net.sf.json.JSONObject; import net.sf.json.JsonConfig; i
记录一些函数用法 .Aky. 位运算 PHP 数据库函数 IP
高手们照旧忽略。想弄个全天朝IP段数据库，找了个今天最新更新的国内所有运营商IP段，copy到文件，用文件函数，字符串函数把玩下。分割出startIp和endIp这样格式写入.txt文件，直接用phpmyadmin导入.csv文件的形式导入。（生命在于折腾，也许你们觉得我傻X，直接下载人家弄好的导入不就可以，做自己的菜鸟，让别人去说吧）当然用到了ip2long()函数把字符串转为整型数
sublime text 3 rust wudixiaotie Sublime Text
1.sublime text 3 => install package => Rust 2.cd ~/.config/sublime-text-3/Packages 3.mkdir rust 4.git clone https://github.com/sp0/rust-style 5.cd rust-style 6.cargo build --release 7.ctrl