LMY~~

python数据采集

一、采集豆瓣电影 Top 250的数据采集
- 1.进入豆瓣 Top 250的网页
- 2.进入开发者选项
- 3.进入top250中去查看相关配置
- 4.添加其第三方库
- 5.进行爬虫的编写
- - - 反反爬处理--伪装浏览器
- 6、bs4库中beautifulSoup类的使用
- 7、储存到CSV中
- 备注
二、安居客数据采集
- 1.安居客的网页
- 2.导入from lxml import etree
- 3.将采集的字符串转换为html格式：etree.html
- 4.转换后的数据调用xPath(xpath的路径)：data.xpath（路径表达式）
三、拉勾网的数据采集
- （一）requests数据采集
- - 1.导入库
  - 2.向服务器发送请求
  - 3.下载数据
- （二）post方式请求（参数不在url里）
- - 1.通过网页解析找到post参数（请求数据Fromdata）并且参数定义到字典里
  - 2.服务器发送请求调用requests.post（data=fromdata）
- （三）数据解析——json
- - 1.导入 import——json
  - 2. 将采集的json数据格式转换为python字典：json.load（json格式数据）
  - 3.通过字典中的键访问值
- （四）数据存储到mysql中
- - 1.下载并导入pymysql import pymysql
  - 2.建立链接：conn=pymysql.Connenct(host,port,user,oassword,db,charset='utf8')
  - 3.写sql语句
  - 4.定义游标：cursor=conn.cursor()
  - 5.使用定义的游标执行sql语句：cursor.execute()sql
  - 6.向数据库提交数据：conn.commit()
  - 7.关闭游标:cursor.close()
  - 8.关闭链接：conn.close()
四、疫情数据采集
五、scrapy 数据采集
- （一）创建爬虫步骤
- - 1.创建项目，在命令行输入：scrapy startproject 项目名称
  - 2.创建爬虫文件：在命令行定位到spiders文件夹下：scrapy genspider 爬虫文件名网站域名
  - 3.运行程序：scrapy crawl 爬虫文件名
- （二）scrapy文件结构
- - 1.spiders文件夹:编写解析程序
  - 2.__init__.py：包标志文件，只有存在这个python文件的文件夹才是包
  - 3.items.py:定义数据结构，用于定义爬取的字段
  - 4.middlewares.py:中间件
  - 5.pipliness.py：管道文件，用于定义数据存储方向
  - 6.settings.py:配置文件，通常配置用户代理，管道文件开启。
- （三）scrapy爬虫框架使用流程
- - 1.配置settings.py文件：
  - - （1）User-Agent:"浏览器参数"
    - (2)ROBOTS协议：Fasle
    - （3）修改ITEM_PIPLINES：打开管道
  - 2.在item.py中定义爬取的字段
  - 3.在spiders文件夹中的parse（）方法中写解析代码并将解析结果提交给item对象
  - 4.在piplines.py中定义存储路径
  - 5.分页采集：
  - - 1）查找url的规律
    - 2）在爬虫文件中先定义页面变量，使用if判断和字符串格式化方法生成新url
    - 3）yield scrapy Requests(url,callback=parse)
- 案例
- - （一）安居客数据采集
  - - 1.ajk.py中
    - 2.pipelines.py中
    - 3.items.py中
    - 4.settings.py中
    - 5.middlewares.py中
  - （二）太平洋汽车数据采集
  - - 1.tpy.py中
    - 2.pipelines.py中
    - 3.items.py中
    - 4.settings.py中
    - 5.middlewares.py中

一、采集豆瓣电影 Top 250的数据采集

1.进入豆瓣 Top 250的网页

豆瓣电影 Top 250

2.进入开发者选项

3.进入top250中去查看相关配置

4.添加其第三方库

在其中进行添加

添加bs4、requests、lxml

5.进行爬虫的编写

（1）导入：import requests
（2）向服务器发送请求 requests=requests.get(url)
(3)获取网页数据：html =request.text

反反爬处理–伪装浏览器

1.定义变量geaders={用户代理}
2.向服务器发送请求时携带上代理信息：response=requests.get(url,headrs = h)

6、bs4库中beautifulSoup类的使用

1.导入：from bs4 import BeautifulSoup
2.定义soup对象：soup = BeautifulSoup(html,lxml)
3.使用soup对象调用类方法select(css选择器)
4.提取数据：
get_text()
text
string

7、储存到CSV中

1.先将数据存放到列表里
1.创建csv文档with open(‘xxx.csv’,‘a’,newlie=‘’) as f:
3.调用CSV模块中的write()方法：w=csv.write(f
4.将列表中的数据写入文档：w.writerows(listdata)

import requests,csv
from bs4 import BeautifulSoup
def getHtml(url):
    # 数据采集
    h ={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
    }# 看下面的备注
    response =requests.get(url,headers = h)
    html=response.text
    # print(html)
    # 数据解析：正则，BeautifulSoup，Xpath
    soup=BeautifulSoup(html,'lxml')
    filmtitle =soup.select('div.hd > a > span:nth-child(1)')
    ct=soup.select('div.bd > p:nth-child(1)')
    score=soup.select(' div.bd > div > span.rating_num')
    evalue =soup.select('div.bd > div > span:nth-child(4)')
    # print(ct)
    flimlist=[];
    for t,c,s,e in zip(filmtitle,ct,score,evalue):
        title = t.text
        content= c.text
        filmscore=s.text
        num=e.text.strip('人评价')
        director=content.strip().split()[1]
        if "主演:" in content:
            actor=content.strip().split('主演:')[1].split()[0]
        else:
            actor=None
        year=content.strip().split('/')[-3].split()[-1]
        area=content.strip().split('/')[-2].split()
        filmtype=content.strip().split('/')[-1].split()
        listdata=[title,director,actor,year,area,filmtype,num,filmscore]
        flimlist.append(listdata)
    print(flimlist)
        # 存储数据
    with open('douban250.csv','a',encoding='utf-8',newline="") as f:
        w=csv.writer(f)
        w.writerows(flimlist)
# 函数调用
with open('douban250.csv','a',encoding='utf-8',newline="") as f:
    w=csv.writer(f)
    listtitle=['title','director','actor','year','area','type','score','evealueate']
    w.writerow(listtitle)
for i in range(0,226,25):
    getHtml('https://movie.douban.com/top250?start=%s&filter='%(i))

备注

h 是到网站的开发者页面去寻找这句话就是

二、安居客数据采集

1.安居客的网页

安居客

2.导入from lxml import etree

3.将采集的字符串转换为html格式：etree.html

4.转换后的数据调用xPath(xpath的路径)：data.xpath（路径表达式）

import requests
from lxml import etree
def getHtml(url):
    h={
        'user - agent': 'Mozilla / 5.0(Windows NT 10.0;Win64;x64)'
    }
    response = requests.get(url,headers=h)
    html=response.text
    # 数据解析
    data=etree.HTML(html)
    print(data)
    name=data.xpath('//span[@class="items-name"]/text()')
    print(name)
getHtml("https://bj.fang.anjuke.com/?from=AF_Home_switchcity")

三、拉勾网的数据采集

（一）requests数据采集

1.导入库

2.向服务器发送请求

3.下载数据

（二）post方式请求（参数不在url里）

1.通过网页解析找到post参数（请求数据Fromdata）并且参数定义到字典里

2.服务器发送请求调用requests.post（data=fromdata）

（三）数据解析——json

1.导入 import——json

2. 将采集的json数据格式转换为python字典：json.load（json格式数据）

3.通过字典中的键访问值

（四）数据存储到mysql中

1.下载并导入pymysql import pymysql

2.建立链接：conn=pymysql.Connenct(host,port,user,oassword,db,charset=‘utf8’)

3.写sql语句

4.定义游标：cursor=conn.cursor()

5.使用定义的游标执行sql语句：cursor.execute()sql

6.向数据库提交数据：conn.commit()

7.关闭游标:cursor.close()

8.关闭链接：conn.close()

这次的爬虫是需要账户的所以h是使用的Request Headers中的数据

import requests, json,csv,time,pymysql

keyword = input("请输入查询的职务")

def getHrml(url):
    # 数据采集
    h = {
        'accept': 'application/json, text/javascript, */*; q=0.01',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',
        'content-length': '25',
        'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'cookie': 'JSESSIONID=ABAAAECABIEACCA0BB433B062238BA98562B2C4DC03D33A; WEBTJ-ID=20210713%E4%B8%8A%E5%8D%889:11:14091114-17a9d6b06f45b5-0f9ac091808606-6373264-1327104-17a9d6b06f531f; RECOMMEND_TIP=true; PRE_UTM=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; user_trace_token=20210713091113-ad1a1d99-8c61-4649-86de-caa60fecdd96; LGUID=20210713091113-57f6a01d-c07d-4d7d-a226-089ac21198e3; privacyPolicyPopup=false; sajssdk_2015_cross_new_user=1; sensorsdata2015session=%7B%7D; _ga=GA1.2.1716223378.1626138675; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1626138675; LGSID=20210713091113-eef33caa-4fff-426f-a12f-d9ccdc49edab; PRE_HOST=www.baidu.com; PRE_SITE=https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3D1PmCn3%5F6suzwEKiV8mGzsySp2-ZqV4jK6AtxxnmDtde%26wd%3D%26eqid%3Dabee6e3e0030ff6d0000000360ece82b; _gid=GA1.2.1912526675.1626138675; index_location_city=%E5%85%A8%E5%9B%BD; hasDeliver=0; __lg_stoken__=a754529f8432345bce5a926838ce3ba208e2eb1b5e2aa7d40fdad449dd8fbc5fc68b31dd9e0579699a42fe25571f070de507bfb1ccd0d4cd334b58732e2966e98e41f591c3d3; gate_login_token=1a7455a0d75a2c4afa5e07a08311e9e93141da1f752d719873fcd5aadaffe363; LG_LOGIN_USER_ID=8ed9d2581d433f9543578454ebfed80f4ee5c5bee7be1dd4632f11fafed9de49; LG_HAS_LOGIN=1; _putrc=022DEADA81248A58123F89F2B170EADC; login=true; unick=%E7%94%A8%E6%88%B77650; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; __SAFETY_CLOSE_TIME__22228481=1; X_HTTP_TOKEN=3e22ba0cf8e3ba80701931626146c8e7a89d916f14; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2222228481%22%2C%22first_id%22%3A%2217a9d6b080a526-0c7af230c4b51d-6373264-1327104-17a9d6b080b931%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24os%22%3A%22Windows%22%2C%22%24browser%22%3A%22Chrome%22%2C%22%24browser_version%22%3A%2291.0.4472.124%22%2C%22lagou_company_id%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217a9d6b080a526-0c7af230c4b51d-6373264-1327104-17a9d6b080b931%22%7D; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1626139109; TG-TRACK-CODE=search_code; LGRID=20210713091903-7c87ddc0-90c1-4e87-8970-4dbeefbb4c24; SEARCH_ID=2061158f47e14c7b87878919bbf28317',
        'origin': 'https://www.lagou.com',
        'referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
        'sec-ch-ua': '"Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
        'sec-ch-ua-mobile': '?0',
        'sec-fetch-dest': 'empty',
        'sec-fetch-mode': 'cors',
        'sec-fetch-site': 'same-origin',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'x-anit-forge-code': '0',
        'x-anit-forge-token': None,
        'x-requested-with': 'XMLHttpRequest'
    }
    # 定义post方式参数
    for num in range(1, 4):
        fromdata = {
            'first': 'true',
            'pn': num,
            'kd': keyword
        }
    # 向服务器发送请求
        response = requests.post(url, headers=h, data=fromdata)
        # 下载数据
        joinhtml = response.text
        # print(type(joinhtml))
        # 解析数据
        dictdata = json.loads(joinhtml)
        # print(type(dictdata))
        ct = dictdata['content']['positionResult']['result']
        positonlist = []
        for i in range(0, dictdata['content']['pageSize']):
            positionName = ct[i]['positionName']
            companyFullName = ct[i]['companyFullName']
            city = ct[i]['city']
            district = ct[i]['district']
            companySize = ct[i]['companySize']
            education = ct[i]['education']
            salary = ct[i]['salary']
            salaryMonth = ct[i]['salaryMonth']
            workYear = ct[i]['workYear']
            datalist=[positionName,companyFullName,city,district,companySize,education,salary,salaryMonth,workYear]
            positonlist.append(datalist)
            print(datalist)
            # 建立链接
            conn = pymysql.Connect(host='localhost', port=3306, user='root', passwd='lmy3.1415926', db='lagou',charset='utf8')
            # sql
            sql="insert into lg values('%s','%s','%s','%s','%s','%s','%s','%s','%s');"%(positionName,companyFullName,city,district,companySize,education,salary,salaryMonth,workYear)
            # 创建游标并执行
            cursor =conn.cursor()
            try:
                cursor.execute(sql)
                conn.commit()
            except:
                conn.rollback()
            cursor.close()
            conn.close()


        # 设置时间延迟
        time.sleep(3)
        with open('拉勾网%s职位信息.csv'%keyword,'a',encoding="utf-8",newline="") as f:
            w=csv.writer(f)
            w.writerows(positonlist)
# 调用函数
title=['职位名称','公司名称','公司所在城市','所属地区','公司人数','学历要求','薪资范围','薪资月','作年薪']
with open('拉勾网%s职位信息.csv'%keyword,'a',encoding="utf-8",newline="") as f:
    w=csv.writer(f)
    w.writerow(title)
getHrml('https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false')

四、疫情数据采集

360疫情数据

import requests,json,csv,pymysql

def getHrml(url):
    h={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }

    response = requests.get(url, headers=h)
    joinhtml = response.text
    directro=joinhtml.split('(')[1]
    directro2=directro[0:-2]
    dictdata = json.loads(directro2)
    ct = dictdata['country']
    positonlist = []
    for i in range(0, 196):
        provinceName=ct[i]['provinceName']
        cured=ct[i]['cured']
        diagnosed=ct[i]['diagnosed']
        died=ct[i]['died']
        diffDiagnosed=ct[i]['diffDiagnosed']

        datalist=[provinceName,cured,diagnosed,died,diffDiagnosed]
        positonlist.append(datalist)
        print(diagnosed)
        conn = pymysql.Connect(host='localhost', port=3306, user='root', passwd='lmy3.1415926', db='lagou', charset='utf8')
        sql = "insert into yq values('%s','%s','%s','%s','%s');" %(provinceName,cured,diagnosed,died,diffDiagnosed)
        cursor = conn.cursor()
        try:
            cursor.execute(sql)
            conn.commit()
        except: 
            conn.rollback()
        cursor.close()
        conn.close()
    with open('疫情数据.csv','a',encoding="utf-8",newline="") as f:
        w=csv.writer(f)
        w.writerows(positonlist)
title=['provinceName','cured','diagnosed','died','diffDiagnosed']
with open('疫情数据.csv','a',encoding="utf-8",newline="") as f:
    w = csv.writer(f)
    w.writerow(title)
getHrml('https://m.look.360.cn/events/feiyan?sv=&version=&market=&device=2&net=4&stype=&scene=&sub_scene=&refer_scene=&refer_subscene=&f=jsonp&location=true&sort=2&_=1626165806369&callback=jsonp2')

五、scrapy 数据采集

（一）创建爬虫步骤

1.创建项目，在命令行输入：scrapy startproject 项目名称

2.创建爬虫文件：在命令行定位到spiders文件夹下：scrapy genspider 爬虫文件名网站域名

3.运行程序：scrapy crawl 爬虫文件名

（二）scrapy文件结构

1.spiders文件夹:编写解析程序

2.init.py：包标志文件，只有存在这个python文件的文件夹才是包

3.items.py:定义数据结构，用于定义爬取的字段

4.middlewares.py:中间件

5.pipliness.py：管道文件，用于定义数据存储方向

6.settings.py:配置文件，通常配置用户代理，管道文件开启。

（三）scrapy爬虫框架使用流程

1.配置settings.py文件：

（1）User-Agent:“浏览器参数”

(2)ROBOTS协议：Fasle

（3）修改ITEM_PIPLINES：打开管道

2.在item.py中定义爬取的字段

 字段名 = scrapy.Field()

3.在spiders文件夹中的parse（）方法中写解析代码并将解析结果提交给item对象

4.在piplines.py中定义存储路径

5.分页采集：

1）查找url的规律

2）在爬虫文件中先定义页面变量，使用if判断和字符串格式化方法生成新url

3）yield scrapy Requests(url,callback=parse)

案例

（一）安居客数据采集

其中的文件

1.ajk.py中

import scrapy

from anjuke.items import AnjukeItem
class AjkSpider(scrapy.Spider):
    name = 'ajk'
    #allowed_domains = ['bj.fang.anjuke.com/?from=navigation']
    start_urls = ['https://bj.fang.anjuke.com/loupan/all/p1/']
    pagenum = 1
    def parse(self, response):
        #print(response.text)
        #response.encoding = 'gb2312'
        item = AnjukeItem()
        # 解析程序
        name = response.xpath('//span[@class="items-name"]/text()').extract()
        temp = response.xpath('//span[@class="list-map"]/text()').extract()
        place = []
        district = []
        #print(temp)
        for i in temp:
            placetemp = "".join(i.split("]")[0].strip("[").strip().split())
            districttemp = i.split("]")[1].strip()
            #print(districttemp)
            place.append(placetemp)
            district.append(districttemp)
        #apartment = response.xpath('//a[@class="huxing"]/span[not(@class)]/text()').extract()
        area1 = response.xpath('//span[@class="building-area"]/text()').extract()
        area = []
        for j in area1:
            areatemp = j.strip("建筑面积：").strip('㎡')
            area.append(areatemp)
        price = response.xpath('//p[@class="price"]/span/text()').extract()
        # print(name)
        # 将处理后的数据传入item中
        item['name'] = name
        item['place'] = place
        item['district'] = district
        #item['apartment'] = apartment
        item['area'] = area
        item['price'] = price
        yield item
        # print(type(item['name']))
        for a,b,c,d,e in zip(item['name'],item['place'],item['district'],item['area'],item['price']):
             print(a,b,c,d,e)
        # 分页爬虫
        if self.pagenum < 5:
            self.pagenum += 1
            newurl = "https://bj.fang.anjuke.com/loupan/all/p{}".format(str(self.pagenum))
            #print(newurl)
            yield scrapy.Request(newurl, callback=self.parse)
        # print(type(dict(item)))

2.pipelines.py中

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import csv,pymysql
# 存放到csv中
class AnjukePipeline:
    def open_spider(self,spider):
        self.f = open("安居客北京.csv", "w", encoding='utf-8', newline="")
        self.w = csv.writer(self.f)
        titlelist = ['name', 'place', 'distract', 'area', 'price']
        self.w.writerow(titlelist)
    def process_item(self, item, spider):
        # writerow()  [1,2,3,4] writerows()  [[第一条记录],[第二条记录],[第三条记录]]
        # 数据处理
        k = list(dict(item).values())
        self.listtemp = []
        for a,b,c,d,e in zip(k[0],k[1],k[2],k[3],k[4]):
            self.temp = [a,b,c,d,e]
            self.listtemp.append(self.temp)
        #print(listtemp)
        self.w.writerows(self.listtemp)
        return item
    def close_spider(self,spider):
        self.f.close()
# 存储到mysql中
class MySqlPipeline:
    def open_spider(self,spider):
        self.conn = pymysql.Connect(host="localhost",port=3306,user='root',password='lmy3.1415926',db="anjuke",charset='utf8')
    def process_item(self,item,spider):
        self.cursor = self.conn.cursor()
        for a, b, c, d, e in zip(item['name'], item['place'], item['district'], item['area'], item['price']):
            sql = 'insert into ajk values("%s","%s","%s","%s","%s");'%(a,b,c,d,e)

            self.cursor.execute(sql)
            self.conn.commit()
        return item
    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

3.items.py中

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class AnjukeItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    place = scrapy.Field()
    district = scrapy.Field()
    #apartment = scrapy.Field()
    area = scrapy.Field()
    price = scrapy.Field()

4.settings.py中

# Scrapy settings for anjuke project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'anjuke'

SPIDER_MODULES = ['anjuke.spiders']
NEWSPIDER_MODULE = 'anjuke.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla / 5.0(Windows NT 10.0;WOW64)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'anjuke.middlewares.AnjukeSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'anjuke.middlewares.AnjukeDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'anjuke.pipelines.AnjukePipeline': 300,
   'anjuke.pipelines.MySqlPipeline': 301
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.middlewares.py中

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class AnjukeSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class AnjukeDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

（二）太平洋汽车数据采集

1.tpy.py中

import scrapy

from taipy.items import TaipyItem


class TpySpider(scrapy.Spider):
    name = 'tpy'
    # allowed_domains = ['price.pcauto.com.cn/top/k0-p1.html']
    start_urls = ['https://price.pcauto.com.cn/top/k0-p1.html']
    pagenum = 1
    def parse(self, response):
        item = TaipyItem()
        # response.encoding ='gb2312'
        # print(response.text)
        name = response.xpath('//p[@class="sname"]/a/text()').extract()
        temperature = response.xpath('//p[@class="col rank"]/span[@class="fl red rd-mark"]/text()').extract()
        price = response.xpath('//p/em[@class="red"]/text()').extract()
        brand2 = response.xpath('//p[@class="col col1"]/text()').extract()
        brand = []
        for j in brand2:
            areatemp = j.strip('品牌：').strip('排量：').strip('\r\n')
            brand.append(areatemp)
        brand = [i for i in brand if i != '']
        rank2 = response.xpath('//p[@class="col"]/text()').extract()
        rank = []
        for j in rank2:
            areatemp = j.strip('级别：').strip('变速箱：').strip('\r\n')
            rank.append(areatemp)
        rank = [i for i in rank if i != '']

        item['name'] = name
        item['temperature'] = temperature
        item['price'] = price
        item['brand'] = brand
        item['rank'] = rank
        yield item

        for a, b, c, d, e in zip(item['name'], item['temperature'], item['price'], item['brand'], item['rank']):
            print(a, b, c, d, e)
        if self.pagenum < 6:
            self.pagenum += 1
            newurl = "https://price.pcauto.com.cn/top/k0-p{}.html".format(str(self.pagenum))
            print(newurl)
            yield scrapy.Request(newurl, callback=self.parse)
        #print(type(dict(item)))

2.pipelines.py中

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import csv,pymysql

class TaipyPipeline:
    def open_spider(self, spider):
        self.f=open("太平洋.csv", "w", encoding='utf-8', newline="")
        self.w = csv.writer(self.f)
        titlelist=['name','temperature','price','brand','rank']
        self.w.writerow(titlelist)

    def process_item(self, item, spider):
        k = list(dict(item).values())
        self.listtemp = []
        for a, b, c, d, e in zip(k[0], k[1], k[2], k[3], k[4]):
            self.temp = [a, b, c, d, e]
            self.listtemp.append(self.temp)
        print(self.listtemp)
        self.w.writerows(self.listtemp)
        return item
    def close_spider(self, spider):
        self.f.close()
class MySqlPipeline:
    def open_spider(self,spider):
        self.conn = pymysql.Connect(host="localhost",port=3306,user='root',password='lmy3.1415926',db="taipy",charset='utf8')
    def process_item(self,item,spider):
        self.cursor = self.conn.cursor()
        for a, b, c, d, e in zip(item['name'], item['temperature'], item['price'], item['brand'], item['rank']):
            sql = 'insert into tpy values("%s","%s","%s","%s","%s");'%(a,b,c,d,e)

            self.cursor.execute(sql)
            self.conn.commit()
        return item
    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

3.items.py中

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class TaipyItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    temperature = scrapy.Field()
    price = scrapy.Field()
    brand = scrapy.Field()
    rank = scrapy.Field()

4.settings.py中

# Scrapy settings for taipy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'taipy'

SPIDER_MODULES = ['taipy.spiders']
NEWSPIDER_MODULE = 'taipy.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'taipy.middlewares.TaipySpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'taipy.middlewares.TaipyDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#     # 'taipy.pipelines.TaipyItem':300
# }

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'taipy.pipelines.TaipyPipeline': 300,
    'taipy.pipelines.MySqlPipeline':301
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.middlewares.py中

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class TaipySpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class TaipyDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

你可能感兴趣的:(python,爬虫,大数据)

利用Beautiful Soup和Pandas进行网页数据抓取与清洗处理实战傻啦嘿哟 pandas
目录一、准备工作二、抓取网页数据三、数据清洗四、数据处理五、保存数据六、完整代码示例七、总结在数据分析和机器学习的项目中，数据的获取、清洗和处理是非常关键的步骤。今天，我们将通过一个实战案例，演示如何利用Python中的BeautifulSoup库进行网页数据抓取，并使用Pandas库进行数据清洗和处理。这个案例不仅适合初学者，也能帮助有一定经验的朋友快速掌握这两个强大的工具。一、准备工作在开始之
python做一个注册界面_python如何做一个登录注册界面 weixin_39824033 python做一个注册界面
python做一个登录注册界面的方法：首先初始化一个window界面，并使用画布实现欢迎的logo；然后用代码实现登录和注册按钮；接着并进行登录判断代码；最后完成注册界面即可。【相关学习推荐：python视频教程】python做一个登录注册界面的方法：一、登录界面1、首先初始化一个window界面window=tk.Tk()window.title('WelcometoMofanPython')w
python读取zip包内文件_Python模块学习：zipfile zip文件操作 weixin_40001634 python读取zip包内文件
最近在写一个网络客户端下载程序，用于下载服务器上的数据。有些数据(如文本，office文档)如果直接传输的话，将会增加通信的数据量，使下载时间变长。服务器在传输这些数据之前先对其进行压缩，客户端接收到数据之后进行解压，这样可以减小网通传输数据的通信量，缩短下载的时间，从而增加客户体验。以前用C#做类似应用程序的时候，我会用SharpZipLib这个开源组件，现在用Python做类似的工作，只要使用
python制作登陆窗口_python登陆界面 weixin_39758494 python制作登陆窗口
广告关闭腾讯云11.11云上盛惠，精选热门产品助力上云，云服务器首年88元起，买的越多返的越多，最高返5000元！print(账号密码错误！请重试。)returnfalsebutton(master,text=登陆,width=10,command=test).grid(row=3,column=0,sticky=w,padx=10,pady=5)button(master,text=退出,wid
如何使用零配置的Sphinx生成Python文档？潮易 sphinx 全文检索搜索引擎
如何使用零配置的Sphinx生成Python文档？在Python编程中，编写文档是非常重要的。一个好的文档可以帮助其他开发者理解和使用你的代码。Sphinx是一个用于生成Python项目的文档的静态网页生成器，它支持多种文档格式，包括ReStructuredText和Markdown。以下是使用零配置的方式来使用Sphinx生成Python文档的详细步骤：1.首先，确保你已经安装了Sphinx。打
如何订阅&q；/扫描&q；主题、修改消息并发布到新主题？潮易 python 开发语言
如何订阅&q；/扫描&q；主题、修改消息并发布到新主题？这个问题涉及到Python编程中的MQTT（MessageQueuingTelemetryTransport）库的使用，该库允许我们创建客户端订阅和发布消息到MQTT服务器。以下是一个简单的步骤：1.安装MQTT库：可以使用pip安装`paho-mqtt`库。```pythonpipinstallpaho-mqtt```2.创建一个MQTT客
Python-tkinter自制登录界面（含注册） GCHEK python 开发语言
简单的用户登录、注册界面importtkinterastkimporttimeimportsubprocessimportsysimportosimporttkinter.messageboxwindow=tk.Tk()window.title('GCHEK')window.geometry('400x300')#设置储存用户信息的容器，这里用的txt。ifnotos.path.exists('U
Python爬虫requests(详细) dme. Python爬虫零基础入门爬虫 python
本文来学爬虫使用requests模块的常见操作。1.URL参数无论是在发送GET/POST请求时，网址URL都可能会携带参数，例如：http://www.5xclass.cn?age=19&name=dengres=requests.get(url="https://www.5xclass.cn?age=19&name=deng")res=requests.get(url="https://www
使用python计算等比数列求和的方法 HAMYHF windows
在python中，计算Sum=m+mm+mmm+mmmm+.....+mmmmm.....,输入两个数m,n。m的位数累加到n的值，列出算式并计算出结果：#为了打印出算式，并计算出结果，将m,mm这些放入到列表中#定义列表中的m初始值为0,用Ele来代表m,mm....Ele=0#定义总和为0Sum=0#定义一个空列表List=[]#输入两个值n=int(input("inputadigit：")
Python+Playwright常用元素定位方法 HAMYHF python 功能测试
CSSselector选择器在CSS中，定位元素主要通过选择器完成，以下是几种常见的CSS选择器定位方法：标签选择器(element):直接使用HTML元素名称来定位，例如p会选择所有段落元素。属性选择器(attribute):选择所有具有指定属性的元素，无论该属性的值是什么。例如，[title]会选择所有包含title属性的元素。选择具有指定属性，并且该属性值完全等于给定值的元素。例如，[typ
Python中的 redis keyspace 通知_python 操作redis psubscribe(‘__keyspace@0__ ‘) 2301_82243733 程序员 python 学习面试
最后Python崛起并且风靡，因为优点多、应用领域广、被大牛们认可。学习Python门槛很低，但它的晋级路线很多，通过它你能进入机器学习、数据挖掘、大数据，CS等更加高级的领域。Python可以做网络应用，可以做科学计算，数据分析，可以做网络爬虫，可以做机器学习、自然语言处理、可以写游戏、可以做桌面应用…Python可以做的很多，你需要学好基础，再选择明确的方向。这里给大家分享一份全套的Pytho
Python数据分析与可视化程序媛小果 python python 数据分析开发语言
Python数据分析与可视化在数据驱动的商业世界中，数据分析和可视化成为了理解复杂数据集、做出明智决策的关键工具。Python，作为一种功能强大且易于学习的编程语言，提供了丰富的库和框架，使得数据分析和可视化变得简单高效。本文将探讨Python在数据分析和可视化中的应用，包括数据预处理、分析、以及如何通过可视化工具将数据洞察转化为可操作的策略。1.数据分析的重要性数据分析是提取数据中有用信息的过程
【Python 学习 / 7】模块与文件操作卜及中 Python基础 python 学习数据库
文章目录前言一、导入模块1.导入整个模块2.导入模块中的特定函数3.给模块或函数起别名二、常用模块1.`math`模块2.`random`模块3.`os`模块4.`sys`模块三、文件处理1.打开文件2.读取文件3.写入文件4.关闭文件5.使用`with`语句管理文件四、日期时间1.`datetime`模块获取当前日期和时间创建日期和时间对象格式化日期和时间解析字符串为日期对象2.`time`模块
YashanDB访问约束数据库
本文内容来自YashanDB官网，原文内容请见https://doc.yashandb.com/yashandb/23.3/zh/%E6%A6%82%E5%BF%B5%...访问约束是YashanDB特有的一种关系数据结构，基于有界计算理论的访问约束模型（AC，AccessConstraint）实现：通过在数据源上建立AC，实现大数据变小的模型变换。在查询时，通过访问AC数据，缩小查询代价和提升查
探索天气预警API：精准预测，守护安全 api
引言在当今这个快速变化的世界中，天气的波动直接影响着人们的日常生活、农业生产、交通出行乃至公共安全。为了有效应对各种极端天气事件，天气预警API应运而生，成为连接气象数据与公众服务的重要桥梁。本文将深入探讨天气预警API的工作原理、应用场景以及其对社会的积极影响。天气预警API的工作原理天气预警API基于先进的气象监测技术和大数据分析，通过收集全球范围内的气象卫星、雷达、地面观测站等数据源，进行实
经销商管理系统架构设计方案（附 Java版本和Python版本源代码详解） AI天才研究院 DeepSeek R1 &大数据AI人工智能大模型 AI大模型企业级应用开发实战 AI大模型应用入门实战与进阶计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
经销商管理系统架构设计方案（Java实现源代码详解）关键词：经销商管理系统，Java，SpringBoot，MyBatis，MySQL，架构设计，源代码1.背景介绍随着市场竞争的日益激烈，企业对经销商的管理越来越重视。传统的经销商管理方式效率低下，信息滞后，难以适应现代企业的发展需求。为了提高经销商管理效率，降低运营成本，越来越多的企业开始采用信息化的手段来管理经销商，而经销商管理系统应运而生。经
Python:数据从Excel表格链接到Word文档更新Excel即可自动更新Word 一个花生米生花 python excel word
要使用Python来创建或更新一个Word文档，并将数据从Excel表格链接到Word文档中，你可以使用python-docx库来操作Word文档和openpyxl或pandas库来读取Excel文件。不过，需要注意的是，python-docx库并不支持将外部文件链接到Word文档的功能。你可以在Word文档中插入Excel数据的快照，但它们不会自动更新。如果你想要在Word文档中插入Excel数
使用Odoo Shell卸载模块 odoo中国 odoo odoo 开源软件 erp
使用OdooShell卸载模块我们在Odoo使用过程中，因为模块安装错误或者前端错误等导致odoo无法通过界面登录，这时候你可以使用OdooShell来卸载模块。OdooShell是一个交互式Pythonshell，允许你直接与Odoo数据库和模型进行交互。以下是使用OdooShell卸载模块的详细步骤：步骤1：启动OdooShell要启动OdooShell，你需要在终端中运行以下命令。确保你已经
NumPy的基本使用 Mo思编程学习 numpy python 开发语言 pip
在Python的数据科学与数值计算领域，NumPy无疑是一颗耀眼的明星。作为Python中用于科学计算的基础库，NumPy提供了高效的多维数组对象以及处理这些数组的各种工具。本文将带您深入了解NumPy的基本使用，感受它的强大魅力。一、安装与导入在使用NumPy之前，首先要确保它已经安装在您的Python环境中。如果您使用的是Anaconda发行版，NumPy通常已经预装。若未安装，可以使用如下命
FOKS-TROT: 一个高效、易用的全功能开源知识图谱生成工具柳旖岭
FOKS-TROT:一个高效、易用的全功能开源知识图谱生成工具项目简介FOKS-TROT是一个基于Python的全功能开源知识图谱生成工具，旨在帮助研究人员和开发者快速构建具有丰富信息的知识图谱。该项目由hkx3upper在GitCode上开发并维护。通过FOKS-TROT，您可以轻松地将各种数据源（如文本文件、数据库、API）转换为结构化的知识图谱，并对其进行可视化分析和机器学习任务。此外，该工
python实现word文档合并 v2.0 task138 python自动化 python 自动化运维开发
目录前言要求运行效果脚本下载链接前言之前发表了一个小工具，python用于合并word文档以完成特定的工作任务，现在领导给出了新需求，适当的调整了一下word文档的合并情况。同时，各位同事反馈说，环境部署太难了，脚本的使用成本比较高，难度大，所以我这次把脚本打包成一个EXE可执行文件，直接双击即可使用。要求由于脚本的具体逻辑发生了变化，因此，exe文件的同级目录下，一定要存在一个txt文件，否则无
2025年全国CTF夺旗赛-从零基础入门到竞赛，看这一篇就稳了！白帽安全-黑客4148 安全 web安全网络网络安全 CTF
目录一、CTF简介二、CTF竞赛模式三、CTF各大题型简介四、CTF学习路线4.1、初期1、html+css+js（2-3天）2、apache+php（4-5天）3、mysql（2-3天）4、python(2-3天)5、burpsuite（1-2天）4.2、中期1、SQL注入（7-8天）2、文件上传（7-8天）3、其他漏洞（14-15天）4.3、后期五、CTF学习资源5.1、CTF赛题复现平台5.
2025年全国CTF夺旗赛-从零基础入门到竞赛，看这一篇就稳了！白帽安全-黑客4148 网络安全 web安全 linux 密码学 CTF
目录一、CTF简介二、CTF竞赛模式三、CTF各大题型简介四、CTF学习路线4.1、初期1、html+css+js（2-3天）2、apache+php（4-5天）3、mysql（2-3天）4、python(2-3天)5、burpsuite（1-2天）4.2、中期1、SQL注入（7-8天）2、文件上传（7-8天）3、其他漏洞（14-15天）4.3、后期五、CTF学习资源5.1、CTF赛题复现平台5.
基于python深度学习遥感影像地物分类与目标识别、分割实践技术应用 xiao5kou4chang6kai4 深度学习遥感勘测 python 深度学习分类
专题一：深度学习发展与机器学习深度学习的历史发展过程机器学习，深度学习等任务的基本处理流程梯度下降算法讲解不同初始化，学习率对梯度下降算法的实例分析从机器学习到深度学习算法专题二深度卷积网络、卷积神经网络、卷积运算的基本原理池化操作，全连接层，以及分类器的作用BP反向传播算法的理解一个简单CNN模型代码理解特征图，卷积核可视化分析专题三TensorFlow与keras介绍与入门TensorFlow
python 快速实现链接转 word 文档嘿嘿潶黑黑 python word
python快速实现链接转word文档演示代码展示最后演示代码展示fromnewspaperimportArticlefromdocximportDocumentfromdocx.sharedimportPt,RGBColorfromdocx.enum.styleimportWD_STYLE_TYPEfromdocx.oxml.nsimportqn#tkinterGUIimporttkintera
Python入门笔记「已注销」计算机
文章目录第0周课程导学第1周Python基本语法元素保留字数据类型语句与函数输入函数第2周Python基本图形绘制turtle库绝对坐标海龟坐标turtle角度坐标体系RGB色彩体系画笔控制函数运动控制函数方向控制函数循环语句第3周基本数据类型整型浮点数科学计数法复数类型数值运算操作符二元操作符有对应的增强赋值操作符数值运算函数字符串类型的表示字符串切片字符串类型及操作字符串类型格式化time库时
pythonxml模块高级用法_Python minidom模块用法示例【DOM写入和解析XML】 Lucy-露西娅 pythonxml模块高级用法
本文实例讲述了Pythonminidom模块用法。分享给大家供大家参考，具体如下：一、DOM写XML文件#-*-coding:utf-8-*-#!python3#导入minidomfromxml.domimportminidom#1.创建DOM树对象dom=minidom.Document()#2.创建根节点。每次都要用DOM对象来创建任何节点。root_node=dom.createElemen
React 渲染 Flash 接口数据 ox0080 #北漂+滴滴出行 VIP 激励 Web react.js 前端前端框架
1.后端Python代码使用Flask创建多个接口，每个接口返回不同的数据，并使用自定义装饰器来绑定路由。代码：#app.pyfromflaskimportFlask,jsonifyapp=Flask(__name__)defapi_route(route,methods=['GET']):"""自定义装饰器，用于将函数与HTTP路由绑定"""defdecorator(func):app.rout
LQB---基础练习---十六进制转八进制「已注销」 #LQB LQB
试题基础练习十六进制转八进制资源限制内存限制：512.0MBC/C++时间限制：1.0sJava时间限制：3.0sPython时间限制：5.0s问题描述给定n个十六进制正整数，输出它们对应的八进制数。输入格式输入的第一行为一个正整数n（1<=n<=10）。接下来n行，每行一个由09、大写字母AF组成的字符串，表示要转换的十六进制正整数，每个十六进制数长度不超过100000。输出格式输出n行，每行为
【2025年】全国CTF夺旗赛-从零基础入门到竞赛，看这一篇就稳了！网安詹姆斯 web安全 CTF 网络安全大赛 python linux
【2025年】全国CTF夺旗赛-从零基础入门到竞赛，看这一篇就稳了！基于入门网络安全/黑客打造的：黑客&网络安全入门&进阶学习资源包目录一、CTF简介二、CTF竞赛模式三、CTF各大题型简介四、CTF学习路线4.1、初期1、html+css+js（2-3天）2、apache+php（4-5天）3、mysql（2-3天）4、python(2-3天)5、burpsuite（1-2天）4.2、中期1、S
ASM系列四利用Method 组件动态注入方法逻辑 lijingyao8206 字节码技术 jvm AOP 动态代理 ASM
这篇继续结合例子来深入了解下Method组件动态变更方法字节码的实现。通过前面一篇，知道ClassVisitor 的visitMethod()方法可以返回一个MethodVisitor的实例。那么我们也基本可以知道，同ClassVisitor改变类成员一样，MethodVIsistor如果需要改变方法成员，注入逻辑，也可以
java编程思想 --内部类百合不是茶 java 内部类匿名内部类
内部类;了解外部类并能与之通信内部类写出来的代码更加整洁与优雅 1,内部类的创建内部类是创建在类中的 package com.wj.InsideClass; /* * 内部类的创建 */ public class CreateInsideClass { public CreateInsideClass(
web.xml报错 crabdave web.xml
web.xml报错 The content of element type "web-app" must match "(icon?,display- name?,description?,distributable?,context-param*,filter*,filter-mapping*,listener*,servlet*,s
泛型类的自定义麦田的设计者 java android 泛型
为什么要定义泛型类，当类中要操作的引用数据类型不确定的时候。采用泛型类，完成扩展。例如有一个学生类 Student{ Student(){ System.out.println("I'm a student....."); } } 有一个老师类
CSS清除浮动的4中方法 IT独行者 JavaScript UI css
清除浮动这个问题，做前端的应该再熟悉不过了，咱是个新人，所以还是记个笔记，做个积累，努力学习向大神靠近。CSS清除浮动的方法网上一搜，大概有N多种，用过几种，说下个人感受。 1、结尾处加空div标签 clear:both 1 2 3 4 .div 1 { background : #000080 ; border : 1px s
Cygwin使用windows的jdk 配置方法 _wy_ jdk windows cygwin
1.[vim /etc/profile] JAVA_HOME="/cgydrive/d/Java/jdk1.6.0_43" (windows下jdk路径为D:\Java\jdk1.6.0_43) PATH="$JAVA_HOME/bin:${PATH}" CLAS
linux下安装maven 无量 maven linux 安装
Linux下安装maven(转) 1.首先到Maven官网下载安装文件，目前最新版本为3.0.3，下载文件为 apache-maven-3.0.3-bin.tar.gz，下载可以使用wget命令； 2.进入下载文件夹，找到下载的文件，运行如下命令解压 tar -xvf apache-maven-2.2.1-bin.tar.gz 解压后的文件夹
tomcat的https 配置,syslog-ng配置 aichenglong tomcat http跳转到https syslong-ng配置 syslog配置
1) tomcat配置https,以及http自动跳转到https的配置 1)TOMCAT_HOME目录下生成密钥(keytool是jdk中的命令) keytool -genkey -alias tomcat -keyalg RSA -keypass changeit -storepass changeit
关于领号活动总结 alafqq 活动
关于某彩票活动的总结具体需求，每个用户进活动页面，领取一个号码，1000中的一个；活动要求 1，随机性，一定要有随机性； 2，最少中奖概率，如果注数为3200注，则最多中4注 3，效率问题，（不能每个人来都产生一个随机数，这样效率不高）； 4，支持断电（仍然从下一个开始），重启服务；（存数据库有点大材小用，因此不能存放在数据库）解决方案 1，事先产生随机数1000个，并打
java数据结构冒泡排序的遍历与排序百合不是茶 java
java的冒泡排序是一种简单的排序规则冒泡排序的原理：比较两个相邻的数，首先将最大的排在第一个，第二次比较第二个，此后一样；针对所有的元素重复以上的步骤，除了最后一个例题；将int array[]
JS检查输入框输入的是否是数字的一种校验方法 bijian1013 js
如下是JS检查输入框输入的是否是数字的一种校验方法： <form method=post target="_blank"> 数字：<input type="text" name=num onkeypress="checkNum(this.form)"><br> </form>
Test注解的两个属性：expected和timeout bijian1013 java JUnit expected timeout
JUnit4：Test文档中的解释：　　The Test annotation supports two optional parameters. 　　The first, expected, declares that a test method should throw an exception. 　　If it doesn't throw an exception or if it
[Gson二]继承关系的POJO的反序列化 bit1129 POJO
父类 package inheritance.test2; import java.util.Map; public class Model { private String field1; private String field2; private Map<String, String> infoMap
【Spark八十四】Spark零碎知识点记录 bit1129 spark
1. ShuffleMapTask的shuffle数据在什么地方记录到MapOutputTracker中的 ShuffleMapTask的runTask方法负责写数据到shuffle map文件中。当任务执行完成成功，DAGScheduler会收到通知，在DAGScheduler的handleTaskCompletion方法中完成记录到MapOutputTracker中
WAS各种脚本作用大全 ronin47 WAS 脚本
　　　http://www.ibm.com/developerworks/cn/websphere/library/samples/SampleScripts.html 　　　无意中，在WAS官网上发现的各种脚本作用，感觉很有作用，先与各位分享一下　　　获取下载这些示例 jacl 和 Jython 脚本可用于在 WebSphere Application Server 的不同版本中自
java-12.求 1+2+3+..n不能使用乘除法、 for 、 while 、 if 、 else 、 switch 、 case 等关键字以及条件判断语句 bylijinnan switch
借鉴网上的思路，用java实现： public class NoIfWhile { /** * @param args * * find x=1+2+3+....n */ public static void main(String[] args) { int n=10; int re=find(n); System.o
Netty源码学习-ObjectEncoder和ObjectDecoder bylijinnan java netty
Netty中传递对象的思路很直观： Netty中数据的传递是基于ChannelBuffer（也就是byte[]）；那把对象序列化为字节流，就可以在Netty中传递对象了相应的从ChannelBuffer恢复对象，就是反序列化的过程 Netty已经封装好ObjectEncoder和ObjectDecoder 先看ObjectEncoder ObjectEncoder是往外发送
spring 定时任务中cronExpression表达式含义 chicony cronExpression
一个cron表达式有6个必选的元素和一个可选的元素，各个元素之间是以空格分隔的，从左至右，这些元素的含义如下表所示：代表含义是否必须允许的取值范围 &nb
Nutz配置Jndi ctrain JNDI
1、使用JNDI获取指定资源： var ioc = { dao : { type :"org.nutz.dao.impl.NutDao", args : [ {jndi :"jdbc/dataSource"} ] } } 以上方法,仅需要在容器中配置好数据源,注入到NutDao即可.
解决 /bin/sh^M: bad interpreter: No such file or directory daizj shell
在Linux中执行.sh脚本，异常/bin/sh^M: bad interpreter: No such file or directory。分析：这是不同系统编码格式引起的：在windows系统中编辑的.sh文件可能有不可见字符，所以在Linux系统下执行会报以上异常信息。解决： 1）在windows下转换：利用一些编辑器如UltraEdit或EditPlus等工具
[转]for 循环为何可恨？ dcj3sjt126com 程序员读书
Java的闭包(Closure)特征最近成为了一个热门话题。一些精英正在起草一份议案，要在Java将来的版本中加入闭包特征。然而，提议中的闭包语法以及语言上的这种扩充受到了众多Java程序员的猛烈抨击。不久前，出版过数十本编程书籍的大作家Elliotte Rusty Harold发表了对Java中闭包的价值的质疑。尤其是他问道“for 循环为何可恨？”[http://ju
Android实用小技巧 dcj3sjt126com android
1、去掉所有Activity界面的标题栏　　修改AndroidManifest.xml 　　在application 标签中添加android:theme="@android:style/Theme.NoTitleBar" 2、去掉所有Activity界面的TitleBar 和StatusBar 　　修改AndroidManifes
Oracle 复习笔记之序列 eksliang Oracle 序列 sequence Oracle sequence
转载请出自出处：http://eksliang.iteye.com/blog/2098859 1.序列的作用序列是用于生成唯一、连续序号的对象一般用序列来充当数据库表的主键值 2.创建序列语法如下： create sequence s_emp start with 1 --开始值 increment by 1 --増长值 maxval
有“品”的程序员 gongmeitao 工作
完美程序员的10种品质　　完美程序员的每种品质都有一个范围，这个范围取决于具体的问题和背景。没有能解决所有问题的完美程序员（至少在我们这个星球上），并且对于特定问题，完美程序员应该具有以下品质：　　1. 才智非凡- 能够理解问题、能够用清晰可读的代码翻译并表达想法、善于分析并且逻辑思维能力强（范围：用简单方式解决复杂问题）　　
使用KeleyiSQLHelper类进行分页查询 hvt sql .net C#asp.net hovertree
本文适用于sql server单主键表或者视图进行分页查询，支持多字段排序。KeleyiSQLHelper类的最新代码请到http://hovertree.codeplex.com/SourceControl/latest下载整个解决方案源代码查看。或者直接在线查看类的代码：http://hovertree.codeplex.com/SourceControl/latest#HoverTree.D
SVG 教程（三）圆形，椭圆，直线天梯梦 svg
SVG <circle> SVG 圆形 - <circle> <circle> 标签可用来创建一个圆：下面是SVG代码： <svg xmlns="http://www.w3.org/2000/svg" version="1.1"> <circle cx="100" c
链表栈 luyulong java 数据结构
public class Node { private Object object; private Node next; public Node() { this.next = null; this.object = null; } public Object getObject() { return object; } public
基础数据结构和算法十：2-3 search tree sunwinner Algorithm 2-3 search tree
Binary search tree works well for a wide variety of applications, but they have poor worst-case performance. Now we introduce a type of binary search tree where costs are guaranteed to be loga
spring配置定时任务 stunizhengjia spring timer
最近因工作的需要，用到了spring的定时任务的功能,觉得spring还是很智能化的,只需要配置一下配置文件就可以了,在此记录一下，以便以后用到： //------------------------定时任务调用的方法------------------------------ /** * 存储过程定时器 */ publi
ITeye 8月技术图书有奖试读获奖名单公布 ITeye管理员活动
ITeye携手博文视点举办的8月技术图书有奖试读活动已圆满结束，非常感谢广大用户对本次活动的关注与参与。 8月试读活动回顾： http://webmaster.iteye.com/blog/2102830 本次技术图书试读活动的优秀奖获奖名单及相应作品如下（优秀文章有很多，但名额有限，没获奖并不代表不优秀）：《跨终端Web》 gleams：http

python数据采集