网络爬虫速成指南（四） URL判重

如果采集量比较小：
布隆过滤器详解原理

如果采集量比较大：
redis：主要是把URL转为md5，作为key来进行判重

 关于布隆过滤器器的参数说明，简单点说： private static BloomFilter<String> bloomFilter = new BloomFilter<String>(2X, X); 这样用就可以了

附：布隆过滤器实现，@author Magnus Skjegstad <[email protected]>


import java.io.Serializable;

import java.nio.charset.Charset;

import java.security.MessageDigest;

import java.security.NoSuchAlgorithmException;

import java.util.BitSet;

import java.util.Collection;



/**

 * Implementation of a Bloom-filter, as described here:

 * http://en.wikipedia.org/wiki/Bloom_filter

 * 

 * Inspired by the SimpleBloomFilter-class written by Ian Clarke. This

 * implementation provides a more evenly distributed Hash-function by using a

 * proper digest instead of the Java RNG. Many of the changes were proposed in

 * comments in his blog:

 * http://blog.locut.us/2008/01/12/a-decent-stand-alone-java

 * -bloom-filter-implementation/

 * 

 * @param <E>

 *            Object type that is to be inserted into the Bloom filter, e.g.

 *            String or Integer.

 * @author Magnus Skjegstad <[email protected]>

 */

public class BloomFilter<E> implements Serializable {

    private BitSet bitset;

    private int bitSetSize;

    private double bitsPerElement;

    private int expectedNumberOfFilterElements; // expected (maximum) number of

                                                // elements to be added

    private int numberOfAddedElements; // number of elements actually added to

                                        // the Bloom filter

    private int k; // number of hash functions



    static final Charset charset = Charset.forName("UTF-8"); // encoding used

                                                                // for storing

                                                                // hash values

                                                                // as strings



    static final String hashName = "MD5"; // MD5 gives good enough accuracy in

                                            // most circumstances. Change to

                                            // SHA1 if it's needed

    static final MessageDigest digestFunction;

    static { // The digest method is reused between instances

        MessageDigest tmp;

        try {

            tmp = java.security.MessageDigest.getInstance(hashName);

        } catch (NoSuchAlgorithmException e) {

            tmp = null;

        }

        digestFunction = tmp;

    }



    /**

     * Constructs an empty Bloom filter. The total length of the Bloom filter

     * will be c*n.

     * 

     * @param c

     *            is the number of bits used per element.

     * @param n

     *            is the expected number of elements the filter will contain.

     * @param k

     *            is the number of hash functions used.

     */

    public BloomFilter(double c, int n, int k) {

        this.expectedNumberOfFilterElements = n;

        this.k = k;

        this.bitsPerElement = c;

        this.bitSetSize = (int) Math.ceil(c * n);

        numberOfAddedElements = 0;

        this.bitset = new BitSet(bitSetSize);

    }



    /**

     * Constructs an empty Bloom filter. The optimal number of hash functions

     * (k) is estimated from the total size of the Bloom and the number of

     * expected elements.

     * 

     * @param bitSetSize

     *            defines how many bits should be used in total for the filter.

     * @param expectedNumberOElements

     *            defines the maximum number of elements the filter is expected

     *            to contain.

     */

    public BloomFilter(int bitSetSize, int expectedNumberOElements) {

        this(bitSetSize / (double) expectedNumberOElements,

                expectedNumberOElements, (int) Math

                        .round((bitSetSize / (double) expectedNumberOElements)

                                * Math.log(2.0)));

    }



    /**

     * Constructs an empty Bloom filter with a given false positive probability.

     * The number of bits per element and the number of hash functions is

     * estimated to match the false positive probability.

     * 

     * @param falsePositiveProbability

     *            is the desired false positive probability.

     * @param expectedNumberOfElements

     *            is the expected number of elements in the Bloom filter.

     */

    public BloomFilter(double falsePositiveProbability,

            int expectedNumberOfElements) {

        this(Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2)))

                / Math.log(2), // c = k / ln(2)

                expectedNumberOfElements, (int) Math.ceil(-(Math

                        .log(falsePositiveProbability) / Math.log(2)))); // k =

                                                                            // ceil

                                                                            // (

                                                                            // -

                                                                            // log_2

                                                                            // (

                                                                            // false

                                                                            // prob

                                                                            // .

                                                                            // )

                                                                            // )

    }



    /**

     * Construct a new Bloom filter based on existing Bloom filter data.

     * 

     * @param bitSetSize

     *            defines how many bits should be used for the filter.

     * @param expectedNumberOfFilterElements

     *            defines the maximum number of elements the filter is expected

     *            to contain.

     * @param actualNumberOfFilterElements

     *            specifies how many elements have been inserted into the

     *            <code>filterData</code> BitSet.

     * @param filterData

     *            a BitSet representing an existing Bloom filter.

     */

    public BloomFilter(int bitSetSize, int expectedNumberOfFilterElements,

            int actualNumberOfFilterElements, BitSet filterData) {

        this(bitSetSize, expectedNumberOfFilterElements);

        this.bitset = filterData;

        this.numberOfAddedElements = actualNumberOfFilterElements;

    }



    /**

     * Generates a digest based on the contents of a String.

     * 

     * @param val

     *            specifies the input data.

     * @param charset

     *            specifies the encoding of the input data.

     * @return digest as long.

     */

    public static long createHash(String val, Charset charset) {

        return createHash(val.getBytes(charset));

    }



    /**

     * Generates a digest based on the contents of a String.

     * 

     * @param val

     *            specifies the input data. The encoding is expected to be

     *            UTF-8.

     * @return digest as long.

     */

    public static long createHash(String val) {

        return createHash(val, charset);

    }



    /**

     * Generates a digest based on the contents of an array of bytes.

     * 

     * @param data

     *            specifies input data.

     * @return digest as long.

     */

    public static long createHash(byte[] data) {

        long h = 0;

        byte[] res;



        synchronized (digestFunction) {

            res = digestFunction.digest(data);

        }



        for (int i = 0; i < 4; i++) {

            h <<= 8;

            h |= ((int) res[i]) & 0xFF;

        }

        return h;

    }



    /**

     * Compares the contents of two instances to see if they are equal.

     * 

     * @param obj

     *            is the object to compare to.

     * @return True if the contents of the objects are equal.

     */

    @Override

    public boolean equals(Object obj) {

        if (obj == null) {

            return false;

        }

        if (getClass() != obj.getClass()) {

            return false;

        }

        final BloomFilter<E> other = (BloomFilter<E>) obj;

        if (this.expectedNumberOfFilterElements != other.expectedNumberOfFilterElements) {

            return false;

        }

        if (this.k != other.k) {

            return false;

        }

        if (this.bitSetSize != other.bitSetSize) {

            return false;

        }

        if (this.bitset != other.bitset

                && (this.bitset == null || !this.bitset.equals(other.bitset))) {

            return false;

        }

        return true;

    }



    /**

     * Calculates a hash code for this class.

     * 

     * @return hash code representing the contents of an instance of this class.

     */

    @Override

    public int hashCode() {

        int hash = 7;

        hash = 61 * hash + (this.bitset != null ? this.bitset.hashCode() : 0);

        hash = 61 * hash + this.expectedNumberOfFilterElements;

        hash = 61 * hash + this.bitSetSize;

        hash = 61 * hash + this.k;

        return hash;

    }



    /**

     * Calculates the expected probability of false positives based on the

     * number of expected filter elements and the size of the Bloom filter. <br

     * /><br /> The value returned by this method is the <i>expected</i> rate of

     * false positives, assuming the number of inserted elements equals the

     * number of expected elements. If the number of elements in the Bloom

     * filter is less than the expected value, the true probability of false

     * positives will be lower.

     * 

     * @return expected probability of false positives.

     */

    public double expectedFalsePositiveProbability() {

        return getFalsePositiveProbability(expectedNumberOfFilterElements);

    }



    /**

     * Calculate the probability of a false positive given the specified number

     * of inserted elements.

     * 

     * @param numberOfElements

     *            number of inserted elements.

     * @return probability of a false positive.

     */

    public double getFalsePositiveProbability(double numberOfElements) {

        // (1 - e^(-k * n / m)) ^ k

        return Math.pow((1 - Math.exp(-k * (double) numberOfElements

                / (double) bitSetSize)), k);



    }



    /**

     * Get the current probability of a false positive. The probability is

     * calculated from the size of the Bloom filter and the current number of

     * elements added to it.

     * 

     * @return probability of false positives.

     */

    public double getFalsePositiveProbability() {

        return getFalsePositiveProbability(numberOfAddedElements);

    }



    /**

     * Returns the value chosen for K.<br /> <br /> K is the optimal number of

     * hash functions based on the size of the Bloom filter and the expected

     * number of inserted elements.

     * 

     * @return optimal k.

     */

    public int getK() {

        return k;

    }



    /**

     * Sets all bits to false in the Bloom filter.

     */

    public void clear() {

        bitset.clear();

        numberOfAddedElements = 0;

    }



    /**

     * Adds an object to the Bloom filter. The output from the object's

     * toString() method is used as input to the hash functions.

     * 

     * @param element

     *            is an element to register in the Bloom filter.

     */

    public void add(E element) {

        long hash;

        String valString = element.toString();

        for (int x = 0; x < k; x++) {

            hash = createHash(valString + Integer.toString(x));

            hash = hash % (long) bitSetSize;

            bitset.set(Math.abs((int) hash), true);

        }

        numberOfAddedElements++;

    }



    /**

     * Adds all elements from a Collection to the Bloom filter.

     * 

     * @param c

     *            Collection of elements.

     */

    public void addAll(Collection<? extends E> c) {

        for (E element : c)

            add(element);

    }



    /**

     * Returns true if the element could have been inserted into the Bloom

     * filter. Use getFalsePositiveProbability() to calculate the probability of

     * this being correct.

     * 

     * @param element

     *            element to check.

     * @return true if the element could have been inserted into the Bloom

     *         filter.

     */

    public boolean contains(E element) {

        long hash;

        String valString = element.toString();

        for (int x = 0; x < k; x++) {

            hash = createHash(valString + Integer.toString(x));

            hash = hash % (long) bitSetSize;

            if (!bitset.get(Math.abs((int) hash)))

                return false;

        }

        return true;

    }



    /**

     * Returns true if all the elements of a Collection could have been inserted

     * into the Bloom filter. Use getFalsePositiveProbability() to calculate the

     * probability of this being correct.

     * 

     * @param c

     *            elements to check.

     * @return true if all the elements in c could have been inserted into the

     *         Bloom filter.

     */

    public boolean containsAll(Collection<? extends E> c) {

        for (E element : c)

            if (!contains(element))

                return false;

        return true;

    }



    /**

     * Read a single bit from the Bloom filter.

     * 

     * @param bit

     *            the bit to read.

     * @return true if the bit is set, false if it is not.

     */

    public boolean getBit(int bit) {

        return bitset.get(bit);

    }



    /**

     * Set a single bit in the Bloom filter.

     * 

     * @param bit

     *            is the bit to set.

     * @param value

     *            If true, the bit is set. If false, the bit is cleared.

     */

    public void setBit(int bit, boolean value) {

        bitset.set(bit, value);

    }



    /**

     * Return the bit set used to store the Bloom filter.

     * 

     * @return bit set representing the Bloom filter.

     */

    public BitSet getBitSet() {

        return bitset;

    }



    /**

     * Returns the number of bits in the Bloom filter. Use count() to retrieve

     * the number of inserted elements.

     * 

     * @return the size of the bitset used by the Bloom filter.

     */

    public int size() {

        return this.bitSetSize;

    }



    /**

     * Returns the number of elements added to the Bloom filter after it was

     * constructed or after clear() was called.

     * 

     * @return number of elements added to the Bloom filter.

     */

    public int count() {

        return this.numberOfAddedElements;

    }



    /**

     * Returns the expected number of elements to be inserted into the filter.

     * This value is the same value as the one passed to the constructor.

     * 

     * @return expected number of elements.

     */

    public int getExpectedNumberOfElements() {

        return expectedNumberOfFilterElements;

    }



    /**

     * Get expected number of bits per element when the Bloom filter is full.

     * This value is set by the constructor when the Bloom filter is created.

     * See also getBitsPerElement().

     * 

     * @return expected number of bits per element.

     */

    public double getExpectedBitsPerElement() {

        return this.bitsPerElement;

    }



    /**

     * Get actual number of bits per element based on the number of elements

     * that have currently been inserted and the length of the Bloom filter. See

     * also getExpectedBitsPerElement().

     * 

     * @return number of bits per element.

     */

    public double getBitsPerElement() {

        return this.bitSetSize / (double) numberOfAddedElements;

    }

}

package com.lietu.show;



import java.io.IOException;

import java.sql.Connection;

import java.sql.DriverManager;

import java.sql.PreparedStatement;

import java.sql.SQLException;

import java.util.Properties;



import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;



//提取商品信息

public class ExtractProduct {



    static BloomFilter<String> urlSeen = new BloomFilter<String>(200000, 20000);



    /**

     * @param args

     * @throws IOException

     */

    public static void main(String[] args) throws IOException {

        String url = "http://www.xiu.com/about/classifymap.shtml";

        Document doc = Jsoup.connect(url).get();

        Elements links = doc.select("a[href]"); // 带有href属性的a标签

        for (Element link : links) { // 遍历每个链接

            String linkHref = link.attr("href"); // 得到href属性中的值，也就是url地址

            if (linkHref.startsWith("http://list.xiu.com/")

                    || linkHref.startsWith("http://brand.xiu.com/")) {

                linkHref = getListURL(linkHref);

                if (urlSeen.contains(linkHref))

                    continue;

                // System.out.println(linkHref); //输出url地址和锚点上的文字说明

                extractList(linkHref);

            }

        }

    }

    

    public static String getListURL(String oldURL){

        ParseURL splitURL = new ParseURL(oldURL);

        if(splitURL.searchparms==null){

            return splitURL.baseURL;

        }

        //System.out.println(splitURL.baseURL);

        String curPage = splitURL.searchparms.get("currentPage");

        // http://list.xiu.com/20069.html?currentPage=2

        if (curPage != null)

            //System.out.println(splitURL.baseURL + "?currentPage=" + curPage);

            return splitURL.baseURL + "?currentPage=" + curPage;

        else{

            //System.out.println(splitURL.baseURL);

            return splitURL.baseURL;

        }

    }



    //处理目录页

    // http://list.xiu.com/19757.html

    // http://brand.xiu.com/3117.html

    public static void extractList(String url) {

        System.out.println("List url:" + url);

        urlSeen.add(url);

        Document doc = null;

        try {

            doc = Jsoup.connect(url).get();

        } catch (Exception e) {

            try {

                doc = Jsoup.connect(url).get();

            } catch (IOException e1) {

                // TODO Auto-generated catch block

                e1.printStackTrace();

                return;

            }

        }



        Elements links = doc.select("a[href]"); // 带有href属性的a标签

        for (Element link : links) { // 遍历每个链接

            String linkHref = link.attr("href"); // 得到href属性中的值，也就是url地址

            if (linkHref.startsWith("http://list.xiu.com/")

                    || linkHref.startsWith("http://brand.xiu.com/")) {

                linkHref = getListURL(linkHref);

                if (urlSeen.contains(linkHref))

                    continue;

                extractList(linkHref); // 输出url地址和锚点上的文字说明

            } else if (linkHref.startsWith("http://item.xiu.com/product/")) { // "http://item.xiu.com/product/0359097.shtml"

                if (urlSeen.contains(linkHref))

                    continue;

                getProduct(linkHref);

            }

        }

    }



    //处理详细页

    public static void getProduct(String url) {

        System.out.println("Product url:" + url);

        urlSeen.add(url);

        // String url = "http://item.xiu.com/product/0359097.shtml";

        Document doc = null;

        try {

            doc = Jsoup.connect(url).get();

        } catch (Exception e) {

            try{

                doc = Jsoup.connect(url).get();

            }

            catch (Exception e2) {

                try {

                    doc = Jsoup.connect(url).get();

                } catch (IOException e1) {

                    // TODO Auto-generated catch block

                    e1.printStackTrace();

                    return;

                }

            }

        }

        Elements links = doc.select("div.p_title"); // 带有href属性的a标签

        String name = links.get(0).childNode(1).childNode(0).toString();

        // System.out.println(name);



        Element link = doc.select("div.conlist").get(2);

        String desc = link.text();

        // System.out.println(link.text());



        Element img = doc.select("#imgPic").first();

        String imgSrc = img.attr("src");

        // System.out.println(img.attr("src"));

        

        Element thumb = doc.select("img[onload]").first();

        String thumbSrc = thumb.attr("src");

        //System.out.println(thumbSrc);

        

        insertDatabase(url,name,desc,imgSrc,thumbSrc);

    }



    public static void insertDatabase(String url, String name, String desc,

            String img,String thumbSrc) {

        Connection conn = null;

        PreparedStatement stmt2 = null;

        try {

            Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");

            Properties p = new Properties(); 

            p.put("charSet", "gb2312");

            conn = DriverManager

                    .getConnection("jdbc:odbc:driver={Microsoft Access Driver (*.mdb)};DBQ=good.mdb",p);



            stmt2 = conn

                    .prepareStatement("insert into good(url,productName,productDesc,img,thumb)values(?,?,?,?,?)");

            stmt2.setString(1, url);

            stmt2.setString(2, name);

            stmt2.setString(3, desc);

            stmt2.setString(4, img);

            stmt2.setString(5, thumbSrc);

            stmt2.executeUpdate();

        } catch (Exception ex) {

            ex.printStackTrace();

        } finally {

            try {

                if (stmt2 != null)

                    stmt2.close();

                if (conn != null)

                    conn.close();

            } catch (SQLException e) {

                e.printStackTrace();

            }

        }

    }



}

Java爬虫框架（一）--架构设计狼图腾-狼之传说 java 框架 java 任务 html解析器存储电子商务
一、架构图那里搜网络爬虫框架主要针对电子商务网站进行数据爬取，分析，存储，索引。爬虫：爬虫负责爬取，解析，处理电子商务网站的网页的内容数据库：存储商品信息索引：商品的全文搜索索引Task队列：需要爬取的网页列表Visited表：已经爬取过的网页列表爬虫监控平台：web平台可以启动，停止爬虫，管理爬虫，task队列，visited表。二、爬虫1.流程1)Scheduler启动爬虫器，TaskMast
WebMagic：强大的Java爬虫框架解析与实战 Aaron_945 Java java 爬虫开发语言
文章目录引言官网链接WebMagic原理概述基础使用1.添加依赖2.编写PageProcessor高级使用1.自定义Pipeline2.分布式抓取优点结论引言在大数据时代，网络爬虫作为数据收集的重要工具，扮演着不可或缺的角色。Java作为一门广泛使用的编程语言，在爬虫开发领域也有其独特的优势。WebMagic是一个开源的Java爬虫框架，它提供了简单灵活的API，支持多线程、分布式抓取，以及丰富的
00. 这里整理了最全的爬虫框架（Java + Python）有一只柴犬爬虫系列爬虫 java python
目录1、前言2、什么是网络爬虫3、常见的爬虫框架3.1、java框架3.1.1、WebMagic3.1.2、Jsoup3.1.3、HttpClient3.1.4、Crawler4j3.1.5、HtmlUnit3.1.6、Selenium3.2、Python框架3.2.1、Scrapy3.2.2、BeautifulSoup+Requests3.2.3、Selenium3.2.4、PyQuery3.2
Python精选200Tips：121-125 AnFany Python200+Tips python 开发语言
Spendyourtimeonself-improvement121Requests-简化的HTTP请求处理发送GET请求发送POST请求发送PUT请求发送DELETE请求会话管理处理超时文件上传122BeautifulSoup-网页解析和抓取解析HTML和XML文档查找单个标签查找多个标签使用CSS选择器查找标签提取文本修改文档内容删除标签处理XML文档123Scrapy-强大的网络爬虫框架示例
爬虫之隧道代理：如何在爬虫中使用代理IP？ 2401_87251497 python 开发语言爬虫网络 tcp/ip 网络协议
在进行网络爬虫时，使用代理IP是一种常见的方式来绕过网站的反爬虫机制，提高爬取效率和数据质量。本文将详细介绍如何在爬虫中使用隧道代理，包括其原理、优势以及具体的实现方法。无论您是爬虫新手还是有经验的开发者，这篇文章都将为您提供实用的指导。什么是隧道代理？隧道代理是一种高级的代理技术，它通过创建一个加密的隧道，将数据从客户端传输到代理服务器，再由代理服务器转发到目标服务器。这样不仅可以隐藏客户端的真
Python爬虫代理池极客李华 python授课 python 爬虫开发语言
Python爬虫代理池网络爬虫在数据采集和信息抓取方面起到了关键作用。然而，为了应对网站的反爬虫机制和保护爬虫的真实身份，使用代理池变得至关重要。1.代理池的基本概念：代理池是一组包含多个代理IP地址的集合。通过在爬虫中使用代理池，我们能够隐藏爬虫的真实IP地址，实现一定程度的匿名性。这有助于防止被目标网站封锁或限制访问频率。2.为何使用代理池：匿名性：代理池允许爬虫在请求目标网站时使用不同的IP
盘点一个Python网络爬虫抓取股票代码问题（上篇）皮皮_f075
大家好，我是皮皮。一、前言前几天在Python白银群【厚德载物】问了一个Python网络爬虫的问题，这里拿出来给大家分享下。image.png二、实现过程这个问题其实for循环就可以搞定了，看上去粉丝的代码没有带请求头那些，导致获取不到数据。后来【瑜亮老师】、【小王子】给了具体思路，代码如下图所示：image.png后来【小王子】也给了一个具体代码，如下：importrequestsimportt
python ray分布式_取代 Python 多进程！伯克利开源分布式框架 Ray weixin_39946313 python ray分布式
Ray由伯克利开源，是一个用于并行计算和分布式Python开发的开源项目。本文将介绍如何使用Ray轻松构建可从笔记本电脑扩展到大型集群的应用程序。并行和分布式计算是现代应用程序的主要内容。我们需要利用多个核心或多台机器来加速应用程序或大规模运行它们。网络爬虫和搜索所使用的基础设施并不是在某人笔记本电脑上运行的单线程程序，而是相互通信和交互的服务的集合。云计算承诺在所有维度上(内存、计算、存储等)实
如何用python爬取股票数据选股_用python爬取股票数据 weixin_39752087
获取数据是数据分析中必不可少的一部分，而网络爬虫是是获取数据的一个重要渠道之一。鉴于此，我拾起了Python这把利器，开启了网络爬虫之路。本篇使用的版本为python3.5，意在抓取证券之星上当天所有A股数据。程序主要分为三个部分：网页源码的获取、所需内容的提取、所得结果的整理。一、网页源码的获取很多人喜欢用python爬虫的原因之一就是它容易上手。只需以下几行代码既可抓取大部分网页的源码。imp
使用 RecursiveUrlLoader 实现递归网页爬取：深入解析与实践指南 qq_37836323 python 前端数据库
使用RecursiveUrlLoader实现递归网页爬取：深入解析与实践指南1.引言在当今的数字时代，网络爬虫已成为获取和分析大量在线信息的重要工具。LangChain提供的RecursiveUrlLoader是一个强大的工具，能够递归地爬取网页内容，并将其转换为易于处理的文档格式。本文将深入探讨RecursiveUrlLoader的使用方法、特性以及实际应用场景。2.RecursiveUrlLo
Python 协程 & 异步编程 (asyncio) 入门介绍 linmeiyun 后端 python python 爬虫学习开发语言机器学习
在近期的编码工作过程中遇到了async和await装饰的函数，查询资料后了解到这种函数是基于协程的异步函数。这类编程方式称为异步编程，常用在IO较频繁的系统中，如：Tornadoweb框架、文件下载、网络爬虫等应用。协程能够在IO等待时间就去切换执行其他任务，当IO操作结束后再自动回调，那么就会大大节省资源并提供性能。接下来便简单的讲解一下异步编程相关概念以及案例演示。1.协程简介1.1协程的含义
python网络爬虫（五）——爬取天气预报光电的一只菜鸡 python python 爬虫开发语言
1.注册高德天气key 点击高德天气，然后按照开发者文档完成key注册；作为爬虫练习项目之一。从高德地图json数据接口获取天气，可以获取某省的所有城市天气，高德地图的这个接口还能获取县城的天气。其天气查询API服务地址为https://restapi.amap.com/v3/weather/weatherInfo?parameters，若要获取某城市的天气推荐2.安装MongoDB Mong
顶级的python入门教程！小白到大师，从这篇教程开始！马大哈（Python） python pycharm 开发语言学习青少年编程
1.为什么要学习Python？学习Python的原因有很多，以下是几个主要的原因：广泛应用：Python被广泛应用于Web开发、数据科学、人工智能、机器学习、自动化运维、网络爬虫、科学计算、游戏开发等多个领域。掌握Python意味着你可以在这些领域中找到丰富的职业机会。入门简单：Python的语法简洁明了，易于学习和理解，对于编程初学者来说非常友好。它的代码风格一致，可读性强，有助于培养良好的编程
爬虫更换ip地址 xiaoxiongip666 爬虫 tcp/ip 网络协议
网络爬虫更换IP地址是为了应对网站的反爬策略，如IP限制、频率控制等。IP地址轮换的主要目的是保持匿名性和隐蔽性，防止被目标服务器识别为同一个爬虫客户端。以下是一些常见的IP更换方法：使用代理IP池：通过购买或使用免费的代理IP服务，爬虫程序会周期性地从池中获取一个新的IP地址，然后进行请求。常见的代理服务提供商有小熊IP等。间隔时间更换：爬虫可以在每次请求之间设置一个随机或固定的等待时间，然后更
Python爬虫实战 weixin_34007879 爬虫 json java
引言网络爬虫是抓取互联网信息的利器，成熟的开源爬虫框架主要集中于两种语言Java和Python。主流的开源爬虫框架包括：1.分布式爬虫框架：Nutch2.Java单机爬虫框架：Crawler4j,WebMagic,WebCollector、Heritrix3.python单机爬虫框架：scrapy、pyspiderNutch是专为搜索引擎设计的的分布式开源框架，上手难度高，开发复杂，基本无法满足快
Day21—爬虫性能优化技巧 Ztop 爬虫（新手推荐）爬虫 python 性能优化
在网络爬虫的开发过程中，性能优化是一个关键环节。一个高效的爬虫不仅能够快速完成任务，还能减轻对目标网站的压力，降低被封禁的风险。本文将讨论如何优化爬虫性能，包括请求头优化、连接池、缓存策略等技巧。1.请求头优化请求头是HTTP请求的重要组成部分，它包含了客户端向服务器发送的元数据。通过优化请求头，可以模拟正常用户的行为，减少被网站识别为爬虫的可能性。User-Agent：设置合适的User-Age
【网络安全】Bingbot索引投毒实现储存型XSS 秋说网络安全 web安全漏洞挖掘
未经许可，不得转载。文章目录前言Bingbot如何运作正文漏洞步骤前言Bing是由微软开发的搜索引擎，提供网页、视频、图片和地图等多种搜索功能。其目标是通过呈现有条理且相关的搜索结果，帮助用户做出更明智的决策。Bingbot是微软开发的网络爬虫，也被称为蜘蛛或搜索引擎机器人，主要用于探索和索引Bing搜索引擎的网页。自2010年10月推出以来，Bingbot通过外部和内部链接发现新网页，并更新已存
21.7K Star力荐！跨平台的开源免费可视化爬虫，让数据采集不再是难题！科技Ins 实用工具爬虫
朋友们！你是否曾梦想着轻松地从网上抓取数据，却苦于编程技能的门槛？现在，有了EasySpider，这一切都变得触手可及！这不仅仅是一个工具，它是一个革命性的网络爬虫神器，让你能够像专业人士一样，无需编写一行代码，就能轻松设计和执行爬虫任务。无论是动态内容还是复杂页面，EasySpider都能帮你搞定。而且，它完全免费，开源，跨平台，还有活跃的社区支持。准备好了吗？让我们一探究竟，看看EasySpi
爬虫进阶之人见人爱的Scrapy框架--Scrapy入门我真的超级好
不要重复造轮子，这是学习Python以来听得最多的一句话，无非就是叫我们要灵活运用现有的库，毕竟Python的一大特点就是拥有功能强大强大而种类丰富的库。那么在爬虫领域要灵活使用哪个轮子呢？--当然是目前最火的爬虫框架Scrapy。笔者通过慕课网免费课程《Python最火爬虫框架Scrapy入门与实践》+书籍《精通Scrapy网络爬虫》+度娘+CSDN完成自学，其中遇到诸多困难（要么太深入没看懂，
python网络爬虫（一）——网络爬虫基本原理光电的一只菜鸡 python python 爬虫数据库
1.使用BeautifulSoup解析网页通过request库已经抓取到网页源码，接下来要从源码中找到并提取数据。BeautifulSoup是python的一个库，其主要功能是从网页中抓取数据。BeautifulSoup目前已经被移植到bs4库中，也就是说在导入BeautifulSoup时需要先安装bs4。安装好bs4库后，还需要安装lxml库。如果我们不安装lxml库，就会使用python默
Python爬虫核心面试题2 闲人编程程序员面试 python 爬虫开发语言面试网络 HTTP
网络爬虫1.什么是HTTP协议？它有哪些常见的请求方法？2.在进行网络爬虫时，如何判断一个网站是否允许被爬取？3.在使用HTTP请求时，如何处理重定向？4.解释HTTP状态码200、404、500的含义。5.什么是Session？如何在爬虫中保持Session？6.在爬虫中，如何处理Cookies？7.解释什么是SSL/TLS？如何在爬虫中处理SSL证书验证？8.如何处理请求超时？9.什么是HTT
如何在Java爬虫中设置代理IP：详解与技巧天启代理ip java 爬虫 tcp/ip
在进行网络爬虫时，使用代理IP可以有效地避免被目标网站封禁，提升数据抓取的成功率。本文将详细介绍如何在Java爬虫中设置代理IP，并提供一些实用的技巧和示例代码。为什么需要代理IP？在进行爬虫操作时，频繁的请求可能会引起目标网站的注意，甚至导致IP被封禁。就像一只贪心的小猫不停地偷鱼吃，迟早会被发现。为了避免这种情况，我们可以使用代理IP，模拟多个用户，从而降低被封禁的风险。获取代理IP获取代理I
Java爬虫开发：Jsoup库在图片URL提取中的实战应用小白学大数据 python java 爬虫开发语言测试工具前端 javascript
在当今的互联网时代，数据的获取和处理变得尤为重要。对于网站内容的自动化抓取，爬虫技术扮演着不可或缺的角色。Java作为一种广泛使用的编程语言，拥有丰富的库支持网络爬虫的开发。其中，Jsoup库以其简洁、高效的特点，成为处理HTML内容和提取数据的优选工具。本文将详细介绍如何使用Jsoup库开发Java爬虫，以实现图片URL的提取。Jsoup库简介Jsoup是一个用于解析HTML文档的Java库，它
Scrapy添加代理IP池：自动化爬虫的秘密武器天启代理ip scrapy tcp/ip 自动化
在网络爬虫的世界里，IP地址的频繁更换是防止被目标网站封禁的有效手段。通过在Scrapy中添加代理IP池，你可以轻松实现自动化的IP切换，提高数据抓取的效率和稳定性。今天，我们就来详细讲解一下如何在Scrapy中添加代理IP池，让你的爬虫更加智能和高效。什么是代理IP池？代理IP池是指一组可以轮换使用的代理IP地址集合。通过在爬虫中使用代理IP池，你可以在每次请求时随机选择一个代理IP，从而避免因
python网络爬虫（三）——爬虫攻防光电的一只菜鸡 python python 爬虫开发语言
爬虫是模拟人的浏览访问行为，进行数据的批量抓取，当抓取的数据量逐渐增大时，会给被访问的服务器造成很大的压力，甚至有可能崩溃。换句话说就是，服务器是不喜欢有人抓取自己的数据的，那么，网站方面就会这队这些爬虫者采取一些反爬策略。服务器识别爬虫的一种方式是通过检查连接的User-Agent来识别到底是浏览器访问还是代码访问的。如果是代码访问的，当访问量增大时，服务器其就会直接封掉来访IP。在
python网络爬虫（二）——数据的清洗与组织光电的一只菜鸡 python python 爬虫 java
学会了网络爬虫发送请求后，我们可以获得一段目标的HTML代码，但是还没有把数据提取出来，接下来需要进行数据的清洗与组织。foritemindata:result={'title':item.get_test(),'link':item.get('href')}print(result) 首先明确要提取的数据是标题和链接，标题在a标签中，提取标签的正文用get_text()方法；链接在a标签的
python网络爬虫的流程图_python爬虫系列（1）- 概述 weixin_39649965 python网络爬虫的流程图
原标题：python爬虫系列（1）-概述事由之前间断地写过一些python爬虫的一些文章，如：工具分享|在线小说一键下载Python帮你定制批量获取智联招聘的信息Python帮你定制批量获取你想要的信息用python定制网页跟踪神器，有信息更新第一时间通知你（附视频演示）把python网页跟踪神器部署到云上，彻底解放你的电脑个人认为学习python语言的话，爬虫是一个非常适合入门的方向。为了把学习
【Python进阶】Python爬虫的基本概念，带你进一步了解Python爬虫！！！程序员陌陌 python 爬虫开发语言
一、Python爬虫基本概念网络爬虫，又称为网页蜘蛛或爬虫，是一种自动浏览万维网的程序。它按照一定的算法顺序抓取网页内容，同时将抓取到的数据存储起来，用于进一步的分析和处理。网络爬虫在信息获取、数据挖掘、搜索引擎构建等方面发挥着关键作用。二、工作流程确定目标网站：明确需要抓取数据的网站和具体页面。分析网页结构：使用开发者工具查看网页的HTML结构，确定数据存放的位置。编写爬虫代码：使用Python
搜索引擎原理详解风不归Alkaid 搜索引擎搜索引擎
搜索引擎是一种复杂的软件系统，旨在帮助用户找到互联网上的信息。它们通过索引大量网页并快速响应用户查询来工作。搜索引擎的核心功能包括爬虫（crawling）、索引（indexing）、查询处理（queryprocessing）和排名（ranking）。一、网络爬虫（WebCrawling）网络爬虫（WebCrawling）是搜索引擎的核心组件之一，它的主要任务是发现和获取互联网上的网页内容，以便后续
网络爬虫是否存在侵权行为，合法吗？ Bj陈默爬虫 python 网络
网络爬虫是一种按照一定规则自动抓取互联网信息的程序或脚本。其是否存在侵权行为以及是否合法不能一概而论，需要根据具体情况进行分析判断，主要从以下几个方面考量：一、合法性的判定遵守robots协议：robots协议（也称爬虫协议）是网站通过该协议明确警示搜索引擎哪些页面可以爬取，哪些页面不能爬取，相当于网站立在自己房间门口的一个“牌子”，告知外来者谁可以过来，谁不可以过来。如果网络爬虫在被爬取方设置的
java线程Thread和Runnable区别和联系 zx_code java jvm thread 多线程 Runnable
我们都晓得java实现线程2种方式，一个是继承Thread，另一个是实现Runnable。模拟窗口买票，第一例子继承thread，代码如下 package thread; public class ThreadTest { public static void main(String[] args) { Thread1 t1 = new Thread1(
【转】JSON与XML的区别比较丁_新 json xml
1.定义介绍 (1).XML定义扩展标记语言 (Extensible Markup Language, XML) ，用于标记电子文件使其具有结构性的标记语言，可以用来标记数据、定义数据类型，是一种允许用户对自己的标记语言进行定义的源语言。 XML使用DTD(document type definition)文档类型定义来组织数据;格式统一，跨平台和语言，早已成为业界公认的标准。 XML是标
c++ 实现五种基础的排序算法 CrazyMizzz C++c 算法
#include<iostream> using namespace std; //辅助函数，交换两数之值 template<class T> void mySwap(T &x, T &y){ T temp = x; x = y; y = temp; } const int size = 10; //一、用直接插入排
我的软件麦田的设计者我的软件音乐类娱乐放松
这是我写的一款app软件，耗时三个月，是一个根据央视节目开门大吉改变的，提供音调，猜歌曲名。1、手机拥有者在android手机市场下载本APP，同意权限，安装到手机上。2、游客初次进入时会有引导页面提醒用户注册。（同时软件自动播放背景音乐）。3、用户登录到主页后，会有五个模块。a、点击不胫而走，用户得到开门大吉首页部分新闻，点击进入有新闻详情。b、
linux awk命令详解被触发 linux awk
awk是行处理器: 相比较屏幕处理的优点，在处理庞大文件时不会出现内存溢出或是处理缓慢的问题，通常用来格式化文本信息 awk处理过程: 依次对每一行进行处理，然后输出 awk命令形式: awk [-F|-f|-v] ‘BEGIN{} //{command1; command2} END{}’ file [-F|-f|-v]大参数，-F指定分隔符，-f调用脚本，-v定义变量 var=val
各种语言比较 _wy_ 编程语言
Java Ruby PHP 擅长领域
oracle 中数据类型为clob的编辑知了ing oracle clob
public void updateKpiStatus(String kpiStatus,String taskId){ Connection dbc=null; Statement stmt=null; PreparedStatement ps=null; try { dbc = new DBConn().getNewConnection(); //stmt = db
分布式服务框架 Zookeeper -- 管理分布式环境中的数据矮蛋蛋 zookeeper
原文地址： http://www.ibm.com/developerworks/cn/opensource/os-cn-zookeeper/ 安装和配置详解本文介绍的 Zookeeper 是以 3.2.2 这个稳定版本为基础，最新的版本可以通过官网 http://hadoop.apache.org/zookeeper/来获取，Zookeeper 的安装非常简单，下面将从单机模式和集群模式两
tomcat数据源 alafqq tomcat
数据库 JNDI(Java Naming and Directory Interface，Java命名和目录接口)是一组在Java应用中访问命名和目录服务的API。没有使用JNDI时我用要这样连接数据库： 03. Class.forName("com.mysql.jdbc.Driver"); 04. conn
遍历的方法百合不是茶遍历
遍历在java的泛
linux查看硬件信息的命令 bijian1013 linux
linux查看硬件信息的命令一.查看CPU： cat /proc/cpuinfo 二.查看内存： free 三.查看硬盘： df linux下查看硬件信息 1、lspci 列出所有PCI 设备； lspci - list all PCI devices:列出机器中的PCI设备（声卡、显卡、Modem、网卡、USB、主板集成设备也能
java常见的ClassNotFoundException bijian1013 java
1.java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory 添加包common-logging.jar2.java.lang.ClassNotFoundException: javax.transaction.Synchronization
【Gson五】日期对象的序列化和反序列化 bit1129 反序列化
对日期类型的数据进行序列化和反序列化时，需要考虑如下问题： 1. 序列化时，Date对象序列化的字符串日期格式如何 2. 反序列化时，把日期字符串序列化为Date对象，也需要考虑日期格式问题 3. Date A -> str -> Date B,A和B对象是否equals 默认序列化和反序列化 import com
【Spark八十六】Spark Streaming之DStream vs. InputDStream bit1129 Stream
1. DStream的类说明文档： /** * A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous * sequence of RDDs (of the same type) representing a continuous st
通过nginx获取header信息 ronin47 nginx header
1. 提取整个的Cookies内容到一个变量，然后可以在需要时引用，比如记录到日志里面， if ( $http_cookie ~* "(.*)$") { set $all_cookie $1; } 变量$all_cookie就获得了cookie的值，可以用于运算了
java-65.输入数字n，按顺序输出从1最大的n位10进制数。比如输入3，则输出1、2、3一直到最大的3位数即999 bylijinnan java
参考了网上的http://blog.csdn.net/peasking_dd/article/details/6342984 写了个java版的： public class Print_1_To_NDigit { /** * Q65.输入数字n，按顺序输出从1最大的n位10进制数。比如输入3，则输出1、2、3一直到最大的3位数即999 * 1.使用字符串
Netty源码学习-ReplayingDecoder bylijinnan java netty
ReplayingDecoder是FrameDecoder的子类，不熟悉FrameDecoder的，可以先看看 http://bylijinnan.iteye.com/blog/1982618 API说，ReplayingDecoder简化了操作，比如： FrameDecoder在decode时，需要判断数据是否接收完全： public class IntegerH
js特殊字符过滤 cngolon js特殊字符 js特殊字符过滤
1.js中用正则表达式过滤特殊字符, 校验所有输入域是否含有特殊符号function stripscript(s) { var pattern = new RegExp("[`~!@#$^&*()=|{}':;',\\[\\].<>/?~！@#￥……&*（）——|{}【】‘；：”“'。，、？]"
hibernate使用sql查询 ctrain Hibernate
import java.util.Iterator; import java.util.List; import java.util.Map; import org.hibernate.Hibernate; import org.hibernate.SQLQuery; import org.hibernate.Session; import org.hibernate.Transa
linux shell脚本中切换用户执行命令方法 daizj linux shell 命令切换用户
经常在写shell脚本时，会碰到要以另外一个用户来执行相关命令，其方法简单记下： 1、执行单个命令：su - user -c "command" 如：下面命令是以test用户在/data目录下创建test123目录 [root@slave19 /data]# su - test -c "mkdir /data/test123"
好的代码里只要一个 return 语句 dcj3sjt126com return
别再这样写了：public boolean foo() { if (true) { return true; } else { return false;
Android动画效果学习 dcj3sjt126com android
1、透明动画效果方法一：代码实现 public View onCreateView(LayoutInflater inflater, ViewGroup container, Bundle savedInstanceState) { View rootView = inflater.inflate(R.layout.fragment_main, container, fals
linux复习笔记之bash shell (4)管道命令 eksliang linux管道命令汇总 linux管道命令 linux常用管道命令
转载请出自出处： http://eksliang.iteye.com/blog/2105461 bash命令执行的完毕以后，通常这个命令都会有返回结果，怎么对这个返回的结果做一些操作呢？那就得用管道命令‘|’。上面那段话，简单说了下管道命令的作用，那什么事管道命令呢？答：非常的经典的一句话，记住了，何为管
Android系统中自定义按键的短按、双击、长按事件 gqdy365 android
在项目中碰到这样的问题：由于系统中的按键在底层做了重新定义或者新增了按键，此时需要在APP层对按键事件（keyevent）做分解处理，模拟Android系统做法，把keyevent分解成： 1、单击事件：就是普通key的单击； 2、双击事件：500ms内同一按键单击两次； 3、长按事件：同一按键长按超过1000ms（系统中长按事件为500ms）； 4、组合按键：两个以上按键同时按住；
asp.net获取站点根目录下子目录的名称 hvt .net C#asp.net hovertree Web Forms
使用Visual Studio建立一个.aspx文件(Web Forms)，例如hovertree.aspx,在页面上加入一个ListBox代码如下： <asp:ListBox runat="server" ID="lbKeleyiFolder" /> 那么在页面上显示根目录子文件夹的代码如下： string[] m_sub
Eclipse程序员要掌握的常用快捷键 justjavac java eclipse 快捷键 ide
判断一个人的编程水平，就看他用键盘多，还是鼠标多。用键盘一是为了输入代码（当然了，也包括注释），再有就是熟练使用快捷键。曾有人在豆瓣评《卓有成效的程序员》：“人有多大懒，才有多大闲”。之前我整理了一个程序员图书列表，目的也就是通过读书，让程序员变懒。写道程序员作为特殊的群体，有的人可以这么懒，懒到事情都交给机器去做，而有的人又可
c++编程随记 lx.asymmetric C++笔记
为了字体更好看，改变了格式…… &&运算符： #include<iostream> using namespace std; int main(){ int a=-1,b=4,k; k=(++a<0)&&!(b--
linux标准IO缓冲机制研究音频数据 linux
一、什么是缓存I/O(Buffered I/O)缓存I/O又被称作标准I/O,大多数文件系统默认I/O操作都是缓存I/O。在Linux的缓存I/O机制中，操作系统会将I/O的数据缓存在文件系统的页缓存(page cache)中，也就是说，数据会先被拷贝到操作系统内核的缓冲区中，然后才会从操作系统内核的缓冲区拷贝到应用程序的地址空间。1.缓存I/O有以下优点:A.缓存I/O使用了操作系统内核缓冲区，
随想生活暗黑小菠萝生活
其实账户之前就申请了，但是决定要自己更新一些东西看也是最近。从毕业到现在已经一年了。没有进步是假的，但是有多大的进步可能只有我自己知道。毕业的时候班里12个女生，真正最后做到软件开发的只要两个包括我，PS：我不是说测试不好。当时因为考研完全放弃找工作，考研失败，我想这只是我的借口。那个时候才想到为什么大学的时候不能好好的学习技术，增强自己的实战能力，以至于后来找工作比较费劲。我
我认为POJO是一个错误的概念 windshome java POJO 编程 J2EE 设计
这篇内容其实没有经过太多的深思熟虑，只是个人一时的感觉。从个人风格上来讲，我倾向简单质朴的设计开发理念；从方法论上，我更加倾向自顶向下的设计；从做事情的目标上来看，我追求质量优先，更愿意使用较为保守和稳妥的理念和方法。 &

网络爬虫速成指南 （四） URL判重

你可能感兴趣的:(网络爬虫)

网络爬虫速成指南（四） URL判重