23个Python爬虫项目

Today, 23 Python crawler projects have been sorted out for you. The reason is that the crawler entry is simple and fast, and it is also very suitable for new beginners to cultivate confidence. All links point to GitHub. Wechat can not be opened directly. The old rules can be opened by computer.
Attention to the public number “Python column”, background response: reptile books, access to two Python reptile related e-books.

  1. Wechat Sogou - Wechat Public Reptile
    The public number crawler interface based on Search Dog Wechat Search can be extended to the public number crawler based on Search Dog Search. The returned results are lists, and each item is a public number specific information dictionary.
    GitHub address:
    Https://github.com/Chyroc/WechatSogou
  2. DouBanSpider - Douban Reading Reptile
    It can climb down all the books under the reading label of Douban and store them in Excel according to the ranking of the scores. It can be convenient for people to select and search books with high scores, such as those with more than 1000 ratings. It can be stored in different Sheets of Excel according to different topics, and can be crawled by using User Agent as a browser, and add randomness. Delay to better imitate browser behavior, to avoid crawlers blocked.
    GitHub address:
    Https://github.com/lanbing510/DouBanSpider
  3. zhihu_spider - Knowing Reptiles
    The function of this project is to crawl user information and interpersonal topology. The crawler framework uses scrapy and the data storage uses mongo.
    GitHub address:
    Https://github.com/LiuRoy/zhihu_spider
  4. bilibilibili-user-Bilibili user crawler
    Total data: 20119918, grab fields: user id, nickname, gender, avatar, rank, experience value, number of fans, birthday, address, registration time, signature, rank and experience value, etc. The user data report of B station is generated after grabbing.
    GitHub address:
    Https://github.com/airingursb/bilibilibili-user
  5. Sina Spider - Sina Weibo Reptilian
    It mainly crawls the personal information, micro-blog information, fans and concerns of Sina Weibo users. Code access to Sina Weibo Cookie for login, through multi-account login to prevent Sina’s anti-pickpocket. The scrapy crawler framework is mainly used.
    GitHub address:
    Https://github.com/LiuXingMing/SinaSpider
  6. Distribute_crawler - Novel Download Distributed Crawler
    A distributed network crawler implemented by scrapy, Redis, MongoDB and graphite, which stores the MongoDB cluster at the bottom, uses Redis to realize the distribution, and uses graphite to display the crawler status, mainly for a novel site.
    GitHub address:
    Https://github.com/gnemoug/distribute_crawler
  7. CnkiSpider - Chinese HowNet Reptile.
    After setting the retrieval conditions, src/CnkiSpider.py is executed to fetch the data, which is stored in the / data directory, and the first behavior field name of each data file is executed.
    GitHub address:
    Https://github.com/yanzhou/CnkiSpider
  8. Lian JiaSpider - Chain Home Net Reptile.
    Climb the record of second-hand housing transactions of chain home in Beijing over the years. It covers all the code of Chain Home Crawler, including Chain Home Simulated Login Code.
    GitHub address:
    Https://github.com/lanbing510/LianJiaSpider
  9. scrapy_jingdong-Jingdong reptile.
    The crawler of Jingdong website based on scrapy is saved in CSV format.
    GitHub address:
    Https://github.com/taizilongxu/scrapy_jingdong
  10. QQ-Groups-Spider-QQ Reptiles.
    QQ group information is captured in batches, including group name, group number, group owner, group profile and so on. Finally, XLS (X) / CSV result file is generated.
    GitHub address:
    Https://github.com/caspartse/QQ-Groups-Spider
  11. Woyun_public-dark cloud reptile.
    Black Cloud Open Vulnerabilities, Knowledge Base Crawlers and Search. The list of all public vulnerabilities and the text content of each vulnerability exist in MongoDB, about 2G content; if the whole station crawls all text and pictures as offline queries, it will take about 10G space, 2 hours (10M telecommunication bandwidth); crawl all knowledge base, a total of about 500M space. Flask is used as web server and bootstrap is used as front-end in vulnerability search.
    Https://github.com/hanc00l/wooyun_public
  12. spider-hao123 website crawler.
    With hao123 as the entry page, scroll to crawl outside chains, collect web sites, record the number of inside and outside chains on the web sites, record titles and other information, and test on Windows 732 bits. At present, every 24 hours, it can collect about 100,000 data.
    Https://github.com/simapple/spider
  13. Findtrip - ticket crawler (where to go and Ctrip).
    Findtrip is a ticket crawler based on Skapy, which currently integrates two major domestic ticket websites (where to go + Ctrip).
    Https://github.com/fankcoder/findtrip
  14. 163 spider - Web Easy Client Content Crawler based on requests, MySQLdb, torndb
    Https://github.com/leyle/163 spider
  15. doubanspiders - Reptilian collections of Douban films, books, groups, albums, things, etc.
    ~ https://github.com/fanpei91/doubanspiders~
    Updated at March 12, 2019:
    This project has been 404, please be aware of the
  16. QQSpider - QQ spatial reptiles, including logs, speeches, personal information, etc., can capture 4 million pieces of data a day.
    Https://github.com/LiuXingMing/QQSpider
  17. Baidu-music-spider-Baidu MP3 crawler, using redis to support breakpoint continuation.
    Https://github.com/Shu-Ji/baidu-music-spider
  18. tbcrawler - Taobao and Tianmao’s crawler, can search keywords, item ID to capture page information, data stored in mongodb.
    Https://github.com/pakoo/tbcrawler
  19. Stockholm - A framework for testing stock data (Shanghai and Shenzhen) crawlers and stock selection strategies. According to the selected

你可能感兴趣的:(教程)