用Python 爬取10X genomics支持文档

10 X 有着相对完善的单细胞分析技术体系，特别是他们家有十分丰富的技术文档。有时候自己用到哪一点就在网上来找，可是并不知道他们家具体都有哪些服务，而且电子档的资料看起来不便于做笔记（个人原因），所以就想着能不能批量下载他们家的用户指南。当然第一个念头就是用Python来爬咯。

user-guides

查了一下他们家有个10X 图书馆，好多资料啊：

我看看这些文件都藏在哪里，查看网页源代码：

只要我获得这些pdf结尾的文件。逐一下载不就好了嘛。我相信Python 完全有这个能力，那么我该怎么去写呢？找库咯。之前用过request，应该也是可以用的，然后是我不孤独，肯定也有人爬过pdf文件啊，看看有没有示例。

当然是有的：python3爬虫下载网页上的pdf，基本上可以了。

# -*- coding: utf-8 -*-
"""
Created on Sat Aug 24 15:44:53 2019

@author: Administrator
"""
import urllib.request
import re
import os
import requests


def getHtml(url):
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}  # 伪装成浏览器
    page1 = urllib.request.Request(url,headers=headers)
    page=urllib.request.urlopen(page1)
    html = page.read()
    page.close()
    return html

这里有一点，有些网站设置了反爬虫，所以我们可以伪装成浏览器登陆，如何伪装：

 

def getUrl(html):
    reg = r'(

 
 raw_url="https://www.10xgenomics.com/cn/resources/user-guides/"
raw_url = "https://www.10xgenomics.com/cn/resources/demonstrated-protocols/"
raw_url = "https://www.10xgenomics.com/cn/resources/technical-notes/"
html = getHtml(raw_url)
url_lst = getUrl(html)

print("url_lst", url_lst)
os.getcwd()
os.chdir("C:\\Users\\Administrator\\Desktop\\10X print")
 
if not os.path.exists('technical-notes') :
    os.mkdir('technical-notes')
os.chdir(os.path.join(os.getcwd(), 'technical-notes'))
 
for url in url_lst[:]:
    url = 'http:' + url  #形成完整的下载地址
    getFile(url)
 
 下载的可欢了： 
  
   
     
    
   
   
  
 两小时写代码，5分钟完成任务： 
  
   
     
    
   
   
  
 接下来，我肯定是要用Python 把这些单个的pdf文件拼接在一起了，制作一本书啊。 
 参考：python PDF文件合并、图片处理 
 import codecs
import os
import PyPDF2 as PyPDF2

#建立一个装pdf文件的数组
files = list()
for filename in os.listdir("C:\\Users\\Administrator\\Desktop\\10X print\\technical-notes"):
    if filename.endswith(".pdf"):
        files.append(filename)
     
    #以数字进行排序（数组中的排列顺序默认是以ASCII码排序，当以数字进行排序时会不成功）
newfiles = files
os.chdir("C:\\Users\\Administrator\\Desktop\\10X print\\technical-notes")


pdfwriter = PyPDF2.PdfFileWriter()
for item in newfiles:
    pdfreader = PyPDF2.PdfFileReader(open(item,"rb"))
    for page in range(pdfreader.numPages):
        pdfwriter.addPage(pdfreader.getPage(page))
     
with codecs.open("technical-notes.pdf","wb") as f:
    pdfwriter.write(f)
 
 一份10X genomics技术文档合集（166页）就这样产生了：

用Python 爬取10X genomics支持文档

你可能感兴趣的:(用Python 爬取10X genomics支持文档)