2017.07.20

scrapy 爬虫,能够将知乎上的问题和答案爬取并入库

item, itemloader, mysqldb

使用 twisted 异步框架,MySQLdb 入库时,构造的sql语句
cursor.execute(insert_sql, params)的时候,不管参数是什么类型,都只能使用 %s,不然会报错

比如:
insert_sql = "insert into tb_name values(%s, %s, %s, %s)"
params = (pram1, pram2, pram3, pram4)
不管 params 元组中的数据元素是什么类型的数据,在 insert_sql 中都只能使用 %s,不然就会报出 sql 语句的错误

在文档中查找(http://mysql-python.sourceforge.net/MySQLdb.html#cursor-objects)

     c=db.cursor()
     max_price=5
     c.execute("""SELECT spam, eggs, sausage FROM breakfast
      WHERE price < %s""", (max_price,))

In this example, max_price=5 Why, then, use %s in the string? Because MySQLdb will convert it to a SQL literal value, which is the string '5'. When it's finished, the query will actually say, "...WHERE price < 5".

Why the tuple? Because the DB API requires you to pass in any parameters as a sequence. Due to the design of the parser, (max_price) is interpreted as using algebraic grouping and simply as max_price and not a tuple. Adding a comma, i.e. (max_price,) forces it to make a tuple.

The only other method you are very likely to use is when you have to do a multi-row insert:

  c.executemany(
  """INSERT INTO breakfast (name, spam, eggs, sausage, price)
  VALUES (%s, %s, %s, %s, %s)""",
  [
  ("Spam and Sausage Lover's Plate", 5, 1, 8, 7.95 ),
  ("Not So Much Spam Plate", 3, 2, 0, 3.95 ),
  ("Don't Wany ANY SPAM! Plate", 0, 4, 3, 5.95 )
  ] )

Here we are inserting three rows of five values. Notice that there is a mix of types (strings, ints, floats) though we still only use %s. And also note that we only included format strings for one row. MySQLdb picks those out and duplicates them for each row.


知乎存在反爬虫策略,即每间隔一段时间就会对 User-Agent 进行检验,程序自动爬虫的时候,就会发现,后面的 http 请求的返回码都是 403,打开这个请求的连接,发现知乎需要你输入验证码通过后才能继续进入。

后续在处理这个爬虫策略的问题

你可能感兴趣的:(2017.07.20)