百家号爬取(2)

此篇文章主要讲述百家号评论数阅读数的爬取

评论数和阅读数都在单独的一个json数据表中


https://mbd.baidu.com/webpage?type=homepage&action=interact&format=jsonp¶ms=%5B%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229683117499664348209%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221113000014175815%22%2C%22feed_id%22%3A%229683117499664348209%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%228997120757336896754%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221106000014171319%22%2C%22feed_id%22%3A%228997120757336896754%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229442416292259854102%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221106000014171220%22%2C%22feed_id%22%3A%229442416292259854102%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%228994022518148142722%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221084000014170786%22%2C%22feed_id%22%3A%228994022518148142722%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229180210467318996709%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221110000014181138%22%2C%22feed_id%22%3A%229180210467318996709%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229470100560664750777%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221119000014172446%22%2C%22feed_id%22%3A%229470100560664750777%22%7D%5D&uk=D0hHfmuMEVka02HZelKA7g&_=1548119615162&callback=jsonp1

该url解析

主要是从上个json数据表中获得的

"user_type"

dynamic_id"

"dynamic_type"

"dynamic_sub_type"

"thread_id"

"feed_id"

进行拼装

代码为

for iin range(len(title)):

user_type = re.findall(r'"user_type":"(.+?)",', asyncData[i])[0]

dynamic_id = re.findall(r'"dynamic_id":"(.+?)",', asyncData[i])[0]

dynamic_type=re.findall(r'"dynamic_type":"(.+?)",', asyncData[i])[0]

dynamic_sub_type=re.findall(r'"dynamic_sub_type":"(.+?)",', asyncData[i])[0]

thread_id=re.findall(r'"thread_id":"(.+?)",', asyncData[i])[0]

feed_id=re.findall(r'"feed_id":"(.+?)"', asyncData[i])[0]

print(title[i],url[i],date[i],cerate[i],publish[i],updated[i])

if i

readjson+='user_type%22%3A%22'+user_type+'%22%2C%22'\

+'dynamic_id%22%3A%22'+dynamic_id+'%22%2C%22'\

+'dynamic_type%22%3A%22'+dynamic_type+'%22%2C%22'\

+'dynamic_sub_type%22%3A%22'+dynamic_sub_type+'%22%2C%22'\

+'thread_id%22%3A%22'+thread_id+'%22%2C%22'\

+'feed_id%22%3A%22'+feed_id+'%22%7D%2C%7B%22'

    else:

readjson +='user_type%22%3A%22' + user_type +'%22%2C%22' \

+'dynamic_id%22%3A%22' + dynamic_id +'%22%2C%22' \

+'dynamic_type%22%3A%22' + dynamic_type +'%22%2C%22' \

+'dynamic_sub_type%22%3A%22' + dynamic_sub_type +'%22%2C%22' \

+'thread_id%22%3A%22' + thread_id +'%22%2C%22' \

+'feed_id%22%3A%22' + feed_id +'%22%7D%5D'

readjson+='&uk=D0hHfmuMEVka02HZelKA7g&_='+str(b)

注:feed_id最后一个接的是%22%7D%5D,而不是之前的'%22%7D%2C%7B%22'

你可能感兴趣的:(百家号爬取(2))