python爬虫之urllib.request和cookie登录CSDN

最近为了爬取自己想要的东西,又开始回忆起了python爬虫。

首先,需要找到登录页面的url。

https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn

用基本的urllib抓取网页代码发现提交的表单代码

                  <form id="fm1" action="/account/verify;jsessionid=78D8B598F6A7667130715F7491D6AFDD.tomcat1" method="post">

                    <input id="username" name="username" tabindex="1" placeholder="输入用户名/邮箱/手机号" class="user-name" type="text" value=""/>
                    <div class="mobile-auth" style="display:none"><span>该手机已绑定账号,可使用  span><a href="" id="mloginurl" class="mobile-btn" >手机验证码登录a>div>
                    <input id="password" name="password" tabindex="2" placeholder="输入密码" class="pass-word" type="password" value="" autocomplete="off"/>




                            <div class="error-mess" style="display:none;">
                                <span class="error-icon">span><span id="error-message">span>
                            div>



                    <div class="row forget-password">
                        <span class="col-xs-6 col-sm-6 col-md-6 col-lg-6">
                            <input type="checkbox" name="rememberMe" id="rememberMe" value="true" class="auto-login" tabindex="4"/>
                            <label for="rememberMe">下次自动登录label>
                        span>
                        <span class="col-xs-6 col-sm-6 col-md-6 col-lg-6 forget tracking-ad" data-mod="popu_26">
                            <a href="/account/fpwd?action=forgotpassword&service=http%3A%2F%2Fmy.csdn.net%2Fmy%2Fmycsdn" tabindex="5">忘记密码a>
                        span>
                    div>
                    
                    <input type="hidden" name="lt" value="LT-9098-f0M45K9ONcaHCXC7e00ykfOpTxPheC" />
                    <input type="hidden" name="execution" value="e1s1" />
                    <input type="hidden" name="_eventId" value="submit" />
                    <input class="logging" accesskey="l" value="登 录" tabindex="6" type="button" />

                  form>

注意到其中有个jsessionid。

另外通过fiddler抓取的登录信息中包含username, password, lt, execution, _eventId信息

python爬虫之urllib.request和cookie登录CSDN_第1张图片

在抓取的代码中,csdn也对lt等信息做了注释。
并且包含本表单信息的url是

https://passport.csdn.net/account/verify

于是猜想这几个参数就是登录的关键。那如何获取呢?

关于jsessionid, 在登录主页多试了几次发现每次都不一样,lt与exection也都在变化,于是猜想这些数据是需要动态获取的。于是更改代码如下:


import urllib.request
import urllib.parse
import urllib.error
import http.cookiejar
import re
import sys

class CsdnCookie:
    def __init__(self):
        self.login_url = 'https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn'
        self.verify_url = 'https://passport.csdn.net/account/verify'
        self.my_url = 'https://my.csdn.net/'
        self.user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
        self.user_headers = {
            'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
            'Accept - Encoding': "gzip, deflate, br",
            'Connection': "Keep-Alive",
            'User-Agent': self.user_agent
        }
        self.cookie_dir = 'C:/Users/ecaoyng/Desktop/PPT/cookie_csdn.txt'



    def get_lt_execution(self):

        cookie = http.cookiejar.MozillaCookieJar(self.cookie_dir)
        handler = urllib.request.HTTPCookieProcessor(cookie)
        opener = urllib.request.build_opener(handler)
        # request = urllib.request.Request(self.login_url, headers=self.user_headers)
        try:
            response = opener.open(self.login_url)
            page_src = response.read().decode(encoding="utf-8")
            pattern = re.compile(
                'login.css;jsessionid=(.*?)".*?name="lt" value="(.*?)" />.*?name="execution" value="(.*?)" />', re.S)
            items = re.findall(pattern, page_src)
            print(items)
            print('='*80)
            values = {
                'username' : "username",
                'password' : "password",
                'lt' : items[0][1],
                'execution' : items[0][2],
                '_eventId' : "submit"
            }
            post_data = urllib.parse.urlencode(values)
            post_data = post_data.encode('utf-8')
            opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36')]
            self.verify_url = self.verify_url + ';jsessionid=' + items[0][0]
            print('=' * 80)
            print(self.verify_url)
            print('=' * 80)
            response_login=opener.open(self.verify_url,post_data)
            print(response_login.read().decode(encoding="utf-8"))

            for i in cookie:
                print('Name: %s' % i.name)
                print('Value: %s' % i.value)
            print('=' * 80)
            cookie.save(ignore_discard=True, ignore_expires=True)

            my_page=opener.open(self.my_url)
            print(my_page.read().decode(encoding = 'utf-8'))
        except urllib.error.URLError as e:
            print('Error msg: %s' % e.reason)

    def access_other_page(self):
        try:
            cookie = http.cookiejar.MozillaCookieJar()
            cookie.load(self.cookie_dir, ignore_discard=True, ignore_expires=True)
            get_request = urllib.request.Request(self.my_url, headers=self.user_headers)
            access_opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))
            get_response = access_opener.open(get_request)
            print('='*80)
            print(get_response.read().decode(encoding="utf-8"))
        except Exception as e:
            print('Error msg when entry other pages: %s' % e.reason())


if __name__ == '__main__':
    print(sys.getdefaultencoding())
    print('='*80)
    cookie_obj=CsdnCookie()
    # cookie_obj.get_lt_execution()
    cookie_obj.access_other_page()

下面是获取到的csdn的cookie

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file!  Do not edit.

.csdn.net   TRUE    /   FALSE       AU  1FF
.csdn.net   TRUE    /   FALSE   xxx BT  xxx
.csdn.net   TRUE    /   FALSE       UD  Python%E7%88%B1%E5%A5%BD%E8%80%85
.csdn.net   TRUE    /   FALSE   1543915400  UE  "[email protected]"
.csdn.net   TRUE    /   FALSE   1543915398  UN  username
.csdn.net   TRUE    /   FALSE       UserInfo    2MufVKKubW9%2FasttTNA6s3WQr%2BaBa08G3ijawR7NBVftvqoXgXWKvKxjvv2g3YMtJINvNyXOlJM%2FMpWjlo3nxZmMLRQbY5D51X2sJgag7QtsKAGN6NBORCEWVZ1W0BzbQ%2FFZwXUiAjK7CwakS5fGJg%3D%3D
.csdn.net   TRUE    /   FALSE       UserName    username
.csdn.net   TRUE    /   FALSE       UserNick    nickname
.csdn.net   TRUE    /   FALSE       access-token    012f25c0-3341-444d-864c-3a1f497948c9
.csdn.net   TRUE    /   FALSE   1735689600  dc_session_id   10_1512379399618.695892
.csdn.net   TRUE    /   FALSE   1735689600  uuid_tt_dd  10_9929133460-1512379399943-185666
passport.csdn.net   FALSE   /   TRUE        CASTGC  TGT-151925-6sfuatdEVoSydhGYa1jjSpaMxCzxmKVOI9dfLECPmqfdMqxDuT-passport.csdn.net
passport.csdn.net   FALSE   /   FALSE       JSESSIONID  42458334234E0ED744ED275311851BE4.tomcat1
passport.csdn.net   FALSE   /   TRUE    1514971397  LSSC    LSSC-1192363-wi3f7cybBifLidkrZxeiGMpCjUyOkE-passport.csdn.net

你可能感兴趣的:(爬虫,Python3)