鍩虹鐭ヨ瘑锛�
1.url锛圲niform Resource Locator锛�:鍙仛缁熶竴璧勬簮瀹氫綅绗︼紝鏄簰鑱旂綉涓婃爣鍑嗚祫婧愮殑鍦板潃锛屼織绉扳�滅綉鍧�鈥濄��
2.鍦╬ython 3.x涓凡缁忔病鏈変簡urllib2搴擄紝鍙湁urllib涓�涓簱浜嗐��
3.url Encoding涔熷彨鍋歱ercent鈥攅ncode锛屽嵆URL缂栫爜涔熷彨鍋氱櫨鍒嗗彿缂栫爜銆�
4.python2.7涓殑urllib2灏辨槸python3涓殑urllib.request
robotparser鍙樹负浜唘rllib搴撲腑鐨勪竴涓ā鍧�
鏍规嵁瀹樻柟鎵嬪唽锛寀rllib鏄鐞唘rl鐨勪竴涓簱锛�
鍏朵腑鏈夊洓涓ā鍧楋細
1.urllib.request鐢ㄦ潵鎵撳紑鍜岃鍙杣rls
聽 聽 聽1.1.urlopen鍑芥暟鏄父鐢ㄧ殑鎵撳紑url鏂瑰紡銆�
聽 聽 聽1.2.鐢╞uilt_opener鍑芥暟鏋勫缓opener鏉ユ墦寮�缃戦〉鏃堕珮绾ф柟寮忋��
2.urllib.error鍖呭惈浜嗚繍琛寀rllib.request鐨勮繃绋嬩腑鍙戠敓鐨勯敊璇�
3.urllib.parse鐢ㄦ潵鍒嗘瀽缃戝潃锛坲rls锛�
4.urllib.robotparser鐢ㄦ潵鍒嗘瀽robots.txt鏂囦欢
涓�銆乽rllib.request涓父鐢ㄧ殑鍑芥暟
urllib.request.urlopen(url, data=None, [timeout,], cafile=None, capath=None, cadefault=False, context=None)
1.urllib.request 妯″潡鐢℉TTP/1.1鍗忚浠ュ強鍖呮嫭Connection锛歝lose鐨勫ご閮ㄥ湪瀹冪殑http璇锋眰涓��
2.鍙緵閫夋嫨鐨則imeout鍙傛暟鎸囨槑闃绘杩炴帴鏃堕棿锛岃姹傝繛鎺ョ殑鎿嶄綔timeout绉掑悗杩樻病杩炴帴涓婏紝灏变細鎶涘嚭杩炴帴瓒呮椂鐨勫紓甯搞�傝嫢娌℃湁璁剧疆鍒欎负鍏ㄥ眬鍙橀噺涓己鐪佺殑瓒呮椂鏃堕棿銆�
3.瀵逛簬HTTP and HTTPS URLs锛岃繖涓嚱鏁拌繑鍥炵殑鏄竴涓猦ttp.client.HTTPResponse瀵硅薄锛堣繘琛屼簡杞诲井鐨勪慨楗帮級锛岃瀵硅薄鏈夊涓嬫柟娉曪細
- 聽 璇ュ璞℃槸绫绘枃浠跺璞★紝绫绘枃浠剁殑鏂规硶閮藉彲浠ヤ娇鐢紝锛坮ead锛宺eadline锛宖ileno锛宑lose锛�
- 聽 geturl锛堬級锛氳繑鍥炶姹傜殑url
- 聽 getcode锛堬級锛氳繑鍥炲搷搴旂殑http鐘舵�佺爜锛�200琛ㄧず璇锋眰鎴愬姛寰楀埌鍝嶅簲锛�404琛ㄧず璇锋眰娌″搷搴�
- 聽 info():杩斿洖httplib.HTTPMessage瀵硅薄锛岃〃绀鸿繙绋嬫湇鍔″櫒杩斿洖鐨勫ご閮ㄤ俊鎭�
浜屻�乽rllib.parse涓父鐢ㄥ嚱鏁帮細
1.urllib.parse.urlparse(url,scheme='',allow_fragments=True)锛�
-鐢ㄦ潵鍒嗘瀽涓�涓猆RL锛屽苟鍒嗚В涓�6涓粍鎴愰儴鍒�
-杩斿洖涓�涓�6涓厓绱犵殑鍏冪粍锛氾紙scheme锛宯etloc锛宲ath锛宲arams锛宷uery锛宖ragment锛夋槸涓�涓猽rllib.parse.ParseResult瀵硅薄
骞朵笖璇ュ璞℃湁杩�6涓厓绱犲搴旂殑鏂规硶
eg锛�
>>>from urllib import parse
>>>url = r'https://docs.python.org/3.5/search.html?q=parse&check_keywords=yes&area=default'
>>>parseResult= parse.urlparse(url)
>>>parseResult#鎶婂湴鍧�瑙f瀽鎴愮粍浠�
ParseResult(scheme='https', netloc='docs.python.org', path='/3.5/search.html', params='', query='q=parse&check_keywords=yes&area=default', fragment='')
>>>parseResult.query
'q=parse&check_keywords=yes&area=default'
鐪嬬粨鏋滃氨鐭ラ亾鏄粈涔堟剰鎬濅簡
2.urllib.parse.urlunparse(Tuple)
-鏄痷rlparse鐨勯�嗚繃绋�
-杈撳叆鏄�6涓厓绱犵殑鍏冪粍锛岃緭鍑烘槸瀹屾暣鐨剈rl鍦板潃
3.urllib.parse.urljoin
urljoin(base, url, allow_fragments=True)
聽 聽 聽 聽 Join a base URL and a possibly relative URL to form an absolute
聽 聽 聽 聽 interpretation of the latter.
-base鏄痷rl鐨勫熀鍦板潃
-base涓庣浜屼釜鍙傛暟涓殑鐩稿鍦板潃鐩哥粨鍚堢粍鎴愪竴涓粷瀵筓RL鍦板潃
eg:
>>>scheme='http'
>>>netloc='www.python.org'
>>>path='lib/module-urlparse.html'
>>>modlist=('urllib','urllib2','httplib')
>>> unparsed_url=parse.urlunparse((scheme,netloc,path,'','',''))
>>> unparsed_url
'http://www.python.org/lib/module-urlparse.html'
>>> for mod in modlist:
url=parse.urljoin(unparsed_url,'module-%s.html'%mod)
print(url)
#鏇挎崲鏄粠鏈�鍚庝竴涓�"/"澶勬浛鎹㈢殑
http://www.python.org/lib/module-urllib.html
http://www.python.org/lib/module-urllib2.html
http://www.python.org/lib/module-httplib.html
>>>聽
4.urllib.parse.parse_qs(qs,keep_blank_values=False,strict_parsing=False,encoding='urf-8',error='replace'):
-鐢ㄦ潵鍒嗘瀽瀛楃涓插舰寮忕殑query璇锋眰銆傦紙Parse a query given as a string argument锛�
qs鍙傛暟锛歶rl缂栫爜鐨勫瓧绗︿覆query璇锋眰锛坓et璇锋眰锛夈��
-杩斿洖query璇锋眰鐨勫弬鏁板瓧鍏�
eg锛�
鎺ヤ笂锛�
>>> param_dict=parse.parse_qs(parseResult.query)
>>> param_dict
>>> {'area': ['default'], 'check_keywords': ['yes'], 'q': ['parse']}
5.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=)
#瀵筿uery鍚堝苟锛屽苟涓旇繘琛寀rl缂栫爜
>>> from urllib import parse
>>> query={'name':'walker','age':99}
>>> parse.urlencode(query)
'name=walker&age=99'
鎬荤粨锛�
1.2.鏄url鏁翠綋鐨勫鐞嗭紝鍖呮嫭鍒嗚В鍜岀粍鍚堛��
4.5鏄url涓殑query杩欎釜鍙傛暟鐨勫鐞嗐��
5.urllib.parse.quote(string, safe='/', encoding=None, errors=None)
#瀵瑰瓧绗︿覆杩涜url缂栫爜
1.url瀛楃涓蹭腑濡傛灉甯︽湁涓枃鐨勭紪鐮侊紝瑕佷娇鐢╱rl鏃躲�傚厛灏嗕腑鏂囬儴鍒嗙紪鐮佺敱gbk璇戜负utf8
鐒跺悗鍦╱rllib.quote(str)鎵嶅彲浠ヤ娇鐢╱rl姝e父璁块棶鎵撳紑锛屽惁鍒欑紪鐮佷細鍑洪棶棰樸��
2.鍚屾牱濡傛灉浠巙rl涓彇鍑虹浉搴斾腑鏂囧瓧娈佃В鐮佹椂锛岄渶瑕佸厛unquote锛岀劧鍚庡湪decode锛屽叿浣撴寜鐓bk鎴栬�卽tf8锛岃鎯呭喌鑰屽畾銆�
eg锛�
>>>from urllib import parse
>>>parse.quote('a&b/c')#鏈紪鐮佹枩绾�
'a%26b/c'
>>>parse.quote_plus('a&b/c')#缂栫爜浜嗘枩绾�
6.unquote(string, encoding='utf-8', errors='replace')
>>>parse.unquote('1+2')
'1+2'
>>> parse.unquote_plus('1+2')
'1 2'
涓夈�乽rllib.robotparser
鐢ㄦ潵鍒嗘瀽robots.txt鏂囦欢,鐪嬫槸鍚︽敮鎸佽鐖櫕
eg锛�
>>>from urlli import robotparser
>>>rp=robotparser.RobotFileParser()
>>>rp.set_url('http://example.webscraping.com/robots.txt')#璇诲叆robots.txt鏂囦欢
>>>rp.read()
>>>url='http://example.webscraping.com'
>>>user_agent='GoodCrawler'
>>>rp.can_fetch(user_agent,url)
True
璇︾粏璇存槑锛岃涓嬮潰鍑芥暟鏂囨。锛�
FUNCTIONS
聽 聽 parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace')
聽 聽 聽 聽 Parse a query given as a string argument.
聽 聽 聽 Arguments:
聽 聽 聽 聽 qs: percent-encoded query string to be parsed
聽 聽 聽 聽 keep_blank_values: flag indicating whether blank values in
聽 聽 聽 聽 聽 聽 percent-encoded queries should be treated as blank strings.
聽 聽 聽 聽 聽 聽 A true value indicates that blanks should be retained as
聽 聽 聽 聽 聽 聽 blank strings. 聽The default false value indicates that
聽 聽 聽 聽 聽 聽 blank values are to be ignored and treated as if they were
聽 聽 聽 聽 聽 聽 not included.
聽 聽 聽 聽 strict_parsing: flag indicating what to do with parsing errors.
聽 聽 聽 聽 聽 聽 If false (the default), errors are silently ignored.
聽 聽 聽 聽 聽 聽 If true, errors raise a ValueError exception.
聽 聽 聽 聽 encoding and errors: specify how to decode percent-encoded sequences
聽 聽 聽 聽 聽 聽 into Unicode characters, as accepted by the bytes.decode() method.
聽 聽 parse_qsl(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace')
聽 聽 聽 聽 Parse a query given as a string argument.
聽 聽 聽 聽 Arguments:
聽 聽 聽 聽 qs: percent-encoded query string to be parsed
聽 聽 聽 聽 keep_blank_values: flag indicating whether blank values in
聽 聽 聽 聽 聽 聽 percent-encoded queries should be treated as blank strings. 聽A
聽 聽 聽 聽 聽 聽 true value indicates that blanks should be retained as blank
聽 聽 聽 聽 聽 聽 strings. 聽The default false value indicates that blank values
聽 聽 聽 聽 聽 聽 are to be ignored and treated as if they were 聽not included.
聽 聽 聽 聽 strict_parsing: flag indicating what to do with parsing errors. If
聽 聽 聽 聽 聽 聽 false (the default), errors are silently ignored. If true,
聽 聽 聽 聽 聽 聽 errors raise a ValueError exception.
聽 聽 聽 聽 encoding and errors: specify how to decode percent-encoded sequences
聽 聽 聽 聽 聽 聽 into Unicode characters, as accepted by the bytes.decode() method.
聽 聽 聽 聽 Returns a list, as G-d intended.
聽 聽 quote(string, safe='/', encoding=None, errors=None)
聽 聽 聽 聽 quote('abc def') -> 'abc%20def'
聽 聽 聽 聽 Each part of a URL, e.g. the path info, the query, etc., has a
聽 聽 聽 聽 different set of reserved characters that must be quoted.
聽 聽 聽 聽 RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax lists
聽 聽 聽 聽 the following reserved characters.
聽 聽 聽 聽 reserved 聽 聽= ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 "$" | ","
聽 聽 聽 聽 Each of these characters is reserved in some component of a URL,
聽 聽 聽 聽 but not necessarily in all of them.
聽 聽 聽 聽 By default, the quote function is intended for quoting the path
聽 聽 聽 聽 section of a URL. 聽Thus, it will not encode '/'. 聽This character
聽 聽 聽 聽 is reserved, but in typical usage the quote function is being
聽 聽 聽 聽 called on a path where the existing slash characters are used as
聽 聽 聽 聽 reserved characters.
聽 聽 聽 聽 string and safe may be either str or bytes objects. encoding and errors
聽 聽 聽 聽 must not be specified if string is a bytes object.
聽 聽 聽 聽 The optional encoding and errors parameters specify how to deal with
聽 聽 聽 聽 non-ASCII characters, as accepted by the str.encode method.
聽 聽 聽 聽 By default, encoding='utf-8' (characters are encoded with UTF-8), and
聽 聽 聽 聽 errors='strict' (unsupported characters raise a UnicodeEncodeError).
聽 聽 quote_from_bytes(bs, safe='/')
聽 聽 聽 聽 Like quote(), but accepts a bytes object rather than a str, and does
聽 聽 聽 聽 not perform string-to-bytes encoding. 聽It always returns an ASCII string.
聽 聽 聽 聽 quote_from_bytes(b'abc def?') -> 'abc%20def%3f'
聽 聽 quote_plus(string, safe='', encoding=None, errors=None)
聽 聽 聽 聽 Like quote(), but also replace ' ' with '+', as required for quoting
聽 聽 聽 聽 HTML form values. Plus signs in the original string are escaped unless
聽 聽 聽 聽 they are included in safe. It also does not have safe default to '/'.
聽 聽 unquote(string, encoding='utf-8', errors='replace')
聽 聽 聽 聽 Replace %xx escapes by their single-character equivalent. The optional
聽 聽 聽 聽 encoding and errors parameters specify how to decode percent-encoded
聽 聽 聽 聽 sequences into Unicode characters, as accepted by the bytes.decode()
聽 聽 聽 聽 method.
聽 聽 聽 聽 By default, percent-encoded sequences are decoded with UTF-8, and invalid
聽 聽 聽 聽 sequences are replaced by a placeholder character.
聽 聽 聽 聽 unquote('abc%20def') -> 'abc def'.
聽 聽 unquote_plus(string, encoding='utf-8', errors='replace')
聽 聽 聽 聽 Like unquote(), but also replace plus signs by spaces, as required for
聽 聽 聽 聽 unquoting HTML form values.
聽 聽 聽 聽 unquote_plus('%7e/abc+def') -> '~/abc def'
聽 聽 unquote_to_bytes(string)
聽 聽 聽 聽 unquote_to_bytes('abc%20def') -> b'abc def'.
聽 聽 urldefrag(url)
聽 聽 聽 聽 Removes any existing fragment from URL.
聽 聽 聽 聽 Returns a tuple of the defragmented URL and the fragment. 聽If
聽 聽 聽 聽 the URL contained no fragments, the second element is the
聽 聽 聽 聽 empty string.
聽 聽 urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=
聽 聽 聽 聽 Encode a dict or sequence of two-element tuples into a URL query string.
聽 聽 聽 聽 If any values in the query arg are sequences and doseq is true, each
聽 聽 聽 聽 sequence element is converted to a separate parameter.
聽 聽 聽 聽 If the query arg is a sequence of two-element tuples, the order of the
聽 聽 聽 聽 parameters in the output will match the order of parameters in the
聽 聽 聽 聽 input.
聽 聽 聽 聽 The components of a query arg may each be either a string or a bytes type.
聽 聽 聽 聽 The safe, encoding, and errors parameters are passed down to the function
聽 聽 聽 聽 specified by quote_via (encoding and errors only if a component is a str).
聽 聽 urljoin(base, url, allow_fragments=True)
聽 聽 聽 聽 Join a base URL and a possibly relative URL to form an absolute
聽 聽 聽 聽 interpretation of the latter.
聽 聽 urlparse(url, scheme='', allow_fragments=True)
聽 聽 聽 聽 Parse a URL into 6 components:
聽 聽 聽 聽
聽 聽 聽 聽 Return a 6-tuple: (scheme, netloc, path, params, query, fragment).
聽 聽 聽 聽 Note that we don't break the components up in smaller bits
聽 聽 聽 聽 (e.g. netloc is a single string) and we don't expand % escapes.
聽 聽 urlsplit(url, scheme='', allow_fragments=True)
聽 聽 聽 聽 Parse a URL into 5 components:
聽 聽 聽 聽
聽 聽 聽 聽 Return a 5-tuple: (scheme, netloc, path, query, fragment).
聽 聽 聽 聽 Note that we don't break the components up in smaller bits
聽 聽 聽 聽 (e.g. netloc is a single string) and we don't expand % escapes.
聽 聽 urlunparse(components)
聽 聽 聽 聽 Put a parsed URL back together again. 聽This may result in a
聽 聽 聽 聽 slightly different, but equivalent URL, if the URL that was parsed
聽 聽 聽 聽 originally had redundant delimiters, e.g. a ? with an empty query
聽 聽 聽 聽 (the draft states that these are equivalent).
聽 聽 urlunsplit(components)
聽 聽 聽 聽 Combine the elements of a tuple as returned by urlsplit() into a
聽 聽 聽 聽 complete URL as a string. The data argument can be any five-item iterable.
聽 聽 聽 聽 This may result in a slightly different, but equivalent URL, if the URL that
聽 聽 聽 聽 was parsed originally had unnecessary delimiters (for example, a ? with an
聽 聽 聽 聽 empty query; the RFC states that these are equivalent).
DATA
聽 聽 __all__ = ['urlparse', 'urlunparse', 'urljoin', 'urldefrag', 'urlsplit...
FILE
聽 聽 d:\python3\lib\urllib\parse.py