鐖櫕瀛︿範绗旇锛堜竴锛�--urllib鎬荤粨

鍩虹鐭ヨ瘑锛�

1.url锛圲niform Resource Locator锛�:鍙仛缁熶竴璧勬簮瀹氫綅绗︼紝鏄簰鑱旂綉涓婃爣鍑嗚祫婧愮殑鍦板潃锛屼織绉扳�滅綉鍧�鈥濄��

2.鍦╬ython 3.x涓凡缁忔病鏈変簡urllib2搴擄紝鍙湁urllib涓�涓簱浜嗐��

3.url Encoding涔熷彨鍋歱ercent鈥攅ncode锛屽嵆URL缂栫爜涔熷彨鍋氱櫨鍒嗗彿缂栫爜銆�

4.python2.7涓殑urllib2灏辨槸python3涓殑urllib.request

robotparser鍙樹负浜唘rllib搴撲腑鐨勪竴涓ā鍧�


鏍规嵁瀹樻柟鎵嬪唽锛寀rllib鏄鐞唘rl鐨勪竴涓簱锛�

鍏朵腑鏈夊洓涓ā鍧楋細

1.urllib.request鐢ㄦ潵鎵撳紑鍜岃鍙杣rls

聽 聽 聽1.1.urlopen鍑芥暟鏄父鐢ㄧ殑鎵撳紑url鏂瑰紡銆�

聽 聽 聽1.2.鐢╞uilt_opener鍑芥暟鏋勫缓opener鏉ユ墦寮�缃戦〉鏃堕珮绾ф柟寮忋��

2.urllib.error鍖呭惈浜嗚繍琛寀rllib.request鐨勮繃绋嬩腑鍙戠敓鐨勯敊璇�

3.urllib.parse鐢ㄦ潵鍒嗘瀽缃戝潃锛坲rls锛�

4.urllib.robotparser鐢ㄦ潵鍒嗘瀽robots.txt鏂囦欢



涓�銆乽rllib.request涓父鐢ㄧ殑鍑芥暟

urllib.request.urlopen(url, data=None, [timeout,], cafile=None, capath=None, cadefault=False, context=None)

1.urllib.request 妯″潡鐢℉TTP/1.1鍗忚浠ュ強鍖呮嫭Connection锛歝lose鐨勫ご閮ㄥ湪瀹冪殑http璇锋眰涓��

2.鍙緵閫夋嫨鐨則imeout鍙傛暟鎸囨槑闃绘杩炴帴鏃堕棿锛岃姹傝繛鎺ョ殑鎿嶄綔timeout绉掑悗杩樻病杩炴帴涓婏紝灏变細鎶涘嚭杩炴帴瓒呮椂鐨勫紓甯搞�傝嫢娌℃湁璁剧疆鍒欎负鍏ㄥ眬鍙橀噺涓己鐪佺殑瓒呮椂鏃堕棿銆�

3.瀵逛簬HTTP and HTTPS URLs锛岃繖涓嚱鏁拌繑鍥炵殑鏄竴涓猦ttp.client.HTTPResponse瀵硅薄锛堣繘琛屼簡杞诲井鐨勪慨楗帮級锛岃瀵硅薄鏈夊涓嬫柟娉曪細

- 聽 璇ュ璞℃槸绫绘枃浠跺璞★紝绫绘枃浠剁殑鏂规硶閮藉彲浠ヤ娇鐢紝锛坮ead锛宺eadline锛宖ileno锛宑lose锛�

- 聽 geturl锛堬級锛氳繑鍥炶姹傜殑url

- 聽 getcode锛堬級锛氳繑鍥炲搷搴旂殑http鐘舵�佺爜锛�200琛ㄧず璇锋眰鎴愬姛寰楀埌鍝嶅簲锛�404琛ㄧず璇锋眰娌″搷搴�

- 聽 info():杩斿洖httplib.HTTPMessage瀵硅薄锛岃〃绀鸿繙绋嬫湇鍔″櫒杩斿洖鐨勫ご閮ㄤ俊鎭�


浜屻�乽rllib.parse涓父鐢ㄥ嚱鏁帮細

1.urllib.parse.urlparse(url,scheme='',allow_fragments=True)锛�

-鐢ㄦ潵鍒嗘瀽涓�涓猆RL锛屽苟鍒嗚В涓�6涓粍鎴愰儴鍒�

-杩斿洖涓�涓�6涓厓绱犵殑鍏冪粍锛氾紙scheme锛宯etloc锛宲ath锛宲arams锛宷uery锛宖ragment锛夋槸涓�涓猽rllib.parse.ParseResult瀵硅薄

骞朵笖璇ュ璞℃湁杩�6涓厓绱犲搴旂殑鏂规硶

eg锛�

>>>from urllib import parse

>>>url = r'https://docs.python.org/3.5/search.html?q=parse&check_keywords=yes&area=default'

>>>parseResult= parse.urlparse(url)

>>>parseResult#鎶婂湴鍧�瑙f瀽鎴愮粍浠�

ParseResult(scheme='https', netloc='docs.python.org', path='/3.5/search.html', params='', query='q=parse&check_keywords=yes&area=default', fragment='')

>>>parseResult.query

'q=parse&check_keywords=yes&area=default'

鐪嬬粨鏋滃氨鐭ラ亾鏄粈涔堟剰鎬濅簡


2.urllib.parse.urlunparse(Tuple)

-鏄痷rlparse鐨勯�嗚繃绋�

-杈撳叆鏄�6涓厓绱犵殑鍏冪粍锛岃緭鍑烘槸瀹屾暣鐨剈rl鍦板潃


3.urllib.parse.urljoin

urljoin(base, url, allow_fragments=True)

聽 聽 聽 聽 Join a base URL and a possibly relative URL to form an absolute

聽 聽 聽 聽 interpretation of the latter.

-base鏄痷rl鐨勫熀鍦板潃

-base涓庣浜屼釜鍙傛暟涓殑鐩稿鍦板潃鐩哥粨鍚堢粍鎴愪竴涓粷瀵筓RL鍦板潃


eg:

>>>scheme='http'

>>>netloc='www.python.org'

>>>path='lib/module-urlparse.html'

>>>modlist=('urllib','urllib2','httplib')

>>> unparsed_url=parse.urlunparse((scheme,netloc,path,'','',''))

>>> unparsed_url

'http://www.python.org/lib/module-urlparse.html'

>>> for mod in modlist:

url=parse.urljoin(unparsed_url,'module-%s.html'%mod)

print(url)


#鏇挎崲鏄粠鏈�鍚庝竴涓�"/"澶勬浛鎹㈢殑

http://www.python.org/lib/module-urllib.html

http://www.python.org/lib/module-urllib2.html

http://www.python.org/lib/module-httplib.html

>>>聽


4.urllib.parse.parse_qs(qs,keep_blank_values=False,strict_parsing=False,encoding='urf-8',error='replace'):

-鐢ㄦ潵鍒嗘瀽瀛楃涓插舰寮忕殑query璇锋眰銆傦紙Parse a query given as a string argument锛�

qs鍙傛暟锛歶rl缂栫爜鐨勫瓧绗︿覆query璇锋眰锛坓et璇锋眰锛夈��

-杩斿洖query璇锋眰鐨勫弬鏁板瓧鍏�

eg锛�

鎺ヤ笂锛�

>>> param_dict=parse.parse_qs(parseResult.query)

>>> param_dict

>>> {'area': ['default'], 'check_keywords': ['yes'], 'q': ['parse']}


5.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=)

#瀵筿uery鍚堝苟锛屽苟涓旇繘琛寀rl缂栫爜

>>> from urllib import parse

>>> query={'name':'walker','age':99}

>>> parse.urlencode(query)

'name=walker&age=99'


鎬荤粨锛�

1.2.鏄url鏁翠綋鐨勫鐞嗭紝鍖呮嫭鍒嗚В鍜岀粍鍚堛��

4.5鏄url涓殑query杩欎釜鍙傛暟鐨勫鐞嗐��


5.urllib.parse.quote(string, safe='/', encoding=None, errors=None)

#瀵瑰瓧绗︿覆杩涜url缂栫爜

1.url瀛楃涓蹭腑濡傛灉甯︽湁涓枃鐨勭紪鐮侊紝瑕佷娇鐢╱rl鏃躲�傚厛灏嗕腑鏂囬儴鍒嗙紪鐮佺敱gbk璇戜负utf8

鐒跺悗鍦╱rllib.quote(str)鎵嶅彲浠ヤ娇鐢╱rl姝e父璁块棶鎵撳紑锛屽惁鍒欑紪鐮佷細鍑洪棶棰樸��

2.鍚屾牱濡傛灉浠巙rl涓彇鍑虹浉搴斾腑鏂囧瓧娈佃В鐮佹椂锛岄渶瑕佸厛unquote锛岀劧鍚庡湪decode锛屽叿浣撴寜鐓bk鎴栬�卽tf8锛岃鎯呭喌鑰屽畾銆�

eg锛�

>>>from urllib import parse

>>>parse.quote('a&b/c')#鏈紪鐮佹枩绾�

'a%26b/c'

>>>parse.quote_plus('a&b/c')#缂栫爜浜嗘枩绾�

6.unquote(string, encoding='utf-8', errors='replace')

>>>parse.unquote('1+2')

'1+2'

>>> parse.unquote_plus('1+2')

'1 2'


涓夈�乽rllib.robotparser

鐢ㄦ潵鍒嗘瀽robots.txt鏂囦欢,鐪嬫槸鍚︽敮鎸佽鐖櫕

eg锛�

>>>from urlli import robotparser

>>>rp=robotparser.RobotFileParser()

>>>rp.set_url('http://example.webscraping.com/robots.txt')#璇诲叆robots.txt鏂囦欢

>>>rp.read()

>>>url='http://example.webscraping.com'

>>>user_agent='GoodCrawler'

>>>rp.can_fetch(user_agent,url)

True


璇︾粏璇存槑锛岃涓嬮潰鍑芥暟鏂囨。锛�

FUNCTIONS

聽 聽 parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace')

聽 聽 聽 聽 Parse a query given as a string argument.


聽 聽 聽 Arguments:


聽 聽 聽 聽 qs: percent-encoded query string to be parsed


聽 聽 聽 聽 keep_blank_values: flag indicating whether blank values in

聽 聽 聽 聽 聽 聽 percent-encoded queries should be treated as blank strings.

聽 聽 聽 聽 聽 聽 A true value indicates that blanks should be retained as

聽 聽 聽 聽 聽 聽 blank strings. 聽The default false value indicates that

聽 聽 聽 聽 聽 聽 blank values are to be ignored and treated as if they were

聽 聽 聽 聽 聽 聽 not included.


聽 聽 聽 聽 strict_parsing: flag indicating what to do with parsing errors.

聽 聽 聽 聽 聽 聽 If false (the default), errors are silently ignored.

聽 聽 聽 聽 聽 聽 If true, errors raise a ValueError exception.


聽 聽 聽 聽 encoding and errors: specify how to decode percent-encoded sequences

聽 聽 聽 聽 聽 聽 into Unicode characters, as accepted by the bytes.decode() method.


聽 聽 parse_qsl(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace')

聽 聽 聽 聽 Parse a query given as a string argument.


聽 聽 聽 聽 Arguments:


聽 聽 聽 聽 qs: percent-encoded query string to be parsed


聽 聽 聽 聽 keep_blank_values: flag indicating whether blank values in

聽 聽 聽 聽 聽 聽 percent-encoded queries should be treated as blank strings. 聽A

聽 聽 聽 聽 聽 聽 true value indicates that blanks should be retained as blank

聽 聽 聽 聽 聽 聽 strings. 聽The default false value indicates that blank values

聽 聽 聽 聽 聽 聽 are to be ignored and treated as if they were 聽not included.


聽 聽 聽 聽 strict_parsing: flag indicating what to do with parsing errors. If

聽 聽 聽 聽 聽 聽 false (the default), errors are silently ignored. If true,

聽 聽 聽 聽 聽 聽 errors raise a ValueError exception.


聽 聽 聽 聽 encoding and errors: specify how to decode percent-encoded sequences

聽 聽 聽 聽 聽 聽 into Unicode characters, as accepted by the bytes.decode() method.


聽 聽 聽 聽 Returns a list, as G-d intended.


聽 聽 quote(string, safe='/', encoding=None, errors=None)

聽 聽 聽 聽 quote('abc def') -> 'abc%20def'


聽 聽 聽 聽 Each part of a URL, e.g. the path info, the query, etc., has a

聽 聽 聽 聽 different set of reserved characters that must be quoted.


聽 聽 聽 聽 RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax lists

聽 聽 聽 聽 the following reserved characters.


聽 聽 聽 聽 reserved 聽 聽= ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |

聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 "$" | ","


聽 聽 聽 聽 Each of these characters is reserved in some component of a URL,

聽 聽 聽 聽 but not necessarily in all of them.


聽 聽 聽 聽 By default, the quote function is intended for quoting the path

聽 聽 聽 聽 section of a URL. 聽Thus, it will not encode '/'. 聽This character

聽 聽 聽 聽 is reserved, but in typical usage the quote function is being

聽 聽 聽 聽 called on a path where the existing slash characters are used as

聽 聽 聽 聽 reserved characters.


聽 聽 聽 聽 string and safe may be either str or bytes objects. encoding and errors

聽 聽 聽 聽 must not be specified if string is a bytes object.


聽 聽 聽 聽 The optional encoding and errors parameters specify how to deal with

聽 聽 聽 聽 non-ASCII characters, as accepted by the str.encode method.

聽 聽 聽 聽 By default, encoding='utf-8' (characters are encoded with UTF-8), and

聽 聽 聽 聽 errors='strict' (unsupported characters raise a UnicodeEncodeError).


聽 聽 quote_from_bytes(bs, safe='/')

聽 聽 聽 聽 Like quote(), but accepts a bytes object rather than a str, and does

聽 聽 聽 聽 not perform string-to-bytes encoding. 聽It always returns an ASCII string.

聽 聽 聽 聽 quote_from_bytes(b'abc def?') -> 'abc%20def%3f'


聽 聽 quote_plus(string, safe='', encoding=None, errors=None)

聽 聽 聽 聽 Like quote(), but also replace ' ' with '+', as required for quoting

聽 聽 聽 聽 HTML form values. Plus signs in the original string are escaped unless

聽 聽 聽 聽 they are included in safe. It also does not have safe default to '/'.


聽 聽 unquote(string, encoding='utf-8', errors='replace')

聽 聽 聽 聽 Replace %xx escapes by their single-character equivalent. The optional

聽 聽 聽 聽 encoding and errors parameters specify how to decode percent-encoded

聽 聽 聽 聽 sequences into Unicode characters, as accepted by the bytes.decode()

聽 聽 聽 聽 method.

聽 聽 聽 聽 By default, percent-encoded sequences are decoded with UTF-8, and invalid

聽 聽 聽 聽 sequences are replaced by a placeholder character.


聽 聽 聽 聽 unquote('abc%20def') -> 'abc def'.


聽 聽 unquote_plus(string, encoding='utf-8', errors='replace')

聽 聽 聽 聽 Like unquote(), but also replace plus signs by spaces, as required for

聽 聽 聽 聽 unquoting HTML form values.


聽 聽 聽 聽 unquote_plus('%7e/abc+def') -> '~/abc def'


聽 聽 unquote_to_bytes(string)

聽 聽 聽 聽 unquote_to_bytes('abc%20def') -> b'abc def'.


聽 聽 urldefrag(url)

聽 聽 聽 聽 Removes any existing fragment from URL.


聽 聽 聽 聽 Returns a tuple of the defragmented URL and the fragment. 聽If

聽 聽 聽 聽 the URL contained no fragments, the second element is the

聽 聽 聽 聽 empty string.


聽 聽 urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=)

聽 聽 聽 聽 Encode a dict or sequence of two-element tuples into a URL query string.


聽 聽 聽 聽 If any values in the query arg are sequences and doseq is true, each

聽 聽 聽 聽 sequence element is converted to a separate parameter.


聽 聽 聽 聽 If the query arg is a sequence of two-element tuples, the order of the

聽 聽 聽 聽 parameters in the output will match the order of parameters in the

聽 聽 聽 聽 input.


聽 聽 聽 聽 The components of a query arg may each be either a string or a bytes type.


聽 聽 聽 聽 The safe, encoding, and errors parameters are passed down to the function

聽 聽 聽 聽 specified by quote_via (encoding and errors only if a component is a str).


聽 聽 urljoin(base, url, allow_fragments=True)

聽 聽 聽 聽 Join a base URL and a possibly relative URL to form an absolute

聽 聽 聽 聽 interpretation of the latter.


聽 聽 urlparse(url, scheme='', allow_fragments=True)

聽 聽 聽 聽 Parse a URL into 6 components:

聽 聽 聽 聽 :///;?#

聽 聽 聽 聽 Return a 6-tuple: (scheme, netloc, path, params, query, fragment).

聽 聽 聽 聽 Note that we don't break the components up in smaller bits

聽 聽 聽 聽 (e.g. netloc is a single string) and we don't expand % escapes.


聽 聽 urlsplit(url, scheme='', allow_fragments=True)

聽 聽 聽 聽 Parse a URL into 5 components:

聽 聽 聽 聽 :///?#

聽 聽 聽 聽 Return a 5-tuple: (scheme, netloc, path, query, fragment).

聽 聽 聽 聽 Note that we don't break the components up in smaller bits

聽 聽 聽 聽 (e.g. netloc is a single string) and we don't expand % escapes.


聽 聽 urlunparse(components)

聽 聽 聽 聽 Put a parsed URL back together again. 聽This may result in a

聽 聽 聽 聽 slightly different, but equivalent URL, if the URL that was parsed

聽 聽 聽 聽 originally had redundant delimiters, e.g. a ? with an empty query

聽 聽 聽 聽 (the draft states that these are equivalent).


聽 聽 urlunsplit(components)

聽 聽 聽 聽 Combine the elements of a tuple as returned by urlsplit() into a

聽 聽 聽 聽 complete URL as a string. The data argument can be any five-item iterable.

聽 聽 聽 聽 This may result in a slightly different, but equivalent URL, if the URL that

聽 聽 聽 聽 was parsed originally had unnecessary delimiters (for example, a ? with an

聽 聽 聽 聽 empty query; the RFC states that these are equivalent).


DATA

聽 聽 __all__ = ['urlparse', 'urlunparse', 'urljoin', 'urldefrag', 'urlsplit...


FILE

聽 聽 d:\python3\lib\urllib\parse.py

你可能感兴趣的:(鐖櫕瀛︿範绗旇锛堜竴锛�--urllib鎬荤粨)