BeautifulSoup学习笔记

1. 查找tag的方法:点(.)节点名,只能获取第一个匹配子节点,可以多次调用

soup.p
#

The Dormouse's story

soup.p.b#查找tag的方法,直接.tag名,soup对象可以多次调用这个方法(点取属性,只能获得第一个匹配结果) #The Dormouse's story

2. .contents与.children与.descendants方法的比较

  • .contents方法返回由该节点的直接子节点构成的列表
  • .children方法返回生成该节点的直接字节点的迭代器
  • .descendants方法返回生成该节点的所有子孙节点的生成器,第一个元素是第一个子节点
soup.body
"""

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""
soup.body.contents """ ['\n',

The Dormouse's story

, '\n',

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

, '\n',

...

, '\n'] """
soup.body.children#与.contents一样,得到的是tag的直接子节点,但返回的是一个迭代器 # list(soup.body.children)#转换为list """ ['\n',

The Dormouse's story

, '\n',

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

, '\n',

...

, '\n'] """
for i in soup.body.children: print(i) """

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""
#.descendants 属性可以对所有tag的子孙节点进行递归循环,返回一个生成器 soup.p.descendants # list(soup.p.descendants) #[The Dormouse's story, "The Dormouse's story"]

3. .string的注意事项

如果tag只包含一个子节点,并且改子节点为字符串或者它的子节点只有一个,则.string返回的是唯一的字符串节点
如果子孙节点有包含多个子节点,则.string不知道定位到哪一个节点的string,返回none

soup.body
"""

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""
soup.body.string soup.p #

The Dormouse's story

soup.p.string #"The Dormouse's story" soup.p.b.string #"The Dormouse's story"

4. .strings:返回文档中的所有字符串的生成器

type(soup.strings)
#generator
for string in soup.strings:#返回文档中多条字符串
    print(string)
"""
The Dormouse's story




The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.
"""

5. .stripped_strings返回删除了回车和每条字符串两边的空格

lt=""
for string in soup.stripped_strings:#删除了回车和每行两边多余的空格
    lt+=string
print(lt)
"""
The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,LacieandTillie;
and they lived at the bottom of a well....
"""
soup.getText()
"""
"The Dormouse's story\n\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n"
"""

你可能感兴趣的:(爬虫,数据挖掘)