The Dormouse's story

## bs4 `from bs4 import BeautifulSoup` 读取本地文件 `soup = BeautifulSoup(open('baike.html', encoding='utf8'), "html.parser")` get_text() 如果只想得到tag中包含的文本内容,那么可以调用 `get_text()` 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回: ## tag 点取属性的方式只能获得当前名字的第一个tag ```python tag_a = soup.a ``` 如果想要得到所有的\标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 Searching the tree 中描述的方法,比如: find_all() ```python soup.find_all('a') # [Elsie, # Lacie, # Tillie] ``` tag中最重要的属性: name和attributes tag.name是标签名 ```python soup = BeautifulSoup('Extremely bold') tag = soup.b type(tag) # ``` ```python tag['class'] # u'boldest' tag.get('class') ``` ### 多值属性一个tag可以有多个CSS的class，在Beautiful Soup中多值属性的返回类型是list； ```python css_soup = BeautifulSoup('

') css_soup.p['class'] # ["body", "strikeout"] ``` 注意：如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回 ```python id_soup = BeautifulSoup('

') id_soup.p['id'] # 'my id' ``` 如果转换的文档是**XML格式,那么tag中不包含多值属性** ```python xml_soup = BeautifulSoup('

', 'xml') xml_soup.p['class'] # u'body strikeout' ``` ## string 字符串常被包含在tag内.Beautiful Soup用 `NavigableString` 类来包装tag中的字符串: ```python tag.string # u'Extremely bold' type(tag.string) # ``` ### 编码转换通过 `unicode()` 方法可以直接将 `NavigableString` 对象转换成Unicode字符串: ```python unicode_string = unicode(tag.string) unicode_string # u'Extremely bold' type(unicode_string) # ``` ## 子节点一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性. ### .contents和.children tag的 `.contents` 属性可以将tag的子节点以列表的方式输出 ```python head_tag = soup.head head_tag # The Dormouse's story head_tag.contents [The Dormouse's story] title_tag = head_tag.contents[0] title_tag # The Dormouse's story title_tag.contents # [u'The Dormouse's story'] ``` 通过tag的 `.children` 生成器,可以对tag的子节点进行循环: ```python for child in title_tag.children: print(child) # The Dormouse's story ``` string 如果一个tag仅有一个子节点,那么这个tag也可以使用 `.string` 方法。如果tag包含了多个子节点,tag就无法确定 `.string` 方法应该调用哪个子节点的内容, `.string` 的输出结果是 `None` What's it meaning? 如果tag中包含多个字符串 [[2\]](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id90) ,可以使用 `.strings` 来循环获取: ```python for string in soup.strings: print(repr(string)) # u"The Dormouse's story" # u'\n\n' # u"The Dormouse's story" # u'\n\n' # u'Once upon a time there were three little sisters; and their names were\n' # u'Elsie' # u',\n' # u'Lacie' # u' and\n' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # u'...' # u'\n' ``` 输出的字符串中可能包含了很多空格或空行,使用 `.stripped_strings` 可以去除多余空白内容: ```python for string in soup.stripped_strings: print(repr(string)) ``` What's the repr() ? ## 兄弟节点 ### .next_sibling 和 .previous_sibling 需要小心，上一个\标签和下一个\标签之间的顿号和换行符; ```html

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

``` ```python link = soup.a link # Elsie link.next_sibling # u',\n' link.next_sibling.next_sibling # Lacie ``` 迭代获取兄弟节点 ```python for sibling in soup.a.next_siblings: print(repr(sibling)) # u',\n' # Lacie # u' and\n' # Tillie # u'; and they lived at the bottom of a well.' # None for sibling in soup.find(id="link3").previous_siblings: print(repr(sibling)) # ' and\n' # Lacie # u',\n' # Elsie # u'Once upon a time there were three little sisters; and their names were\n' ``` ### .next_elements 和 .previous_elements 两个tag之间的字符串，也会被认为是elements。解析器先进入\标签,然后是tag中间的字符串,然后关闭\标签下面的代码，有助于理解next_siblings和next_elements的区别 ```python from bs4 import BeautifulSoup html_doc = """ The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" soup = BeautifulSoup(html_doc, 'html.parser') tag_a = soup.a ans = [] for b in tag_a.next_siblings: print(type(b)) ans.append(b.get_text()) print(ans) ans = [] for n_e in tag_a.next_elements: # print(type(n_e)) print(n_e) ans.append(n_e.get_text()) print(ans) ``` ## find_all ###