Python BeautifulSoup 使用-博客园

Python BeautifulSoup 使用

2019-01-20 17:01:51 发布 119 浏览

页面报错/反馈

已收藏点赞

BS4库简单使用:

1.最好配合LXML库，下载：pip install lxml

2.最好配合Requests库，下载：pip install requests

3.下载bs4：pip install bs4

4.直接输入pip没用？解决：环境变量->系统变量->Path->新建：C:Python27Scripts

案例：获取网站标题

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

import requests

url = "https://www.baidu.com"

response = requests.get(url)

soup = BeautifulSoup(response.content, 'lxml')

print soup.title.text

标签识别

示例1：

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

'''

soup = BeautifulSoup(html, 'lxml')

# BeautifulSoup中有内置的方法来实现格式化输出

print(soup.prettify())

# title标签内容

print(soup.title.string)

# title标签的父节点名

print(soup.title.parent.name)

# 标签名为p的内容

print(soup.p)

# 标签名为p的class内容

print(soup.p["class"])

# 标签名为a的内容

print(soup.a)

# 查找所有的字符a

print(soup.find_all('a'))

# 查找id='link3'的内容

print(soup.find(id='link3'))

示例2：

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

'''

soup = BeautifulSoup(html, 'lxml')

# 将p标签下的所有子标签存入到了一个列表中

print (soup.p.contents)

find_all示例:

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

Hello

'''

soup = BeautifulSoup(html, 'lxml')

# 查找所有的ul标签内容

print(soup.find_all('ul'))

# 针对结果再次find_all,从而获取所有的li标签信息

for ul in soup.find_all('ul'):

print(ul.find_all('li'))

# 查找id为list-1的内容

print(soup.find_all(attrs={'id': 'list-1'}))

# 查找class为element的内容

print(soup.find_all(attrs={'class': 'element'}))

# 查找所有的text='Foo'的文本

print(soup.find_all(text='Foo'))

CSS选择器示例：

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

Hello

'''

soup = BeautifulSoup(html, 'lxml')

# 获取class名为panel下panel-heading的内容

print(soup.select('.panel .panel-heading'))

# 获取class名为ul和li的内容

print(soup.select('ul li'))

# 获取class名为element，id为list-2的内容

print(soup.select('#list-2 .element'))

# 使用get_text()获取文本内容

for li in soup.select('li'):

print(li.get_text())

# 获取属性的时候可以通过[属性名]或者attrs[属性名]

for ul in soup.select('ul'):

print(ul['id'])

# print(ul.attrs['id'])

登录查看全部

参与评论