Erlo

Python BeautifulSoup 使用

2019-01-20 17:01:51 发布   119 浏览  
页面报错/反馈
收藏 点赞

BS4库简单使用:

1.最好配合LXML库,下载:pip install lxml

2.最好配合Requests库,下载:pip install requests

3.下载bs4:pip install bs4

4.直接输入pip没用?解决:环境变量->系统变量->Path->新建:C:Python27Scripts

 

案例:获取网站标题

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

import requests

 

url = "https://www.baidu.com"

 

response = requests.get(url)

 

soup = BeautifulSoup(response.content, 'lxml')

 

print soup.title.text

 

标签识别

示例1:

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

 

html = '''

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

'''

soup = BeautifulSoup(html, 'lxml')

 

# BeautifulSoup中有内置的方法来实现格式化输出

print(soup.prettify())

 

# title标签内容

print(soup.title.string)

 

# title标签的父节点名

print(soup.title.parent.name)

 

# 标签名为p的内容

print(soup.p)

 

# 标签名为p的class内容

print(soup.p["class"])

 

# 标签名为a的内容

print(soup.a)

 

# 查找所有的字符a

print(soup.find_all('a'))

 

# 查找id='link3'的内容

print(soup.find(id='link3'))

 

示例2:

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

 

html = '''

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

'''

 

soup = BeautifulSoup(html, 'lxml')

 

# 将p标签下的所有子标签存入到了一个列表中

print (soup.p.contents)

 

find_all示例:

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

 

html = '''

    

        

Hello

    

    

        

                

  • Foo
  •             

  • Bar
  •             

  • Jay
  •         

        

                

  • Foo
  •             

  • Bar
  •         

    

'''

 

soup = BeautifulSoup(html, 'lxml')

 

# 查找所有的ul标签内容

print(soup.find_all('ul'))

 

# 针对结果再次find_all,从而获取所有的li标签信息

for ul in soup.find_all('ul'):

    print(ul.find_all('li'))

 

# 查找id为list-1的内容

print(soup.find_all(attrs={'id': 'list-1'}))

 

# 查找class为element的内容

print(soup.find_all(attrs={'class': 'element'}))

 

# 查找所有的text='Foo'的文本

print(soup.find_all(text='Foo'))

 

CSS选择器示例:

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

 

html = '''

    

        

Hello

    

    

        

                

  • Foo
  •             

  • Bar
  •             

  • Jay
  •         

        

                

  • Foo
  •             

  • Bar
  •         

    

'''

 

soup = BeautifulSoup(html, 'lxml')

 

# 获取class名为panel下panel-heading的内容

print(soup.select('.panel .panel-heading'))

 

# 获取class名为ul和li的内容

print(soup.select('ul li'))

 

# 获取class名为element,id为list-2的内容

print(soup.select('#list-2 .element'))

 

# 使用get_text()获取文本内容

for li in soup.select('li'):

    print(li.get_text())

 

# 获取属性的时候可以通过[属性名]或者attrs[属性名]

for ul in soup.select('ul'):

    print(ul['id'])

    # print(ul.attrs['id'])

 

登录查看全部

参与评论

评论留言

还没有评论留言,赶紧来抢楼吧~~

手机查看

返回顶部

给这篇文章打个标签吧~

棒极了 糟糕透顶 好文章 PHP JAVA JS 小程序 Python SEO MySql 确认