python网络爬虫实战-beautiful soup

大家好，我是python网络爬虫这门课程的主要讲师geo

beautiful soup

Python 的一个 HTML/XML 解析库，常用于网页爬虫中解析 HTML 内容。

它可以把网页源码变成「可查询、可遍历、可搜索」的数据结构

解析网页 HTML
按标签名、属性、CSS 选择器查找元素
提取文本、链接、图片
清洗网页内容
为爬虫提供结构化数据

pip install beautifulsoup4
pip install requests

如果不用 html.parser 可用

pip install lxml
pip install html5lib

搭配 requests 读取网页

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")  # 也可用 "lxml" 或 "html5lib"

soup对象表示整个网页的 DOM 树

获取标签

获取第一个标签

soup.title
soup.h1

注意：这种方式只会返回第一个匹配到的标签。

获取标签文本内容

soup.h1.text

或者：

soup.h1.get_text()

获取标签属性

HTML 示例：

<a href="https://example.com">链接</a>

a = soup.a
print(a["href"])

find 与 find_all

find 查找单个元素

soup.find("div")
soup.find("a", class_="link")

find_all 查找多个元素

soup.find_all("a")

返回结果是一个列表：

for a in soup.find_all("a"):
    print(a.text, a.get("href"))

常见参数说明

soup.find("div", id="content")
soup.find_all("span", class_="price")

注意： class 是 Python 保留字，必须写成 class_

CSS Selector

下载 selector gadget zen

select 方法

soup.select("h1")
soup.select(".price")
soup.select("#content")

层级选择器

soup.select("div.article h2")
soup.select("ul li a")

获取第一个匹配结果

soup.select_one("h1")

去除首尾空白

text = tag.get_text(strip=True)

去除换行符

text = text.replace("\n", "")

使用正则表达式：

因为返回一个 list of Tag -> soup[0]

import re
soup = soup.find_all("a", href=re.compile("iana"))
print(soup[0].get('href'))

组合条件查找：

soup.find_all("div", class_="news", id="top")

嵌套结构遍历：

container = soup.find("div", class_="box")
for p in container.find_all("p"):
    print(p.text)

实战案例：爬取豆瓣书籍排行榜

import requests
from bs4 import BeautifulSoup

url = "https://book.douban.com/top250"
headers = {"User-Agent": "Mozilla/5.0"}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, "lxml")

books = soup.select(".pl2 a")
for book in books:
    print(book.text.strip(), book.get("href"))

红楼梦 https://book.douban.com/subject/1007305/
活着 https://book.douban.com/subject/4913064/
哈利·波特 https://book.douban.com/subject/24531956/
1984 https://book.douban.com/subject/4820710/
三体全集

Share this article: