python网络爬虫实战-re正则表达式

大家好，我是python网络爬虫这门课程的主要讲师geo

什么是 `re` 模块？

re 是 Python 自带的模块，用于处理正则表达式（regular expressions）。它能用来查找、匹配、替换、分割文本，非常强大。

基本用法与函数

函数	用途
`re.search()`	在字符串中搜索匹配项（找一个）
`re.match()`	从字符串开头匹配（只看开头）
`re.findall()`	找出所有匹配项，返回列表
`re.finditer()`	返回可迭代对象（更高效）
`re.sub()`	替换字符串中的内容
`re.split()`	按正则表达式分割字符串
`re.compile()`	编译成可复用正则对象

基本正则表达式语法

特殊字符（元字符）

字符	含义	示例
`.`	匹配任意一个字符（换行除外）	`a.b` 可匹配 `acb`
`^`	匹配字符串开头	`^hello` 匹配以 hello 开头
`$`	匹配字符串结尾	`world$` 匹配以 world 结尾
`*`	匹配前一个字符0次或多次	`lo*l` 可匹配 `ll`, `lol`, `loool`
`+`	匹配前一个字符1次或多次	`lo+l`
`?`	匹配0次或1次	`colou?r` 匹配 `color` 或 `colour`
`{n}`	恰好n次	`a{3}` 匹配 `aaa`
`{n,}`	至少n次	`a{2,}` 匹配 `aa`, `aaa`, …
`{n,m}`	n到m次	`a{2,4}` 匹配 `aa`, `aaa`, `aaaa`

字符集（方括号）

表达式	含义
`[abc]`	匹配a或b或c
`[a-z]`	匹配小写英文字母
`[^0-9]`	匹配非数字字符

常用转义字符

表达式	含义
`\d`	数字（0-9）
`\D`	非数字
`\w`	单词字符（字母、数字、下划线）
`\W`	非单词字符
`\s`	空白字符（空格、换行等）
`\S`	非空白字符

匹配邮箱地址

import re

text = "请联系我: test123@example.com"
pattern = r"\w+@\w+\.\w+"

match = re.search(pattern, text)
print(match.group())  # 输出: test123@example.com

提取所有手机号

text = "张三: 012-3456789, 李四: 019-1122334"
pattern = r"\d{3}-\d{7,8}"

phones = re.findall(pattern, text)
print(phones)  # ['012-3456789', '019-1122334']

替换敏感词

text = "你真笨蛋！"
pattern = r"笨蛋"
clean = re.sub(pattern, "**", text)
print(clean)  # 你真**！

按多个分隔符分割字符串

text = "apple,banana;cherry orange"
parts = re.split(r"[;, ]", text)
print(parts)  # ['apple', 'banana', 'cherry', 'orange']

使用 `compile()` 提高效率

pattern = re.compile(r"\d{4}-\d{2}-\d{2}")
text = "今天是 2025-08-02"
match = pattern.search(text)
print(match.group())  # 2025-08-02

另一个例子

import re 

pattern = re.compile(r"\d{4}-\d{2}-\d{2}")
text = "今天是 2025-08-02 2025-08-05"
match = pattern.findall(text)
print(match)  # 2025-08-02

基本匹配

import re

text = "Hello 123 World"
match = re.search(r"\d+", text)  # \d+ 表示匹配一个或多个数字
print(match.group())  # 输出: 123

查找所有匹配

text = "cat, dog, cat, bird"
matches = re.findall(r"cat", text)
print(matches)  # 输出: ['cat', 'cat']

替换字符串

text = "2025-09-04"
new_text = re.sub(r"-", "/", text)  # 把 - 替换成 /
print(new_text)  # 输出: 2025/09/04

提取 Email

text = "我的邮箱是 test123@example.com，请联系我。"
email = re.search(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", text)
print(email.group())  # 输出: test123@example.com

匹配开头或结尾

text = "Hello World"
print(bool(re.match(r"Hello", text)))  # True, 开头匹配
print(bool(re.search(r"World$", text)))  # True, 结尾匹配

分组提取

text = "姓名: 张三, 年龄: 25"
match = re.search(r"姓名:\s*(\S+),\s*年龄:\s*(\d+)", text)
print(match.group(1))  # 张三
print(match.group(2))  # 25

Share this article:

python网络爬虫实战-re正则表达式

什么是 `re` 模块？

基本用法与函数

基本正则表达式语法

特殊字符（元字符）

字符集（方括号）

常用转义字符

匹配邮箱地址

提取所有手机号

替换敏感词

按多个分隔符分割字符串

使用 `compile()` 提高效率

基本匹配

查找所有匹配

替换字符串

提取 Email

匹配开头或结尾

分组提取

python flask-前言

python flask-入门

python网络爬虫实战-re正则表达式

什么是 re 模块？

基本用法与函数

基本正则表达式语法

特殊字符（元字符）

字符集（方括号）

常用转义字符

匹配邮箱地址

提取所有手机号

替换敏感词

按多个分隔符分割字符串

使用 compile() 提高效率

基本匹配

查找所有匹配

替换字符串

提取 Email

匹配开头或结尾

分组提取

python flask-前言

python flask-入门

什么是 `re` 模块？

使用 `compile()` 提高效率