re模块

re 模块提供正则表达式支持，用于复杂的字符串匹配、查找、替换和分割。正则表达式是处理文本数据的强大工具，但语法复杂，需要逐步学习。

基本匹配

re.search：搜索第一个匹配

import re

match = re.search(r"\d+", "abc123def")
if match:
    print match.group()     # 123
    print match.start()     # 3
    print match.end()       # 6
    print match.span()      # (3, 6)

r"..." 是原始字符串，避免反斜杠转义问题。\d+ 匹配一个或多个数字。

re.match：从字符串开头匹配

print re.match(r"\d+", "abc123")       # None —— 开头不是数字
print re.match(r"\d+", "123abc")       # 匹配 "123"

match 等价于 search 加上 ^ 锚点。

re.findall：查找所有匹配

print re.findall(r"\d+", "abc123def456")     # ['123', '456']
print re.findall(r"[a-z]+", "abc123def456")   # ['abc', 'def']

re.finditer：返回迭代器（节省内存）

for match in re.finditer(r"\d+", "abc123def456"):
    print match.group(), match.span()
# 123 (3, 6)
# 456 (9, 12)

常用模式

模式	含义	示例
`.`	任意字符（除换行）	`a.c` 匹配 "abc"
`\d`	数字	`\d{3}` 匹配 3 位数字
`\w`	单词字符（字母数字下划线）	`\w+` 匹配单词
`\s`	空白字符	`\s+` 匹配空格
`^`	字符串开头	`^Hello`
`$`	字符串结尾	`world$`
`*`	0 次或多次	`a*`
`+`	1 次或多次	`a+`
`?`	0 次或 1 次	`a?`
`{n}`	恰好 n 次	`a{3}`
`{n,m}`	n 到 m 次	`a{2,4}`
`[]`	字符集	`[abc]` 匹配 a/b/c
`\|`	或	`cat\|dog`
`()`	分组	`(\d+)-(\d+)`

分组捕获

import re

# 提取日期各部分
text = "Date: 2024-01-15"
match = re.search(r"(\d{4})-(\d{2})-(\d{2})", text)
if match:
    print match.group(0)    # 2024-01-15（完整匹配）
    print match.group(1)    # 2024（第一组）
    print match.group(2)    # 01（第二组）
    print match.group(3)    # 15（第三组）
    print match.groups()    # ('2024', '01', '15')

命名分组（Python 2.7 支持）：

match = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", text)
print match.group("year")       # 2024
print match.groupdict()         # {'year': '2024', 'month': '01', 'day': '15'}

替换

re.sub：替换匹配内容

text = "Hello 123 world 456"
result = re.sub(r"\d+", "NUM", text)
print result            # Hello NUM world NUM

# 使用函数动态替换
result = re.sub(r"\d+", lambda m: str(int(m.group()) * 2), text)
print result            # Hello 246 world 912

re.subn：替换并返回次数

result, count = re.subn(r"\d+", "NUM", text)
print result, count     # Hello NUM world NUM, 2

分割

re.split：按正则分割

text = "apple, banana; orange  lemon"
fruits = re.split(r"[,;\s]+", text)
print fruits            # ['apple', 'banana', 'orange', 'lemon']

比 str.split() 更灵活，支持多个分隔符。

编译正则

频繁使用的正则应该编译，提高性能：

pattern = re.compile(r"\d{4}-\d{2}-\d{2}")

# 多次使用
print pattern.search("Date: 2024-01-15")
print pattern.search("Date: 2024-02-20")
print pattern.findall("2024-01-15 and 2024-02-20")

编译后的正则对象有相同的方法：search、match、findall、sub 等。

标志位

import re

# 忽略大小写
re.search(r"hello", "HELLO", re.IGNORECASE)

# 多行模式（^$ 匹配每行开头结尾）
re.search(r"^start", "line1\nstart here", re.MULTILINE)

# 点号匹配换行
re.search(r"a.b", "a\nb", re.DOTALL)

# 组合标志
re.search(r"pattern", text, re.IGNORECASE | re.MULTILINE)

实际应用

验证邮箱：

def is_valid_email(email):
    pattern = r"^[\w\.-]+@[\w\.-]+\.\w+$"
    return re.match(pattern, email) is not None

print is_valid_email("user@example.com")    # True
print is_valid_email("invalid.email")       # False

提取 URL：

text = "Visit https://example.com or http://test.org"
urls = re.findall(r"https?://[^\s]+", text)
print urls              # ['https://example.com', 'http://test.org']

解析日志：

log_line = "192.168.1.1 - - [15/Jan/2024:14:30:25 +0800] \"GET /index.html HTTP/1.1\" 200 1234"

pattern = re.compile(
    r'^(\S+) \S+ \S+ \[(.+?)\] "(\S+) (\S+) \S+" (\d+) (\S+)'
)
match = pattern.match(log_line)
if match:
    ip, time, method, path, status, size = match.groups()
    print ip, method, path, status

清理文本：

def clean_text(text):
    # 去除多余空白
    text = re.sub(r"\s+", " ", text)
    # 去除特殊字符
    text = re.sub(r"[^\w\s]", "", text)
    return text.strip()

print clean_text("  Hello!!!   World...  ")     # Hello World

贪婪 vs 非贪婪

默认贪婪（匹配尽可能多的字符）：

print re.search(r"<.*>", "<div>content</div>").group()
# <div>content</div> —— 贪婪，匹配到最后的 >

非贪婪（?）匹配尽可能少：

print re.search(r"<.*?>", "<div>content</div>").group()
# <div> —— 非贪婪，匹配到第一个 >

注意事项

正则表达式可读性差，复杂模式建议加注释或拆分
对于简单操作（固定字符串查找），str 方法更快："abc" in text
正则注入：不要直接用用户输入构造正则，可能引发 ReDoS（正则拒绝服务）