2.3.3 字符串匹配和搜索_Python进阶编程：编写更高效、优雅的Python代码-QQ阅读中文仙侠网

上QQ阅读APP看书，第一时间看更新

2.3.3　字符串匹配和搜索

在实际应用中，我们有时需要搜索特定模式的文本。

如果想匹配的是字面字符串，那么通常只需要调用基本字符串方法即可，如str.find()、str.endswith()、str.startswith()或类似的方法，示例如下：

text_val = 'life is short, I use python, what about you'
print(text_val == 'life')
print(text_val.startswith('life'))
print(text_val.endswith('what'))
print(text_val.find('python'))

对于复杂的匹配，我们需要使用正则表达式和re模块，如匹配数字格式的日期字符串04/20/2020，示例如下：

date_text_1 = '04/20/2020'
date_text_2 = 'April 20, 2020'

import re
if re.match(r'\d+/\d+/\d+', date_text_1):
    print('yes,the date type is match')
else:
    print('no,it is not match')

if re.match(r'\d+/\d+/\d+', date_text_2):
    print('yes,it match')
else:
    print('no,not match')

若想使用同一个模式去做多次匹配，可以先将模式字符串预编译为模式对象，示例如下：

date_pat = re.compile(r'\d+/\d+/\d+')
if date_pat.match(date_text_1):
    print('yes,the date type is match')
else:
    print('no,it is not match')

if date_pat.match(date_text_2):
    print('yes,it match')
else:
    print('no,not match')

match()方法总是从字符串开始去匹配。如果想查找字符串任意部分的模式出现位置，可以使用findall()方法代替，示例如下：

date_text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
print(date_pat.findall(date_text))

定义正则式时，通常用括号捕获分组，示例如下：

date_pat_1 = re.compile(r'(\d+)/(\d+)/(\d+)')

捕获分组可以使得后面的处理更加简单，并且可以分别将每个组的内容提取出来，相关代码（str_match_search.py）示例如下：

group_result = date_pat_1.match('04/20/2020')
print(f'group result is:{group_result}')
print(f'group 0 is:{group_result.group(0)}')
print(f'group 1 is:{group_result.group(1)}')
print(f'group 2 is:{group_result.group(2)}')
print(f'group 3 is:{group_result.group(3)}')

print(f'groups is:{group_result.groups()}')

month, date, year = group_result.groups()
print(f'month is {month}, date is {date}, year is {year}')

print(date_pat_1.findall(date_text))

for month, day, year in date_pat_1.findall(date_text):
    print(f'{year}-{month}-{day}')

执行py文件，得到的输出结果类似如下：

group result is:<re.Match object; span=(0, 10), match='04/20/2020'>
group 0 is:04/20/2020
group 1 is:04
group 2 is:20
group 3 is:2020
groups is:('04', '20', '2020')
month is 04, date is 20, year is 2020
[('11', '27', '2012'), ('3', '13', '2013')]
2012-11-27
2013-3-13

findall()方法会搜索文本并以列表形式返回所有的匹配。如果想以迭代方式返回匹配，可以使用finditer()方法代替，相关代码（str_match_search.py）示例如下：

for m_val in date_pat_1.finditer(date_text):
    print(m_val.groups())

这里阐述了使用re模块进行匹配和搜索文本的最基本方法。核心步骤就是先使用re.compile()方法编译正则表达式字符串，然后使用match()、findall()或者finditer()等方法进行匹配。

我们在写正则表达式字符串的时候，相对普遍的做法是使用原始字符串，比如r'(\d+)/(\d+)/(\d+)'。这种字符串不需要解析反斜杠，这在正则表达式中是很有用的。如果不使用原始字符串，必须使用两个反斜杠，类似'(\\d+)/(\\d+)/(\\d+)'。

注意：match()方法仅仅检查字符串的开始部分。它的匹配结果有可能并不是期望的那样，示例如下：

group_result = date_pat_1.match('04/20/2020abcdef')
print(group_result)
print(group_result.group())

如果想精确匹配，需要确保正则表达式以$结尾，示例如下：

date_pat_2 = re.compile(r'(\d+)/(\d+)/(\d+)$')
print(date_pat_2.match('04/20/2020abcdef'))
print(date_pat_2.match('04/20/2020'))

如果仅仅是做一次简单的文本匹配/搜索操作，可以略过编译部分，直接使用re模块级别的函数，示例如下：

print(re.findall(r'(\d+)/(\d+)/(\d+)', date_text))

注意：如果打算做大量的匹配和搜索操作，最好先编译正则表达式，然后再重复使用它。模块级别的函数会将最近编译过的模式缓存起来，因此不会降低太多性能。如果使用预编译模式，会减少查找和一些额外处理的损耗。