我目前正在使用pyteaser进行汇总,效果很好.我正在查看源代码,但即使借助下面的评论,我也不理解以下编码.任何人都可以解释一下吗?
def split_sentences(text): ''' The regular expression matches all sentence ending punctuation and splits the string at those points. At this point in the code, the list looks like this ["Hello, world", "!" ... ]. The punctuation and all quotation marks are separated from the actual text. The first s_iter line turns each group of two items in the list into a tuple, excluding the last item in the list (the last item in the list does not need to have this performed on it). Then, the second s_iter line combines each tuple in the list into a single item and removes any whitespace at the beginning of the line. Now, the s_iter list is formatted correctly but it is missing the last item of the sentences list. The second to last line adds this item to the s_iter list and the last line returns the full list. ''' sentences = regex_split('(?
senshin.. 6
第1行
首先,我们有一个正则表达式
(?.以下是使用
re.DEBUG
标志编译它的输出:assert_not -1 in range (65, 90) subpattern 1 in literal 46 literal 33 literal 63 max_repeat 0 1 literal 34 assert 1 max_repeat 1 65535 in category category_space max_repeat 0 1 literal 34 in range (65, 90)首先,我们寻找一些没有大写字母的东西
[A-Z]
(带有负面的后观断言?,即
assert_not
).然后,我们寻找一个标点符号(其中一个.!?
),然后是零或一个双引号"
.最后,我们检查我们的东西是跟着一个或多个空格字符\s+
,零个或一个双引号\"
,和一个大写字母[A-Z]
(这部分是前向断言,即assert
).这个正则表达式只会实际匹配
([.!?]"?)
部分,即标点符号可能后面跟一个引号.
regex_split
是这里的别名re.split
.因此,text
在每个匹配的部分之前和之后进行拆分:标点符号和可能的引用,不以大写字母开头,后跟空格,可能是引号和大写字母.例如:'John was tired. So was Sally. But was Bob? I don\'t know! Huh.'会给出以下内容
sentences
:['John was tired', '.', ' So was Sally', '.', ' But was Bob', '?', " I don't know", '!', ' Huh.']第2行
接下来,我们删除最后一个元素
sentences
(因为这个操作不需要在它上面完成;我们在第4行添加它)[:-1]
:['John was tired', '.', ' So was Sally', '.', ' But was Bob', '?', " I don't know", '!']将它转换为迭代器
iter
,将其放在list([]
)中:[] 现在,当我们这样做时
zip(*[
,我们正在做的是将两个对同一迭代器对象的引用压缩在一起(由一元运算] * 2)
*
符解压缩).这样,当zip
迭代第一次引用它一次时,它会消耗一个句子,然后当它迭代第二次引用它一次(与第一个引用配对)时,它已经消耗了一个句子,所以移动到相应的标点符号.感谢@SeanVieira在评论中对此进行简要解释.这给了我们以下结果:[('John was tired', '.'), (' So was Sally', '.'), (' But was Bob', '?'), (" I don't know", '!')]因此,我们将每个句子与其结束标点符号配对(同样,除了最后一个标点符号).
第3行
现在,我们将句子加回到它的标点符号
''.join(...)
(我认为map
这是多余的):['John was tired.', ' So was Sally.', ' But was Bob?', " I don't know!"]并剥离前导空格
.lstrip()
:['John was tired.', 'So was Sally.', 'But was Bob?', "I don't know!"]第4行
最后,我们回到最后一句,给我们:
['John was tired.', 'So was Sally.', 'But was Bob?', "I don't know!", ' Huh. ']第5行
最后,我们
return
.
首先,我们有一个正则表达式(?<![A-Z])([.!?]"?)(?=\s+\"?[A-Z])
.以下是使用re.DEBUG
标志编译它的输出:
assert_not -1 in range (65, 90) subpattern 1 in literal 46 literal 33 literal 63 max_repeat 0 1 literal 34 assert 1 max_repeat 1 65535 in category category_space max_repeat 0 1 literal 34 in range (65, 90)
首先,我们寻找一些没有大写字母的东西[A-Z]
(带有负面的后观断言?<!
,即assert_not
).然后,我们寻找一个标点符号(其中一个.!?
),然后是零或一个双引号"
.最后,我们检查我们的东西是跟着一个或多个空格字符\s+
,零个或一个双引号\"
,和一个大写字母[A-Z]
(这部分是前向断言,即assert
).
这个正则表达式只会实际匹配([.!?]"?)
部分,即标点符号可能后面跟一个引号.
regex_split
是这里的别名re.split
.因此,text
在每个匹配的部分之前和之后进行拆分:标点符号和可能的引用,不以大写字母开头,后跟空格,可能是引号和大写字母.例如:
'John was tired. So was Sally. But was Bob? I don\'t know! Huh.'
会给出以下内容sentences
:
['John was tired', '.', ' So was Sally', '.', ' But was Bob', '?', " I don't know", '!', ' Huh.']
接下来,我们删除最后一个元素sentences
(因为这个操作不需要在它上面完成;我们在第4行添加它)[:-1]
:
['John was tired', '.', ' So was Sally', '.', ' But was Bob', '?', " I don't know", '!']
将它转换为迭代器iter
,将其放在list([]
)中:
[<list_iterator object at ...>]
现在,当我们这样做时zip(*[<list iterator ...>] * 2)
,我们正在做的是将两个对同一迭代器对象的引用压缩在一起(由一元运算*
符解压缩).这样,当zip
迭代第一次引用它一次时,它会消耗一个句子,然后当它迭代第二次引用它一次(与第一个引用配对)时,它已经消耗了一个句子,所以移动到相应的标点符号.感谢@SeanVieira在评论中对此进行简要解释.这给了我们以下结果:
[('John was tired', '.'), (' So was Sally', '.'), (' But was Bob', '?'), (" I don't know", '!')]
因此,我们将每个句子与其结束标点符号配对(同样,除了最后一个标点符号).
现在,我们将句子加回到它的标点符号''.join(...)
(我认为map
这是多余的):
['John was tired.', ' So was Sally.', ' But was Bob?', " I don't know!"]
并剥离前导空格.lstrip()
:
['John was tired.', 'So was Sally.', 'But was Bob?', "I don't know!"]
最后,我们回到最后一句,给我们:
['John was tired.', 'So was Sally.', 'But was Bob?', "I don't know!", ' Huh. ']
最后,我们return
.