pyteaser中以下编码的含义是什么？

Question

问

pyteaser中以下编码的含义是什么？

鸟鸟212 发布于 2023-02-06 10:18

我目前正在使用pyteaser进行汇总,效果很好.我正在查看源代码,但即使借助下面的评论,我也不理解以下编码.任何人都可以解释一下吗？

def split_sentences(text): ''' The regular expression matches all sentence ending punctuation and splits the string at those points. At this point in the code, the list looks like this ["Hello, world", "!" ... ]. The punctuation and all quotation marks are separated from the actual text. The first s_iter line turns each group of two items in the list into a tuple, excluding the last item in the list (the last item in the list does not need to have this performed on it). Then, the second s_iter line combines each tuple in the list into a single item and removes any whitespace at the beginning of the line. Now, the s_iter list is formatted correctly but it is missing the last item of the sentences list. The second to last line adds this item to the s_iter list and the last line returns the full list. ''' sentences = regex_split('(?

senshin.. 6

第1行

首先,我们有一个正则表达式(?.以下是使用re.DEBUG标志编译它的输出:

assert_not -1 in range (65, 90) subpattern 1 in literal 46 literal 33 literal 63 max_repeat 0 1 literal 34 assert 1 max_repeat 1 65535 in category category_space max_repeat 0 1 literal 34 in range (65, 90) 首先,我们寻找一些没有大写字母的东西[A-Z](带有负面的后观断言?,即assert_not).然后,我们寻找一个标点符号(其中一个.!?),然后是零或一个双引号".最后,我们检查我们的东西是跟着一个或多个空格字符\s+,零个或一个双引号\",和一个大写字母[A-Z](这部分是前向断言,即assert). 这个正则表达式只会实际匹配([.!?]"?)部分,即标点符号可能后面跟一个引号. regex_split是这里的别名re.split.因此,text在每个匹配的部分之前和之后进行拆分:标点符号和可能的引用,不以大写字母开头,后跟空格,可能是引号和大写字母.例如: 'John was tired. So was Sally. But was Bob? I don\'t know! Huh.' 会给出以下内容sentences: ['John was tired', '.', ' So was Sally', '.', ' But was Bob', '?', " I don't know", '!', ' Huh.'] 第2行接下来,我们删除最后一个元素sentences(因为这个操作不需要在它上面完成;我们在第4行添加它)[:-1]: ['John was tired', '.', ' So was Sally', '.', ' But was Bob', '?', " I don't know", '!'] 将它转换为迭代器iter,将其放在list([])中: [] 现在,当我们这样做时zip(*[] * 2),我们正在做的是将两个对同一迭代器对象的引用压缩在一起(由一元运算*符解压缩).这样,当zip迭代第一次引用它一次时,它会消耗一个句子,然后当它迭代第二次引用它一次(与第一个引用配对)时,它已经消耗了一个句子,所以移动到相应的标点符号.感谢@SeanVieira在评论中对此进行简要解释.这给了我们以下结果: [('John was tired', '.'), (' So was Sally', '.'), (' But was Bob', '?'), (" I don't know", '!')] 因此,我们将每个句子与其结束标点符号配对(同样,除了最后一个标点符号). 第3行现在,我们将句子加回到它的标点符号''.join(...)(我认为map这是多余的): ['John was tired.', ' So was Sally.', ' But was Bob?', " I don't know!"] 并剥离前导空格.lstrip(): ['John was tired.', 'So was Sally.', 'But was Bob?', "I don't know!"] 第4行最后,我们回到最后一句,给我们: ['John was tired.', 'So was Sally.', 'But was Bob?', "I don't know!", ' Huh. '] 第5行最后,我们return.



        
        1 个回答
        
                        
                第1行

首先,我们有一个正则表达式(?<![A-Z])([.!?]"?)(?=\s+\"?[A-Z]).以下是使用re.DEBUG标志编译它的输出:

assert_not -1 
  in 
    range (65, 90)
subpattern 1 
  in 
    literal 46
    literal 33
    literal 63
  max_repeat 0 1 
    literal 34 
assert 1 
  max_repeat 1 65535 
    in 
      category category_space
  max_repeat 0 1 
    literal 34 
  in 
    range (65, 90)


首先,我们寻找一些没有大写字母的东西[A-Z](带有负面的后观断言?<!,即assert_not).然后,我们寻找一个标点符号(其中一个.!?),然后是零或一个双引号".最后,我们检查我们的东西是跟着一个或多个空格字符\s+,零个或一个双引号\",和一个大写字母[A-Z](这部分是前向断言,即assert).

这个正则表达式只会实际匹配([.!?]"?)部分,即标点符号可能后面跟一个引号.

regex_split是这里的别名re.split.因此,text在每个匹配的部分之前和之后进行拆分:标点符号和可能的引用,不以大写字母开头,后跟空格,可能是引号和大写字母.例如:

'John was tired. So was Sally. But was Bob? I don\'t know! Huh.'


会给出以下内容sentences:

['John was tired', '.', ' So was Sally', '.', ' But was Bob', '?', " I don't know", '!', ' Huh.']


第2行

接下来,我们删除最后一个元素sentences(因为这个操作不需要在它上面完成;我们在第4行添加它)[:-1]:

['John was tired', '.', ' So was Sally', '.', ' But was Bob', '?', " I don't know", '!']


将它转换为迭代器iter,将其放在list([])中:

[<list_iterator object at ...>]


现在,当我们这样做时zip(*[<list iterator ...>] * 2),我们正在做的是将两个对同一迭代器对象的引用压缩在一起(由一元运算*符解压缩).这样,当zip迭代第一次引用它一次时,它会消耗一个句子,然后当它迭代第二次引用它一次(与第一个引用配对)时,它已经消耗了一个句子,所以移动到相应的标点符号.感谢@SeanVieira在评论中对此进行简要解释.这给了我们以下结果:

[('John was tired', '.'), (' So was Sally', '.'), (' But was Bob', '?'), (" I don't know", '!')]


因此,我们将每个句子与其结束标点符号配对(同样,除了最后一个标点符号). 

第3行

现在,我们将句子加回到它的标点符号''.join(...)(我认为map这是多余的):

['John was tired.', ' So was Sally.', ' But was Bob?', " I don't know!"]


并剥离前导空格.lstrip():

['John was tired.', 'So was Sally.', 'But was Bob?', "I don't know!"]


第4行

最后,我们回到最后一句,给我们:

['John was tired.', 'So was Sally.', 'But was Bob?', "I don't know!", ' Huh. ']


第5行

最后,我们return.

        
                
                    2023-02-06 10:21  回答
                       怪兽朴朴朴
                
            
                    
    
    
    
        撰写答案
        
            
                
                    回答问题...