pyteaser中以下编码的含义是什么?

 鸟鸟212 发布于 2023-02-06 10:18

我目前正在使用pyteaser进行汇总,效果很好.我正在查看源代码,但即使借助下面的评论,我也不理解以下编码.任何人都可以解释一下吗?

def split_sentences(text):
    '''
    The regular expression matches all sentence ending punctuation and splits the string at those points.
    At this point in the code, the list looks like this ["Hello, world", "!" ... ]. The punctuation and all quotation marks
    are separated from the actual text. The first s_iter line turns each group of two items in the list into a tuple,
    excluding the last item in the list (the last item in the list does not need to have this performed on it). Then,
    the second s_iter line combines each tuple in the list into a single item and removes any whitespace at the beginning
    of the line. Now, the s_iter list is formatted correctly but it is missing the last item of the sentences list. The
    second to last line adds this item to the s_iter list and the last line returns the full list.
    '''

    sentences = regex_split('(?

senshin.. 6

第1行

首先,我们有一个正则表达式(?.以下是使用re.DEBUG标志编译它的输出:

assert_not -1 
  in 
    range (65, 90)
subpattern 1 
  in 
    literal 46
    literal 33
    literal 63
  max_repeat 0 1 
    literal 34 
assert 1 
  max_repeat 1 65535 
    in 
      category category_space
  max_repeat 0 1 
    literal 34 
  in 
    range (65, 90)

首先,我们寻找一些没有大写字母的东西[A-Z](带有负面的后观断言?,即assert_not).然后,我们寻找一个标点符号(其中一个.!?),然后是零或一个双引号".最后,我们检查我们的东西跟着一个或多个空格字符\s+,零个或一个双引号\",和一个大写字母[A-Z](这部分是前向断言,即assert).

这个正则表达式只会实际匹配([.!?]"?)部分,即标点符号可能后面跟一个引号.

regex_split是这里的别名re.split.因此,text在每个匹配的部分之前和之后进行拆分:标点符号和可能的引用,不以大写字母开头,后跟空格,可能是引号和大写字母.例如:

'John was tired. So was Sally. But was Bob? I don\'t know! Huh.'

会给出以下内容sentences:

['John was tired', '.', ' So was Sally', '.', ' But was Bob', '?', " I don't know", '!', ' Huh.']

第2行

接下来,我们删除最后一个元素sentences(因为这个操作不需要在它上面完成;我们在第4行添加它)[:-1]:

['John was tired', '.', ' So was Sally', '.', ' But was Bob', '?', " I don't know", '!']

将它转换为迭代器iter,将其放在list([])中:

[]

现在,当我们这样做时zip(*[] * 2),我们正在做的是将两个对同一迭代器对象的引用压缩在一起(由一元运算*符解压缩).这样,当zip迭代第一次引用它一次时,它会消耗一个句子,然后当它迭代第二次引用它一次(与第一个引用配对)时,它已经消耗了一个句子,所以移动到相应的标点符号.感谢@SeanVieira在评论中对此进行简要解释.这给了我们以下结果:

[('John was tired', '.'), (' So was Sally', '.'), (' But was Bob', '?'), (" I don't know", '!')]

因此,我们将每个句子与其结束标点符号配对(同样,除了最后一个标点符号).

第3行

现在,我们将句子加回到它的标点符号''.join(...)(我认为map这是多余的):

['John was tired.', ' So was Sally.', ' But was Bob?', " I don't know!"]

并剥离前导空格.lstrip():

['John was tired.', 'So was Sally.', 'But was Bob?', "I don't know!"]

第4行

最后,我们回到最后一句,给我们:

['John was tired.', 'So was Sally.', 'But was Bob?', "I don't know!", ' Huh. ']

第5行

最后,我们return.

1 个回答
  • 第1行

    首先,我们有一个正则表达式(?<![A-Z])([.!?]"?)(?=\s+\"?[A-Z]).以下是使用re.DEBUG标志编译它的输出:

    assert_not -1 
      in 
        range (65, 90)
    subpattern 1 
      in 
        literal 46
        literal 33
        literal 63
      max_repeat 0 1 
        literal 34 
    assert 1 
      max_repeat 1 65535 
        in 
          category category_space
      max_repeat 0 1 
        literal 34 
      in 
        range (65, 90)
    

    首先,我们寻找一些没有大写字母的东西[A-Z](带有负面的后观断言?<!,即assert_not).然后,我们寻找一个标点符号(其中一个.!?),然后是零或一个双引号".最后,我们检查我们的东西跟着一个或多个空格字符\s+,零个或一个双引号\",和一个大写字母[A-Z](这部分是前向断言,即assert).

    这个正则表达式只会实际匹配([.!?]"?)部分,即标点符号可能后面跟一个引号.

    regex_split是这里的别名re.split.因此,text在每个匹配的部分之前和之后进行拆分:标点符号和可能的引用,不以大写字母开头,后跟空格,可能是引号和大写字母.例如:

    'John was tired. So was Sally. But was Bob? I don\'t know! Huh.'
    

    会给出以下内容sentences:

    ['John was tired', '.', ' So was Sally', '.', ' But was Bob', '?', " I don't know", '!', ' Huh.']
    

    第2行

    接下来,我们删除最后一个元素sentences(因为这个操作不需要在它上面完成;我们在第4行添加它)[:-1]:

    ['John was tired', '.', ' So was Sally', '.', ' But was Bob', '?', " I don't know", '!']
    

    将它转换为迭代器iter,将其放在list([])中:

    [<list_iterator object at ...>]
    

    现在,当我们这样做时zip(*[<list iterator ...>] * 2),我们正在做的是将两个对同一迭代器对象的引用压缩在一起(由一元运算*符解压缩).这样,当zip迭代第一次引用它一次时,它会消耗一个句子,然后当它迭代第二次引用它一次(与第一个引用配对)时,它已经消耗了一个句子,所以移动到相应的标点符号.感谢@SeanVieira在评论中对此进行简要解释.这给了我们以下结果:

    [('John was tired', '.'), (' So was Sally', '.'), (' But was Bob', '?'), (" I don't know", '!')]
    

    因此,我们将每个句子与其结束标点符号配对(同样,除了最后一个标点符号).

    第3行

    现在,我们将句子加回到它的标点符号''.join(...)(我认为map这是多余的):

    ['John was tired.', ' So was Sally.', ' But was Bob?', " I don't know!"]
    

    并剥离前导空格.lstrip():

    ['John was tired.', 'So was Sally.', 'But was Bob?', "I don't know!"]
    

    第4行

    最后,我们回到最后一句,给我们:

    ['John was tired.', 'So was Sally.', 'But was Bob?', "I don't know!", ' Huh. ']
    

    第5行

    最后,我们return.

    2023-02-06 10:21 回答
撰写答案
今天,你开发时遇到什么问题呢?
立即提问
热门标签
PHP1.CN | 中国最专业的PHP中文社区 | PNG素材下载 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有