作者:所谓-旧 | 来源:互联网 | 2023-05-17 13:43
Whatswrongwiththefollowingcode-Ipinpointedittothehypheninthecomment,butwhyshould
What's wrong with the following code - I pinpointed it to the hyphen in the comment, but why should that cause an error?
以下代码出了什么问题 - 我在评论中将其指向连字符,但为什么会导致错误?
import re
valid = re.compile(r'''[^
\uFFFE\uFFFF # non-characters
]''', re.VERBOSE)
Traceback (most recent call last):
File "valid.py", line 5, in
]''', re.VERBOSE)
File "/usr/local/lib/python3.3/re.py", line 214, in compile
return _compile(pattern, flags)
File "/usr/local/lib/python3.3/re.py", line 281, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/local/lib/python3.3/sre_compile.py", line 494, in compile
p = sre_parse.parse(p, flags)
File "/usr/local/lib/python3.3/sre_parse.py", line 748, in parse
p = _parse_sub(source, pattern, 0)
File "/usr/local/lib/python3.3/sre_parse.py", line 360, in _parse_sub
itemsappend(_parse(source, state))
File "/usr/local/lib/python3.3/sre_parse.py", line 506, in _parse
raise error("bad character range")
sre_constants.error: bad character range
This next segment without the hyphen is error free:
没有连字符的下一个段没有错误:
import re
valid = re.compile(r'''[^
\uFFFE\uFFFF # non characters !! no errors
]''', re.VERBOSE)
Edit:
Adding to the answer of @nhahtdh, string concatenation seems another reasonable way to comment character classes in a verbose style:
添加到@nhahtdh的答案,字符串连接似乎是另一种以详细样式注释字符类的合理方法:
valid = re.compile( r'[^'
r'\u0000-\u0008' # C0 block first segment
r'\u000Bu\u000C' # allow TAB U+0009, LF U+000A, and CR U+000D
r'\u000E-\u001F' # rest of C0
r'\u007F' # disallow DEL U+007F
r'\u0080-\u009F' # All C1 block
r']' # don't forget this!
r'''
| [0-9] # normal verbose style
| [a-z] # another term +++
''', re.VERBOSE)
2 个解决方案
7
According to the documentation (emphasis mine):
根据文件(强调我的):
re.X
re.VERBOSE
This flag allows you to write regular expressions that look nicer. Whitespace within the pattern is ignored, except when in a character class or preceded by an unescaped backslash, and, when a line contains a '#' neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such '#' through the end of the line are ignored.
此标志允许您编写看起来更好的正则表达式。模式中的空格被忽略,除非在字符类中或前面有未转义的反斜杠,并且当一行中的字符类中既没有'#'也没有未转义的反斜杠时,最左边的所有字符都是'# '到最后一行被忽略了。
Basically, you cannot have comment inside a character class, and whitespace inside character class is considered significant.
基本上,您不能在字符类中进行注释,并且字符类中的空格被认为是重要的。
Since #
is inside character class, it does not function as comment, and everything inside the character class is parsed as part of the character class without exception (even the new line character is parsed as part of the character class). The error is thrown due to n-c
being invalid character range.
由于#在字符类中,因此它不作为注释起作用,并且字符类中的所有内容都被解析为字符类的一部分而没有异常(即使新行字符被解析为字符类的一部分)。由于n-c是无效字符范围而引发错误。
Valid way to write the expression would be:
编写表达式的有效方法是:
valid = re.compile(r'[^\uFFFE\uFFFF] # non-characters', re.VERBOSE)
Here is one suggestion on how to comment when you want to explain a lengthy character class:
以下是关于如何在需要解释冗长字符类时进行注释的建议:
r'''
# LOTS is for foo
# _ is a special fiz
# OF-LITERAL is for bar
[^LOTS_OF-LITERAL]
'''