What's wrong with the following code - I pinpointed it to the hyphen in the comment, but why should that cause an error?

以下代码出了什么问题 - 我在评论中将其指向连字符,但为什么会导致错误?

import re

valid = re.compile(r'''[^
\uFFFE\uFFFF   # non-characters
]''', re.VERBOSE)

Traceback (most recent call last):
  File "valid.py", line 5, in 
    ]''', re.VERBOSE)
  File "/usr/local/lib/python3.3/re.py", line 214, in compile
    return _compile(pattern, flags)
  File "/usr/local/lib/python3.3/re.py", line 281, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/local/lib/python3.3/sre_compile.py", line 494, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/local/lib/python3.3/sre_parse.py", line 748, in parse
    p = _parse_sub(source, pattern, 0)
  File "/usr/local/lib/python3.3/sre_parse.py", line 360, in _parse_sub
    itemsappend(_parse(source, state))
  File "/usr/local/lib/python3.3/sre_parse.py", line 506, in _parse
    raise error("bad character range")
sre_constants.error: bad character range

This next segment without the hyphen is error free:


import re

valid = re.compile(r'''[^
\uFFFE\uFFFF   # non characters !! no errors
]''', re.VERBOSE)


Adding to the answer of @nhahtdh, string concatenation seems another reasonable way to comment character classes in a verbose style:


valid = re.compile( r'[^'
r'\u0000-\u0008'    # C0 block first segment
r'\u000Bu\u000C'    # allow TAB U+0009, LF U+000A, and CR U+000D
r'\u000E-\u001F'    # rest of C0
r'\u007F'           # disallow DEL U+007F
r'\u0080-\u009F'    # All C1 block
r']'                # don't forget this!
| [0-9]    # normal verbose style
| [a-z]    # another term +++
''', re.VERBOSE)

2 个解决方案



According to the documentation (emphasis mine):



This flag allows you to write regular expressions that look nicer. Whitespace within the pattern is ignored, except when in a character class or preceded by an unescaped backslash, and, when a line contains a '#' neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such '#' through the end of the line are ignored.

此标志允许您编写看起来更好的正则表达式。模式中的空格被忽略,除非在字符类中或前面有未转义的反斜杠,并且当一行中的字符类中既没有'#'也没有未转义的反斜杠时,最左边的所有字符都是'# '到最后一行被忽略了。

Basically, you cannot have comment inside a character class, and whitespace inside character class is considered significant.


Since # is inside character class, it does not function as comment, and everything inside the character class is parsed as part of the character class without exception (even the new line character is parsed as part of the character class). The error is thrown due to n-c being invalid character range.


Valid way to write the expression would be:


valid = re.compile(r'[^\uFFFE\uFFFF]   # non-characters', re.VERBOSE)

Here is one suggestion on how to comment when you want to explain a lengthy character class:


# LOTS is for foo
# _ is a special fiz
# OF-LITERAL is for bar



Comments don't always play nice in regular expressions, and it looks like your regex engine is parsing the hyphen as part of the regular expression. You can't rely on comments not getting parsed here. This is a good thing to find out before implementing this code.


