What are the special reserved character entities in HTML and in XML?
HTML和XML中有哪些特殊的保留字符实体?
The information that i have says:
我得到的信息是:
HTML:
HTML:
&
(replace with &
)<
(replace with <
)>
(replace with >
)
"
(replace with "
)'
(replace with '
)
XML:
XML:
<
(replace with <
)>
(replace with >
)&
(replace with &
)'
(replace with '
)"
(replace with "
)But i cannot find documentation on either of these.
但我找不到任何文件。
The W3C does mention, in Extensible Markup Language (XML) 1.0 (Fifth Edition), certain predefined entity references. But it says that these entities are predefined (in the same way that ©
is predefined); not that they must be escaped:
在可扩展标记语言(XML) 1.0(第5版)中,W3C确实提到了某些预定义的实体引用。但是它说这些实体是预定义的(与©的方式相同;预定义的);并不是说他们必须逃脱:
4.6 Predefined Entities
[Definition: Entity and character references may both be used to escape the left angle bracket, ampersand, and other delimiters. A set of general entities (amp, lt, gt, apos, quot) is specified for this purpose. Numeric character references may also be used; they are expanded immediately when recognized and must be treated as character data, so the numeric character references " <" and " & " may be used to escape
[定义:实体和字符引用都可以用来转义左角括号、&符和其他分隔符。为此目的指定了一组通用实体(amp、lt、gt、apos)。也可以使用数字字符引用;当它们被识别并且必须被视为字符数据时,它们会被立即扩展,因此数字字符引用“<”;”和“& # 38;“可用于转义 <和&当它们出现在字符数据中时。”
What characters must be escaped into entity references in HTML?
What characters must be escaped into entity references in XML?
在HTML中哪些字符必须转义为实体引用?在XML中,哪些字符必须转义到实体引用中?
Update:
更新:
From Extensible Markup Language (XML) 1.0 (Fifth Edition):
可扩展标记语言1.0(第五版):
2.4 Character Data and Markup
The ampersand character (
&
) and the left angle bracket (<
) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section.
If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&
" and "<
" respectively.除作为标记分隔符、注释、处理指令或CDATA部分之外,符号(&)和左角括号(<)不能以它们的文字形式出现。如果在其他地方需要它们,则必须分别使用数字字符引用或字符串“&”和“<”来转义它们。
The right angle bracket (
>
) may be represented using the string ">
", and must, for compatibility, be escaped using either ">
" or a character reference when it appears in the string "]]>
" in content, when that string is not marking the end of a CDATA section.右尖括号(>)可以用字符串“>”表示。在内容中,当该字符串不标记CDATA区域的末尾时,必须使用“>”或字符串中出现的字符引用来避免兼容性。
To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (
'
) may be represented as "'
", and the double-quote character ("
) as ""
".为了允许属性值同时包含单引号和双引号,撇号或单引号字符(')可以表示为“&apos”;,双引号字符(")为";"
i read the former as saying that
我读了前一句
must be:
必须:
<
(<
) must be&
(&
) must bemay, but must when appearing as ]]>
可以,但当出现时一定要>吗?
>
(>
) must be, if appearing as ]]>
And that '
and "
don't have to be escaped at all; unless you want to have quotes inside quoted attributes.
和“根本不需要逃跑;除非你想引用内部引用的属性。
From HTML 4.01 Specification, HTML Document Representation:
从HTML 4.01规范,HTML文档表示:
5.3.2 Character entity references
Authors wishing to put the "
<
" character in text should use "<
" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter).希望在文本中加入“<”字符的作者应使用“<”(ASCII decimal 60)以避免可能与标记的开头混淆(开始标记打开分隔符)。
Similarly, authors should use "
>
" (ASCII decimal 62) in text instead of ">
" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.同样,作者应该使用“>”(ASCII十进制62)在文本中而不是“>”,以避免较老的用户代理在标记(标记结束分隔符)出现在引用的属性值中时错误地将其视为标记的结束。
Authors should use "
&
" (ASCII decimal 38) instead of "&
" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&
" in attribute values since character references are allowed within CDATA attribute values.作者应该使用“,”(ASCII十进制)而不是“&”,以避免与字符引用的开头混淆(实体引用开放分隔符)。由于在CDATA属性值中允许字符引用,所以作者也应该在属性值中使用“&”。
Some authors use the character entity reference "
"
" to encode instances of the double quote mark ("
) since that character may be used to delimit attribute values.有些作者使用字符实体引用“';”来编码双引号(")的实例,因为该字符可以用来分隔属性值。
HTML is much more wishy-washy on the rules, but it sounds like i should:
HTML在规则上要比HTML宽松得多,但听起来我应该:
<
should be with <
>
should be with >
&
should be with &
"
should be with "
and if "
can be an entity reference, i should also replace '
with &
.
如果“可以作为实体引用,我也应该用&替换”。
From HTML5 - A vocabulary and associated APIs for HTML and XHTML:
来自HTML5——HTML和XHTML的词汇表和相关api:
8.3 Serializing HTML fragments
Escaping a string (for the purposes of the algorithm above) consists of running the following steps:
转义字符串(为上述算法的目的)包括运行以下步骤:
Replace any occurrence of the "
&
" character by the string "&
".用字符串"&"替换任何出现的"&"字符。
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string "
".
用字符串“&”替换所有出现的U+00A0无断点空间字符。
If the algorithm was invoked in the attribute mode, replace any occurrences of the "
"
" character by the string ""
".如果算法在属性模式中被调用,那么用字符串“;”替换出现的“”字符。
If the algorithm was not invoked in the attribute mode, replace any occurrences of the "
<
" character by the string "<
", and any occurrences of the ">
" character by the string ">
".如果在属性模式中没有调用该算法,则用字符串“<”替换“<”字符的任何出现。,以及字符串“>”中出现的“>”字符。
Which i read as HTML:
我把它读成HTML:
&
by &
always
by
always"
by "
if it's inside an attribute<
by <
if it's not in an attribute (i.e. attributes can contain <
)>
by >
if it's not in an attribute (i.e. attributes can contain >
)12
First, you're comparing a HTML 4.01 specification with an HTML 5 one. HTML5 ties more closely in with XML than HTML 4.01 ever does (that's why we have XHTML), so this answer will stick to HTML 5 and XML.
首先,您正在比较HTML 4.01规范和HTML 5规范。HTML5与XML的联系比HTML 4.01更加紧密(这就是我们拥有XHTML的原因),所以这个答案将只适用于HTML5和XML。
Your quoted references are all consistent on the following points:
您所引用的参考文献都符合以下几点:
<
should always be represented with <
when not indicating a processing instruction>
should always be represented with >
when not indicating a processing instruction&
should always be represented with &
(which only applies to XML)I agree 100% with this. You never want the parser to mistake literals for instructions, so it's a solid idea to always encode any non-space (see below) character. Good parsers know that anything contained within are not instructions, so the encoding is not necessary there.
我百分之百同意。您永远不希望解析器将文本错误地用于指令,因此,始终将任何非空间(见下)字符编码是一个很好的想法。好的解析器知道不是指令,所以在那里编码是不必要的。
In practice, I never encode '
or "
unless
实际上,我从不编码“或”除非”
"Yoinks!", he said.
)Both specifications also agree with this.
两个规范也都同意这一点。
So, the only point of contention is the (space). The only mention of it in either specification is when serialization is attempted. When not, you should always use a literal
(space). Unless you are writing your own parser, I don't see the need to be doing any kind of serialization, so this is beside the point.
所以,唯一的争论点是(空间)。在这两种规范中,只有在尝试序列化时才提到它。如果不是,则应该始终使用文字(空格)。除非您正在编写自己的解析器,否则我不认为需要进行任何类型的序列化,因此这是无关紧要的。