数据总线地址总线和什么总线
We have a system that uses templates to create XML. Something like:
我们有一个使用模板创建XML的系统。 就像是:
{CUSTOMTEMPLATETHING1}
{CUSTOMTEMPLATETHING2} <根>
{CUSTOMTEMPLATETHING1} foo> {CUSTOMTEMPLATETHING2} bar> root>
And the result might be:
结果可能是&#xff1a;
text content
<根>
文本内容 foo> bar> root>
Notice that
请注意&#xff0c;
So, we created what we called the Rectifier. You can feel free to ponder the root or roots of the word. The early versions of the Rectifier used an uber-regular expression to strip out these tags from the source string. This system returns a full XML Document string, not an XmlReader or IXPathNavigable.
因此&#xff0c;我们创建了所谓的整流器。 您可以随意考虑单词的一个或多个词根。 整流器的早期版本使用超级正则表达式从源字符串中去除这些标签。 该系统返回完整的XML文档字符串&#xff0c;而不是XmlReader或IXPathNavigable。
I heard a cool quote yesterday at the Portland NerdDinner while we were planning the CodeCamp.
昨天我们在计划CodeCamp时&#xff0c;在Portland NerdDinner听到了一个很酷的报价。
"So you&#39;ve got a problem, and you&#39;ve decided to solve it with Regular Expressions. Now you&#39;ve got two problems."
“因此&#xff0c;您遇到了一个问题&#xff0c;并且决定使用正则表达式解决它。现在您遇到了两个问题。”
Since the size of the documents we passed through this system were between 10k and 100k the performance of the RegEx, especially when it&#39;s compiled and cached was fine. Didn&#39;t give it a thought for years. It worked and it worked well. It looked like this:
由于我们通过此系统传递的文档大小在10k到100k之间&#xff0c;因此RegEx的性能特别是在编译和缓存RegEx的情况下尤其如此。 多年没有思考。 它运作良好&#xff0c;而且运作良好。 它看起来像这样&#xff1a;
private static Regex regex &#61; new Regex(&#64;"\<[\w-_.: ]*\>\<\!\[CDATA\[\]\]\>\[\w-_.: ]*\>|\<[\w-_.: ]*\>\[\w-_.: ]*\>|<[\w-_.: ]*/\>|\<[\w-_.: ]*[/]&#43;\>|\<[\w-_.: ]*[\s]xmlns[:\w]*&#61;""[\w-/_.: ]*""\>\[\w-_.: ]*\>|<[\w-_.: ]*[\s]xmlns[:\w]*&#61;""[\w-/_.: ]*""[\s]*/\>|\<[\w-_.: ]*[\s]xmlns[:\w]*&#61;""[\w-/_.: ]*""\>\<\!\[CDATA\[\]\]\>\[\w-_.: ]*\>",RegexOptions.Compiled);
私有静态Regex regex &#61; new Regex(&#64;“ \ <[\ w-_ .:] * \> \ <\&#xff01;\ [CDATA \ [\] \] \> \ [\ w-_ .:] * \> | \ <[\ w-_ .:] * \> \ [\ w-_ .:] * \> | <[\ w-_ .:] * / \> | \ <[\ w -_ .:] * [/] &#43; \> | \ <[\ w-_ .:] * [\ s] xmlns [&#xff1a;\ w] * &#61;“” [\ w-/ _ .:] *“” \> \ [\ w-_ .:] * \> | <[\ w-_ .:] * [\ s] xmlns [&#xff1a;\ w] * &#61;“” [\ w-/ _ .:] *“” [[s] * / \> | \ <[\ w-_ .:] * [\ s] xmlns [&#xff1a;\ w] * &#61;“” [\ w-/ _ .:] *“” \ > \ <\&#xff01;\ [CDATA \ [\] \] \> \ [\ w-_ .:] * \>“&#xff0c;RegexOptions.Compiled);
Stuff like this has what I call a "High Bus Factor." That means if the developer who wrote it is hit by a bus, you&#39;re screwed. It&#39;s nice to create a solution that anyone can sit down and start working on and this isn&#39;t one of them.
这样的东西具有我所说的“高总线系数” 。 这意味着如果编写它的开发人员被公交车撞了&#xff0c;那您就被搞砸了。 创建一个任何人都可以坐下来并开始工作的解决方案真是太好了&#xff0c;这不是其中之一。
Then, lately some folks started pushing larger amounts of data through this system, in excess of 1.5 Megs and this Regular Expression started to 4, 8, 12 seconds to finish on this giant XML strings. We&#39;d hit the other side of the knee of the exponential performance curve that you see with string processing like this.
然后&#xff0c;最近有些人开始通过该系统推送超过1.5 Megs的大量数据&#xff0c;并且此正则表达式的开始时间为4、8、12秒&#xff0c;以完成此巨型XML字符串的处理。 通过这样的字符串处理&#xff0c;我们可以看到指数性能曲线的另一端。
So, Patrick had the idea to use XmlReaders and create an XmlRectifyingReader or XmlPeekingReader. Basically a fake reader, that had a reader internally and would "peek" ahead to see if we should skip empty elements. It&#39;s a complicated problem when you consider nesting, CDATA sections, attributes, namespaces, etc. But, because XmlReaders are forward only, you have to hold a lot of state as you move forward, since there&#39;s no way to back up. We gave up on this idea, since we want to fix this in a day, but it remains, in our opinion, a cool idea we&#39;d like to try. We wanted to do something like: xs.Deserialize(new XmlRectifyingReader(new StringReader(inputString))). But, the real issue was performance - over elegance.
因此&#xff0c; Patrick想到了使用XmlReaders并创建XmlRectifyingReader或XmlPeekingReader的想法。 基本上是伪造的阅读器&#xff0c;内部具有阅读器&#xff0c;并且会“向前看”看是否应该跳过空元素。 当考虑嵌套&#xff0c;CDATA节&#xff0c;属性&#xff0c;名称空间等时&#xff0c;这是一个复杂的问题。但是&#xff0c;由于XmlReaders仅是转发的&#xff0c;因此在前进时必须保持很多状态&#xff0c;因为无法进行备份。 因为我们想在一天内解决这个问题&#xff0c;所以我们放弃了这个想法&#xff0c;但是在我们看来&#xff0c;它仍然是我们想要尝试的一个很棒的想法。 我们想要做类似的事情&#xff1a;xs.Deserialize(new XmlRectifyingReader(new StringReader(inputString)))。 但是&#xff0c;真正的问题是性能-而不是优雅。
Then we figured we&#39;d do an XmlReader/XmlWriter thing like:
然后我们认为我们可以做一个XmlReader / XmlWriter之类的事情&#xff1a;
using(StringWriter strw &#61; new StringWriter())
使用(StringWriter strw &#61; new StringWriter())
{
{
XmlWriter writer &#61; new XmlTextWriter(strw);
XmlWriter writer &#61;新的XmlTextWriter(strw);
XmlReader reader &#61; new XmlTextReader(new StringReader(input));
XmlReader reader &#61; new XmlTextReader( new StringReader(input));
reader.Read();
reader.Read();
RectifyXmlInternal(reader, writer); //This is US
RectifyXmlInternal(reader&#xff0c;writer); //这是我们
reader.Close();
reader.Close();
writer.Close();
writer.Close();
return strw.ToString();
返回strw.ToString();
}
}
private class Attribute
私有类属性
{
{
public Attribute(string l, string n, string v, string p)
公共属性(字符串l&#xff0c;字符串n&#xff0c;字符串v&#xff0c;字符串p)
{
{
LocalName &#61; l;
LocalName &#61; l;
Namespace &#61; n;
命名空间&#61; n;
Value &#61; v;
值&#61; v;
Prefix &#61; p;
前缀&#61; p;
}
}
public string LocalName &#61; string.Empty;
公共字符串LocalName &#61;字符串.Empty;
public string Namespace &#61; string.Empty;
公共字符串命名空间&#61; string .Empty;
public string Value &#61; string.Empty;
公共字符串值&#61;字符串.Empty;
public string Prefix &#61; string.Empty;
公共字符串前缀&#61;字符串.Empty;
}
}
internal static void RectifyXmlInternal(XmlReader reader, XmlWriter writer)
内部静态无效RectifyXmlInternal(XmlReader reader&#xff0c;XmlWriter writer)
{
{
int depth &#61; reader.Depth;
int深度&#61; reader.Depth;
while (true && !reader.EOF)
while ( true &&&#xff01;reader.EOF)
{
{
switch ( reader.NodeType )
开关(reader.NodeType)
{
{
case XmlNodeType.Text:
大小写XmlNodeType.Text&#xff1a;
writer.WriteString( reader.Value );
writer.WriteString(reader.Value);
break;
休息;
case XmlNodeType.Whitespace:
大小写XmlNodeType.Whitespace&#xff1a;
case XmlNodeType.SignificantWhitespace:
大小写XmlNodeType.SignificantWhitespace&#xff1a;
writer.WriteWhitespace(reader.Value);
writer.WriteWhitespace(reader.Value);
break;
休息;
case XmlNodeType.EntityReference:
案例XmlNodeType.EntityReference&#xff1a;
writer.WriteEntityRef(reader.Name);
writer.WriteEntityRef(reader.Name);
break;
休息;
case XmlNodeType.XmlDeclaration:
大小写XmlNodeType.XmlDeclaration&#xff1a;
case XmlNodeType.ProcessingInstruction:
案例XmlNodeType.ProcessingInstruction&#xff1a;
writer.WriteProcessingInstruction( reader.Name, reader.Value );
writer.WriteProcessingInstruction(reader.Name&#xff0c;reader.Value);
break;
休息;
case XmlNodeType.DocumentType:
大小写XmlNodeType.DocumentType&#xff1a;
writer.WriteDocType( reader.Name,
writer.WriteDocType(reader.Name&#xff0c;
reader.GetAttribute( "PUBLIC" ), reader.GetAttribute( "SYSTEM" ),
reader.GetAttribute(“ PUBLIC”)&#xff0c;reader.GetAttribute(“ SYSTEM”)&#xff0c;
reader.Value );
reader.Value);
break;
休息;
case XmlNodeType.Comment:
案例XmlNodeType.Comment&#xff1a;
writer.WriteComment( reader.Value );
writer.WriteComment(reader.Value);
break;
休息;
case XmlNodeType.EndElement:
案例XmlNodeType.EndElement&#xff1a;
if(depth > reader.Depth)
如果(深度>读者深度)
return;
回报;
break;
休息;
}
}
if(reader.IsEmptyElement || reader.EOF) return;
如果(reader.IsEmptyElement || reader.EOF)返回&#xff1b;
else if(reader.IsStartElement())
否则如果(reader.IsStartElement())
{
{
string name &#61; reader.Name;
字符串名称&#61; reader.Name;
string localName &#61; reader.LocalName;
字符串localName &#61; reader.LocalName;
string prefix &#61; reader.Prefix;
字符串前缀&#61; reader.Prefix;
string uri &#61; reader.NamespaceURI;
字符串uri &#61; reader.NamespaceURI;
ArrayList attributes &#61; null;
ArrayList属性&#61; null ;
if(reader.HasAttributes)
如果(reader.HasAttributes)
{
{
attributes &#61; new ArrayList();
属性&#61;新的ArrayList();
while(reader.MoveToNextAttribute() )
同时(reader.MoveToNextAttribute())
attributes.Add(new Attribute(reader.LocalName,reader.NamespaceURI,reader.Value,reader.Prefix));
attribute.Add( new Attribute(reader.LocalName&#xff0c;reader.NamespaceURI&#xff0c;reader.Value&#xff0c;reader.Prefix));
}
}
bool CData &#61; false;
布尔CData &#61; false ;
reader.Read();
reader.Read();
if(reader.NodeType &#61;&#61; XmlNodeType.CDATA)
如果(reader.NodeType &#61;&#61; XmlNodeType.CDATA)
{
{
CData &#61; true;
CData &#61; true ;
}
}
if(reader.NodeType &#61;&#61; XmlNodeType.CDATA && reader.Value.Length &#61;&#61; 0)
如果(reader.NodeType &#61;&#61; XmlNodeType.CDATA && reader.Value.Length &#61;&#61; 0)
{
{
reader.Read();
reader.Read();
}
}
if(reader.NodeType &#61;&#61; XmlNodeType.EndElement && reader.Name.Equals(name))
如果(reader.NodeType &#61;&#61; XmlNodeType.EndElement && reader.Name.Equals(name))
{
{
reader.Read();
reader.Read();
if (reader.Depth 如果(reader.Depth <深度) return; 回报; else 其他 continue; 继续; } } writer.WriteStartElement( prefix, localName, uri); writer.WriteStartElement(前缀&#xff0c;localName&#xff0c;uri); if (attributes !&#61; null) 如果(属性&#xff01;&#61; null ) { { foreach(Attribute a in attributes) 的foreach(在属性属性一个) writer.WriteAttributeString(a.Prefix,a.LocalName,a.Namespace,a.Value); writer.WriteAttributeString(a.Prefix&#xff0c;a.LocalName&#xff0c;a.Namespace&#xff0c;a.Value); } } if(reader.IsStartElement()) 如果(reader.IsStartElement()) { { if(reader.Depth > depth) 如果(读者深度>深度) RectifyXmlInternal(reader, writer); RectifyXmlInternal(reader&#xff0c;writer); else 其他 continue; 继续; } } else 其他 { { if (CData) 如果(CData) writer.WriteCData(reader.Value); writer.WriteCData(reader.Value); else 其他 writer.WriteString(reader.Value); writer.WriteString(reader.Value); reader.Read(); reader.Read(); } } writer.WriteFullEndElement(); writer.WriteFullEndElement(); reader.Read(); reader.Read(); } } } } } }
The resulting "rectified" or empty-element stripped XML is byte for byte identical to the XML created by the original Regular Expression, so we succeeded in keeping compatiblity. The performance on small strings of XML less than 100 bytes is about 2x slower, because of the all overhead. However, as the size of the XML approaches middle part of the bell curve that repsents the typical size (10k of 100k) this technique overtakes RegularExpressions in a big way. Initial tests are between 7x and 10x faster in our typical scenario. When the XML gets to 1.5 megs this technique can process it in sub-second times. So, the Regular Expression behaves in an O(c^n) way, and this technique (scary as it is) behaves more O(n log(n)).
结果得到的“已纠正”或空元素剥离的XML逐字节与原始正则表达式创建的XML相同&#xff0c;因此我们成功地保持了兼容性。 小于100字节的XML小字符串的性能由于所有开销而降低了约2倍。 但是&#xff0c;由于XML的大小接近代表典型大小的钟形曲线的中间部分(100k中的10k)&#xff0c;因此该技术将大大取代RegularExpression。 在我们的典型情况下&#xff0c;初始测试的速度要快7到10倍。 当XML达到1.5兆时&#xff0c;该技术可以在不到一秒的时间内处理它。 因此&#xff0c;正则表达式的行为为O(c ^ n)&#xff0c;而这种技术(实际上是吓人的)表现出更多的O(n log(n))。
This lesson taught me that manipulating XML as if it were a string is often easy and quick to develop, but manipulating the infoset with really lightweight APIs like the XmlReader will almost always make life easier.
这节课告诉我&#xff0c;将XML当作字符串来进行处理通常很容易且快速地进行开发&#xff0c;但是使用真正轻量级的API(如XmlReader)来处理信息集将几乎总是使生活变得更轻松。
I&#39;d be interested in hearing Oleg or Kzu&#39;s opinions on how to make this more elegant and performant, and if it&#39;s even worth the hassle. Our dream of an XmlPeekingReader or XmlRectifyingReader to do this all in one pass remains...
我很想听听奥列格(Oleg)或库祖(Kzu)关于如何使其更优雅&#xff0c;更高效的观点&#xff0c;以及是否值得为此烦恼。 我们的梦想仍然是XmlPeekingReader或XmlRectifyingReader一次完成所有任务……
翻译自: https://www.hanselman.com/blog/stripping-out-empty-xmlelements-in-a-performant-way-and-the-bus-factor
数据总线地址总线和什么总线