基于分隔符切片HTML
作者:呆子只爱小呆 | 来源:互联网 | 2023-01-21 20:42
如何解决《基于分隔符切片HTML》经验,为你挑选了1个好方法。
1> Hugo Delsing..:
为了解决这样的问题,您首先需要在开始编码之前确定获得解决方案所需的步骤.
找到以[[delimiter]]开头的元素
检查它的父母是否有 next sibling
没有?重复2
是?下一个兄弟包含内容.
现在,一旦你开始使用它,你已经准备好了90%.您需要做的就是清理不必要的标签,然后就完成了.
为了获得可以扩展的东西,不要构建一个可以工作的混淆代码的市长堆,而是将您需要的所有数据拆分成可以使用的东西.
下面的代码可以使用两个完全符合您需求的类,并在您需要时为您提供一个很好的方法来遍历所有元素.它确实使用PHP Simple HTML DOM Parser而不是DOMDocument
,因为我更喜欢它.
[[delimiter]]Start of content section 1.
More content in section 1
[[delimiter]]Start of section 2
More content in section 2
[[delimiter]]Start of section 3
More content in section 3
XML;
/*
* CALL
*/
$parser = new HtmlParser($html, '[[delimiter]]');
//dump found
//decode/encode to only show public values
print_r(json_decode(json_encode($parser)));
/*
* ACTUAL CODE
*/
class HtmlParser
{
private $_html;
private $_delimiter;
private $_dom;
public $Elements = array();
final public function __construct($html, $delimiter)
{
$this->_html = $html;
$this->_delimiter = $delimiter;
$this->_dom = str_get_html($this->_html);
$this->getElements();
}
final private function getElements()
{
//this will find all elements, including parent elements
//it will also select the actual text as an element, without surrounding tags
$elements = $this->_dom->find("[contains(text(),'".$this->_delimiter."')]");
//find the actual elements that start with the delimiter
foreach($elements as $element) {
//we want the element without tags, so we search for outertext
if (strpos($element->outertext, $this->_delimiter)===0) {
$this->Elements[] = new DelimiterTag($element);
}
}
}
}
class DelimiterTag
{
private $_element;
public $Content;
public $MoreContent;
final public function __construct($element)
{
$this->_element = $element;
$this->COntent= $element->outertext;
$this->findMore();
}
final private function findMore()
{
//we need to traverse up until we find a parent that has a next sibling
//we need to keep track of the child, to cleanup the last parent
$child = $this->_element;
$parent = $child->parent();
$next = null;
while($parent) {
$next = $parent->next_sibling();
if ($next) {
break;
}
$child = $parent;
$parent = $child->parent();
}
if (!$next) {
//no more content
return;
}
//create empty element, to build the new data
//go up one more element and clean the innertext
$more = $parent->parent();
$more->innertext = "";
//add the parent, because this is where the actual content lies
//but we only want to add the child to the parent, in case there are more delimiters
$parent->innertext = $child->outertext;
$more->innertext .= $parent->outertext;
//add the next sibling, because this is where more content lies
$more->innertext .= $next->outertext;
//set the variables
if ($more->tag=="body") {
//Your section 3 works slightly different as it doesn't show the parent tag, where the first two do.
//That's why i show the innertext for the root tag and the outer text for others.
$this->MoreCOntent= $more->innertext;
} else {
$this->MoreCOntent= $more->outertext;
}
}
}
?>
清理输出:
stdClass Object
(
[Elements] => Array
(
[0] => stdClass Object
(
[Content] => [[delimiter]]Start of content section 1.
[MoreContent] =>
[[delimiter]]Start of content section 1.
More content in section 1
)
[1] => stdClass Object
(
[Content] => [[delimiter]]Start of section 2
[MoreContent] =>
[[delimiter]]Start of section 2
More content in section 2
)
[2] => stdClass Object
(
[Content] => [[delimiter]]Start of section 3
[MoreContent] =>
[[delimiter]]Start of section 3
More content in section 3
)
)
)
我已经决定这是一个有趣的理论练习,但却是一个实际的噩梦.您可能需要数百个示例来确保任何解决方案都能正常工作,然后正如您所说,有人带着示例101来再次破坏代码.
TBH - 没有OP的输入,甚至很难验证对该文档所做的基本假设.即使是"父母有下一个兄弟"逻辑的跟踪备份文档的基础也可能过于简单化了可能的组合.