当前位置: 开发笔记 > 编程语言 > 正文

基于分隔符切片HTML

作者：呆子只爱小呆 | 来源：互联网 | 2023-01-21 20:42

如何解决《基于分隔符切片HTML》经验，为你挑选了1个好方法。

1> Hugo Delsing..：

为了解决这样的问题,您首先需要在开始编码之前确定获得解决方案所需的步骤.

找到以[[delimiter]]开头的元素

检查它的父母是否有 next sibling

没有？重复2

是？下一个兄弟包含内容.

现在,一旦你开始使用它,你已经准备好了90%.您需要做的就是清理不必要的标签,然后就完成了.

为了获得可以扩展的东西,不要构建一个可以工作的混淆代码的市长堆,而是将您需要的所有数据拆分成可以使用的东西.

下面的代码可以使用两个完全符合您需求的类,并在您需要时为您提供一个很好的方法来遍历所有元素.它确实使用PHP Simple HTML DOM Parser而不是DOMDocument,因为我更喜欢它.

[[delimiter]]Start of content section 1.

More content in section 3

XML; /* * CALL */ $parser = new HtmlParser($html, '[[delimiter]]'); //dump found //decode/encode to only show public values print_r(json_decode(json_encode($parser))); /* * ACTUAL CODE */ class HtmlParser { private $_html; private $_delimiter; private $_dom; public $Elements = array(); final public function __construct($html, $delimiter) { $this->_html = $html; $this->_delimiter = $delimiter; $this->_dom = str_get_html($this->_html); $this->getElements(); } final private function getElements() { //this will find all elements, including parent elements //it will also select the actual text as an element, without surrounding tags $elements = $this->_dom->find("[contains(text(),'".$this->_delimiter."')]"); //find the actual elements that start with the delimiter foreach($elements as $element) { //we want the element without tags, so we search for outertext if (strpos($element->outertext, $this->_delimiter)===0) { $this->Elements[] = new DelimiterTag($element); } } } } class DelimiterTag { private $_element; public $Content; public $MoreContent; final public function __construct($element) { $this->_element = $element; $this->COntent= $element->outertext; $this->findMore(); } final private function findMore() { //we need to traverse up until we find a parent that has a next sibling //we need to keep track of the child, to cleanup the last parent $child = $this->_element; $parent = $child->parent(); $next = null; while($parent) { $next = $parent->next_sibling(); if ($next) { break; } $child = $parent; $parent = $child->parent(); } if (!$next) { //no more content return; } //create empty element, to build the new data //go up one more element and clean the innertext $more = $parent->parent(); $more->innertext = ""; //add the parent, because this is where the actual content lies //but we only want to add the child to the parent, in case there are more delimiters $parent->innertext = $child->outertext; $more->innertext .= $parent->outertext; //add the next sibling, because this is where more content lies $more->innertext .= $next->outertext; //set the variables if ($more->tag=="body") { //Your section 3 works slightly different as it doesn't show the parent tag, where the first two do. //That's why i show the innertext for the root tag and the outer text for others. $this->MoreCOntent= $more->innertext; } else { $this->MoreCOntent= $more->outertext; } } } ?>

清理输出:

stdClass Object
(
  [Elements] => Array
  (
    [0] => stdClass Object
    (
        [Content] => [[delimiter]]Start of content section 1.
        [MoreContent] => 
                            [[delimiter]]Start of content section 1.
                            More content in section 1
                          
    )

    [1] => stdClass Object
    (
        [Content] => [[delimiter]]Start of section 2
        [MoreContent] => 
                            [[delimiter]]Start of section 2
                            More content in section 2
                         
    )

    [2] => stdClass Object
    (
        [Content] => [[delimiter]]Start of section 3
        [MoreContent] => 
                            [[delimiter]]Start of section 3
                         
                         
                            More content in section 3
                          
    )
  )
)