DOMDocument::loadHTML

(PHP 5, PHP 7, PHP 8)

DOMDocument::loadHTML — 從字串載入 HTML

說明

public DOMDocument::loadHTML(string $source, int $options = 0): bool

此函式會剖析字串 source 中包含的 HTML。與載入 XML 不同，HTML 不必是格式良好的即可載入。

警告

此函式使用 HTML 4 剖析器剖析輸入。HTML 5 的剖析規則（現代網路瀏覽器使用的規則）不同。根據輸入，這可能會導致不同的 DOM 結構。因此，此函式不能安全地用於清理 HTML。

剖析 HTML 時的行為可能取決於使用的 libxml 版本，尤其是在邊緣情況和錯誤處理方面。對於符合 HTML5 規範的剖析，請使用 PHP 8.4 中新增的 Dom\HTMLDocument::createFromString() 或 Dom\HTMLDocument::createFromFile()。

舉例來說，某些 HTML 元素在遇到時會隱含地關閉父元素。自動關閉父元素的規則在 HTML 4 和 HTML 5 之間有所不同，因此 DOMDocument 看到的結果 DOM 結構可能與網路瀏覽器看到的 DOM 結構不同，可能允許攻擊者破壞結果 HTML。

參數

source: HTML 字串。
options: 位元 OR libxml 選項常數。

傳回值

成功時傳回 true，失敗時傳回 false。

錯誤/例外

如果將空字串作為 source 傳遞，則會產生警告。此警告不是由 libxml 產生的，無法使用 libxml 的錯誤處理函式來處理。

雖然格式錯誤的 HTML 應成功載入，但此函式在遇到錯誤標記時可能會產生 E_WARNING 錯誤。可以使用 libxml 的錯誤處理函式來處理這些錯誤。

變更記錄

版本	說明
8.3.0	此函式現在有一個暫定的 bool 傳回類型。
8.0.0	靜態呼叫此函式現在會拋出 Error。先前，會引發 `E_DEPRECATED`。

範例

範例 #1 建立文件

<?php
$doc = new DOMDocument();
$doc->loadHTML("<html><body>測試<br></body></html>");
echo $doc->saveHTML();
?>

另請參閱

DOMDocument::loadHTMLFile() - 從檔案載入 HTML
DOMDocument::saveHTML() - 使用 HTML 格式將內部文件傾印到字串中
DOMDocument::saveHTMLFile() - 使用 HTML 格式將內部文件傾印到檔案中

發現問題？

了解如何改進此頁面 • 提交提取請求 • 回報錯誤

＋新增註解

使用者貢獻的註解 19 個註解

上

下

139

mdmitry at gmail dot com ¶

14 年前

您也可以使用這個簡單的技巧以 UTF-8 載入 HTML

<?php

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

// 不好的修復
foreach ($doc->childNodes as $item)
 if ($item->nodeType == XML_PI_NODE)
 $doc->removeChild($item); // 移除技巧
$doc->encoding = 'UTF-8'; // 插入正確的

?>

上

下

匿名 ¶

2 年前

如果 HTML 包含諸如 "nav, section, footer" 等 HTML5 採用標籤，loadHTML() & loadHTMLFile() 可能總是會產生警告（在 PHP 8.1.6 中）。

嘗試執行以下程式碼。

<?php

$file_name = 'PHP Runtime Configuration - Manual.html'; // 事先從「https://php.dev.org.tw/manual/en/session.configuration.php」下載此檔案。

$doc = new DOMDocument();
$doc->loadHTMLFile($file_name); // 如果將「LIBXML_NOERROR」設定為第二個引數，則沒有錯誤
echo $doc->saveHTML();

// Warning: DOMDocument::loadHTMLFile(): Tag nav invalid in PHP Runtime Configuration - Manual.html, line: 63 in D:\xampp\htdocs\test\xml(dom)\loadHTML\index.php on line 6

?>

上

下

BychkovVV at mail dot ru ¶

4 年前

如果您從任何網站載入 "utf-8" 編碼的 html 內容，當 meta 寬度 content-type 不是 HEAD 的第一個子節點時，剖析器將不會確認 (編碼)；因此您可以進行此修復
function domLoadHTML($html)
{$testDOM = new DOMDocument('1.0', 'UTF-8');
$testDOM->loadHTML($html);
$charset = NULL;
$searchInElemnt = function(&$item) use (&$searchInElemnt, &$charset)
{if($item->childNodes)
{foreach($item->childNodes as $childItem)
{switch($childItem->nodeName)
{case 'html'
case 'head'
$searchInElemnt($childItem);
break;
case 'meta'
$attributes = array();
foreach ($childItem->attributes as $attr)
{$attributes[mb_strtoupper($attr->localName)] = $attr->nodeValue;
                            }
if(array_key_exists('HTTP-EQUIV', $attributes) && (mb_strtoupper($attributes['HTTP-EQUIV']) == 'CONTENT-TYPE') && array_key_exists('CONTENT', $attributes) && preg_match('~[\s]*;[\s]*charset[\s]*=[\s]*([^\s]+)~', $attributes['CONTENT'], $matches))
{$charset = preg_replace('~[\s\']~', '', $matches[1]);
                            }
                       }
                    }
                 }
              };
$searchInElemnt($testDOM);
if(isset($charset))
{$dom = new DOMDocument('1.0', $charset);
$dom->loadHTML('<?xml encoding="'.$charset.'">'.$html);
foreach ($dom->childNodes as $item)
if($item->nodeType == XML_PI_NODE)
{$dom->removeChild($item);
                 }
$dom->encoding = $charset;
              }
else
{$dom = $testDOM;
              }
return $dom;
           };

上

下

Shane Harter ¶

14 年前

DOMDocument 非常擅長處理不完美的標記，但它在執行時會到處拋出警告。

這裡沒有充分說明。解決此問題的方法是實作一個單獨的裝置來處理這些錯誤。

在呼叫 loadHTML 之前，先設定 libxml_use_internal_errors(true)。這將防止錯誤冒泡到您的預設錯誤處理器。然後您可以使用其他 libxml 錯誤函式來取得這些錯誤（如果需要）。

您可以在這裡找到更多資訊：https://php.dev.org.tw/manual/en/ref.libxml.php

上

下

hanhvansu at yahoo dot com ¶

17 年前

當使用 loadHTML() 處理 UTF-8 頁面時，您可能會遇到 DOM 函式的輸出與輸入不符的問題。例如，如果您想取得 "Cạnh tranh"，您會收到 "Cáº¡nh tranh"。我建議在載入 UTF-8 頁面前先使用 mb_convert_encoding。
<?php 
 $pageDom = new DomDocument(); 
 $searchPage = mb_convert_encoding($htmlUTF8Page, 'HTML-ENTITIES', "UTF-8"); 
 @$pageDom->loadHTML($searchPage); 
 
?>

上

下

obayed dot opu at gmail dot com ¶

2 年前

為了支援 HTML5，您必須停用 XML 錯誤處理，方法是在 loadHTML 方法中加入 `LIBXML_NOERROR` 選項。

範例

<?php
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br><section>I'M UNSUPPORTED</section></body></html>", LIBXML_NOERROR);
echo $doc->saveHTML();
?>

上

下

bigtree at DONTSPAM dot 29a dot nl ¶

19 年前

當載入與 iso-8859-1 編碼不同的 HTML 時，請注意。由於此方法不會主動嘗試找出您要載入的 HTML 編碼（像大多數瀏覽器那樣），您必須在 HTML 的 head 中指定它。例如，如果您的 HTML 是 UTF-8 編碼，請確保 HTML 的 head 區段中有一個 meta 標籤。

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>

如果您未像這樣指定字元集，所有高位 ASCII 位元組都會被 HTML 編碼。僅將您正在載入 HTML 的 DOM 文件設定為 UTF-8 是不夠的。

上

下

deepakrajpal dot com at gmail dot com ¶

4 年前

如果我們載入 HTML5 標籤，例如 <section>、<svg>，會出現以下錯誤：

DOMDocument::loadHTML(): Tag section invalid in Entity

我們可以使用 libxml_use_internal_errors(true); 在 loadHTML() 之前停用標準的 libxml 錯誤（並啟用使用者錯誤處理）。

這在 phpunit 自訂斷言中非常有用，如下面的範例所示（如果使用 phpunit 測試案例）：

// 建立 DOMDocument
$dom = new DOMDocument();

// 修復 html5/svg 錯誤
libxml_use_internal_errors(true);
        
// 載入 html
$dom->loadHTML("<section></section>");
$htmlNodes = $dom->getElementsByTagName('section');

if ($htmlNodes->length == 0) {
$this->assertFalse(TRUE);
} else {
$this->assertTrue(TRUE);
}

上

下

finkenb2 at mail dot lib dot msu dot edu ¶

9 年前

警告：這對於 HTML5 元素（例如 SVG）無法正常運作。網路上大多數的建議是關閉錯誤，以便使其與 HTML5 一起運作。

上

下

fr at felix-riesterer dot de ¶

8 年前

請記住：如果您使用 HTML5 文件類型和類似這樣的 meta 元素：

<meta charset=utf-8">

您的 HTML 程式碼將被解讀為 ISO-8859-something，非 ASCII 字元將被轉換為 HTML 實體。但是，類似 HTML4 的版本將會運作（正如 10 年前「bigtree at 29a」所指出的）。

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

上

下

cake at brothercake dot com ¶

11 年前

請注意，此函式實際上並不了解 HTML，它使用 SGML 的一般規則來修正標籤湯輸入，因此它會建立格式正確的標記，但不知道允許哪些元素內容。

例如，如果輸入像這樣，第一個元素沒有關閉：

<span>hello <div>world</div>

loadHTML 會將其變更為這樣，這是格式正確但無效的：

<span>hello <div>world</div></span>

上

下

Errol ¶

15 年前

應該注意的是，當 body 標籤內提供了任何文字時：
在包含元素之外，DOMDocument 會將該
文字封裝到段落標籤 (<p>) 中。


例如：
<?php 
$doc = new DOMDocument(); 
$doc->loadHTML("<html><body>Test<br><div>Text</div></body></html>"); 
echo $doc->saveHTML(); 
?> 

將產生：
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>Test<br></p>
<div>Text</div>
</body></html>


而：
<?php 
$doc = new DOMDocument(); 
$doc->loadHTML( 
 "<html><body><i>Test</i><br><div>Text</div></body></html>"); 
echo $doc->saveHTML(); 
?> 

將產生：
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<i>Test</i><br><div>Text</div>
</body></html>

上

下

romain dot lalaut at laposte dot net ¶

17 年前

請注意，此類文件的元素將沒有命名空間，即使使用 <html xmlns="http://www.w3.org/1999/xhtml"> 也是如此。

上

下

kerim-yagmurcu at gmx dot de ¶

7 年前

對於那些想要取得外部 URL 的類別元素的人，我有 2 個有用的函式。在此範例中，我們從 Google 搜尋中取得 '<h3 class="r">' 元素（搜尋結果標頭）。
元素回傳（搜尋結果標頭）從 google 搜尋

1. 檢查 URL (是否可連線，存在)
<?php
# URL 檢查
function url_check($url) { 
 $headers = @get_headers($url); 
 return is_array($headers) ? preg_match('/^HTTP\\/\\d+\\.\\d+\\s+2\\d\\d\\s+.*$/',$headers[0]) : false; 
};
?>

2. 清理您要取得的元素（移除所有標籤、跳格、換行符號等等）。
<?php
# 清理字串的函式
function clean($text){
 $clean = html_entity_decode(trim(str_replace(';','-',preg_replace('/\s+/S', " ", strip_tags($text)))));// 移除所有內容
 return $clean;
 echo '\n';// 拋出新的一行
}
?>

完成之後，我們可以透過以下方法輸出搜尋結果標頭：
<?php
$searchstring = 'djceejay';
$url = 'http://www.google.de/webhp#q='.$searchstring;
if(url_check($url)){
 $doc = new DomDocument;
 $doc->validateOnParse = true;
 $doc->loadHtml(file_get_contents($url));
 $output = clean($doc->getElementByClass('r')->textContent);
 echo $output . '<br>';
}else{
 echo 'URL 無法連線！';// 當 URL 無法呼叫時拋出訊息
}
?>

上

下

jamesedwardcooke+php at gmail dot com ¶

16 年前

使用 loadHTML() 會自動設定 DOMDocument 執行個體的 doctype 屬性（設定為 HTML 中的 doctype，或預設為 4.0 Transitional）。如果您使用 DOMImplementation 設定 doctype，它將被覆寫。

我以為可以設定它，然後載入具有我定義的 doctype 的 HTML（以便在執行階段決定 doctype），並且在嘗試找出我的 doctype 去哪裡時遇到了巨大的麻煩。希望這對其他人有所幫助。

上

下

divinity76+spam at gmail dot com ¶

4 年前

如果您想要擺脫所有「僅包含空白字元的 DOMText 元素」，也許可以試試：

<?php

function loadHTML_noemptywhitespace(string $html, int $extra_flags = 0, int $exclude_flags = 0): DOMDocument
{
 $flags = LIBXML_HTML_NODEFDTD | LIBXML_NOBLANKS | LIBXML_NONET;
 $flags = ($flags | $extra_flags) & ~ $exclude_flags;

 $domd = new DOMDocument();
 $domd->preserveWhiteSpace = false;
 @$domd->loadHTML('<?xml encoding="UTF-8">' . $html, $flags);
 $removeAnnoyingWhitespaceTextNodes = function (\DOMNode $node) use (&$removeAnnoyingWhitespaceTextNodes): void {
 if ($node->hasChildNodes()) {
 // Warning: it's important to do it backwards; if you do it forwards, the index for DOMNodeList might become invalidated;
 // that's why i don't use foreach() - don't change it (unless you know what you're doing, ofc)
 for ($i = $node->childNodes->length - 1; $i >= 0; --$i) {
 $removeAnnoyingWhitespaceTextNodes($node->childNodes->item($i));
 }
 }
 if ($node->nodeType === XML_TEXT_NODE && !$node->hasChildNodes() && !$node->hasAttributes() && empty(trim($node->textContent))) {
 //echo "Removing annoying POS";
 // var_dump($node);
 $node->parentNode->removeChild($node);
 } //elseif ($node instanceof DOMText) { echo "not removed"; var_dump($node, $node->hasChildNodes(), $node->hasAttributes(), trim($node->textContent)); }
 };
 $removeAnnoyingWhitespaceTextNodes($domd);
 return $domd;
}

上

下

Alex ¶

14 年前

請注意「陷阱」（設計如此，但與預期不符）：如果您使用 loadHTML，則無法驗證文件。驗證僅適用於 XML。詳細資訊請參閱：http://bugs.php.net/bug.php?id=43771&edit=1

上

下

xuanbn at yahoo dot com ¶

17 年前

如果您使用 loadHTML() 處理 UTF HTML 字串（例如越南文），您可能會遇到亂碼文字，而某些檔案卻是正常的。即使您的 HTML 已經有像這樣的 meta charset：

<meta http-equiv="content-type" content="text/html; charset=utf-8">

我發現，為了幫助 loadHTML() 正確處理 UTF 檔案，meta 標籤應該放在最前面，在任何 UTF 字串出現之前。例如，這個 HTML 檔案：

<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<title> Vietnamese - Tiếng Việt</title>
</head>
<body></body>
</html>

當 <meta> 標籤出現在 <title> 標籤之前時，loadHTML() 可以正常處理。

但是下面的檔案將無法被 loadHTML() 辨識，因為 <title> 標籤包含 UTF 字串，而且出現在 <meta> 標籤之前。

<html>
<head>
<title> Vietnamese - Tiếng Việt</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body></body>
</html>

上

下

piopier ¶

15 年前

這是我寫的一個函式，用來總結關於使用 loadHTML 和 DOM 函式時，字符集問題（UTF-8...）的先前評論。
它會在 <head> 之後立即添加字符集 meta 標籤，以改進自動編碼偵測，將任何特定字符轉換為 HTML 實體，這樣 PHP DOM 函式/屬性將會回傳正確的值。

<?php
mb_detect_order("ASCII,UTF-8,ISO-8859-1,windows-1252,iso-8859-15");
function loadNprepare($url,$encod='') {
 $content = file_get_contents($url);
 if (!empty($content)) {
 if (empty($encod))
 $encod = mb_detect_encoding($content);
 $headpos = mb_strpos($content,'<head>');
 if (FALSE=== $headpos)
 $headpos= mb_strpos($content,'<HEAD>');
 if (FALSE!== $headpos) {
 $headpos+=6;
 $content = mb_substr($content,0,$headpos) . '<meta http-equiv="Content-Type" content="text/html; charset='.$encod.'">' .mb_substr($content,$headpos);
 }
 $content=mb_convert_encoding($content, 'HTML-ENTITIES', $encod);
 }
 $dom = new DomDocument;
 $res = $dom->loadHTML($content);
 if (!$res) return FALSE;
 return $dom;
}
?>

注意：它使用 mb_strpos/mb_substr 而不是 mb_ereg_replace，因為在處理巨大的 HTML 頁面時，這似乎更有效率。

＋新增註解