mb_detect_encoding

(PHP 4 >= 4.0.6, PHP 5, PHP 7, PHP 8)

mb_detect_encoding — 偵測字元編碼

描述

mb_detect_encoding(字串 $string, 陣列|字串|null $encodings = null, 布林值 $strict = false): 字串|false

從候選的有序清單中，偵測字串 string 最可能的字元編碼。

自動偵測預期的字元編碼永遠無法完全可靠；如果沒有額外的資訊，它類似於在沒有金鑰的情況下解碼加密的字串。最好始終使用與資料一起儲存或傳輸的字元編碼指示，例如「Content-Type」HTTP 標頭。

此函式在多位元組編碼中最有用，其中並非所有位元組序列都形成有效的字串。如果輸入字串包含這樣的序列，則會拒絕該編碼，並檢查下一個編碼。

參數

string

正在檢查的字串。

encodings

要嘗試的字元編碼清單，依順序排列。該清單可以指定為字串陣列，或以逗號分隔的單個字串。

如果省略 encodings 或為 null，則會使用目前的 detect_order（使用 mbstring.detect_order 組態選項或 mb_detect_order() 函式設定）。

strict

控制當 string 在任何列出的 encodings 中無效時的行為。如果 strict 設定為 false，則會傳回最接近的匹配編碼；如果 strict 設定為 true，則會傳回 false。

strict 的預設值可以使用 mbstring.strict_detection 組態選項設定。

傳回值

偵測到的字元編碼，如果字串在任何列出的編碼中無效，則傳回 false。

變更日誌

版本	描述
8.2.0	mb_detect_encoding() 將不再傳回以下非文字編碼：`"Base64"`、`"QPrint"`、`"UUencode"`、`"HTML entities"`、`"7 bit"` 和 `"8 bit"`。

範例

範例 #1 mb_detect_encoding() 範例

<?php
// 使用目前的 detect_order 偵測字元編碼
echo mb_detect_encoding($str);

// "auto" 會根據 mbstring.language 展開
echo mb_detect_encoding($str, "auto");

// 使用以逗號分隔的清單指定 "encodings" 參數
echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");

// 使用陣列指定 "encodings" 參數
$encodings = [
 "ASCII",
 "JIS",
 "EUC-JP"
];
echo mb_detect_encoding($str, $encodings);
?>

範例 #2 strict 參數的效果

<?php
// 以 ISO-8859-1 編碼的 'áéóú'
$str = "\xE1\xE9\xF3\xFA";

// 字串在 ASCII 或 UTF-8 中無效，但 UTF-8 被認為是更接近的匹配
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], true));

// 如果找到有效的編碼，strict 參數不會改變結果
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], true));
?>

上面的範例會輸出

string(5) "UTF-8"
bool(false)
string(10) "ISO-8859-1"
string(10) "ISO-8859-1"

在某些情況下，相同的位元組序列可能會在多個字元編碼中形成有效的字串，並且不可能知道預期是哪種解釋。例如，在許多其他情況下，位元組序列 "\xC4\xA2" 可能是

在任何 ISO-8859-1、ISO-8859-15 或 Windows-1252 中編碼的「Ä¢」（U+00C4 帶分音符的拉丁大寫字母 A，後接 U+00A2 分幣符號）
在 ISO-8859-5 中編碼的「ФЂ」（U+0424 西里爾大寫字母 EF 後接 U+0402 西里爾大寫字母 DJE）
在 UTF-8 中編碼的「Ģ」（U+0122 帶尾形符的拉丁大寫字母 G）

範例 #3 多個編碼匹配時順序的效果

<?php
$str = "\xC4\xA2";

// 字串在所有三種編碼中都有效，因此會傳回列出的第一個編碼
var_dump(mb_detect_encoding($str, ['UTF-8', 'ISO-8859-1', 'ISO-8859-5']));
var_dump(mb_detect_encoding($str, ['ISO-8859-1', 'ISO-8859-5', 'UTF-8']));
var_dump(mb_detect_encoding($str, ['ISO-8859-5', 'UTF-8', 'ISO-8859-1']));
?>

上面的範例會輸出

string(5) "UTF-8"
string(10) "ISO-8859-1"
string(10) "ISO-8859-5"

參見

mb_detect_order() - 設定/取得字元編碼偵測順序

發現問題了嗎？

了解如何改進此頁面 • 提交 Pull Request • 回報錯誤

＋新增註解

使用者貢獻的註解 20 則註解

向上

向下

Gerg Tisza ¶

13 年前

如果您嘗試使用 mb_detect_encoding 來偵測字串是否為有效的 UTF-8，請使用嚴格模式，否則它幾乎沒有用處。

<?php
 $str = 'áéóú'; // ISO-8859-1
 mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'
 mb_detect_encoding($str, 'UTF-8', true); // false
?>

向上

向下

mta59066 at gmail dot com ¶

2 年前

對於 php8.1 來說，此文件不再正確，且 mb_detect_encoding 不再支援編碼順序。文件中提供的範例輸出對於 php8.1 來說也不再正確。這在 https://github.com/php/php-src/issues/8279 中有所解釋。

我理解這些函式先前的模糊性，但在我看來，8.1 應該要棄用 mb_detect_encoding 和 mb_detect_order，並提出不同的函式。它現在嘗試尋找將使用最少空間的編碼，而不考慮順序，我不確定誰需要這個。

以下是一個範例函式，它將執行 mb_detect_encoding 在 8.1 變更之前所做的事情。

<?php

function mb_detect_enconding_in_order(string $string, array $encodings): string|false
{
 foreach($encodings as $enc) {
 if (mb_check_encoding($string, $enc)) {
 return $enc;
 }
 }
 return false;
}

?>

向上

向下

geompse at gmail dot com ¶

2 年前

自 8.1.7 以來，發生了重大的未記錄的重大變更
https://3v4l.org/BLjZ3

請務必將 mb_detect_encoding 替換為 mb_check_encoding 的迴圈呼叫

向上

向下

Chrigu ¶

19 年前

如果您需要區分 UTF-8 和 ISO-8859-1 編碼，請在 encoding_list 中先列出 UTF-8
mb_detect_encoding($string, 'UTF-8, ISO-8859-1');

如果您先列出 ISO-8859-1，mb_detect_encoding() 將永遠返回 ISO-8859-1。

向上

向下

chris AT w3style.co DOT uk ¶

18 年前

基於下面使用 preg_match() 的程式碼片段，我需要一些更快且不那麼具體的方法。該函式有效且很棒，但它會掃描整個字串並檢查它是否符合 UTF-8。我需要一些純粹檢查字串是否包含 UTF-8 字元的東西，以便我可以將字元編碼從 iso-8859-1 切換到 utf-8。

我修改了模式，使其僅在 UTF-8 範圍內尋找非 ASCII 多位元組序列，並且在找到至少一個多位元組字串後停止。這快很多。

<?php

function detectUTF8($string)
{
 return preg_match('%(?:
 [\xC2-\xDF][\x80-\xBF] # 非過長 2 位元組
 |\xE0[\xA0-\xBF][\x80-\xBF] # 不包含過長位元組
 |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # 直接 3 位元組
 |\xED[\x80-\x9F][\x80-\xBF] # 不包含代理字元
 |\xF0[\x90-\xBF][\x80-\xBF]{2} # 平面 1-3
 |[\xF1-\xF3][\x80-\xBF]{3} # 平面 4-15
 |\xF4[\x80-\x8F][\x80-\xBF]{2} # 平面 16
 )+%xs', $string);
}

?>

向上

向下

nat3738 at gmail dot com ¶

15 年前

一種透過 BOM 偵測檔案 UTF-8/16/32 的簡單方法（不適用於沒有 BOM 的字串或檔案）

<?php
// Unicode BOM 是 U+FEFF，但編碼後，它會看起來像這樣。
define ('UTF32_BIG_ENDIAN_BOM' , chr(0x00) . chr(0x00) . chr(0xFE) . chr(0xFF));
define ('UTF32_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE) . chr(0x00) . chr(0x00));
define ('UTF16_BIG_ENDIAN_BOM' , chr(0xFE) . chr(0xFF));
define ('UTF16_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE));
define ('UTF8_BOM' , chr(0xEF) . chr(0xBB) . chr(0xBF));

function detect_utf_encoding($filename) {

 $text = file_get_contents($filename);
 $first2 = substr($text, 0, 2);
 $first3 = substr($text, 0, 3);
 $first4 = substr($text, 0, 3);
 
 if ($first3 == UTF8_BOM) return 'UTF-8';
 elseif ($first4 == UTF32_BIG_ENDIAN_BOM) return 'UTF-32BE';
 elseif ($first4 == UTF32_LITTLE_ENDIAN_BOM) return 'UTF-32LE';
 elseif ($first2 == UTF16_BIG_ENDIAN_BOM) return 'UTF-16BE';
 elseif ($first2 == UTF16_LITTLE_ENDIAN_BOM) return 'UTF-16LE';
}
?>

向上

向下

dennis at nikolaenko dot ru ¶

16 年前

請注意偵測俄語編碼的錯誤
http://bugs.php.net/bug.php?id=38138

向上

向下

rl at itfigures dot nl ¶

17 年前

我使用 Chris 的「detectUTF8」函式來偵測是否需要從 utf8 轉換為 8859-1，這效果很好。我確實遇到以下 iconv 轉換的問題。

問題在於，iconv 轉換為 8859-1（使用 //TRANSLIT）會將歐元符號替換為 EUR，儘管 \x80 在 8859-1 字元集中通常被用作歐元符號。

我無法使用 8859-15，因為它會損壞其他一些字元，所以我新增了 2 個 str_replace

if(detectUTF8($str)){
$str=str_replace("\xE2\x82\xAC","&euro;",$str);
$str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str);
$str=str_replace("&euro;","\x80",$str);
}

如果需要 html 輸出，則最後一行不是必要的（甚至是不需要的）。

向上

向下

eyecatchup at gmail dot com ¶

11 年前

只是提醒：您可以簡單地使用 'u' 修飾符來測試字串的 UTF-8 有效性，而無需使用 W3C 經常推薦的（相當複雜）正規表示式（http://www.w3.org/International/questions/qa-forms-utf-8.en.php）。

<?php
 if (preg_match("//u", $string)) {
 // $string 是有效的 UTF-8
 }

向上

向下

hmdker at gmail dot com ¶

16 年前

當 mb_detect_encoding 不可用時，用來偵測 UTF-8 的函式可能會很有用。

<?php
function is_utf8($str) {
 $c=0; $b=0;
 $bits=0;
 $len=strlen($str);
 for($i=0; $i<$len; $i++){
 $c=ord($str[$i]);
 if($c > 128){
 if(($c >= 254)) return false;
 elseif($c >= 252) $bits=6;
 elseif($c >= 248) $bits=5;
 elseif($c >= 240) $bits=4;
 elseif($c >= 224) $bits=3;
 elseif($c >= 192) $bits=2;
 else return false;
 if(($i+$bits) > $len) return false;
 while($bits > 1){
 $i++;
 $b=ord($str[$i]);
 if($b < 128 || $b > 191) return false;
 $bits--;
 }
 }
 }
 return true;
}
?>

向上

向下

php-note-2005 at ryandesign dot com ¶

19 年前

使用 W3C 建立的正規表示式，可以更簡單地檢查 UTF-8 編碼的字串。

<?php

// 如果 $string 是有效的 UTF-8 字串，則返回 true，否則返回 false。
function is_utf8($string) {
 
 // 來源：http://w3.org/International/questions/qa-forms-utf-8.html
 return preg_match('%^(?:
 [\x09\x0A\x0D\x20-\x7E] # ASCII
 | [\xC2-\xDF][\x80-\xBF] # 非過長 2 位元組
 | \xE0[\xA0-\xBF][\x80-\xBF] # 排除過長字元
 | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # 直接 3 位元組
 | \xED[\x80-\x9F][\x80-\xBF] # 排除代理字元
 | \xF0[\x90-\xBF][\x80-\xBF]{2} # 平面 1-3
 | [\xF1-\xF3][\x80-\xBF]{3} # 平面 4-15
 | \xF4[\x80-\x8F][\x80-\xBF]{2} # 平面 16
 )*$%xs', $string);
 
} // function is_utf8

?>

向上

向下

garbage at iglou dot eu ¶

7 年前

要偵測 UTF-8，可以使用

if (preg_match('!!u', $str)) { echo 'utf-8'; }

- Norihiori

向上

向下

maarten ¶

19 年前

有時候 `mb_detect_string` 並不符合你的需求。例如，當使用 pdflib 時，你會需要驗證 UTF-8 的正確性。`mb_detect_encoding` 有時會將某些 ISO-8859-1 編碼的文字誤判為 UTF-8。
要驗證 UTF-8，請使用以下程式碼：

//
// UTF-8 編碼驗證基於維基百科條目開發，網址：
// http://en.wikipedia.org/wiki/UTF-8
//
// 實作為基於簡單狀態機的遞迴下降剖析器
// 版權所有 2005 Maarten Meijer
//
// 這迫切需要一個 C 語言的實作被包含在 PHP 核心中
//
function valid_1byte($char) {
if(!is_int($char)) return false;
return ($char & 0x80) == 0x00;
    }
    
function valid_2byte($char) {
if(!is_int($char)) return false;
return ($char & 0xE0) == 0xC0;
    }

function valid_3byte($char) {
if(!is_int($char)) return false;
return ($char & 0xF0) == 0xE0;
    }

function valid_4byte($char) {
if(!is_int($char)) return false;
return ($char & 0xF8) == 0xF0;
    }
    
function valid_nextbyte($char) {
if(!is_int($char)) return false;
return ($char & 0xC0) == 0x80;
    }
    
function valid_utf8($string) {
$len = strlen($string);
$i = 0;
while( $i < $len ) {
$char = ord(substr($string, $i++, 1));
if(valid_1byte($char)) { // 繼續
continue;
} else if(valid_2byte($char)) { // 檢查 1 個位元組
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
} else if(valid_3byte($char)) { // 檢查 2 個位元組
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
} else if(valid_4byte($char)) { // 檢查 3 個位元組
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
} // 跳到下一個字元
        }
return true; // 完成
    }

狀態機的示意圖請參考：http://www.xs4all.nl/~mjmeijer/unicode.png 和 http://www.xs4all.nl/~mjmeijer/unicode2.png

向上

向下

-1

d_maksimov ¶

2 年前

這對我的 `exec(...)` 呼叫很有幫助。當它返回 cp866 或 cp1251 時。

try {
$line = iconv('CP866', 'CP1251', $line);
} catch(Exception $e) {
}
return iconv('CP1251', 'UTF-8', $line);

向上

向下

emoebel at web dot de ¶

10 年前

如果函數 "mb_detect_encoding" 不存在...

... 嘗試

<?php 
// ---------------------------------------------------- 
if ( !function_exists('mb_detect_encoding') ) { 

// ---------------------------------------------------------------- 
function mb_detect_encoding ($string, $enc=null, $ret=null) { 
 
 static $enclist = array( 
 'UTF-8', 'ASCII', 
 'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4', 'ISO-8859-5', 
 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 
 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16', 
 'Windows-1251', 'Windows-1252', 'Windows-1254', 
 );
 
 $result = false; 
 
 foreach ($enclist as $item) { 
 $sample = iconv($item, $item, $string); 
 if (md5($sample) == md5($string)) { 
 if ($ret === NULL) { $result = $item; } else { $result = true; } 
 break; 
 }
 }
 
 return $result; 
} 
// ---------------------------------------------------------------- 

} 
// ---------------------------------------------------- 
?>

範例 / `mb_detect_encoding()` 的用法：

<?php 
// ------------------------------------------------------ 
function str_to_utf8 ($str) { 
 
 if (mb_detect_encoding($str, 'UTF-8', true) === false) { 
 $str = utf8_encode($str); 
 }

 return $str;
}
// ------------------------------------------------------ 
?>

$txtstr = str_to_utf8($txtstr);

向上

向下

bmrkbyet at web dot de ¶

11 年前

a) 如果函數 `mb_detect_encoding` 無法使用

### mb_detect_encoding ... iconv ###

<?php
// -------------------------------------------

if(!function_exists('mb_detect_encoding')) { 
function mb_detect_encoding($string, $enc=null) { 
 
 static $list = array('utf-8', 'iso-8859-1', 'windows-1251');
 
 foreach ($list as $item) {
 $sample = iconv($item, $item, $string);
 if (md5($sample) == md5($string)) { 
 if ($enc == $item) { return true; } else { return $item; } 
 }
 }
 return null;
}
}

// -------------------------------------------
?>

b) 如果 FUNCTION mb_convert_encoding 不可用

### mb_convert_encoding ... iconv ###

<?php
// -------------------------------------------

if(!function_exists('mb_convert_encoding')) { 
function mb_convert_encoding($string, $target_encoding, $source_encoding) { 
 $string = iconv($source_encoding, $target_encoding, $string); 
 return $string; 
}
}

// -------------------------------------------
?>

向上

向下

-1

telemach ¶

19 年前

注意：即使您需要區分 UTF-8 和 ISO-8859-1，並且您使用以下偵測順序（如 chrigu 建議）

mb_detect_encoding('accentu?e' , 'UTF-8, ISO-8859-1')

會回傳 ISO-8859-1，而

mb_detect_encoding('accentu?' , 'UTF-8, ISO-8859-1')

會回傳 UTF-8

重點：結尾的 '?'（以及可能其他帶有重音符號的字元）會誤導 mb_detect_encoding

向上

向下

-1

recentUser at example dot com ¶

6 年前

在我的環境 (PHP 7.1.12) 中，
"mb_detect_encoding()" 無法運作
當 "mb_detect_order()" 未正確設定時。

為了讓 "mb_detect_encoding()" 在這種情況下運作，
只需在您的腳本檔案中，於 "mb_detect_encoding()" 之前加上 "mb_detect_order('...')"
即可。

以下兩種方式都
"ini_set('mbstring.language', '...');"
和
"ini_set('mbstring.detect_order', '...');"
無法在腳本檔案中達到此目的
然而，在 PHP.INI 檔案中設定它們可能有效。

向上

向下

-2

lotushzy at gmail dot com ¶

6 年前

關於函數 mb_detect_encoding，連結 https://php.dev.org.tw/manual/zh/function.mb-detect-encoding.php，像這樣
mb_detect_encoding('áéóú', 'UTF-8', true); // false
但現在的結果不是 false，可以告訴我原因嗎，謝謝！

向上

向下

-5

lexonight at yahoo dot com ¶

8 年前

<?php
$file = file_get_contents("somefile.txt");
$encodings = implode(',', mb_list_encodings());
echo mb_detect_encoding($file, $encodings, true);
?>
看起來可以運作

＋新增註解