關於使用 /u 樣式修飾詞時 UTF-8 字串的有效性,需要注意以下幾點:
1. 如果樣式本身包含無效的 UTF-8 字元,您會收到錯誤訊息(如上文文件中所述 -「自 PHP 4.3.5 起會檢查樣式的 UTF-8 有效性」)
2. 當目標字串包含無效的 UTF-8 序列/碼位時,preg_* 函數基本上會「無聲無息地失效」,也就是不匹配任何內容,但沒有指出字串是無效的 UTF-8。
3. PCRE 將五個和六個八位元組的 UTF-8 字元序列視為有效(在樣式和目標字串中皆是如此),但 Unicode 不支援這些序列(請參閱「Linux 和 Unix 安全程式設計 HOWTO」的第 5.9 節「字元編碼」- 可在 http://www.tldp.org/ 和其他地方找到)
4. 若要參考一個以 PHP 撰寫的範例演算法,用於測試 UTF-8 字串的有效性(並捨棄五個/六個八位元組序列),請前往:http://hsivonen.iki.fi/php-utf8/
以下腳本應該可以讓您了解哪些有效,哪些無效;
<?php
$examples = array(
'Valid ASCII' => "a",
'Valid 2 Octet Sequence' => "\xc3\xb1",
'Invalid 2 Octet Sequence' => "\xc3\x28",
'Invalid Sequence Identifier' => "\xa0\xa1",
'Valid 3 Octet Sequence' => "\xe2\x82\xa1",
'Invalid 3 Octet Sequence (in 2nd Octet)' => "\xe2\x28\xa1",
'Invalid 3 Octet Sequence (in 3rd Octet)' => "\xe2\x82\x28",
'Valid 4 Octet Sequence' => "\xf0\x90\x8c\xbc",
'Invalid 4 Octet Sequence (in 2nd Octet)' => "\xf0\x28\x8c\xbc",
'Invalid 4 Octet Sequence (in 3rd Octet)' => "\xf0\x90\x28\xbc",
'Invalid 4 Octet Sequence (in 4th Octet)' => "\xf0\x28\x8c\x28",
'Valid 5 Octet Sequence (but not Unicode!)' => "\xf8\xa1\xa1\xa1\xa1",
'Valid 6 Octet Sequence (but not Unicode!)' => "\xfc\xa1\xa1\xa1\xa1\xa1",
);
echo "++Invalid UTF-8 in pattern\n";
foreach ( $examples as $name => $str ) {
echo "$name\n";
preg_match("/".$str."/u",'Testing');
}
echo "++ preg_match() examples\n";
foreach ( $examples as $name => $str ) {
preg_match("/\xf8\xa1\xa1\xa1\xa1/u", $str, $ar);
echo "$name: ";
if ( count($ar) == 0 ) {
echo "Matched nothing!\n";
} else {
echo "Matched {$ar[0]}\n";
}
}
echo "++ preg_match_all() examples\n";
foreach ( $examples as $name => $str ) {
preg_match_all('/./u', $str, $ar);
echo "$name: ";
$num_utf8_chars = count($ar[0]);
if ( $num_utf8_chars == 0 ) {
echo "Matched nothing!\n";
} else {
echo "Matched $num_utf8_chars character\n";
}
}
?>