請注意,當使用 UTF-8 時,mb_strtolower 只會將標記有 Unicode 屬性「大寫字母」(「Lu」)的大寫字元轉換為小寫。但是,也有字母,例如「字母數字」(Unicode 屬性「Nl」),它們也有小寫和大寫變體。這些字元不會被 mb_strtolower 轉換!
範例
羅馬字母 Ⅰ、Ⅱ、Ⅲ、...、Ⅿ(UTF-8 碼位 8544 到 8559)也存在於它們各自的小寫變體 ⅰ、ⅱ、ⅲ、...、ⅿ(UTF-8 碼位 8560 到 8575)中,而且我認為,也應該由 mb_strtolower 轉換,但它們不會!
大型網路公司(如 Google)會將這兩種變體視為語義上相同(因為表示法僅在大小寫上有所不同)。
由於我沒有在網路上找到任何關於如何在 PHP 中將所有 UTF8 字串對應到它們的小寫對應項的正確解決方案,我提供了以下針對 UTF-8 字串的硬編碼延伸 mb_strtolower 函式
該函式包裝了現有的函式 mb_strtolower(),並且額外地將存在小寫表示形式的大寫 UTF8 字元替換掉。由於我無法在網路上找到任何正確的 Unicode 大寫和小寫字元表,我根據 Google 搜尋和關鍵字工具檢查了前一百萬個 UTF8 字元,並將以下 78 個字元識別為大寫字元,它們不會被 mb_strtolower 取代,但具有 UTF8 小寫對應項。
<?php
function strtolower_utf8_extended( $utf8_string )
{
$additional_replacements = array
( "Dž" => "dž" , "Lj" => "lj" , "Nj" => "nj" , "Dz" => "dz" , "Ϸ" => "ϸ" , "Ϲ" => "ϲ" , "Ϻ" => "ϻ" , "ᾈ" => "ᾀ" , "ᾉ" => "ᾁ" , "ᾊ" => "ᾂ" , "ᾋ" => "ᾃ" , "ᾌ" => "ᾄ" , "ᾍ" => "ᾅ" , "ᾎ" => "ᾆ" , "ᾏ" => "ᾇ" , "ᾘ" => "ᾐ" , "ᾙ" => "ᾑ" , "ᾚ" => "ᾒ" , "ᾛ" => "ᾓ" , "ᾜ" => "ᾔ" , "ᾝ" => "ᾕ" , "ᾞ" => "ᾖ" , "ᾟ" => "ᾗ" , "ᾨ" => "ᾠ" , "ᾩ" => "ᾡ" , "ᾪ" => "ᾢ" , "ᾫ" => "ᾣ" , "ᾬ" => "ᾤ" , "ᾭ" => "ᾥ" , "ᾮ" => "ᾦ" , "ᾯ" => "ᾧ" , "ᾼ" => "ᾳ" , "ῌ" => "ῃ" , "ῼ" => "ῳ" , "Ⅰ" => "ⅰ" , "Ⅱ" => "ⅱ" , "Ⅲ" => "ⅲ" , "Ⅳ" => "ⅳ" , "Ⅴ" => "ⅴ" , "Ⅵ" => "ⅵ" , "Ⅶ" => "ⅶ" , "Ⅷ" => "ⅷ" , "Ⅸ" => "ⅸ" , "Ⅹ" => "ⅹ" , "Ⅺ" => "ⅺ" , "Ⅻ" => "ⅻ" , "Ⅼ" => "ⅼ" , "Ⅽ" => "ⅽ" , "Ⅾ" => "ⅾ" , "Ⅿ" => "ⅿ" , "Ⓐ" => "ⓐ" , "Ⓑ" => "ⓑ" , "Ⓒ" => "ⓒ" , "Ⓓ" => "ⓓ" , "Ⓔ" => "ⓔ" , "Ⓕ" => "ⓕ" , "Ⓖ" => "ⓖ" , "Ⓗ" => "ⓗ" , "Ⓘ" => "ⓘ" , "Ⓙ" => "ⓙ" , "Ⓚ" => "ⓚ" , "Ⓛ" => "ⓛ" , "Ⓜ" => "ⓜ" , "Ⓝ" => "ⓝ" , "Ⓞ" => "ⓞ" , "Ⓟ" => "ⓟ" , "Ⓠ" => "ⓠ" , "Ⓡ" => "ⓡ" , "Ⓢ" => "ⓢ" , "Ⓣ" => "ⓣ" , "Ⓤ" => "ⓤ" , "Ⓥ" => "ⓥ" , "Ⓦ" => "ⓦ" , "Ⓧ" => "ⓧ" , "Ⓨ" => "ⓨ" , "Ⓩ" => "ⓩ" , "𐐦" => "𐑎" , "𐐧" => "𐑏" );
$utf8_string = mb_strtolower( $utf8_string, "UTF-8");
$utf8_string = strtr( $utf8_string, $additional_replacements );
return $utf8_string;
} ?>