Monday, May 31, 2010

utf8 in php5

utf8_encode() and utf8_decode() are awful, because they only operate in the ISO-8859-1 (latin1) charset portion of utf8. Also, for reference U+ff01, the ff01 is called the code point.

Lets say I have a unicode character U+4f60 or 你, and I want to echo it in php.

PHP is not very utf8 friendly, in order to get to work, you have to be using a computer with the right language packs installed, edit the source file in utf8, make sure every page sends out header('Content-type: text/html; charset=utf-8'); and you may still need to echo the utf8 meta tag. And for good measure make sure the file is saved with a utf8 byte order mark. But how do you echo unicode if your source file can only be in the latin1 character set?

One way is to echo the code point. In java "\u4f60" represents the character 你. Not so in php. The best you can do is these:

<?php echo json_decode('"\u4f60"');?>
<?php echo html_entity_decode("&#x4f60;",ENT_QUOTES,"UTF-8");?>
<?php echo "\xE4\xBD\xA0";?> //already encoded in utf8

but there is another way, with 2 basic functions:

function utf8($num)
  if($num<=0x7F)    return chr($num);
  if($num<=0x7FF)   return chr(($num>>6)+192).chr(($num&63)+128);
  if($num<=0xFFFF)  return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
  if($num<=0x1FFFFreturn chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128).chr(($num&63)+128);
  return '';

function uniord($c)
  $ord0 ord($c{0}); if ($ord0>=0   && $ord0<=127return $ord0;
  $ord1 ord($c{1}); if ($ord0>=192 && $ord0<=223return ($ord0-192)*64 + ($ord1-128);
  $ord2 ord($c{2}); if ($ord0>=224 && $ord0<=239return ($ord0-224)*4096 + ($ord1-128)*64 + ($ord2-128);
  $ord3 ord($c{3}); if ($ord0>=240 && $ord0<=247return ($ord0-240)*262144 + ($ord1-128)*4096 + ($ord2-128)*64 + ($ord3-128);
  return false;

<?php echo utf8(0x4f60);?>

or you can convert back
<?php echo dechex(uniord(utf8(0x4f60)));?>

Also helpful:
function is_utf8($string
  // From
  return preg_match('%^(?:
      [\x09\x0A\x0D\x20-\x7E]      # ASCII
    | [\xC2-\xDF][\x80-\xBF]       # non-overlong 2-byte
    |  \xE0[\xA0-\xBF][\x80-\xBF]    # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    |  \xED[\x80-\x9F][\x80-\xBF]    # excluding surrogates
    |  \xF0[\x90-\xBF][\x80-\xBF]{2}   # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}      # planes 4-15
    |  \xF4[\x80-\x8F][\x80-\xBF]{2}   # plane 16

No comments: