PHP 7.2.0 Release Candidate 4 Released

Multibyte String Functions

References

Multibyte character encoding schemes and their related issues are fairly complicated, and are beyond the scope of this documentation. Please refer to the following URLs and other resources for further information regarding these topics.

Table of Contents

add a note add a note

User Contributed Notes 32 notes

up
39
deceze at gmail dot com
5 years ago
Please note that all the discussion about mb_str_replace in the comments is pretty pointless. str_replace works just fine with multibyte strings:

<?php

$string 
= '漢字はユニコード';
$needle  = 'は';
$replace = 'Foo';

echo
str_replace($needle, $replace, $string);
// outputs: 漢字Fooユニコード

?>

The usual problem is that the string is evaluated as binary string, meaning PHP is not aware of encodings at all. Problems arise if you are getting a value "from outside" somewhere (database, POST request) and the encoding of the needle and the haystack is not the same. That typically means the source code is not saved in the same encoding as you are receiving "from outside". Therefore the binary representations don't match and nothing happens.
up
9
mdoocy at u dot washington dot edu
10 years ago
Note that some of the multi-byte functions run in O(n) time, rather than constant time as is the case for their single-byte equivalents. This includes any functionality requiring access at a specific index, since random access is not possible in a string whose number of bytes will not necessarily match the number of characters. Affected functions include: mb_substr(), mb_strstr(), mb_strcut(), mb_strpos(), etc.
up
9
Eugene Murai
12 years ago
PHP can input and output Unicode, but a little different from what Microsoft means: when Microsoft says "Unicode", it unexplicitly means little-endian UTF-16 with BOM(FF FE = chr(255).chr(254)), whereas PHP's "UTF-16" means big-endian with BOM. For this reason, PHP does not seem to be able to output Unicode CSV file for Microsoft Excel. Solving this problem is quite simple: just put BOM infront of UTF-16LE string.

Example:

$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, 'UTF-16LE', 'UTF-8');
up
6
mitgath at gmail dot com
8 years ago
according to:
http://bugs.php.net/bug.php?id=21317
here's missing function

<?php
function mb_str_pad ($input, $pad_length, $pad_string, $pad_style, $encoding="UTF-8") {
   return
str_pad($input,
strlen($input)-mb_strlen($input,$encoding)+$pad_length, $pad_string, $pad_style);
}
?>
up
2
php at kamiware dot org
11 months ago
str_replace is NOT multi-bite safe.

This Ukrainian word gives a bug when used in the next code: відео

$rubishcharacters='[#|\[{}\]´`≠,;.:-\\_<>=*+"\'?()!§$&%';
$searchstring='відео';

$result = str_replace(str_split($rubishcharacters), ' ', $searchstring);
up
4
Ben XO
8 years ago
PHP5 has no mb_trim(), so here's one I made. It work just as trim(), but with the added bonus of PCRE character classes (including, of course, all the useful Unicode ones such as \pZ).

Unlike other approaches that I've seen to this problem, I wanted to emulate the full functionality of trim() - in particular, the ability to customise the character list.

<?php
   
/**
     * Trim characters from either (or both) ends of a string in a way that is
     * multibyte-friendly.
     *
     * Mostly, this behaves exactly like trim() would: for example supplying 'abc' as
     * the charlist will trim all 'a', 'b' and 'c' chars from the string, with, of
     * course, the added bonus that you can put unicode characters in the charlist.
     *
     * We are using a PCRE character-class to do the trimming in a unicode-aware
     * way, so we must escape ^, \, - and ] which have special meanings here.
     * As you would expect, a single \ in the charlist is interpretted as
     * "trim backslashes" (and duly escaped into a double-\ ). Under most circumstances
     * you can ignore this detail.
     *
     * As a bonus, however, we also allow PCRE special character-classes (such as '\s')
     * because they can be extremely useful when dealing with UCS. '\pZ', for example,
     * matches every 'separator' character defined in Unicode, including non-breaking
     * and zero-width spaces.
     *
     * It doesn't make sense to have two or more of the same character in a character
     * class, therefore we interpret a double \ in the character list to mean a
     * single \ in the regex, allowing you to safely mix normal characters with PCRE
     * special classes.
     *
     * *Be careful* when using this bonus feature, as PHP also interprets backslashes
     * as escape characters before they are even seen by the regex. Therefore, to
     * specify '\\s' in the regex (which will be converted to the special character
     * class '\s' for trimming), you will usually have to put *4* backslashes in the
     * PHP code - as you can see from the default value of $charlist.
     *
     * @param string
     * @param charlist list of characters to remove from the ends of this string.
     * @param boolean trim the left?
     * @param boolean trim the right?
     * @return String
     */
   
function mb_trim($string, $charlist='\\\\s', $ltrim=true, $rtrim=true)
    {
       
$both_ends = $ltrim && $rtrim;

       
$char_class_inner = preg_replace(
            array(
'/[\^\-\]\\\]/S', '/\\\{4}/S' ),
            array(
'\\\\\\0', '\\' ),
           
$charlist
       
);

       
$work_horse = '[' . $char_class_inner . ']+';
       
$ltrim && $left_pattern = '^' . $work_horse;
       
$rtrim && $right_pattern = $work_horse . '$';

        if(
$both_ends)
        {
           
$pattern_middle = $left_pattern . '|' . $right_pattern;
        }
        elseif(
$ltrim)
        {
           
$pattern_middle = $left_pattern;
        }
        else
        {
           
$pattern_middle = $right_pattern;
        }

        return
preg_replace("/$pattern_middle/usSD", '', $string) );
    }
?>
up
2
treilor at gmail dot com
3 years ago
A small note for those who will follow rawsrc at gmail dot com's advice: mb_split uses regular expressions, in which case it may make sense to use built-in function mb_ereg_replace.
up
2
Daniel Rhodes
4 years ago
Here's a cheap and cheeky function to remove leading and trailing *punctuation* (or more specifically "non-word characters") from a UTF-8 string in whatever language. (At least it works well enough for Japanese and English.)

/**
* Trim singlebyte and multibyte punctuation from the start and end of a string
*
* @author Daniel Rhodes
* @note we want the first non-word grabbing to be greedy but then
* @note we want the dot-star grabbing (before the last non-word grabbing)
* @note to be ungreedy
*
* @param string $string input string in UTF-8
* @return string as $string but with leading and trailing punctuation removed
*/
function mb_punctuation_trim($string)
{
    preg_match('/^[^\w]{0,}(.*?)[^\w]{0,}$/iu', $string, $matches); //case-'i'nsensitive and 'u'ngreedy
   
    if(count($matches) < 2)
    {
        //some strange error so just return the original input
        return $string;
    }
   
    return $matches[1];
}

Hope you like it!
up
2
johannesponader at dontspamme dot googlemail dot co
7 years ago
Please note that when migrating code to handle UTF-8 encoding, not only the functions mentioned here are useful, but also the function htmlentities() has to be changed to htmlentities($var, ENT_COMPAT, "UTF-8") or similar. I didn't scan the manual for it, but there could be some more functions that need adjustments like this.
up
3
roydukkey at roydukkey dot com
7 years ago
This would be one way to create a multibyte substr_replace function

<?php
function mb_substr_replace($output, $replace, $posOpen, $posClose) {
        return
mb_substr($output, 0, $posOpen).$replace.mb_substr($output, $posClose+1);
    }
?>
up
3
Anonymous
3 years ago
Yet another single-line mb_trim() function

<?php
function mb_trim($string, $trim_chars = '\s'){
    return
preg_replace('/^['.$trim_chars.']*(?U)(.*)['.$trim_chars.']*$/u', '\\1',$string);
}
$string = '           "some text."      ';
echo
mb_trim($string, '\s".');
//some text
?>
up
6
rawsrc at gmail dot com
6 years ago
Hi,

For those who are looking for mb_str_replace, here's a simple function :
<?php
function mb_str_replace($needle, $replacement, $haystack) {
   return
implode($replacement, mb_split($needle, $haystack));
}
?>
I haven't found a simpliest way to proceed :-)
up
1
Daniel Rhodes
4 years ago
Here's a cheap and cheeky function to remove leading and trailing *punctuation* (or more specifically "non-word characters") from a UTF-8 string in whatever language. (At least it works well enough for Japanese and English.)

/**
* Trim singlebyte and multibyte punctuation from the start and end of a string
*
* @author Daniel Rhodes
* @note we want the first non-word grabbing to be greedy but then
* @note we want the dot-star grabbing (before the last non-word grabbing)
* @note to be ungreedy
*
* @param string $string input string in UTF-8
* @return string as $string but with leading and trailing punctuation removed
*/
function mb_punctuation_trim($string)
{
    preg_match('/^[^\w]{0,}(.*?)[^\w]{0,}$/iu', $string, $matches); //case-'i'nsensitive and 'u'ngreedy
   
    if(count($matches) < 2)
    {
        //some strange error so just return the original input
        return $string;
    }
   
    return $matches[1];
}

Hope you like it!
up
1
daniel at softel dot jp
11 years ago
Note that although "multi-byte" hints at total internationalization, the mb_ API was designed by a Japanese person to support the Japanese language.

Some of the functions, for example mb_convert_kana(), make absolutely no sense outside of a Japanese language environment.

It should perhaps be considered "lucky" if the functions work with non-Japanese multi-byte languages.

I don't mean any disrespect to the mb_ API because I'm using it everyday and I appreciate its usefulness, but maybe a better name would be the jp_ API.
up
1
rr_news at live dot de
10 months ago
The suggestion from "mt at mediamedics dot nl" is not that bad like the down votes indicate. There is only one small bug which can be easily fixed to make it work.
The head of the "for" need to be modified by replacing $i + $split_length by $i += $split_length.

Here is the full working code, with additional check to verify that the method doesn't exists already:

<?php
if ( !function_exists('mb_str_split') )
{
    function
mb_str_split($string, $split_length = 1)
    {
       
mb_internal_encoding('UTF-8');
       
mb_regex_encoding('UTF-8'); 

       
$split_length = ($split_length <= 0) ? 1 : $split_length;

       
$mb_strlen = mb_strlen($string, 'utf-8');

       
$array = array();

        for(
$i = 0; $i < $mb_strlen; $i += $split_length)
        {
           
$array[] = mb_substr($string, $i, $split_length);
        }

        return
$array;
    }
}
?>
up
0
mattr at telebody dot com
3 years ago
A brief note on Daniel Rhodes' mb_punctuation_trim().
The regular expression modifier u does not mean ungreedy, rather it means the pattern is in UTF-8 encoding. Instead the U modifier should be used to get ungreedy behavior. (I have not otherwise tested his code.)
See http://php.net/manual/en/reference.pcre.pattern.modifiers.php
>=
']*$/u' ( ']*$/u' ( et> mitgath at gmail dot com
3 years ago<" id="e laned b s.  >
on-Jyword">)
 ()feature59">,tsid &nbs only only th@param lar expss="default"Tly ksthat b 6ermshaus
iv clasbr /> .e full working code, with additional check to verto verify that the method doesn't ex(s already:

<?php
if ( !<pan class="default">$string('mb_str_split'<?php
function mb_str_replace($string, strlen($hayss_    }

  ="keyword">function
strlen$split_length
)
    & of ass="keyword">);

    &nbefault">$split_length
function );

&alass="default">$string/span> ass="keyword">);

&nspanass="default">$string/span>$replacement$split_length)
    & of   {
    &ss="default">$string/span>   {
    nbsp;      
<?php
function mb_str_replace($string, val="default">$replacementsp; &nbefault">$split_length);
    of   {
    n class="keyword">,
$split_length, );

&alass="default">$string/span> ass="keyword">);

&nspanass="default">$string/span>;

  &$split_length;
  &nlass="default">$string
);
}
function_exists
$split_length
$split_length);
    ass="keyword">);

&nbs already:


  ="keyword">function
$strinspan><= an>0function <?php
if ( !   {
    pan>($strinspan><= an>    &ss="default">$string/span>   {
    nb/span>function_exists$split_length
$string/span>    {
       
<>
  ="keyword">function
strlen, function_exists$split_length
function     {
 !p="keyword">);
}$posOpen$split_length)
    &cubjec< ass="keyword">);

&sp;      
$array[] = ubjecstrlen,$pan>($posOpen<. );

&
. >$array[] = ubjecstrlen,apor />    {
 +      
$mb_strlen = span class="default">strlen$split_length
$string/span>    {
       
<>
  ="keyword">function
strlen, functiodefault">strlen,apor />    {
 +      
$mb_strlen = $array$split_length
$split_length);
    cubjecstrlen$array;
    }
Hope you like it!
8409/><=
> mitgath at gmail dot com8/e6-27tron18>
3 years ago<().  >
fromv clnowhe full worki* @return stefault"><( an>$split$f hrt = spanan cl( pn>$split$cter>$split= spanan cl( pn>$split$n>$split$por = span cpor( pn>;
    }
}
?>
<;vote=up" title="Vote up!" class="usernot21ng">
"al/en/red="V115050" title="50% like this.efesar
rawsrc at gmail dot com
7 years ago This would be one way to create a multibyte substr_replace function

);
    c"default">$string
,eyword">);
    c"default">$string
$char_class_inner "/$pattern_middle<"ass="string">"/$patternp;   c"default">$string);
    cmb_substr+1);
    }

Hope you like it!
<2787>
daniel at softel dot jp
7 years ago CSV c"> Excelake Euclae Muraie specificahe full$unistrotefauc">_Excela= chr(255).chr(254).nese person/> ( $utf8tefa, 'The 16LE', 'The 8'bsp;   &nHow-bytcode>n Excelaiv Mac OS Xh.
perlanguag/>I pu>I eachxt" le rowr />/>I spancellpresv> ed atos no += $,>on-JTAB "\\ clheeky funungrCSV delimifunuodes' m+= n );
    }
<558ame p;vote=up" title="Vote up!" class="user558am49">
daniel at softel dot jp5em>
7 years ago

  ="keyword">function
''=
  ="keyword">function
) )
{
    function_middle<"`.`ass="string">"/$patternp;   c"defaass="default">$array$patternp;  0, $haystacklt">$hanswerstring/* c"defa[0] "stsome str[$sizr-1],guagch@para[ >peratso .
861eed =
mitgath at gmail dot com8/10/e3 03:05>
3 years ago
corred="Hcom11pa"vock 6hexane
'br="default"><"ace func The r /> ace funct brief noe lanasdiv cndedr />o wilpan>
<?php
function mb_str_replace($needle, $replacement class="keyword">(
function ;

       
$mb_strlen = mb_str_replace$splitn>(;

       
$mb_strlen = $needle$splitn>(    {
       
<>
  ="keyword">function
$replacement,
mb_str_replace$splitwhile ("keyword">function     {
 !p="keyword">);
}$posOpen = 1);

&sp;      
$array[] = $replacement, $pan>($posOpen<. $split_length$string/>
&
. >$array[] = $replacement,     {
 +      
mb_str_replace$splitefault">$splitn>(
    {
        <>
  ="keyword">function $replacement, $replacement,     {
 +       mb_str_replace$splitturn the original inplass="keyword">);
    an class="default">$replacement,
);
  &nbe some more functions tha70294d adjustments like this.
70294d =
daniel at softel dot jptron28>
3 years ago.NET,@param lar exd &nbes
);
    }

Hope you like it!
"al/en/red="V115050" title="50% like this.an k 6 daniel at softel dot jp
3 years ago);
    }

Hope you like it!
<3113>
"al/en/red="V115050" title="50% like this.Aardvark
daniel at softel dot jp
3 years ago hong. Iikevces currclaly-byte" hints

}owsae perss569r /> <530ame p;vote=up" title="Vote up!" class="user530am49">
<530ame wass="user">daniel at softel dot jp5em5-21 03:43
3 years ago)
 .= The reIbics'ti-byt stalassbinar);file guagpvott/; &nbibr gm>
Hope you like it! 51887>
"al/en/red="V115050" title="50% like this.nzkiwie
daniel at softel dot jp5em4<13t04n37
3 years agoagouti+= $splitclar); worki"ts="vote.tp:/_last nPHP_INI_ALL"r />Tabpa 1aiv ions wesdefau E a Japa4bibrsays += $s"he suggesno">));p.neini"e down vAatinionstabpa showsaasbr ld-PHP-erss569rn classs: down v;; D iabpa HTTPaIast nh perss569rdown vts="vote.tp:/_last n=u PHP 4.3.0 "> hige s) down v;; D iabpa HTTPaIast nh perss569rdown vts="vote./> _p; islinput = Offn the manual for it, but there could be some more functions thate" t5iest way to proceed :-)
5mp;vote=up" title="Vote up!" class="usernotes-5">
5 Daniel Rhodes
4 years ago
Here's a cheap and cheeky function to remove leading and trailing *punctuation* (or more specifically "non-word characters") from a UTF-8 string in whatever language. (At least it works well enough for Japanese and English.)

/**
* Trim singlebyte and multibyte punctuation from the start and end of a string
*
* @author Daniel Rhodes
* @note we want the first non-word grabbing to be greedy but then
* @note we want the dot-star grabbing (before the last non-word grabbing)
* @note to be ungreedy
*
* @param string $string input string in UTF-8
* @return string as $string but with leading and trailing punctuation removed
*/
function mb_punctuation_trim($string)
{
    preg_match('/^[^\w]{0,}(.*?)[^\w]{0,}$/iu', $string, $matches); //case-'i'nsensitive and 'u'ngreedy
        {
        //some strange error so just return the original input
        return $string;
    }
   
    re some more functions tha7472me would be the jp_ API.
<7472me =
daniel at softel dot jp7em4<
10 months ago
o wilWindowsamuckedr tnek to ve$file =bexeltro( "\n", nese person/> ( ing bbfile_getse ptns r( p_FILES['file']['tmp_id="'] ) ), 'The 8', 'The 16' ) )spn class="keyword">);
  &nbe some more functions tha94356>
Hope you like it! 94356><=
fal/en/red="V115050" title="50% like this.peo thAT( lanpam)rn zzignzuiv>
mitgath at gmail dot com
10 months agoo wilfroms onv> drulyt it works " id="Hc, e sugglkne The rnisass="ord si+= $splitheeky funum113238"> ics dotwhnb> mocbr / ucky> Heree cheeky fun,i-byt at it works heeky fun.e full working code, with additional check to verto verify that the method
);

&="keyword">function
''function '');
   aal class="default">$posOpen$splitn>(;

       
$mb_strlen = ''$splitn>();

    &nbefault">$split-wort">$array
= array();

        for(
$i = 0; $i
<     {
 ++>$split_lengthn>(    {
        
$array[] = $replacement,
$string,n class="default">$split_lengthn class="keyword">, );
        }

        return rto verify that the method slpanm 6b> nglissto verto verify that the method doesn't ace function

$array[] = $replacement,
= array();

      &n_middle<' ault">function_exists
$splitexisempty="keyword">function '');
   aal class="default">$posOpen$splitn>$splitn>(;

       
$mb_strlen
= ''$splitn>(;

       

  ="keyword">function
''$splitn>$splitn>$array = array();

        for(
$i = 0; $i
<     {
 ++>$split_lengthexis"keyword">function
''$string/span>    {
 ] ===an class="keyword">,
function_exists
function
''$string/span>    {
 ] =bsp;     &n_middle<'ault">function_exists
, , $splitn>$array = array();

        for(pan class="default">$i
,n class="default">$split_lengti = 0$i = class="default">$i $split_lengthexis"keyword">function ''$string/span>    {
 ] ===an class="keyword">,
function_exists
function ''$string/span>    {
 ] =bsp;     &n_middle<'ault">function_exists
, ,
/*efault">$splitn>);
   imeltrosass="default">$array
[_middle<'ault">function_exists, ''$split} to verto verify that the method">$array;
    } <9519me =
mitgath at gmail dot com
3 years ago< it works one-to-fromal">))he full working code, with additional check to veefault">$splitn>) )
{
    functionp;   c"defaass="default">$array
$patternp;   cclas_lasgti ass="keyword">);

&=an class="keyword">, $split_lengthlass="default">$split_length$split_lengthn>()
 ass="default">$array
[_middle<'The 8'n>''$split_lengthn>( ass="default">$array[_middle<'The 8'n>''$split_lengthnan class="k">$split_lengthn>();

&=anbsp;   functionp;   cclas_lasgti ass="keyword">);

&al cnbsp;       for(
$span><= 0function_exists, $split_lengthn>(;

       
$mb_strlen = $array$pattern_middle<'utf 8'n>'', $split_lengthn>();

    &nbefault">$splitefault">$split_length
$split_lengthv ct">$array
= array();

        for(
$i = 0; $mb_strlen = 0+       function_exists$split_lengthass="default">$split_length(''$array[] = $array$patternp;   n>$string,acclas_lasgtiult">function_existsass="default">$split_lengthass="default">$split_length);
        }

        p; &nbefault">$split} to verify that the method">$array
;
    } <7247me =
daniel at softel dot jp7em1-19 05:12tla>
3 years agos peo thd noivbrsoss>< 1pn cld nos oivstaly pord">agout, obytloade muclclassuay-break;stron+= $shandlesnbinar);at a &nbshisworpuponuclclassv clorkslasgtise down vdown vhe rproblem occuenalss="r;filed sifilnedr
nerhe full worki$>Skorpam, $m
d; &n tempt ese mucage.
< spe. Post exs ir; &ny>)); />o wilb_phelpsn; &nfromel che full worki/in whate* PHP Sagg
ass="dp; &nbefault" ace _obytloadsp;.= ini_get("ts="vote.ace _obytload")sp; &nbp; &nbeni_set("ts="vote.ace _obytload", 0)sp; &nbp; &nb$lasa= c iv_like_ace _sp; &nsp; &_shift( &ngum clr)," &ngum clr)sp; &nbp; &nbeni_set("ts="vote.ace _obytload", ace _obytloadsp;)sp; &nbp; &nb;  ; &sp; &nbp; &nbtrass="keyword">;
    } redired==tp://phfr.p.net/manual/en/re class="voteested>050div>>add >ser">sp; &nrased="Hc>ed">dosed="Hc>