PHP HTML truncate and UTF-8

Question

I need to truncate string to specified length ignoring HTML tags. I found appropriate function here.

So I made light changes to it, added buffer input ob_start();

The problem is with UTF-8. If the last symbol of truncated string is from interval [ą,č,ę,ė,į,š,ų,ū,ž], then I get REPLACEMENT CHARACTER U+FFFD � at the end of the string.

Here is my code. You can copy-paste it and try by yourself:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>String truncate</title> </head> <?php $html = '<b>Koks nors tekstas</b>. <p>Lietuviškas žodis.</p>'; $html = html_truncate(27, $html); echo $html; /* Truncate HTML, close opened tags * * @param int, maxlength of the string * @param string, html * @return $html */ function html_truncate($maxLength, $html){ $printedLength = 0; $position = 0; $tags = array(); ob_start(); while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)){ list($tag, $tagPosition) = $match[0]; // Print text leading up to the tag. $str = substr($html, $position, $tagPosition - $position); if ($printedLength + strlen($str) > $maxLength){ print(substr($str, 0, $maxLength - $printedLength)); $printedLength = $maxLength; break; } print($str); $printedLength += strlen($str); if ($tag[0] == '&'){ // Handle the entity. print($tag); $printedLength++; } else{ // Handle the tag. $tagName = $match[1][0]; if ($tag[1] == '/'){ // This is a closing tag. $openingTag = array_pop($tags); assert($openingTag == $tagName); // check that tags are properly nested. print($tag); } else if ($tag[strlen($tag) - 2] == '/'){ // Self-closing tag. print($tag); } else{ // Opening tag. print($tag); $tags[] = $tagName; } } // Continue after the tag. $position = $tagPosition + strlen($tag); } // Print any remaining text. if ($printedLength < $maxLength && $position < strlen($html)) print(substr($html, $position, $maxLength - $printedLength)); // Close any open tags. while (!empty($tags)) printf('</%s>', array_pop($tags)); $bufferOuput = ob_get_contents(); ob_end_clean(); $html = $bufferOuput; return $html; } ?> <body> </body> </html>

This function result would look like this:

Koks nors tekstas.
Lietuvi�

Any ideas why this function is messing up with UTF-8 ?

possible duplicate of UTF-8 compatible truncate function

user
– user

2014-03-18 03:59:15 +00:00
Commented Mar 18, 2014 at 3:59 — user
– user, Commented Mar 18, 2014 at 3:59

hakre · Accepted Answer · 2011-11-22 13:01:35Z

Any ideas why this function is messing up with UTF-8 ?

The general problem is that the function does not handle UTF-8 strings, but strings with an US-ASCII, Latin-1 or any other single-byte charset.

You're looking for making the function compatible with UTF-8 charsets. UTF-8 is a multibyte charset.

For that it is necessary that you verify that each of the string functions used inside that function properly handle the UTF-8 multibyte charset:

preg_match needs a pattern with the u modifier^Docs to work on UTF-8 strings.
substr needs to be replaced with mb_substr^Docs.
strlen needs to be replaced with mb_strlen^Docs

As you're dealing with HTML it's probably more save to use DOMDocument to manipulate the HTML chunk. That just as a note, it's much more flexible and does work properly.

Maxime Pacary · Accepted Answer · 2011-11-22 12:47:35Z

I would suggest to simply use a unicode-safe substring function such as mb_substr(), to truncate the unicode strings.

So basically try to replace all substr() occurences by mb_substr().

Before that, check that the mbstring PHP module is enabled on your environment.

Community · Accepted Answer · 2023-11-17 20:08:53Z

1

you are looking for:

mb_strlen()

and associate mb_* functions.

edited Nov 17, 2023 at 20:08

CommunityBot

11 silver badge

answered Nov 22, 2011 at 12:48

Jacco

23.9k18 gold badges92 silver badges106 bronze badges

Comments

user1058988 · Accepted Answer · 2011-11-23 07:43:40Z

Just use the following function

echo utf8_encode($match[0]) // $match[0] It's your variable which you want to print

Collectives™ on Stack Overflow

PHP HTML truncate and UTF-8

4 Answers 4

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Linked

Related