2

I need to truncate string to specified length ignoring HTML tags. I found appropriate function here.

So I made light changes to it, added buffer input ob_start();

The problem is with UTF-8. If the last symbol of truncated string is from interval [ą,č,ę,ė,į,š,ų,ū,ž], then I get REPLACEMENT CHARACTER U+FFFD � at the end of the string.

Here is my code. You can copy-paste it and try by yourself:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>String truncate</title> </head> <?php $html = '<b>Koks nors tekstas</b>. <p>Lietuviškas žodis.</p>'; $html = html_truncate(27, $html); echo $html; /* Truncate HTML, close opened tags * * @param int, maxlength of the string * @param string, html * @return $html */ function html_truncate($maxLength, $html){ $printedLength = 0; $position = 0; $tags = array(); ob_start(); while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)){ list($tag, $tagPosition) = $match[0]; // Print text leading up to the tag. $str = substr($html, $position, $tagPosition - $position); if ($printedLength + strlen($str) > $maxLength){ print(substr($str, 0, $maxLength - $printedLength)); $printedLength = $maxLength; break; } print($str); $printedLength += strlen($str); if ($tag[0] == '&'){ // Handle the entity. print($tag); $printedLength++; } else{ // Handle the tag. $tagName = $match[1][0]; if ($tag[1] == '/'){ // This is a closing tag. $openingTag = array_pop($tags); assert($openingTag == $tagName); // check that tags are properly nested. print($tag); } else if ($tag[strlen($tag) - 2] == '/'){ // Self-closing tag. print($tag); } else{ // Opening tag. print($tag); $tags[] = $tagName; } } // Continue after the tag. $position = $tagPosition + strlen($tag); } // Print any remaining text. if ($printedLength < $maxLength && $position < strlen($html)) print(substr($html, $position, $maxLength - $printedLength)); // Close any open tags. while (!empty($tags)) printf('</%s>', array_pop($tags)); $bufferOuput = ob_get_contents(); ob_end_clean(); $html = $bufferOuput; return $html; } ?> <body> </body> </html> 

This function result would look like this:

Koks nors tekstas.
Lietuvi�

Any ideas why this function is messing up with UTF-8 ?

1

4 Answers 4

1

Any ideas why this function is messing up with UTF-8 ?

The general problem is that the function does not handle UTF-8 strings, but strings with an US-ASCII, Latin-1 or any other single-byte charset.

You're looking for making the function compatible with UTF-8 charsets. UTF-8 is a multibyte charset.

For that it is necessary that you verify that each of the string functions used inside that function properly handle the UTF-8 multibyte charset:

As you're dealing with HTML it's probably more save to use DOMDocument to manipulate the HTML chunk. That just as a note, it's much more flexible and does work properly.

Sign up to request clarification or add additional context in comments.

Comments

1

I would suggest to simply use a unicode-safe substring function such as mb_substr(), to truncate the unicode strings.

So basically try to replace all substr() occurences by mb_substr().

Before that, check that the mbstring PHP module is enabled on your environment.

Comments

1

you are looking for:

mb_strlen()

and associate mb_* functions.

Comments

0

Just use the following function

echo utf8_encode($match[0]) // $match[0] It's your variable which you want to print

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.