I need to truncate string to specified length ignoring HTML tags. I found appropriate function here.
So I made light changes to it, added buffer input ob_start();
The problem is with UTF-8. If the last symbol of truncated string is from interval [ą,č,ę,ė,į,š,ų,ū,ž], then I get REPLACEMENT CHARACTER U+FFFD � at the end of the string.
Here is my code. You can copy-paste it and try by yourself:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>String truncate</title> </head> <?php $html = '<b>Koks nors tekstas</b>. <p>Lietuviškas žodis.</p>'; $html = html_truncate(27, $html); echo $html; /* Truncate HTML, close opened tags * * @param int, maxlength of the string * @param string, html * @return $html */ function html_truncate($maxLength, $html){ $printedLength = 0; $position = 0; $tags = array(); ob_start(); while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)){ list($tag, $tagPosition) = $match[0]; // Print text leading up to the tag. $str = substr($html, $position, $tagPosition - $position); if ($printedLength + strlen($str) > $maxLength){ print(substr($str, 0, $maxLength - $printedLength)); $printedLength = $maxLength; break; } print($str); $printedLength += strlen($str); if ($tag[0] == '&'){ // Handle the entity. print($tag); $printedLength++; } else{ // Handle the tag. $tagName = $match[1][0]; if ($tag[1] == '/'){ // This is a closing tag. $openingTag = array_pop($tags); assert($openingTag == $tagName); // check that tags are properly nested. print($tag); } else if ($tag[strlen($tag) - 2] == '/'){ // Self-closing tag. print($tag); } else{ // Opening tag. print($tag); $tags[] = $tagName; } } // Continue after the tag. $position = $tagPosition + strlen($tag); } // Print any remaining text. if ($printedLength < $maxLength && $position < strlen($html)) print(substr($html, $position, $maxLength - $printedLength)); // Close any open tags. while (!empty($tags)) printf('</%s>', array_pop($tags)); $bufferOuput = ob_get_contents(); ob_end_clean(); $html = $bufferOuput; return $html; } ?> <body> </body> </html> This function result would look like this:
Koks nors tekstas.
Lietuvi�
Any ideas why this function is messing up with UTF-8 ?