Ah, slugification
// This function expects the input to be UTF-8 encoded. function slugify($text) { // Swap out Non "Letters" with a - $text = preg_replace('/[^\\pL\d]+/u', '-', $text); // Trim out extra -'s $text = trim($text, '-'); // Convert letters that we have left to the closest ASCII representation $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text); // Make text lowercase $text = strtolower($text); // Strip out anything we haven't been able to convert $text = preg_replace('/[^-\w]+/', '', $text); return $text; }
This works fairly well, as it first uses the unicode properties of each character to determine if it's a letter (or \d against a number) - then it converts those that aren't to -'s - then it transliterates to ascii, does another replacement for anything else, and then cleans up after itself. (Fabrik's test returns "arvizturo-tukorfurogep")
I also tend to add in a list of stop words - so that those are removed from the slug. "the" "of" "or" "a", etc (but don't do it on length, or you strip out stuff like "php")