290

I am facing an issue with URLs, I want to be able to convert titles that could contain anything and have them stripped of all special characters so they only have letters and numbers and of course I would like to replace spaces with hyphens.

How would this be done? I've heard a lot about regular expressions (regex) being used...

0

3 Answers 3

837

This should do what you're looking for:

function clean($string) { $string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens. return preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars. } 

Usage:

echo clean('a|"bc!@£de^&$f g'); 

Will output: abcdef-g

Edit:

Hey, just a quick question, how can I prevent multiple hyphens from being next to each other? and have them replaced with just 1?

function clean($string) { $string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens. $string = preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars. return preg_replace('/-+/', '-', $string); // Replaces multiple hyphens with single one. } 
Sign up to request clarification or add additional context in comments.

7 Comments

@all Please note this won't work with UTF-8
@metal_fan what would that mean... i mean how bad can it be?
The first line should have a space as the first parameter. Right now it is just an empty string. Corrected here: $string = str_replace(' ', '-', $string);
@TerryHarvey this wouldn;t work if ' is the last character, e-g America' is the keyword you get and you need to clean it..
Is there a good reason the clean function does a str_replace before the preg_replace as the reg_replace takes care of the whitespace as well?
|
142

improved clean-up

The solution below has a "SEO friendlier" version:

function hyphenize($string) { $dict = array( "I'm" => "I am", "thier" => "their", // Add your own replacements here ); return strtolower( preg_replace( array( '#[\\s-]+#', '#[^A-Za-z0-9. -]+#' ), array( '-', '' ), // the full cleanString() can be downloaded from http://www.unexpectedit.com/php/php-clean-string-of-utf8-chars-convert-to-similar-ascii-char cleanString( str_replace( // preg_replace can be used to support more complicated replacements array_keys($dict), array_values($dict), urldecode($string) ) ) ) ); } function cleanString($text) { $utf8 = array( '/[áàâãªä]/u' => 'a', '/[ÁÀÂÃÄ]/u' => 'A', '/[ÍÌÎÏ]/u' => 'I', '/[íìîï]/u' => 'i', '/[éèêë]/u' => 'e', '/[ÉÈÊË]/u' => 'E', '/[óòôõºö]/u' => 'o', '/[ÓÒÔÕÖ]/u' => 'O', '/[úùûü]/u' => 'u', '/[ÚÙÛÜ]/u' => 'U', '/ç/' => 'c', '/Ç/' => 'C', '/ñ/' => 'n', '/Ñ/' => 'N', '/–/' => '-', // UTF-8 hyphen to "normal" hyphen '/[’‘‹›‚]/u' => ' ', // Literally a single quote '/[“”«»„]/u' => ' ', // Double quote '/ /' => ' ', // nonbreaking space (equiv. to 0x160) ); return preg_replace(array_keys($utf8), array_values($utf8), $text); } 

The rationale for the above functions (which I find way inefficient - the one below is better) is that a service that shall not be named apparently ran spelling checks and keyword recognition on the URLs.

After losing a long time on a customer's paranoias, I found out they were not imagining things after all -- their SEO experts [I am definitely not one] reported that, say, converting "Viaggi Economy Perù" to viaggi-economy-peru "behaved better" than viaggi-economy-per (the previous "cleaning" removed UTF8 characters; Bogotà became bogot, Medellìn became medelln and so on).

There were also some common misspellings that seemed to influence the results, and the only explanation that made sense to me is that our URL were being unpacked, the words singled out, and used to drive God knows what ranking algorithms. And those algorithms apparently had been fed with UTF8-cleaned strings, so that "Perù" became "Peru" instead of "Per". "Per" did not match and sort of took it in the neck.

In order to both keep UTF8 characters and replace some misspellings, the faster function below became the more accurate (?) function above. $dict needs to be hand tailored, of course.

Previous answer

A simple approach:

// Remove all characters except A-Z, a-z, 0-9, dots, hyphens and spaces // Note that the hyphen must go last not to be confused with a range (A-Z) // and the dot, NOT being special (I know. My life was a lie), is NOT escaped $str = preg_replace('/[^A-Za-z0-9. -]/', '', $str); // Replace sequences of spaces with hyphen $str = preg_replace('/ */', '-', $str); // The above means "a space, followed by a space repeated zero or more times" // (should be equivalent to / +/) // You may also want to try this alternative: $str = preg_replace('/\\s+/', '-', $str); // where \s+ means "zero or more whitespaces" (a space is not necessarily the // same as a whitespace) just to be sure and include everything 

Note that you might have to first urldecode() the URL, since %20 and + both are actually spaces - I mean, if you have "Never%20gonna%20give%20you%20up" you want it to become Never-gonna-give-you-up, not Never20gonna20give20you20up . You might not need it, but I thought I'd mention the possibility.

So the finished function along with test cases:

function hyphenize($string) { return ## strtolower( preg_replace( array('#[\\s-]+#', '#[^A-Za-z0-9. -]+#'), array('-', ''), ## cleanString( urldecode($string) ## ) ) ## ) ; } print implode("\n", array_map( function($s) { return $s . ' becomes ' . hyphenize($s); }, array( 'Never%20gonna%20give%20you%20up', "I'm not the man I was", "'Légeresse', dit sa majesté", ))); Never%20gonna%20give%20you%20up becomes never-gonna-give-you-up I'm not the man I was becomes im-not-the-man-I-was 'Légeresse', dit sa majesté becomes legeresse-dit-sa-majeste 

To handle UTF-8 I used a cleanString implementation found online (link broken since, but a stripped down copy with all the not-too-esoteric UTF8 characters is at the beginning of the answer; it's also easy to add more characters to it if you need) that converts UTF8 characters to normal characters, thus preserving the word "look" as much as possible. It could be simplified and wrapped inside the function here for performance.

The function above also implements converting to lowercase - but that's a taste. The code to do so has been commented out.

2 Comments

Great solution, thanks. One adjustment though, swap the 2nd array in hypenize preg_replace around to avoid word1 & word 2 becoming word1--word2, array( '', '-'),
strtr() is better suited to consume your associative $dict array. array_values() is not necessary when feeding $dict to str_replace(). Why is there a space listed in the negated character class if the existence of a space is made impossible by the earlier executed pattern with \s? ('#[^A-Za-z0-9. -]+#') Why not enjoy a smaller pattern by using the case-insensitive pattern modifier i?
56

Here, check out this function:

function seo_friendly_url($string){ $string = str_replace(array('[\', \']'), '', $string); $string = preg_replace('/\[.*\]/U', '', $string); $string = preg_replace('/&(amp;)?#?[a-z0-9]+;/i', '-', $string); $string = htmlentities($string, ENT_COMPAT, 'utf-8'); $string = preg_replace('/&([a-z])(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig|quot|rsquo);/i', '\\1', $string ); $string = preg_replace(array('/[^a-z0-9]/i', '/[-]+/') , '-', $string); return strtolower(trim($string, '-')); } 

1 Comment

A literal hyphen does not benefit from being boxed inside of square braces. -+ means the same thing as your [-]+. This answer is missing its educational explanation. Why are you doing each of these steps? How does it work? Why should other developers copy this into their project? If you are going to unconditionally convert the resultant string to lower case, then why not simplify all of the earlier processes by calling strtolower() when the string is first received? The back-to-back preg_replace() calls can be consolidated because the function can take arrays of finds and replaces.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.