25

I would like to sanitize a string in to a URL so this is what I basically need:

  1. Everything must be removed except alphanumeric characters and spaces and dashed.
  2. Spaces should be converter into dashes.

Eg.

This, is the URL! 

must return

this-is-the-url 
1
  • Hi jens, I am clueless about the code and thats what I need help for. The only thing I know is it should use preg_replace() but I dont know what the regular expression should be. Thanks Commented Jun 11, 2010 at 11:09

10 Answers 10

52
function slug($z){ $z = strtolower($z); $z = preg_replace('/[^a-z0-9 -]+/', '', $z); $z = str_replace(' ', '-', $z); return trim($z, '-'); } 
Sign up to request clarification or add additional context in comments.

7 Comments

great thanks.. Just one edit.. I want to remove dashes from beginning and end before returning $z just in case they exists.
-1: Reading between the lines of what SilentGhost intends rather than the code he/she has written. this appears url-safe, it's at the cost of loss of information. The right way to encode data for a URL is to use urlencode().
(I see it does the translation shown in the example - but not what atif089 asked for)
@symcbean urlecode is not what I needed because I want to eliminate symbols rather than converting them. So this is exactly what I wanted.
@mario: 1. it doesn't do the same processing; 2. it's a maintenance nightmare.
|
4

First strip unwanted characters

$new_string = preg_replace("/[^a-zA-Z0-9\s]/", "", $string); 

Then changes spaces for unserscores

$url = preg_replace('/\s/', '-', $new_string); 

Finally encode it ready for use

$new_url = urlencode($url); 

1 Comment

underscore is a different character: _ is an underscore, - is a hyphen. Also using urlencode on such a string doesn't change anything. You're also forgetting hypen in the first regex and \s is not equivalent to space character.
3

The OP is not explicitly describing all of the attributes of a slug, but this is what I am gathering from the intent.

My interpretation of a perfect, valid, condensed slug aligns with this post: https://wordpress.stackexchange.com/questions/149191/slug-formatting-acceptable-characters#:~:text=However%2C%20we%20can%20summarise%20the,or%20end%20with%20a%20hyphen.

I find none of the earlier posted answers to achieve this consistently (and I'm not even stretching the scope of the question to include multi-byte characters).

  1. convert all characters to lowercase
  2. replace all sequences of one or more non-alphanumeric characters to a single hyphen.
  3. trim the leading and trailing hyphens from the string.

I recommend the following one-liner which doesn't bother declaring single-use variables:

return trim(preg_replace('/[^a-z0-9]+/', '-', strtolower($string)), '-'); 

Not shown in my demo link, here is an attempt to better handle multibyte strings, though it doesn't quite accommodate as many scenarios as Casimir's answer.

return trim(preg_replace('/[^a-z0-9]+/', '-', strtolower(iconv('utf-8', 'ascii//translit', $string))), '-'); 

I have also prepared a demonstration which highlights what I consider to be inaccuracies in the other answers. (Demo)

'This, is - - the URL!' input 'this-is-the-url' expected 'this-is-----the-url' SilentGhost 'this-is-the-url' mario 'This-is---the-URL' Rooneyl 'This-is-the-URL' AbhishekGoel 'This, is - - the URL!' HelloHack 'This, is - - the URL!' DenisMatafonov 'This,-is-----the-URL!' AdeelRazaAzeemi 'this-is-the-url' mickmackusa --- 'Mork & Mindy' input 'mork-mindy' expected 'mork--mindy' SilentGhost 'mork-mindy' mario 'Mork--Mindy' Rooneyl 'Mork-Mindy' AbhishekGoel 'Mork & Mindy' HelloHack 'Mork & Mindy' DenisMatafonov 'Mork-&-Mindy' AdeelRazaAzeemi 'mork-mindy' mickmackusa --- 'What the_underscore ?!?' input 'what-the-underscore' expected 'what-theunderscore' SilentGhost 'what-the_underscore' mario 'What-theunderscore-' Rooneyl 'What-theunderscore-' AbhishekGoel 'What the_underscore ?!?' HelloHack 'What the_underscore ?!?' DenisMatafonov 'What-the_underscore-?!?' AdeelRazaAzeemi 'what-the-underscore' mickmackusa 

Comments

1

This will do it in a Unix shell (I just tried it on my MacOS):

$ tr -cs A-Za-z '-' < infile.txt > outfile.txt 

I got the idea from a blog post on More Shell, Less Egg

Comments

1

Try This

 function clean($string) { $string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens. $string = preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars. return preg_replace('/-+/', '-', $string); // Replaces multiple hyphens with single one. } 

Usage:

echo clean('a|"bc!@£de^&$f g'); 

Will output: abcdef-g

source : https://stackoverflow.com/a/14114419/2439715

Comments

1

Using intl transliterator is a good option because with it you can easily handle complicated cases with a single set of rules. I added custom rules to illustrate how it can be flexible and how you can keep a maximum of meaningful informations. Feel free to remove them and to add your own rules.

$strings = [ 'This, is - - the URL!', 'Holmes & Yoyo', 'L’Œil de démon', 'How to win 1000€?', '€, $ & other currency symbols', 'Und die Katze fraß alle mäuse.', 'Белите рози на София', 'പോണ്ടിച്ചേരി സൂര്യനു കീഴിൽ', ]; $rules = <<<'RULES' # Transliteration :: Any-Latin ; :: Latin-Ascii ; # examples of custom replacements '&' > ' and ' ; [^0-9][01]? { € > ' euro' ; € > ' euros' ; [^0-9][01]? { '$' > ' dollar' ; '$' > ' dollars' ; :: Null ; # slugify [^[:alnum:]&[:ascii:]]+ > '-' ; :: Lower ; # trim [$] { '-' > &Remove() ; '-' } [$] > &Remove() ; RULES; $tsl = Transliterator::createFromRules($rules, Transliterator::FORWARD); $results = array_map(fn($s) => $tsl->transliterate($s), $strings); print_r($results); 

demo

Unfortunately, the PHP manual is totally empty about ICU transformations but you can find informations about them here.

1 Comment

0

All previous asnwers deal with url, but in case some one will need to sanitize string for login (e.g.) and keep it as text, here is you go:

function sanitizeText($str) { $withSpecCharacters = htmlspecialchars($str); $splitted_str = str_split($str); $result = ''; foreach ($splitted_str as $letter){ if (strpos($withSpecCharacters, $letter) !== false) { $result .= $letter; } } return $result; } echo sanitizeText('ОРРииыфвсси ajvnsakjvnHB "&nvsp;\n" <script>alert()</script>'); //ОРРииыфвсси ajvnsakjvnHB &nvsp;\n scriptalert()/script //No injections possible, all info at max keeped 

Comments

0
 function isolate($data) { $data = trim($data); $data = stripslashes($data); $data = htmlspecialchars($data); return $data; } 

1 Comment

Please add more information with your code, maybe how to use or how you got to this answer. Thank you.
-1

You should use the slugify package and not reinvent the wheel ;)

https://github.com/cocur/slugify

1 Comment

Link only answer is useless, especially when it will be broken. Can you elaborate on this a little more?
-1

The following will replace spaces with dashes.

$str = str_replace(' ', '-', $str); 

Then the following statement will remove everything except alphanumeric characters and dashed. (didn't have spaces because in previous step we had replaced them with dashes.

// Char representation 0 - 9 A- Z a- z - $str = preg_replace('/[^\x30-\x39\x41-\x5A\x61-\x7A\x2D]/', '', $str); 

Which is equivalent to

$str = preg_replace('/[^0-9A-Za-z-]+/', '', $str); 

FYI: To remove all special characters from a string use

$str = preg_replace('/[^\x20-\x7E]/', '', $str); 

\x20 is hexadecimal for space that is start of Acsii charecter and \x7E is tilde. As accordingly to wikipedia https://en.wikipedia.org/wiki/ASCII#Printable_characters

FYI: look into the Hex Column for the interval 20-7E

Printable characters Codes 20hex to 7Ehex, known as the printable characters, represent letters, digits, punctuation marks, and a few miscellaneous symbols. There are 95 printable characters in total.

2 Comments

I can challenge anyone to proof me wrong. Don't know why i was downvoted.
Demonstrations can be found in stackoverflow.com/a/65280956/2943403

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.