14

I have a URL that looks like this (note the “„ symbols):

http://tinklarastis.omnitel.lt/kokius-aptarnavimo-kanalus-klientui-siulo-„omnitel“-1494

I receive it from SimplePie parser, if that matters. Now, if you try going to this specific URL in your browser and copy it from the address bar, you would get a URL that has the non-ASCII symbols percent encoded:

http://tinklarastis.omnitel.lt/kokius-aptarnavimo-kanalus-klientui-siulo-%E2%80%9Eomnitel%E2%80%9C-1494

I am trying to understand how can I mimic the same conversion in PHP. I cannot simply use urlencode() or urlrawencode() as they encode both non-ASCII symbols and reserved symbols, while in my case the reserved symbols (/?&, etc) should stay as they are.

So far I have only seen solutions that involve splitting the URL into pieces between reserved symbols and then using urlencode(), but that feels hackish to me and I hope there's a more elegant solution. I have tried various variations of iconv(), mb_convert_encoding(), yet with no success yet.

6
  • possible duplicate of How to encode URL using php like browsers do Commented Mar 23, 2012 at 8:27
  • What's so "hackish" in the solution you linked to? What's that "elegant" way of the trivial string manipulation you are looking for? Commented Mar 23, 2012 at 8:28
  • @YourCommonSense I might be wrong, but for me it looks like my situation is not a arbitrary string manipulation exercise. Rather, it is a rather generic encoding task - the fact that browsers do that while copying URLs from address bar indicates that there should be some standard / meaning behind it. Commented Mar 23, 2012 at 8:44
  • Yes, this is generic encoding task. Which itself being trivial string manipulation. And you already have the solution. I see no point in posting another question if you already found the answer. Commented Mar 23, 2012 at 8:47
  • I am looking for a defined way to solve this task. In order to escape HTML, one can use htmlspecialchars() or just write a custom function with character codes & str_replace(). You are right, I know the custom way, but I am looking for a solution that would use in-built string manipulation functions (no matter how trivial they are). Commented Mar 23, 2012 at 8:51

5 Answers 5

23

I have a simple one-liner that I use to do in-place encoding only on non-ASCII characters using preg_match_callback:

preg_replace_callback('/[^\x20-\x7f]/', function($match) { return urlencode($match[0]); }, $url); 

Note that the anonymous function is only supported in PHP 5.3+.

Sign up to request clarification or add additional context in comments.

2 Comments

This should be the accepted answer. It handles non-ASCII chars anywhere in the URL (path and query string), and does not need to perform checks such as "avoid double encoding" as in the OP's answer.
Good solution, thanks! If you also want standard-whitespaces to be encoded to +, use '/[^\x21-\x7f]/' instead. And if you want them to be encoded to %20 to be compliant to RFC 3986, call rawurlencode() instead of urlencode().
12

After researching a bit, I came to a conclusion that there's no way to do nicely in PHP (however, other languages like python / perl do seem to have functions exactly for this use case). This is the function I came up with (ensures encoding of path fragment of the URL):

function url_path_encode($url) { $path = parse_url($url, PHP_URL_PATH); if (strpos($path,'%') !== false) return $url; //avoid double encoding else { $encoded_path = array_map('urlencode', explode('/', $path)); return str_replace($path, implode('/', $encoded_path), $url); } } 

Comments

2

This function may help:

function sanitizeUrl($url) { $chars = '$-_.+!*\'(),{}|\\^~[]`<>#%";/?:@&='; $pattern = '~[^a-z0-9' . preg_quote($chars, '~') . ']+~iu'; $callback = create_function('$matches', 'return urlencode($matches[0]);'); return preg_replace_callback($pattern, $callback, $url); } 

1 Comment

Allowed in url chars only: '-._~:/?#[]@!$&\'()*+,;='
1

I think this will do what you want.

<?php $string = 'http://tinklarastis.omnitel.lt/kokius-aptarnavimo-kanalus-klientui-siulo-„omnitel“-1494/?foo=bar&fizz=buzz'; var_dump(filter_var($string, FILTER_SANITIZE_STRING, FILTER_FLAG_ENCODE_HIGH)); 

This will get you:

$ php test.php string(140) "http://tinklarastis.omnitel.lt/kokius-aptarnavimo-kanalus-klientui-siulo-&#226;&#128;&#158;omnitel&#226;&#128;&#156;-1494/?foo=bar&fizz=buzz" 

1 Comment

Good propose but it will not match RFC 3986
0
function cyrillicaToUrlencode($text){ return $line = preg_replace_callback('/([а-яё])/ui', function ($matches) { return urlencode($matches[0]); }, $text); } echo cyrillicaToUrlencode("https://test.com/Москваёtext1Воронежtext2Москваёtext3yМоскваё___-Москваё"); 

Will return - https://test.com/%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0%D1%91text1%D0%92%D0%BE%D1%80%D0%BE%D0%BD%D0%B5%D0%B6text2%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0%D1%91text3y%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0%D1%91___-%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0%D1%91

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.