1

I have two URLs and am looking for the best way to decide if they are identical.

Example:

$url1 = 'http://example.com/page.php?tab=items&msg=3&sort=title'; $url2 = 'http://example.com/page.php?tab=items&sort=title&msg=3'; 

In the two URLs only the sort and msg param are switched, so I consider them equal. However I cannot simply do if ( $url1 == $url2 ) { … }

I'm having a list of URLs and need to find duplicates, so the code should be fast as it is run inside a loop. (As a side note: The domain/page.php will always be same, it's only about finding URLs by params.)

3
  • 2
    Turning the params into an associative array, then using array_diff could be a good way to go Commented Dec 27, 2014 at 12:24
  • You'd need to combine parse_url and parse_str. Either comparing the param arrays, or sort'em and reassemble urls. Commented Dec 27, 2014 at 12:24
  • 2
    Parse the parameters with parse_str, sort them with ksort, then put them back. Then you can compare them. Commented Dec 27, 2014 at 12:25

3 Answers 3

1

Maybe like this?

function compare_url($url1, $url2){ return (parse_url($url1,PHP_URL_QUERY) == parse_url($url2,PHP_URL_QUERY)); } 
Sign up to request clarification or add additional context in comments.

1 Comment

That's perfect :-) I only use parse_url() instead of preg_replace(). Very good idea to compare the param-arrays!
1

It's not as easy as it might sound to find out if an URI is identical or not, especially as you take the query parameter into account here.

One common way to do this is to have a function that normalizes the URL and then compare the normalized URIs:

$url1 = 'http://example.com/page.php?tab=items&msg=3&sort=title'; $url2 = 'http://example.com/page.php?tab=items&sort=title&msg=3'; var_dump(url_nornalize($url1) == url_nornalize($url2)); # bool(true) 

Into such a normalization function you put in your requirements. First of all the URL should be normalized according to the specs:

function url_nornalize($url, $separator = '&') { // normalize according RFC 3986 $url = new Net_URL2($url); $url->normalize(); 

And then you can take care of additional normalization steps, for example, sorting the sub-parts of the query:

 // normalize query if applicable $query = $url->getQuery(); if (false !== $query) { $params = explode($separator, $query); sort($params); $query = implode($separator, $params); $url->setQuery($query); } 

Additional steps can be though of, like removing default parameters or not allowed ones, or duplicate ones and what not.

Finally the string of normalized URL is returned

 return (string) $url; } 

Using an array/hash-map for the parameters isn't bad as well, I just wanted to show an alternative approach. Full example:

<?php /** * http://stackoverflow.com/questions/27667182/are-two-urls-identical-ignore-the-param-order */ require_once 'Net/URL2.php'; function url_nornalize($url, $separator = '&') { // normalize according RFC 3986 $url = new Net_URL2($url); $url->normalize(); // normalize query if applicable $query = $url->getQuery(); if (false !== $query) { $params = explode($separator, $query); // remove empty parameters $params = array_filter($params, 'strlen'); // sort parameters sort($params); $query = implode($separator, $params); $url->setQuery($query); } return (string)$url; } $url1 = 'http://EXAMPLE.com/p%61ge.php?tab=items&&&msg=3&sort=title'; $url2 = 'http://example.com:80/page.php?tab=items&sort=title&msg=3'; var_dump(url_nornalize($url1) == url_nornalize($url2)); # bool(true) 

Comments

0

To make sure that both URLs are identical, we need to compare at least 4 elements:

  1. The scheme(e.g. http, https, ftp)
  2. The host, i.e. the domain name of the URL
  3. The path, i.e. the "file" that was requested
  4. Query parameters of the request.

Some notes:

  • (1) and (2) are case-insensitive, which means http://example.org is identical to HTTP://EXAMPLE.ORG.
  • (3) can have leading or trailing slashes, that should be ignored: example.org is identical to example.org/
  • (4) could include parameters in varying order.
  • We can safely ignore anchor text, or "fragment" (#anchor after the query parameters), as they are only parsed by the browser.
  • URLs can also include port-numbers, a username and password - I think we can ignore those elements, as they are used so rarely that they do not need to be checked here.

Solution:

Here's a complete function that checks all those details:

/** * Check if two urls match while ignoring order of params * * @param string $url1 * @param string $url2 * @return bool */ function do_urls_match( $url1, $url2 ) { // Parse urls $parts1 = parse_url( $url1 ); $parts2 = parse_url( $url2 ); // Scheme and host are case-insensitive. $scheme1 = strtolower( $parts1[ 'scheme' ] ?? '' ); $scheme2 = strtolower( $parts2[ 'scheme' ] ?? '' ); $host1 = strtolower( $parts1[ 'host' ] ?? '' ); $host2 = strtolower( $parts2[ 'host' ] ?? '' ); if ( $scheme1 !== $scheme2 ) { // URL scheme mismatch (http <-> https): URLs are not identical. return false; } if ( $host1 !== $host2 ) { // Different host (domain name): Not identical. return false; } // Remvoe leading/trailing slashes, url-decode special characters. $path1 = trim( urldecode( $parts1[ 'path' ] ?? '' ), '/' ); $path2 = trim( urldecode( $parts2[ 'path' ] ?? '' ), '/' ); if ( $path1 !== $path2 ) { // The request-path is different: Different URLs. return false; } // Convert the query-params into arrays. parse_str( $parts1['query'] ?? '', $query1 ); parse_str( $parts2['query'] ?? '', $query2 ); if ( count( $query1 ) !== count( $query2 ) ) { // Both URLs have a differnt number of params: They cannot match. return false; } // Only compare the query-arrays when params are present. if (count( $query1 ) > 0 ) { ksort( $query1 ); ksort( $query2 ); if ( array_diff( $query1, $query2 ) ) { // Query arrays have differencs: URLs do not match. return false; } } // All checks passed, URLs are identical. return true; } // End do_urls_match() 

Test cases:

$base_urls = [ 'https://example.org/', 'https://example.org/index.php?sort=asc&field=id&filter=foo', 'http://EXAMPLE.com/p%61ge.php?tab=items&&&msg=3&sort=title', ]; $compare_urls = [ 'https://example.org/', 'https://Example.Org', 'https://example.org/index.php?sort=asc&&field=id&filter=foo', 'http://example.org/index.php?sort=asc&field=id&filter=foo', 'https://company.net/page.php?sort=asc&field=id&filter=foo', 'https://example.org/index.php?sort=asc&&&field=id&filter=foo#anchor', 'https://example.org/index.php?field=id&filter=foo&sort=asc', 'http://example.com:80/page.php?tab=items&sort=title&msg=3', ]; foreach ( $base_urls as $url1 ) { printf( "\n\n%s", $url1 ); foreach ( $compare_urls as $url2 ) { if (do_urls_match( $url1, $url2 )) { printf( "\n [MATCHES] %s", $url2 ); } } } /* Output: https://example.org/ [MATCHES] https://example.org/ [MATCHES] https://Example.Org https://example.org/index.php?sort=asc&field=id&filter=foo [MATCHES] https://example.org/index.php?sort=asc&&field=id&filter=foo [MATCHES] https://example.org/index.php?sort=asc&&&field=id&filter=foo#anchor [MATCHES] https://example.org/index.php?field=id&filter=foo&sort=asc http://EXAMPLE.com/p%61ge.php?tab=items&&&msg=3&sort=title [MATCHES] http://example.com:80/page.php?tab=items&sort=title&msg=3 */ 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.