0

when I spide a website ,I got a lot of bad url like these. http://example.com/../../.././././1.htm http://example.com/test/../test/.././././1.htm http://example.com/.//1.htm http://example.com/../test/..//1.htm

all of these should be http://example.com/1.htm. how to use PHP codes to do this ,thanks.

PS: I use http://snoopy.sourceforge.net/ I get a lot of repeated link in my database , the 'http://example.com/../test/..//1.htm' should be 'http://example.com/1.htm' .

5
  • 3
    What have you tried? Commented Jul 6, 2012 at 2:08
  • He told you--he spide it. (What is "spide"? I'm not being facetious, I don't understand your question.) Commented Jul 6, 2012 at 2:14
  • How are these URLs being generated? Commented Jul 6, 2012 at 2:23
  • I think he mean to say spider. He's crawling a domain. Commented Jul 6, 2012 at 2:23
  • Why should /.\./ and // both become /? They are semantically different. Commented Jul 6, 2012 at 2:27

3 Answers 3

1

You could do it like this, assuming all the urls you have provided are expected tobe http://example.com/1.htm:

$test = array('http://example.com/../../../././.\./1.htm', 'http://example.com/test/../test/../././.\./1.htm', 'http://example.com/.//1.htm', 'http://example.com/../test/..//1.htm'); foreach ($test as $url){ $u = parse_url($url); $path = $u['scheme'].'://'.$u['host'].'/'.basename($u['path']); echo $path.'<br />'.PHP_EOL; } /* result http://example.com/1.htm<br /> http://example.com/1.htm<br /> http://example.com/1.htm<br /> http://example.com/1.htm<br /> */ //or as a function @lpc2138 function getRealUrl($url){ $u = parse_url($url); $path = $u['scheme'].'://'.$u['host'].'/'.basename($u['path']); $path .= (!empty($u['query'])) ? '?'.$u['query'] : ''; return $path; } 
Sign up to request clarification or add additional context in comments.

4 Comments

<pre>function getRealUrl($url){ $u = parse_url($url); $path = $u['scheme'].'://'.$u['host'].'/'.basename($u['path']);</pre> thanks if($u['query']){ $path .='?'.$u['query']; } return $path; }
you didn't have that in your examples.
yes,it is just a example.not all.http://www.example.com/datasheet/ds11015.html should be http://www.example.com/datasheet/ds11015.html
@lpc2138 what is the difference between both given urls in your above comment? "example.com/datasheet/ds11015.html" should be "example.com/datasheet/ds11015.html" What I see is that they both are same.
0

You seem to be looking for a algorithm to remove the dot segments:

function remove_dot_segments($abspath) { $ib = $abspath; $ob = ''; while ($ib !== '') { if (substr($ib, 0, 3) === '../') { $ib = substr($ib, 3); } else if (substr($ib, 0, 2) === './') { $ib = substr($ib, 2); } else if (substr($ib, 0, 2) === '/.' && ($ib[2] === '/' || strlen($ib) === 2)) { $ib = '/'.substr($ib, 3); } else if (substr($ib, 0, 3) === '/..' && ($ib[3] === '/' || strlen($ib) === 3)) { $ib = '/'.substr($ib, 4); $ob = substr($ob, 0, strlen($ob)-strlen(strrchr($ob, '/'))); } else if ($ib === '.' || $ib === '..') { $ib = ''; } else { $pos = strpos($ib, '/', 1); if ($pos === false) { $ob .= $ib; $ib = ''; } else { $ob .= substr($ib, 0, $pos); $ib = substr($ib, $pos); } } } return $ob; } 

This removes the . and .. segments. Any removal of any other segment like an empty one (//) or .\. is not as per standard as it changes the semantics of the path.

1 Comment

github.com/glenscott/url-normalizer/blob/master/… Gumbo,thanks.you tell me how to find this codes.
0

You could do some fancy regex but this works just fine.

fixUrl('http://example.com/../../../././.\./1.htm'); function fixUrl($str) { $str = str_replace('../', '', $str); $str = str_replace('./', '', $str); $str = str_replace('\.', '', $str); return $str; } 

3 Comments

You could just use $str = str_replace(array('../','./','\.'), '', $str);
your codes don't work,I just give you an example url. my url ishttp://example.com/../test/..//1.htm
This doesn't do the actual traversal/collapsing of paths (like /test/../ into /)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.