1

I'm writing some kind of a page scraper, and one of the things I'm looking to do is combine the current url with an url fragment extracted from the current page.

Like this:

if (WebPath.IsAbsolute(urlFragment)) links.Add(new Uri(urlFragment)); else links.Add(new Uri(currentUrl, urlFragment)); 

Easy peasy - this approach works most of the time, for both relative and absolute Uris.

However, some pages look like http://example.com/couple/of/folders/, with the url fragment couple/of/otherfolders/. And every single browser out there interprets that as http://example.com/couple/of/otherfolders.

Of course, my code yields http://example.com/couple/of/folders/couple/of/otherfolders. Which totally looks correct from the Uri's point of view - but I don't get how a browser can interpret this otherwise.

Now, I've searched for a solution to this problem, but I only found people who didn't know how to combine two urls, so that didn't get me very far. Closest thing I found was this question: How do you combine URL fragments in Java the same way browsers do? , but the answer doesn't tackle my particular problem.

Does anybody know what I'm missing?


Edit - this is the IsAbsolute method (I know I should replace it with new Uri(link).IsAbsoluteUri):

public static bool IsAbsolute(string path) { var uppercasePath = path.ToUpper(); return uppercasePath.StartsWith("HTTP://") || uppercasePath.StartsWith("HTTPS://"); } 
4
  • 1
    Browsers would never interpret it like that, as there’s no leading slash. Assuming there is a leading slash, Uri works properly. Commented Jun 29, 2013 at 14:14
  • 1
    @minitech See goominet.com/unspeakable-vault/vault/1 for an example - check out the Next link Commented Jun 29, 2013 at 14:14
  • 2
    @Vlad: Ah… sorry. The document has a <base> element, which is a horrible thing that you need to handle separately. So check for <base>, get its href, use that instead of currentUrl if it exists. Commented Jun 29, 2013 at 14:17
  • Dude, you rule, please post this as an answer so I can accept it. Commented Jun 29, 2013 at 14:18

1 Answer 1

3

Normally, browsers wouldn’t do that. But when there’s a <base> element, its href replaces the current page’s URL for the page’s URL-resolving purposes.

Check for a <base> and use it in place of currentUrl if it exists.

Also, thanks for reminding me to fix all my scrapers!

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks again, I had no idea this element exists.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.