How can I combine two urls the same way a browser does?

Question

I'm writing some kind of a page scraper, and one of the things I'm looking to do is combine the current url with an url fragment extracted from the current page.

Like this:

if (WebPath.IsAbsolute(urlFragment)) links.Add(new Uri(urlFragment)); else links.Add(new Uri(currentUrl, urlFragment));

Easy peasy - this approach works most of the time, for both relative and absolute Uris.

However, some pages look like http://example.com/couple/of/folders/, with the url fragment couple/of/otherfolders/. And every single browser out there interprets that as http://example.com/couple/of/otherfolders.

Of course, my code yields http://example.com/couple/of/folders/couple/of/otherfolders. Which totally looks correct from the Uri's point of view - but I don't get how a browser can interpret this otherwise.

Now, I've searched for a solution to this problem, but I only found people who didn't know how to combine two urls, so that didn't get me very far. Closest thing I found was this question: How do you combine URL fragments in Java the same way browsers do? , but the answer doesn't tackle my particular problem.

Does anybody know what I'm missing?

Edit - this is the IsAbsolute method (I know I should replace it with new Uri(link).IsAbsoluteUri):

public static bool IsAbsolute(string path) { var uppercasePath = path.ToUpper(); return uppercasePath.StartsWith("HTTP://") || uppercasePath.StartsWith("HTTPS://"); }

Browsers would never interpret it like that, as there’s no leading slash. Assuming there is a leading slash, Uri works properly. — Ry-
– Ry- ♦, Commented Jun 29, 2013 at 14:14
@minitech See goominet.com/unspeakable-vault/vault/1 for an example - check out the Next link — Vlad Iliescu
– Vlad Iliescu, Commented Jun 29, 2013 at 14:14
@Vlad: Ah… sorry. The document has a <base> element, which is a horrible thing that you need to handle separately. So check for <base>, get its href, use that instead of currentUrl if it exists. — Ry-
– Ry- ♦, Commented Jun 29, 2013 at 14:17
Dude, you rule, please post this as an answer so I can accept it. — Vlad Iliescu
– Vlad Iliescu, Commented Jun 29, 2013 at 14:18

Ry- · Accepted Answer · 2013-06-29 14:21:51Z

Normally, browsers wouldn’t do that. But when there’s a <base> element, its href replaces the current page’s URL for the page’s URL-resolving purposes.

Check for a <base> and use it in place of currentUrl if it exists.

^{Also, thanks for reminding me to fix all my scrapers!}

Collectives™ on Stack Overflow

How can I combine two urls the same way a browser does?

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related