3

I've got some weird usability warnings popping up in my search console for pages which don't actually exist on my site. It looks like someone is creating 100's of malformed links to my pages.

Here's the problem.

Say I have https://example.com/pagename.html

Someone is creating hundreds of links with random text after the filetype like this: https://example.com/pagename.html/randomtext

Strangely, with that malformed URL, part of the page renders, but renders with broken styling for some reason. (Triggering search console warnings for every one of those "pages"). I'm not really sure how or why these broken URLs are rendering at all. My understanding is that this should return a 404 error. But it doesn't.

So...

I'm trying to solve the problem with .htaccess like this:

RewriteCond %{REQUEST_FILENAME} !-d RewriteRule ^(.+)/$ /$1 [L,R=301] 

That sort of works. It removes the trailing slash from https://example.com/pagename.html/ but it doesn't fix https://example.com/pagename.html/randomchars

How do I get it to ignore all characters following the file type? (except standard GET strings starting with a ?)

Thanks in advance.

2
  • "/pagename.html" - is pagename.html a real file? Does it really have a .html file extension? Commented Nov 14, 2020 at 23:19
  • 1
    @MrWhite Yes, the extension .html is real. Your answer below was awesome btw. Thank you! Commented Nov 15, 2020 at 10:10

1 Answer 1

3
RewriteCond %{REQUEST_FILENAME} !-d RewriteRule ^(.+)/$ /$1 [L,R=301] 

The regex ^(.+)/$ only matches URLs that end with a slash, so it matches /pagename.html/, but not /pagename.html/randomchars.

However, to avoid matching the slash anywhere in the URL (you probably have multiple path segments in some of your URLs) you can specifically match the file extension .html. This also negates the need for the directory check.

For example, try the following instead:

RewriteRule ^(.+?\.html)/ /$1 [R=302,L] 

Note the absence of the trailing end-of-string anchor ($) on the regex.

I made the regex non-greedy (ie. the ? in .+?) so that it matches the first instance of .html in the URL-path and not the last. Otherwise, if the "randomchars" also contained an instance of .html/ then you'd potentially get just another broken request/redirect.

Test first with a 302 (temporary) redirect to avoid potential caching issues. Only use a 301 (permanent) redirect when you are sure everything is working OK.

I'm not really sure how or why these broken URLs are rendering at all.

If pagename.html is a real file then the trailing /randomchars are additional pathname information (path-info) on the URL. Whether Apache accepts path-info on the URL or whether it triggers a 404 is (by default) dependent on the handler that manages the file type (.html, .php, etc.). Although the handler for .html files usually rejects path-info by default, unless you've explicitly enabled it or are parsing .html files as PHP or something?

If this is indeed path-info (as described above) then you could instead disable path-info for all file types:

AcceptPathInfo Off 

If you then remove the redirect (RewriteRule directive) above then such URLs will generate a 404 as intended. If you keep the redirect then the redirect will still occur.

with that malformed URL, part of the page renders, but renders with broken styling for some reason.

The broken styling (failure to load CSS) is most probably due to using relative URL-paths to your CSS files. With the additional path segment(s) on the URL (ie. path-info), you are changing the base URL that the relative URL is relative to. eg. A relative URL of the form css/mystyles.css will resolve to /pagename.html/css/mystyles.css instead of /css/mystyles.css as is probably intended, so results in a 404 and no styles are rendered. See my answer to the following question for more information on this: .htaccess rewrite URL leads to missing CSS

3
  • 1
    If I could vote you up 20 times I would. Thank you for this awesome answer. And yes, the path to CSS was relative. That makes perfect sense. Also fixed now. Thank you! Commented Nov 15, 2020 at 10:11
  • 1
    Also: Yes, as you guessed, I have .php parsing enabled in .html pages. Commented Nov 15, 2020 at 10:25
  • 1
    "If I could vote you up 20 times I would", wasn't enough praise already? ;) Everything worked perfectly, thanks again! EDIT: Ah.. sorry didn't accept the answer. My bad. Just did. Commented Nov 22, 2020 at 19:06

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.