11

The wget man page states this, under the section for the --random-wait parameter:

 Some web sites may perform log analysis to identify retrieval programs such as Wget by looking for statistically significant similarities in the time between requests. [...] A 2001 article in a publication devoted to development on a popular consumer platform provided code to perform this analysis on the fly. Its author suggested blocking at the class C address level to ensure automated retrieval programs were blocked despite changing DHCP-supplied addresses. 

I want to obtain a copy of this article for reading, and have tried many searches on the Internet to determine the article. However, all I can find with these searches is the man page for wget hosted on different websites; and some other research papers having no relation at all with this topic.

Does anyone know which article is being referred to and where I can obtain a copy?

1

2 Answers 2

14

Even though not a direct answer, git blame and git log reveal that this section was introduced in commit 2c41d783 by a committer called hniksic, who turns out to be Hrvoje Niksic. His email address can be found in wget's ChangeLog file (I won't publish it here for the obvious reasons). I'd suggest asking him directly, as he might be the best to give a more adequate answer. While at it, you might consider asking him whether he would mind updating the manpage accordingly. ;)

3

I think it might be this article:

Creating meaningful data from web logs using base SAS

There's a paragraph discussing blocking of class C ranges:

Once the IP address is separated into its components the filtering of ranges of IP addresses is simple. A class B filter is done against the first two octets, e.g. 168.126.xx.xx. This is variable Onetwo in the code example above. Class C ranges are more commonly used as they target entire servers and use three of the four octets, e.g. 168.126.56.xx. In the code sample above, this the field Three given that Usrhost is the web log’s TCP/IP address value.

And one mentioning wget in user agent string-based blocking:

Our preferred method for user agent string identification utilizes the index pattern matching function. For example:

if index(lowcase(agentstr), 'keynote') or index(lowcase(agentstr), 'sureseeker') or index(lowcase(agentstr), 'wget') or 

It was the fifth result in Googling for "log analysis wget" for the year 2001.

You must log in to answer this question.