3

It would certainly be possible to whip together something in Python to query a URL to see when it was last modified, using the HTTP headers, but I wondered if there is an existing tool that can do that for me? I'd imagine something like:

% checkurl http://unix.stackexchange.com/questions/247445/ Fri Dec 4 16:59:28 EST 2015 

or maybe:

% checkurl "+%Y%m%d" http://unix.stackexchange.com/questions/247445/ 20151204 

as a bell and/or whistle. I don't think that wget or curl have what I need, but I wouldn't be surprised to be proven wrong. Is there anything like this out there?

3
  • 1
    curl --head url seems to report the headers to me. Assuming a Last-Modified header does come through, curl --header url | awk '/Last-Modified/{print $2}' should be able to extract the value Commented Dec 4, 2015 at 22:26
  • Note, though, that Last-Modified headers are completely useless on many sites because the pages are generated dynamically from a database and will always return a LM header of approximately now. This is often done deliberately as a cache-busting technique and to force re-fetches (and thus marketable page views) when the client requests the page with an If-Modified-Since request header. Commented Dec 5, 2015 at 0:49
  • That's true enough. My particular use case is to monitor Web-based downloads of databases, so that's less of an issue. Still something to be mindful of, though. Commented Dec 5, 2015 at 1:10

4 Answers 4

4

This seems to fit your requirements (updated to use '\r\n' as record separator for response data):

#!/bin/sh get_url_date() { curl --silent --head "${1:?URL ARG REQUIRED}" | awk -v RS='\r\n' ' /Last-Modified:/ { gsub("^[^ ]*: *", "") print exit } ' } unset date_format case $1 in (+*) date_format="$1" shift ;; esac url_date="$(get_url_date "${1:?URL ARG REQUIRED}")" if [ -z "$url_date" ] then exit 1 fi if [ "$date_format" != "" ] then date "$date_format" -d"$url_date" else echo "$url_date" fi 
1
  • See also tail -n+2 | formail -zx last-modified to extract header values (tail skipping the response line) Commented Mar 4, 2022 at 17:00
3

A Perl one-liner:

% perl -MLWP::Simple -MDate::Format -e 'print time2str "%C\n", (head $ARGV[0])[2]' http://example.com Sat Aug 10 02:54:35 EEST 2013 

On a modern Linux or FreeBSD system the modules it requires are likely to be already installed.

1

It turns out that curl and wget can both do this, but it's probably worth doing in Python after all. Here's what I ended up writing:

#!/usr/bin/env python3 import sys, dateutil.parser, subprocess, requests from getopt import getopt errflag = 0 gTouch = None gUsage = """Usage: lastmod [-t file] url where: -t file Touches the given file to make its modification date the same as the URL modification date. url A URL to be retrieved """ opts, args = getopt(sys.argv[1:], "t:v?") for k, v in opts: if k == "-t": # File to touch gTouch = v elif k == "-?": # Write out usage and exit errflag += 1 if len(args) != 1: errflag += 1 if errflag: sys.stderr.write(USAGE) sys.exit(1) res = requests.head(args[0]) if res.status_code != 200: sys.stderr.write("Failed to retrieve URL\n") sys.exit(1) if not 'Last-Modified' in res.headers: sys.stderr.write("Headers has no last-modified date\n") sys.exit(1) dt = dateutil.parser.parse(res.headers['Last-Modified']) if gTouch: subprocess.call(["touch", "-t", dt.strftime("%Y%m%d%H%m"), gTouch]) else: sys.stdout.write("%s\n" % dt.ctime()) 
6
  • You really want the [requests] (docs.python-requests.org/en/latest) module Commented Dec 4, 2015 at 23:48
  • I agree in principle, but the above only grabs the header, which is what I really wanted, and it looks like request.get() grabs the whole document, which I really don't want. In fact, using requests does seem to take a LOT longer. If I've got it wrong, would you mind posting a snippet that illustrates how just to check the header? Commented Dec 5, 2015 at 0:04
  • Sure. requests.get('http://unix.stackexchange.com/questions/247445/').headers['Last-Modified'] Commented Dec 5, 2015 at 1:03
  • But does the call to get() retrieve the whole document as well as the headers, or is the retrieval of the document itself postponed until it's required? Commented Dec 5, 2015 at 1:08
  • Good point..although I'm approaching the limit of my http knowledge, this should do the job requests.head('http://unix.stackexchange.com/questions/247445/').headers['Last-Modified'] Commented Dec 5, 2015 at 1:14
0

Check out Carbon14; it is a command-line python tool for detecting webpage history from images. If there are some images on your inspecting webpage, it works fine. Install from Carbon14 Github repository, after installation run with;

python carbon14.py <url> 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.