Command line tool to check when a URL was updated?

Question

It would certainly be possible to whip together something in Python to query a URL to see when it was last modified, using the HTTP headers, but I wondered if there is an existing tool that can do that for me? I'd imagine something like:

% checkurl http://unix.stackexchange.com/questions/247445/ Fri Dec 4 16:59:28 EST 2015

or maybe:

% checkurl "+%Y%m%d" http://unix.stackexchange.com/questions/247445/ 20151204

as a bell and/or whistle. I don't think that wget or curl have what I need, but I wouldn't be surprised to be proven wrong. Is there anything like this out there?

curl --head url seems to report the headers to me. Assuming a Last-Modified header does come through, curl --header url | awk '/Last-Modified/{print $2}' should be able to extract the value — iruvar
– iruvar, Commented Dec 4, 2015 at 22:26
Note, though, that Last-Modified headers are completely useless on many sites because the pages are generated dynamically from a database and will always return a LM header of approximately now. This is often done deliberately as a cache-busting technique and to force re-fetches (and thus marketable page views) when the client requests the page with an If-Modified-Since request header. — cas
– cas, Commented Dec 5, 2015 at 0:49
That's true enough. My particular use case is to monitor Web-based downloads of databases, so that's less of an issue. Still something to be mindful of, though. — Scott Deerwester
– Scott Deerwester, Commented Dec 5, 2015 at 1:10

RobertL · Accepted Answer · 2015-12-05 18:10:51Z

This seems to fit your requirements (updated to use '\r\n' as record separator for response data):

#!/bin/sh get_url_date() { curl --silent --head "${1:?URL ARG REQUIRED}" | awk -v RS='\r\n' ' /Last-Modified:/ { gsub("^[^ ]*: *", "") print exit } ' } unset date_format case $1 in (+*) date_format="$1" shift ;; esac url_date="$(get_url_date "${1:?URL ARG REQUIRED}")" if [ -z "$url_date" ] then exit 1 fi if [ "$date_format" != "" ] then date "$date_format" -d"$url_date" else echo "$url_date" fi

See also tail -n+2 | formail -zx last-modified to extract header values (tail skipping the response line) — Stéphane Chazelas
– Stéphane Chazelas, Commented Mar 4, 2022 at 17:00

nwk · Accepted Answer · 2015-12-05 18:23:53Z

A Perl one-liner:

% perl -MLWP::Simple -MDate::Format -e 'print time2str "%C\n", (head $ARGV[0])[2]' http://example.com Sat Aug 10 02:54:35 EEST 2013

On a modern Linux or FreeBSD system the modules it requires are likely to be already installed.

Scott Deerwester · Accepted Answer · 2015-12-05 01:48:48Z

It turns out that curl and wget can both do this, but it's probably worth doing in Python after all. Here's what I ended up writing:

#!/usr/bin/env python3 import sys, dateutil.parser, subprocess, requests from getopt import getopt errflag = 0 gTouch = None gUsage = """Usage: lastmod [-t file] url where: -t file Touches the given file to make its modification date the same as the URL modification date. url A URL to be retrieved """ opts, args = getopt(sys.argv[1:], "t:v?") for k, v in opts: if k == "-t": # File to touch gTouch = v elif k == "-?": # Write out usage and exit errflag += 1 if len(args) != 1: errflag += 1 if errflag: sys.stderr.write(USAGE) sys.exit(1) res = requests.head(args[0]) if res.status_code != 200: sys.stderr.write("Failed to retrieve URL\n") sys.exit(1) if not 'Last-Modified' in res.headers: sys.stderr.write("Headers has no last-modified date\n") sys.exit(1) dt = dateutil.parser.parse(res.headers['Last-Modified']) if gTouch: subprocess.call(["touch", "-t", dt.strftime("%Y%m%d%H%m"), gTouch]) else: sys.stdout.write("%s\n" % dt.ctime())

You really want the [requests] (docs.python-requests.org/en/latest) module — iruvar
– iruvar, Commented Dec 4, 2015 at 23:48
I agree in principle, but the above only grabs the header, which is what I really wanted, and it looks like request.get() grabs the whole document, which I really don't want. In fact, using requests does seem to take a LOT longer. If I've got it wrong, would you mind posting a snippet that illustrates how just to check the header? — Scott Deerwester
– Scott Deerwester, Commented Dec 5, 2015 at 0:04
Sure. requests.get('http://unix.stackexchange.com/questions/247445/').headers['Last-Modified'] — iruvar
– iruvar, Commented Dec 5, 2015 at 1:03
But does the call to get() retrieve the whole document as well as the headers, or is the retrieval of the document itself postponed until it's required? — Scott Deerwester
– Scott Deerwester, Commented Dec 5, 2015 at 1:08
Good point..although I'm approaching the limit of my http knowledge, this should do the job requests.head('http://unix.stackexchange.com/questions/247445/').headers['Last-Modified'] — iruvar
– iruvar, Commented Dec 5, 2015 at 1:14

ahmetpergamum · Accepted Answer · 2022-03-04 16:51:43Z

Check out Carbon14; it is a command-line python tool for detecting webpage history from images. If there are some images on your inspecting webpage, it works fine. Install from Carbon14 Github repository, after installation run with;

python carbon14.py <url>

Stack Exchange Network

Command line tool to check when a URL was updated?

4 Answers 4

You must log in to answer this question.

Linked

Hot Network Questions

Command line tool to check when a URL was updated?

4 Answers 4

You must log in to answer this question.

Linked

Related

Hot Network Questions