Pavuk - Last Changes

Pavuk

Last Update: April 19 2005

Last Changes :

* ---------- released version 0.9.33 (2005-09-27) * fixed 64bit problems (BUG #1226863) * updated German locale, fixes done by Debian developers (Hey, please inform us about errors. Scanning the net and all distributions for possible fixes is not very helpful.) * ---------- released version 0.9.34 (2006-01-09) * security fixes * some minor bug fixes * reworked build system a lot, fixed RPM spec file * now builds fine using most of the possibilities pavuk provides * RPM builds on openSUSE build service for SUSE since version 9.3, Fedora since version 4 and Mandriva since version 2006 * RPM packages can be found here: http://software.opensuse.org/download/home:/dstoecker/ * ---------- released version 0.9.35 (2007-02-21) * added -persistent/-nopersistent option 2007-april-30 [notes taken from old work back in 2005/2006 merged into pavuk mainstream source tree] * bufio has seen a MAJOR overhaul. It is now capable of pushing text & binary data to the file system at unprecedented rates. This is done by adding a variable sized (and possibly large) memory cache, resulting in large size I/O operations. These perform very much faster than the regular RTL I/O calls. (tested on quad CPU UNIX Dell servers) the new bufio was required as I needed to log/track a huge amount of data in the shortest possible time / lowest possible CPU load. * cookie handling has been fixed/augmented. pavuk can now have the initial cookie values that go with a certain web request preconfigured on the commandline. Also, several bugs in handling the cookies have been fixed. (tested on a wicked ASP.NET intranet site which 'assumed' the use of a special web client (a TV set top box) which would transmit it's serial # as a client-side created(!) cookie to the web server. This site/client combo thus actually transmitted cookies which would first show up in a web _request_ instead of the usual: a server-side _response_.) * several portability items have been changed (h_errno, ...) to make the code compile and work on the odd-flavored UNIX box. A native Win32 port is under way: it now works, inclusing zlib and OpenSSL, though the latter has not been tested recently. Note that the changes may have broken GTK support, as I was not able to build the code with GTK on my UNIX boxes. * socket I/O (IP traffic) has been fixed to properly cope with user breaks (a user hitting Ctrl+C). Several locations in the software where the unexpected signal would cause an infinite loop have been identified and fixed. * added several lines of DEBUG_xxx to aid both developer and user in tracking down hard to diagnose issues inside pavuk while scanning a site. * Accepted-Encoding (more specifically: the handling of x-gzip/gzip/x- compress/compress encoding) has been changed to allow for better portability: data is expanded in-memory, without the need for an external 'gzip' tool and/or OS-specific forks & pipes. (Win32 wouldn't know a fork if ever it saw one.) * ALL stdio is now handled through the new bufio system. This not only improves performance when you've got -debug and -debuglevel dialed all the way up, but also corrected several spots where, depending on your C RTL, stdio/stderr traffic would arrive at different moments on your console (some of it was written through the FILE I/O, some through direct I/O, causing blurbs of output to pass one another along the way to the actual console). * buffer overrun protection has been improved. Note also that every snprintf() and derivative thereof is now 'augmented' by an additional line of code which ensures that the last character in the buffer is guaranteed to be a NUL sentinel, thus ensuring that the buffer will always present data in correct C string format (NUL-terminated). (This is an old habit of mine as some C RTLs have shown to be kinda flaky on the subject of NUL sentinels when snprintf() et al are writing data up to the edge of their output buffers: some C RTLs 'forget' to put a NUL there under particular circumstances (some commercial Watcom compiler releases come to mind). * multithreading pavuk has been tested on an high perf MP UNIX box and it was like the documentation/notes state somewhere: instable. The thread interlocking has now been fixed; one of the hardest to fix proved to be the lockup at the end of a pavuk run. The fix also includes the use of semaphores and some additional code changes to make the code thread safe; critical sections are now handled as such. This includes placing several non-threadsafe C RTL calls (e.g. ctime()) inside critical sections! * auto-form-filling (the feature which led me to select pavuk over wget et al when I started the hammer/chunky project) has been fixed for those special pages where you have an empty form to submit: the site I had to test included such a form, which was submitted using javascript, but did not contain _any_ input fields (but cookies were expected to come with that request, thank you). Before, pavuk crashed on such a page. This has now been fixed. * added a 'reindent' target to the makefile, using GNU indent to reformat the code. (When you're working several weeks on end in crunch time, you want to see some proper and consistent looking source code, even when you just made it a mess yourself...) Also extended the cleanup makefile target to help me in cleaning up any backup and/or temporary files created by vi and some log diagnostic scripts. [edit may/2007: wasn't this already in the makefiles before - see ChangeLog entry in 2003?] * added several commandline parameter types, which allow you to instruct pavuk to use OS file handles or file names for logging activity, while you can now also specify whether a log file should be overwritten (default) or appended to (new feature) by adding another '@' prefix to the file path. TODO: document this properly. * added hammer/crunchy modes: several ways to scan a web site and than rescan it. The higher (later) hammer mode has been specifically written to use pavuk as a 'replay attack' based DoS tool for testing high performance web servers. (bufio was overhauled to allow us to log all I/O data + diagnostics to disc while hammering the server while the pavuk system _must_ perform better (= faster) than the web server when running both on equivalent hardware.) * The native Win32 port has been overhauled (previous code was never released to the public) to make sure I did not have to look for OS- specific path elements _everywhere_ in the code (it was becomes a code- wise maintainance nightmare while fixing up/down all those 'absolute path' and 'path expansion' code sections to handle Win32 drive letters (root is '[A-Z]:[\\/]' instead of simply '/'). This has been fixed by using the cygwin 'path hack' for the native Win32 port too: root is '/cygdrive/[a-z]/' so it looks exactly like a UNIX path. Any places in the codes which need to address the OS while passing an OS- specific path are now handled almost invisibly: all relevant C RTL calls (fopen/open/stat/lstat/symlink/link/unlink/rename/mkdir/rmdir/opendir) are now encapsulated in tl_[sysname] wrapper functions where these /cygdrive/[x]/ paths are converted back to native Win32 paths before the actual C RTL function is called. Also any debug/print statement, which is used to report a file path, is fixed to convert file paths to the native representation with a minimum of fuss: see the new tl_native() call for a description how this was done. This code has not been tested in a UNIX/MP environment, but the design is such that this should not cause any trouble (pthread port for Win32 is in progress ATM). * added -debug_level modes: all/trace/dev/bufio/cookie/htmlform. Also added a feature where you can now specify a set of debug levels and have some of those levels _removed_, e.g. 'all,!dev' will show anything _except_ 'dev' level debug output: note the new '!' prefix. * -debug_level output is now prefixed with its level in caps and square brackets, e.g. '[PROCE]' to aid in filtering the debug output (for instance by piping it through sed/grep). * unified debug output handling in the code: -debug_levels are now only active when you specify -debug too. * inflate_decode() and gzip_decode() have been fixed to suit a multithreaded environment. gzip_decode() now has an in-memory implementation, using the zlib library, for those systems which do not support UNIX pipes/forks. * Fixed deflate/compress handling: the MJF Accept-Encoding deflate hack has been removed and the request header extended. (tested on a Wikipedia HTTP/1.1 compliant server) You may wish to permanently disable the code within in decode.c if you do not wish to depend on the external gzip tool any more. * _all_ system header file #include's have been removed from the sources and integrated into config.h to allow for better portable source code. config.h.in and autoconf.am have been extended to include several more OS- dependent system call and header file checks. A seperate native Win32 version of the header file is also provided (used by the MSVC2005 native Win32 build). * several hardcoded buffer sizes in the software have been made configurable (but remain hardcoded). See for instance dinfo.c: 12 --> PAVUK_INFO_DIRNAME and 1024-and-other-fixed-buf-sizes --> BUFIO_ADVISED_READLN_BUFSIZE * fixed several cases where dangling (i.e. free()d but not NULL-ed) pointers caused havok. Code has been quickly reviewed to locate and fix additional spots that did not yet cause pavuk to go 'crazy Ivan' (Hunt for the Red October, anyone? ;-) ) * hardcoded lock filenames have been converted to #define's to allow these to be changed in a single spot (config.h), improving portability. e.g.: '._lock' --> PAVUK_LOCK_FILENAME * UNIX-specific octal privs have been changed to their proper #define's to allow for maximum portability (Win32 doesn't know '0644' but can cope with S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH though maybe in a odd way). * fixed quite a few spots where an unidentified form encoding method would lead to _very_ instable bahaviour, including crashes/core dumps. Look for fi->method = FORM_M_UNKNOWN assignments and additonal FORM_M_UNKNOWN checks. * added -no_dns support for those who have to work in an environment with flaky or no DNS support (I had to as I was working on a box in a specially configured, partially walled-off DMZ zone while developing and testing pavuk against a web server.) * fixed typos in the text as I came along them. * the bufio overhaul also lead to a overhaul of the -dumpxxx code, removing/fixing several spots in the code which caused incorrect/instable behaviour. (e.g. code in doc.c) * Fixed handling of compressed data for any text-based server response; pavuk now correctly handles any gzipped/deflated text, including, for instance, any 'text/javascript' content sent over the wire in compressed form (tested on a Wikipedia-based HTTP/1.1 compliant server). * added -progress_mode: several choices in progress verbosity. * added -no_disc_io: test a grab/scan without writing anything to disc. Mostly useful in combination with the earlier -hammer modes. * fixed/updated HTTP error response handling in accordance with RFC2616 so I can better see what a HTTP/1.1 compliant target is reporting back to pavuk. (errcode.c et al) * unified timing units to fix a few timing oddities: instead of minutes, etc. the code uses seconds everywhere (apart, of course, from the few locations where we use milleseconds ;-) ) -timeout is now in milliseconds! * Added -rtimeout and -wtimeout command line parameters. (unit: milliseocnds) * added -allow_persistent / -noallow_persistent commandline arguments to allow/disallow the use of HTTP/1.1 persistent connections. * added -dumpcmd and -dumpdir commandline arguments. * added -bad_content commandline argument for use with the hammer/chunky modes. * added -report_url_on_err commandline argument: report the URL which was processed while the error occurred. * added -test_id commandline argument: this is included in the timing report so reports can be better automatically processed / combined. * added -page_sfx commandline argument to help pavuk identify what suffixes are to be considered web pages (useful for scanning ASP and ASP.NET sites which present unusual mime types with their pages). * added -tlogfile4sum commandline argument: specify a log file where timing info is stored. Handy when pavuk is not only used to grab the info off a site but also scan & report site performance. * added -encode commandline parameter as the counterpart of -noencode. * added -nohtDig, -noquiet and -noverbose commandline parameters as counterparts of -htDig, -quiet and -verbose respectively. * added filepath support to -dumpfd and -dump_urlfd: by specifying the option prefixed with a '@' character, pavuk will treat the option value as filepath specification instead of a OS file handle and subsequently open the specific file internally. Note that adding yet another '@' character as a prefix signals pavuk to _append_ to the specified file, instead of _overwriting_ it. This is useful when you wish to have those dumps but are working in an environment where you cannot pass valid file handles through the commandline. * added -dump_request and -nodeump_request commandline arguments for use with -dumpfd: when -dump_request is specified, the log file will include complete dump of each request sent to the server by pavuk. Thus you can produce a complete audit trail of the exchange. * replaced the DUMP_URLLIST macros in stats.c by two functions. Code is a bit cleaner that way. * fixed times.c which barfed on timestamps beyond 2037 (signed int wrap around for time_t). * added assert() checks at several locations in the code to help track down unexpected behaviour which could lead to crashes (like it did till now). * unified the proliferation of HEX2ASC-alike macros with and without off-by- one offsets inside. Now there's one macro for each of 'em in tools.h. * changed the configure.in option to --disable-threads to keep the pattern consistent (--disable-xxx series of options in configure), but the default behaviour remains the same. * configure.in: as --disable-debug removes any debug-_related_ features from the pavuk build, these options have been added: --disable-debugging will create a default build with all debugging removed from the compiled binaries. --disable-prof and --disable-gprof have been added to remove any profile info from the default compiled binaries. * added checks in configure.in for socklen_t, pid_t and a bunch of system calls and header files that do not live in each environment. 2007-may-6 * included pthreads-Win32 based multithreading support in the native Win32 build. * included EXPERIMENTAL tre (regex) support in the native Win32 build. * fixed several lurking bugs (buffer overruns, etc.) which only showed in a multithreaded environment. * fixed locking bugs in the new bufio implementation. * added Win32 memory leak + heap checking for the DEBUG build: many memory leaks have been tracked and fixed. (MSVC <ctrdbg.h> based) * fixed memory leak due to wrong scope in report_error() code. * added DBGxxx macro's to aid heap tracking for the debug build. See DBGdecl/DBGpass/DBGvars usage. * removed a very nasty memleak in html_parser_get_url() which would leak at least 3 blocks for each rejected local anchor URL - and those come quite a few! Took me a day to track it down. :-( * added filtering so gzipped/compressed files on the server are not decompressed unintentionally while the server supports Accept- Encoding:gzip or compress. ( doc_download_helper() in doc.c ) 2007-may-11 * renamed function should_leave_persistent() to the more appropriately named should_keep_persistent() * Updated 'chunky' source to the state of the latest pavuk CVS contents (as of today) as this code has not yet been merged into CVS itself. * fixed bugs in -scenario handling, when scanrio files produced by pavuk are re-used in the Win32 environment * fixed bugs in path & file type commandline arguments for the native Win32 port. * fixed bug in retrying/resuming download for RFC2616 (HTTP/1.1) 'chunked' content download handling. * merged -allow_persistent / -noallow_persistent commandline arguments with the equivalent -persistent/-nopersistent feature from the official pavuk CVS sources. Also improved the code a bit: added the 'Connection: close' header for requests over -nopersistent connections, so the server will close the connection for us. * added the -ignore_chunk_bug commandline argument to allow pavuk to handle RFC2616 'chunked' downloads from buggy (IIS) web servers. ( See also: http://www.subbu.org/weblogs/main/2004/11/persistent_conn.html http://skrb.org/ietf/http_errata.html#chunk-size http://www.apps.ietf.org/rfc/rfc2616.html#sec-3.6.1 http://www.jmarshall.com/easy/http/ ) 2007-may/june * recompiled in 64-bit Linux (SuSe 10.2) and fixed a few items in the Makefile.am, configure.in and ac-config.h.in files. Also added the tests\ and www\ directories to the distro. * fixed a few 64-bit compile warnings; at least the test cases in tests\ perform OK now on a 64-bit Linux system. * updated the man page a bit; still a lot more to do. Where is that 'nroff for dummies' cheatsheet when you need it? ;-( * listed -use_http11 as 'on' by default now. * moved MODE_MIRROR unescape code section up in url.c to line 1682 in url_get_local_name_real() as this code would otherwise have no effect at all in any environment where the '%' percent character is included in the FS_UNSAFE_CHARACTERS charset (for example: Win32). * PARAM_DOUBLE default values are now fixed point values in 'long' integer format; the current values in the program (all 0.0) are clearly within range _and_ it 'saves' on compiler warnings quite a bit. (We've still some way to go before we get anywhere near a '[almost-]zero-warning cross platform portable build: few int to pointer and vice versa casts remain.) * fixed bug in cfg_get_num_params() which would access uninitialized memory out there in NirvanaLand when a PARAM_UNSUPPORTED option was passed to pavuk. * Fixed configure.in to include 'debug' build handling for KDevelop (which would pass '--enable-debug=full' to ./configure). * updated the configure.in script to increase portability (opendir/closedir: dirent.h et al) * included a few aufoconf macros in the m4 directory for easier/proper portability support using autoconf et al. * bugs fixed from BUGS list: multithreaded mode is not as stable as single threaded (fixed at least for the CLI version of pavuk; the GTK GUI version is in a rather bad shape) * bugs fixed from BUGS list: signal handling / timeout does not really work (at least not in multi threaded downloads). After a SIGINT pavuk just hangs.) This has also been fixed for the CLI version of pavuk at least. * Win32 port now includes JavaScript support (using the statically linked Mozilla js library). * fixed short option definitions in options.h: -tp / -tsp et al * 'fixed' GUI for Javascript enabled builds (GTK2) - WARNING: it compiles now, but has NOT been tested, so expect bugs here! * merged the 'chunky' code with the pavuk main source tree. Now 'chunky' is equivalent to building pavuk with './configure --enable-hammer'. * set default from -leave_site to -dont_leave_site to prevent 'blown up' web crawls when this filter parameter has not been specified. This change includes a fix for the cfg/command line handling of pavuk for the conditions section (see condition.h + config.c) as pavuk assumed sizeof(long)==sizeof(int) in these code sections. * Now the proper GPL license (GPL, not LGPL) is included in the file ./COPYING. 2007-sep * fixed processing of zero byte length files (robot.txt at figleaf.com, etc.): no more crash/assertion failure due to NULLed docu->contents. * fixed a few memleaks. * added extra error checking for file rename operations as some issues were found with the Win32 build when using a SAMBA-shared filesystem for storing the spidered data/files. (It turned out that the same issues existed when using native (NTFS, FAT32) filesystems.) * dialed down the number of default threads from 3 to 1 (see BUGS) to prevent a hail of (legitimate) rename error reports. * added flock() implementation for Win32: when built with multithreading support, having no valid flock() implementation is very dangerous! * changed configure.in to detect both flock() and fcntl() file locking mechanisms so pavuk will be able to support writing spidered content to network shares on both Win32 and UNIX systems: flock() does not support network shares locks, fcntl() does, at least on the latest Linux kernels, see man flock(2) * added error reporting/checking for undesirable use of invalid flock() implementation. (Useful when porting pavuk to other non-Unix platforms.) * Fixed content/file size treatment code for items which are already available locally (i.e. pavuk finds the item at the remote has not changed from when the last time it fetched the item into local cache). * Fixed the conditions for when to display certain informational messages: less screen clutter when not running in '-verbose' mode OR when running in '-progress' modes. * Fixed several error/info messages in the code section for decompressing gzip/compress transmitted HTTP content. * Fixed handling of gzip/compress transmitted content when retrieved from local store instead (when pavuk discovers that the file at the remote site has not changed since the last time it was fetched and stored on your local disc). * Fixed a few memleaks. * Changed the DBGvars/DBGpass/DBGargs macros used for tracing memory allocations in debug mode to make these macros look more like regular 'C' functions to 'demented' code formatters and analysis tools. The drawback is that these still look 'weird' in function prototypes, but that causes quite a few less errors/warnings than the old style. * Fixed bugs in get_abs_file_path() directory detection and Win32 abs path processing. Also fixed code which produced double slashes in file paths on occasion, causing trouble on Win32 platforms. (Fix applied generally.) * Fixed mk_native() allocated string management pool to support printf() et al where up to 3 mk_native() calls are made in the argument list. This is important to prevent spurious crashes in multithreaded mode when the worst case scenario for mk_native() applies: all threads are executing printf()- style statement which has multiple calls to mk_native() in the argument list. Currently overdimensioned a bit as the actual code only has two simultaneous calls while the pool now is dimensioned to tolerate 3 simultaneous calls per thread. * No more _strfindnchr() and strfindnchr(): strfindnchr() - and its use - has now been fixed to match the (proper working) _strfindnchr(). [fnmatch.c/tools.c et al] * Fixed const-correctness of several functions. * Added '-mime_type_file' commandline option to help pavuk support an up-to- date list of mime types and their filename extensions, using, for example, the UNIX mime.types(5) config file as a source of MIME type information. If the user does not specify the '-mime_type_file' option, the original built-in defaults will be used instead. This feature has been added to provide better support for the pavuk - fnrules %M macro: this macro now will use this configuration to produce a suitable filename extension for each MIME type: the first extension listed in the '-mime_type_file' config file for the given MIME type will be used as extension for the %M macro. * Changed the GTK GUI macros to become functions for ease of debugging. The added (tiny) call overhead won't be a performance hit anyway. * Fixed -fnrules handling: the generated path is cleaned up before it is returned to pavuk for use. Cleanup actions: - duplicate '/' slashes are removed - filenames and directory names which end in a '.' dot, get the dot removed * Added '%X' to the -fnrules formatted processing to allow reformatting of filenames using an optional mimetype-derived extension. This is useful when grabbing Wiki (MediaWiki et al) sites when you'd like to store the grabbed content using default mimetype-related filename extensions, so instead of storing a file like wiki/page/AboutThisSite that would transform into wiki/page/AboutThisSite.html while pages like wiki/static_page/contact.htm would remain as is. (Note: this might be considered shorthand for a -fnrules (...) expression which compares both %e and %E. The intent of %X, however, is to only allow %e extensions to pass which are 'valid' for the given MIME type and force the %E mimetype based extension for all other cases.) CAVEAT: %e/%E/%X/%Y will print the extension WITHOUT the leading '.' dot in both simple mode and extended LISP mode. * Added '%Y', '%A' and '%B' to the -fnrules macros: '%Y' uses the MIME type prefered filename extension if the URL/filename doesn't have an extension yet (while the rather similar '%X' will OVERRIDE the existing extension if it is not listed with the specified MIME type). '%B' prints the 'basic MIME type', i.e. the MIME type without the ';' semicolon separated MIME attributes such as language, etc., while '%A' will print these extensions (if they were passed to us by the server). CAVEAT: %e/%E/%X/%Y will print the extension WITHOUT the leading '.' dot in both simple mode and extended LISP mode. All this allows for pavuk -fnrules commandline arguments like this: -fnrules F '*' '%h:%r/%d/%b%s.%Y' -mime_types_file ./mime.types -tr_chr_chr ':\\!&=?' '_' so we'll be able to grab a [Media]Wiki site while storing those pages as regular 'abc_php_xyz.html', instead of 'abc.php?xyz' page/filenames. * Added -fnrules 'fnseq' operator to the extended rules: compares a wildcard pattern and a string a la fnmatch(3). * Checked and updated manpage for the -fnrules operators (added 'ud' and 'sp' operators to the manpage). * Added -fnrules 'sn' operator to the extended rules as counterpart of 'ns'. 'sn' uses strtol() to convert a string to a number, while 'ns' uses printf() to format a number to a string. (See the man page.) * Updated the man page a bit regarding '-fnrules'. * sanitized escape_str(); a quick code review led us to a lurking bug in uconfig.c@309, which has been fixed implicitly. * Added/updates source code documentation: tools.c/tr.c soure code comments. * Added some sanity checks in the code (tools.c/tr.c/lfname.c) * Added debug_level 'rules' to allow debugging of both simple and 'extended' -fnrules expressions and '-fnrules' URL F/R matching. * Different boxes exhibit different mktime() behaviour, especially when handling out of range tm value sets. Besides, mktime() works in 'local time' while some parts of the code require a robust UTC mkgmtime() (not available on many boxes) --> ripped & introduced as tl_mkgmtime(). A local time-aware equivalent with excellent out-of-range handling is available as tl_mktime(). * Added additional error handling around calls which try to parse time stamps using tl_mkgmtime() and tl_mktime() (times.c). Basically, now both HTTP and FTP benefit from the new code which should now proces timestamps like the UTC timestamps they are, while 'out of UNIX time_t bounds' timestamps (beyond the range 1970..2038 A.D.) are handled in a more sane manner: - out of bounds timestamps are reported by pavuk - out of bounds timestamps are then 'sanitized', i.e. restricted to the 1/1/1970..31/12/2037 date range, i.e. a timestamp beyond the horizon, like '1/4/2051' will be 'sanitized' (= restricted) to the upper bound: 31/12/2037. The same goes for te from antiquity like '11/3/1969' (the birthday of a certain person), which will be 'sanitized' towards 1/1/1970. * Split up DEBUG into developer related stuff, such as memory/heap checking, ASSERT/VERIFY, etc. and user related stuff (the -debug and -debug_level command line arguments): ./configure is now fitted with an extra parameter: --enable/disable-debug-features which will turn on/off -debug/-debug_level user level debugging support in pavuk, while the existing --enable/disable-debug adds/removes additional developer checks, such as heap allocated checks and ASSERT and VERIFY macros. In the code, -debug/-debug_level related code is located within the 'HAVE_DEBUG_FEATURES' sections, while the developer debug/release builds are still related to the standard 'DEBUG' #define. This now results in three ./configure options that determine the (debug) feature set of your binary: --enable/disable-debugging --> compile a binary with source level debug info included and all optimizations DISabled for improved debugging (by using gdb or another debugger of your choice) --enable/disable-debug --> include/exclude additional run time checks in your binary. Most important are the ASSERT and VERIFY pre/post-condition validation methods located throughout the code. The use of these is advised, though these may cause a performance hit. --enable/disable-debug-features --> include/exclude user level -debug/- debug_level command line features, which help you as a pavuk user to 'debug' pavuk during the run. Using -debug, pavuk will be EXTREMELY verbose, which can be toned down by applying a -debug_level restriction filter. For example: -debug -debug_level all,!devel will be VERY verbose, but will NOT log any DEVEL level debug info, while: -debug -debug_level !all,rules will ONLY produce additional output for the RULES level, i.e. when pavuk processes - fnrules and/or JavaScript macros. * Fixed crash when non-RFC compliant website was grabbed: see testcase 7a. * Added targeted help: when options cannot be parsed correctly, short_usage() will try to help the user by printing the full help for the abusing commandline option only. (Of course, I screwed up while using debug_level flag sets _again_ :-( [Ger]) * Some improvements for network connectivity error handling and reporting. (xvherror() added.) This is the result of some FTP tests with pavuk (tests 8b). * Don't yak about 'Checking "robots.txt"' anymore when doing a FTP grab when robots.txt is NOT applicable anyway. * FTP: added crude 'autodetect/retry' mechanism for FTP servers which do not like NLST (==> response code 550) but report correct directory content for LIST (or vice versa). (ftp.c) * FTP/HTTP: at debug level 'protoD' pavuk will now dump RAW data/content received from the server before preprocessing (i.e. converting to HTML or decompressing). * Added command line option integer sizing support: byte sizes can now be specified in K, M or G. Other integer values can also be postfixed with K, M or G, but then these will be treated like the ISO values 1000, 1E6 and 1E9. * Addition memory leak fixes in case pavuk is fed an invalid commandline. * NTLM support code: fixed a few glaring bugs. * Added O_SHORT_LIVED to lock file open() flags for better Win32 behaviour. * Fixed code to load the pavuk configuration settings from, in order of appearance: env:PAVUKRC_FILE ~/.pavukrc SYSCONFDIR/pavukrc which matches the description in the manual. (see also man page) 2008-jan * Added 'js' flag to '-debug_level', which is used to dump a lot of detail about the pattern matching and transformation applied to JavaScript code using the '-js_pattern' and '-js_transform / -js_transform2' commandline options. * Added sanity check for '-js_pattern' and '-js_transform[2]' regexes, which MUST contain a subexpression for them to 'work' as expected. * removed re_pmatch_sub() and changed the code where it was used to work with the available re_pmatch_subs() call, which allows for more elaborate validation anyway. See htmlparser.c. * Removed a regex handling bug in the -js_transform[2] code, which would crash pavuk when using regexes where the first subexpression might be empty. The crash is due to the fact that the regex parser would return indexes '- 1' for these empty subexpression(s), resulting in out-of-bounds memory writes in the rewrite code. This in turn would nuke the heap, so after that is was only a matter of time for pavuk to fail dramatically. 2008 feb 04 * Added DEBUG_MISC() lines to solve sourceforge.net issue: [ 1852885 ] to improve manipulation by locally stored files * Included provisional fix (I don't have a working sample run to reproduce the issue (yet)) for sourceforge.net issue: 1852884 ] infinite loop on unexpected responses * Cleaned up the mess that was -progress_mode. * Cleaned up several DEBUG_xxx macro mistakes * Added a little description to the 'hidden' -htDig commandline option, which can be used to dump the server-transmitted MIME headers for each URL, similar to the htdig tool. * Added a bit of documentation for the -rollback option (which was undocumented) 2008 mar 20 * GNU gettext tools don't like '\r' in i18n strings --> fixed by changing the related printf() statements in src/doc.c * started update of configure scripts to the latest autoconf/automake. Also reordered the NEWS file so it will work with the new, stricter ./bootstrap && ./configure && make distcheck distro test cycle. 2008 jul 10 * fixed ';' semicolon bug in http.c near line 2074 which caused incorrect decoding of the HTTP/1.x response code header. * fixed gzip/compress/... content compression support (HTTP/1.1 Accept- Encoding); the previous code was a valliant attempt to 'fix' the client side (pavuk) to cope with buggy web servers which send the wrong encoding type for already compressed files, but this would screw up particular responses by *well-behaving* web servers. Of course this would only happen in rare circumstances so it was kinda hard to track down. Documentation for -Enc/-noEnc has been updated to reflect this situation and the code now (hopefully properly) finally supports compressed data transmission for RFC2616-complaint web servers. If you find that your 'downloaded' compressed files are already /incorrectly/ DEcompressed by pavuk, this is NOT the fault of the client (pavuk) but evidence that your server is behaving inappropriately and the proper remedy for this is the use of the option '-noEnc' which turns this feature off so the server is not allowed to screw up in this way any more. Also made sure one can check if pavuk has been built with compression support by calling 'pavuk --version' and looking at the feature list. * autoconf/configure script: using the highly undocumented v_cflags or other x_* variables as environment variables to hack the configure script (you could do that, especially with v_cflags) has been obsoleted while the configure and m4/* scripts have been upgraded to support autoconf 2.62/automake 1.10 and use ONLY *documented* AC.*/etc. macros from now on. Note: thanks to the JavaScript library issues on SuSe10.2/AMD64 (older JS lib version and seemingly partial header install), I may have failed to eradicate all undocumented macros. * Extra note about configure.in: bash, at least on SuSe10.2/64-bit, handles 'if eval test ...' just ever so slightly different than 'if test ...', especially where it comes to 'test -n'. As these styles were mixed rather arbitrarily before, the 'if eval test ...' style has been completely removed from the configure script, as this would sometimes render quite unexpected (and incorrect!) results. * fix_crlf.sh has been updated to ensure important Microsoft Visual Studio files are not damaged by having their CRLF sequences converted to UNIX LF line endings: this kind of thing will make MSVC spit you in the face and reject everything you try until you give it back those CRLF line endings in there. So much for XML as project file format and MSVC... * extra fixes to ensure 'make distcheck' does not barf up a hairball. This includes enforcing the permanent inclusion of the 'po' subdirectory in the Makefile set for multilingual support. * configure/Makefile(s): if you don't have one or more of the archiving/compression tools compress/lzma/gzip/tar/7z(7zip) installed on your system, we don't go belly up at config ~ nor at 'make dist' time anymore. This, of course, includes correct behaviour at 'make distcheck' time: only use/test those 'GNU standard' formats, which can be created on your box. * Added the 'bootstrap' shell script, next to 'autogen.sh'. I know they serve the (almost) same purpose, but 'bootstrap' is far more sophisticated than autogen.sh and I didn't wish to overwrite 'autogen.sh'. Besides, IDEs on UNIX boxen expect either the one or the other (there's no single 'standard' for this), so we might as well provide both. At a later time, we might probably point autogen.sh to bootstrap. * Updated the mime.types MIME 'hint' file: currently, it's a mix of 1) all properly registered MIME types ( http://www.iana.org/assignments/media-types/ ) 2) the mime.types file provided with the latest Apache/XAMPP 3) my (Ger Hobbelt) additional file extension hints as used on my own servers. This is mostly about professional graphics ~ and modern 'scene' audio/video container formats, such as Matroska. This only adds extensions for otherwise already existing MIME types. * Updated the DocBook-based documentation for several options (-End/-noEnc, ...) * 'pavuk --version' now also reports if ZLIB support is included in the binary. This is important for '-Enc'. * Fixed the '-Enc' compressed transmission and HTTP header processing code to act properly with fully RFC2616-compliant web servers, discarding the old 'hack/fix' attempt to solve a non-complaint server issue at the client, as this would break things for fully compliant servers in the rare (but extremely annoying) use case: - pavuk with '-Enc' option - webserver is fully RFC2616 compliant - pavuk issues request for file in a .tar.Z or other gzip/compress compressed format, where the file on the server is only slightly compressed (fastest compression). - webserver will transmit file to pavuk, but due to pavuk reporting it is able to handle compressed transmission AND the server discovering that the content can be compressed quite some more than it already was, the file will be transmitted after a server-side just-in-time compression round. - pavuk receives the data. The old hacked code would NOT decompress the data. However it SHOULD because the server PROPERLY reported 'Content- Encoding: gzip' to pavuk. End result: grabbed data which you cannot process nor trust to be in the same format as stored on the server as it all 'depends' on arbitrary conditions which you cannot control: is the web server able to compress the data before transmission? Is the web server configured to allow compression? Etc. This use case has now been fixed. The effect of BADLY behaving web servers (which send 'Content-Encoding: gzip' for any .Z, .z or .gz files (IIS x.x and other servers which are not configured to /properly/ handle files and MIME types) is described in the DocBook manual page now, including the fix for this (specify the '-noEnc' commandline with pavuk). * active FTP: timeout and stop/break handling slightly improved: now pavuk should always terminate under all circumstances while a break or stop has been signalled. * Changed the default for '-url_strategy' from 'level' to 'leveli' to make pavuk behave more like your regular web browser (with a user clicking through web pages). * Initial fix for NTLM support for 64-bit Windows. (Only lightly tested.) This includes converting that bit of code to support the C99 intNN_t types (where NN e {8,16,32}), while the configure script takes care about providing the proper types for not-fully-C99-compliant environments. * The TRE regex package would barf up a hairball due to the incorrect header file being loaded. ./configure now recognizes TRE specifics a bit better and the code now loads the proper header file (<tre/regex.h> instead of <regex.h>). This is important on systems which have multiple, ever so slightly incompatible regex processing libraries installed. * Improved diagnostics a little bit by adding reporting support for URL_PARENT_REWRITING, i.e. the situation where a parent page of a grabbed page is loaded for the sake of adjusting (rewriting) the URLs in its content. * Fixed code so it would compile in full (-DDEBUG) debug mode on UNIX. * autoconf/configure: ran into some weird issues due to inconsistent M4 [] quoting: quite a few lines did without it. Turns out that this is a BIG No!No! as adding the AX_ADD_OPTION() macro turned this lurking mess into a true disaster. Fixed by applying [] quoting throughout. The only place where I didn't do it, is in the first and second args of AC_DEFINE() -- which should be used instead of AC_DEFINE_UNQUOTED when you don't need the latters extra functionality anyway -- and the first arg of AC_DEFINE_UNQUOTED(). Any other spot where [] quotes are missing in the M4 macros and/or configure.in? Consider that a bug and please report so I can fix it. * Finally got the configure system to recognize my JavaScript libraries and all. Tugged and tweaked a few items in the bindings to allow maximum flexibility for the JS code when it is used to filter URLs (e.g. JavaScript pavuk_url_cond_check() function). * Updated jsbind.c to use latest SpiderMonkey 1.8.x (tested on Win32) * Changed man/Makefile to ensure HTML is not recreated every 'make' run, but only when manpage changes. This should really copy the results from ./doc/, but that's for later... * DocBook documentation: tweaked man page generation to mimic original manpage title exactly. * DocBook documentation: updated '-version' info (important to see at run- time what abilities you've got with /your/ pavuk. * Win32/MSVC: all project files have been updated to produce next to Win32/x86: Win64/AMD64 and Win64/Itanium binaries. These project files assume the existence of all optional libraries: OpenSSL, SpiderMonkey (JavaScript), zlib. Where to get those, prefered directory layout, etc. to be published, so others can build from source on Win32/64 too and get the same results. 2008 jul 20 * tweaked configure+makefiles so that a 'make dist' from CVS becomes possible: there were quite a few references to yet unpublishable files in my makefiles (Ger Hobbelt). * config section: improved adherence to C standards: no more potentially dangerous mixed use of function and data pointers by typecasting function pointers into data pointers and vice versa. This has been resolved by an added layer of indirection, which makes it all very legal C again. It goes somewhat like this: function_pointer_type ptr = &function; data_pointer_type d = &ptr; then use (d[0])(...) to call the function. This contrasts the old code: data_pointer_type d = (data_pointer_type)&function; and function invocation using: ((function_pointer_type)d)(...) * Added support for parsing 'hidden' CSS and JavaScript in HTML. The support is also extended to generally parse inside HTML comments PLUS Microsoft IE CC's (Conditional Comments):  -read_css -read_cdata -read_msie_cc -read_comments These are all enabled by default; documentation has been updated for these as well. * Fixed CSS and [Java]Script handling in the HTML tokenizer/parser, which was feeding the filters and URL extractors (htmlparser.c). Now the code can cope better with incorrectly formatted pages / files. * Reordered the HTML tags in htmltags.c in a preparatory move to check the list for missing attributes (onXXX JavaScript items for one! several are missing) and HTML 3/4 tags. (htmltags.c) 2008 aug 13 * updated the -debug_level related code; DEBUG_DEVEL() and a few others now 'automagically' report the sourcefile+lineno without the need to specify these explicitly + some DEVEL_*() calls have been shifted to other '-debug_devel' levels (net, mtthr, htmlform, ...) * completed the -debug_level tracing for multithreaded runs: now all semaphore accesses can be traced using the -debug_devel mtthr * Major fix for bufio+socket code: no more lockup for pavuk due to delayed reception of response data (tl_selectr() would incorrectly lock indefinitely -- which proved to be a generic coding mistake in both tl_selectr() and tl_selectw() -- PLUS better error condition handling in an attempt to improve handling of all sorts of 'spurious error conditions' which may occur when your network suffers from packet loss or other undesirable effects. * -mode remind code fix for multithreaded use to make it match recurse and other modes better; not severely tested so YMMV! (The old code wouldn't work anyway, so it's an improvement anyhow). * few code cleanups (#if 0 ... #endif) * DocBook manual updated: now all return codes from pavuk are documented. * minor code fixes for SSL/SFTP. * updated configure and code to assist in compiling with both latest SiderMonkey and older Mozilla JavaScript libraries (Win32/64 and UNIX respectively). * Some unused error checks replaced by ASSERT() and some ASSERT()s replaced by error reports as those errors /can/ happen in actual use (though seldom). * Fix for parsing malformed URLs (with multiple '#' and/or '?': bookmarks and query string parts would not be stripped/detached correctly as the last '#'/'?' instead of the FIRST occurrence of '#'/'?' would be picked as a separation point. * Ran the gettext files through pot/pox/po again. Lots of 'fuzzies'... These need to be fixed. * EXPERIMENTAL: added preliminary code for extended JavaScript support: hooks to process HTML and CSS just like you can process embedded <SCRIPT>s now. The new hooks are still 'nulls', i.e. do not have any effect. This is a work in progress; it compiles & runs (tested on UNIX and Win32 in multithreaded mode) but the new hooks still need to be implemented. The goal here is that all grabbed (parsable) content should be processable by custom JavaScript script functions AND when more than one URL is found, the JavaScript code should be allowed to add those extra URLs to the pavuk queue (using the new url.queue() JavaScript PavukUrl object method -- currently a 'nil' member function as it still must be fully implemented). * isatty() fixes which check for error conditions and do /not/ provide special 'console oriented' features when isatty(0) produces an error (may happen on Win32/UNIX). * Checked and updated all header files (after I ran into a cyclic dependency when changing a bit of code): no .h files will #include "config.h"; all .c files /do/ #include "config.h" as the first header. System-dependent stuff (TRUE/FALSE definitions and a few other bits) have been moved to config.h (where they below IMO) and removed from tools.h This is a change required for the gzip fix [SF bug #2050527]. * Preliminary fix for CSS url grabbing and rewriting bug [SF bug #2050537]. The new code will now try to keep these three styles of <url> formatting in CSS intact -- this is done so as to keep particular CSS browser hacks intact as much as possible: @import "<url>" @import url(<url>) @import url('<url>') @import url("<url>") and of course the use of 'url()' elsewhere in any CSS is treated like the three examples above, i.e. NONE of these should be changed regarding <url> delimiters (quotes or braces) when rewritten by pavuk. The ONLY situation where pavuk will CHANGE the quotes is when a <url> is found to contain the delimiter quote itself: in that case the quotes are changed from ' to " and vice versa. 2008 aug 18 * minor fixes to the includes mime.types file * configure: added support/auto-detection for the GNU GDB extended debug output (-ggdb -g3) for when building a debug build. * NTLM: fixed code for Win64 and other 64-bit platforms which do or do not support structure packing. * documentation update: -[no]chunk_bug commandline argument finally documented (was in there already for a longer time; is a special fix for badly behaving IIS web servers which transmit data in 'chunked mode'. Also upgraded the documentation for the -tr_str_str/tr_chr_chr options so one can finally read how to use [:print:] and other definitions in there for -tr_chr_chr and be able to determine up front what the bugger will do for you. For example: Why does -tr_chr_chr '[hexnum:]' '0123456789abcdef' *not* do what you expect when the filename has any of the a..f characters? (Answer: they all become 'f' as [:hexnum:] actually expands to '0123456789ABCDEFabcdef' itself, so it is longer than the destination set and by definition any 'overflow' will be replaced by the last character in the target set.) * HTML/CSS/JavaScript parent rewriting was sometimes flaky; this has been fixed by fixing several bits of antiquated code in pavuk: now all code sections are equaly aware of URL_ISHTML, URL_ISSTYLE and/or URL_ISSCRIPT. Several functions have been adapted to mirror the new awareness: ext_is_html() has been enhanced and has been renamed to actually show its intended function: ext_is_parsable() -- which can be a HTML, CSS *or* JavaScript file! (not only HTML can be parent of other URLs and need updating ('URL parent rewriting'). [ SF bug #2050537 ] CSS @import bad / HTML corrupted --> fixed * On SuSe10.2/AMD64 glibc6 dumped core when running pavuk in full-out '- debug -debug_level all' (the latter is implicit when you use '-debug') mode. This was caused by glibc()'s printf() functions *sensibly* executing a strlen() operation on the data fed to one of several '%.*s' printf() formatting parameters, while those data series had NOT been NUL terminated. This would happen when debugging pavuk while fetching data from a gzip- enabled web server: the gzip/inflate code would NOT append a new NUL sentinel. * Several other '%.*s' and '%s' related core dump spots in the DEBUG_XYZ() code which would dump downloaded content have been fixed by feeding the data through an enhanced asciidump function -- which will switch to HEX dumping when the content to be shown for scutiny contains a large amount of non-ASCII data (> 10% is the current heuristic to switch over). * glibc6 on SuSe10.2/AMD64 would also dump core when being fed a 110K string to a printf '%s' statement. This has been fixed by always limiting the amount of content to be displayed when debug-printing downloaded data (various '-debug_level's) * gzip/inflate would fail to perform on 'non-parsable' content, i.e. plain text files downloaded from a gzip-enabled web server. This has been fixed. CAVEAT: The current gzip/inflate code does not deliver when it is fed very large files. Hence, when downloading VMware images and/or multi-GB ISO files, a workaround is to specify -noEnc. This will be fixed at a later date. [SF bug #2050527] nonparsed files saved in (wrong) compressed when using HTTP --> fixed * Parent rewriting would try to treat all parents as HTML, which is VERY wrong when the actual parent is a CSS stylesheet or a JavaScript script file. Fixed. * unified variable names for 'struct doc' variables: it is *QUITE* irritating to loose your display of 'docu' contents just because this call uses 'docp' for the same (or 'html_doc') while trying to track down lurking parent rewriting and file URL parsing bugs. Updated all sourcefiles to the use of varname 'docu' for the current document. 'docp' and 'html_doc' have been renamed. * two bugfixes for the tr() code: (1) when using X-Y character ranges, the size estimator would allocate way too less space. This has been fixed. (2) the documentation says it well: you cannot include a NUL in a tr() character set. In one case (a range at the start of the spec like this: '- z' would actually attempt to insert such a NUL anyhow, causing subtle bug. Fixed. And a minor code cleanup. * fixed argument quoting for external app invocation, which is particularly important for Windows machines: they treat '-quoting quite different from "-quoting. Fixed by using "-quotes instead of the original '-quotes. * -enable_js is now turned ON by default - just like the documentation already said. KNOWN ISSUE: empty lines in JavaScript code and files gets stripped by pavuk on rewriting; this will be fixed at a later date. * fix in mime.types file for CVS file extension + added mime types for Microsoft Office 2007 * fixed heap corruption in ainterface.c when calling append_starting_url() when url has been specified in the extended '-request' format, including a predefined local filename. (Would dump core on some systems.) * moved the url2diag and info2diag functions from recurse.c to where they should have been: url.c -- to resolve a cyclic dependency. * fixed up the '-request' format url parser/decoder url_parse() call: several types of input specification error would be silently rejected (now pavuk prints a suitable error message to tell the user what [s]he did wrong and what was expected) + a few tugs & tweaks to fix behavior for parsing extended URL specifications (including cookies, predefined local filenames, etc.) and an extra '-debug' (level: URL) line to help you diagnose how the '-request's have been parsed/decoded. * now you can use the extended '-request' URL format anywhere on the commandline and/or your pavuk configuration files -- as long as you keep it within quotes on the commandline of course, e.g. pavuk "URL:http://example.com/ LFNAME:example.html" * fix: config files generated by pavuk now properly select the 'short format' (URL:....) instead of the 'long url spec fomat' (Request:....): previously pavuk would loose information about web forms, cookies, local filenames, etc. for some types of requested url. * quickfix for issue reported on the mailing list regarding JavaScript interface functions causing the build to fail - which happened when no JavaScript library could be found. NOTE: on Linux, the JS libraries and headerfiles seem to get installed in various places. The current ./configure script looks for the jsapi.h header file in the directory /usr/include/js unless you specify the '--with-js-includes=<dir>' option when running ./configure. The same goes for the js library itself: the current configure script looks for either libjs or libmozjs in any of these directories: /usr/lib64/thunderbird /usr/lib64/firefox /usr/lib64 /usr/lib/thunderbird /usr/lib/firefox /usr/lib unless you specify the ./configure --with-js-libraries=<dir> option to point to your specific libjs.a / libmozjs.a * added an advanced example of use to the pavuk DocBook documentation which will end up in the manpage (where it's a bit too much, but then at least the users have an extended example of actual use) -- example shows how to grab the up-to-date content from a MediaWiki-based web site. * added S/M/H/D unit support for the time argument decoder function * Updated the manual regarding: - all missing 'hammer mode' options - the missing -rtimeout and -wtimeout options - checked first few options in options.h and made sure those were all documented. (This is a work in progress...) * All timeouts are now in milliseconds, except the -max_time one, which is in minutes. All timeout arguments (except -max_time) now recognize the alternative units for specifying time: s/m/h/d/S/M/H/D: second, minute, hour, day. When no unit has been specified, the unit 'milliseconds' is assumed. * Fix for bug report #2158794: now all DEBUG_*() functions are called using the proper number of arguments. The code has been further enhanced for all printf()-like functions (such as the DEBUF_*() and x*printf() functions) to enable GCC and MSVC to check the format specification strings and parameter count and type (GCC). This led to the discovery of a multitude of errors, which have been fixed (wrong integer sizes, etc.). * Preliminary code move to allow downloading extremely large entities (larger than 2GB) such as DVD ISO images: this has been done by more judicious use of the size_t and ssize_t types instead of simply 'int'. On 64-bit platforms, size_t/ssize_t can handle 64-bit sizes, while 'int' cannot (as GCC still uses 32-bit ints on most common hardware 64-bit architectures (Intel, ...)). Further effort will need to be spent to adapt the system (and OpenSSL) calls to enable the complete datapath for >2GB entity sizes (at least when compiled on 64-bit). * Small documentation fix: regex overview of characterset changed in DocBook source so it appears as a simple list, instead of just one long paragraph full of concatenated items --> improved readability. * const-ified the source code and fixed a few comment typos and a lurking bug in FTP (found thanks to constification): filename for directory index urls could be damaged in particular circumstances. * fixed makefiles for environments without any DocBook tools. Also fixed configure script to help detect the absence of mandatory DocBook template files. Plus added DocBook produce to the distro as we cannot expect everyone to have the DocBook tools; nevertheless, everybody /should/ receive a full set of documentation. * Bugfix in GET_NUMLIST(): now original numlist is properly removed (would only be noticable before when specifying multiple port numbers). * memleak fix for _free_httphdr(): now also the httphdr struct itself gets free()d. * Fixed lockups in debug logging code when running in '-x' GUI mode; overhauled the 'recursive invocation' detection code within, which is mandatory to prevent recursive calls to debug/log functions to blow up the stack and dump core while running in ultra verbose debug/diag mode (-debug -debug_level all). This is the second part of the fix for bug #2184196. * Bugfix for #2023089: new code is introduced for '-lmax' depth level checks: the 'depth' (a.k.a. 'level') will always be taken from the non-inline parent URL which has the lowest level. This should fix situations where 'inline' URLs have 'inline' *parent* URLs, such as style sheets, which are referenced non-inline URLs (HTML files). Seeking out the lowest level non-inline parent should also take care of situations where multiple HTML files at different levels themselves, all (directly!) reference the same stylesheet/inline URL. * Attempt at fixing a GUI semaphore lockup, caused by LOCK_CFG_URLSTACK being used for different purposes (was a quick hack once to create a 'critical section' there) in recurse.c @ 1129. Same hack, but now we use LOCK_GHBN which should cause much less trouble there. * Bit of code cleanup. * Code review checks to see if URLT_FTPS and URLT_GOPHER are used consistently where you'd expect them. As you would URLT_HTTPS, next to URLT_HTTP. * Code review checks and fixes to prevent pspurious damage to url->parent structures: now the access to this element is critical-sectioned /everywhere/ using LOCK_URL(u); existed in 95% of the places already, now all code has been checked. * Several fixes for multithreaded GTK GUI use. Most important thing which was missing: a call to gtk_threads_init(). * JavaScript: updated HTML tag/attribute tables to recognize all onXYZ=... JavaScript event attributes in HTML + added the full set of attributes to the url pattern class/object which is available in pavuk's own JavaScript extension.

For information on current development see here.