Compare two URL lists and print newly added URLs to a new file

Question

I am initially producing two files which contain lists of URLs—I will refer to them as old and new. I would like to compare the two files and if there are any URLs in the new file which are not in the old file, I would like these to be displayed in an extra_urls file.

Now, I've read some stuff about using the diff command but from what I can tell, this also analyses the order of the information. I don't want the order to have any effect on the output. I just want the extra URL's in new printed to the extra_urls file, no matter what order they are placed in either of the other two files.

How can I do this?

Barmar · Accepted Answer · 2015-12-17 18:03:01Z

14

You can use the comm command to compare two files, and selectively show lines unique to one or the other, or the lines in common. It requires the inputs to be sorted, but you can sort them on the fly, by using process substitution.

comm -13 <(sort old.txt) <(sort new.txt)

If you're using a version of bash that doesn't support process substitution, it can be emulated using named pipes. An example is shown in Wikipedia.

edited Dec 17, 2015 at 18:03

answered Nov 23, 2015 at 15:53

Barmar

10.6k1 gold badge22 silver badges29 bronze badges

Concise but effective- exactly what was needed, excellent bit of code for what I required.

neilH
– neilH

2015-11-23 16:15:08 +00:00
Commented Nov 23, 2015 at 16:15
Hmm, but if the input is sorted, then diff will do the same thing, right?

justhalf
– justhalf

2015-11-24 06:52:25 +00:00
Commented Nov 24, 2015 at 6:52
diff will show all the differences. comm allows you to select whether you want to see the lines from file 1, file 2, or the ones they have in common.

Barmar
– Barmar

2015-11-24 07:02:09 +00:00
Commented Nov 24, 2015 at 7:02
Hi Barmar, not sure you will check this but just incase, i've moved this script onto my Synology Nas to run from there. Since running my script from the Synology I'm now getting the syntax error: line 60: syntax error: unexpected "("

neilH
– neilH

2015-12-17 17:58:25 +00:00
Commented Dec 17, 2015 at 17:58
What version of bash is it running? It may not support process substitution.

Barmar
– Barmar

2015-12-17 17:59:15 +00:00
Commented Dec 17, 2015 at 17:59

| Show 4 more comments

terdon · Accepted Answer · 2015-11-24 10:13:42Z

I would just use grep:

grep -vFf old new > extra_urls

Explanation

-f : tells grep to read its search patterns from a file. In this case, old.
-v : tells grep to invert the match, to only print non-matching lines.
-F : tells grep to interpret its search patterns as strings, not regular expressions. That way, the . of the URL will be matched literally.

Combined, these make grep print any lines in new that were not in old. The order of the URLs in the file is irrelevant.

Hi terdon, Thanks for your input. I've just tested this and it produced a blank "extra urls"_file despite there being new urls in the "new" file. — neilH
– neilH, Commented Nov 23, 2015 at 16:14
@bms9nmh hmm, that's odd. Please edit your question to give an example of your input files. You might also want to come into the site's chat room where we can discuss this further. — terdon
– terdon ♦, Commented Nov 23, 2015 at 16:16

glenn jackman · Accepted Answer · 2015-11-23 15:31:27Z

1

Since order is important to you, use awk

awk ' NR == FNR {old[$1]=1; next} !($1 in old) ' old new > extra

answered Nov 23, 2015 at 15:31

glenn jackman

88.5k16 gold badges124 silver badges179 bronze badges

1

Hi glen, just to clarify, order isn't important. The url's order isn't an issue, just the difference between the two files i.e. the additional url's. I don't want the difference in order to effect the output in any way.

neilH
– neilH

2015-11-23 15:37:18 +00:00
Commented Nov 23, 2015 at 15:37
@bms9nmh: you could just change > extra to | sort > extra. or | sort -u > extra if you only want a new url to appear in the output once, regardless how many times it's in the input. The input order is liable to affect the output order unless you do extra work somewhere along the way to prevent it.

Steve Jessop
– Steve Jessop

2015-11-23 22:07:46 +00:00
Commented Nov 23, 2015 at 22:07
@steve, meh, comm is the best answer for this question, although grep -Fvf is good too

glenn jackman
– glenn jackman

2015-11-24 00:37:28 +00:00
Commented Nov 24, 2015 at 0:37

Add a comment |

Volker Siegel · Accepted Answer · 2015-11-27 19:23:20Z

I have an application called meld. It allows viewing the two (or three) files, side by sides, shows the differences and allows for selective copying from one to the other or deleting characters.

Meld can be installed from a terminal with

sudo apt-get install meld

score 0 · Accepted Answer · 2021-03-11 18:07:45Z

Here's a more general solution, that can find and compare URL's in text files containing not just URL's:

#!/bin/sh # diffl.sh # DIFF with Links - a "diff utility"-like .sh script # (dash, bash, zsh compatible) that can find missing # web links in one file compared to a group of files # Please note that: for simplicity, in this script, only # URLs containing "://" are taken into consideration, # although there can be URLs that do not contain it # (such as mailto:[email protected]) GetOS () { OS_kernel_name=$(uname -s) case "$OS_kernel_name" in "Linux") eval $1="Linux" ;; "Darwin") eval $1="Mac" ;; "CYGWIN"*|"MSYS"*|"MINGW"*) eval $1="Windows" ;; "") eval $1="unknown" ;; *) eval $1="other" ;; esac } DetectShell () { eval $1=\"\"; if [ -n "$BASH_VERSION" ]; then eval $1=\"bash\"; elif [ -n "$ZSH_VERSION" ]; then eval $1=\"zsh\"; elif [ "$PS1" = '$ ' ]; then eval $1=\"dash\"; else eval $1=\"undetermined\"; fi } PrintInTitle () { printf "\033]0;%s\007" "$1" } PrintJustInTitle () { PrintInTitle "$1">/dev/tty } trap1 () { CleanUp printf "\nAborted.\n">/dev/tty } CleanUp () { #Restore "INTERRUPT" (CTRL-C) and "TERMINAL STOP" (CTRL-Z) signals: trap - INT trap - TSTP #Clear the title: PrintJustInTitle "" #Restore initial IFS: #IFS=$old_IFS unset IFS } DisplayHelp () { printf "\n" printf "diffl - DIFF by URL web Links\n" printf "\n" printf " What it does:\n" printf " - compares the URL web links in the two provided files (<file1> and <file2>) and shows the missing web links that are found in one but not in the other\n" printf " Syntax:\n" printf " <caller_shell> '/path/to/diffl.sh' <file1> <file2> ... <fileN> [flags]\n" printf " - where:\n" printf " - <caller_shell> can be any of the shells: dash, bash, zsh, or any other shell compatible with the \"dash\" shell syntax\n" printf " - '/path/to/diffl.sh' represents the path of this script\n" printf " - <file1> and <file2> represent the directory trees to be compared\n" printf " - if more than two files are provided as parameters (<file1>, <file2>, ..., <fileN>): the web links in <file1> are compared with all the web links in <file2>, ... <fileN>\n" printf " - [flags] can be:\n" printf " --help or -h\n" printf " Displays this help information\n" printf " Output:\n" printf " - lines starting with '<' signify web links from <file1>\n" printf " - lines starting with '>' signify web links from <file2>, ..., <fileN>\n" printf " Notes:\n" printf " - for simplicity, in this script, only URLs containing \"://\" are taken into consideration, although there can be URLs that do not contain it (such as mailto:[email protected])\n" printf "\n" } GetOS OS ################################################################################# ## Uncomment the next line if your OS is not Linux or Mac (and eventually ## ## modify the commands used (sed, sort, uniq) according to your system): ## ################################################################################# #OS="userdefined" DetectShell current_shell if [ "$current_shell" = "undetermined" ]; then printf "\nWarning: This script was designed to work with dash, bash and zsh shells.\n\n">/dev/tty fi #Get the program parameters into the array "params": params_count=0 for i; do params_count=$((params_count+1)) eval params_$params_count=\"\$i\" done params_0=$((params_count)) if [ "$params_0" = "0" ]; then #if no parameters are provided: display help DisplayHelp CleanUp && exit 0 fi #Create a flags array. A flag denotes special parameters: help_flag="0" i=1; j=0; while [ "$i" -le "$((params_0))" ]; do eval params_i=\"\$\{params_$i\}\" case "${params_i}" in "--help" | "-h" ) help_flag="1" ;; * ) j=$((j+1)) eval selected_params_$j=\"\$params_i\" ;; esac i=$((i+1)) done selected_params_0=$j #Rebuild params array: for i in $(seq 1 $selected_params_0); do eval params_$i=\"\$\{selected_params_$i\}\" done params_0=$selected_params_0 if [ "$help_flag" = "1" ]; then DisplayHelp else #Run program: NL=$(printf '%s' "\n\n"); #final NewLine is deleted #or use: #NL=$'\n' error1="false" error2="false" error3="false" { sed --help >/dev/null 2>/dev/null; } || { error1="true"; } { sort --help >/dev/null 2>/dev/null; } || { error2="true"; } { uniq --help >/dev/null 2>/dev/null; } || { error3="true"; } if [ "$error1" = "true" -o "$error2" = "true" -o "$error3" = "true" ]; then { printf "\n" if [ "$error1" = "true" ]; then printf '%s' "ERROR: Could not run \"sed\" (necessary in order for this script to function correctly)!"; fi if [ "$error2" = "true" ]; then printf '%s' "ERROR: Could not run \"sort\" (necessary in order for this script to function correctly)"; fi if [ "$error3" = "true" ]; then printf '%s' "ERROR: Could not run \"uniq\" (necessary in order for this script to function correctly)"; fi printf "\n" }>/dev/stderr exit fi if [ "$OS" = "Linux" -o "$OS" = "Mac" -o "$OS" = "userdefined" ]; then # command1: sed -E 's/([a-zA-Z]*\:\/\/)/\\${NL}\1/g' sed_command1='sed -E '"'"'s/([a-zA-Z]*\:\/\/)/'"\\${NL}"'\1/g'"'"; # command2: sed -n 's/\(\(.*\([^a-zA-Z+]\)\|\([a-zA-Z]\)\)\)\(\([a-zA-Z]\)*\:\/\/\)\([^ \t]*\).*/\4\5\7/p' sed_command2='sed -n '"'"'s/\(\(.*\([^a-zA-Z+]\)\|\([a-zA-Z]\)\)\)\(\([a-zA-Z]\)*\:\/\/\)\([^ \t]*\).*/\4\5\7/p'"'" # command3: sed -E 's/(.) [0-9]* (.*)/\1 \2/g' sed_command3='sed -E '"'"'s/(.) [0-9]* (.*)/\1 \2/g'"'"; # command4: sed -E 's/^1/>/g;s/^0/</g' sed_command4='sed -E '"'"'s/^1/>/g;s/^0/</g'"'" else printf '\n%s\n\n' "Error: Unsupported OS!">/dev/stderr exit 1 fi #Get the program parameters into the array "files": count=0 for i; do count=$((count+1)) eval files_$count=\"\$i\" done files_0=$((count)) error="false" if [ "$files_0" -lt "2" ]; then printf '\n%s\n' "ERROR: Please provide at least two parameters!">/dev/stderr error="true" fi if [ "$error" = "true" ]; then printf "\n" exit 1 fi error="false" for i in $(seq 1 $files_0); do eval current_file=\"\$files_$i\" if [ ! \( -e "$current_file" -a -f "$current_file" \) ]; then printf '\n%s\n' "ERROR: File \"$current_file\" does not exist or is not a regular file!">/dev/stderr error="true" fi done if [ "$error" = "true" ]; then printf "\n" exit 1 fi #Proceed to finding and comparing links: #Trap "INTERRUPT" (CTRL-C) and "TERMINAL STOP" (CTRL-Z) signals: trap 'trap1' INT trap 'trap1' TSTP old_IFS="$IFS" #Store initial IFS value IFS=" " { PrintJustInTitle "Searching for links [1]..." mask="00000000000000000000" { count=0 for link in $(\ cat "$files_1" |\ eval $sed_command1 |\ eval $sed_command2\ ); do count_prev=$count count=$((count+1)) if [ "${#count_prev}" -lt "${#count}" ]; then mask="${mask%?}" fi number="$mask$count" printf '%s\n' "0 $number $link" PrintJustInTitle "Links found [1]: $((count))..." done; PrintJustInTitle "Sorting results [1]..." }|sort -u -k 3 PrintJustInTitle "Searching for links [2]..." mask="00000000000000000000" { count=0 for i in $(seq 2 $files_0); do eval current_file=\"\$files_$i\" for link in $(\ cat "$current_file" |\ eval $sed_command1 |\ eval $sed_command2\ ); do count_prev=$count count=$((count+1)) if [ "${#count_prev}" -lt "${#count}" ]; then mask="${mask%?}" fi number="$mask$count" printf '%s\n' "1 $number $link" PrintJustInTitle "Links found [2]: $((count))..." done done PrintJustInTitle "Sorting results [2]..." }|sort -u -k 3 PrintJustInTitle "Searching for unique links [3]..." }|{\ sort -k 3|uniq -u -f 2|sort|eval $sed_command3|eval $sed_command4 PrintJustInTitle "Done"; } CleanUp fi

Syntax:
- <caller_shell> '/path/to/diffl.sh' <file1> <file2> ... <fileN>
What it does:
- this will show the URL web links that <file1> and the group of files <file2>, ..., <fileN> don't have in common
Notes:
- for simplicity, in this script, only URLs containing "://" are taken into consideration

(1) This appears to be similar enough to your other answer that many of my comments there probably apply here, as well. (2) When you post an answer that’s derived from somebody else’s work, you should say so. — G-Man Says 'Reinstate Monica'
– G-Man Says 'Reinstate Monica', Commented Mar 13, 2021 at 2:06
This script compares only the web links inside files, whereas the other answer compares: date modified, size, path of files... these are different things (and by the way, I am the author of unix.stackexchange.com/questions/59336/… ) — user456983
– user456983, Commented Mar 15, 2021 at 13:16

Stack Exchange Network

Compare two URL lists and print newly added URLs to a new file

5 Answers 5

Explanation

You must log in to answer this question.

Linked

Hot Network Questions

Compare two URL lists and print newly added URLs to a new file

5 Answers 5

Explanation

You must log in to answer this question.

Linked

Related

Hot Network Questions