How to remove spaces between tags in a delimited file?

Question

I have this table dump from a MySQL system, and although it follows RFC standards, it appears to have added unwanted space in columns where HTML text are stored. For example:

 "2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">" <div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"

This is one out of about 30K rows, so I am trying to figure out a smart way to remove the space between the " and <div (and possibly others) here. I tried out:

awk '{$1=$1;printf $0}'

And this kind of works, but it mashes everything into one line which is not what I want. I would like to preserve the line breaks in the CSV dump. I am very curious to hear your ideas on how to tackle this.

Thanks @KarlT. I tried print, but unfortunately I get the same content as in my original CSV. Only printf appears to remove the unwanted space. — user3723380
– user3723380, Commented Feb 16, 2022 at 12:25
Can you you define which spaces should be removed? I suppose you need to keep spaces between speech marks. Are there spaces between words to keep? Is it only spaces before and after tags that should be removed or replace " <" or "< " with "<" and "> " or " >" with ">". I'm imagining a regex replace. — user18098820
– user18098820, Commented Feb 16, 2022 at 12:26
From my example, it would only be the space between " and <div that would need to go. Anything else that's encapsulated between double quotes should stay as is so as to keep the column structure. So if it's something that fits within the double quotes, it stays as is but just the space before the content starts needs to go. I hope this makes sense — user3723380
– user3723380, Commented Feb 16, 2022 at 12:30
Something like sed -Ez 's/"\s+<div/"<div/g' file (with GNU sed)? Or if you can use perl, perl -0777pe 's/"\K\s+(?=<div)//g' file? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Feb 16, 2022 at 12:31
I just noticed there's worse things than some extra white space in your input if it's supposed to be CSV. The last field (i.e. the part after the final ,) in your CSV is "<div style=""margin:1em;"">" <div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">" - note that some of the "s in that field aren't doubled/escaped, e.g. en"dir, so that's not valid CSV by any standard. So removing the white space you're asking about in this question may not be all you need to do to get valid output, and fixing whatever is generating this input is a better (maybe only) choice. — Ed Morton
– Ed Morton, Commented Feb 16, 2022 at 14:00

Ed Morton · Accepted Answer · 2022-02-16 13:50:56Z

The following using GNU awk for multi-char RS, RT, and gensub() will work even if your input file is huge as it doesn't read the whole file into memory, it just reads the strings separated by "<spaces>< or newline one at a time:

$ awk -v RS='"\\s+<|\n' '{printf "%s%s", $0, gensub(/"\s+</,"\"<",1,RT)}' file "2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"

I'm assuming that when you say and possibly others in your question you mean other case like "<spaces><div> where there's a " then spaces then a tag starting with < but that's obviously just a guess.

RavinderSingh13 · Accepted Answer · 2022-02-16 13:51:49Z

With your shown samples only, please try following awk code. Written and tested in GNU awk. Simple explanation would be, setting RS(record separator) as null and in main program, globally substituting new lines followed by spaces followed by <div with <div in lines and printing the lines by awkish way by using 1.

awk -v RS="" '{gsub(/\n+[[:space:]]+<div/,"<div")} 1' Input_file

sseLtaH · Accepted Answer · 2022-02-16 13:54:16Z

Assuming your request is to remove the space before the start of the <div tag, you can try this GNU sed

$ sed -z 's/\(\"\)[[:space:]]\+\(<div .*\)/\1\n\2/' input_file "2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">" <div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"

Wiktor Stribiżew · Accepted Answer · 2022-02-16 12:46:26Z

You can do this with perl:

perl -0777 -i -pe 's/"\K\s+(?=<div)//g' file

Details

0777 slurps the file into a single string so that the pattern could match line break sequences
-i - file inline replacement is on
"\K\s+(?=<div) - matches a " char that is dropped from the match value with \K, then one or more whitespaces are consumed (with \s+) and then <div must follow immediately and the match is replaced with an empty string
g replaces all occurrences.

You can achieve the same with a GNU sed:

sed -i -Ez 's/"\s+<div/"<div/g' file

where -i enables inplace file replacement and -E enables the POSIX ERE regex syntax, and z pulls the file text into pattern space where line breaks are "visible" for the regex pattern.

Collectives™ on Stack Overflow

How to remove spaces between tags in a delimited file?

4 Answers 4

Comments

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Related