2

I have this table dump from a MySQL system, and although it follows RFC standards, it appears to have added unwanted space in columns where HTML text are stored. For example:

 "2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">" <div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">" 

This is one out of about 30K rows, so I am trying to figure out a smart way to remove the space between the " and <div (and possibly others) here. I tried out:

awk '{$1=$1;printf $0}' 

And this kind of works, but it mashes everything into one line which is not what I want. I would like to preserve the line breaks in the CSV dump. I am very curious to hear your ideas on how to tackle this.

6
  • Thanks @KarlT. I tried print, but unfortunately I get the same content as in my original CSV. Only printf appears to remove the unwanted space. Commented Feb 16, 2022 at 12:25
  • Can you you define which spaces should be removed? I suppose you need to keep spaces between speech marks. Are there spaces between words to keep? Is it only spaces before and after tags that should be removed or replace " <" or "< " with "<" and "> " or " >" with ">". I'm imagining a regex replace. Commented Feb 16, 2022 at 12:26
  • From my example, it would only be the space between " and <div that would need to go. Anything else that's encapsulated between double quotes should stay as is so as to keep the column structure. So if it's something that fits within the double quotes, it stays as is but just the space before the content starts needs to go. I hope this makes sense Commented Feb 16, 2022 at 12:30
  • Something like sed -Ez 's/"\s+<div/"<div/g' file (with GNU sed)? Or if you can use perl, perl -0777pe 's/"\K\s+(?=<div)//g' file? Commented Feb 16, 2022 at 12:31
  • 1
    I just noticed there's worse things than some extra white space in your input if it's supposed to be CSV. The last field (i.e. the part after the final ,) in your CSV is "<div style=""margin:1em;"">" <div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">" - note that some of the "s in that field aren't doubled/escaped, e.g. en"dir, so that's not valid CSV by any standard. So removing the white space you're asking about in this question may not be all you need to do to get valid output, and fixing whatever is generating this input is a better (maybe only) choice. Commented Feb 16, 2022 at 14:00

4 Answers 4

2

The following using GNU awk for multi-char RS, RT, and gensub() will work even if your input file is huge as it doesn't read the whole file into memory, it just reads the strings separated by "<spaces>< or newline one at a time:

$ awk -v RS='"\\s+<|\n' '{printf "%s%s", $0, gensub(/"\s+</,"\"<",1,RT)}' file "2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">" 

I'm assuming that when you say and possibly others in your question you mean other case like "<spaces><div> where there's a " then spaces then a tag starting with < but that's obviously just a guess.

Sign up to request clarification or add additional context in comments.

Comments

2

With your shown samples only, please try following awk code. Written and tested in GNU awk. Simple explanation would be, setting RS(record separator) as null and in main program, globally substituting new lines followed by spaces followed by <div with <div in lines and printing the lines by awkish way by using 1.

awk -v RS="" '{gsub(/\n+[[:space:]]+<div/,"<div")} 1' Input_file 

Comments

1

Assuming your request is to remove the space before the start of the <div tag, you can try this GNU sed

$ sed -z 's/\(\"\)[[:space:]]\+\(<div .*\)/\1\n\2/' input_file "2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">" <div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">" 

Comments

0

You can do this with perl:

perl -0777 -i -pe 's/"\K\s+(?=<div)//g' file 

Details

  • 0777 slurps the file into a single string so that the pattern could match line break sequences
  • -i - file inline replacement is on
  • "\K\s+(?=<div) - matches a " char that is dropped from the match value with \K, then one or more whitespaces are consumed (with \s+) and then <div must follow immediately and the match is replaced with an empty string
  • g replaces all occurrences.

You can achieve the same with a GNU sed:

sed -i -Ez 's/"\s+<div/"<div/g' file 

where -i enables inplace file replacement and -E enables the POSIX ERE regex syntax, and z pulls the file text into pattern space where line breaks are "visible" for the regex pattern.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.