I have this table dump from a MySQL system, and although it follows RFC standards, it appears to have added unwanted space in columns where HTML text are stored. For example:
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">" <div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">" This is one out of about 30K rows, so I am trying to figure out a smart way to remove the space between the " and <div (and possibly others) here. I tried out:
awk '{$1=$1;printf $0}' And this kind of works, but it mashes everything into one line which is not what I want. I would like to preserve the line breaks in the CSV dump. I am very curious to hear your ideas on how to tackle this.
sed -Ez 's/"\s+<div/"<div/g' file(with GNU sed)? Or if you can use perl,perl -0777pe 's/"\K\s+(?=<div)//g' file?,) in your CSV is"<div style=""margin:1em;"">" <div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"- note that some of the"s in that field aren't doubled/escaped, e.g.en"dir, so that's not valid CSV by any standard. So removing the white space you're asking about in this question may not be all you need to do to get valid output, and fixing whatever is generating this input is a better (maybe only) choice.