Grep finds the string although the string is not visible in file

Question

I am stuck since 1day with a weird problem. I have a CSV file which I need to import into my hive table. The CSV file, however, has newline characters embedded in between the strings. As the files are huge I am not able to use a text editor to replace the '\n' character.

I wrote a python program to help me clean the file. I read each row from the CSV file and if I encounter any newline character I replace it with space. Below is my program.

# -*- coding: utf-8 -*- import csv import sys file = open("team_contacts_cleaned.csv","w") with open('team_contacts.csv') as csvfile: reader = csv.reader(csvfile) for row in reader: stripped = [col.replace('\n', '') for col in row] file.write(','.join(stripped)) file.write('\n') file.close() print 'Done'

Once I have this cleaned file I see that the line counts match as expected. and when I grep the file on the strings which I know is breaking the record the exact line is printed in the console, however, I don't see that line in the output.

Eg.

Original File

cat team_contacts.csv | grep -A4 'Yennai Nambi' ,,,,,11/30/2017 11:45 AM UTC,,,,12/29/2017 11:51 AM UTC,,"Yennai Nambi Vandhavarai Yaemaatra Maattaen ; Verum Yaeniyaay Naanirundhu Yaemaatra Maattaen ; Naan Uyir Vaazhndhaal Ingaedhaan ; Ooadivida Maattaen .",0,

Cleaned File

cat team_contacts_cleaned.csv | grep 'Naan Uyir Vaazhndhaal Ingaedhaan' ,,,,,11/30/2017 11:45 AM UTC,,,,12/29/2017 11:51 AM UTC,,Yennai Nambi Vandhavarai Yaemaatra MaOoadivida Maattaen .,0,

it looks like the data got erased when I cat the file however the grep is able to exactly locate the string which means the string is still there but why isn't it showing up?

Now when I move this cleaned file to hive it again breaks and data shows up like this

Verum Yaeniyaay Naanirundhu Yaemaatra Maattaen ; NULL NULL NULL NULL NULL NULLNULL Naan Uyir Vaazhndhaal Ingaedhaan ; NULL NULL NULL NULL NULL NULL NULL NULLNULL

What am I missing here ?

I even tried a gawk program before writing a python code I faced the same issue.

gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' team_contacts.csv > team.csv

This probably isn't your issue, but… is there a reason you're using csv to read the input CSV, but not using csv to write the output CSV? — abarnert
– abarnert, Commented Apr 6, 2018 at 18:34
Meanwhile, I'm having a hard time understanding what problem you're describing. It would really help if you give us a minimal reproducible example—with sample input, expected output, and actual output—so we can reproduce it ourselves and debug it. (Follow that link above to the help page; it has a lot of useful information.) — abarnert
– abarnert, Commented Apr 6, 2018 at 18:36
No need for Python to change newlines to spaces, just do tr '\n' ' ' < team_contacts.csv > team_contacts_cleaned.csv. — agc
– agc, Commented Apr 6, 2018 at 22:48

Sridhar Iyer · Accepted Answer · 2018-04-08 21:43:45Z

I resolved this problem. when I broke the file into parts, I got this error

_csv.Error: new-line character seen in unquoted field

The following link helped me resolve it. https://stackoverflow.com/a/32827068/3538426

This can be closed. Updated the code

import csv import sys csv.field_size_limit(sys.maxsize) file = open("team_contacts_cleaned.csv","w") with open('team_contacts.csv','rU') as csvfile: reader = csv.reader(csvfile) writer = csv.writer(file,quoting=csv.QUOTE_ALL) for row in reader: stripped = [col.replace('\n', ' ') for col in row] writer.writerow(stripped) file.close() print 'Done'

@agc your solution will not work as it will remove all newlines.

Collectives™ on Stack Overflow

Grep finds the string although the string is not visible in file

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related