1

I am stuck since 1day with a weird problem. I have a CSV file which I need to import into my hive table. The CSV file, however, has newline characters embedded in between the strings. As the files are huge I am not able to use a text editor to replace the '\n' character.

I wrote a python program to help me clean the file. I read each row from the CSV file and if I encounter any newline character I replace it with space. Below is my program.

# -*- coding: utf-8 -*- import csv import sys file = open("team_contacts_cleaned.csv","w") with open('team_contacts.csv') as csvfile: reader = csv.reader(csvfile) for row in reader: stripped = [col.replace('\n', '') for col in row] file.write(','.join(stripped)) file.write('\n') file.close() print 'Done' 

Once I have this cleaned file I see that the line counts match as expected. and when I grep the file on the strings which I know is breaking the record the exact line is printed in the console, however, I don't see that line in the output.

Eg.

Original File

cat team_contacts.csv | grep -A4 'Yennai Nambi' ,,,,,11/30/2017 11:45 AM UTC,,,,12/29/2017 11:51 AM UTC,,"Yennai Nambi Vandhavarai Yaemaatra Maattaen ; Verum Yaeniyaay Naanirundhu Yaemaatra Maattaen ; Naan Uyir Vaazhndhaal Ingaedhaan ; Ooadivida Maattaen .",0, 

Cleaned File

cat team_contacts_cleaned.csv | grep 'Naan Uyir Vaazhndhaal Ingaedhaan' ,,,,,11/30/2017 11:45 AM UTC,,,,12/29/2017 11:51 AM UTC,,Yennai Nambi Vandhavarai Yaemaatra MaOoadivida Maattaen .,0, 

it looks like the data got erased when I cat the file however the grep is able to exactly locate the string which means the string is still there but why isn't it showing up?

Now when I move this cleaned file to hive it again breaks and data shows up like this

Verum Yaeniyaay Naanirundhu Yaemaatra Maattaen ; NULL NULL NULL NULL NULL NULLNULL Naan Uyir Vaazhndhaal Ingaedhaan ; NULL NULL NULL NULL NULL NULL NULL NULLNULL 

What am I missing here ?

I even tried a gawk program before writing a python code I faced the same issue.

gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' team_contacts.csv > team.csv 
3
  • This probably isn't your issue, but… is there a reason you're using csv to read the input CSV, but not using csv to write the output CSV? Commented Apr 6, 2018 at 18:34
  • 1
    Meanwhile, I'm having a hard time understanding what problem you're describing. It would really help if you give us a minimal reproducible example—with sample input, expected output, and actual output—so we can reproduce it ourselves and debug it. (Follow that link above to the help page; it has a lot of useful information.) Commented Apr 6, 2018 at 18:36
  • No need for Python to change newlines to spaces, just do tr '\n' ' ' < team_contacts.csv > team_contacts_cleaned.csv. Commented Apr 6, 2018 at 22:48

1 Answer 1

0

I resolved this problem. when I broke the file into parts, I got this error

_csv.Error: new-line character seen in unquoted field

The following link helped me resolve it. https://stackoverflow.com/a/32827068/3538426

This can be closed. Updated the code

import csv import sys csv.field_size_limit(sys.maxsize) file = open("team_contacts_cleaned.csv","w") with open('team_contacts.csv','rU') as csvfile: reader = csv.reader(csvfile) writer = csv.writer(file,quoting=csv.QUOTE_ALL) for row in reader: stripped = [col.replace('\n', ' ') for col in row] writer.writerow(stripped) file.close() print 'Done' 

@agc your solution will not work as it will remove all newlines.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.