I am stuck since 1day with a weird problem. I have a CSV file which I need to import into my hive table. The CSV file, however, has newline characters embedded in between the strings. As the files are huge I am not able to use a text editor to replace the '\n' character.
I wrote a python program to help me clean the file. I read each row from the CSV file and if I encounter any newline character I replace it with space. Below is my program.
# -*- coding: utf-8 -*- import csv import sys file = open("team_contacts_cleaned.csv","w") with open('team_contacts.csv') as csvfile: reader = csv.reader(csvfile) for row in reader: stripped = [col.replace('\n', '') for col in row] file.write(','.join(stripped)) file.write('\n') file.close() print 'Done' Once I have this cleaned file I see that the line counts match as expected. and when I grep the file on the strings which I know is breaking the record the exact line is printed in the console, however, I don't see that line in the output.
Eg.
Original File
cat team_contacts.csv | grep -A4 'Yennai Nambi' ,,,,,11/30/2017 11:45 AM UTC,,,,12/29/2017 11:51 AM UTC,,"Yennai Nambi Vandhavarai Yaemaatra Maattaen ; Verum Yaeniyaay Naanirundhu Yaemaatra Maattaen ; Naan Uyir Vaazhndhaal Ingaedhaan ; Ooadivida Maattaen .",0, Cleaned File
cat team_contacts_cleaned.csv | grep 'Naan Uyir Vaazhndhaal Ingaedhaan' ,,,,,11/30/2017 11:45 AM UTC,,,,12/29/2017 11:51 AM UTC,,Yennai Nambi Vandhavarai Yaemaatra MaOoadivida Maattaen .,0, it looks like the data got erased when I cat the file however the grep is able to exactly locate the string which means the string is still there but why isn't it showing up?
Now when I move this cleaned file to hive it again breaks and data shows up like this
Verum Yaeniyaay Naanirundhu Yaemaatra Maattaen ; NULL NULL NULL NULL NULL NULLNULL Naan Uyir Vaazhndhaal Ingaedhaan ; NULL NULL NULL NULL NULL NULL NULL NULLNULL What am I missing here ?
I even tried a gawk program before writing a python code I faced the same issue.
gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' team_contacts.csv > team.csv
csvto read the input CSV, but not usingcsvto write the output CSV?tr '\n' ' ' < team_contacts.csv > team_contacts_cleaned.csv.