Convert CSV file from any type to UTF-8

Question

Hello I am creating a simple console application in vb.net in order to convert a file from any type to utf8 but i can't figure out how this thing works with the encoding. I know that the source file is in Unicode, but when i convert it to a new format i get junk. Any suggestions? i am not sure if my code is correct

this is my code.

Imports System.IO Imports System.Text Module Module1 Sub Main() Console.Write("Please give the filepath (example:c:/tesfile.csv):") Dim filepath As String = Console.ReadLine() Dim sEncoding As String = DetermineFileType(filepath) Dim strContents As String Dim strEncodedContents As String Dim objReader As StreamReader Dim ErrInfo As String Dim bString As Byte() Try 'Read the file objReader = New StreamReader(filepath) 'Read untill the end strContents = objReader.ReadToEnd() 'Close The file objReader.Close() 'Write Contents on DOS Console.WriteLine(strContents) Console.WriteLine("") bString = EncodeString(strContents, "UTF-8") strEncodedContents = System.Text.Encoding.UTF8.GetString(bString) Dim objWriter As New System.IO.StreamWriter(filepath.Replace(".csv", "_encoded.csv")) objWriter.WriteLine(strEncodedContents) objWriter.Close() Console.WriteLine("Encoding Finished") Catch Ex As Exception ErrInfo = Ex.Message Console.WriteLine(ErrInfo) End Try Console.ReadKey() End Sub Public Function DetermineFileType(ByVal aFileName As String) As String Dim sEncoding As String = String.Empty Dim oSR As New StreamReader(aFileName, True) oSR.ReadToEnd() ' Add this line to read the file. sEncoding = oSR.CurrentEncoding.EncodingName Return sEncoding End Function Function EncodeString(ByRef SourceData As String, ByRef CharSet As String) As Byte() 'get a byte pointer To the source data Dim bSourceData As Byte() = System.Text.Encoding.Unicode.GetBytes(SourceData) 'get destination encoding Dim OutEncoding As System.Text.Encoding = System.Text.Encoding.GetEncoding(CharSet) 'Encode the data To destination code page/charset Return System.Text.Encoding.Convert(OutEncoding, System.Text.Encoding.UTF8, bSourceData) End Function End Module

Unicode is a specification, not an encoding. What encoding do you source files use? UTF-8? UTF-16? UCS2? ... — fge
– fge, Commented Dec 20, 2011 at 13:59
UTF-8 is unicode as well :-) I take it then the input file is UTF-16? — Callie J
– Callie J, Commented Dec 20, 2011 at 14:02
i am confused now :S Unicode is a spesification. UTF-8 Is encoding but UTF-8 is also Unicode :S i mixed everything up now — themhz
– themhz, Commented Dec 20, 2011 at 14:09
i have a file CSV i open it with notepad++ and i check the encoding and "encode in UCS-2 little indian" is checked. and i need this file to convert it in utf-8. the thing is that when i print in the console everything seems ok but when i write in the file it mixes up the charachters — themhz
– themhz, Commented Dec 20, 2011 at 14:13

David Waters · Accepted Answer · 2011-12-21 12:06:56Z

StreamReader has a constructor that takes an Encoding if you know the encoding of the file you should pass that into the constructor of StreamReader

objReader = New StreamReader(filepath, Encoding.UTF32)

EDIT

You say in a comment that the file is Encoded as UCS-2 from Wikipedia

The older UCS-2 (2-byte Universal Character Set) is a similar character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996.2 It produces a fixed-length format by simply using the code point as the 16-bit code unit and produces exactly the same result as UTF-16 for 96.9% of all the code points in the range 0-0xFFFF, including all characters that had been assigned a value at that time.

In which case you can try to decode using UTF-16 which is called Unicode with in System.Text.Encoding so try

objReader = New StreamReader(filepath, Encoding.Unicode)

FYI Unicode is a standard which has a variety of encodings including

UTF-8
UTF-16 (BigEndian)
UTF-16 (LittleEndian)
UTF-32 (BigEndian)
UTF-32 (LittleEndian)

For Microsoft to call UTF-16 Unicode is a little misleading but not inaccurate, UTF-16 is one encoding possible for Unicode.

Hans Passant · Accepted Answer · 2011-12-20 14:12:50Z

StreamReader already assumes utf-8 encoding if you don't specify it in the constructor call. So re-encoding it to utf-8 cannot solve your problem. Use the StreamReader(String, Encoding) overload and specify the encoding that was used when the file was created. If you have no clue what it might be then Enoding.Default is usually the best guess. Talk to the programmer that wrote the code for the .csv file creator to be sure. When you get it right, you don't need this code anymore either.

this is what i am doing now objReader = New StreamReader(filepath, Encoding.UTF8) strContents = objReader.ReadToEnd() 'Close The file objReader.Close() 'Write Contents on DOS Console.WriteLine(strContents) Console.WriteLine("") Dim objWriter As New System.IO.StreamWriter(filepath.Replace(".csv", "_encoded.csv")) objWriter.WriteLine(strContents) objWriter.Close() Console.WriteLine("Encoding Finished") but i still get junk
You already know that the file wasn't encoded in utf-8. So don't use Encoding.UTF8 in the constructor call.

Collectives™ on Stack Overflow

Convert CSV file from any type to UTF-8

2 Answers 2

EDIT

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

EDIT

Comments

2 Comments

Linked

Related