Return to Answer

added note to the Merge function

edited May 4, 2018 at 12:15

1.3k
8
11

I am dealing with similar kinds of data, although in another format. Efficient code and parallelization really is key to acceptable performance. I will try to share my parallelized generic low-level importer when it is done. For the time being, some suggestions:

Importing

Do ReadList in chunks of lines and process those chunks. Use Sow and Reap for building the main list of data.

Reap[ While[ (lines = ReadList[stream, "String", 1000, NullRecords -> True]) =!= {}), (* line splitting and processing goes here *) StringCases[{___ ~~ "Id=\"" ~~ ___, ...}, (* Sow the result for each line *) Sow[...]

Chunk processing makes parallelization easier to implement later. While it is certainly possible, parallelization is not as straight-forward as it might seem.

Sow and Reap are generally very efficient to construct lists.

A note to your code: When using WordSeparators -> {"\t", " "}, it is usually required to set NullWords -> True as well. (It is very unintuitive that NullWords is set to False by default as it leads to unexpected behaviour.)

Joining two datasets (the VLOOKUP)

Use Dispatch. It is very fast while everything else is very slow. A dispatch table is a list of replacement rules optimized for fast operation. When I need to "left join" table1 and table2 on columns col1 and col2, respectively, I use these little helper functions to create a dispatch table and join them:

Options[MakeDispatchTable] = {"AppendRules" -> {_ -> Missing[]}}; MakeDispatchTable[list_, column_, opts : OptionsPattern[]] := Dispatch@MakeRuleTable[list, column, opts] Options[MakeRuleTable] = Options[MakeDispatchTable]; MakeRuleTable[list_, column_, opts : OptionsPattern[]] := Join[ Rule[#[[Sequence @@ column]], #] & /@ list, OptionValue@"AppendRules" ]; table12joined = MapThread[Append, {table1, Replace[table1[[All, col1]], MakeDispatchTable[table2, col2], {1}]} ]

Note that the Rule _ -> Missing[] is appended to replace any row which did not match with a Missing[]. This is also why it is important to use Replace[..., ..., {1}], so it will only replace on level one, instead of using /., matching the whole expression with _ and replacing it with Missing[] altogether.

EDIT

As of version 10.0, there is the built-in function Merge which provides the functionality described with the dispatch table above and features good performance.

Importing

Do ReadList in chunks of lines and process those chunks. Use Sow and Reap for building the main list of data.

Reap[ While[ (lines = ReadList[stream, "String", 1000, NullRecords -> True]) =!= {}), (* line splitting and processing goes here *) StringCases[{___ ~~ "Id=\"" ~~ ___, ...}, (* Sow the result for each line *) Sow[...]

Chunk processing makes parallelization easier to implement later. While it is certainly possible, parallelization is not as straight-forward as it might seem.

Sow and Reap are generally very efficient to construct lists.

Joining two datasets (the VLOOKUP)

Options[MakeDispatchTable] = {"AppendRules" -> {_ -> Missing[]}}; MakeDispatchTable[list_, column_, opts : OptionsPattern[]] := Dispatch@MakeRuleTable[list, column, opts] Options[MakeRuleTable] = Options[MakeDispatchTable]; MakeRuleTable[list_, column_, opts : OptionsPattern[]] := Join[ Rule[#[[Sequence @@ column]], #] & /@ list, OptionValue@"AppendRules" ]; table12joined = MapThread[Append, {table1, Replace[table1[[All, col1]], MakeDispatchTable[table2, col2], {1}]} ]

Importing

Do ReadList in chunks of lines and process those chunks. Use Sow and Reap for building the main list of data.

Reap[ While[ (lines = ReadList[stream, "String", 1000, NullRecords -> True]) =!= {}), (* line splitting and processing goes here *) StringCases[{___ ~~ "Id=\"" ~~ ___, ...}, (* Sow the result for each line *) Sow[...]

Chunk processing makes parallelization easier to implement later. While it is certainly possible, parallelization is not as straight-forward as it might seem.

Sow and Reap are generally very efficient to construct lists.

Joining two datasets (the VLOOKUP)

Options[MakeDispatchTable] = {"AppendRules" -> {_ -> Missing[]}}; MakeDispatchTable[list_, column_, opts : OptionsPattern[]] := Dispatch@MakeRuleTable[list, column, opts] Options[MakeRuleTable] = Options[MakeDispatchTable]; MakeRuleTable[list_, column_, opts : OptionsPattern[]] := Join[ Rule[#[[Sequence @@ column]], #] & /@ list, OptionValue@"AppendRules" ]; table12joined = MapThread[Append, {table1, Replace[table1[[All, col1]], MakeDispatchTable[table2, col2], {1}]} ]

EDIT

As of version 10.0, there is the built-in function Merge which provides the functionality described with the dispatch table above and features good performance.

Source Link

answered May 3, 2018 at 18:00

Theo Tiger

1.3k
8
11

Importing

Do ReadList in chunks of lines and process those chunks. Use Sow and Reap for building the main list of data.

Reap[ While[ (lines = ReadList[stream, "String", 1000, NullRecords -> True]) =!= {}), (* line splitting and processing goes here *) StringCases[{___ ~~ "Id=\"" ~~ ___, ...}, (* Sow the result for each line *) Sow[...]

Chunk processing makes parallelization easier to implement later. While it is certainly possible, parallelization is not as straight-forward as it might seem.

Sow and Reap are generally very efficient to construct lists.

Joining two datasets (the VLOOKUP)

Options[MakeDispatchTable] = {"AppendRules" -> {_ -> Missing[]}}; MakeDispatchTable[list_, column_, opts : OptionsPattern[]] := Dispatch@MakeRuleTable[list, column, opts] Options[MakeRuleTable] = Options[MakeDispatchTable]; MakeRuleTable[list_, column_, opts : OptionsPattern[]] := Join[ Rule[#[[Sequence @@ column]], #] & /@ list, OptionValue@"AppendRules" ]; table12joined = MapThread[Append, {table1, Replace[table1[[All, col1]], MakeDispatchTable[table2, col2], {1}]} ]