Skip to main content
deleted 1010 characters in body
Source Link
Henrik Schumacher
  • 112.9k
  • 7
  • 197
  • 339

Edit

Regarding TFs of bigrams (as you mentioned in the comments), this might be of interest to you:

ClearAll[getBigramFrequencies]; getBigramFrequencies[text_String] := Module[{words, wordcounts, bigramcounts}, words = StringSplit[ ToLowerCase[ StringDelete[text, PunctuationCharacter | DigitCharacter]]]; If[Length[words] > 0, wordcounts = AssociationThread @@ Transpose[Tally[words]]; bigramcounts = Merge[ KeyValueMap[ {word, count} \[Function] AssociationThread[StringPartition[word, 2, 1], count], wordcounts ], Total]; If[Length[bigramcounts] > 0, bigramcounts/N[Total[bigramcounts]], Association[] ] , Association[] ] ]; texts = ExampleData /@ ExampleData["Text"]; BigramFrequencies = ParallelMap[getBigramFrequencies, texts]; // AbsoluteTiming 

Edit

Regarding TFs of bigrams (as you mentioned in the comments), this might be of interest to you:

ClearAll[getBigramFrequencies]; getBigramFrequencies[text_String] := Module[{words, wordcounts, bigramcounts}, words = StringSplit[ ToLowerCase[ StringDelete[text, PunctuationCharacter | DigitCharacter]]]; If[Length[words] > 0, wordcounts = AssociationThread @@ Transpose[Tally[words]]; bigramcounts = Merge[ KeyValueMap[ {word, count} \[Function] AssociationThread[StringPartition[word, 2, 1], count], wordcounts ], Total]; If[Length[bigramcounts] > 0, bigramcounts/N[Total[bigramcounts]], Association[] ] , Association[] ] ]; texts = ExampleData /@ ExampleData["Text"]; BigramFrequencies = ParallelMap[getBigramFrequencies, texts]; // AbsoluteTiming 
added 1010 characters in body
Source Link
Henrik Schumacher
  • 112.9k
  • 7
  • 197
  • 339

Edit

Regarding TFs of bigrams (as you mentioned in the comments), this might be of interest to you:

ClearAll[getBigramFrequencies]; getBigramFrequencies[text_String] := Module[{words, wordcounts, bigramcounts}, words = StringSplit[ ToLowerCase[ StringDelete[text, PunctuationCharacter | DigitCharacter]]]; If[Length[words] > 0, wordcounts = AssociationThread @@ Transpose[Tally[words]]; bigramcounts = Merge[ KeyValueMap[ {word, count} \[Function] AssociationThread[StringPartition[word, 2, 1], count], wordcounts ], Total]; If[Length[bigramcounts] > 0, bigramcounts/N[Total[bigramcounts]], Association[] ] , Association[] ] ]; texts = ExampleData /@ ExampleData["Text"]; BigramFrequencies = ParallelMap[getBigramFrequencies, texts]; // AbsoluteTiming 

Edit

Regarding TFs of bigrams (as you mentioned in the comments), this might be of interest to you:

ClearAll[getBigramFrequencies]; getBigramFrequencies[text_String] := Module[{words, wordcounts, bigramcounts}, words = StringSplit[ ToLowerCase[ StringDelete[text, PunctuationCharacter | DigitCharacter]]]; If[Length[words] > 0, wordcounts = AssociationThread @@ Transpose[Tally[words]]; bigramcounts = Merge[ KeyValueMap[ {word, count} \[Function] AssociationThread[StringPartition[word, 2, 1], count], wordcounts ], Total]; If[Length[bigramcounts] > 0, bigramcounts/N[Total[bigramcounts]], Association[] ] , Association[] ] ]; texts = ExampleData /@ ExampleData["Text"]; BigramFrequencies = ParallelMap[getBigramFrequencies, texts]; // AbsoluteTiming 
added 603 characters in body
Source Link
Henrik Schumacher
  • 112.9k
  • 7
  • 197
  • 339

Map (/@) returns the resultresults of every iterationthe iterations of AssociateTo in a list and that is confusing you. You can suppress the output by using Scan instead of Map (that might have also some performance advantages for long associations). Actually, the output of the mapped function is not of interest to you. What matters is the value of test afterwards, as AssociateTo uses call by reference.

test = Association[{a -> 1, b -> 2, c -> 3}] Scan[ (If[MissingQ[test[#]], AssociateTo[test, # -> 1], AssociateTo[test, # -> (test[#] + 1)] ]) &, {a, b, c, d, e} ]; test 

<|a -> 1, b -> 2, c -> 3|>

<|a -> 2, b -> 3, c -> 4, d -> 1, e -> 1|>

Using Scan instead of Map increases performance since for returning the intermediate results of AssociateTo requires copying them. But that is what you actually try to avoid by using AssociateTo instead of using Associate recursively. Here is an illustration of the performance difference:

n = 100000; a = b = AssociationThread[Range[n], RandomInteger[10, n]]; rand = RandomInteger[n, n]; a == b Scan[ (If[MissingQ[a[#]], AssociateTo[a, # -> 1], AssociateTo[a, # -> (a[#] + 1)] ]) &, rand ]; // AbsoluteTiming //First Map[ (If[MissingQ[b[#]], AssociateTo[b, # -> 1], AssociateTo[b, # -> (b[#] + 1)] ]) &, rand ]; // AbsoluteTiming //First a == b 

0.30575

0.517976

True

Map (/@) returns the result of every iteration of AssociateTo and that is confusing you. You can suppress the output by using Scan instead of Map (that might have also some performance advantages for long associations). Actually, the output of the mapped function is not of interest to you. What matters is the value of test afterwards, as AssociateTo uses call by reference.

test = Association[{a -> 1, b -> 2, c -> 3}] Scan[ (If[MissingQ[test[#]], AssociateTo[test, # -> 1], AssociateTo[test, # -> (test[#] + 1)] ]) &, {a, b, c, d, e} ]; test 

<|a -> 1, b -> 2, c -> 3|>

<|a -> 2, b -> 3, c -> 4, d -> 1, e -> 1|>

Map (/@) returns the results of the iterations of AssociateTo in a list and that is confusing you. You can suppress the output by using Scan instead of Map. Actually, the output of the mapped function is not of interest to you. What matters is the value of test afterwards, as AssociateTo uses call by reference.

test = Association[{a -> 1, b -> 2, c -> 3}] Scan[ (If[MissingQ[test[#]], AssociateTo[test, # -> 1], AssociateTo[test, # -> (test[#] + 1)] ]) &, {a, b, c, d, e} ]; test 

<|a -> 1, b -> 2, c -> 3|>

<|a -> 2, b -> 3, c -> 4, d -> 1, e -> 1|>

Using Scan instead of Map increases performance since for returning the intermediate results of AssociateTo requires copying them. But that is what you actually try to avoid by using AssociateTo instead of using Associate recursively. Here is an illustration of the performance difference:

n = 100000; a = b = AssociationThread[Range[n], RandomInteger[10, n]]; rand = RandomInteger[n, n]; a == b Scan[ (If[MissingQ[a[#]], AssociateTo[a, # -> 1], AssociateTo[a, # -> (a[#] + 1)] ]) &, rand ]; // AbsoluteTiming //First Map[ (If[MissingQ[b[#]], AssociateTo[b, # -> 1], AssociateTo[b, # -> (b[#] + 1)] ]) &, rand ]; // AbsoluteTiming //First a == b 

0.30575

0.517976

True

Source Link
Henrik Schumacher
  • 112.9k
  • 7
  • 197
  • 339
Loading