Return to Revisions

1 of 9

answered May 12, 2023 at 18:32

3.9k
10
21

Using Raku (formerly known as Perl_6)

~$ curl https://www.gutenberg.org/cache/epub/5/pg5.txt > US_Constitution.txt

THEN:

Below grep followed by elems gives the count per "examined unit" of text, wherein for slurp the unit is the entire file, lines is obvious lines, and words is obviously words:

~$ raku -e 'slurp.grep(/ :i the /).elems.put;' US_Constitution.txt 1 ~$ raku -e 'lines.grep(/ :i the /).elems.put;' US_Constitution.txt 443 ~$ raku -e 'words.grep(/ :i the /).elems.put;' US_Constitution.txt 681

Below match followed by elems gives the count of matches. The "examined unit" doesn't matter so slurp, lines, and words all return the same count:

~$ raku -e 'slurp.match(:global, / :i the /).elems.put;' US_Constitution.txt 681 ~$ raku -e 'lines.match(:global, / :i the /).elems.put;' US_Constitution.txt 681 ~$ raku -e 'words.match(:global, / :i the /).elems.put;' US_Constitution.txt 681

The regex can be improved to only match the free-standing word "the". Word-boundaries (general) are denoted with <|w> or <?wb>. Alternatively, you can be even more specific and denote << left-word-boundary and/or >> right-word-boundary:

~$ raku -e 'words.match(:global, / :i <|w> the <|w> /).elems.put;' US_Constitution.txt 519 ~$ raku -e 'words.match(:global, / :i <?wb> the <?wb> /).elems.put;' US_Constitution.txt 519 ~$ raku -e 'words.match(:global, / :i << the >> /).elems.put;' US_Constitution.txt 519 #below, remove `:i` case-insensitive flag (adverb): ~$ raku -e 'words.match(:global, / << the >> /).elems.put;' US_Constitution.txt 458

https://raku.org

answered May 12, 2023 at 18:32

jubilatious1

3.9k
10
21