1

Using sed and any basic commands, I'm trying to count the number of words in each separate passage that has many separate passages. Each passage begins with a specific number and increases. Example:

0:1.1 This is the first passage...

0:1.2 This is the second passage...

The difficult thing is that each passage is a paragraph that is word wrapped and not a single line. I could count the words in each passage if they were on single lines. How can I do this?Thanks for the help

I did figure how to count each passage with:

grep '[0-9]:[0-9]' file | wc -l

2
  • "Word-wrap" means displaying a long line as multiple lines that fit into the window width, i.e. the line is not actually broken but just displayed so. Do you mean to say each passage is actually broken into multiple lines? Commented Nov 4, 2012 at 8:14
  • Is there a blank line between each passage? Commented Nov 4, 2012 at 12:37

4 Answers 4

1

This awk solution might work for you:

awk '/^[0-9]:[0-9]\.[0-9]/{ if (pass_num) printf "%s, word count: %i\n", pass_num, word_count pass_num=$1 word_count=-1 } { word_count+=NF } END { printf "%s, word count: %i\n", pass_num, word_count } ' file 

Test input:

# cat file 0:1.1 I am le passage one. There are many words in me. 0:1.2 I am le passage two. One two three four five six Seven 0:1.3 I am "Hello world" 

Test output:

0:1.1, word count: 11 0:1.2, word count: 12 0:1.3, word count: 4 


How it works:

Each word is separated by empty space, so each word can be represented by each field in awk, i.e. word count in a line is equal to NF. The word count is summed up every line until the next passage.

When it encounters a new passage (indicated by the presence of a passage number), it

  • prints out the previous passage's number and word count.
  • set passage number to this new passage number
  • reset passage word count (-1 because we don't want the passage number be counted)

The END{..} block is needed because the final passage doesn't have a trigger that causes it to print out the passage number and word count.

The if (pass_num) is to suppress printf when awk encounters the first passage.

Sign up to request clarification or add additional context in comments.

Comments

1

This might work for you (GNU sed):

sed -r ':a;$bb;N;/\n[0-9]+:[0-9]+\.[0-9]+/!s/\n/ /g;ta;:b;h;s/\n.*//;s/([0-9]+:[0-9]+\.[0-9]+)(.*)/echo "\1 = $(wc -w <<<"\2")"/ep;g;D' file 

It forms each section into a single line then counts the words in the section less the section number (newlines are replaced by spaces).

Comments

0
$ cat file 0:1.1 This is the first passage... welcome to the SO, you leart a lot of things here. 0:1.2 This is the second passage... wer qwerqrq ewqr e 0:1.3 This is the second passage... 

Using sed and GNU grep:

$ sed -n '/0:1.1/,/[0-9]:[0-9]\.[0-9]/{//!p}' file | grep -Eo '[[:alpha:]]*' | wc -l 11 

0:1.1 -> Give the passage number here in which you want to count.

Comments

0

Here's one way with GNU awk:

awk -v RS='[0-9]+:[0-9]+\\.[0-9]+' -v FS='[ \t\n]+' 'NF > 0 { print R ": " NF - 2 } { R = RT }' 

If it is run on the file listed by doubledown, the output is:

0:1.1: 11 0:1.2: 12 0:1.3: 4 

Explanation

This works by splitting the input into records according to [0-9]+:[0-9]+\\.[0-9]+ and splitting into fields at whitespace. The record separator is off by one, hence the {R = RT }, the field counter is off by two because each record starts and ends with an FS, hence the NF - 2.

Edit - only count fields containing [:alnum:]

The above also counts e.g. ellipsis (...) as words, to avoid this do something like this:

awk -v RS='[0-9]+:[0-9]+\\.[0-9]+' -v FS='[ \t\n]+' ' NF > 0 { wc = NF-2 for(i=2; i<NF; i++) if($i !~ /[[:alnum:]]+/) wc-- print R ": " wc } { R = RT }' 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.