using sed to turn paragraph to lines

Question

Using sed and any basic commands, I'm trying to count the number of words in each separate passage that has many separate passages. Each passage begins with a specific number and increases. Example:

0:1.1 This is the first passage...

0:1.2 This is the second passage...

The difficult thing is that each passage is a paragraph that is word wrapped and not a single line. I could count the words in each passage if they were on single lines. How can I do this?Thanks for the help

I did figure how to count each passage with:

grep '[0-9]:[0-9]' file | wc -l

"Word-wrap" means displaying a long line as multiple lines that fit into the window width, i.e. the line is not actually broken but just displayed so. Do you mean to say each passage is actually broken into multiple lines? — doubleDown
– doubleDown, Commented Nov 4, 2012 at 8:14

doubleDown · Accepted Answer · 2012-11-04 09:13:01Z

This awk solution might work for you:

awk '/^[0-9]:[0-9]\.[0-9]/{ if (pass_num) printf "%s, word count: %i\n", pass_num, word_count pass_num=$1 word_count=-1 } { word_count+=NF } END { printf "%s, word count: %i\n", pass_num, word_count } ' file

Test input:

# cat file 0:1.1 I am le passage one. There are many words in me. 0:1.2 I am le passage two. One two three four five six Seven 0:1.3 I am "Hello world"

Test output:

0:1.1, word count: 11 0:1.2, word count: 12 0:1.3, word count: 4

How it works:

Each word is separated by empty space, so each word can be represented by each field in awk, i.e. word count in a line is equal to NF. The word count is summed up every line until the next passage.

When it encounters a new passage (indicated by the presence of a passage number), it

prints out the previous passage's number and word count.
set passage number to this new passage number
reset passage word count (-1 because we don't want the passage number be counted)

The END{..} block is needed because the final passage doesn't have a trigger that causes it to print out the passage number and word count.

The if (pass_num) is to suppress printf when awk encounters the first passage.

potong · Accepted Answer · 2012-11-04 09:47:13Z

This might work for you (GNU sed):

sed -r ':a;$bb;N;/\n[0-9]+:[0-9]+\.[0-9]+/!s/\n/ /g;ta;:b;h;s/\n.*//;s/([0-9]+:[0-9]+\.[0-9]+)(.*)/echo "\1 = $(wc -w <<<"\2")"/ep;g;D' file

It forms each section into a single line then counts the words in the section less the section number (newlines are replaced by spaces).

Guru · Accepted Answer · 2012-11-04 06:55:18Z

$ cat file 0:1.1 This is the first passage... welcome to the SO, you leart a lot of things here. 0:1.2 This is the second passage... wer qwerqrq ewqr e 0:1.3 This is the second passage...

Using sed and GNU grep:

$ sed -n '/0:1.1/,/[0-9]:[0-9]\.[0-9]/{//!p}' file | grep -Eo '[[:alpha:]]*' | wc -l 11

0:1.1 -> Give the passage number here in which you want to count.

Thor · Accepted Answer · 2012-11-04 10:20:13Z

Here's one way with GNU awk:

awk -v RS='[0-9]+:[0-9]+\\.[0-9]+' -v FS='[ \t\n]+' 'NF > 0 { print R ": " NF - 2 } { R = RT }'

If it is run on the file listed by doubledown, the output is:

0:1.1: 11 0:1.2: 12 0:1.3: 4

Explanation

This works by splitting the input into records according to [0-9]+:[0-9]+\\.[0-9]+ and splitting into fields at whitespace. The record separator is off by one, hence the {R = RT }, the field counter is off by two because each record starts and ends with an FS, hence the NF - 2.

Edit - only count fields containing `[:alnum:]`

The above also counts e.g. ellipsis (...) as words, to avoid this do something like this:

awk -v RS='[0-9]+:[0-9]+\\.[0-9]+' -v FS='[ \t\n]+' ' NF > 0 { wc = NF-2 for(i=2; i<NF; i++) if($i !~ /[[:alnum:]]+/) wc-- print R ": " wc } { R = RT }'

Collectives™ on Stack Overflow

using sed to turn paragraph to lines

4 Answers 4

Comments

Comments

Comments

Explanation

Edit - only count fields containing `[:alnum:]`

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Explanation

Edit - only count fields containing [:alnum:]

Comments

Related

Edit - only count fields containing `[:alnum:]`