1

Say I have a bunch of (markdown) text with each sentence on a separate line (for easier version control in case of typos). Example file.txt:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Dictum sit amet justo donec enim diam vulputate. Nunc faucibus a pellentesque sit amet. Quis enim lobortis scelerisque fermentum dui faucibus in. Leo duis ut diam quam nulla porttitor massa id neque. Vitae tortor condimentum lacinia quis vel eros. 

How can I convert turn each paragraph into a single line so that it looks like:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Dictum sit amet justo donec enim diam vulputate. Nunc faucibus a pellentesque sit amet. Quis enim lobortis scelerisque fermentum dui faucibus in. Leo duis ut diam quam nulla porttitor massa id neque. Vitae tortor condimentum lacinia quis vel eros. Velit euismod in pellentesque massa placerat duis ultricies lacus. 

My idea is to find and replace the newline \n character between a fullstop . and any non-whitespace character \S. I've figured out how to do it in regex101 here but was wondering if there's a shorter tr/sed/awk equivalent I can use in my bash shell. Something like cat file.txt | ???

3 Answers 3

3

perl has a paragraph mode via the -00 perlrun flag, so if we replace all the internal newlines of your input with a space:

$ wc -l input 7 input $ perl -00 -pe 's/\n(?!\Z)/ /g' input | wc -l 3 $ 

The (?!\Z) bit is to not replace the newlines at the end of each paragraph, thus preserving the paragraph boundaries.

Another option is lex. This reveals a few tricky points, notably how to handle EOF and whether to always include an ultimate newline (as POSIX demands), and what you define as a paragraph: exactly two newlines, or any number whatsoever?

%% [\n][\n]+ { printf("%s", yytext); } \n { int c = input(); /* TODO book docs say this should return EOF on EOF ?? */ if (c == 0) { putchar('\n'); yyterminate(); } else { printf(" %c", c); } } <<EOF>> { putchar('\n'); yyterminate(); } %% int main(int argc, char *argv[]) { return yylex(); } 

It probably needs more tests than

$ make paranlneg lex -o lex.paranlneg.c paranlneg.l egcc -O2 -pipe -o paranlneg lex.paranlneg.c -ll rm -f lex.paranlneg.c $ perl -E 'say "a\nb\n\nc\nd"' | ./paranlneg a b c d $ 
1
  • Thanks! I've had to modify your script a little bit to handle some edge cases but I'll figure those out myself. Good work :) Commented Nov 1, 2018 at 21:41
2

Similar to @thrig's Perl-based answer but using GNU Awk:

$ gawk -vRS= '{$1=$1; printf $0 RT}' file.txt Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Dictum sit amet justo donec enim diam vulputate. Nunc faucibus a pellentesque sit amet. Quis enim lobortis scelerisque fermentum dui faucibus in. Leo duis ut diam quam nulla porttitor massa id neque. Vitae tortor condimentum lacinia quis vel eros. 

For a quick'n'dirty solution you could use Coreutils fmt utility with a suitably large width value:

fmt -w1000 file.txt 

(although by default this will add a double space after each period).

2

GNU sed based approach:

You can use tr to replace <newline> characters with <NUL> characters, then use sed to change sequences of two or more consecutive <NUL> characters into a double <newline> character, then use tr to replace remaining <NUL> characters with white spaces:

$ tr '\n' '\0' <file.txt | sed 's/\o000\{2,\}/\n\n/g' | tr '\0' ' ' | sed --null-data 's/ $/\n/' 

Here, the last sed is only needed to substitute the final remaining space with a new line.

Alternatively (and more concisely) you can instruct sed to treat your file as a sequence of null-terminated lines (that is, sed considers it a single line) and replace with a single white space all the occurrences of a single new line preceded and followed by a non-space character:

$ sed --null-data 's/\([^[:space:]]\)\n\([^[:space:]]\)/\1 \2/g' file.txt 

This will also preserve vertical spacing between paragraphs, i.e. the number of consecutive new lines. I preferred to search for a non-space character (instead of a dot) followed by a new line just to handle the case of a sentence not ending in a full stop.

1
  • Not as short but I like this approach! Looks a lot more adaptable for handling other use cases :D Commented Nov 1, 2018 at 22:16

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.