3

I need to get a clean txt document and my first approach is to use aspell. The issue is I need it on batch, no interactive mode. Every txt file is piped to aspell and must be returned a new document with the non-dictionnary words deleted.

I've found just the inverse behaviour: list the non-dictionary words using

cat $file | aspell list | sort -u -f 

Is aspell the correct tool to achieve that cleaned document folder? What about automatic substitution of misspelled words? (using a predefined list file)

1 Answer 1

1
sed -E -e "s/$(aspell list <file | sort -u | paste -s -d'|' | sed -e 's/^/\\b(/; s/$/)\\b/' )//g" \ file > newfile 

This uses command substitution $(...) to insert the output of aspell list <$file into a sed search and replace operation.

aspell's output is also unique sorted and paste is used to join each line with |. Finally it is piped through sed to add \b word-boundary anchors as well as open and close parentheses. All of which constructs a valid extended regular expression like \b(word1|word2|word3|...)\b to use as the search regexp in the sed search and replace command.

You can test the result of the entire command with, e.g., diff -u file newfile

AFAIK, aspell doesn't have an auto-correct mode. This is probably a Good Thing.

4
  • Hi cas, tested your code but the file comes out untouched Commented May 12, 2016 at 15:26
  • Try the updated version. The first had two problems - 1. aspell reads from stdin, not a file 2. grep -v would never have done what you want, it would have removed the entire line on any match, not just the matching word. Commented May 13, 2016 at 0:01
  • Updated version just strips words but is ripping apart some words that are contained inside too: vg. citizenship would be converted to citizen if ship is not in dictionnary. That is too bad Commented May 13, 2016 at 11:48
  • ok, that just means the regexp needs to be further modified to have word boundary anchors....i really should have thought of that earlier. i'll update my answer. Commented May 13, 2016 at 14:24

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.