Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

5
  • What OS and language settings are you using? Please recall that many docs are using non-breaking spaces (NBSP) that are coded into UTF-8 or with several other bytes, when not in ASCII. Try fiddle with your OS/terminal language or locale settings. Commented May 2, 2021 at 10:21
  • The book is english. I'm using Ubuntu 20. $ locale LANG=en_US.UTF-8 Commented May 2, 2021 at 10:24
  • @not2qubit Are you sure the utility shouldn't be responsible for this? For example the utility pdftotext has -layout option to keep original formatting of a PDF in TXT. Commented May 2, 2021 at 10:27
  • I have no idea. I just had a similar issue with OCR reading a PDF and prog was insisting to extract nbsp's since the doc was coded in a foreign language. Commented May 2, 2021 at 10:30
  • you could convert to HTML (or just unzip the EPUB and use the HTML within directly) and try your luck with links -dump or similar. if that doesn't work either you might have to have a look at the HTML directly and write your own helper script for converting the code snippets. Commented May 2, 2021 at 10:35