0

I'm working with HTML files created by Acrobat, which doesn't use proper HTML entities to escape Unicode characters. I need to include single and double right quotation marks in a regex pattern, but every attempt I've made at escaping these characters has failed in my script...even if it works from a regular PowerShell session.

For example, this find/replace does not work:

 $html = $html.Replace("`“", '“') $html = $html.Replace("`”", '”') $html = $html.Replace("`‘", '‘') $html = $html.Replace("`’", '’') 

...but it does work if I break into my script and run one of those replace lines from the debug prompt.

Edit: Here's a snippet of the markup I'm testing with right now:

<p style="padding-left: 5pt;text-indent: 17pt;line-height: 119%;text-align: justify;">To guide its readers the Hermetica makes use of the mystical astrological world-view that we have been discussing. It describes the creation of the world as a series of emanations, starting with the Light, who gave birth to a son called Logos. In the words of Hermes’s guide, Poimandres:</p><p style="padding-left: 24pt;text-indent: 0pt;line-height: 119%;text-align: justify;">“That Light,” he said, “is I, even Mind, the first God, who was before the watery substance which appeared out of the darkness; and the Logos which came forth the Light is son of God.”</p><p style="padding-left: 21pt;text-indent: 1pt;line-height: 119%;text-align: justify;">(Scott, Walter, translator, Hermetica: The Ancient Greek and Latin Writings Which Contain Religious or Philosophical Teachings Ascribed to Hermes Trismegistus, Boston: Shambhala: 1985, p. 117)</p> 

If $html equals that string, my attempts to find and replace the characters appear to be futile.

5
  • The snippet you posted seems to work, which makes me think there's something else going on in a part of your script that you didn't include. Commented Dec 2, 2021 at 16:54
  • Hmmm. $html is just the contents of an Acrobat-generated HTML file, and this is early into the parsing of the file, so I haven't done much to it. If I output the contents of $html at a debug prompt (in VSCode), I still see the Unicode quote characters, not the HTML entities - same goes for outputting $html to a text file so I can see the result. Commented Dec 2, 2021 at 16:57
  • I'm getting correct results from a non-VSCode PowerShell window, so this might be something specific to VSCode's PowerShell extension. Commented Dec 2, 2021 at 17:11
  • It's very possible. I've only ever used PowerShell ISE so I couldn't say one way or the other. Commented Dec 2, 2021 at 17:13
  • 1
    Ok, this seems to be an issue with the encoding of my .ps1 file. A test script I created in the ISE was saved as UTF-8 BOM, whereas VSCode creates UTF8 files without the BOM. The ISE-created test with BOM encoding works; the exact same code saved as UTF8 in VSCode does not. Commented Dec 2, 2021 at 19:49

2 Answers 2

1

Try using the Unicode values instead of backquoting the literal:

 $html = $html.Replace("`u{201C}", '&ldquo;') $html = $html.Replace("`u{201D}", '&rdquo;') $html = $html.Replace("`u{2018}", '&lsquo;') $html = $html.Replace("`u{2019}", '&rsquo;') 

Produces

enter image description here

If you're having problems with encoding (UTF-8, for example, as you suggested), take a look at https://unicode-table.com - you can get the code values for any encoding.

Sign up to request clarification or add additional context in comments.

2 Comments

First one worked for me :) nice answer
That format for Unicode entities isn't working at all for me.
0

Evidently, PowerShell does funny things with non-BOM UTF-8 encoding. Setting VSCode to auto-encode PowerShell scripts as UTF-8 with BOM allows the String.Replace function to operate as expected.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.