PowerShell - escaping fancy single and double quotes for regex and string replace

Question

I'm working with HTML files created by Acrobat, which doesn't use proper HTML entities to escape Unicode characters. I need to include single and double right quotation marks in a regex pattern, but every attempt I've made at escaping these characters has failed in my script...even if it works from a regular PowerShell session.

For example, this find/replace does not work:

 $html = $html.Replace("`“", '&ldquo;') $html = $html.Replace("`”", '&rdquo;') $html = $html.Replace("`‘", '&lsquo;') $html = $html.Replace("`’", '&rsquo;')

...but it does work if I break into my script and run one of those replace lines from the debug prompt.

Edit: Here's a snippet of the markup I'm testing with right now:

<p style="padding-left: 5pt;text-indent: 17pt;line-height: 119%;text-align: justify;">To guide its readers the Hermetica makes use of the mystical astrological world-view that we have been discussing. It describes the creation of the world as a series of emanations, starting with the Light, who gave birth to a son called Logos. In the words of Hermes’s guide, Poimandres:</p><p style="padding-left: 24pt;text-indent: 0pt;line-height: 119%;text-align: justify;">“That Light,” he said, “is I, even Mind, the first God, who was before the watery substance which appeared out of the darkness; and the Logos which came forth the Light is son of God.”</p><p style="padding-left: 21pt;text-indent: 1pt;line-height: 119%;text-align: justify;">(Scott, Walter, translator, Hermetica: The Ancient Greek and Latin Writings Which Contain Religious or Philosophical Teachings Ascribed to Hermes Trismegistus, Boston: Shambhala: 1985, p. 117)</p>

If $html equals that string, my attempts to find and replace the characters appear to be futile.

The snippet you posted seems to work, which makes me think there's something else going on in a part of your script that you didn't include. — Jesse
– Jesse, Commented Dec 2, 2021 at 16:54
Hmmm. $html is just the contents of an Acrobat-generated HTML file, and this is early into the parsing of the file, so I haven't done much to it. If I output the contents of $html at a debug prompt (in VSCode), I still see the Unicode quote characters, not the HTML entities - same goes for outputting $html to a text file so I can see the result. — CXL
– CXL, Commented Dec 2, 2021 at 16:57
I'm getting correct results from a non-VSCode PowerShell window, so this might be something specific to VSCode's PowerShell extension. — CXL
– CXL, Commented Dec 2, 2021 at 17:11
It's very possible. I've only ever used PowerShell ISE so I couldn't say one way or the other. — Jesse
– Jesse, Commented Dec 2, 2021 at 17:13
Ok, this seems to be an issue with the encoding of my .ps1 file. A test script I created in the ISE was saved as UTF-8 BOM, whereas VSCode creates UTF8 files without the BOM. The ISE-created test with BOM encoding works; the exact same code saved as UTF8 in VSCode does not. — CXL
– CXL, Commented Dec 2, 2021 at 19:49

WaitingForGuacamole · Accepted Answer · 2021-12-02 21:35:10Z

Try using the Unicode values instead of backquoting the literal:

 $html = $html.Replace("`u{201C}", '&ldquo;') $html = $html.Replace("`u{201D}", '&rdquo;') $html = $html.Replace("`u{2018}", '&lsquo;') $html = $html.Replace("`u{2019}", '&rsquo;')

Produces

If you're having problems with encoding (UTF-8, for example, as you suggested), take a look at https://unicode-table.com - you can get the code values for any encoding.

That format for Unicode entities isn't working at all for me.

CXL · Accepted Answer · 2021-12-02 19:54:10Z

Evidently, PowerShell does funny things with non-BOM UTF-8 encoding. Setting VSCode to auto-encode PowerShell scripts as UTF-8 with BOM allows the String.Replace function to operate as expected.

Collectives™ on Stack Overflow

PowerShell - escaping fancy single and double quotes for regex and string replace

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related