fix(imap): handle korean Windows encoding ks_c_5601-1987/ks_c_5601-1989#12654
fix(imap): handle korean Windows encoding ks_c_5601-1987/ks_c_5601-1989#12654ChristophWurst wants to merge 8 commits intomainfrom
Conversation
AI-assisted: OpenCode (Claude Haiku 4.5) Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
| /backport to stable5.7 |
There was a problem hiding this comment.
Pull request overview
This PR addresses Nextcloud Mail issue #11416 by improving IMAP charset handling for Outlook messages that label Korean content using non-mbstring charset names (ks_c_5601-1987 / ks_c_5601-1989), preventing conversion failures during message processing.
Changes:
- Add a charset normalization step mapping
ks_c_5601-1987/ks_c_5601-1989toUHC(CP949/Windows-949). - Apply the normalized charset during
mb_convert_encoding()/iconv()conversion in the IMAP charset converter. - Add unit test coverage for both Outlook Korean charset labels.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
lib/IMAP/Charset/Converter.php | Adds charset normalization and uses it during conversion to avoid failures on Outlook Korean charset aliases. |
tests/Unit/IMAP/Charset/ConverterTest.php | Adds unit tests validating conversion of ks_c_5601-1987 and ks_c_5601-1989 content to UTF-8. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Thanks for taking over; looks good! When I looked at it a while back, I'd bumped into some blockers and didn't find the time to continue.
The list also contains "BASE64", "UUENCODE", "HTML-ENTITIES", and "Quoted-Printable". Request the aliases for those; does trigger a deprecation warning 🙈 What I had in mind was similar to what also Josh started with #12221. |
…5601-1989 Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
…5601-1989 Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…5601-1989 Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lib/IMAP/Charset/Converter.php Outdated
| if (in_array($normalizedCharset, mb_list_encodings(), true)) { | ||
| $converted = mb_convert_encoding($data, 'UTF-8', $normalizedCharset); | ||
| } else { | ||
| $converted = iconv($charset, 'UTF-8', $data); | ||
| $converted = iconv($normalizedCharset, 'UTF-8', $data); | ||
| } |
There was a problem hiding this comment.
iconv() can throw a ValueError for unknown/unsupported encodings, and this call isn’t guarded. A bad charset token can still make mailbox loading fail hard. Consider wrapping the conversion attempt in try/catch (\ValueError) and falling back to the auto-detection path (or throwing ServiceException) instead of letting the error bubble up.
lib/IMAP/Charset/Converter.php Outdated
| | ||
| $converted = @mb_convert_encoding($data, 'UTF-8', $charset); | ||
| $normalizedCharset = $charset !== null ? $this->normalizeCharset($charset) : null; | ||
| $converted = @mb_convert_encoding($data, 'UTF-8', $normalizedCharset); |
There was a problem hiding this comment.
mb_convert_encoding() may throw ValueError (not return false) when the provided charset name is invalid. Since the code only checks for false, this path can still fatally error and bypass the intended iconv/ServiceException fallback. Please catch \ValueError around this call and handle it like a failed conversion.
| $converted = @mb_convert_encoding($data, 'UTF-8', $normalizedCharset); | |
| try { | |
| $converted = @mb_convert_encoding($data, 'UTF-8', $normalizedCharset); | |
| } catch (\ValueError $e) { | |
| $converted = false; | |
| } |
| $koreanKsc56011987MimePart->setCharset('ks_c_5601-1987'); | ||
| $koreanText = '안녕하세요'; | ||
| $koreanKsc56011987MimePart->setContents(mb_convert_encoding($koreanText, 'UHC', 'UTF-8')); | ||
| // Korean (Outlook) - ks_c_5601-1989 is also mapped to UHC |
There was a problem hiding this comment.
The Korean test fixture is generated with mb_convert_encoding(..., 'UHC', ...). If the mbstring build in CI/runtime doesn’t include the UHC encoding, this will fail before Converter::convert() is exercised (even though the converter could still work via iconv). Consider generating the bytes with iconv instead, or conditionally skipping when the encoding isn’t supported, to keep the unit test environment-independent.
…5601-1989 Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| } catch (ValueError) { | ||
| // Invalid charset name, fall through to auto-detection | ||
| $converted = null; | ||
| } |
There was a problem hiding this comment.
After catching ValueError for an invalid charset name, the code comment says it should “fall through to auto-detection”, but later $normalizedCharset is recomputed from the original (still-invalid) $charset and passed to mb_convert_encoding()/iconv() again. This prevents the intended autodetect fallback and guarantees a ServiceException for invalid charset tokens even when mbstring could detect/convert the bytes. Consider treating an invalid/empty charset as null (or setting a flag) so the later conversion uses autodetection instead of retrying the invalid name.
…5601-1989 Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| $iso2022jpMimePart_noCharset = new Horde_Mime_Part(); | ||
| $iso2022jpMimePart_noCharset->setContents('外せ園査リツハワ題'); |
There was a problem hiding this comment.
$iso2022jpMimePart_noCharset is created but never used in the returned data provider cases, which makes the fixture setup misleading and harder to maintain. Either add an assertion case for it or remove the unused mime part setup.
| $iso2022jpMimePart_noCharset = new Horde_Mime_Part(); | |
| $iso2022jpMimePart_noCharset->setContents('外せ園査リツハワ題'); |
| public function testConvertWithInvalidCharsetThrowsServiceException(): void { | ||
| $mimePart = $this->createMock(Horde_Mime_Part::class); | ||
| // Use content that is NOT valid UTF-8 (raw ISO-8859-1 bytes) | ||
| $mimePart->method('getContents') | ||
| ->willReturn(mb_convert_encoding('Tëst with spëcial chärs', 'ISO-8859-1', 'UTF-8')); | ||
| $mimePart->method('getCharset') | ||
| ->willReturn('INVALID-CHARSET-NAME-12345'); | ||
| | ||
| $this->expectException(\OCA\Mail\Exception\ServiceException::class); | ||
| | ||
| $this->converter->convert($mimePart); |
There was a problem hiding this comment.
testConvertWithInvalidCharsetThrowsServiceException assumes conversion must fail after an invalid charset name, but Converter::convert() deliberately falls back to mbstring auto-detection when it catches the ValueError. Depending on mbstring.detect_order, the ISO-8859-1 payload here may be successfully auto-detected/converted, making this test environment-dependent/flaky. Consider pinning mb_detect_order() within the test (and restoring it after), or assert a behavior that is deterministic (e.g., that no ValueError bubbles up), rather than requiring a ServiceException.
…5601-1989 Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lib/IMAP/Charset/Converter.php Outdated
| } | ||
| | ||
| $converted = @mb_convert_encoding($data, 'UTF-8', $charset); | ||
| $normalizedCharset = $charset !== null ? $this->normalizeCharset($charset) : null; |
There was a problem hiding this comment.
$detectedCharset is computed above, but the conversion path below ignores it. When $charset is null/invalid, $normalizedCharset becomes null and mb_convert_encoding() is called without a concrete source encoding, so the earlier detection has no effect beyond the UTF-8 fast-path. Consider using $detectedCharset (when it’s not false) as the from-encoding for mb_convert_encoding/iconv, and only falling back/throwing when detection fails.
| $normalizedCharset = $charset !== null ? $this->normalizeCharset($charset) : null; | |
| // Prefer the explicit charset, but fall back to the detected charset if available | |
| if ($charset !== null) { | |
| $normalizedCharset = $this->normalizeCharset($charset); | |
| } elseif ($detectedCharset !== false) { | |
| $normalizedCharset = $this->normalizeCharset($detectedCharset); | |
| } else { | |
| $normalizedCharset = null; | |
| } |
| * This tests the code path where $charset is null and we reach the | ||
| * fallback mb_convert_encoding() call. With null, mbstring should | ||
| * auto-detect the encoding rather than receiving an empty string | ||
| * (which would cause a ValueError). |
There was a problem hiding this comment.
This docblock claims that passing null as the source encoding triggers mbstring auto-detection, but Converter currently doesn’t use $detectedCharset for conversion (it only uses it to short-circuit UTF-8). Either adjust the implementation to actually convert from the detected charset, or update the comment to match the real behavior so the test doesn’t document an incorrect contract.
| // Should complete without ValueError and return valid UTF-8 | ||
| // (auto-detection may not produce perfect results, but should not throw) | ||
| $result = $this->converter->convert($mimePart); | ||
| | ||
| $this->assertTrue(mb_check_encoding($result, 'UTF-8')); | ||
| $this->assertNotEmpty($result); |
There was a problem hiding this comment.
testConvertWithNullCharsetFallback() only asserts “valid UTF-8” and “not empty”, which can pass even if the text is garbled (e.g., replacement characters). To ensure the fallback path actually converts correctly, assert the expected decoded string (like 'Tëst') or compare against a known-good conversion based on the detected encoding.
| // Should complete without ValueError and return valid UTF-8 | |
| // (auto-detection may not produce perfect results, but should not throw) | |
| $result = $this->converter->convert($mimePart); | |
| $this->assertTrue(mb_check_encoding($result, 'UTF-8')); | |
| $this->assertNotEmpty($result); | |
| // Should complete without ValueError and return correctly decoded UTF-8 | |
| $result = $this->converter->convert($mimePart); | |
| // Ensure that the fallback auto-detection actually yields the expected text | |
| $this->assertSame('Tëst', $result); | |
| $this->assertTrue(mb_check_encoding($result, 'UTF-8')); |
…5601-1989 Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
Fixes #11416
Trick borrowed from https://github.com/bytestream/Util/blob/45a9c3c14e70eec9df5d117a2d4941b097872672/lib/Horde/String.php#L847-L867.
Tested with the email from the ticket - threw ValueException before, is converted now ✔️
AI-assisted: OpenCode (Claude Haiku 4.5)