Skip to content

fix(imap): handle korean Windows encoding ks_c_5601-1987/ks_c_5601-1989#12654

Open
ChristophWurst wants to merge 8 commits intomainfrom
fix/imap/korean-windows-encoding
Open

fix(imap): handle korean Windows encoding ks_c_5601-1987/ks_c_5601-1989#12654
ChristophWurst wants to merge 8 commits intomainfrom
fix/imap/korean-windows-encoding

Conversation

@ChristophWurst
Copy link
Member

Fixes #11416

Trick borrowed from https://github.com/bytestream/Util/blob/45a9c3c14e70eec9df5d117a2d4941b097872672/lib/Horde/String.php#L847-L867.

Tested with the email from the ticket - threw ValueException before, is converted now ✔️

AI-assisted: OpenCode (Claude Haiku 4.5)

AI-assisted: OpenCode (Claude Haiku 4.5) Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
@ChristophWurst
Copy link
Member Author

/backport to stable5.7

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses Nextcloud Mail issue #11416 by improving IMAP charset handling for Outlook messages that label Korean content using non-mbstring charset names (ks_c_5601-1987 / ks_c_5601-1989), preventing conversion failures during message processing.

Changes:

  • Add a charset normalization step mapping ks_c_5601-1987 / ks_c_5601-1989 to UHC (CP949/Windows-949).
  • Apply the normalized charset during mb_convert_encoding() / iconv() conversion in the IMAP charset converter.
  • Add unit test coverage for both Outlook Korean charset labels.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
lib/IMAP/Charset/Converter.php Adds charset normalization and uses it during conversion to avoid failures on Outlook Korean charset aliases.
tests/Unit/IMAP/Charset/ConverterTest.php Adds unit tests validating conversion of ks_c_5601-1987 and ks_c_5601-1989 content to UTF-8.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kesselb
Copy link
Contributor

kesselb commented Mar 25, 2026

Thanks for taking over; looks good!

When I looked at it a while back, I'd bumped into some blockers and didn't find the time to continue.

mb_list_encodings is "incomplete". To get a full list, one needs to also run mb_encoding_aliases for each encoding from the list.

The list also contains "BASE64", "UUENCODE", "HTML-ENTITIES", and "Quoted-Printable". Request the aliases for those; does trigger a deprecation warning 🙈

What I had in mind was similar to what also Josh started with #12221.

https://github.com/zbateson/mb-wrapper/blob/a282176d795b81aac38cadb31c9b29315a94011b/src/MbWrapper.php#L37

…5601-1989 Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
…5601-1989 Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…5601-1989 Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 69 to 73
if (in_array($normalizedCharset, mb_list_encodings(), true)) {
$converted = mb_convert_encoding($data, 'UTF-8', $normalizedCharset);
} else {
$converted = iconv($charset, 'UTF-8', $data);
$converted = iconv($normalizedCharset, 'UTF-8', $data);
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iconv() can throw a ValueError for unknown/unsupported encodings, and this call isn’t guarded. A bad charset token can still make mailbox loading fail hard. Consider wrapping the conversion attempt in try/catch (\ValueError) and falling back to the auto-detection path (or throwing ServiceException) instead of letting the error bubble up.

Copilot uses AI. Check for mistakes.

$converted = @mb_convert_encoding($data, 'UTF-8', $charset);
$normalizedCharset = $charset !== null ? $this->normalizeCharset($charset) : null;
$converted = @mb_convert_encoding($data, 'UTF-8', $normalizedCharset);
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mb_convert_encoding() may throw ValueError (not return false) when the provided charset name is invalid. Since the code only checks for false, this path can still fatally error and bypass the intended iconv/ServiceException fallback. Please catch \ValueError around this call and handle it like a failed conversion.

Suggested change
$converted = @mb_convert_encoding($data, 'UTF-8', $normalizedCharset);
try {
$converted = @mb_convert_encoding($data, 'UTF-8', $normalizedCharset);
} catch (\ValueError $e) {
$converted = false;
}
Copilot uses AI. Check for mistakes.
Comment on lines +68 to +71
$koreanKsc56011987MimePart->setCharset('ks_c_5601-1987');
$koreanText = '안녕하세요';
$koreanKsc56011987MimePart->setContents(mb_convert_encoding($koreanText, 'UHC', 'UTF-8'));
// Korean (Outlook) - ks_c_5601-1989 is also mapped to UHC
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Korean test fixture is generated with mb_convert_encoding(..., 'UHC', ...). If the mbstring build in CI/runtime doesn’t include the UHC encoding, this will fail before Converter::convert() is exercised (even though the converter could still work via iconv). Consider generating the bytes with iconv instead, or conditionally skipping when the encoding isn’t supported, to keep the unit test environment-independent.

Copilot uses AI. Check for mistakes.
…5601-1989 Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +75 to 79
}
} catch (ValueError) {
// Invalid charset name, fall through to auto-detection
$converted = null;
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After catching ValueError for an invalid charset name, the code comment says it should “fall through to auto-detection”, but later $normalizedCharset is recomputed from the original (still-invalid) $charset and passed to mb_convert_encoding()/iconv() again. This prevents the intended autodetect fallback and guarantees a ServiceException for invalid charset tokens even when mbstring could detect/convert the bytes. Consider treating an invalid/empty charset as null (or setting a flag) so the later conversion uses autodetection instead of retrying the invalid name.

Copilot uses AI. Check for mistakes.
…5601-1989 Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
@ChristophWurst ChristophWurst requested a review from Copilot March 26, 2026 07:31
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 63 to 64
$iso2022jpMimePart_noCharset = new Horde_Mime_Part();
$iso2022jpMimePart_noCharset->setContents('外せ園査リツハワ題');
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$iso2022jpMimePart_noCharset is created but never used in the returned data provider cases, which makes the fixture setup misleading and harder to maintain. Either add an assertion case for it or remove the unused mime part setup.

Suggested change
$iso2022jpMimePart_noCharset = new Horde_Mime_Part();
$iso2022jpMimePart_noCharset->setContents('外せ園査リツハワ題');
Copilot uses AI. Check for mistakes.
Comment on lines +137 to +147
public function testConvertWithInvalidCharsetThrowsServiceException(): void {
$mimePart = $this->createMock(Horde_Mime_Part::class);
// Use content that is NOT valid UTF-8 (raw ISO-8859-1 bytes)
$mimePart->method('getContents')
->willReturn(mb_convert_encoding('Tëst with spëcial chärs', 'ISO-8859-1', 'UTF-8'));
$mimePart->method('getCharset')
->willReturn('INVALID-CHARSET-NAME-12345');

$this->expectException(\OCA\Mail\Exception\ServiceException::class);

$this->converter->convert($mimePart);
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testConvertWithInvalidCharsetThrowsServiceException assumes conversion must fail after an invalid charset name, but Converter::convert() deliberately falls back to mbstring auto-detection when it catches the ValueError. Depending on mbstring.detect_order, the ISO-8859-1 payload here may be successfully auto-detected/converted, making this test environment-dependent/flaky. Consider pinning mb_detect_order() within the test (and restoring it after), or assert a behavior that is deterministic (e.g., that no ValueError bubbles up), rather than requiring a ServiceException.

Copilot uses AI. Check for mistakes.
…5601-1989 Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

$converted = @mb_convert_encoding($data, 'UTF-8', $charset);
$normalizedCharset = $charset !== null ? $this->normalizeCharset($charset) : null;
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$detectedCharset is computed above, but the conversion path below ignores it. When $charset is null/invalid, $normalizedCharset becomes null and mb_convert_encoding() is called without a concrete source encoding, so the earlier detection has no effect beyond the UTF-8 fast-path. Consider using $detectedCharset (when it’s not false) as the from-encoding for mb_convert_encoding/iconv, and only falling back/throwing when detection fails.

Suggested change
$normalizedCharset = $charset !== null ? $this->normalizeCharset($charset) : null;
// Prefer the explicit charset, but fall back to the detected charset if available
if ($charset !== null) {
$normalizedCharset = $this->normalizeCharset($charset);
} elseif ($detectedCharset !== false) {
$normalizedCharset = $this->normalizeCharset($detectedCharset);
} else {
$normalizedCharset = null;
}
Copilot uses AI. Check for mistakes.
Comment on lines +109 to +112
* This tests the code path where $charset is null and we reach the
* fallback mb_convert_encoding() call. With null, mbstring should
* auto-detect the encoding rather than receiving an empty string
* (which would cause a ValueError).
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docblock claims that passing null as the source encoding triggers mbstring auto-detection, but Converter currently doesn’t use $detectedCharset for conversion (it only uses it to short-circuit UTF-8). Either adjust the implementation to actually convert from the detected charset, or update the comment to match the real behavior so the test doesn’t document an incorrect contract.

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +127
// Should complete without ValueError and return valid UTF-8
// (auto-detection may not produce perfect results, but should not throw)
$result = $this->converter->convert($mimePart);

$this->assertTrue(mb_check_encoding($result, 'UTF-8'));
$this->assertNotEmpty($result);
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testConvertWithNullCharsetFallback() only asserts “valid UTF-8” and “not empty”, which can pass even if the text is garbled (e.g., replacement characters). To ensure the fallback path actually converts correctly, assert the expected decoded string (like 'Tëst') or compare against a known-good conversion based on the detected encoding.

Suggested change
// Should complete without ValueError and return valid UTF-8
// (auto-detection may not produce perfect results, but should not throw)
$result = $this->converter->convert($mimePart);
$this->assertTrue(mb_check_encoding($result, 'UTF-8'));
$this->assertNotEmpty($result);
// Should complete without ValueError and return correctly decoded UTF-8
$result = $this->converter->convert($mimePart);
// Ensure that the fallback auto-detection actually yields the expected text
$this->assertSame('Tëst', $result);
$this->assertTrue(mb_check_encoding($result, 'UTF-8'));
Copilot uses AI. Check for mistakes.
…5601-1989 Signed-off-by: Christoph Wurst <1374172+ChristophWurst@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment