Skip to content

Conversation

@vonosmas
Copy link
Contributor

  • Removes the copy-pasta implementation of wcstointeger,
    and migrate the wcsto* family of functions to use a template
    version of strtointeger.
  • Fixes the out-of-bound read in the original implementation(s)
    when the entire input string consists of whitespaces
    (then the sign check can access OOB memory)

The code is currently slightly peppered with "if constexpr" statements
to distinguish between char and wchar_t. We can probably
simplify it in subsequent changes by:

  • using overrides, so that internal::isalnum() is overriden for
    both char and wchar_t (since C++ luckily allows us to reuse names).
  • this wouldn't help for direct comparison with literals -
    for this as a somewhat ugly workaround like is_char_literal(c, '0', L'0')
* Removes the copy-pasta implementation of wcstointeger, and migrate the wcsto* family of functions to use a template version of strtointeger. * Fixes the out-of-bound read in the original implementation(s) when the entire input string consists of whitespaces (then the sign check can access OOB memory) The code is currently slightly peppered with "if constexpr" statements to distinguish between char and wchar_t. We can probably simplify it in subsequent changes by: * using overrides, so that internal::isalnum() is overriden for both char and wchar_t (since C++ luckily allows us to reuse names). * this wouldn't help for direct comparison with literals - for this as a somewhat ugly workaround like is_char_literal(c, '0', L'0')
@llvmbot llvmbot added libc bazel "Peripheral" support tier build system: utils/bazel labels Oct 31, 2025
@llvmbot
Copy link
Member

llvmbot commented Oct 31, 2025

@llvm/pr-subscribers-libc

Author: Alexey Samsonov (vonosmas)

Changes
  • Removes the copy-pasta implementation of wcstointeger,
    and migrate the wcsto* family of functions to use a template
    version of strtointeger.
  • Fixes the out-of-bound read in the original implementation(s)
    when the entire input string consists of whitespaces
    (then the sign check can access OOB memory)

The code is currently slightly peppered with "if constexpr" statements
to distinguish between char and wchar_t. We can probably
simplify it in subsequent changes by:

  • using overrides, so that internal::isalnum() is overriden for
    both char and wchar_t (since C++ luckily allows us to reuse names).
  • this wouldn't help for direct comparison with literals -
    for this as a somewhat ugly workaround like is_char_literal(c, '0', L'0')

Patch is 30.76 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/165884.diff

12 Files Affected:

  • (modified) libc/src/__support/CMakeLists.txt (-12)
  • (modified) libc/src/__support/str_to_integer.h (+71-29)
  • (removed) libc/src/__support/wcs_to_integer.h (-155)
  • (modified) libc/src/wchar/CMakeLists.txt (+4-4)
  • (modified) libc/src/wchar/wcstol.cpp (+2-2)
  • (modified) libc/src/wchar/wcstoll.cpp (+2-2)
  • (modified) libc/src/wchar/wcstoul.cpp (+2-2)
  • (modified) libc/src/wchar/wcstoull.cpp (+2-2)
  • (modified) libc/test/src/__support/CMakeLists.txt (+1-1)
  • (modified) libc/test/src/__support/str_to_integer_test.cpp (+4-2)
  • (modified) libc/test/src/__support/wcs_to_integer_test.cpp (+52-50)
  • (modified) utils/bazel/llvm-project-overlay/libc/BUILD.bazel (+1)
diff --git a/libc/src/__support/CMakeLists.txt b/libc/src/__support/CMakeLists.txt index 0ef09a9b8c9d0..b7af751ec3f27 100644 --- a/libc/src/__support/CMakeLists.txt +++ b/libc/src/__support/CMakeLists.txt @@ -179,19 +179,7 @@ add_header_library( DEPENDS .ctype_utils .str_to_num_result - libc.hdr.errno_macros - libc.src.__support.CPP.limits - libc.src.__support.CPP.type_traits - libc.src.__support.common -) - -add_header_library( - wcs_to_integer - HDRS - wcs_to_integer.h - DEPENDS .wctype_utils - .str_to_num_result libc.hdr.errno_macros libc.src.__support.CPP.limits libc.src.__support.CPP.type_traits diff --git a/libc/src/__support/str_to_integer.h b/libc/src/__support/str_to_integer.h index d332c929f2c31..0f87d7a8c8fbf 100644 --- a/libc/src/__support/str_to_integer.h +++ b/libc/src/__support/str_to_integer.h @@ -25,36 +25,70 @@ #include "src/__support/macros/config.h" #include "src/__support/str_to_num_result.h" #include "src/__support/uint128.h" +#include "src/__support/wctype_utils.h" namespace LIBC_NAMESPACE_DECL { namespace internal { // Returns the idx to the first character in src that is not a whitespace -// character (as determined by isspace()) +// character (as determined by isspace() / iswspace()) +template <typename CharType> LIBC_INLINE size_t -first_non_whitespace(const char *__restrict src, +first_non_whitespace(const CharType *__restrict src, size_t src_len = cpp::numeric_limits<size_t>::max()) { size_t src_cur = 0; - while (src_cur < src_len && internal::isspace(src[src_cur])) { + while (src_cur < src_len) { + if constexpr (cpp::is_same_v<CharType, char>) { + if (!internal::isspace(src[src_cur])) break; + } else { + if (!internal::iswspace(src[src_cur])) break; + } ++src_cur; } return src_cur; } +// Returns 1 if 'src' starts with + or - sign, and returns 0 otherwise. +// Writes the sign value to |is_positive|. +template <typename CharType> +LIBC_INLINE static size_t +consume_sign(const CharType *__restrict src, bool *is_positive) { + *is_positive = true; + if constexpr (cpp::is_same_v<CharType, char>) { + if (*src == '+' || *src == '-') { + *is_positive = (*src == '+'); + return 1; + } + } else { + if (*src == L'+' || *src == L'-') { + *is_positive = (*src == L'+'); + return 1; + } + } + return 0; +} + // checks if the next 3 characters of the string pointer are the start of a // hexadecimal number. Does not advance the string pointer. -LIBC_INLINE bool -is_hex_start(const char *__restrict src, - size_t src_len = cpp::numeric_limits<size_t>::max()) { +template<typename CharType> +LIBC_INLINE static bool +is_hex_start(const CharType *__restrict src, size_t src_len) { if (src_len < 3) return false; - return *src == '0' && tolower(*(src + 1)) == 'x' && isalnum(*(src + 2)) && - b36_char_to_int(*(src + 2)) < 16; + if constexpr (cpp::is_same_v<CharType, char>) { + return src[0] == '0' && tolower(src[1]) == 'x' && isalnum(src[2]) && + b36_char_to_int(src[2]) < 16; + } else { + return src[0] == L'0' && towlower(src[1]) == L'x' && iswalnum(src[2]) && + b36_wchar_to_int(src[2]) < 16; + } } // Takes the address of the string pointer and parses the base from the start of // it. -LIBC_INLINE int infer_base(const char *__restrict src, size_t src_len) { +template<typename CharType> +LIBC_INLINE static int infer_base(const CharType *__restrict src, + size_t src_len) { // A hexadecimal number is defined as "the prefix 0x or 0X followed by a // sequence of the decimal digits and the letters a (or A) through f (or F) // with values 10 through 15 respectively." (C standard 6.4.4.1) @@ -63,8 +97,13 @@ LIBC_INLINE int infer_base(const char *__restrict src, size_t src_len) { // An octal number is defined as "the prefix 0 optionally followed by a // sequence of the digits 0 through 7 only" (C standard 6.4.4.1) and so any // number that starts with 0, including just 0, is an octal number. - if (src_len > 0 && src[0] == '0') - return 8; + if (src_len > 0) { + if constexpr (cpp::is_same_v<CharType, char>) { + if (src[0] == '0') return 8; + } else { + if (src[0] == L'0') return 8; + } + } // A decimal number is defined as beginning "with a nonzero digit and // consist[ing] of a sequence of decimal digits." (C standard 6.4.4.1) return 10; @@ -77,32 +116,26 @@ LIBC_INLINE int infer_base(const char *__restrict src, size_t src_len) { // ----------------------------------------------------------------------------- // Takes a pointer to a string and the base to convert to. This function is used // as the backend for all of the string to int functions. -template <class T> +template <typename T, typename CharType> LIBC_INLINE StrToNumResult<T> -strtointeger(const char *__restrict src, int base, +strtointeger(const CharType *__restrict src, int base, const size_t src_len = cpp::numeric_limits<size_t>::max()) { using ResultType = make_integral_or_big_int_unsigned_t<T>; - ResultType result = 0; - - bool is_number = false; - size_t src_cur = 0; - int error_val = 0; - if (src_len == 0) return {0, 0, 0}; if (base < 0 || base == 1 || base > 36) return {0, 0, EINVAL}; - src_cur = first_non_whitespace(src, src_len); - - char result_sign = '+'; - if (src[src_cur] == '+' || src[src_cur] == '-') { - result_sign = src[src_cur]; - ++src_cur; + size_t src_cur = first_non_whitespace(src, src_len); + if (src_cur == src_len) { + return {0, 0, 0}; } + bool is_positive; + src_cur += consume_sign(src + src_cur, &is_positive); + if (base == 0) base = infer_base(src + src_cur, src_len - src_cur); @@ -110,8 +143,6 @@ strtointeger(const char *__restrict src, int base, src_cur = src_cur + 2; constexpr bool IS_UNSIGNED = cpp::is_unsigned_v<T>; - const bool is_positive = (result_sign == '+'); - ResultType constexpr NEGATIVE_MAX = !IS_UNSIGNED ? static_cast<ResultType>(cpp::numeric_limits<T>::max()) + 1 : cpp::numeric_limits<T>::max(); @@ -120,8 +151,19 @@ strtointeger(const char *__restrict src, int base, ResultType const abs_max_div_by_base = abs_max / static_cast<ResultType>(base); - while (src_cur < src_len && isalnum(src[src_cur])) { - int cur_digit = b36_char_to_int(src[src_cur]); + bool is_number = false; + int error_val = 0; + ResultType result = 0;  + while (src_cur < src_len) { + int cur_digit; + if constexpr (cpp::is_same_v<CharType, char>) { + if (!isalnum(src[src_cur])) break; + cur_digit = b36_char_to_int(src[src_cur]); + } else { + if (!iswalnum(src[src_cur])) break; + cur_digit = b36_wchar_to_int(src[src_cur]); + } + if (cur_digit >= base) break; diff --git a/libc/src/__support/wcs_to_integer.h b/libc/src/__support/wcs_to_integer.h deleted file mode 100644 index 4254bd860f77a..0000000000000 --- a/libc/src/__support/wcs_to_integer.h +++ /dev/null @@ -1,155 +0,0 @@ -//===-- Widechar string to integer conversion utils -------------*- C++ -*-===// -// -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. -// See https://llvm.org/LICENSE.txt for license information. -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception -// -//===----------------------------------------------------------------------===// - -#ifndef LLVM_LIBC_SRC___SUPPORT_WCS_TO_INTEGER_H -#define LLVM_LIBC_SRC___SUPPORT_WCS_TO_INTEGER_H - -#include "hdr/errno_macros.h" // For ERANGE -#include "src/__support/CPP/limits.h" -#include "src/__support/CPP/type_traits.h" -#include "src/__support/CPP/type_traits/make_unsigned.h" -#include "src/__support/big_int.h" -#include "src/__support/common.h" -#include "src/__support/macros/config.h" -#include "src/__support/str_to_num_result.h" -#include "src/__support/uint128.h" -#include "src/__support/wctype_utils.h" - -namespace LIBC_NAMESPACE_DECL { -namespace internal { - -// Returns the idx of the first character in src that is not a whitespace -// character (as determined by iswspace()) -LIBC_INLINE size_t -first_non_whitespace(const wchar_t *__restrict src, - size_t src_len = cpp::numeric_limits<size_t>::max()) { - size_t src_cur = 0; - while (src_cur < src_len && internal::iswspace(src[src_cur])) { - ++src_cur; - } - return src_cur; -} - -// checks if the next 3 characters of the string pointer are the start of a -// hexadecimal number. Does not advance the string pointer. -LIBC_INLINE bool -is_hex_start(const wchar_t *__restrict src, - size_t src_len = cpp::numeric_limits<size_t>::max()) { - if (src_len < 3) - return false; - return *src == L'0' && towlower(*(src + 1)) == L'x' && iswalnum(*(src + 2)) && - b36_wchar_to_int(*(src + 2)) < 16; -} - -// Takes the address of the string pointer and parses the base from the start of -// it. -LIBC_INLINE int infer_base(const wchar_t *__restrict src, size_t src_len) { - // A hexadecimal number is defined as "the prefix 0x or 0X followed by a - // sequence of the decimal digits and the letters a (or A) through f (or F) - // with values 10 through 15 respectively." (C standard 6.4.4.1) - if (is_hex_start(src, src_len)) - return 16; - // An octal number is defined as "the prefix 0 optionally followed by a - // sequence of the digits 0 through 7 only" (C standard 6.4.4.1) and so any - // number that starts with 0, including just 0, is an octal number. - if (src_len > 0 && src[0] == L'0') - return 8; - // A decimal number is defined as beginning "with a nonzero digit and - // consist[ing] of a sequence of decimal digits." (C standard 6.4.4.1) - return 10; -} - -template <class T> -LIBC_INLINE StrToNumResult<T> -wcstointeger(const wchar_t *__restrict src, int base, - const size_t src_len = cpp::numeric_limits<size_t>::max()) { - using ResultType = make_integral_or_big_int_unsigned_t<T>; - - ResultType result = 0; - - bool is_number = false; - size_t src_cur = 0; - int error_val = 0; - - if (src_len == 0) - return {0, 0, 0}; - - if (base < 0 || base == 1 || base > 36) - return {0, 0, EINVAL}; - - src_cur = first_non_whitespace(src, src_len); - - wchar_t result_sign = L'+'; - if (src[src_cur] == L'+' || src[src_cur] == L'-') { - result_sign = src[src_cur]; - ++src_cur; - } - - if (base == 0) - base = infer_base(src + src_cur, src_len - src_cur); - - if (base == 16 && is_hex_start(src + src_cur, src_len - src_cur)) - src_cur = src_cur + 2; - - constexpr bool IS_UNSIGNED = cpp::is_unsigned_v<T>; - const bool is_positive = (result_sign == L'+'); - - ResultType constexpr NEGATIVE_MAX = - !IS_UNSIGNED ? static_cast<ResultType>(cpp::numeric_limits<T>::max()) + 1 - : cpp::numeric_limits<T>::max(); - ResultType const abs_max = - (is_positive ? cpp::numeric_limits<T>::max() : NEGATIVE_MAX); - ResultType const abs_max_div_by_base = - abs_max / static_cast<ResultType>(base); - - while (src_cur < src_len && iswalnum(src[src_cur])) { - int cur_digit = b36_wchar_to_int(src[src_cur]); - if (cur_digit >= base) - break; - - is_number = true; - ++src_cur; - - // If the number has already hit the maximum value for the current type then - // the result cannot change, but we still need to advance src to the end of - // the number. - if (result == abs_max) { - error_val = ERANGE; - continue; - } - - if (result > abs_max_div_by_base) { - result = abs_max; - error_val = ERANGE; - } else { - result = result * static_cast<ResultType>(base); - } - if (result > abs_max - static_cast<ResultType>(cur_digit)) { - result = abs_max; - error_val = ERANGE; - } else { - result = result + static_cast<ResultType>(cur_digit); - } - } - - ptrdiff_t str_len = is_number ? static_cast<ptrdiff_t>(src_cur) : 0; - - if (error_val == ERANGE) { - if (is_positive || IS_UNSIGNED) - return {cpp::numeric_limits<T>::max(), str_len, error_val}; - else // T is signed and there is a negative overflow - return {cpp::numeric_limits<T>::min(), str_len, error_val}; - } - - return {static_cast<T>(is_positive ? result : -result), str_len, error_val}; -} - -} // namespace internal -} // namespace LIBC_NAMESPACE_DECL - -#endif // LLVM_LIBC_SRC___SUPPORT_WCS_TO_INTEGER_H diff --git a/libc/src/wchar/CMakeLists.txt b/libc/src/wchar/CMakeLists.txt index adde382bf0950..ba27cd77f6bac 100644 --- a/libc/src/wchar/CMakeLists.txt +++ b/libc/src/wchar/CMakeLists.txt @@ -63,7 +63,7 @@ add_entrypoint_object( wcstol.h DEPENDS libc.src.errno.errno - libc.src.__support.wcs_to_integer + libc.src.__support.str_to_integer ) add_entrypoint_object( @@ -74,7 +74,7 @@ add_entrypoint_object( wcstoll.h DEPENDS libc.src.errno.errno - libc.src.__support.wcs_to_integer + libc.src.__support.str_to_integer ) add_entrypoint_object( @@ -85,7 +85,7 @@ add_entrypoint_object( wcstoul.h DEPENDS libc.src.errno.errno - libc.src.__support.wcs_to_integer + libc.src.__support.str_to_integer ) add_entrypoint_object( @@ -96,7 +96,7 @@ add_entrypoint_object( wcstoull.h DEPENDS libc.src.errno.errno - libc.src.__support.wcs_to_integer + libc.src.__support.str_to_integer ) add_entrypoint_object( diff --git a/libc/src/wchar/wcstol.cpp b/libc/src/wchar/wcstol.cpp index a05718f706dfd..a56b5f91272cd 100644 --- a/libc/src/wchar/wcstol.cpp +++ b/libc/src/wchar/wcstol.cpp @@ -10,14 +10,14 @@ #include "src/__support/common.h" #include "src/__support/libc_errno.h" #include "src/__support/macros/config.h" -#include "src/__support/wcs_to_integer.h" +#include "src/__support/str_to_integer.h" namespace LIBC_NAMESPACE_DECL { LLVM_LIBC_FUNCTION(long, wcstol, (const wchar_t *__restrict str, wchar_t **__restrict str_end, int base)) { - auto result = internal::wcstointeger<long>(str, base); + auto result = internal::strtointeger<long>(str, base); if (result.has_error()) libc_errno = result.error; diff --git a/libc/src/wchar/wcstoll.cpp b/libc/src/wchar/wcstoll.cpp index de1299d681cdb..6229d24172b51 100644 --- a/libc/src/wchar/wcstoll.cpp +++ b/libc/src/wchar/wcstoll.cpp @@ -10,14 +10,14 @@ #include "src/__support/common.h" #include "src/__support/libc_errno.h" #include "src/__support/macros/config.h" -#include "src/__support/wcs_to_integer.h" +#include "src/__support/str_to_integer.h" namespace LIBC_NAMESPACE_DECL { LLVM_LIBC_FUNCTION(long long, wcstoll, (const wchar_t *__restrict str, wchar_t **__restrict str_end, int base)) { - auto result = internal::wcstointeger<long long>(str, base); + auto result = internal::strtointeger<long long>(str, base); if (result.has_error()) libc_errno = result.error; diff --git a/libc/src/wchar/wcstoul.cpp b/libc/src/wchar/wcstoul.cpp index 79b8c9b5c9fa3..c5639bee1d649 100644 --- a/libc/src/wchar/wcstoul.cpp +++ b/libc/src/wchar/wcstoul.cpp @@ -10,14 +10,14 @@ #include "src/__support/common.h" #include "src/__support/libc_errno.h" #include "src/__support/macros/config.h" -#include "src/__support/wcs_to_integer.h" +#include "src/__support/str_to_integer.h" namespace LIBC_NAMESPACE_DECL { LLVM_LIBC_FUNCTION(unsigned long, wcstoul, (const wchar_t *__restrict str, wchar_t **__restrict str_end, int base)) { - auto result = internal::wcstointeger<unsigned long>(str, base); + auto result = internal::strtointeger<unsigned long>(str, base); if (result.has_error()) libc_errno = result.error; diff --git a/libc/src/wchar/wcstoull.cpp b/libc/src/wchar/wcstoull.cpp index 768e03c4bd189..2ab24e9b2b2a1 100644 --- a/libc/src/wchar/wcstoull.cpp +++ b/libc/src/wchar/wcstoull.cpp @@ -10,14 +10,14 @@ #include "src/__support/common.h" #include "src/__support/libc_errno.h" #include "src/__support/macros/config.h" -#include "src/__support/wcs_to_integer.h" +#include "src/__support/str_to_integer.h" namespace LIBC_NAMESPACE_DECL { LLVM_LIBC_FUNCTION(unsigned long long, wcstoull, (const wchar_t *__restrict str, wchar_t **__restrict str_end, int base)) { - auto result = internal::wcstointeger<unsigned long long>(str, base); + auto result = internal::strtointeger<unsigned long long>(str, base); if (result.has_error()) libc_errno = result.error; diff --git a/libc/test/src/__support/CMakeLists.txt b/libc/test/src/__support/CMakeLists.txt index a02514106a307..138866b4cc869 100644 --- a/libc/test/src/__support/CMakeLists.txt +++ b/libc/test/src/__support/CMakeLists.txt @@ -151,7 +151,7 @@ add_libc_test( wcs_to_integer_test.cpp DEPENDS libc.src.__support.integer_literals - libc.src.__support.wcs_to_integer + libc.src.__support.str_to_integer ) add_libc_test( diff --git a/libc/test/src/__support/str_to_integer_test.cpp b/libc/test/src/__support/str_to_integer_test.cpp index 1ec882b212b8a..e5ac1d6cbb7b3 100644 --- a/libc/test/src/__support/str_to_integer_test.cpp +++ b/libc/test/src/__support/str_to_integer_test.cpp @@ -49,12 +49,14 @@ TEST(LlvmLibcStrToIntegerTest, LeadingSpaces) { EXPECT_EQ(result.parsed_len, ptrdiff_t(7)); ASSERT_EQ(result.value, 12); - result = LIBC_NAMESPACE::internal::strtointeger<int>(" 12345", 10, 5); + // Use a non-null-terminated buffer to test for possible OOB access. + char buf[5] = {' ', ' ', ' ', ' ', ' '}; + result = LIBC_NAMESPACE::internal::strtointeger<int>(buf, 10, 5); EXPECT_FALSE(result.has_error()); EXPECT_EQ(result.parsed_len, ptrdiff_t(0)); ASSERT_EQ(result.value, 0); - result = LIBC_NAMESPACE::internal::strtointeger<int>(" 12345", 10, 0); + result = LIBC_NAMESPACE::internal::strtointeger<int>(buf, 10, 0); EXPECT_FALSE(result.has_error()); EXPECT_EQ(result.parsed_len, ptrdiff_t(0)); ASSERT_EQ(result.value, 0); diff --git a/libc/test/src/__support/wcs_to_integer_test.cpp b/libc/test/src/__support/wcs_to_integer_test.cpp index 4554968be67ce..38af778ca2440 100644 --- a/libc/test/src/__support/wcs_to_integer_test.cpp +++ b/libc/test/src/__support/wcs_to_integer_test.cpp @@ -6,7 +6,7 @@ // //===----------------------------------------------------------------------===// -#include "src/__support/wcs_to_integer.h" +#include "src/__support/str_to_integer.h" #include <stddef.h> #include "test/UnitTest/Test.h" @@ -14,224 +14,226 @@ // This file is for testing the src_len argument and other internal interface // features. Primary testing is done through the public interface. -TEST(LlvmLibcStrToIntegerTest, SimpleLength) { - auto result = LIBC_NAMESPACE::internal::wcstointeger<int>(L"12345", 10, 10); +TEST(LlvmLibcWcsToIntegerTest, SimpleLength) { + auto result = LIBC_NAMESPACE::internal::strtointeger<int>(L"12345", 10, 10); EXPECT_FALSE(result.has_error()); EXPECT_EQ(result.parsed_len, ptrdiff_t(5)); ASSERT_EQ(result.value, 12345); - result = LIBC_NAMESPACE::internal::wcstointeger<int>(L"12345", 10, 2); + result = LIBC_NAMESPACE::internal::strtointeger<int>(L"12345", 10, 2); EXPECT_FALSE(result.has_error()); EXPECT_EQ(result.parsed_len, ptrdiff_t(2)); ASSERT_EQ(result.value, 12); - result = LIBC_NAMESPACE::internal::wcstointeger<int>(L"12345", 10, 0); + result = LIBC_NAMESPACE::internal::strtointeger<int>(L"12345", 10, 0); EXPECT_FALSE(result.has_error()); EXPECT_EQ(result.parsed_len, ptrdiff_t(0)); ASSERT_EQ(result.value, 0); } -TEST(LlvmLibcStrToIntegerTest, LeadingSpaces) { +TEST(LlvmLibcWcsToIntegerTest, LeadingSpaces) { auto result = - LIBC_NAMESPACE::internal::wcstointeger<int>(L" 12345", 10, 15); + LIBC_NAMESPACE::internal::strtointeger<int>(L" 12345", 10, 15); EXPECT_FALSE(result.has_error()); EXPECT_EQ(result.parsed_len, ptrdiff_t(10)); ASSERT_EQ(result.value, 12345); - result = LIBC_NAMESPACE::internal::wcstointeger<int>(L" 12345", 10, 10); + result = LIBC_NAMESPACE::internal::strtointeger<int>(L" 12345", 10, 10); EXPECT_FALSE(result.has_error()); EXPECT_EQ(result.parsed_len, ptrdiff_t(10)); ASSERT_EQ(result.value, 12345); - result = LIBC_NAMESPACE::internal::wcstointeger<int>(L" 1234... [truncated] 
@github-actions
Copy link

github-actions bot commented Oct 31, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@michaelrj-google michaelrj-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

@vonosmas vonosmas enabled auto-merge (squash) October 31, 2025 18:53
@vonosmas vonosmas merged commit 315dfe5 into llvm:main Oct 31, 2025
19 of 20 checks passed
@vonosmas vonosmas deleted the libc-str-to-integer branch October 31, 2025 18:59
DEBADRIBASAK pushed a commit to DEBADRIBASAK/llvm-project that referenced this pull request Nov 3, 2025
* Removes the copy-pasta implementation of wcstointeger, and migrate the wcsto* family of functions to use a template version of strtointeger. * Fixes the out-of-bound read in the original implementation(s) when the entire input string consists of whitespaces (then the sign check can access OOB memory) The code is currently slightly peppered with "if constexpr" statements to distinguish between char and wchar_t. We can probably simplify it in subsequent changes by: * using overrides, so that internal::isalnum() is overriden for both char and wchar_t (since C++ luckily allows us to reuse names). * this wouldn't help for direct comparison with literals - for this as a somewhat ugly workaround like is_char_literal(c, '0', L'0')
ckoparkar pushed a commit to ckoparkar/llvm-project that referenced this pull request Nov 6, 2025
* Removes the copy-pasta implementation of wcstointeger, and migrate the wcsto* family of functions to use a template version of strtointeger. * Fixes the out-of-bound read in the original implementation(s) when the entire input string consists of whitespaces (then the sign check can access OOB memory) The code is currently slightly peppered with "if constexpr" statements to distinguish between char and wchar_t. We can probably simplify it in subsequent changes by: * using overrides, so that internal::isalnum() is overriden for both char and wchar_t (since C++ luckily allows us to reuse names). * this wouldn't help for direct comparison with literals - for this as a somewhat ugly workaround like is_char_literal(c, '0', L'0')
@ldionne
Copy link
Member

ldionne commented Dec 1, 2025

We started seeing failures in some of our downstream builds since this patch landed. The cause seems to be that libc/src/__support/str_to_integer.h is now including the src/__support/wctype_utils.h header, which does:

libc/src/__support/wctype_utils.h:43:44: error: unknown type name 'wint_t' 43 | LIBC_INLINE static constexpr bool iswlower(wint_t wch) { | ^ 

This is happening in a build where the underlying C library does not provide support for wide characters (which is represented in libc++ by #if _LIBCPP_HAS_WIDE_CHARACTERS). This was not noticed by our upstream CI since it does not actually build against a C library that doesn't provide the wchar-related declarations: it merely turns off the associated libc++ interfaces.

This actually brings up an interesting and important question. Since libc++ supports various "carve outs" like _LIBCPP_HAS_WIDE_CHARACTERS, _LIBCPP_HAS_LOCALIZATION, etc, and since libc++ now uses libc as an implementation detail for some of its functionality, there's kind of a requirement for libc to support the same carve outs. I don't know whether that's desirable and I think we should think about the overall problem to avoid issues like this in the future, since I don't really think it makes sense to ask libc to start supporting all the same carve outs that libc++ has (and in exactly the same way), and to add this coupling between both projects.

TBH, I don't really have a good suggestion to fix this breakage at the moment. Conditionalizing the include of src/__support/wctype_utils.h and using #if _LIBCPP_HAS_WIDE_CHARACTERS around the if constexpr (to avoid mentioning e.g. internal::iswspace which might not exist) would technically solve the problem, but I don't want to ask libc to do that since it's a poor way to address the issue.

@vonosmas
Copy link
Contributor Author

vonosmas commented Dec 1, 2025

We started seeing failures in some of our downstream builds since this patch landed. The cause seems to be that libc/src/__support/str_to_integer.h is now including the src/__support/wctype_utils.h header, which does:

libc/src/__support/wctype_utils.h:43:44: error: unknown type name 'wint_t' 43 | LIBC_INLINE static constexpr bool iswlower(wint_t wch) { | ^ 

First, about the specific issue: there are patches on top of it, namely e7f7973, which replaces majority of wint_t uses with wchar_t. The only two remaining uses are btowc/wctob, which can be removed from wctype_utils.h with relative ease. In any case, would you mind sharing the failures when built against a more recent version of llvm-libc, or against trunk? I think it might be different, because wchar_t is a fundamental type in C++?

This is happening in a build where the underlying C library does not provide support for wide characters (which is represented in libc++ by #if _LIBCPP_HAS_WIDE_CHARACTERS). This was not noticed by our upstream CI since it does not actually build against a C library that doesn't provide the wchar-related declarations: it merely turns off the associated libc++ interfaces.

This actually brings up an interesting and important question. Since libc++ supports various "carve outs" like _LIBCPP_HAS_WIDE_CHARACTERS, _LIBCPP_HAS_LOCALIZATION, etc, and since libc++ now uses libc as an implementation detail for some of its functionality, there's kind of a requirement for libc to support the same carve outs. I don't know whether that's desirable and I think we should think about the overall problem to avoid issues like this in the future, since I don't really think it makes sense to ask libc to start supporting all the same carve outs that libc++ has (and in exactly the same way), and to add this coupling between both projects.

TBH, I don't really have a good suggestion to fix this breakage at the moment. Conditionalizing the include of src/__support/wctype_utils.h and using #if _LIBCPP_HAS_WIDE_CHARACTERS around the if constexpr (to avoid mentioning e.g. internal::iswspace which might not exist) would technically solve the problem, but I don't want to ask libc to do that since it's a poor way to address the issue.

  • Is there any interest in using llvm-libc str-to-float parsing code for wide strings in libc++?
  • We can probably add a conversion layer that would transform libc++ - specific defines into llvm-libc specific defines, and assume that the users of libc/shared/ interface (libc/shared/str_to_integer.h) may configure what functionality they'll be getting (similar to how they can #define LIBC_NAMESPACE prior to including these headers).
@ldionne
Copy link
Member

ldionne commented Dec 1, 2025

We started seeing failures in some of our downstream builds since this patch landed. The cause seems to be that libc/src/__support/str_to_integer.h is now including the src/__support/wctype_utils.h header, which does:

libc/src/__support/wctype_utils.h:43:44: error: unknown type name 'wint_t' 43 | LIBC_INLINE static constexpr bool iswlower(wint_t wch) { | ^ 

First, about the specific issue: there are patches on top of it, namely e7f7973, which replaces majority of wint_t uses with wchar_t. The only two remaining uses are btowc/wctob, which can be removed from wctype_utils.h with relative ease. In any case, would you mind sharing the failures when built against a more recent version of llvm-libc, or against trunk? I think it might be different, because wchar_t is a fundamental type in C++?

I am currently building against a tree where the latest LLVM-libc commit is 80e4a3f, which includes e7f7973. This is the error:

In file included from /Users/ldionne/llvm/libcxx/src/include/from_chars_floating_point.h:25: In file included from /Users/ldionne/llvm/cmake/Modules/../../libc/shared/str_to_float.h:13: In file included from /Users/ldionne/llvm/cmake/Modules/../../libc/src/__support/str_to_float.h:29: In file included from /Users/ldionne/llvm/cmake/Modules/../../libc/src/__support/high_precision_decimal.h:22: In file included from /Users/ldionne/llvm/cmake/Modules/../../libc/src/__support/str_to_integer.h:28: /Users/ldionne/llvm/cmake/Modules/../../libc/src/__support/wctype_utils.h:593:38: error: unknown type name 'wint_t' 593 | LIBC_INLINE cpp::optional<int> wctob(wint_t c) { | ^ /Users/ldionne/llvm/cmake/Modules/../../libc/src/__support/wctype_utils.h:605:27: error: use of undeclared identifier 'wint_t' 605 | LIBC_INLINE cpp::optional<wint_t> btowc(int c) { | ^~~~~~ /Users/ldionne/llvm/cmake/Modules/../../libc/src/__support/wctype_utils.h:607:12: error: no viable conversion from returned value of type 'const nullopt_t' to function return type 'int' 607 | return cpp::nullopt; | ^~~~~~~~~~~~ /Users/ldionne/llvm/cmake/Modules/../../libc/src/__support/wctype_utils.h:608:22: error: unknown type name 'wint_t' 608 | return static_cast<wint_t>(c); | ^ 

This seems to validate that only btowc and wctob are the remaining issues. However, as I said, I'm don't really want to ask libc to work around this in a ad-hoc manner, instead I'm more interested in figuring out how we should handle this general problem in a way that work for both projects.

  • Is there any interest in using llvm-libc str-to-float parsing code for wide strings in libc++?

If I understand your question correctly, you're asking whether we are interested in using LLVM-libc for parsing wide character strings into floating points in libc++? The answer to that would be "no", simply because there is no standard C++ API that does that. std::from_chars (which we implement using LLVM libc) only parses from char const*, and there is nothing equivalent for parsing from wchar_t const* (unless you are thinking about something else that's not std::from_chars).

  • We can probably add a conversion layer that would transform libc++ - specific defines into llvm-libc specific defines, and assume that the users of libc/shared/ interface (libc/shared/str_to_integer.h) may configure what functionality they'll be getting (similar to how they can #define LIBC_NAMESPACE prior to including these headers).

Yes, that would definitely be an option, however that means both projects would have to agree when a new carve-out is added, when we remove a carve-out (which doesn't happen often due to backwards compatibility), etc. If llvm-libc is happy to do that, then I think that might be the easiest path forward from the libc++ side. But IDK if that's in the best interest of llvm-libc to do that?

Another alternative we could explore is build llvm-libc in standalone mode instead of overlay mode. I don't know the details of that, but if we can just let llvm-libc provide the definitions of e.g. wint_t that it needs for its own use, that could solve this problem.

vonosmas added a commit to vonosmas/llvm-project that referenced this pull request Dec 1, 2025
They are not used anywhere except for the btowc/wctob entrypoints, so just move the implementation there. Internal code should probably be using a safer mbrtowc variants anyway, if applicable. This allows us to remove the use of wint_t, which is problematic for some uses through `libc/shared/` when a host system doesn't have wide-character support (see PR llvm#165884 comments). There's no such problems with `wchar_t`, since it's a fundamental type in C++.
@ldionne
Copy link
Member

ldionne commented Dec 1, 2025

So, actually, it seems like the following solves my issues:

diff --git a/libc/src/__support/high_precision_decimal.h b/libc/src/__support/high_precision_decimal.h index 75f2a7607b42..ddeffb4198bd 100644 --- a/libc/src/__support/high_precision_decimal.h +++ b/libc/src/__support/high_precision_decimal.h @@ -20,7 +20,10 @@ #include "src/__support/ctype_utils.h" #include "src/__support/macros/config.h" #include "src/__support/str_to_integer.h" -#include "src/__support/wctype_utils.h" + +#ifndef _LIBC_HAS_NO_WIDE_CHARACTERS +# include "src/__support/wctype_utils.h" +#endif namespace LIBC_NAMESPACE_DECL { namespace internal { diff --git a/libc/src/__support/str_to_float.h b/libc/src/__support/str_to_float.h index 873a11378065..26aa4fa76b3a 100644 --- a/libc/src/__support/str_to_float.h +++ b/libc/src/__support/str_to_float.h @@ -33,7 +33,10 @@ #include "src/__support/str_to_integer.h" #include "src/__support/str_to_num_result.h" #include "src/__support/uint128.h" -#include "src/__support/wctype_utils.h" + +#ifndef _LIBC_HAS_NO_WIDE_CHARACTERS +# include "src/__support/wctype_utils.h" +#endif namespace LIBC_NAMESPACE_DECL { namespace internal { diff --git a/libc/src/__support/str_to_integer.h b/libc/src/__support/str_to_integer.h index 2df1ea894e53..6f7fc610bf09 100644 --- a/libc/src/__support/str_to_integer.h +++ b/libc/src/__support/str_to_integer.h @@ -25,7 +25,10 @@ #include "src/__support/macros/config.h" #include "src/__support/str_to_num_result.h" #include "src/__support/uint128.h" -#include "src/__support/wctype_utils.h" + +#ifndef _LIBC_HAS_NO_WIDE_CHARACTERS +# include "src/__support/wctype_utils.h" +#endif namespace LIBC_NAMESPACE_DECL { namespace internal { diff --git a/libcxx/src/include/from_chars_floating_point.h b/libcxx/src/include/from_chars_floating_point.h index 20f23d2bc267..7e962c352e7f 100644 --- a/libcxx/src/include/from_chars_floating_point.h +++ b/libcxx/src/include/from_chars_floating_point.h @@ -16,6 +16,10 @@ #include <concepts> #include <limits> +#if !_LIBCPP_HAS_WIDE_CHARACTERS +# define _LIBC_HAS_NO_WIDE_CHARACTERS +#endif + // Make sure we use libc++'s assertion machinery within the shared code we use // from LLVM libc. #define LIBC_ASSERT(cond) _LIBCPP_ASSERT((cond), _LIBCPP_TOSTRING(cond))

If we want to go down that route, we should probably formalize it a bit so we have a robust way to test this configuration and so we can have a clear "contract" between LLVM-libc and libc++.

@vonosmas
Copy link
Contributor Author

vonosmas commented Dec 1, 2025

LMK if #170200 fixes it as well by removing wint_t. It's not a principled fix like yours (since we really don't need to include any wctype_utils), but may unblock you, and is a positive change anyway.

I would defer to @michaelrj-google regarding the contract for _LIBC_HAS_NO_WIDE_CHARACTERS. I think it might be reasonable for libc/shared/ users to #define _LIBC_HAS_NO_WIDE_CHARACTERS and the likes before including libc/shared/ headers to get a subset of functionality they want (like you do in your diff).

IMO using full-build of llvm-libc (and thus depend on its own type definitions) might be dangerous, since a programmer would probably assume that when llvm-libc is used as-a-library it would use system headers and definitions for types and macro.

@ldionne
Copy link
Member

ldionne commented Dec 1, 2025

Yup, I commented on your PR, thanks. It does fix my issue.

vonosmas added a commit that referenced this pull request Dec 1, 2025
They are not used anywhere except for the btowc/wctob entrypoints, so just move the implementation there. Internal code should probably be using a safer mbrtowc variants anyway, if applicable. This allows us to remove the use of wint_t, which is problematic for some uses through `libc/shared/` when a host system doesn't have wide-character support (see PR #165884 comments). There's no such problems with `wchar_t`, since it's a fundamental type in C++.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bazel "Peripheral" support tier build system: utils/bazel libc

4 participants