[Mingw-users] RFC: mbrtowc() and wcrtomb() implementation problems

Back to archive index
Keith Marshall keith****@users*****
Fri Feb 14 02:58:49 JST 2020


On 13/02/2020 14:42, Eli Zaretskii wrote, quoting me:
>>
>> [...snip...]
>>
>> 2) Representation of wchar_t as UTF-16LE entities makes it
>> impossible to have effective handling, in mbrtowc() and wcrtomb(),
>> for code points which lie off the BMP.  Such code points are
>> represented by surrogate pairs, in UTF-16, and there is no
>> standards conformant mechanism for passing such surrogate pairs
>> through the single wchar_t argument to either of these functions.>>
>> I can address the first of these issues; the second is more
>> problematic.>>
> I think you can give up on 2).  Wide-character support in the CRT
> routines is fundamentally broken on MS-Windows, due to the use of
> UTF-16, for any codepoint beyond the BMP.

I certainly agree with this ... with the benefit of hindsight, I think
we can all agree that Microsoft's decision to standardize on UTF-16 for
their wchar_t was myopic, in the extreme, and the blinkered insistence,
which prevailed throughout their documentation for years, that Unicode
could mean nothing other than UTF-16LE, did them no credit at all.

> [...snip...]
> 
> I don't think this can be fixed as long as wchar_t remains a 16-bit
> data type.  People who need their MinGW programs to do better should
> either (a) convert everything to UTF-8 and write their own code to
> manipulate UTF-8 strings, or (b) use replacements such as Gnulib
> (which, quite expectedly, uses a 32-bit data type for wide
> characters).
> 
> So I think you should just document this as a Windows restriction, and
> move on.

That would be the easy cop-out, but the problem with doing so is that we
have had (fundamentally broken) implementations, in libmingwex.a, for
about fifteen years now, and I'm uncomfortable with providing broken
implementations.  Of course, I could reject bug reports against the
existing implementations, declaring them as no longer supported, and
flagging them as "won't fix".  (Withdrawing them altogether is hardly a
viable option, since it would break any legacy code which may have come
to rely on them ... not least of these being our own enhanced printf
implementation, which requires both mbrtowc(), and wcrtomb())!

It turns out that I can, quite easily, improve on the implementations of
those existing functions, within libmingwex.a, which are related to
wcrtomb(), and which convert from wchar_t to MBCS.  Conversion in the
opposite direction seems trickier; I won't give up on it just yet, but I
may need to document a limitation that calling any mbrtowc() related
function, with a wchar_t return value buffer sufficient for less than
two UTF-16 wchar_t entities, may be unsafe.

-- 
Regards,
Keith.

Public key available from keys.gnupg.net
Key fingerprint: C19E C018 1547 DE50 E1D4 8F53 C0AD 36C6 347E 5A3F

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.osdn.me/mailman/archives/mingw-users/attachments/20200213/9346670f/attachment.sig>


More information about the MinGW-Users mailing list
Back to archive index