Ticket #39687

wcsrtombs with NULL dest pointer doesn't ignore len parameter

Date d'ouverture: 2019-10-20 02:38 Dernière mise à jour: 2019-11-13 08:38

Rapporteur:
Propriétaire:
Type:
État:
Ouvert [Owner assigned]
Composant:
Jalon:
(Aucun)
Priorité:
5 - moyen
Sévérité:
5 - moyen
Résolution:
Aucun
Fichier:
1
Vote
Score: 0
No votes
0.0% (0/0)
0.0% (0/0)

Détails

I'm using MinGW-gcc-6.3.0. The wcsrtombs() function as mentioned in the docs on cppreference should return the number of bytes that would have been written to src. However it doesn't do so on my end. It seems the implementation doesn't ignore the length parameter when dest is passed as NULL? A similar issue was reported and presumably fixed for the Mingw-w64 on the sourceforge site Currently working around by passing INT_MAX as the length parameter, so it finishes within the limit and returns the size.

I apologize in advance if this is just an issue from my end or if any other info is missing. First time submitting a ticket :)

Ticket History (3/17 Histories)

2019-10-20 02:38 Updated by: gallickgunner
  • New Ticket "wcsrtombs with null dest pointer doesn't ignore len parameter" created
2019-10-20 02:53 Updated by: gallickgunner
  • Details Updated
2019-10-23 02:27 Updated by: keith
  • Propriétaire Update from (Aucun) to keith
  • Composant Update from GCC to WSL
  • Details Updated
  • Summary Updated
Commentaire

Thanks for bringing this to my attention.

Yes, there are issues with the MinGW implementations of wcsrtombs(), and the related wcrtomb() functions. They go far beyond this mishandling of a NULL pointer destination in wcsrtombs(); in particular, they fail to defer to the Microsoft implementations, which were first introduced by MSVCR80.DLL, and subsequently retrofitted to MSVCRT.DLL in Vista, in addition to both being non-conforming (to ISO-C99), w.r.t. handling of a NULL destination pointer.

I'll look into providing conforming implementations, for mingwrt-5.3 — these are runtime library issues, not issues with GCC itself. In the meantime, it would be helpful if you can provide a SSCCE to both demonstrate the issue, and test any potential solution.

2019-10-24 16:14 Updated by: gallickgunner
Commentaire

Reply To keith

Thanks for bringing this to my attention. Yes, there are issues with the MinGW implementations of wcsrtombs(), and the related wcrtomb() functions. They go far beyond this mishandling of a NULL pointer destination in wcsrtombs(); in particular, they fail to defer to the Microsoft implementations, which were first introduced by MSVCR80.DLL, and subsequently retrofitted to MSVCRT.DLL in Vista, in addition to both being non-conforming (to ISO-C99), w.r.t. handling of a NULL destination pointer. I'll look into providing conforming implementations, for mingwrt-5.3 — these are runtime library issues, not issues with GCC itself. In the meantime, it would be helpful if you can provide a SSCCE to both demonstrate the issue, and test any potential solution.

Thanks didn't know it was the runtime library issue. Anyways I'd be happy to post a code example. Is a simple concise main.cpp enough?

2019-10-24 19:44 Updated by: keith
Commentaire

Reply To gallickgunner

I'd be happy to post a code example. Is a simple concise main.cpp enough?

Yes. Short and sweet is the order of the day. Concise is good ... provided it is sufficient to illustrate the fault, and can subsequently be used to show that any proposed solution is effective.

2019-10-25 16:12 Updated by: gallickgunner
Commentaire

Sorry about the delay but I think the site encppreference gives a perfect example. Posting the code here,

#include <iostream>
#include <vector>
#include <string>
#include <cwchar>
#include <clocale>
 
void print_wide(const wchar_t* wstr)
{
    std::mbstate_t state = std::mbstate_t();
    std::size_t len = 1 + std::wcsrtombs(nullptr, &wstr, 0, &state);
    std::vector<char> mbstr(len);
    std::wcsrtombs(&mbstr[0], &wstr, mbstr.size(), &state);
    std::cout << "multibyte string: " << &mbstr[0] << '\n'
              << "Length, including '\\0': " << mbstr.size() << '\n';
}
 
int main()
{
    std::setlocale(LC_ALL, "");
    const wchar_t* wstr = L"Hello";
    print_wide(wstr);
}

Expected Output

multibyte string: Hello
Length, including '\0': 6

If Length is printed 1 or "Hello" doesnt get printed, the function doesn't work correctly.

(Edited, 2019-10-26 03:08 Updated by: gallickgunner)
2019-10-25 20:56 Updated by: keith
Commentaire

Reply To gallickgunner

So if Length is printed 1 it means the function doesn't work correctly.

And, what should be printed, when the function is working correctly? That is vital information, to make this a valid test case.

2019-10-26 03:04 Updated by: gallickgunner
Commentaire

Reply To keith

Reply To gallickgunner

So if Length is printed 1 it means the function doesn't work correctly.

And, what should be printed, when the function is working correctly? That is vital information, to make this a valid test case.

Real sorry. I updated the code and replaced the string literal with fancy characters with a simple "Hello". I'm not an expert at strings and encodings but I think the number of characters of the previous string would depend on the encoding used. As depending on the encoding the special characters can get translated to different multibyte characters. encppreference used en_US.utf8 for the locale but it isn't available on all platforms. I think the simple string Hello is gonna be converted to the same old 5 characters regardless of the encoding. If I'm wrong do correct me, anyone.

2019-10-30 05:23 Updated by: keith
Commentaire

Reply To gallickgunner

I think the site encppreference gives a perfect example....

Alas, as it now stands, (and as it was originally), it isn't particularly useful. Its problems originate in this statement:

    std::setlocale(LC_ALL, "");
which, on MS-Windows, is somewhat dysfunctional; it does not respect LC_CLASS environment variables, but rather just aligns the runtime locale with the pre-configured console code page, (which, in my case, is English_United Kingdom.1252). In the majority of usage cases, for western language locales, that will represent a single byte character encoding, in which a significant majority of Unicode code points cannot be represented, so the conversion fails with errno = EILSEQ, when you give it any of the majority of wchar_t inputs in which the high order byte is non-zero. This is exactly what happens, with your original example, when run with code page 1252, (so the correctly displayed byte count for Length would be zero — i.e. 1 + (size_t)(-1)).

2019-11-05 07:34 Updated by: keith
Commentaire

Unfortunately, converting a wchar_t representation of an SBCS string (such as L"Hello") doesn't really exercise the code effectively. Thus, I modified your original test case, to:

  1. #include <iostream>
  2. #include <vector>
  3. #include <string>
  4. #include <cstring>
  5. #include <cwchar>
  6. #include <clocale>
  7. #include <cerrno>
  8. void print_wide(const wchar_t* wstr)
  9. { std::mbstate_t state = std::mbstate_t();
  10. std::size_t len = 1 + std::wcsrtombs(nullptr, &wstr, 0, &state);
  11. if( len > 0 )
  12. { std::vector<char> mbstr(len);
  13. std::wcsrtombs(&mbstr[0], &wstr, mbstr.size(), &state);
  14. std::cout << "Multibyte string: " << &mbstr[0] << '\n'
  15. << "Length, including '\\0': " << mbstr.size() << '\n';
  16. }
  17. else
  18. { std::cerr << "Verified input length: " << len << "; "
  19. << std::strerror(errno) << '\n';
  20. }
  21. }
  22. int main()
  23. { const char *lang = std::getenv("LC_CTYPE");
  24. std::setlocale(LC_CTYPE, (lang == nullptr) ? "" : lang );
  25. std::cerr << "Locale: " << std::setlocale(LC_CTYPE, NULL) << '\n';
  26. const wchar_t* wstr = L"A wstring-\u00df\u6c34\U0001d10b"; // or L"A wstring-ß水𝄋"
  27. print_wide(wstr);
  28. }
When I compile this, and run it on my GNU/Linux host, I see:
$ g++ tc39687.cc 
]$ ./a.out
Locale: en_GB.utf8
Multibyte string: A wstring-ß水𝄋
Length, including '\0': 20
I shall attempt to reproduce this, with a modified MinGW implementation of wcsrtombs() and related functions.

2019-11-10 06:38 Updated by: keith
  • File wchar-to-mbcs-conversion.patch (File ID: 5575) is attached
2019-11-10 07:01 Updated by: keith
Commentaire

I've attached a tentative patch, which goes at least part of the way towards resolving this issue. When I run my previous test case under wine-4.18, on my Manjaro Linux box, I see:

$ mingw32-g++ tc39687.cc 
$ LC_CTYPE="en_GB.65001" ./a.exe | cat
Locale: English_United Kingdom.Multibyte string: A wstring-ß水𝄋
Length, including '\0': 20
which is consistent with the behaviour I observe from my original (native) GNU/Linux test.

(Note that I have to give the LC_CTYPE spec in a mixed Linux/Microsoft format, because wine's use of libfontconfig raises complaints about a Microsoft-style language_region part, but not the .codeset part, which I do, currently, need to give as a Microsoft codepage number; the output from wine is piped through cat because wine, itself, garbles the display of non-ASCII UTF-8 code points, when the output is direct to the UTF-8 encoded terminal, whereas cat reads the correctly formatted UTF-8 output stream, and subsequently displays it correctly).

2019-11-12 08:39 Updated by: keith
Commentaire

I feared that this may prove to be the case: if I run that same test case, which succeeded under wine, in a native Win7 virtual machine, it fails:

$ LC_CTYPE="English_United Kingdom.65001" /e/tc39687.exe
Locale: C
Verified input length: 0; Illegal byte sequence
I suspect that the problem arises from an unfortunate — but documented — limitation of Microsoft's setlocale() implementation; it is unable to process locale specifications in which the codeset element refers to any with an MB_CUR_MAX of greater than two bytes, (which precludes UTF-8 — code page 65001 — amongst a significant number of others).

I shall continue to explore possible work-arounds.

2019-11-13 00:11 Updated by: keith
Commentaire

The problem is indeed the Microsoft setlocale() limitation, (which, although Microsoft's documentation doesn't acknowledge it as such, is clearly a bug). The clue is in the Locale: C output, following the conditional call which becomes equivalent to setlocale( LC_CTYPE, "English_United Kingdom.65001" ); the failure to change the locale setting, from the initial "C" state, is a clear indication that the 65001 (UTF-8) codeset specification has been rejected.

The ideal solution would be to have a correctly working setlocale() implementation, but Microsoft will never provide that, particularly in the case of any legacy Windows version which MinGW still endeavours to support. Next best would be a MinGW replacement setlocale() implementation, (which could also address other failings of the Microsoft implementation), but this would require significant development effort, and must, therefore, be considered only as a longer term development objective. In the shorter term, I propose to adapt the MinGW replacements for wcrtomb() and wcsrtombs(), such that they will initially identify a potential codeset identity from the string returned by setlocale( LC_CTYPE, NULL ), (as they do currently), but then override that by inspection of getenv( "LC_ALL" ), getenv( "LC_CTYPE" ), or getenv( "LANG" ), in that order, and taking the first non-NULL, if any, and then if, and only if, the string returned by setlocale( LC_CTYPE, NULL ) is a verbatim match for that returned by setlocale( LC_CTYPE, "" ), and any overriding codeset, so identified, is confirmed as having a non-zero encoding length, by GetCPInfo( codeset, &info ).

(Edited, 2019-11-13 00:22 Updated by: keith)
2019-11-13 07:40 Updated by: keith
  • File wchar-to-mbcs-conversion.patch (File ID: 5575) is deleted
2019-11-13 08:38 Updated by: keith
Commentaire

I've updated my original patch, to implement the stratagem I outlined earlier. With this applied, and also with the following addition to my original test case:

  1. --- a/tc39687.cc 2019-11-12 23:06:48.688108980 +0000
  2. +++ b/tc39687.cc 2019-11-11 22:55:10.816277321 +0000
  3. @@ -19,12 +19,14 @@
  4. { std::cerr << "Verified input length: " << len << "; "
  5. << std::strerror(errno) << '\n';
  6. }
  7. }
  8. int main()
  9. -{ const char *lang = std::getenv("LC_CTYPE");
  10. +{
  11. + std::setlocale(LC_ALL, "" );
  12. + const char *lang = std::getenv("LC_CTYPE");
  13. std::setlocale(LC_CTYPE, (lang == nullptr) ? "" : lang );
  14. std::cout << "Locale: " << std::setlocale(LC_CTYPE, NULL) << '\n';
  15. const wchar_t* wstr = L"A wstring-\u00df\u6c34\U0001d10b"; // or L"A wstring-ß水𝄋"
  16. print_wide(wstr);
  17. }
With these changes, I can successfully run the test case on my Win7 VM, where I now see:
$ LC_CTYPE="English_United Kingdom.65001" /e/tc39687.exe
Locale: English_United Kingdom.1252
Multibyte string: A wstring-ÃYæ°´ð?"<
Length, including '\0': 20
Note that the output of the converted string data is garbled, (because the console is actually incapable of rendering UTF-8 encoded data), but the length determination is now correct.

I propose incorporating this modification into mingwrt-5.3.

Attachment File List

Modifier

Please login to add comment to this ticket » Connexion