Discussion:
[boost] [review] Review of Nowide (Unicode) starts today
Frédéric Bron via Boost
2017-06-12 04:20:44 UTC
Permalink
Hi Everyone,

The formal review of Artyom Beilis' Nowide library starts today and
will last until Wed. 21st of June.

Your participation is encouraged, as the proposed library is
uncoupled, focused and rather small. Nowadays, everyone needs to
handle Unicode but this is very difficult using only the standard
library in a platform independant way. Nowide offers a very simple way
to handle Unicode the same way on
Windows/MacOS/Linux.

Key features:

* work with UTF-8 in your code, Nowide converts to OS encoding
* Easy to use functions for converting UTF-8 to/from UTF-16
* A class to fixing argc, argc and env main parameters to use UTF-8
* UTF-8 aware functions:
- stdio.h functions (fopen, freopen, remove, rename)
- stdlib.h functions (system, getenv, setenv, unsetenv, putenv)
- fstream (filebuf, fstream/ofstream/ifstream)
- iostream (cout, cerr, clog, cin)

Documentation:
http://cppcms.com/files/nowide/html

GitHub:
https://github.com/artyom-beilis/nowide
git clone https://github.com/artyom-beilis/nowide.git

Latest tarballs:
- to be unzipped in boost source:
https://github.com/artyom-beilis/nowide/archive/master.zip
- as a standalone library: http://cppcms.com/files/nowide/nowide_standalone.zip

Nowide has standard boost layout. So, just put it into the boost
source tree under libs/nowide directory for building and running
tests.
Or alternatively - for header only part of the library add the
nowide/include path to the include path of your projects.

Please post your comments and review to the boost mailing list
(preferably), or privately to the Review Manager (to me ;-). Here are
some questions you might want to answer in your review:

- What is your evaluation of the design?
- What is your evaluation of the implementation?
- What is your evaluation of the documentation?
- What is your evaluation of the potential usefulness of the library?
- Did you try to use the library? With what compiler? Did you have any
problems?
- How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?
- Are you knowledgeable about the problem domain?

And most importantly:
- Do you think the library should be accepted as a Boost library? Be
sure to say this explicitly so that your other comments don't obscure
your overall opinion.

For more information about Boost Formal Review Process, see:
http://www.boost.org/community/reviews.html

Thank you very much for your time and efforts.

Frédéric

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mai
Groke, Paul via Boost
2017-06-12 08:09:01 UTC
Permalink
Your participation is encouraged, as the proposed library is uncoupled,
focused and rather small. Nowadays, everyone needs to handle Unicode but
this is very difficult using only the standard library in a platform independant
way. Nowide offers a very simple way to handle Unicode the same way on
Windows/MacOS/Linux.
* work with UTF-8 in your code, Nowide converts to OS encoding
* Easy to use functions for converting UTF-8 to/from UTF-16
* A class to fixing argc, argc and env main parameters to use UTF-8
- stdio.h functions (fopen, freopen, remove, rename)
- stdlib.h functions (system, getenv, setenv, unsetenv, putenv)
- fstream (filebuf, fstream/ofstream/ifstream)
- iostream (cout, cerr, clog, cin)
Not a review, just a question (actually two): how does Nowide deal with modified UTF-8? I'm asking because if modified UTF-8 is fully supported (converting from/to, passing modified UTF-8 directly to the UTF-8 aware functions etc.), then Nowide could be valuable to us. Also: are any other platforms aside from Windows/OS X/Linux supported (e.g. AIX, z/OS)?

Thanks,
Paul Groke


_______________________________________________
Unsubscribe & other changes: http://lists.boost.
Artyom Beilis via Boost
2017-06-12 08:15:58 UTC
Permalink
Post by Groke, Paul via Boost
Not a review, just a question (actually two): how does Nowide deal with modified UTF-8?
No, modified utf-8 is not supported since it isn't utf-8 it will be
considered invalid encoding.
Post by Groke, Paul via Boost
Also: are any other platforms aside from Windows/OS X/Linux supported (e.g. AIX, z/OS)?
Actually if you take a look on the docs - the support of POSIX based
OSes (or OS with native narrow API)
is transparent.
Post by Groke, Paul via Boost
Thanks,
Paul Groke
Artyom Beilis

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Yakov Galka via Boost
2017-06-12 09:03:15 UTC
Permalink
On Mon, Jun 12, 2017 at 11:15 AM, Artyom Beilis via Boost <
Post by Artyom Beilis via Boost
No, modified utf-8 is not supported since it isn't utf-8 it will be
considered invalid encoding.
On a related note, does it support WTF-8? I.e. encoding lone UTF-16
surrogates (malformed UTF-16 sequences) within the UTF-8 scheme. It is
needed to guarantee UTF-16 → UTF-8 → UTF-16 roundtrip of invalid UTF-16
data on Windows, and is not an invalid behavior per se, because all valid
UTF-16 sequences still map bijectively onto valid UTF-8 sequences.
--
Yakov Galka
http://stannum.co.il/

_______________________________________________
Unsubscribe & other changes: http://lists.boo
Artyom Beilis via Boost
2017-06-12 10:52:50 UTC
Permalink
Post by Yakov Galka via Boost
On a related note, does it support WTF-8? I.e. encoding lone UTF-16
surrogates (malformed UTF-16 sequences) within the UTF-8 scheme. It is
needed to guarantee UTF-16 → UTF-8 → UTF-16 roundtrip of invalid UTF-16 data
on Windows, and is not an invalid behavior per se, because all valid UTF-16
sequences still map bijectively onto valid UTF-8 sequences.
--
Yakov Galka
http://stannum.co.il/
No it does not.

I considered it before but I think that security risk of creating or
accepting malformed UTF-8 or UTF-16.

Converting invalid UTF-16 to WTF-8 and other way around is not obvious
behavior and has potential of security risk especially for users that
are not aware of such an issue. So invalid UTF-8/16 sequences are
rejected by design.


Artyom

_______________________________________________
Unsubscribe & other changes: ht
Peter Dimov via Boost
2017-06-12 12:11:25 UTC
Permalink
Post by Artyom Beilis via Boost
Converting invalid UTF-16 to WTF-8 and other way around is not obvious
behavior and has potential of security risk especially for users that are
not aware of such an issue. So invalid UTF-8/16 sequences are rejected by
design.
Implying that there are Windows file names that can't be handled by the
library?


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
degski via Boost
2017-06-12 14:17:47 UTC
Permalink
Post by Artyom Beilis via Boost
Converting invalid UTF-16 to WTF-8 and other way around is not obvious
Post by Artyom Beilis via Boost
behavior and has potential of security risk especially for users that are
not aware of such an issue. So invalid UTF-8/16 sequences are rejected by
design.
Implying that there are Windows file names that can't be handled by the
library?
Question: "Shouldn't the passing of invalid UTF-8/16 sequences be defined
as UB?" How important is this point? I'm oblivious to this problem, and it
sounds like I would like to keep it that way.

degski
--
"*Ihre sogenannte Religion wirkt bloß wie ein Opiat reizend, betäubend,
Schmerzen aus Schwäche stillend.*" - Novalis 1798

_______________________________________________
Unsubscribe & other changes: http://li
Peter Dimov via Boost
2017-06-12 14:57:15 UTC
Permalink
Post by degski via Boost
Question: "Shouldn't the passing of invalid UTF-8/16 sequences be defined
as UB?"
Of course not. Why would one need to use the library then? It defeats the
whole purpose of it.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
degski via Boost
2017-06-13 06:20:23 UTC
Permalink
Post by degski via Boost
Question: "Shouldn't the passing of invalid UTF-8/16 sequences be defined
Post by degski via Boost
as UB?"
Of course not. Why would one need to use the library then? It defeats the
whole purpose of it.
From WP (read up on it now): "RFC 3629 states "Implementations of the
decoding algorithm MUST protect against decoding invalid sequences."[13]
<https://en.wikipedia.org/wiki/UTF-8#cite_note-rfc3629-13> *The Unicode
Standard* requires decoders to "...treat any ill-formed code unit sequence
as an error condition. This guarantees that it will neither interpret nor
emit an ill-formed code unit sequence.""

So not UB then, but it should not pass either.

Are we talking FAT32 or NTFS? What Windows verions are affected? I also
think, as some posters below (and in another thread) state, that Windows
should not be treated differently. A new boost library should not
accomodate bad/sloppy windows' historic quirks. The library *can* require
that's it's use depends on the system and its' users adhere to the standard.

Then WP on Overlong encodings: "The standard specifies that the correct
encoding of a code point use only the minimum number of bytes required to
hold the significant bits of the code point. Longer encodings are called
*overlong* and are not valid UTF-8 representations of the code point. This
rule maintains a one-to-one correspondence between code points and their
valid encodings, so that there is a unique valid encoding for each code
point."

The key being: "... are not valid UTF-8 representations ...", i.e. we're
back to the case above.

degski

WP: https://en.wikipedia.org/wiki/UTF-8
--
"*Ihre sogenannte Religion wirkt bloß wie ein Opiat reizend, betäubend,
Schmerzen aus Schwäche stillend.*" - Novalis 1798

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailma
Artyom Beilis via Boost
2017-06-12 14:58:32 UTC
Permalink
On Mon, Jun 12, 2017 at 3:11 PM, Peter Dimov via Boost
Post by Artyom Beilis via Boost
Converting invalid UTF-16 to WTF-8 and other way around is not obvious
behavior and has potential of security risk especially for users that are
not aware of such an issue. So invalid UTF-8/16 sequences are rejected by
design.
Implying that there are Windows file names (that their names contain invalid UTF-16) that can't be handled by the
library?
Yes indeed if you will try to use stuff like FindFileW and it returns
invalid UTF-16 you will get an error trying to convert it to UTF-8.

By definition: you can't handle file names that can't be represented
in UTF-8 as there is no valid UTF-8 representation exist.

Artyom

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Vadim Zeitlin via Boost
2017-06-12 15:05:50 UTC
Permalink
On Mon, 12 Jun 2017 17:58:32 +0300 Artyom Beilis via Boost <***@lists.boost.org> wrote:

AB> By definition: you can't handle file names that can't be represented
AB> in UTF-8 as there is no valid UTF-8 representation exist.

This is a nice principle to have in theory, but very unfortunate in
practice because at least under Unix systems such file names do occur in
the wild (maybe less often now than 10 years ago, when UTF-8 was less
ubiquitous, but it's still hard to believe that the problem has completely
disappeared). And there are ways to solve it, e.g. I think glib represents
such file names using special characters from a PUA and there are other
possible approaches, even if, admittedly, none of them is perfect.

While it is reasonable to restrict the file names created by the program
to valid UTF-8, not allowing certain existing files to be opened at all is,
IMHO, not a good idea and probably has security implications as well.

Regards,
VZ
Artyom Beilis via Boost
2017-06-12 17:39:17 UTC
Permalink
On Mon, Jun 12, 2017 at 6:05 PM, Vadim Zeitlin via Boost
Post by Vadim Zeitlin via Boost
AB> By definition: you can't handle file names that can't be represented
AB> in UTF-8 as there is no valid UTF-8 representation exist.
This is a nice principle to have in theory, but very unfortunate in
practice because at least under Unix systems such file names do occur in
the wild (maybe less often now than 10 years ago, when UTF-8 was less
ubiquitous, but it's still hard to believe that the problem has completely
disappeared). And there are ways to solve it, e.g. I think glib represents
such file names using special characters from a PUA and there are other
possible approaches, even if, admittedly, none of them is perfect.
Please note: Under POSIX platforms no conversions are performed
and no UTF-8 validation is done as this is incorrect:

http://cppcms.com/files/nowide/html/index.html#qna

The only case is when Windows Wide API returns/creates
invalid UTF-16 - which can happen only when invalid surrogate
UTF-16 pairs are generated - and they have no valid UTF-8
representation.

On the other hand creating deliberately invalid UTF-8 is very problematic idea.

Regards,
Artyom

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Yakov Galka via Boost
2017-06-12 18:07:28 UTC
Permalink
On Mon, Jun 12, 2017 at 8:39 PM, Artyom Beilis via Boost <
Post by Artyom Beilis via Boost
Please note: Under POSIX platforms no conversions are performed
http://cppcms.com/files/nowide/html/index.html#qna
The only case is when Windows Wide API returns/creates
invalid UTF-16 - which can happen only when invalid surrogate
UTF-16 pairs are generated - and they have no valid UTF-8
representation.
On the other hand creating deliberately invalid UTF-8 is very problematic idea.
That's not convincing. Why Windows has to be different in this regard? In
both cases, Windows or POSIX, valid input gives valid UTF-8, and invalid
input can give 'invalid' UTF-8. I expect the library to be as transparent
layer as possible and not to interfere with such issues.
--
Yakov Galka
http://stannum.co.il/

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Peter Dimov via Boost
2017-06-12 18:30:57 UTC
Permalink
Post by Yakov Galka via Boost
That's not convincing. Why Windows has to be different in this regard?
Because it is. The file name on Windows physically is a sequence of
uint16_t, whereas on Linux, it's a sequence of uint8_t. On Linux, any
null-terminated byte sequence results in a perfect roundtrip, as it's stored
unmodified, it's literally the exact same byte sequence as the one you gave.
If the library tried to mess with the file name in any way, this would
result in nothing but problems.

On Windows, it is simply not possible to have the exact same byte sequence
stored as the file name. The file system does not use byte sequences, it
uses uint16_t sequences.

(Mac OS X also stores uint16_t by the way, and the functions require UTF-8.
So it's not quite POSIX vs Windows.)


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Yakov Galka via Boost
2017-06-12 18:36:23 UTC
Permalink
On Mon, Jun 12, 2017 at 9:30 PM, Peter Dimov via Boost <
Post by Yakov Galka via Boost
That's not convincing. Why Windows has to be different in this regard?
Because it is. The file name on Windows physically is a sequence of
uint16_t, whereas on Linux, it's a sequence of uint8_t. On Linux, any
null-terminated byte sequence results in a perfect roundtrip, as it's
stored unmodified, it's literally the exact same byte sequence as the one
you gave. If the library tried to mess with the file name in any way, this
would result in nothing but problems.
On Windows, it is simply not possible to have the exact same byte sequence
stored as the file name. The file system does not use byte sequences, it
uses uint16_t sequences.
Yeah, so? I say that the library can provide a Windows -> std::string ->
Windows roundtrip just as it does with any other platform. If FreeBSD ->
std::string conversion can return invalid UTF-8, then so does Windows ->
std::string conversion.
--
Yakov Galka
http://stannum.co.il/

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Peter Dimov via Boost
2017-06-12 18:51:55 UTC
Permalink
Post by Yakov Galka via Boost
Yeah, so? I say that the library can provide a Windows -> std::string ->
Windows roundtrip just as it does with any other platform. If FreeBSD ->
std::string conversion can return invalid UTF-8, then so does Windows ->
std::string conversion.
The security concern here is that under FreeBSD the file name is what it is,
and different byte sequences refer to different names, whereas under Windows
if invalid UTF-8 is allowed many different byte sequences may map to the
same file name.

This does not necessarily preclude handling free surrogate pairs though. In
practice the main problem is probably with overlong encoding of '.', '/' and
'\'.

Last time this came up I argued that if you rely on finding '.' as the
literal 8 bit '.' your input validation is wrong anyway, but requiring
strictly valid UTF-8 is a reasonable first line of defense.

And realistically, if you want to validate the input in the correct manner,
you have to #ifdef for each OS anyway, which kind of makes the library
redundant. So in the specific use case where you _do_ use the library to
avoid #ifdef'ing, it does make sense for it to protect you from invalid
UTF-8 on Windows.

With all that said, I don't quite see the concern with WTF-8. What's the
attack we're defending from by disallowing it?


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Yakov Galka via Boost
2017-06-12 19:02:29 UTC
Permalink
On Mon, Jun 12, 2017 at 9:51 PM, Peter Dimov via Boost <
... whereas under Windows if invalid UTF-8 is allowed many different byte
sequences may map to the same file name.
This is a false presumption. Nobody here proposes allowing absolutely ANY
byte sequences, only using WTF-8 as means of guaranteeing a round-trip. And
as far as WTF-8 goes there is a unique representation for every 16-bit
codeunit sequence.
With all that said, I don't quite see the concern with WTF-8. What's the
attack we're defending from by disallowing it?
There are some concerns with WTF-8, specifically if you concatenate two
WTF-8 strings where one ends in an unpaired surrogate whereas the other
begins with one, then the result is an invalid WTF-8 string. Filenames are
usually parsed and concatenated on ASCII separators, so I don't see a
problem in the typical use-case. As for the non-typical use cases, I would
argue that they are beyond the responsibility of this library.
--
Yakov Galka
http://stannum.co.il/

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Groke, Paul via Boost
2017-06-12 18:14:32 UTC
Permalink
Post by Artyom Beilis via Boost
On Mon, Jun 12, 2017 at 6:05 PM, Vadim Zeitlin via Boost
Post by Vadim Zeitlin via Boost
On Mon, 12 Jun 2017 17:58:32 +0300 Artyom Beilis via Boost
AB> By definition: you can't handle file names that can't be
AB> represented in UTF-8 as there is no valid UTF-8 representation exist.
This is a nice principle to have in theory, but very unfortunate in
practice because at least under Unix systems such file names do occur
in the wild (maybe less often now than 10 years ago, when UTF-8 was
less ubiquitous, but it's still hard to believe that the problem has
completely disappeared). And there are ways to solve it, e.g. I think
glib represents such file names using special characters from a PUA
and there are other possible approaches, even if, admittedly, none of
them is perfect.
Please note: Under POSIX platforms no conversions are performed and no
http://cppcms.com/files/nowide/html/index.html#qna
Well... what's correct on POSIX platforms is a matter of opinion. If you go with the strict interpretation, then in fact conversion from the current locale to UTF-8 must be considered incorrect. Only then you cannot rely on *anything*, except that 0x00 is NUL and 0x2F is the path separator. Which makes any kind of isdigit/toupper/tolower/... string parsing/processing "incorrect".
Post by Artyom Beilis via Boost
The only case is when Windows Wide API returns/creates invalid UTF-16 -
which can happen only when invalid surrogate
UTF-16 pairs are generated - and they have no valid UTF-8 representation.
On the other hand creating deliberately invalid UTF-8 is very problematic idea.
Since the UTF-8 conversion is only done on/for Windows, and Windows doesn't guarantee that all wchar_t paths (or strings in general) will always be valid UTF-16, wouldn't it make more sense to just *define* that the library always uses WTF-8, which allows round-tripping of all possible 16 bit strings? If it's documented that way it shouldn't matter. Especially since users of the library cannot rely on the strings being in UTF-8 anyway, at least not in portable applications.

I agree that the over-long zero/NUL encoding part of modified UTF-8 might still be problematic though, and therefor WTF-8 might be the better choice. Now that still leaves some files that can theoretically exist on a Windows system inaccessible (i.e. those with embedded NUL characters), but those are not accessible via the "usual" Windows APIs either (CreateFileW etc.). So this should be acceptable.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Zach Laine via Boost
2017-06-12 18:35:23 UTC
Permalink
On Mon, Jun 12, 2017 at 1:14 PM, Groke, Paul via Boost <
Post by Groke, Paul via Boost
Since the UTF-8 conversion is only done on/for Windows, and Windows
doesn't guarantee that all wchar_t paths (or strings in general) will
always be valid UTF-16, wouldn't it make more sense to just *define* that
the library always uses WTF-8, which allows round-tripping of all possible
16 bit strings? If it's documented that way it shouldn't matter. Especially
since users of the library cannot rely on the strings being in UTF-8
anyway, at least not in portable applications.
I agree that round-tripping to wchar_t paths on Windows is very important.
I also agree that not detecting invalid UTF-8, or failing to produce an
error in such a case, is very important to *avoid*.

Can we get both? Is it possible to add a WTF-8 mode, perhaps only used in
user-selectable string processing cases?

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Groke, Paul via Boost
2017-06-12 19:06:20 UTC
Permalink
Post by Zach Laine via Boost
On Mon, Jun 12, 2017 at 1:14 PM, Groke, Paul via Boost <
Post by Groke, Paul via Boost
Since the UTF-8 conversion is only done on/for Windows, and Windows
doesn't guarantee that all wchar_t paths (or strings in general) will
always be valid UTF-16, wouldn't it make more sense to just *define*
that the library always uses WTF-8, which allows round-tripping of all
possible
16 bit strings? If it's documented that way it shouldn't matter.
Especially since users of the library cannot rely on the strings being
in UTF-8 anyway, at least not in portable applications.
I agree that round-tripping to wchar_t paths on Windows is very important.
I also agree that not detecting invalid UTF-8, or failing to produce an error in
such a case, is very important to *avoid*.
Can we get both? Is it possible to add a WTF-8 mode, perhaps only used in
user-selectable string processing cases?
Well, I don't see why detecting invalid UTF-8 would be important. In my initial mail I (wrongly) assumed that the library would be translating between the native encoding and UTF-8 also on e.g. Linux (or non-Windows platforms in general). But since this isn't so, I guess the library can simply pass-through strings on all platforms that have narrow APIs. In fact I think it should, since checking for valid UTF-8 would make some files inaccessible on systems like Linux, where you can very easily create file names that aren't valid UTF-8.

In that case the terms "Unicode" and "UTF-8" should not be used in describing the library (name and documentation). The documentation should just say that it transforms strings to some unspecified encoding with the following properties:
- Byte-based
- ASCII compatible (*)
- self-synchronizing
- able to 100% round-trip all characters/NUL terminated strings from the native platform encoding
And for Windows this unspecified encoding then would just happen to be WTF-8.
(* For POSIX systems one cannot even 100% rely on that... so maybe using an even wider set of constraints would be good.)

On platforms like OS X, the API-wrappers of Boost.Nowide would then simply produce the same result as the native APIs would -- because the narrow string would simply be passed through unmodified. It it's invalid UTF-8, and the OS decides to fail the call because of the invalid UTF-8, then so be it - using the library wouldn't change anything.

Almost the same for Windows: the library would simply round-trip invalid UTF-16 to the same invalid UTF-16. If the native Windows API decides to fail the call because of that, OK, then it just fails -- it would also have failed in a wchar_t application.

But maybe I missed something here. If there really is a good reason for enforcing valid UTF-8 in some situation, please let me know :)


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Artyom Beilis via Boost
2017-06-12 19:37:34 UTC
Permalink
Post by Groke, Paul via Boost
But maybe I missed something here. If there really is a good reason for enforcing valid UTF-8 in some situation, please let me know :)
Ok I want to summarize my response regarding WTF-8 especially from
security point of view.

There is lots of dangers in this assumption - generation of WTF-8 as
if it was UTF-8 can lead to serious issues.

It almost never OK to generate invalid UTF-8 especially since most of
chances 99% of users will not understand
what are we talking about round-trip/WTF-8 and so - and I'm talking
from experience.

Just to make clear how serious it can be - this is CVE regarding one
bug in Boost.Locale:

http://people.canonical.com/~ubuntu-security/cve/2013/CVE-2013-0252.html

Lets show a trivial example that WTF-8 can lead to:

Tool:

- User monitoring system that monitors all files and creates
report with all changes by XML to some central server

Deny of Service Attack Example:

- User creates a file with invalid UTF-16
- System monitors the file system and adds it to the XML report in
WTF-8 format
- The central server does not accept the XML since it fails UTF-8 validation
- User does whatever he wants without monitoring
- It removes the file
- There were no reports generated during the period user needed -DOS attack

Bottom line:

(a) Since 99% of programmers are barely aware of various Unicode
issues it is dangerous to assume that giving such a round trip in
trivial way is OK

(b) If you need to do complex manipulations of file system you my
consider Boost.Filesystem that keeps internal representation in native
format and convert
to UTF-8 for other uses (nowide provides integration with Boost.Filesystem)

(c) It is possible to add WTF-8 support under following restrictions:
(c.1) only by adding wtf8_to_wide and wide_to_wtf8 - if user wants it
explicitly(!)
(c.2) No "C/C++" like API should accept one - i.e.
nowide::fopen/nowide::fstream must not do it.


Artyom Beilis

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Zach Laine via Boost
2017-06-12 20:39:51 UTC
Permalink
On Mon, Jun 12, 2017 at 2:37 PM, Artyom Beilis via Boost <
Post by Groke, Paul via Boost
Post by Groke, Paul via Boost
But maybe I missed something here. If there really is a good reason for
enforcing valid UTF-8 in some situation, please let me know :)
[snip]
[snip]
To add a bottom line under that, it seems that nowide does not support
round-tripping from arbitrary Windows path names to UTF-8 and back. Is
that correct?

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Peter Dimov via Boost
2017-06-12 21:05:57 UTC
Permalink
Post by Artyom Beilis via Boost
- User creates a file with invalid UTF-16
- System monitors the file system and adds it to the XML report in
WTF-8 format
- The central server does not accept the XML since it fails UTF-8 validation
- User does whatever he wants without monitoring
- It removes the file
- There were no reports generated during the period user needed -DOS attack
I can't help but note that the same attack would work under Unix. The user
can easily create a file with an invalid UTF-8 name. And, since the library
doesn't enforce valid UTF-8 on POSIX (right?) it would pass through.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Niall Douglas via Boost
2017-06-12 22:52:47 UTC
Permalink
Post by Peter Dimov via Boost
Post by Artyom Beilis via Boost
- User creates a file with invalid UTF-16
- System monitors the file system and adds it to the XML report in
WTF-8 format
- The central server does not accept the XML since it fails UTF-8 validation
- User does whatever he wants without monitoring
- It removes the file
- There were no reports generated during the period user needed -DOS attack
I can't help but note that the same attack would work under Unix. The
user can easily create a file with an invalid UTF-8 name. And, since the
library doesn't enforce valid UTF-8 on POSIX (right?) it would pass
through.
Also, running sanitisation on UTF sequences is laudable and all, but
what about the 80-90% of code which will never, ever feed strings from
untrusted remote parties to system APIs? Do they pay for the UTF
sequence validation too?

Like Peter and many of us, I have a personal "nowide" library too. It
simply wraps the *W() Win32 APIs consuming 16 bit characters with a call
to the NT kernel function for converting UTF-8 to UTF-16 which is
RtlUTF8ToUnicodeN().

It does nothing else fancy, it leaves it entirely up to the Windows
kernel what that conversion means. I'd suggest that the nowide library
ought to keep things similarly simple.

Niall
--
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Artyom Beilis via Boost
2017-06-13 03:49:09 UTC
Permalink
On Tue, Jun 13, 2017 at 12:05 AM, Peter Dimov via Boost
Post by Peter Dimov via Boost
Post by Artyom Beilis via Boost
- User creates a file with invalid UTF-16
- System monitors the file system and adds it to the XML report in
WTF-8 format
- The central server does not accept the XML since it fails UTF-8 validation
- User does whatever he wants without monitoring
- It removes the file
- There were no reports generated during the period user needed -DOS attack
I can't help but note that the same attack would work under Unix. The user
can easily create a file with an invalid UTF-8 name. And, since the library
doesn't enforce valid UTF-8 on POSIX (right?) it would pass through.
Note, under POSIX user takes strings as is and can't trust the source.
Under Windows it need to convert them using nowide which can give him
false assumption that it receives valid UTF-8.

Once again I have no problem providing wtf8 to wide and other way
around functions when user EXPLICITLY says it.
But it shell not be default behavior or some behavior you turn on with
some global define.


Artyom

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Groke, Paul via Boost
2017-06-13 09:30:46 UTC
Permalink
Post by Artyom Beilis via Boost
On Tue, Jun 13, 2017 at 12:05 AM, Peter Dimov via Boost
Post by Peter Dimov via Boost
Post by Artyom Beilis via Boost
- User creates a file with invalid UTF-16
- System monitors the file system and adds it to the XML report in
WTF-8 format
- The central server does not accept the XML since it fails UTF-8 validation
- User does whatever he wants without monitoring
- It removes the file
- There were no reports generated during the period user needed -DOS attack
I can't help but note that the same attack would work under Unix. The
user can easily create a file with an invalid UTF-8 name. And, since
the library doesn't enforce valid UTF-8 on POSIX (right?) it would pass
through.
Note, under POSIX user takes strings as is and can't trust the source.
Under Windows it need to convert them using nowide which can give him
false assumption that it receives valid UTF-8.
OK, thanks for explaining, I understand your concern now. I don't agree that a library like Nowide should necessarily try to address that kind of issue though.

Regarding your example... isn't the whole point of Nowide to make it possible to write portable applications without #ifdef-ing stuff? So ... for this to create a problem on Windows (and at the same time not create a problem on POSIX), a bunch of things would have to coincide
- Programmer didn't read manual and therefor assumes to get UTF-8 on Windows
- Application then wants to write some cmdline argument/path/... into some document that requires valid UTF-8
- The software component that builds/writes that document fails to validate the UTF-8 itself
- Programmer decides to skip the UTF-8 validation on Windows, because he assumes it has already been done
Assuming that the application doesn't need an #ifdef _WIN32 in that part of the code anyway, that would mean that the programmer deliberately introduced an #ifdef _WIN32, just to selectively skip some validation on Windows, because he thinks that it might already have been done.

No, I wouldn't say that's something that should be driving the design decisions of a library like Nowide.

Additionally I don't see what validating the UTF-8 would improve here. In that case the user would still create a file with invalid UTF-16. Only then the application would even fail to enumerate the directory or convert the path. And I'm not convinced that kind of error is more likely to be handled correctly.
Post by Artyom Beilis via Boost
Once again I have no problem providing wtf8 to wide and other way around
functions when user EXPLICITLY says it.
But it shell not be default behavior or some behavior you turn on with some
global define.
The whole appeal of a library like Nowide is that it works more or less transparently - i.e. without having to write conversion calls all over the place. And duplicating all APIs is something that I would deem a very bad choice - that would be confusing as hell.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Peter Dimov via Boost
2017-06-13 15:08:48 UTC
Permalink
Post by Artyom Beilis via Boost
Note, under POSIX user takes strings as is and can't trust the source.
Under Windows it need to convert them using nowide which can give him
false assumption that it receives valid UTF-8.
What use case are we discussing here?

Is it the one where the user gets a path from an external source and passes
it to Nowide? In which case Nowide does no validation on POSIX and strict
UTF-8 validation on Windows? Giving the Windows user the false assumption
that Nowide validates, which turns out to not happen in a later POSIX port.

Or is it the one where the user receives a path from Nowide? Which currently
doesn't occur anywhere, because the library does not return paths, it only
takes them as arguments. (It would occur if you add `readdir`.)
Post by Artyom Beilis via Boost
Once again I have no problem providing wtf8 to wide and other way around
functions when user EXPLICITLY says it.
How would portable user code look like in this scenario?


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Artyom Beilis via Boost
2017-06-13 15:42:43 UTC
Permalink
On Tue, Jun 13, 2017 at 6:08 PM, Peter Dimov via Boost
Post by Peter Dimov via Boost
Post by Artyom Beilis via Boost
Note, under POSIX user takes strings as is and can't trust the source.
Under Windows it need to convert them using nowide which can give him
false assumption that it receives valid UTF-8.
What use case are we discussing here?
Is it the one where the user gets a path from an external source and passes
it to Nowide? In which case Nowide does no validation on POSIX and strict
UTF-8 validation on Windows? Giving the Windows user the false assumption
that Nowide validates, which turns out to not happen in a later POSIX port.
Or is it the one where the user receives a path from Nowide? Which currently
doesn't occur anywhere, because the library does not return paths, it only
takes them as arguments. (It would occur if you add `readdir`.)
For that purpose you have Boost.Filesystem that internally keeps wide strings
on Windows and narrow on POSIX platforms.

Nowide does not try to replace Boost.Filestem. However once you try to get
an std::string from Boost.Filesystem (using nowide integration) you will
get either error or valid UTF-8 string.
Post by Peter Dimov via Boost
Post by Artyom Beilis via Boost
Once again I have no problem providing wtf8 to wide and other way around
functions when user EXPLICITLY says it.
How would portable user code look like in this scenario?
Nowide does not try to provide solution to all possible portable API.
But it provides a toolset to make such portable solutions easier.

However if you convert some wide string from some Windows API you
will expect to get valid UTF-8 and not WTF-8.

Artyom

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Peter Dimov via Boost
2017-06-13 15:50:18 UTC
Permalink
Post by Artyom Beilis via Boost
On Tue, Jun 13, 2017 at 6:08 PM, Peter Dimov via Boost
Post by Peter Dimov via Boost
Post by Artyom Beilis via Boost
Note, under POSIX user takes strings as is and can't trust the source.
Under Windows it need to convert them using nowide which can give him
false assumption that it receives valid UTF-8.
What use case are we discussing here?
Is it the one where the user gets a path from an external source and
passes it to Nowide? In which case Nowide does no validation on POSIX
and strict UTF-8 validation on Windows? Giving the Windows user the
false assumption that Nowide validates, which turns out to not happen in
a later POSIX port.
Or is it the one where the user receives a path from Nowide? Which
currently doesn't occur anywhere, because the library does not return
paths, it only takes them as arguments. (It would occur if you add
`readdir`.)
For that purpose you have Boost.Filesystem that internally keeps wide
strings on Windows and narrow on POSIX platforms.
Nowide does not try to replace Boost.Filestem. However once you try to get
an std::string from Boost.Filesystem (using nowide integration) you will
get either error or valid UTF-8 string.
This doesn't answer my question.

Let me rephrase.

The user gets a path from an external source and passes it to Nowide, for
example, to nowide::fopen.

Nowide does no validation on POSIX and strict UTF-8 validation on Windows.

Why is the Windows-only strict validation a good thing?

What attacks are prevented by not accepting WTF-8 in nowide::fopen ONLY
under Windows, and passing everything through unvalidated on POSIX?

If the program originates on Windows and as a result comes to rely on
Nowide's strict validation, and is later ported to POSIX, are not the users
left with a false sense of security?


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Artyom Beilis via Boost
2017-06-13 20:22:09 UTC
Permalink
Post by Peter Dimov via Boost
Let me rephrase.
The user gets a path from an external source and passes it to Nowide, for
example, to nowide::fopen.
Nowide does no validation on POSIX and strict UTF-8 validation on Windows.
Why is the Windows-only strict validation a good thing?
What attacks are prevented by not accepting WTF-8 in nowide::fopen ONLY
under Windows, and passing everything through unvalidated on POSIX?
Ok...

This is a very good question.

On windows I need to convert for obvious reason.

Now the question is what I accept as valid and what is not valid and
where do I draw the line.
---------------------------------------------------------------------------------------------------------------------------

Now as you have seen there are many possible "non-standard" UTF-8 variants.

What should I accept?

Should I accept CESU-8 (non BMP encoded as 2 pairs of 3 "UTF-8" like
bytes) and UTF-8

Should I accept WTF-8 only non-paired surrogates? What if these
surrogates can be combined into correct UTF-16, is it valid?
(In fact concatenation of WTF-8 strings isn't trivial operation as
simple string + string does not work and lead to invalid WTF-16)

Should I accept modified UTF-8 - it was already asked (i.e. values
encoded without shortest sequence) for example should I accept "x"
encoded in two bytes? What about "."?
How should I treat stuff like "\xFF.txt" <- invalid UTF-8? Should I
convert it to L"\xFF.txt"? Should I convert it to some sort of WTF-16
to preserve the string?

Now despite what it may look from the discussion WTF-8 is far from
being "standard" for representing invalid UTF-16.
Some may substitute it with "?" others with U+FFFD, others just remove
one. I actually tested some cases for real "bad"
file names by different system and each did something different.

I don't think using WTF-8 is some widely used industry standard it is
just one of many variants to create UTF-8 extension.

But there MUST be clear line of what is accepted and what is not and
the safest and most standard line to draw is well defined
UTF-8 and UTF-16 that are (a) 100% convertible one from other (b)
widely used accepted standards.

So that was my decision - based on safety and standards (and there is
no such thing as non strict UTF-8/16)

Does it fits everybody? Of course not!
It there some line that fits everybody? There is no such thing!
Does it fits common case for vast majority of users/developers? IMHO yes.

So as a policy I decided to use UTF-8 and UTF-16 as selection of
encoding for each sides of widen/narrow is required.
Post by Peter Dimov via Boost
If the program originates on Windows and as a result comes to rely on
Nowide's strict validation, and is later ported to POSIX, are not the users
left with a false sense of security?
You have valid point.

But adding validation on POSIX systems in general will be wrong (as I
noted in Q&A)
because there is no single encoding for POSIX OS - it is runtime
parameter - unlike Windows API that has
Wide UTF-16 API so such a validation on Linux/POSIX will likely cause
more issues than solve.

Thanks,
Artyom

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Niall Douglas via Boost
2017-06-13 20:33:00 UTC
Permalink
Post by Artyom Beilis via Boost
Post by Peter Dimov via Boost
Why is the Windows-only strict validation a good thing?
What attacks are prevented by not accepting WTF-8 in nowide::fopen ONLY
under Windows, and passing everything through unvalidated on POSIX?
Ok...
This is a very good question.
On windows I need to convert for obvious reason.
Now the question is what I accept as valid and what is not valid and
where do I draw the line.
---------------------------------------------------------------------------------------------------------------------------
Now as you have seen there are many possible "non-standard" UTF-8 variants.
What should I accept?
I still strongly suggest you simply call RtlUTF8ToUnicodeN()
(https://msdn.microsoft.com/en-us/library/windows/hardware/ff563018(v=vs.85).aspx)
to do the UTF-8 conversion. Do **nothing** else.

Niall
--
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Zach Laine via Boost
2017-06-13 20:52:18 UTC
Permalink
On Tue, Jun 13, 2017 at 3:33 PM, Niall Douglas via Boost <
Post by Artyom Beilis via Boost
Post by Artyom Beilis via Boost
Now as you have seen there are many possible "non-standard" UTF-8
variants.
Post by Artyom Beilis via Boost
What should I accept?
I still strongly suggest you simply call RtlUTF8ToUnicodeN()
(https://msdn.microsoft.com/en-us/library/windows/
hardware/ff563018(v=vs.85).aspx)
to do the UTF-8 conversion. Do **nothing** else.
Niall, could you explain why? I don't know any of the Windows-relevant
details.

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Niall Douglas via Boost
2017-06-13 20:59:13 UTC
Permalink
Post by Niall Douglas via Boost
Post by Artyom Beilis via Boost
Now as you have seen there are many possible "non-standard" UTF-8 variants.
What should I accept?
I still strongly suggest you simply call RtlUTF8ToUnicodeN()
(https://msdn.microsoft.com/en-us/library/windows/hardware/ff563018(v=vs.85).aspx
<https://msdn.microsoft.com/en-us/library/windows/hardware/ff563018(v=vs.85).aspx>)
to do the UTF-8 conversion. Do **nothing** else.
Niall, could you explain why? I don't know any of the Windows-relevant
details.
1. RtlUTF8ToUnicodeN() is what the NT kernel uses and isn't polluted by
Win32.

2. RtlUTF8ToUnicodeN() has a well designed API unlike the awful Win32
MultiByteToWideChar() function.

3. RtlUTF8ToUnicodeN() is close to as fast as any implementation of the
same thing, unlike MultiByteToWideChar() and some STL implementations of
<codecvt>.

4. RtlUTF8ToUnicodeN() treats invalid input in the way which the rest of
the NT kernel is built to expect. I caveat this with saying that Win32
functions can mangle input in a really unhelpful way, I would only trust
RtlUTF8ToUnicodeN() being fed directly to NT kernel APIs. That I know
works well.

5. I've been using RtlUTF8ToUnicodeN() in my own code for years and have
found it unsurprising and unproblematic. Unlike MultiByteToWideChar() or
C++ 11's UTF-8 support which doesn't quite work right on some standard
libraries.

Niall
--
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Artyom Beilis via Boost
2017-06-14 04:51:48 UTC
Permalink
Post by Niall Douglas via Boost
Post by Artyom Beilis via Boost
This is a very good question.
On windows I need to convert for obvious reason.
Now the question is what I accept as valid and what is not valid and
where do I draw the line.
---------------------------------------------------------------------------------------------------------------------------
Now as you have seen there are many possible "non-standard" UTF-8 variants.
What should I accept?
I still strongly suggest you simply call RtlUTF8ToUnicodeN()
(https://msdn.microsoft.com/en-us/library/windows/hardware/ff563018(v=vs.85).aspx)
to do the UTF-8 conversion. Do **nothing** else.
Niall
Actually I think you provided me a good direction I hadn't considered before.

RtlUTF8ToUnicodeN and other way around function does something very simple:

It substitutes invalid codepoints/encoding with U+FFFD - REPLACEMENT CHARACTER
which is standard Unicode way to say I failed to convert a thing.

It is something similar to current ANSI / Wide conversions creating ? instead.

It looks like it is better way to do it instead of failing to convert
entire string all together.

If you get invalid string conversion will success but you'll get
special characters (that are usually marked as � in UI)
that will actually tell you something was wrong.

This way for example getenv on valid key will not return NULL and
create ambiguity of what happened and it is actually
something that is more common behavior in Windows.

I like it and I think I'll change the behavior of the conversion
functions in Boost.Nowide to this one

Thanks!

Artyom Beilis

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/l
Niall Douglas via Boost
2017-06-14 10:41:07 UTC
Permalink
Post by Artyom Beilis via Boost
Actually I think you provided me a good direction I hadn't considered before.
It substitutes invalid codepoints/encoding with U+FFFD - REPLACEMENT CHARACTER
which is standard Unicode way to say I failed to convert a thing.
It is something similar to current ANSI / Wide conversions creating ? instead.
It looks like it is better way to do it instead of failing to convert
entire string all together.
If you get invalid string conversion will success but you'll get
special characters (that are usually marked as � in UI)
that will actually tell you something was wrong.
Some additional information about how the NT kernel treats the FFFD
character might be useful to you.

Quite a lot of kernel APIs treat UNICODE_STRING as just a bunch of bytes
e.g. the filesystem. You can supply any path with any characters at all,
including one containing zero bytes. The only character you can't use is
the backslash as that is the path separator. This matches how most Unix
filesystems work, and indeed at the NT kernel level path comparisons are
case sensitive as well as byte sensitive (unless you ask otherwise)
because it's just a memcmp(). You can have real fun here creating lots
of paths completely unparseable by Win32, as in, files totally
inaccessible to any Win32 API with some really random Win32 error codes
being returned.

Other NT kernel APIs will refuse strings containing FFFD or illegal
UTF-16 characters, and if they do it's generally because accepting them
would be an obvious security risk. But those are definitely a minority.

If the Win32 layer doesn't get in the way, doing as RtlUTF8ToUnicodeN()
does should be safe in so far as the kernel team feel it is. They have
placed appropriate checks on appropriate kernel APIs. But in the end,
it's down to the programmer to correctly validate and check all
untrusted input. Doing so always is an unnecessary expense for most end
users who have trusted input.

Niall
--
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/


_______________________________________________
Unsubscribe & other c
Frédéric Bron via Boost
2017-06-15 06:59:19 UTC
Permalink
Post by Artyom Beilis via Boost
It substitutes invalid codepoints/encoding with U+FFFD - REPLACEMENT CHARACTER
which is standard Unicode way to say I failed to convert a thing.
so does this allow roundtrip conversion UTF-16->UTF-8->UTF-16?

Frédéric

_______________________________________________
Unsubscribe & other changes: http://lists.boost
Peter Dimov via Boost
2017-06-13 23:33:48 UTC
Permalink
Post by Artyom Beilis via Boost
Now as you have seen there are many possible "non-standard" UTF-8 variants.
What should I accept?
Others have made their case for WTF-8, which has the desirable property of
only allowing one encoding for any uint16_t sequence (taken in isolation).

My personal choice here is to be more lenient and accept any combination of
valid UTF-8 and UTF-8 encoded surrogates (a superset of CESU-8 and WTF-8),
but I'm not going to argue very strongly for it over WTF-8, if at all.

I think that overlongs should NOT be accepted, including overlong zero.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Groke, Paul via Boost
2017-06-14 08:12:11 UTC
Permalink
Post by Peter Dimov via Boost
Others have made their case for WTF-8, which has the desirable property of
only allowing one encoding for any uint16_t sequence (taken in isolation).
My personal choice here is to be more lenient and accept any combination of
valid UTF-8 and UTF-8 encoded surrogates (a superset of CESU-8 and WTF-8),
but I'm not going to argue very strongly for it over WTF-8, if at all.
I have to admit that I was confusing WTF-8 and CESU-8. When converting from wide to narrow I would favor CESU-8, mostly because it makes the rules simpler and creates less special cases (e.g. when concatenating).

When converting from narrow to wide I agree that accepting any UTF-8/WTF-8/CESU-8 mix would be good.
Post by Peter Dimov via Boost
I think that overlongs should NOT be accepted, including overlong zero.
Agreed.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Frédéric Bron via Boost
2017-06-15 06:43:45 UTC
Permalink
Post by Peter Dimov via Boost
Or is it the one where the user receives a path from Nowide? Which currently
doesn't occur anywhere, because the library does not return paths,
this is not fully true as you can get a path with getenv.

Frédéric

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/b
Frédéric Bron via Boost
2017-06-16 05:00:09 UTC
Permalink
Post by Artyom Beilis via Boost
Please note: Under POSIX platforms no conversions are performed
http://cppcms.com/files/nowide/html/index.html#qna
I do not quite understand the rationale behind not converting to UTF-8
on Posix platforms. I naively though I got UTF-8 in argv because my
system is convigured in UTF-8 but I discover that this is not
necessary always the case. In the example you highlight, I do not see
the difference from the Windows case. You could convert to UTF-8 in
argv and back to the local encoding in nowide::remove. I understand it
is not efficient if you do not really use the content of the filename
but if you have to write, say an xml report in UTF-8, you would have
to convert anyway.

Today, what is the portable way to convert argv to UTF-8? i.e. without
having to #ifdef _WIN32...?

Frédéric

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/li
Artyom Beilis via Boost
2017-06-16 07:00:43 UTC
Permalink
Post by Frédéric Bron via Boost
Post by Artyom Beilis via Boost
Please note: Under POSIX platforms no conversions are performed
http://cppcms.com/files/nowide/html/index.html#qna
I do not quite understand the rationale behind not converting to UTF-8
on Posix platforms. I naively though I got UTF-8 in argv because my
system is convigured in UTF-8 but I discover that this is not
necessary always the case. In the example you highlight, I do not see
the difference from the Windows case. You could convert to UTF-8 in
argv and back to the local encoding in nowide::remove. I understand it
is not efficient if you do not really use the content of the filename
but if you have to write, say an xml report in UTF-8, you would have
to convert anyway.
Today, what is the portable way to convert argv to UTF-8? i.e. without
having to #ifdef _WIN32...?
Frédéric
Hello Frederic,

There are several reasons for this.


One of is actually original purpose of the library: use same type of strings
internally without creating broken software on Windows and since only
Windows use native Wide instead of narrow API which is native for C++
only Windows case requires encoding conversion.

However there is another deeper issue. Unlike on Windows where native
wide API has well defined UTF-16 encoding it isn't the case for Unix like OSes.

The encoding is defined by current Locale that can be defined
globally, per user,
per process and even change trivially in the same process during the runtime.

There are also several sources of the Locale/Encoding information:

Environment variables:

- LANG/LC_CTYPE - which is UTF-8 on vast majority of modern Unix like
platforms but frequently can be undefined or defined as "C" locale without
encoding information.

This one is what OS defines for the process.

- C locale: setlocale API - which is by default "C" locale by standard
unless explicitly defined otherwise

- C++ locale: std::locale::global() API - which is by default "C" locale by
standard unless explicitly defined otherwise

They are all can be changed at runtime, they aren't synchronized and
they be modified to whatever encoding user wants.

Additionally using std::locale::global as not "C" locale can lead to some
really nasty things like failing to create CSV files due to adding ","
to numbers.

So the safest and the most correct way to handle it is to pass narrow
strings as is
without any conversion.

Regards,
Artyom Beilis

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinf
Frédéric Bron via Boost
2017-06-16 07:40:24 UTC
Permalink
Post by Artyom Beilis via Boost
The encoding is defined by current Locale that can be defined
globally, per user,
per process and even change trivially in the same process during the runtime.
They are all can be changed at runtime, they aren't synchronized and
they be modified to whatever encoding user wants.
So the safest and the most correct way to handle it is to pass narrow
strings as is
without any conversion.
I understand that this complexity can be a nightmare.

How would you therefore do this simple task:
1. get a directory from nowide::getenv -> base_dir (UTF-8 on Windows,
unknown narrow on Posix)
2. create a file in base_dir which name is file_name encoded in UTF-8
(because it is created by the application).

If I understand well, I should NOT do this:

auto f = nowide::ofstream((boost::filesystem::path(base_dir) /
file_name).string());

because this is guaranteed to work only on Windows where I have the
guarantee that base_dir is UTF-8, right?

On Posix, there is a risk that base_dir is for example ISO-8859-1
while file_name is UTF-8 so that the combination made by filesystem is
wrong.
Am I right?

Frédéric

_______________________________________________
Unsubscribe & other cha
Artyom Beilis via Boost
2017-06-16 09:27:26 UTC
Permalink
How would you therefore do this simple task:
1. get a directory from nowide::getenv -> base_dir (UTF-8 on Windows,
unknown narrow on Posix)
2. create a file in base_dir which name is file_name encoded in UTF-8
(because it is created by the application).

If I understand well, I should NOT do this:

auto f = nowide::ofstream((boost::filesystem::path(base_dir) /
file_name).string());

because this is guaranteed to work only on Windows where I have the
guarantee that base_dir is UTF-8, right?


Actually it is what exactly you should do. Once you turn on **nowide
integration** with boost filesystem

path::string does not perform any conversions on POSIX platforms.

And you can safely concatenate two different strings and valid file will be
created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be
valid even if not representable in any encoding.

Under windows you'll get the correct conversions from narrow to wide using
the utf8/16 facet installed by boost.nowide/Filesystem integration as on
Windows boost filesystem uses wide strings

Artyom

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Frédéric Bron via Boost
2017-06-16 09:42:48 UTC
Permalink
Post by Artyom Beilis via Boost
And you can safely concatenate two different strings and valid file will be
created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be
valid even if not representable in any encoding.
but it seems to me that in this case, what I need is UTF-8->ISO 8859-1
conversion of the file name before concatenation with the directory.
Otherwise, OK I will get a file because the system just ask for narrow
string but its name will be wrong in the OS user interface.

Frédéric

_______________________________________________
Unsubscribe & other changes: htt
Artyom Beilis via Boost
2017-06-16 12:20:10 UTC
Permalink
Post by Frédéric Bron via Boost
Post by Artyom Beilis via Boost
And you can safely concatenate two different strings and valid file will be
created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will be
valid even if not representable in any encoding.
but it seems to me that in this case, what I need is UTF-8->ISO 8859-1
conversion of the file name before concatenation with the directory.
Otherwise, OK I will get a file because the system just ask for narrow
string but its name will be wrong in the OS user interface.
Frédéric
You actually **assume** that the encoding you received (like getenv)
from the system actually matches current locale encoding.

But it is not necessary the same:

1. The file/directory was created by user running in different locale
2. The locale isn't defined properly or was modified
3. You get these files directories from some other location (like
unzipped some stuff)

In reality the OS does not care about encoding (most of the time).

Unlike Windows where wchar_t also defines the encoding UTF-16 under POSIX
platforms "char *" can contain whatever encoding and it can be changed.

Also UTF-8 is the most common encoding on all modern Unix like
systems: Linux, BSD, Mac OS X

So I don't think it is necessary to perform any conversions between UTF-8
and whatever "char *" encoding you get because:

(a) You can't reliable know what kind of encoding you use.
(b) Same "char *" may contain parts from different encoding and
actually be valid path.


Artyom

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mai
Peter Dimov via Boost
2017-06-16 12:40:31 UTC
Permalink
Post by Artyom Beilis via Boost
Actually it is what exactly you should do. Once you turn on **nowide
integration** with boost filesystem
path::string does not perform any conversions on POSIX platforms.
And you can safely concatenate two different strings and valid file will
be created. Even if dir is in ISO 8859-1 and file in UTF-8. The file will
be valid even if not representable in any encoding.
"POSIX" is not correct here. File names under Mac OS X are UTF-8. Always
UTF-8, not merely by convention, and no mater what locale is set. (They're
stored in UTF-16 in the filesystem, just like in NTFS. Unlike NTFS, it's
_valid_ UTF-16.)

For encoding-neutral Unix-es such as Linux and Solaris, the above is
correct.

In any case, a library that attempts to reencode from the current locale to
UTF-8 in the narrow case would be useless. (If the locale isn't UTF-8,
nothing will work anyway, the library trying to handle this case would just
make things worse.) This is well established empirically by now.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Niall Douglas via Boost
2017-06-16 14:15:35 UTC
Permalink
Post by Peter Dimov via Boost
Post by Artyom Beilis via Boost
And you can safely concatenate two different strings and valid file
will be created. Even if dir is in ISO 8859-1 and file in UTF-8. The
file will be valid even if not representable in any encoding.
"POSIX" is not correct here. File names under Mac OS X are UTF-8. Always
UTF-8, not merely by convention, and no mater what locale is set.
You were right but are now wrong I am afraid Peter:
https://mjtsai.com/blog/2017/03/24/apfss-bag-of-bytes-filenames/

Apple's shiny new APFS treats paths as a dumb sequence of bytes, and
they are retrofitting HFS+ to behave the same to match.
Post by Peter Dimov via Boost
(They're stored in UTF-16 in the filesystem, just like in NTFS. Unlike
NTFS, it's _valid_ UTF-16.)
You should assume that invalid UTF will present on all filing systems
because almost all of them treat paths as dumb byte strings. It's the
only portable assumption. It's why AFIO v2 opts out of case
insensitivity on Windows which no doubt will surprise quite a few end
users, but it does present many security attacks at the AFIO layer (by
pushing the problem further up the stack).

Niall
--
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Peter Dimov via Boost
2017-06-12 17:52:55 UTC
Permalink
Post by Artyom Beilis via Boost
Implying that there are Windows file names (that their names contain
invalid UTF-16) that can't be handled by the library?
Yes indeed if you will try to use stuff like FindFileW and it returns
invalid UTF-16 you will get an error trying to convert it to UTF-8.
That's what I meant, already existing files that can't be manipulated by the
library.

Incidentally, the mention of FileFindW leads me to my next question: by
looking at my own collection of functions that basically mirror what Nowide
does (everyone has his own Nowide, I suppose, as this is the only sane way
to program), I see 'stat' and a 'readdir' equivalent.

Have you considered providing these? One could go the filesystem route, of
course, but sometimes one may wish to minimize dependencies.
Post by Artyom Beilis via Boost
By definition: you can't handle file names that can't be represented in
UTF-8 as there is no valid UTF-8 representation exist.
Well yes in principle, but the file already exists. Although one can well
argue that this is no different from putting ':' in the name or something
like that.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Groke, Paul via Boost
2017-06-12 09:20:39 UTC
Permalink
Post by Groke, Paul via Boost
Post by Groke, Paul via Boost
Not a review, just a question (actually two): how does Nowide deal with
modified UTF-8?
No, modified utf-8 is not supported since it isn't utf-8 it will be considered
invalid encoding.
I know modified UTF-8 is (can be) invalid UTF-8, that's why I asked. I think it could make sense to support it anyway though. Round tripping (strictly invalid, but possible) file names on Windows, easier interoperability with stuff like JNI, ...
OTOH it would add overhead for systems with native UTF-8 APIs, because Nowide would at least have to check every string for "modified UTF-8 encoded" surrogate pairs and convert the string if necessary. Which of course is a good argument for not supporting modified UTF-8, because then Nowide could just pass the strings through unmodified on those systems.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Yakov Galka via Boost
2017-06-12 09:29:21 UTC
Permalink
On Mon, Jun 12, 2017 at 12:20 PM, Groke, Paul via Boost <
Post by Groke, Paul via Boost
I know modified UTF-8 is (can be) invalid UTF-8, that's why I asked. I
think it could make sense to support it anyway though. Round tripping
(strictly invalid, but possible) file names on Windows, easier
interoperability with stuff like JNI, ...
Don't you mean WTF-8 then? AFAIK "Modified UTF-8" is UTF-8 that encodes the
null character with an overlong sequence, and thus is incompatible with
standard UTF-8, unlike WTF-8 which is a compatible extension.
Post by Groke, Paul via Boost
OTOH it would add overhead for systems with native UTF-8 APIs, because
Nowide would at least have to check every string for "modified UTF-8
encoded" surrogate pairs and convert the string if necessary. Which of
course is a good argument for not supporting modified UTF-8, because then
Nowide could just pass the strings through unmodified on those systems.
Implementing WTF-8 removes a check in UTF-8 → UTF-16 conversion, and
doesn't change anything in the reverse direction when there is a valid
UTF-16. I suspect it isn't slower.
--
Yakov Galka
http://stannum.co.il/

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/l
Groke, Paul via Boost
2017-06-12 09:55:01 UTC
Permalink
Post by Yakov Galka via Boost
On Mon, Jun 12, 2017 at 12:20 PM, Groke, Paul via Boost
Post by Groke, Paul via Boost
I know modified UTF-8 is (can be) invalid UTF-8, that's why I asked. I
think it could make sense to support it anyway though. Round tripping
(strictly invalid, but possible) file names on Windows, easier
interoperability with stuff like JNI, ...
Don't you mean WTF-8 then? AFAIK "Modified UTF-8" is UTF-8 that encodes
the null character with an overlong sequence, and thus is incompatible
with standard UTF-8, unlike WTF-8 which is a compatible extension.
No, I mean modified UTF-8. Modified UTF-8 is UTF-8 plus the following extensions:
- Allow encoding UTF-16 surrogates as if they were code points (=what "WTF-8" does)
- Allow an over-long 2 byte encoding of the NUL character

Both are not strictly UTF-8 compatible, but both don't introduce significant
overhead in most situations. I don't see how over-long NUL encodings are
"more incompatible" then UTF-8 encoded surrogates, but then again
that's not really important.
Post by Yakov Galka via Boost
Post by Groke, Paul via Boost
OTOH it would add overhead for systems with native UTF-8 APIs, because
Nowide would at least have to check every string for "modified UTF-8
encoded" surrogate pairs and convert the string if necessary. Which of
course is a good argument for not supporting modified UTF-8, because
then Nowide could just > pass the strings through unmodified on those
systems.
Implementing WTF-8 removes a check in UTF-8 -> UTF-16 conversion, and
doesn't change anything in the reverse direction when there is a valid
UTF-16. I suspect it isn't slower.
Supporting modified UTF-8 or WTF-8 adds overhead on systems where the
native OS API accepts UTF-8, but only strictly valid UTF-8.
When some UTF-8 enabled function of the library is called on such a
system, it would have to check for WTF-8 encoded surrogates and
convert them to "true" UTF-8 before passing the string to the OS API.
Because you would expect and want the "normal" UTF-8 encoding for
a string to refer to the same file as the WTF-8 encoding of the same
string.

Regards,
Paul Groke


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Yakov Galka via Boost
2017-06-12 18:12:23 UTC
Permalink
On Mon, Jun 12, 2017 at 12:55 PM, Groke, Paul via Boost <
Post by Groke, Paul via Boost
Supporting modified UTF-8 or WTF-8 adds overhead on systems where the
native OS API accepts UTF-8, but only strictly valid UTF-8.
When some UTF-8 enabled function of the library is called on such a
system, it would have to check for WTF-8 encoded surrogates and
convert them to "true" UTF-8 before passing the string to the OS API.
Because you would expect and want the "normal" UTF-8 encoding for
a string to refer to the same file as the WTF-8 encoding of the same
string.
That's the point: if the string is at all representable in UTF-8 then its
WTF-8 representation is already in a valid UTF-8 representation and no
conversion has to be done. Thus you don't even have to check anything at
all. This is how WTF-8 is 'more compatible' with UTF-8 than Modified UTF-8
is.

By analogy, you don't need to do special checks if you want to pass a UTF-8
string to an ASCII only API, because UTF-8 is a strict superset.
--
Yakov Galka
http://stannum.co.il/

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Peter Dimov via Boost
2017-06-12 12:57:32 UTC
Permalink
Post by Frédéric Bron via Boost
http://cppcms.com/files/nowide/html
Question: what do the functions do (on Windows) when the passed UTF-8 is
invalid? From a brief look at the documentation, I don't see this specified,
but I may have missed it.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Artyom Beilis via Boost
2017-06-12 14:54:58 UTC
Permalink
On Mon, Jun 12, 2017 at 3:57 PM, Peter Dimov via Boost
Post by Peter Dimov via Boost
Post by Frédéric Bron via Boost
http://cppcms.com/files/nowide/html
Question: what do the functions do (on Windows) when the passed UTF-8 is
invalid? From a brief look at the documentation, I don't see this specified,
but I may have missed it.
It is documented in reference documentation.

http://cppcms.com/files/nowide/html/namespaceboost_1_1nowide.html#a6644397c539754076b11e95c535be192

fopen/ifstream etc - would fail in usual way these functions fail -
fail to open file.

widen/narrow will return null or throw
boost::locale::conv::conversion_error depending on function.

Artyom

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Peter Dimov via Boost
2017-06-12 15:03:27 UTC
Permalink
Post by Artyom Beilis via Boost
Post by Peter Dimov via Boost
Question: what do the functions do (on Windows) when the passed UTF-8 is
invalid? From a brief look at the documentation, I don't see this
specified, but I may have missed it.
It is documented in reference documentation.
http://cppcms.com/files/nowide/html/namespaceboost_1_1nowide.html#a6644397c539754076b11e95c535be192
Yes, I see now that it's documented on some, such as for instance fopen:

If invalid UTF-8 given, NULL is returned and errno is set to EINVAL

but not all, for instance getenv:

UTF-8 aware getenv. Returns 0 if the variable is not set.

or setenv:

UTF-8 aware setenv, key - the variable name, value is a new UTF-8
value,.

and it's not clear what happens when one feeds invalid UTF-8 to cout/cerr.

Another minor point is that the library assumes Cygwin is POSIX and tries
`using ::setenv;`, which is a compile error. For the purposes of Nowide,
Cygwin should be considered Windows.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Artyom Beilis via Boost
2017-06-12 17:45:44 UTC
Permalink
On Mon, Jun 12, 2017 at 6:03 PM, Peter Dimov via Boost
Post by Peter Dimov via Boost
Post by Artyom Beilis via Boost
Post by Peter Dimov via Boost
Question: what do the functions do (on Windows) when the passed UTF-8 is
invalid? From a brief look at the documentation, I don't see this >
specified, but I may have missed it.
It is documented in reference documentation.
http://cppcms.com/files/nowide/html/namespaceboost_1_1nowide.html#a6644397c539754076b11e95c535be192
If invalid UTF-8 given, NULL is returned and errno is set to EINVAL
UTF-8 aware getenv. Returns 0 if the variable is not set.
UTF-8 aware setenv, key - the variable name, value is a new UTF-8 value,.
I try to keep interface similar to standard one. Changing errno by
getenv isn't something expected.
Post by Peter Dimov via Boost
and it's not clear what happens when one feeds invalid UTF-8 to cout/cerr.
Yes you are right it should be more clear. In case of invalid UTF-8
fail bit will be set.
Post by Peter Dimov via Boost
Another minor point is that the library assumes Cygwin is POSIX and tries
`using ::setenv;`, which is a compile error. For the purposes of Nowide,
Cygwin should be considered Windows
Not exactly.

Since Cygwin 1.7 it uses UTF-8 by default so all operations like fopen
cooperate properly with Windows API.
So it is correct to consider Cygwin as POSIX and not Windows.

Regarding specific setenv/getenv and how and if it handles
narrow/wide conversions
I indeed need to check.

Thanks,
Artyom

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Peter Dimov via Boost
2017-06-13 17:34:03 UTC
Permalink
Post by Peter Dimov via Boost
Another minor point is that the library assumes Cygwin is POSIX and tries
`using ::setenv;`, which is a compile error.
I fixed this by changing cenv.hpp as follows:

#if defined(BOOST_WINDOWS) || defined(__CYGWIN__)
#include <boost/nowide/windows.hpp>
#endif

namespace boost {
namespace nowide {
#if !defined(BOOST_WINDOWS) && !defined(__CYGWIN__) &&
!defined(BOOST_NOWIDE_DOXYGEN)

and now everything compiles under Cygwin. I get the following two test
failures:

testing.capture-output
..\..\..\bin.v2\libs\nowide\test\test_system_n.test\gcc-cxx03-6.3.0\debug\test_system_n.run
====== BEGIN OUTPUT ======
Failed Error boost::nowide::getenv("BOOST_NOWIDE_TEST") in
test_system.cpp:25 main
Failed Error boost::nowide::system(command.c_str()) == 0 in
test_system.cpp:75 main

EXIT STATUS: 1
====== END OUTPUT ======


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Artyom Beilis via Boost
2017-06-13 19:49:41 UTC
Permalink
On Tue, Jun 13, 2017 at 8:34 PM, Peter Dimov via Boost
Post by Peter Dimov via Boost
Post by Peter Dimov via Boost
Another minor point is that the library assumes Cygwin is POSIX and tries
`using ::setenv;`, which is a compile error.
#if defined(BOOST_WINDOWS) || defined(__CYGWIN__)
#include <boost/nowide/windows.hpp>
#endif
namespace boost {
namespace nowide {
#if !defined(BOOST_WINDOWS) && !defined(__CYGWIN__) &&
!defined(BOOST_NOWIDE_DOXYGEN)
and now everything compiles under Cygwin. I get the following two test
testing.capture-output
..\..\..\bin.v2\libs\nowide\test\test_system_n.test\gcc-cxx03-6.3.0\debug\test_system_n.run
====== BEGIN OUTPUT ======
Failed Error boost::nowide::getenv("BOOST_NOWIDE_TEST") in
test_system.cpp:25 main
Failed Error boost::nowide::system(command.c_str()) == 0 in
test_system.cpp:75 main
EXIT STATUS: 1
====== END OUTPUT ======
Ok I'll test cygwin (hasn't worked with one for a while) and make sure
it works on this platform. Also I need to check how
native cygwin getenv/setenv behaves wither regard to Unicode.

Artyom

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Peter Dimov via Boost
2017-06-14 02:02:03 UTC
Permalink
Post by Frédéric Bron via Boost
- What is your evaluation of the design?
The general principles on which the library is build are solid, the design
is good. As already noted, I would prefer `stat` and `opendir`/`readdir` as
additions, while we're at it. Yes, we could use Filesystem. On the other
hand, we _could_ use Filesystem for remove, rename, and fstream, too.
Post by Frédéric Bron via Boost
- What is your evaluation of the implementation?
I didn't look at the implementation too deeply, but I did take a look at,
and run, the test suite.

There are a number of copy/paste references to Boost.System here
(test/Jamfile) and Boost.Locale there (index.html), which should ideally be
corrected. Also, the test Jamfile is missing

import testing ;

which I do not hold against the author in any way as the Boost.System
Jamfile also had it missing (it works without it unless invoked from the
Boost root.)

The doc/ directory is missing a Jamfile, and there's also no
meta/libraries.json.

There are no .travis.yml and appveyor.yml files in the root, which is odd in
this day and age. :-)

The global BOOST_USE_WINDOWS_H macro should probably be respected.
Post by Frédéric Bron via Boost
- What is your evaluation of the documentation?
Does the job, although not all functions had their behavior documented in
the case the passed UTF-8 sequence is invalid.
Post by Frédéric Bron via Boost
- What is your evaluation of the potential usefulness of the library?
The library, or something isomorphic to it, is indispensable for any serious
Windows work.
Post by Frédéric Bron via Boost
- Did you try to use the library? With what compiler? Did you have any
problems?
As mentioned, I ran the test suite on various versions of msvc (all passed),
MinGW gcc (warnings in c++03 mode, errors in c++11 and above), and Cygwin
gcc/clang (errors because of missing ::getenv and friends).

The problem MinGW gcc 7 (and Cygwin clang, once I enabled the Windows path
in cenv.hpp) identified was

In file included from test\test_system.cpp:11:0:
..\../boost/nowide/cenv.hpp: In function 'int boost::nowide::putenv(char*)':
..\../boost/nowide/cenv.hpp:104:41: warning: comparison between pointer and
zero character constant [-Wpointer-compare]
while(*key_end!='=' && key_end!='\0')
^~~~

which indicates that (1) the library is missing a test for putenv with the
string not containing '=', and (2) the author should consider testing on
Cygwin/Mingw, either locally, or on Appveyor. These two may not be important
production platforms, but they enable testing the Windows-specific code
paths with g++ and clang++, which as the above shows, may uncover potential
problems that testing on MSVC alone would miss.
Post by Frédéric Bron via Boost
- How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?
Quick reading.
Post by Frédéric Bron via Boost
- Are you knowledgeable about the problem domain?
More or less.
Post by Frédéric Bron via Boost
- Do you think the library should be accepted as a Boost library?
Yes. Nowide should be accepted.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Artyom Beilis via Boost
2017-06-14 05:06:14 UTC
Permalink
Post by Peter Dimov via Boost
There are a number of copy/paste references to Boost.System here
(test/Jamfile) and Boost.Locale there (index.html), which should ideally be
corrected.
Ok I'll fix it
Post by Peter Dimov via Boost
Also, the test Jamfile is missing
import testing ;
which I do not hold against the author in any way as the Boost.System
Jamfile also had it missing (it works without it unless invoked from the
Boost root.)
What it does and why do I need one?
Post by Peter Dimov via Boost
The doc/ directory is missing a Jamfile, and there's also no
meta/libraries.json.
Ok.

What I remember from Boost.Locale that doxygen wasn't run during doc build
but rather HTML was pushed to directory. If today there is support of automatic
building of docs with vanilla Doxygen (out of quick book scope) I'll add one.
Post by Peter Dimov via Boost
There are no .travis.yml and appveyor.yml files in the root, which is odd in
this day and age. :-)
What do I miss?
Post by Peter Dimov via Boost
The global BOOST_USE_WINDOWS_H macro should probably be respected.
Ok good point I'll fix it.
Post by Peter Dimov via Boost
Post by Frédéric Bron via Boost
- What is your evaluation of the documentation?
Does the job, although not all functions had their behavior documented in
the case the passed UTF-8 sequence is invalid.
Good point.
Post by Peter Dimov via Boost
Post by Frédéric Bron via Boost
- What is your evaluation of the potential usefulness of the library?
The library, or something isomorphic to it, is indispensable for any serious
Windows work.
Post by Frédéric Bron via Boost
- Did you try to use the library? With what compiler? Did you have any
problems?
As mentioned, I ran the test suite on various versions of msvc (all passed),
MinGW gcc (warnings in c++03 mode, errors in c++11 and above), and Cygwin
gcc/clang (errors because of missing ::getenv and friends).
The problem MinGW gcc 7 (and Cygwin clang, once I enabled the Windows path
in cenv.hpp) identified was
..\../boost/nowide/cenv.hpp:104:41: warning: comparison between pointer and
zero character constant [-Wpointer-compare]
while(*key_end!='=' && key_end!='\0')
^~~~
Ooops it is serious I'll fix it.

https://github.com/artyom-beilis/nowide/issues/14
Post by Peter Dimov via Boost
which indicates that (1) the library is missing a test for putenv with the
string not containing '=', and (2) the author should consider testing on
Cygwin/Mingw, either locally, or on Appveyor. These two may not be important
production platforms, but they enable testing the Windows-specific code
paths with g++ and clang++, which as the above shows, may uncover potential
problems that testing on MSVC alone would miss.
I general I tested using MinGW and MSVC also never tried clang on
Windows - it is probably time to.
Post by Peter Dimov via Boost
Post by Frédéric Bron via Boost
- How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?
Quick reading.
Post by Frédéric Bron via Boost
- Are you knowledgeable about the problem domain?
More or less.
Post by Frédéric Bron via Boost
- Do you think the library should be accepted as a Boost library?
Yes. Nowide should be accepted.
Thank You for your effort and valuable inputs.

Artyom Beilis

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Peter Dimov via Boost
2017-06-14 13:23:21 UTC
Permalink
Post by Artyom Beilis via Boost
Post by Peter Dimov via Boost
Also, the test Jamfile is missing
import testing ;
...
Post by Artyom Beilis via Boost
What it does and why do I need one?
It contains the definitions of the testing rules such as 'run'. Not having
it is like a missing #include in C++. It only works because someone happened
to #include it first.
Post by Artyom Beilis via Boost
Post by Peter Dimov via Boost
There are no .travis.yml and appveyor.yml files in the root, which is
odd in this day and age. :-)
What do I miss?
If you're not familiar with the Travis and Appveyor CIs, you should check
them out, because they are very valuable. What they do is integrate with
Github and perform an automatic test on every repo push, which, if setup
correctly, includes pull requests. Github then displays the test results on
the PR page. This is very useful if you want to decide whether to merge a
PR.

You can look at the Boost.System .travis.yml and appveyor.yml files as
examples.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Frédéric Bron via Boost
2017-06-15 12:32:03 UTC
Permalink
This simple program compiles nicely on Linux+gcc 6.3.0 but fails to
compile on x86_64-w64-mingw (6.3.0, std=c++14):

#include "nowide/fstream.hpp"
int main() {
auto f = nowide::fstream("toto.log");
return 0;
}

instead of
auto f=...,
nowide::fstream f("toto.log")
works fine.

The error is:
toto.cpp:4:38: erreur : use of deleted function «
nowide::basic_fstream<char>::basic_fstream(const
nowide::basic_fstream<char>&) »
auto f = nowide::fstream("toto.log");
^
In file included from toto.cpp:1:0:
nowide/fstream.hpp:190:11: note : «
nowide::basic_fstream<char>::basic_fstream(const
nowide::basic_fstream<char>&) » is implicitly deleted because the
default definition would be ill-formed:
class basic_fstream : public std::basic_iostream<CharType,Traits>
^~~~~~~~~~~~~
nowide/fstream.hpp:190:11: erreur : use of deleted function «
std::basic_iostream<_CharT, _Traits>::basic_iostream(const
std::basic_iostream<_CharT, _Traits>&) [with _CharT = char; _Traits =
std::char_t
raits<char>] »
In file included from
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/iterator:65:0,
from ./nowide/encoding_utf.hpp:14,
from ./nowide/convert.hpp:12,
from nowide/fstream.hpp:13,
from toto.cpp:1:
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/istream:863:7:
note : declared here
basic_iostream(const basic_iostream&) = delete;
^~~~~~~~~~~~~~
In file included from toto.cpp:1:0:
nowide/fstream.hpp:190:11: erreur : use of deleted function «
std::basic_ios<_CharT, _Traits>::basic_ios(const
std::basic_ios<_CharT, _Traits>&) [with _CharT = char; _Traits =
std::char_traits<char>] »
class basic_fstream : public std::basic_iostream<CharType,Traits>
^~~~~~~~~~~~~
In file included from
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/ios:44:0,
from
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/ostream:38,
from
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/iterator:64,
from ./nowide/encoding_utf.hpp:14,
from ./nowide/convert.hpp:12,
from nowide/fstream.hpp:13,
from toto.cpp:1:
/softs/mingw64-6.3.0/x86_64-w64-mingw32/include/c++/6.3.0/bits/basic_ios.h:475:7:
note : declared here
basic_ios(const basic_ios&) = delete;
^~~~~~~~~
In file included from toto.cpp:1:0:
nowide/fstream.hpp:190:11: erreur : «
nowide::scoped_ptr<T>::scoped_ptr(const nowide::scoped_ptr<T>&) [with
T = nowide::basic_filebuf<char>] » is private within this context
class basic_fstream : public std::basic_iostream<CharType,Traits>
^~~~~~~~~~~~~
In file included from nowide/fstream.hpp:14:0,
from toto.cpp:1:
./nowide/scoped_ptr.hpp:31:5: note : declared private here
scoped_ptr(scoped_ptr const &);

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/l
Artyom Beilis via Boost
2017-06-15 18:01:11 UTC
Permalink
On Thu, Jun 15, 2017 at 3:32 PM, Frédéric Bron via Boost
Post by Frédéric Bron via Boost
This simple program compiles nicely on Linux+gcc 6.3.0 but fails to
#include "nowide/fstream.hpp"
int main() {
auto f = nowide::fstream("toto.log");
return 0;
}
instead of
auto f=...,
nowide::fstream f("toto.log")
works fine.
toto.cpp:4:38: erreur : use of deleted function «
nowide::basic_fstream<char>::basic_fstream(const
nowide::basic_fstream<char>&) »
auto f = nowide::fstream("toto.log");
^
Ok I see,

Nowide misses some of the C++11 interfaces - for example move constructor
but it shouldn't be hard to implement ones.

Also I wasn't aware that auto f = nowide::fstream("toto.log"); Can
actually call move
constructor.

Artyom

_______________________________________________
Unsubscribe & other changes: http://lists.boost.o
Frédéric Bron via Boost
2017-06-15 18:26:24 UTC
Permalink
Post by Artyom Beilis via Boost
Also I wasn't aware that auto f = nowide::fstream("toto.log"); Can
actually call move
constructor.
I am also surprised, I though it would be equivalent.
And I use the same gcc version (6.3.0) and it causes no issue on
linux, only on mingw cross compiler...

Frédéric

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/b
Artyom Beilis via Boost
2017-06-15 19:47:00 UTC
Permalink
at Linux nowide fstream is just an alias to std::fstream

Artyom
Post by Artyom Beilis via Boost
Also I wasn't aware that auto f = nowide::fstream("toto.log"); Can
actually call move
constructor.
I am also surprised, I though it would be equivalent.
And I use the same gcc version (6.3.0) and it causes no issue on
linux, only on mingw cross compiler...

Frédéric

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/bo
Frédéric Bron via Boost
2017-06-15 19:53:55 UTC
Permalink
Post by Artyom Beilis via Boost
at Linux nowide fstream is
an alias to std::fstream
If course!

Frédéric

_______________________________________________
Unsubscribe & other changes: http://lists.
Edward Diener via Boost
2017-06-16 21:06:10 UTC
Permalink
Post by Frédéric Bron via Boost
Hi Everyone,
The formal review of Artyom Beilis' Nowide library starts today and
will last until Wed. 21st of June.
Your participation is encouraged, as the proposed library is
uncoupled, focused and rather small. Nowadays, everyone needs to
handle Unicode but this is very difficult using only the standard
library in a platform independant way. Nowide offers a very simple way
to handle Unicode the same way on
Windows/MacOS/Linux.
* work with UTF-8 in your code, Nowide converts to OS encoding
* Easy to use functions for converting UTF-8 to/from UTF-16
* A class to fixing argc, argc and env main parameters to use UTF-8
- stdio.h functions (fopen, freopen, remove, rename)
- stdlib.h functions (system, getenv, setenv, unsetenv, putenv)
- fstream (filebuf, fstream/ofstream/ifstream)
- iostream (cout, cerr, clog, cin)
http://cppcms.com/files/nowide/html
https://github.com/artyom-beilis/nowide
git clone https://github.com/artyom-beilis/nowide.git
https://github.com/artyom-beilis/nowide/archive/master.zip
- as a standalone library: http://cppcms.com/files/nowide/nowide_standalone.zip
Nowide has standard boost layout. So, just put it into the boost
source tree under libs/nowide directory for building and running
tests.
Or alternatively - for header only part of the library add the
nowide/include path to the include path of your projects.
How do I build the documentation for the nowide library ?
Post by Frédéric Bron via Boost
Please post your comments and review to the boost mailing list
(preferably), or privately to the Review Manager (to me ;-). Here are
- What is your evaluation of the design?
- What is your evaluation of the implementation?
- What is your evaluation of the documentation?
- What is your evaluation of the potential usefulness of the library?
- Did you try to use the library? With what compiler? Did you have any
problems?
- How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?
- Are you knowledgeable about the problem domain?
- Do you think the library should be accepted as a Boost library? Be
sure to say this explicitly so that your other comments don't obscure
your overall opinion.
http://www.boost.org/community/reviews.html
Thank you very much for your time and efforts.
Frédéric
_______________________________________________
Unsubscribe & other changes
Artyom Beilis via Boost
2017-06-16 21:13:53 UTC
Permalink
On Sat, Jun 17, 2017 at 12:06 AM, Edward Diener via Boost
Post by Edward Diener via Boost
Post by Frédéric Bron via Boost
Nowide has standard boost layout. So, just put it into the boost
source tree under libs/nowide directory for building and running
tests.
Or alternatively - for header only part of the library add the
nowide/include path to the include path of your projects.
How do I build the documentation for the nowide library ?
Just run doxygen (hand't integrated yet to BB)

Also note full docs can be found there: http://cppcms.com/files/nowide/html/

Artyom

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Zach Laine via Boost
2017-06-19 17:19:22 UTC
Permalink
On Sun, Jun 11, 2017 at 11:20 PM, Frédéric Bron via Boost <
Post by Frédéric Bron via Boost
Hi Everyone,
The formal review of Artyom Beilis' Nowide library starts today and
will last until Wed. 21st of June.
[snip]
Please post your comments and review to the boost mailing list
(preferably), or privately to the Review Manager (to me ;-). Here are
- What is your evaluation of the design?
1) I'd really much rather have an iterator-based interface for the
narrow/wide conversions. There's an existing set of iterators in
Boost.Regex already, and I've recently written one here:

https://github.com/tzlaine/text/blob/master/boost/text/utf8.hpp

The reliance on a new custom string type is a substantial mistake, IMO
(boost::nowide::basic_stackstring). Providing an iterator interface
(possibly cribbing the one of the two implementations above) would negate
the need for this new string type -- I could use the existing std::string,
MyString, QString, a char buffer, or whatever. Also, I'd greatly prefer
that the new interfaces be defined in terms of string_view instead of
string/basic_stackstring (there's also a string_view implementation already
Boost.Utility). string_view is simply far more usable, since it binds
effortlessly to either a char const * or a string.

2) I don't really understand what happens when a user passes a valid
Windows filename that is *invalid* UTF-16 to a program using Nowide. Is
the invalid UTF-16 filename just broken in the process of trying to convert
it to UTF-8? This is partially a documentation problem, but until I
understand how this is intended to work, I'm also counting it as a design
issue.
Post by Frédéric Bron via Boost
- What is your evaluation of the implementation?
I did not look.
Post by Frédéric Bron via Boost
- What is your evaluation of the documentation?
I think the documentation needs a bit of work. The non-reference portion
is quite thin, and drilling down into the reference did not answer at least
one question I had (the one above, about invalid UTF-16):

Looking at some example code in the "Using the Library" section, I saw this:

"
To make this program handle Unicode properly, we do the following changes:

#include <boost/nowide/args.hpp>
#include <boost/nowide/fstream.hpp>
#include <boost/nowide/iostream.hpp>
int main(int argc,char **argv)
{
boost::nowide::args a(argc,argv); // Fix arguments - make them UTF-8
"

Ok, so I clicked "boost::nowide::args", hoping for an answer. The detailed
description for args says:

"
args is a class that fixes standard main() function arguments and changes
them to UTF-8 under Microsoft Windows.

The class uses GetCommandLineW(), CommandLineToArgvW() and
GetEnvironmentStringsW() in order to obtain the information. It does not
relates to actual values of argc,argv and env under Windows.

It restores the original values in its destructor
"

It tells me nothing about what happens when invalid UTF-16 is encountered.
Is there an exception? Is 0xfffd inserted? If the latter, am I just
stuck? I should not have to read any source code to figure this out, but
it looks like I have to.

This criticism can be applied to most of the documentation. My preference
is that the semantics of primary functionality of the library should be
explained in tutorials or other non-reference formats. The current state
of the docs doesn't even explain things in the references. This must be
fixed before this library can be accepted.
Post by Frédéric Bron via Boost
- What is your evaluation of the potential usefulness of the library?
I think this library is attempting to address a real and important issue.
I just can't figure out if it's a complete solution, because how invalid
UTF-16 is treated remains a question.
Post by Frédéric Bron via Boost
- Did you try to use the library? With what compiler? Did you have any
problems?
I did not.

- How much effort did you put into your evaluation? A glance? A quick
Post by Frédéric Bron via Boost
reading? In-depth study?
A quick reading, plus a bit of discussion on the list.
Post by Frédéric Bron via Boost
- Are you knowledgeable about the problem domain?
I understand the UTF-8 issues reasonably well, but am ignorant of the
Windows-specific issues.
Post by Frédéric Bron via Boost
- Do you think the library should be accepted as a Boost library? Be
sure to say this explicitly so that your other comments don't obscure
your overall opinion.
I do not think the library should be accepted in its current form. It
seems not to handle malformed UTF-16, which is a requirement for processing
Windows file names (as I understand it -- please correct this if I'm
wrong). Independent of this, I don't find the docs to be sufficient.

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/m
Frédéric Bron via Boost
2017-06-19 17:37:20 UTC
Permalink
Question to all:

Why should we try to handle wrong UTF-16 (or wrong UTF-8)?
1. such files should not exist
2. if they exist, why?
- if it is because it is an old file, can the user just rename it properly?
- if it is because it was produced by a program, why should this
program continue to work without fixing? Isn't it the best way that we
get wrong filenames forever?

I do not understand why we cannot just issue an error.

Thanks for explanations,

Frédéric
Post by Zach Laine via Boost
On Sun, Jun 11, 2017 at 11:20 PM, Frédéric Bron via Boost <
Post by Frédéric Bron via Boost
Hi Everyone,
The formal review of Artyom Beilis' Nowide library starts today and
will last until Wed. 21st of June.
[snip]
Please post your comments and review to the boost mailing list
(preferably), or privately to the Review Manager (to me ;-). Here are
- What is your evaluation of the design?
1) I'd really much rather have an iterator-based interface for the
narrow/wide conversions. There's an existing set of iterators in
https://github.com/tzlaine/text/blob/master/boost/text/utf8.hpp
The reliance on a new custom string type is a substantial mistake, IMO
(boost::nowide::basic_stackstring). Providing an iterator interface
(possibly cribbing the one of the two implementations above) would negate
the need for this new string type -- I could use the existing std::string,
MyString, QString, a char buffer, or whatever. Also, I'd greatly prefer
that the new interfaces be defined in terms of string_view instead of
string/basic_stackstring (there's also a string_view implementation already
Boost.Utility). string_view is simply far more usable, since it binds
effortlessly to either a char const * or a string.
2) I don't really understand what happens when a user passes a valid
Windows filename that is *invalid* UTF-16 to a program using Nowide. Is
the invalid UTF-16 filename just broken in the process of trying to convert
it to UTF-8? This is partially a documentation problem, but until I
understand how this is intended to work, I'm also counting it as a design
issue.
Post by Frédéric Bron via Boost
- What is your evaluation of the implementation?
I did not look.
Post by Frédéric Bron via Boost
- What is your evaluation of the documentation?
I think the documentation needs a bit of work. The non-reference portion
is quite thin, and drilling down into the reference did not answer at least
"
#include <boost/nowide/args.hpp>
#include <boost/nowide/fstream.hpp>
#include <boost/nowide/iostream.hpp>
int main(int argc,char **argv)
{
boost::nowide::args a(argc,argv); // Fix arguments - make them UTF-8
"
Ok, so I clicked "boost::nowide::args", hoping for an answer. The detailed
"
args is a class that fixes standard main() function arguments and changes
them to UTF-8 under Microsoft Windows.
The class uses GetCommandLineW(), CommandLineToArgvW() and
GetEnvironmentStringsW() in order to obtain the information. It does not
relates to actual values of argc,argv and env under Windows.
It restores the original values in its destructor
"
It tells me nothing about what happens when invalid UTF-16 is encountered.
Is there an exception? Is 0xfffd inserted? If the latter, am I just
stuck? I should not have to read any source code to figure this out, but
it looks like I have to.
This criticism can be applied to most of the documentation. My preference
is that the semantics of primary functionality of the library should be
explained in tutorials or other non-reference formats. The current state
of the docs doesn't even explain things in the references. This must be
fixed before this library can be accepted.
Post by Frédéric Bron via Boost
- What is your evaluation of the potential usefulness of the library?
I think this library is attempting to address a real and important issue.
I just can't figure out if it's a complete solution, because how invalid
UTF-16 is treated remains a question.
Post by Frédéric Bron via Boost
- Did you try to use the library? With what compiler? Did you have any
problems?
I did not.
- How much effort did you put into your evaluation? A glance? A quick
Post by Frédéric Bron via Boost
reading? In-depth study?
A quick reading, plus a bit of discussion on the list.
Post by Frédéric Bron via Boost
- Are you knowledgeable about the problem domain?
I understand the UTF-8 issues reasonably well, but am ignorant of the
Windows-specific issues.
Post by Frédéric Bron via Boost
- Do you think the library should be accepted as a Boost library? Be
sure to say this explicitly so that your other comments don't obscure
your overall opinion.
I do not think the library should be accepted in its current form. It
seems not to handle malformed UTF-16, which is a requirement for processing
Windows file names (as I understand it -- please correct this if I'm
wrong). Independent of this, I don't find the docs to be sufficient.
Zach
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
--
Frédéric Bron

-----------------------------------------------------------
Frédéric Bron (***@m4x.org)
Villa des 4 chemins, Centre Hospitalier, BP 208
38506 Voiron Cedex
tél. fixe : +33 4 76 67 17 27, tél. port.: +33 6 67 02 77 35

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinf
Zach Laine via Boost
2017-06-19 17:43:34 UTC
Permalink
On Mon, Jun 19, 2017 at 12:37 PM, Frédéric Bron via Boost <
Post by Frédéric Bron via Boost
Why should we try to handle wrong UTF-16 (or wrong UTF-8)?
1. such files should not exist
2. if they exist, why?
- if it is because it is an old file, can the user just rename it properly?
- if it is because it was produced by a program, why should this
program continue to work without fixing? Isn't it the best way that we
get wrong filenames forever?
This came up at C++Now last month. My understanding, from talking to
people who seem to know about this, is that such filenames are considered
valid by Windows. To issue an error in such a case means not allowing
users to access files that Windows sees as well-formed.

Zach

_______________________________________________
Unsubscribe & other changes: http://lis
Frédéric Bron via Boost
2017-06-19 19:40:52 UTC
Permalink
Post by Zach Laine via Boost
This came up at C++Now last month. My understanding, from talking to
people who seem to know about this, is that such filenames are considered
valid by Windows. To issue an error in such a case means not allowing
users to access files that Windows sees as well-formed.
but who/what needs ill-formed filenames? For what reason?

Frédéric

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailm
Peter Dimov via Boost
2017-06-19 19:52:12 UTC
Permalink
Post by Frédéric Bron via Boost
Post by Zach Laine via Boost
This came up at C++Now last month. My understanding, from talking to
people who seem to know about this, is that such filenames are
considered valid by Windows. To issue an error in such a case means not
allowing users to access files that Windows sees as well-formed.
but who/what needs ill-formed filenames? For what reason?
The question whether to support such filenames is more a matter of
principle. Namely, whether it is the job of the Nowide library to tell you
what filenames you may use or not. One might argue that it should be of no
concern to it whether, or for what reason, I need such filenames.


_______________________________________________
Unsubscribe & other changes:
degski via Boost
2017-06-20 05:16:30 UTC
Permalink
Post by Peter Dimov via Boost
The question whether to support such filenames is more a matter of
principle. Namely, whether it is the job of the Nowide library to tell you
what filenames you may use or not. One might argue that it should be of no
concern to it whether, or for what reason, I need such filenames.
I think the ruling principle should be the unicode standard. Nowide should
support the unicode standard and no more (at least in its' intentions). One
of the intentions of the standard, as I read it, is to guarantee that a
conversion can *un-ambiguously (and safely) round-trip*. On windows, if
WTF-8 (what's in a name? Apart from the obvious, even Wobbly sounds
bizarre), CESU-8 or Modified UTF-8 are ways to achieve that, I think that
they should be supported. If not, an exception should be thrown when
encountering these invalid encodings, as this is in my view an IO issue
(the un-certain environment in which an application has to live), in which
context, throwing seems to be the norm.

I'm with Frédéric Bron on this one, though. I don't understand why invalid
encodings are found in the wild in the first place and why they should
continue to exist in the wild. The whole thing sounds like a vector for
exploits, malicious code to generate invalid encodings after which
buffer-overruns open up the system.

Something I cannot understand is that some of those on this list who are
most critical of Windows in view of security concerns are also the same
people who happily perpetuate these weaknesses. Microsoft is very much in
front here, by dropping support for OS'es (and therfore their associated
compilers and CRT's), Boost should do the same and adopt rigour: "Ceterum
autem censeo Carthaginem esse delendam"

What's not terribly clear either is whether we are talking about NTFS or
Explorer (and then there is WinFS lurking in the shadows, bound to be
released someday). Windows does (many) other things that are weird, like
the way it deals with capitalized names, long paths etc., NTFS itself does
actually not have these issues and does quite a bit more than what the
Explorer shell can do.

degski
--
"*Ihre sogenannte Religion wirkt bloß wie ein Opiat reizend, betäubend,
Schmerzen aus Schwäche stillend.*" - Novalis 1798

_______________________________________________
Unsubscribe & other ch
Artyom Beilis via Boost
2017-06-19 19:57:24 UTC
Permalink
On Mon, Jun 19, 2017 at 8:19 PM, Zach Laine via Boost
Post by Zach Laine via Boost
On Sun, Jun 11, 2017 at 11:20 PM, Frédéric Bron via Boost <
Post by Frédéric Bron via Boost
- What is your evaluation of the design?
1) I'd really much rather have an iterator-based interface for the
narrow/wide conversions. There's an existing set of iterators in
https://github.com/tzlaine/text/blob/master/boost/text/utf8.hpp
As well in Boost.Locale - Boost.Nowide uses UTF conversions
from header only part of Boost.Locale.
Post by Zach Laine via Boost
The reliance on a new custom string type is a substantial mistake, IMO
(boost::nowide::basic_stackstring).
[snip]
(possibly cribbing the one of the two implementations above) would negate
the need for this new string type -- I could use the existing std::string,
MyString, QString, a char buffer, or whatever. Also, I'd greatly prefer
that the new interfaces be defined in terms of string_view instead of
string/basic_stackstring (there's also a string_view implementation already
Boost.Utility). string_view is simply far more usable, since it binds
effortlessly to either a char const * or a string.
stackstring is barely on-stack buffer optimization - it isn't general string
and never intended to be so.

It isn't in details because it can actually be useful outside library scope.
Post by Zach Laine via Boost
Providing an iterator interface
There is no iterator interface in communication with OS
so no sense to add it there. If you want character by character
interface it exists in Boost.Locale
Post by Zach Laine via Boost
2) I don't really understand what happens when a user passes a valid
Windows filename that is *invalid* UTF-16 to a program using Nowide. Is
the invalid UTF-16 filename just broken in the process of trying to convert
it to UTF-8? This is partially a documentation problem, but until I
understand how this is intended to work, I'm also counting it as a design
issue.
You are right it wasn't clear in the docs. My policy was return a error in
case of invalid encoding but after comments I received I'll rather
replace invalid sequences with U-FFFD Replacement Character
since it was what WinAPI usually does.

https://github.com/artyom-beilis/nowide/issues/16

It is something easy to fix.
Post by Zach Laine via Boost
Post by Frédéric Bron via Boost
- What is your evaluation of the documentation?
I think the documentation needs a bit of work. The non-reference portion
is quite thin, and drilling down into the reference did not answer at least
Actually documentation points out that in case of invalid character
encoding a error is returned and how (in most of places - but
there are some missies)

However I agree that it can be improved
and a special section regarding bad encoding conversion policy
should and will be added.
Post by Zach Laine via Boost
It tells me nothing about what happens when invalid UTF-16 is encountered.
Is there an exception? Is 0xfffd inserted? If the latter, am I just
stuck? I should not have to read any source code to figure this out, but
it looks like I have to.
Valid point.
Post by Zach Laine via Boost
Post by Frédéric Bron via Boost
- Do you think the library should be accepted as a Boost library? Be
sure to say this explicitly so that your other comments don't obscure
your overall opinion.
I do not think the library should be accepted in its current form. It
seems not to handle malformed UTF-16, which is a requirement for processing
Windows file names (as I understand it -- please correct this if I'm
wrong).
I strongly disagree about it. Same as converting invalid ANSI encoding
shouldn't lead to invalid UTF-16 or the other way around.

Since there is no way to present invalid UTF-16 using valid UTF-8
same as there is no way to do it in ANSI<->UTF-16 there is
no non-ambiguous way to provide such a support.
Post by Zach Laine via Boost
Zach
Thank You for your review.

Artyom

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/
Norbert Wenzel via Boost
2017-06-20 21:14:09 UTC
Permalink
This is my review of the proposed Boost.Nowide library.
Post by Frédéric Bron via Boost
- What is your evaluation of the design?
I had no troubles finding my way. Most APIs are already well known as
they resemble APIs from the standard library. I understand (after
reading the discussions on this list) why Nowide does not convert all
input to a common encoding on all platforms but only deals with wide
strings on Windows. It was a bit of a surprise at first, so this design
decision should be documented more prominently as others have already
pointed out.
In general I like the idea that only valid Unicode is accepted as input,
although I had an issue with invalid encoding written to nowide::cout
that set the stream's failbit and ignored all subsequent output. The
author was already aware of that issue and said invalid characters will
be replaced with U+FFFD (invalid character) after the review. I prefer
this solution to stopping all output after any failure.
Post by Frédéric Bron via Boost
- What is your evaluation of the implementationI did not look at the implementation. It worked as expected/documented
when I tested the library.
Post by Frédéric Bron via Boost
- What is your evaluation of the documentation?
It is enough to start working with the library. I would like to see more
clearly for every function when and how input strings are converted to a
different encoding and what happens in case of a conversion error
(exception, failbit of stream is set, invalid character, etc.).
Post by Frédéric Bron via Boost
- What is your evaluation of the potential usefulness of the library?The library is very useful and I intend to use it as a portable
fire-and-forget solution to typical Unicode problems on Windows. I think
it will be a very helpful library especially when porting software from
Posix to Windows, since the library enables dealing with Unicode without
dealing with (and first learning about) the Windows Unicode idiosyncrasies.
As it came up in the discussion, I do not think I ever encountered
non-normalized Unicode file paths on Windows so I do not think the lack
of support for such filepaths is detrimental to the usefulness of this
library.
Post by Frédéric Bron via Boost
- Did you try to use the library? With what compiler? Did you have any > problems?
I tested with MSVC14 and had no technical problems building or using the
library.
Post by Frédéric Bron via Boost
- How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?
I read the discussions on the list, the documentation and wrote a small
toy program to test my expected usage of Nowide. All in all I invested
about three hours.
Post by Frédéric Bron via Boost
- Are you knowledgeable about the problem domain?
Somewhat. I had to work around typical Unicode issues on Windows
multiple times, but always stopped when the issue was solved. So my
knowledge is mostly driven by the specific issues I had so far and not
profound.
Post by Frédéric Bron via Boost
- Do you think the library should be accepted as a Boost library?
Yes.

Norbert


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.
Artyom Beilis via Boost
2017-06-22 20:31:41 UTC
Permalink
On Wed, Jun 21, 2017 at 12:14 AM, Norbert Wenzel via Boost
Post by Norbert Wenzel via Boost
This is my review of the proposed Boost.Nowide library.
Post by Frédéric Bron via Boost
- What is your evaluation of the design?
I had no troubles finding my way. Most APIs are already well known as
they resemble APIs from the standard library. I understand (after
reading the discussions on this list) why Nowide does not convert all
input to a common encoding on all platforms but only deals with wide
strings on Windows. It was a bit of a surprise at first, so this design
decision should be documented more prominently as others have already
pointed out.
Yes indeed I'll have to provide section regarding library policy
in tutorial in one of the first sections.
Post by Norbert Wenzel via Boost
Post by Frédéric Bron via Boost
- What is your evaluation of the implementationI did not look at the implementation. It worked as expected/documented
when I tested the library.
Post by Frédéric Bron via Boost
- What is your evaluation of the documentation?
It is enough to start working with the library. I would like to see more
clearly for every function when and how input strings are converted to a
different encoding and what happens in case of a conversion error
(exception, failbit of stream is set, invalid character, etc.).
Indeed - it should be improved.
Post by Norbert Wenzel via Boost
Post by Frédéric Bron via Boost
- Do you think the library should be accepted as a Boost library?
Yes.
Norbert
Thank You very much for the review and the comments.

Artyom Beilis

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman

Loading...