Currently, encoding in VTK strings is not explicitly specified. When receiving a string from external libraries or using the string in operating system calls (e.g., reading/writing files) then the behavior is often incorrect.
- files that have non-ASCII characters in their name cannot be opened
- when changing the application locale (so that some necessary special characters can be stored in a single byte), then generated files may become invalid (e.g., because decimal point is replaced by decimal comma)
- Python and Qt stores strings with known encoding, but there is no way to convert them to/from strings in VTK without loss of information
There is a vtkUnicodeString class in VTK that you store string with a known encoding. It can store and provide string in utf8 and utf16 encoding. It is already used extensively in text rendering, arrays, tables, certain file export/import, but majority of VTK still uses const char* and get/set macros.
Using const char* for string storage, managing memory with vtkSetStringMacro/vtkGetStringMacro, and process strings with C string functions are all very outdated programming practices. VTK-based applications must choose between following VTK's approach and stuck with outdated practices or break away from it and live with inconsistencies in the code base and be cautious with managing strings (additional conversions, null-pointer checks are needed) - none of these options are good.
It could be possible to state that "all strings in VTK are utf8 encoded", see manifesto of advocates of this approach here. It is certainly compelling that applications can be made to support Unicode without many changes, but it would be hard to enforce that this requirement is fulfilled in all VTK and all classes of VTK-based applications. It would be especially difficult to ensure that strings are transcoded when getting/setting data in VTK to/from other libraries and when VTK calls system APIs. In the long term, when all software libraries assume std::string is utf8-encoded and Windows has utf8 code page, utf8 everywhere is definitely a good solution, but it is questionable if we should try to jump there in one step, or via an intermediate step of using a dedicated encoding-aware string class.
See some more discussion here: https://discourse.vtk.org/t/proposal-should-we-replace-vtkstdstring-with-std-string/796/14
Use an encoding-aware string class in VTK to store all strings.
vtkUnicodeString is a good starting point, as it can store any string with known encoding and it is already in VTK, used in a number of VTK classes. vtkDICOMFilePath contains useful conversion code, too.
It could be renamed to vtkString to make the name shorter. It is also more clear if we don't include the name of a particular encoding in the class name (as in the future we might support multiple encodings inside the string class, not necessarily just Unicode). This would also consistent with how other libraries manage strings (see for example Qt's QString, Gnome's ustring). There are a number of other implementations, such as tiny-utf8
Maybe vtkDICOMFilePath should be added to VTK (as vtkFilePath, maybe parent class could be vtkString) for storing file paths. File paths need some extra features, such as conversion of slashes and handling of extended paths (\\?\) on Windows.
- Rename vtkUnicodeString by vtkString. Maybe improve the API with adding get as/set from Latin1.
- Replace all string attributes in VTK classes by vtkString (add new get/set macros, create object instance in the constructor)
- Update Python wrapping
- Add converters to/from QString
- Maybe add automatic converters to const char* (can be disabled by CMake flags) to make update of application code easier
- Review all operating system calls and make sure strings are properly converted