Currently, encoding in VTK strings is not explicitly specified. When receiving a string from external libraries or using the string in operating system calls (e.g., reading/writing files) then the behavior is often incorrect.
- files that have non-ASCII characters in their name cannot be opened
- when changing the application locale (so that some necessary special characters can be stored in a single byte), then generated files may become invalid (e.g., because decimal point is replaced by decimal comma)
- Python and Qt stores strings with known encoding, but there is no way to convert them to/from strings in VTK without loss of information
There is a vtkUnicodeString class in VTK that you store string with a known encoding. It can store and provide string in utf8 and utf16 encoding. It is already used extensively in text rendering, arrays, tables, certain file export/import, but majority of VTK still uses const char* and get/set macros.
Using const char* for string storage, managing memory with vtkSetStringMacro/vtkGetStringMacro, and process strings with C string functions are all very outdated programming practices. VTK-based applications must choose between following VTK's approach and stuck with outdated practices or break away from it and live with inconsistencies in the code base and be cautious with managing strings (additional conversions, null-pointer checks are needed) - none of these options are good.
See some more discussion here: https://discourse.vtk.org/t/proposal-should-we-replace-vtkstdstring-with-std-string/796/14
Use an encoding-aware string class in VTK to store all strings.
vtkUnicodeString is a good starting point, as it can store any string with known encoding and it is already in VTK, used in a number of VTK classes.
It could be renamed to vtkString to make the name shorter. It is also more clear if we don't include the name of a particular encoding in the class name (as in the future we might support multiple encodings inside the string class, not necessarily just Unicode). This would also consistent with how other libraries manage strings (see for example Qt's QString).
- Rename vtkUnicodeString by vtkString. Maybe improve the API with adding get as/set from Latin1.
- Replace all string attributes in VTK classes by vtkString (add new get/set macros, create object instance in the constructor)
- Update Python wrapping
- Maybe add automatic converters to const char* (can be disabled by CMake flags) to make update of application code easier
- Review all operating system calls and make sure strings are properly converted