Alexander Moisseev
2011-05-25 05:33:26 UTC
- Switch default charset for encoding file names to utf8
- Set general purpose bit 11 for UTF-8 code page
- Set "version made by" field to 0 (MS-DOS) for OEM code pages
Testing
See some archivers tests results in attached zip-test.zip. This test shows which application lists filenames in zip archives created by BackupPC_zipCreate with different code pages and "Version made by" field values correctly (V) or incorrectly (X). Actually, tests was done for Russian (cp866, cp1251) and German (cp850, cp1252) code pages. General purpose bit 11 has been set for UTF-8 code page. "Version made by" field makes no sense for UTF-8.
Interpreting results
If "version made by" set to MS-DOS then "old" versions of applications uses OEM code page. If "version made by" set to UNIX then local ANSI code page. Recent versions of archivers and Zip Folders simply ignores "version made by" and always uses OEM. So OEM code page should be used rather then ANSI for better compatibility.
E.g.
cp850 rather then cp1252,
cp866 rather then cp1251,
and so on.
Nowadays most of archivers support UTF-8. As utf8 is "language independent" it seems reasonable to select one as _default_ charset for encoding file names.
References
http://www.pkware.com/documents/casestudies/APPNOTE.TXT
--
Alexander
- Set general purpose bit 11 for UTF-8 code page
- Set "version made by" field to 0 (MS-DOS) for OEM code pages
Testing
See some archivers tests results in attached zip-test.zip. This test shows which application lists filenames in zip archives created by BackupPC_zipCreate with different code pages and "Version made by" field values correctly (V) or incorrectly (X). Actually, tests was done for Russian (cp866, cp1251) and German (cp850, cp1252) code pages. General purpose bit 11 has been set for UTF-8 code page. "Version made by" field makes no sense for UTF-8.
Interpreting results
If "version made by" set to MS-DOS then "old" versions of applications uses OEM code page. If "version made by" set to UNIX then local ANSI code page. Recent versions of archivers and Zip Folders simply ignores "version made by" and always uses OEM. So OEM code page should be used rather then ANSI for better compatibility.
E.g.
cp850 rather then cp1252,
cp866 rather then cp1251,
and so on.
Nowadays most of archivers support UTF-8. As utf8 is "language independent" it seems reasonable to select one as _default_ charset for encoding file names.
References
http://www.pkware.com/documents/casestudies/APPNOTE.TXT
--
Alexander