This commit introduces Generate-CodepointWidthsFromUCD, a powershell (7+) script that will parse a UCD XML database in the UAX 42 format from https://www.unicode.org/Public/UCD/latest/ucdxml/ and generate CodepointWidthDetector's giant width array. By default, it will emit one UnicodeRange for every range of non-narrow glyphs with a different Width + Emoji + Emoji Presentation class; however, it can be run in "packing" and "full" mode. * Packing mode: ignore the width/emoji/pres class and combine adjacent runs that CPWD will treat the same. * This is for optimizing the number of individual ranges emitted into code. * Full mode: include narrow codepoints (helpful for visualization) It also supports overrides, provided in an XML document of the same format as the UCD itself. Entries in the overrides files are applied after the entire UCD is read and will replace any impacted ranges. The output (when packing) looks like this: ```c++ // Generated by Generate-CodepointWidthsFromUCD -Pack:True -Full:False // on 05/17/2020 02:47:55 (UTC) from Unicode 13.0.0. // 66182 (0x10286) codepoints covered. static constexpr std::array<UnicodeRange, 23> s_wideAndAmbiguousTable{ UnicodeRange{ 0xa1, 0xa1, CodepointWidth::Ambiguous }, UnicodeRange{ 0xa4, 0xa4, CodepointWidth::Ambiguous }, UnicodeRange{ 0xa7, 0xa8, CodepointWidth::Ambiguous }, . . . UnicodeRange{ 0x1f210, 0x1f23b, CodepointWidth::Wide }, UnicodeRange{ 0x1f37e, 0x1f393, CodepointWidth::Wide }, UnicodeRange{ 0x100000, 0x10fffd, CodepointWidth::Ambiguous }, }; ``` The output (when overriding) looks like this: ```c++ // Generated by Generate-CodepointWidthsFromUCD.ps1 -Pack:True -Full:False -NoOverrides:False // on 5/22/2020 11:17:39 PM (UTC) from Unicode 13.0.0. // 321205 (0x4E6B5) codepoints covered. // 240 (0xF0) codepoints overridden. static constexpr std::array<UnicodeRange, 23> s_wideAndAmbiguousTable{ UnicodeRange{ 0xa1, 0xa1, CodepointWidth::Ambiguous }, ... UnicodeRange{ 0xfe20, 0xfe2f, CodepointWidth::Narrow }, // narrow combining ligatures (split into left/right halves, which take 2 columns together) ... UnicodeRange{ 0x100000, 0x10fffd, CodepointWidth::Ambiguous }, }; ``` |
||
---|---|---|
.. | ||
alphabet.txt | ||
expect.txt | ||
README.md | ||
web.txt |
The contents of each .txt
file in this directory are merged together.
- alphabet is a sample for alphabet related items
- web is a sample for web/html related items
- expect is the main list of expected items -- there is nothing particularly special about the file name (beyond the extension which is important).
These terms are things which temporarily exist in the project, but which aren't necessarily words.
If something is a word that could come and go, it probably belongs in a dictionary.