This commit also adds an override UCD and migrates all of the overrides
from GetQuickCharWidth into it.
GetQuickCharWidth
-----------------
The removal of overrides from GQCW reduces the number of comparisons
required for looking up a single character's width from 41 (32
individual ranged comparisons from GQCW + 8+1 from the binary search in
CPWD) to 11 (2 from GQCW, 8+1 from CPWD).
GQCW also incorrectly marked 67 reserved codepoints as `Wide` when they
should have been `Narrow`.
The codepoints whose definitions have changed from `Wide` to `Narrow` are:
```
2E9A 2EF4 2EF5 2EF6 2EF7 2EF8 2EF9 2EFA 2EFB 2EFC 2EFD 2EFE 2EFF 2FD6
2FD7 2FD8 2FD9 2FDA 2FDB 2FDC 2FDD 2FDE 2FDF 2FE0 2FE1 2FE2 2FE3 2FE4
2FE5 2FE6 2FE7 2FE8 2FE9 2FEA 2FEB 2FEC 2FED 2FEE 2FEF 2FFC 2FFD 2FFE
2FFF 31E4 31E5 31E6 31E7 31E8 31E9 31EA 31EB 31EC 31ED 31EE 31EF 321F
A48D A48E A48F FE1A FE1B FE1C FE1D FE1E FE1F FE53 FE67
```
All of them are reserved, but those reserved regions are marked as narrow
in the UCD.
This change also offers us the chance to document exactly why we're
overriding a specific character range. Comments from the override
document will be copied to the generated CPWD table.
New in Unicode 13.0
------------------
Some widths have changed due to previously-reserved characters becoming
_used_ such as U+32FF SQUARE ERA NAME REIWA, the Tangut components
756-768, the entire Khitan Small Script character set, and the Tangut
Ideographs.
A number of the changes in this diff are due to better/worse comment
tracking and the removal of the Emoji/EPres comments. The script once
mistakenly applied comments to packed regions (and it has been updated
to not do so.)
Validation
----------
I build a test application that compared codepoints 0-FFFF for GQCW
against their new registered widths.
When the console functional tests are running on OneCoreUAP, the
newly-introduced (65bd4e327, #4309) FillOutputCharacterA tests will
actually fail because of radio interference on the return value of GLE.
Fixes MSFT-28163465
The table that we refer to in `CodepointWidthDetector.cpp` to determine
whether or not a codepoint should be rendered as Wide vs Narrow was
based off EastAsianWidth[1]. If a codepoint wasn't included in this
table, they're considered Narrow. Many emojis aren't specified in the
EAW list, so this PR supplements our table with emoji codepoints from
emoji-data[2] in order to render most, if not all, emojis as full-width.
There are certain codepoints I've added to the comments (in case we want
to add them officially to the table in the future) that Microsoft
decided to give an emoji presentation even if it's specified as
Narrow/Ambiguous in the EAW list and are _not_ specified in the Unicode
emoji list. These include all of the Mahjong Tiles block, different
direction pencils (✎✐), different pointing index fingers (☜, ☞) among
others. I have no idea if I've captured all of them, as I don't know of
an easy way to detect which are Microsoft specific emojis.
## Validation Steps Performed
I have looked at so many emojis that I dream emoji.
These screenshots aren't encompassing _all_ emoji but I've tried to grab
a couple from all across the codepoint ranges:
Before:
![before](https://user-images.githubusercontent.com/57155886/81445092-2051a980-912d-11ea-9739-c9f588da407d.png)
After:
![after](https://user-images.githubusercontent.com/57155886/81445107-2778b780-912d-11ea-9615-676c2150e798.png)
[1] http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
[2] https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txtCloses#900
## Summary of the Pull Request
Despite being specified as `noexcept`, `FillConsoleOutputCharacterA` emits an exception when a call to `ConvetToW` is made with an argument character which can't be converted. This PR fixes this throw, by wrapping `ConvertToW` in a try-catch_return.
## PR Checklist
* [x] Closes#4258
* [x] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA
* [x] Tests added/passed: thanks @miniksa
## Detailed Description of the Pull Request / Additional comments
Following the semantics of other `FillConsoleOutputCharacter*` the output param `cellsModified` is set to `0`. The try-catch_return is also what other functions of this family perform in case of errors.
## Validation Steps Performed
Original repro no longer crashes.
Generated by https://github.com/jsoref/spelling `f`; to maintain your repo, please consider `fchurn`
I generally try to ignore upstream bits. I've accidentally included some items from the `deps/` directory. I expect someone will give me a list of items to drop, I'm happy to drop whole files/directories, or to split the PR into multiple items (E.g. comments/locals/public).
Closes#4294
This encompasses a handful of problems with column counting.
The Terminal project didn't set a fallback column counter. Oops. I've fixed this to use the `DxEngine` as the fallback.
The `DxEngine` didn't implement its fallback method. Oops. I've fixed this to use the `CustomTextLayout` to figure out the advances based on the same font and fallback pattern as the real final layout, just without "rounding" it into cells yet.
- `CustomTextLayout` has been updated to move the advance-correction into a separate phase from glyph shaping. Previously, we corrected the advances to nice round cell counts during shaping, which is fine for drawing, but hard for column count analysis.
- Now that there are separate phases, an `Analyze` method was added to the `CustomTextLayout` which just performs the text analysis steps and the glyph shaping, but no advance correction to column boundaries nor actual drawing.
I've taken the caching code that I was working on to improve chafa, and I've brought it into this. Now that we're doing a lot of fallback and heavy lifting in terms of analysis via the layout, we should cache the results until the font changes.
I've adjusted how column counting is done overall. It's always been in these phases:
1. We used a quick-lookup of ranges of characters we knew to rapidly decide `Narrow`, `Wide` or `Invalid` (a.k.a. "I dunno")
2. If it was `Invalid`, we consulted a table based off of the Unicode standard that has either `Narrow`, `Wide`, or `Ambiguous` as a result.
3. If it's still `Ambiguous`, we consult a render engine fallback (usually GDI or now DX) to see how many columns it would take.
4. If we still don't know, then it's `Wide` to be safe.
- I've added an additional flow here. The quick-lookup can now return `Ambiguous` off the bat for some glyph characters in the x2000-x3000 range that used to just be simple shapes but have been retroactively recategorized as emoji and are frequently now using full width color glyphs.
- This new state causes the lookup to go immediately to the render engine if it is available instead of consulting the Unicode standard table first because the half/fullwidth table doesn't appear to have been updated for this nuance to reclass these characters as ambiguous, but we'd like to keep that table as a "generated from the spec" sort of table and keep our exceptions in the "quick lookup" function.
I have confirmed the following things "just work" now:
- The windows logo flag from the demo. (⚫⚪💖✅🌌😊)
- The dotted chart on the side of crossterm demo (•)
- The powerline characters that make arrows with the Consolas patched font (██)
- An accented é
- The warning and checkmark symbols appearing same size as the X. (✔⚠🔥)
Related work items: #21167256, #21237515, #21243859, #21274645, #21296827