Commit graph

13 commits

Author SHA1 Message Date
Dustin L. Howett 1df3182865
Fully regenerate CodepointWidthDetector from Unicode 13.0 (#8035)
This commit also adds an override UCD and migrates all of the overrides
from GetQuickCharWidth into it.

GetQuickCharWidth
-----------------

The removal of overrides from GQCW reduces the number of comparisons
required for looking up a single character's width from 41 (32
individual ranged comparisons from GQCW + 8+1 from the binary search in
CPWD) to 11 (2 from GQCW, 8+1 from CPWD).

GQCW also incorrectly marked 67 reserved codepoints as `Wide` when they
should have been `Narrow`.

The codepoints whose definitions have changed from `Wide` to `Narrow` are:

```
2E9A 2EF4 2EF5 2EF6 2EF7 2EF8 2EF9 2EFA 2EFB 2EFC 2EFD 2EFE 2EFF 2FD6
2FD7 2FD8 2FD9 2FDA 2FDB 2FDC 2FDD 2FDE 2FDF 2FE0 2FE1 2FE2 2FE3 2FE4
2FE5 2FE6 2FE7 2FE8 2FE9 2FEA 2FEB 2FEC 2FED 2FEE 2FEF 2FFC 2FFD 2FFE
2FFF 31E4 31E5 31E6 31E7 31E8 31E9 31EA 31EB 31EC 31ED 31EE 31EF 321F
A48D A48E A48F FE1A FE1B FE1C FE1D FE1E FE1F FE53 FE67
```

All of them are reserved, but those reserved regions are marked as narrow
in the UCD.

This change also offers us the chance to document exactly why we're
overriding a specific character range. Comments from the override
document will be copied to the generated CPWD table.

New in Unicode 13.0
------------------

Some widths have changed due to previously-reserved characters becoming
_used_ such as U+32FF SQUARE ERA NAME REIWA, the Tangut components
756-768, the entire Khitan Small Script character set, and the Tangut
Ideographs.

A number of the changes in this diff are due to better/worse comment
tracking and the removal of the Emoji/EPres comments. The script once
mistakenly applied comments to packed regions (and it has been updated
to not do so.)

Validation
----------

I build a test application that compared codepoints 0-FFFF for GQCW
against their new registered widths.
2020-10-27 17:36:28 +00:00
Dustin L. Howett (MSFT) ba1a298d6b
Partially regenerate codepoint widths from Emoji 13.0 (#5934)
This removes all glyphs from the emoji list that do not default to
"emoji presentation" (EPres). It removes all local overrides, but retains
the comments about the emoji we left out that are Microsoft-specific.

This brings us fully in line with the most popular Terminals on OS X,
except that we squash our emoji down to fit in one cell and they let
them hang over the edges and damage other characters. Oh well.

## Detailed Description of the Pull Request / Additional comments

Late Friday evening, I tested my emoji test file on iTerm2. In so doing, I realized
that @j4james and @leonMSFT were right the entire time in #5914: Emoji
that require `U+FE0F` must not be double-width by default.

I finally banged up a powershell script that parses the UCD and emits a codepoint
width table. Once checked in, this will be definitive.

Refs #900, #5914.
Fixes #5941.
2020-05-17 13:32:43 -07:00
Dustin L. Howett (MSFT) c39f9c6626
CodepointWidthDetector: reclassify U+25FB, U+25FC as Narrow (#5914)
This seems to be in line with the emoji-sequences table in the latest
version of the Unicode standard: those glyphs require U+FE0F to activate
their emoji presentation. Since we don't support composing U+FE0F, we
should not present them as emoji by default.

Fixes #5910.

Yes, I hate this.
2020-05-14 23:49:08 +00:00
Leon Liang cf62922ad8
Revert some emoji back to narrow width
A couple of codepoints, namely the card suites, male and female signs,
and white and black smiling faces were changed to have a two-column
width as part of #5795 since they were specified as emoji in Unicode's
emoji list v13.0[1]. 

These particular glyphs also show up in some of the most fundamental
code pages, such as CP437[2] and WGL4[3]. We should
not be touching the width of the glyphs in these codepages, as suddenly
changing a long-time-running narrow glyph to use two-columns all of a
sudden will surely break (and has already broken) things.

[1] https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt
[2] https://en.wikipedia.org/wiki/Code_page_437
[3] https://en.wikipedia.org/wiki/Windows_Glyph_List_4

Closes #5822
2020-05-12 19:38:11 +00:00
Leon Liang 7ae34336da
Make most emojis full-width (#5795)
The table that we refer to in `CodepointWidthDetector.cpp` to determine
whether or not a codepoint should be rendered as Wide vs Narrow was
based off EastAsianWidth[1].  If a codepoint wasn't included in this
table, they're considered Narrow. Many emojis aren't specified in the
EAW list, so this PR supplements our table with emoji codepoints from
emoji-data[2] in order to render most, if not all, emojis as full-width. 

There are certain codepoints I've added to the comments (in case we want
to add them officially to the table in the future) that Microsoft
decided to give an emoji presentation even if it's specified as
Narrow/Ambiguous in the EAW list and are _not_ specified in the Unicode
emoji list. These include all of the Mahjong Tiles block, different
direction pencils (✎✐), different pointing index fingers (☜, ☞) among
others. I have no idea if I've captured all of them, as I don't know of
an easy way to detect which are Microsoft specific emojis.

## Validation Steps Performed
I have looked at so many emojis that I dream emoji.

These screenshots aren't encompassing _all_ emoji but I've tried to grab
a couple from all across the codepoint ranges:

Before:
![before](https://user-images.githubusercontent.com/57155886/81445092-2051a980-912d-11ea-9739-c9f588da407d.png)

After:
![after](https://user-images.githubusercontent.com/57155886/81445107-2778b780-912d-11ea-9615-676c2150e798.png)

[1] http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
[2] https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt

Closes #900
2020-05-08 22:31:09 +00:00
Chester Liu 4f8acb4b9f
Make CodepointWidthDetector::GetWidth faster (#3727)
This is a subset of #3578 which I think is harmless and the first step towards making things right.
References #3546 #3578 

## Detailed Description of the Pull Request / Additional comments

For more robust Unicode support, `CodepointWidthDetector` should provide concrete width information rather than a simple boolean of `IsWide`. Currently only `IsWide` is widely used and optimized using quick lookup table and fallback cache. This PR moves those optimization into `GetWidth`.

## Validation Steps Performed

API remains unchanged. Things are not broken.
2020-04-04 00:56:22 +00:00
Michael Niksa 6735311fc9 Suppress last two errors (C26455 default constructor throw in DxEngine because it's due for refactoring soon anyway & C26444 custom construction/destruction on OutputCellIterator because I can't see what's going on and it needs more investigation and shouldn't hold this up). Also run codeformat. 2019-09-03 16:18:19 -07:00
Michael Niksa 87f5852a72 Define actual constructor for CodepointWidthDetector as default isn't cutting it. 2019-09-03 15:03:54 -07:00
Michael Niksa 4f1157c044 C26447,C26440 - is noexcept but can throw or doesn't throw but not noexcept 2019-08-29 15:23:07 -07:00
Michael Niksa b33a59816e C26496, mark const if it's never written after creation 2019-08-29 11:27:39 -07:00
Dustin L. Howett (MSFT) 16e1e29a12
Replace CodepointWidthDetector's runtime table with a static one (#2368)
This commit replaces CodepointWidthDetector's
dynamically-generated map with a static constexpr one that's compiled
into the binary.

It also almost totally removes the notion of an `Invalid` width. We
definitely had gaps in our character coverage where we'd report a
character as invalid, but we'd then flatten that down to `Narrow` when
asked. By combining the not-present state and the narrow state, we get
to save a significant chunk of data.

I've tested this by feeding it all 0x10FFFF codepoints (and then some)
and making sure they 100% match the old code's outputs.

|------------------------------|---------------|----------------|
| Metric                       | Then          | Now            |
|------------------------------|---------------|----------------|
| disk space                   | 56k (`.text`) | 3k (`.rdata`)  |
| runtime memory (allocations) | 1088          | 0              |
| runtime memory (bytes)       | 51k           | ~0             |
| memory behavior              | not shared    | fully shared   |
| lookup time                  | ~31ns         | ~9ns           |
| first hit penalty            | ~170000ns     | 0ns            |
| lines of code                | 1088          | 285            |
| clarity                      | extreme       | slightly worse |
|------------------------------|---------------|----------------|

I also took a moment and cleaned up a stray boolean that we didn't need.
2019-08-16 10:54:17 -07:00
Michael Niksa 87e85603b9 Merged PR 3215853: Fix spacing/layout for block characters and many retroactively-recategorized emoji (and more!)
This encompasses a handful of problems with column counting.

The Terminal project didn't set a fallback column counter. Oops. I've fixed this to use the `DxEngine` as the fallback.

The `DxEngine` didn't implement its fallback method. Oops. I've fixed this to use the `CustomTextLayout` to figure out the advances based on the same font and fallback pattern as the real final layout, just without "rounding" it into cells yet.
- `CustomTextLayout` has been updated to move the advance-correction into a separate phase from glyph shaping. Previously, we corrected the advances to nice round cell counts during shaping, which is fine for drawing, but hard for column count analysis.
- Now that there are separate phases, an `Analyze` method was added to the `CustomTextLayout` which just performs the text analysis steps and the glyph shaping, but no advance correction to column boundaries nor actual drawing.

I've taken the caching code that I was working on to improve chafa, and I've brought it into this. Now that we're doing a lot of fallback and heavy lifting in terms of analysis via the layout, we should cache the results until the font changes.

I've adjusted how column counting is done overall. It's always been in these phases:
1. We used a quick-lookup of ranges of characters we knew to rapidly decide `Narrow`, `Wide` or `Invalid` (a.k.a. "I dunno")
2. If it was `Invalid`, we consulted a table based off of the Unicode standard that has either `Narrow`, `Wide`, or `Ambiguous` as a result.
3. If it's still `Ambiguous`, we consult a render engine fallback (usually GDI or now DX) to see how many columns it would take.
4. If we still don't know, then it's `Wide` to be safe.
- I've added an additional flow here. The quick-lookup can now return `Ambiguous` off the bat for some glyph characters in the x2000-x3000 range that used to just be simple shapes but have been retroactively recategorized as emoji and are frequently now using full width color glyphs.
- This new state causes the lookup to go immediately to the render engine if it is available instead of consulting the Unicode standard table first because the half/fullwidth table doesn't appear to have been updated for this nuance to reclass these characters as ambiguous, but we'd like to keep that table as a "generated from the spec" sort of table and keep our exceptions in the "quick lookup" function.

I have confirmed the following things "just work" now:
- The windows logo flag from the demo. (💖🌌😊)
- The dotted chart on the side of crossterm demo (•)
- The powerline characters that make arrows with the Consolas patched font (██)
- An accented é
- The warning and checkmark symbols appearing same size as the X. (✔⚠🔥)

Related work items: #21167256, #21237515, #21243859, #21274645, #21296827
2019-05-02 15:29:10 -07:00
Dustin Howett d4d59fa339 Initial release of the Windows Terminal source code
This commit introduces all of the Windows Terminal and Console Host source,
under the MIT license.
2019-05-02 15:29:04 -07:00