Pattern Matching for C# 7 ========================= Pattern matching extensions for C# enable many of the benefits of algebraic data types and pattern matching from functional languages, but in a way that smoothly integrates with the feel of the underlying language. The basic features are: [record types](records.md), which are types whose semantic meaning is described by the shape of the data; and pattern matching, which is a new expression form that enables extremely concise multilevel decomposition of these data types. Elements of this approach are inspired by related features in the programming languages [F#](http://www.msr-waypoint.net/pubs/79947/p29-syme.pdf "Extensible Pattern Matching Via a Lightweight Language") and [Scala](http://lampwww.epfl.ch/~emir/written/MatchingObjectsWithPatterns-TR.pdf "Matching Objects With Patterns"). ## Is Expression The `is` operator is extended to test an expression against a *pattern*. ```antlr relational_expression : relational_expression 'is' pattern ; ``` This form of *relational_expression* is in addition to the existing forms in the C# specification. It is a compile-time error if the *relational_expression* to the left of the `is` token does not designate a value or does not have a type. Every *identifier* of the pattern introduces a new local variable that is *definitely assigned* after the `is` operator is `true` (i.e. *definitely assigned when true*). > Note: There is technically an ambiguity between *type* in an `is-expression` and *constant_pattern*, either of which might be a valid parse of a qualified identifier. We try to bind it as a type for compatibility with previous versions of the language; only if that fails do we resolve it as we do in other contexts, to the first thing found (which must be either a constant or a type). This ambiguity is only present on the right-hand-side of an `is` expression. ## Patterns Patterns are used in the `is` operator and in a *switch_statement* to express the shape of data against which incoming data is to be compared. Patterns may be recursive so that parts of the data may be matched against sub-patterns. ```antlr pattern : declaration_pattern | constant_pattern | var_pattern ; declaration_pattern : type simple_designation ; constant_pattern : shift_expression ; var_pattern : 'var' simple_designation ; ``` > Note: There is technically an ambiguity between *type* in an `is-expression` and *constant_pattern*, either of which might be a valid parse of a qualified identifier. We try to bind it as a type for compatibility with previous versions of the language; only if that fails do we resolve it as we do in other contexts, to the first thing found (which must be either a constant or a type). This ambiguity is only present on the right-hand-side of an `is` expression. ### Declaration Pattern The *declaration_pattern* both tests that an expression is of a given type and casts it to that type if the test succeeds. If the *simple_designation* is an identifier, it introduces a local variable of the given type named by the given identifier. That local variable is *definitely assigned* when the result of the pattern-matching operation is true. ```antlr declaration_pattern : type simple_designation ; ``` The runtime semantic of this expression is that it tests the runtime type of the left-hand *relational_expression* operand against the *type* in the pattern. If it is of that runtime type (or some subtype), the result of the `is operator` is `true`. It declares a new local variable named by the *identifier* that is assigned the value of the left-hand operand when the result is `true`. Certain combinations of static type of the left-hand-side and the given type are considered incompatible and result in compile-time error. A value of static type `E` is said to be *pattern compatible* with the type `T` if there exists an identity conversion, an implicit reference conversion, a boxing conversion, an explicit reference conversion, or an unboxing conversion from `E` to `T`. It is a compile-time error if an expression of type `E` is not pattern compatible with the type in a type pattern that it is matched with. > Note: [In C# 7.1 we extend this](https://github.com/dotnet/csharplang/blob/master/proposals/csharp-7.1/generics-pattern-match.md) to permit a pattern-matching operation if either the input type or the type `T` is an open type. This paragraph is replaced by the following: > > Certain combinations of static type of the left-hand-side and the given type are considered incompatible and result in compile-time error. A value of static type `E` is said to be *pattern compatible* with the type `T` if there exists an identity conversion, an implicit reference conversion, a boxing conversion, an explicit reference conversion, or an unboxing conversion from `E` to `T`, **or if either `E` or `T` is an open type**. It is a compile-time error if an expression of type `E` is not pattern compatible with the type in a type pattern that it is matched with. The declaration pattern is useful for performing run-time type tests of reference types, and replaces the idiom ```cs var v = expr as Type; if (v != null) { // code using v } ``` With the slightly more concise ```cs if (expr is Type v) { // code using v } ``` It is an error if *type* is a nullable value type. The declaration pattern can be used to test values of nullable types: a value of type `Nullable` (or a boxed `T`) matches a type pattern `T2 id` if the value is non-null and the type of `T2` is `T`, or some base type or interface of `T`. For example, in the code fragment ```cs int? x = 3; if (x is int v) { // code using v } ``` The condition of the `if` statement is `true` at runtime and the variable `v` holds the value `3` of type `int` inside the block. ### Constant Pattern ```antlr constant_pattern : shift_expression ; ``` A constant pattern tests the value of an expression against a constant value. The constant may be any constant expression, such as a literal, the name of a declared `const` variable, or an enumeration constant, or a `typeof` expression. If both *e* and *c* are of integral types, the pattern is considered matched if the result of the expression `e == c` is `true`. Otherwise the pattern is considered matching if `object.Equals(e, c)` returns `true`. In this case it is a compile-time error if the static type of *e* is not *pattern compatible* with the type of the constant. ### Var Pattern ```antlr var_pattern : 'var' simple_designation ; ``` An expression *e* matches a *var_pattern* always. In other words, a match to a *var pattern* always succeeds. If the *simple_designation* is an identifier, then at runtime the value of *e* is bound to a newly introduced local variable. The type of the local variable is the static type of *e*. It is an error if the name `var` binds to a type. ## Switch Statement The `switch` statement is extended to select for execution the first block having an associated pattern that matches the *switch expression*. ```antlr switch_label : 'case' complex_pattern case_guard? ':' | 'case' constant_expression case_guard? ':' | 'default' ':' ; case_guard : 'when' expression ; ``` The order in which patterns are matched is not defined. A compiler is permitted to match patterns out of order, and to reuse the results of already matched patterns to compute the result of matching of other patterns. If a *case-guard* is present, its expression is of type `bool`. It is evaluated as an additional condition that must be satisfied for the case to be considered satisfied. It is an error if a *switch_label* can have no effect at runtime because its pattern is subsumed by previous cases. [TODO: We should be more precise about the techniques the compiler is required to use to reach this judgment.] A pattern variable declared in a *switch_label* is definitely assigned in its case block if and only if that case block contains precisely one *switch_label*. [TODO: We should specify when a *switch block* is reachable.] ### Scope of Pattern Variables The scope of a variable declared in a pattern is as follows: - If the pattern is a case label, then the scope of the variable is the *case block*. Otherwise the variable is declared in an *is_pattern* expression, and its scope is based on the construct immediately enclosing the expression containing the *is_pattern* expression as follows: - If the expression is in an expression-bodied lambda, its scope is the body of the lambda. - If the expression is in an expression-bodied method or property, its scope is the body of the method or property. - If the expression is in a `when` clause of a `catch` clause, its scope is that `catch` clause. - If the expression is in an *iteration_statement*, its scope is just that statement. - Otherwise if the expression is in some other statement form, its scope is the scope containing the statement. For the purpose of determing the scope, an *embedded_statement* is considered to be in its own scope. For example, the grammar for an *if_statement* is ``` antlr if_statement : 'if' '(' boolean_expression ')' embedded_statement | 'if' '(' boolean_expression ')' embedded_statement 'else' embedded_statement ; ``` So if the controlled statement of an *if_statement* declares a pattern variable, its scope is restricted to that *embedded_statement*: ``` c# if (x) M(y is var z); ``` In this case the scope of `z` is the embedded statement `M(y is var z);`. Other cases are errors for other reasons (e.g. in a parameter's default value or an attribute, both of which are an error because those contexts require a constant expression). > [In C# 7.3 we added the following contexts](https://github.com/dotnet/csharplang/blob/master/proposals/csharp-7.3/expression-variables-in-initializers.md) in which a pattern variable may be declared: > - If the expression is in a *constructor initializer*, its scope is the *constructor initializer* and the constructor's body. > - If the expression is in a field initializer, its scope is the *equals_value_clause* in which it appears. > - If the expression is in a query clause that is specified to be translated into the body of a lambda, its scope is just that expression. ## Changes to syntactic disambiguation There are situations involving generics where the C# grammar is ambiguous, and the language spec says how to resolve those ambiguities: > #### 7.6.5.2 Grammar ambiguities > The productions for *simple-name* (§7.6.3) and *member-access* (§7.6.5) can give rise to ambiguities in the grammar for expressions. For example, the statement: > ```cs > F(G(7)); > ``` > could be interpreted as a call to `F` with two arguments, `G < A` and `B > (7)`. Alternatively, it could be interpreted as a call to `F` with one argument, which is a call to a generic method `G` with two type arguments and one regular argument. > If a sequence of tokens can be parsed (in context) as a *simple-name* (§7.6.3), *member-access* (§7.6.5), or *pointer-member-access* (§18.5.2) ending with a *type-argument-list* (§4.4.1), the token immediately following the closing `>` token is examined. If it is one of > ```none > ( ) ] } : ; , . ? == != | ^ > ``` > then the *type-argument-list* is retained as part of the *simple-name*, *member-access* or *pointer-member-access* and any other possible parse of the sequence of tokens is discarded. Otherwise, the *type-argument-list* is not considered to be part of the *simple-name*, *member-access* or > *pointer-member-access*, even if there is no other possible parse of the sequence of tokens. Note that these rules are not applied when parsing a *type-argument-list* in a *namespace-or-type-name* (§3.8). The statement > ```cs > F(G(7)); > ``` > will, according to this rule, be interpreted as a call to `F` with one argument, which is a call to a generic method `G` with two type arguments and one regular argument. The statements > ```cs > F(G < A, B > 7); > F(G < A, B >> 7); > ``` > will each be interpreted as a call to `F` with two arguments. The statement > ```cs > x = F < A > +y; > ``` > will be interpreted as a less than operator, greater than operator, and unary plus operator, as if the statement had been written `x = (F < A) > (+y)`, instead of as a *simple-name* with a *type-argument-list* followed by a binary plus operator. In the statement > ```cs > x = y is C + z; > ``` > the tokens `C` are interpreted as a *namespace-or-type-name* with a *type-argument-list*. There are a number of changes being introduced in C# 7 that make these disambiguation rules no longer sufficient to handle the complexity of the language. ### out variable declarations It is now possible to declare a variable in an out argument: ```cs M(out Type name); ``` However, the type may be generic: ```cs M(out A name); ``` Since the language grammar for the argument uses *expression*, this context is subject to the disambiguation rule. In this case the closing `>` is followed by an *identifier*, which is not one of the tokens that permits it to be treated as a *type-argument-list*. I therefore propose to **add *identifier* to the set of tokens that triggers the disambiguation to a *type-argument-list*.** ### tuples and deconstruction declarations A tuple literal runs into exactly the same issue. Consider the tuple expression ```cs (A < B, C > D, E < F, G > H) ``` Under the old C# 6 rules for parsing an argument list, this would parse as a tuple with four elements, starting with `A < B` as the first. However, when this appears on the left of a deconstruction, we want the disambiguation triggered by the *identifier* token as described above: ```cs (A D, E H) = e; ``` This is a deconstruction declaration which declares two variables, the first of which is of type `A` and named `D`. In other words, the tuple literal contains two expressions, each of which is a declaration expression. For simplicitly of the specification and compiler, I propose that this tuple literal be parsed as a two-element tuple wherever it appears (whether or not it appears on the left-hand-side of an assignment). That would be a natural result of the disambiguation described in the previous section. ### pattern-matching Pattern matching introduces a new context where the expression-type ambiguity arises. Previously the right-hand-side of an `is` operator was a type. Now it can be a type or expression, and if it is a type it may be followed by an identifier. This can, technically, change the meaning of existing code: ```cs var x = e is T < A > B; ``` This could be parsed under C#6 rules as ```cs var x = ((e is T) < A) > B; ``` but under under C#7 rules (with the disambiguation proposed above) would be parsed as ```cs var x = e is T B; ``` which declares a variable `B` of type `T`. Fortunately, the native and Roslyn compilers have a bug whereby they give a syntax error on the C#6 code. Therefore this particular breaking change is not a concern. Pattern-matching introduces additional tokens that should drive the ambiguity resolution toward selecting a type. The following examples of existing valid C#6 code would be broken without additional disambiguation rules: ```cs var x = e is A && f; // && var x = e is A || f; // || var x = e is A & f; // & var x = e is A[]; // [ ``` ### proposed change to the disambiguation rule I propose to revise the specification to change the list of disambiguating tokens from > ```none ( ) ] } : ; , . ? == != | ^ ``` to > ```none ( ) ] } : ; , . ? == != | ^ && || & [ ``` And, in certain contexts, we treat *identifier* as a disambiguating token. Those contexts are where the sequence of tokens being disambiguated is immediately preceded by one of the keywords `is`, `case`, or `out`, or arises while parsing the first element of a tuple literal (in which case the tokens are preceded by `(` or `:` and the identifier is followed by a `,`) or a subsequent element of a tuple literal. ### modified disambiguation rule The revised disambiguation rule would be something like this > If a sequence of tokens can be parsed (in context) as a *simple-name* (§7.6.3), *member-access* (§7.6.5), or *pointer-member-access* (§18.5.2) ending with a *type-argument-list* (§4.4.1), the token immediately following the closing `>` token is examined, to see if it is > - One of `( ) ] } : ; , . ? == != | ^ && || & [`; or > - One of the relational operators `< > <= >= is as`; or > - A contextual query keyword appearing inside a query expression; or > - In certain contexts, we treat *identifier* as a disambiguating token. Those contexts are where the sequence of tokens being disambiguated is immediately preceded by one of the keywords `is`, `case` or `out`, or arises while parsing the first element of a tuple literal (in which case the tokens are preceded by `(` or `:` and the identifier is followed by a `,`) or a subsequent element of a tuple literal. > > If the following token is among this list, or an identifier in such a context, then the *type-argument-list* is retained as part of the *simple-name*, *member-access* or *pointer-member-access* and any other possible parse of the sequence of tokens is discarded. Otherwise, the *type-argument-list* is not considered to be part of the *simple-name*, *member-access* or *pointer-member-access*, even if there is no other possible parse of the sequence of tokens. Note that these rules are not applied when parsing a *type-argument-list* in a *namespace-or-type-name* (§3.8). ### breaking changes due to this proposal No breaking changes are known due to this proposed disambiguation rule. ### Interesting examples Here are some interesting results of these disambiguation rules: The expression `(A < B, C > D)` is a tuple with two elements, each a comparison. The expression `(A D, E)` is a tuple with two elements, the first of which is a declaration expression. The invocation `M(A < B, C > D, E)` has three arguments. The invocation `M(out A D, E)` has two arguments, the first of which is an `out` declaration. The expression `e is A C` uses a declaration expression. The case label `case A C:` uses a declaration expression. ## Some Examples of Pattern Matching ### Is-As We can replace the idiom ```cs var v = expr as Type; if (v != null) { // code using v } ``` With the slightly more concise and direct ```cs if (expr is Type v) { // code using v } ``` ### Testing nullable We can replace the idiom ```cs Type? v = x?.y?.z; if (v.HasValue) { var value = v.GetValueOrDefault(); // code using value } ``` With the slightly more concise and direct ```cs if (x?.y?.z is Type value) { // code using value } ``` ### Arithmetic simplification Suppose we define a set of recursive types to represent expressions (per a separate proposal): ```cs abstract class Expr; class X() : Expr; class Const(double Value) : Expr; class Add(Expr Left, Expr Right) : Expr; class Mult(Expr Left, Expr Right) : Expr; class Neg(Expr Value) : Expr; ``` Now we can define a function to compute the (unreduced) derivative of an expression: ```cs Expr Deriv(Expr e) { switch (e) { case X(): return Const(1); case Const(*): return Const(0); case Add(var Left, var Right): return Add(Deriv(Left), Deriv(Right)); case Mult(var Left, var Right): return Add(Mult(Deriv(Left), Right), Mult(Left, Deriv(Right))); case Neg(var Value): return Neg(Deriv(Value)); } } ``` An expression simplifier demonstrates positional patterns: ```cs Expr Simplify(Expr e) { switch (e) { case Mult(Const(0), *): return Const(0); case Mult(*, Const(0)): return Const(0); case Mult(Const(1), var x): return Simplify(x); case Mult(var x, Const(1)): return Simplify(x); case Mult(Const(var l), Const(var r)): return Const(l*r); case Add(Const(0), var x): return Simplify(x); case Add(var x, Const(0)): return Simplify(x); case Add(Const(var l), Const(var r)): return Const(l+r); case Neg(Const(var k)): return Const(-k); default: return e; } } ```