pulumi/docs/design/formats.md
2017-03-10 14:43:34 -08:00

315 lines
23 KiB
Markdown

# Coconut Formats
Coconut cloud topologies are described using three formats: CocoLangs, CocoPack/CocoIL, and CocoGL.
At the highest level, developers write Coconut packages using a high-level language. There are multiple to choose from,
and each is a proper subset of an existing popular programming language; CocoJS is a subset of JavaScript, CocoPy is
a subset of Python, CocoRu is a subset of Ruby, and CocoGo is a subset of Go, for example. The restrictions placed on
these languages ensure static analyzability, determinism, and compilability into an intermediate form. That said, they
retain most of their source language's popular features, including the libraries, so that these languages feel familiar.
To distinguish them from their ordinary counterparts, these subsetted language dialects are called CocoLangs.
In the middle, CocoPack is a standard metadata representation for compiled and redistributable Coconut packages. This
is the unit of distribution and dependency in the classical package management sense. CocoPack is multi-langauge and
contains internal and exported modules, types, functions, variables, and code. All code is serialized into CocoIL, the
CocoPack intermediate language, that can be evaluated by the Coconut toolchain. Because of computations, the final
"shape" of a cloud topology is not determined until the CocoPack is evaluated as part of a deployment planning step.
The final shape, Coconut Graph Language (CocoGL), represents a cloud topology with concrete property values and
dependencies. A graph can be compared to another graph to compute a delta, a capability essential to incremental
deployment and drift analysis. Each graph is a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph), in which
nodes are cloud resources, edges are [directed dependencies](https://en.wikipedia.org/wiki/Dependency_graph) between
them, and all input and output properties are known values. Any given CocoPack can create many possible CocoGL graphs,
and identical CocoGL graphs can be created by different CocoPacks, because CocoPack can contain parameterized logic and
conditional computations. A CocoGL graph can also be generated from a live environment, something we call a snapshot.
This document describes these formats, the requirements for a high-level CocoLang (although details for each language
are specified elsehwere), the CocoPack/CocoIL and CocoGL formats, and the overall compilation process.
## Coconut Languages (CocoLangs)
We envision a collection of high-level languages so IT professionals and developers can pick the one they feel most
comfortable with. For example, we currently plan to support JavaScript (CocoJS), Python (CocoPy), Ruby (CocoRu), and Go
(CocoGo). Furthermore, we imagine translators from other cloud topology formats like AWS CloudFormation and Hashicorp
Terraform. These are called metadata languages, or CocoLangs, and we call code artifacts written in them packages.
In principle, there is no limit to the breadth of CocoLangs that Coconut can support -- and it is indeed extensible by
3rd parties -- although we do require that any CocoLang compiles down into CocoPack. This translation requires static
code translation -- admittedly an extra step for most dynamic languages -- however, the CocoPack/CocoIL format was
designed with these dynamic languages in mind and supports full dynamic dispatch, inspection, and so on.
### Configuration versus Runtime Code
It is important to reiterate that CocoLangs are not traditional languages. Packages written in them describe the
desired state of a cloud topology of resources. The evaluation of a CocoLang program results in a CocoGL DAG that
captures the intended state corresponding to physical entities in a target cloud environment, as well as dependencies
between them. The Coconut toolset then takes this DAG, compares it to the existing environment, and "makes it so."
Note that this approach fully embraces the [immutable infrastructure](
http://martinfowler.com/bliki/ImmutableServer.html) philosophy, including embracing [cattle over pets](
https://blog.engineyard.com/2014/pets-vs-cattle). But it leverages the full power of real programming languages.
A CocoLang package itself is just metadata, therefore, and any computations in the language itself exist solely to
determine this DAG. This is a subtle but critical point, so it bears repeating: CocoLang code does not actually execute
within the target cloud environment, and it alone does not perform side-effects; instead, CocoLang code merely describes
the topology of the code and resources, and the Coconut toolchain orchestrates deployments based on analysis of it.
It is possible to mix CocoLang and regular code in the respective source language. This is particularly helpful when
associating runtime code to deployment-time artifacts, and is common when dealing with serverless programming. For
example, instead of needing to manually separate out the description of a cloud lambda and its code, we can just write a
lambda that the compiler will "shred" into a distinct asset that is bundled inside the cloud deployment automatically.
For example, this mixes configuration and runtime code:
let func = new coconut.x.Function(ctx => {
ctx.write("Hello, Coconut!");
});
In this CocoJS program, the `let func = new coconut.x.Function({})` part uses the CocoJS subset. However, the code
inside of the lambda is real JavaScript, and may use NPM packages, perform IO, and so forth. The CocoJS compiler
understands how to compile this in a dual mode so that the resulting artifacts can be deployed independently.
In any case, "language bindings" bind certain resource properties to executable code. This executable code can come in
many forms, aside from the above lambda example. For example, a container resource may bind to a real, physical Docker
container image. As another example, an RPC resource may bind to an entire Go program, with many endpoints implemented
as Go functions. A CocoPack is incomplete without being fully bound to the program assets that must be co-deployed.
### Restricted Subsets
The restrictions placed on CocoLangs streamline the task of producing cloud topology graphs, and ensure that programs
are deterministic. Determinism is important, otherwise two deployments from the exact same source programs might
result in two graphs that differ in surprising and unwanted ways. Evaluation of the the same program must be
idempotent so that graphs and target environments can easily converge and so that failures can be dealt with reliably.
In general, this means CocoLangs may not perform these actions:
* I/O of any kind (network, file, etc).
* Syscalls (except for those excplicitly blessed as being deterministic).
* Invocation of non-CocoLang code (including 3rd party packages).
* Any action that is not transitively analyable through global analysis (like C FFIs).
Examples of existing efforts to define such a subset in JavaScript, simply as an illustration, include: [Gatekeeper](
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/gatekeeper_tr.pdf), [ADsafe](
http://www.adsafe.org/), [Caja](https://github.com/google/caja), [WebPPL](http://dippl.org/chapters/02-webppl.html),
[Deterministic.js](https://deterministic.js.org/), and even JavaScript's own [strict mode](
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Strict_mode). There are also multiple attempts to
catalogue sources of nondeterminism in [JavaScript](
https://github.com/burg/timelapse/wiki/Note-sources-of-nondeterminism) and [its variants](
https://github.com/WebAssembly/design/blob/master/Nondeterminism.md).
CocoLangs may in fact consume 3rd party packages (e.g., from NPM, Pip, or elsewhere), but they must be blessed by the
CocoLang compiler for your language of choice. This means recompiling packages from source -- and dealing with the
possibility that they will fail to compile if they perform illegal operations -- or, preferably, using a package that
has already been pre-compiled using a CocoLang compiler, likely in CocoPack format, in which case you are guaranteed
that it will work without any unexpected difficulties. The CocoHub contains precompiled CocoPacks for easy consumption.
Each CocoLang program is compiled into a package that leverages the CocoPack and CocoIL formats.
## Coconut Package (CocoPack) and Intermediate Language (CocoIL) Formats
Each Coconut package is encoded in CocoPack and serialized in a JSON/YAML form for easy toolability. The full CocoPack
and CocoIL specifications are available in the [Coconut Package Metadata (CocoPack) design doc](packages.md).
### CocoPack
CocoPack is the unit of sharing and reuse, and includes high-level metadata about the module's contents, in addition to
its modules, types, functions, and variables. Each CocoPack can be either a *library* or an *executable*. A library is
not meant to be instantiated on its own; instead, it packages up abstractions that are useful to other packages. An
executable, on the other hand, is executed in order to create a CocoGL graph and cloud topology all on its own.
For example, the package metadata includes things like a name, description, and dependencies:
{
"name": "acmecorp/aws-example",
"description": "Simple example of using AWS resources.",
"dependencies": {
"aws": "*",
"coconut": "1.0",
"cocojs": "^1.0.7"
},
"modules": {
// CocoIL goes here
}
}
CocoPack uses a "JSON-like" type system to ease cross-language reuse and Internet interoperability. This type system
may be extended with custom types, including custom resources, functions that encapsulate patterns of common
infrastructure logic, and schema types like enums that govern the shape of data and property values being exchanged.
Any of these types and functions may or may not be exported from a package.
### CocoIL
Executable code artifacts are encoded using a simple intermediate language, CocoIL, that Coconut may evaluate.
This captures a deterministic, bounded set of type system and execution constructs that a subset of most higher level
languages can target and consume. The design has been inspired by existing "minimalistic" multi-language intermediate
formats, and is very similar to [Java bytecode](https://docs.oracle.com/javase/specs/jvms/se8/html/) and [.NET CIL](
https://www.ecma-international.org/publications/standards/Ecma-335.htm), with elements of [asm.js](
http://asmjs.org/spec/latest/) and [WebAssembly](https://github.com/WebAssembly/) mixed in for dynamic translation.
All elements within a package are referred to by tokens. For example, the type token for the AWS `Instance` resource
class, within the `aws` package and `ec2/instance` module, would be `aws:ec2/instance:Instance`.
This IL is fully bound by CocoLang compilers, so that IL processing needn't re-parse, re-analyze, or re-bind the
resulting trees. This simplifies the toolchain, particularly given that many languages have their own looking, binding,
and scoping rules that would be difficult for Coconut to perform in a language-agnostic way. The IL is an abstract
syntax tree (AST)-based IL rather than, say, a stack machine. This choice simplifies the creation of new CocoLang
compilers. All order of evaluation is encoded by CocoLang compilers so that again Coconut may remain language-neutral.
An optional verifier can check to ensure ASTs are well-formed:
$ coco pack verify
Unlike many virtual machines, Coconut aggressively performs verification during evaluation of packages. This catches
subtle runtime errors sooner and lets dynamic languages lean on runtime verification. Coconut is concerned more with
correctness and reliability than evaluation performance, which for cloud resources is often already measured in seconds.
### Planning and Deploying
An executable package is eventually deployed as a stack. This takes one of two forms:
* *Plan*: `coco deploy <env> -n` generates a plan, a description of a hypothetical deployment, but does not actually
mutate anything. This is essentially a "dry run" to see what changes will occur. To do this, the package's main
entrypoint is called with the provided arguments. The graph may be incomplete if code depends on resource outputs for
conditional execution or property values, resulting in so-called "holes" that will be shown in the plan's output.
* *Deployment*: `coco deploy <env>` applies a plan to a target environment `<env>`. The primary difference compared to
a plan is that real resources may be created, updated, and deleted. Because side-effecting actions are performed, all
outputs are known, and so the resulting graph will be complete. The resulting deployment record is saved locally.
The result of both steps is a CocoGL graph, the latter being strictly more complete than the former.
## Coconut Graph Language (CocoGL)
CocoGL is the simplest and final frontier of Coconut's formats. Each CocoGL artifact -- something we just call a
graph -- can be an in-memory data structure and/or a serialized into JSON or YAML.
In its most general form, CocoGL is just a graph of objects, their properties, and their dependencies amongst one
another. However, CocoGL is used in a very specific way when it comes to Coconut packages, plans, and deployments.
Specifically, when it comes to deployments, each vertex represents a resource object, complete with properties, and each
edge is a dependeny between resources. Such a graph is required to be a DAG so that resource ordering is well-defined.
Each such graph represents the outcome of some deployment activity, either planned or actually having taken place.
Subtly, the graph is never considered the "source of truth"; only the corresponding live running environment can be the
source of truth. Instead, the graph describes the *intended* eventual state that a deployment activity is meant to
achieve. A *snapshot* may be generated from a live environment to capture the actual state of running resources as a
CocoGL resource graph. A process called *reconciliation* compares differences between the two -- either on-demand or as
part of a continuous monitoring process -- and resolve any differences (through updates in either direction).
Each vertex in a graph carries the resource's provider-assigned ID, a URN, type, and a set of named property values.
The provider ID is opaque to Coconut, and is assigned by the resource provider, while the URN is automatically assigned.
An example AWS EC2 instance ID might be `i-0c5192a1d67810e1a` and URN
`urn:coconut:prod/acmecorp::acmeinfra::index::aws:ec2/instance:Instance::master-node`.
A resource's type tells the CocoGL toolchain how to load a plugin to deal with physical resources that need to be
created, read, updated, or deleted, and governs which properties are legal and their expected types. Any module
references within the CocoGL file still refer to the respective packages, which are resolved during evaluation. All
module references will have been "locked" to a specific version of that CocoPack, however, for repeatability's sake.
Edges between these nodes represent dependencies, and are therefore directed, and must be explicit. The graph itself
fully describes the complete set of dependencies. That is, consider a resource A's property whose value was computed by
reading the output of some other resource B; this dependency from A on B exists in code, and must therefore be
translated into an edge in the resulting graph when evaluating CocoPack/CocoIL and creating CocoGL.
By default, graphs are "complete", and statically link in the resulting graphs from dependency modules. Even though
dependencies on 3rd party modules will remain in the form of tokens and URNs, the graph contains the full [transitive
closure](https://en.wikipedia.org/wiki/Transitive_closure) of resources created by all packages run during evaluation.
Because the graph is a DAG, any cycles in this graph are illegal and will result in an error. It is ideal if higher-
level translation catches this, since each step in the translation process reduces the diagnosability of errors.
Dynamic linking is possible for advanced scenarios, and leverages URNs for discovery, but is described elsewhere.
Please refer to the [design document](https://github.com/pulumi/coconut/blob/master/docs/design/graphs.md) for more
information about CocoGL graphs.
### Resource Providers
In general, the Coconut toolset simply performs an in-order walk of the CocoGL DAG in order to perform a deployment.
However, clearly there must be more to this, in order to actually perform mutations in the target environment.
Some objects are simply abstractions. They exist solely as convenient ways to create other objects, but, at the end
of the day, there are no physical manifestations of them. Some objects, however, correspond to real physical resources
that must be consulted or manipulated during the planning and/or application processes. These are called *resources*,
and are indicated to the Coconut runtime by subclassing from a well-known base class:
export class MyResource extends coconut.Resource {
...
}
The extensibility mechanism used to define this logic is part of the Coconut SDK and is called a *resource provider*.
Each package that defines a resource must also carry with it a corresponding resource provider plugin. These plugins
are loaded dynamically by the Coconut runtime anytime it needs to interact with a resource in a given package. These
plugins implement a simple inteface consisting primarily of create, read, update, and delete (CRUD) operations.
Please refer to the [resources design document](resources.md) for more details on resources and resource providers.
## Scenarios
In this section, we'll walk through a few motivational scenarios beyond the usual compilation process from a high-level
CocoLang, to CocoPack, all the way to CocoGL which is deployed to an environment, to see how the file formats are used.
### Generating CocoGL from a Live Environment
An existing environment can be used to generate a CocoGL snapshot. This is called *graph inference*. The result is an
ordinary CocoGL resource graph containing a description of all resources, their properties, and their dependencies.
Snapshotting can make adoption of Coconut easier if you already have an environment you wish to model. It can also
facilitate identifying "drift" between a desired and actual state; we will see more about this in a moment.
Any CocoGL generated in this manner may have less information than CocoGL generated from true CocoLang source code, due
to the possibility of lossy representations and/or missing abstractions in an actual live environment. For example,
there could be "hidden" implicit dependencies between resources that cannot be discovered and hence will be missing.
Nevertheless, this can be a great first step towards adopting Coconut for your existing environments.
Generating a CocoGL snapshot from a live environment that was created using Coconut, on the other hand, can recover all
of this information reliably, thanks to special tagging that Coconut performs.
Even then, because some objects map to physical resources in a deployment -- like a VM, subnet, etc. -- and other
objects are mere abstractions, there is a limit to how much "reverse engineering" from a live environment can happen.
The creation of an abstraction merely serves to create those physical resources at the "bottom" of the dependency chain.
### Comparing Two CocoGLs
A primary feature of CocoGLs is that two of them can be compared to produce a diff. This has several use cases.
Coconut performs a diff between two CocoGL files to determine a delta for purposes of incremental deployment. This
allows it to change the live environment only where a difference between actual and desired state exists.
As seen above, CocoGL snapshot can be generated from a live environment. As such, a snapshot can be compared to another
graph -- perhaps generated from another live environment -- to determine and reconcile "drift" between them.
This capability can be used to discover differences between environments that are meant to be similar (e.g., in
different zones, staging vs. production, etc). Alternatively, this analysis could be used to to compare an environment
against a CocoLang program's resulting CocoGL, to identify places where manual changes were made to an actual
environment without having made corresponding changes in the sources (be it intentional or accidental).
To cope with some of the potential lossiness during snapshotting, Coconut implements a *semantic diff*, in addition
to a more strict exact diff, algorithm. The semantic diff classifies differences due to lossy inference distinct from
ordinary semantically meaningful differences that could be impacting a live environment's behavior.
### Creating or Updating CocoLang and CocoPack from CocoGL
All artifacts embed debugging information similar to [DWARF](https://en.wikipedia.org/wiki/DWARF) or [PDB](
https://en.wikipedia.org/wiki/Program_database), correlating source lines with the resulting artifacts. As a result,
it is possible to raise CocoGL into CocoPack and, from there, raise CocoPack into your favorite CocoLang.
It is, however, important to note one thing before getting into the details. There are many possible CocoPacks that
could generate a given CocoGL, due to conditional execution of code. There may even be many possible CocoLang programs
that could generate a given CocoPack, since CocoPack's language constructs are intentionally smaller than what might
exist in a higher-level programming language. In short, lowering and re-raising is not a round-trippable operation.
Nevertheless, this raising can come in handy for two reasons.
The first is that, thanks to raising, it is possible to reconcile diffs in part by making changes to the source CocoLang
programs. If we just altered the CocoGL for a given CocoLang program, the process would be incomplete, because then
the developer would be responsible for taking that altered CocoGL and translating it by hand into edits to the
program. Automating this process as much as possible is obviously appealing even if -- similar to manual diff
resolution when applying source patches -- this process requires a little bit of manual, but tool-assistable, work.
The second helps when bootstrapping an existing environment into Coconut for the first time. Not only can we generate
the CocoGL that corresponds to an existing environment, but we can generate a CocoLang in your favorite language, that
will generate an equivalent graph. This is called *program inference*. As with graph inference, the inference might
miss key elements like dependencies, and might not include all of the desirable abstractions and metadata, however this
can serve as a useful starting point for subsequent refactoring that would introduce such things.