Note: for the purposes of this discussion, archives will be treated as
assets, as their differences are not particularly meaningful.
Currently, the identity of an asset is derived from the hash and the
location of its contents (i.e. two assets are equal iff their contents
have the same hash and the same path/URI/inline value). This means that
changing the source of an asset will cause the engine to detect a
difference in the asset even if the source's contents are identical. At
best, this leads to inefficiencies such as unnecessary updates. This
commit changes asset identity so that it is derived solely from an
asset's hash. The source of an asset's contents is no longer part of
the asset's identity, and need only be provided if the contents
themselves may need to be available (e.g. if a hash does not yet exist
for the asset or if the asset's contents might be needed for an update).
This commit also changes the way old assets are exposed to providers.
Currently, an old asset is exposed as both its hash and its contents.
This allows providers to take a dependency on the contents of an old
asset being available, even though this is not an invariant of the
system. These changes remove the contents of old assets from their
serialized form when they are passed to providers, eliminating the
ability of a provider to take such a dependency. In combination with the
changes to asset identity, this allows a provider to detect changes to
an asset simply by comparing its old and new hashes.
This is half of the fix for [pulumi/pulumi-cloud#158]. The other half
involves changes in [pulumi/pulumi-terraform].
We currently have a nasty issue with archive assets wherein they read
their entire contents into memory each time they are accessed (e.g. for
hashing or translation). This interacts badly with scenarios that
place large amounts of data in an archive: aside from limiting the size
of an archive the engine can handle, it also bloats the engine's memory
requirements. This appears to have caused issues when running the PPC in
AWS: evidence suggests that the very high peak memory requirements this
approach implies caused high swap traffic that impacted the service's
availability.
In order to fix this issue, these changes move archives onto a
streaming read model. In order to read an archive, a user:
- Opens the archive with `Archive.Open`. This returns an ArchiveReader.
- Iterates over its contents using `ArchiveReader.Next`. Each returned
blob must be read in full between successive calls to
`ArchiveReader.Next`. This requirement is essentially forced upon us
by the streaming nature of TAR archives.
- Closes the ArchiveReader with `ArchiveReader.Close`.
This model does not require that the complete contents of the archive or
any of its constituent files are in memory at any given time.
Fixes#325.
This change fixes a couple bugs with assets:
* We weren't recursing into subdirectories in the new "path as
archive" feature, which meant we missed most of the files.
* We need to make paths relative to the root of the archive
directory itself, otherwise paths end up redundantly including
the asset's root folder path.
* We need to clean the file paths before adding them to the
archive asset map, otherwise they are inconsistent between the
path, tar, tgz, and zip cases.
* Ignore directories when traversing zips, since they aren't
included in the other formats.
* Tolerate io.EOF errors when reading the ZIP contents into blobs.
* Add test cases for the four different archive kinds.
This fixespulumi/pulumi-aws#50.
This improves a few things about assets:
* Compute and store hashes as input properties, so that changes on
disk are recognized and trigger updates (pulumi/pulumi#153).
* Issue explicit and prompt diagnostics when an asset is missing or
of an unexpected kind, rather than failing late (pulumi/pulumi#156).
* Permit raw directories to be passed as archives, in addition to
archive formats like tar, zip, etc. (pulumi/pulumi#240).
* Permit not only assets as elements of an archive's member list, but
also other archives themselves (pulumi/pulumi#280).
This change brings the same typed serialization we use for RPC
to the serialization of deployments. This ensures that we get
repeatable diffs from one deployment to the next.