We saw an issue where a user was mid-update, and got a networking
error stating `read: operation timed out`. We believe this was simply
a local client error, due to a flaky network. We should be resilient
to such things during updates, particularly when there's no way to
"reattach" to an in-progress udpate (see pulumi/pulumi#762).
This change accomplishes this by changing our retry logic in the
cloud backend's waitForUpdates function. Namely:
* We recognize three types of failure, and react differently:
- Expected HTTP errors. For instance, the 504 Gateway Timeouts
that we already retried in the face of. In these cases, we will
silently retry up to 10 times. After 10 times, we begin warning
the user just in case this is a persistent condition.
- Unexpected HTTP errors. The CLI will quit immediately and issue
an error to the user, in the usual ways. This covers
Unauthorized among other things. Over time, we may find that we
want to intentionally move some HTTP errors into the above.
- Anything else. This covers the transient networking errors case
that we have just seen. I'll admit, it's a wide net, but any
instance of this error issues a warning and it's up to the user
to ^C out of it. We also log the error so that we'll see it if
the user shares their logs with us.
* We implement backoff logic so that we retry very quickly (100ms)
on the first failure, and more slowly thereafter (1.5x, up to a max
of 5 seconds). This helps to avoid accidentally DoSing our service.