Commit graph

35 commits

Author SHA1 Message Date
Harshavardhana bb639d9f29
remove double reads delete versions (#13544)
deleting collection of versions belonging
to same object, we can avoid re-reading
the xl.meta from the disk instead purge
all the requested versions in-memory,

the tradeoff is to allocate a map to de-dup
the versions, allow disks to be read only
once per object.

additionally reduce the data transfer between
nodes by shortening msgp data values.
2021-11-01 10:50:07 -07:00
Poorna Krishnamoorthy c4373ef290
Add support for multi site replication (#12880) 2021-09-18 13:31:35 -07:00
Klaus Post 24722ddd02
Remove inline data hack (#12946)
move the code down to the storage layer,
this logic decouples the inline data from the 
size parameter making it flexible and future
proof.
2021-08-13 08:25:54 -07:00
Harshavardhana 40a2fa8e81
fix: add more optimizations to putMetacacheObject() (#12916)
- avoid extra lookup for 'xl.meta' since we are
  definitely sure that it doesn't exist.

- use this in newMultipartUpload() as well

- also additionally do not write with O_DSYNC
  to avoid loading the drives, instead create
  'xl.meta' for listing operations without
  O_DSYNC since these are ephemeral objects.

- do the same with newMultipartUpload() since
  it gets synced when the PutObjectPart() is
  attempted, we do not need to tax newMultipartUpload()
  instead.
2021-08-10 11:12:22 -07:00
Krishnan Parthasarathi 29eea52e14
Skip transitioning of object versions if inlined (#12705) 2021-07-16 09:38:27 -07:00
Aditya Manthramurthy 30a3921d3e
[Tiering] Support remote tiers with object versioning (#12342)
- Adds versioning support for S3 based remote tiers that have versioning
enabled. This ensures that when reading or deleting we specify the specific
version ID of the object. In case of deletion, this is important to ensure that
the object version is actually deleted instead of simply being marked for
deletion.

- Stores the remote object's version id in the tier-journal. Tier-journal file
version is not bumped up as serializing the new struct version is
compatible with old journals without the remote object version id.

- `storageRESTVersion` is bumped up as FileInfo struct now includes a
`TransitionRemoteVersionID` member.

- Azure and GCS support for this feature will be added subsequently.

Co-authored-by: Krishnan Parthasarathi <krisis@users.noreply.github.com>
2021-06-03 14:26:51 -07:00
Nitish Tiwari 776589f0da Add free inode metric for Prometheus (#12225) 2021-05-06 12:50:48 -07:00
Krishnan Parthasarathi c829e3a13b Support for remote tier management (#12090)
With this change, MinIO's ILM supports transitioning objects to a remote tier.
This change includes support for Azure Blob Storage, AWS S3 compatible object
storage incl. MinIO and Google Cloud Storage as remote tier storage backends.

Some new additions include:

 - Admin APIs remote tier configuration management

 - Simple journal to track remote objects to be 'collected'
   This is used by object API handlers which 'mutate' object versions by
   overwriting/replacing content (Put/CopyObject) or removing the version
   itself (e.g DeleteObjectVersion).

 - Rework of previous ILM transition to fit the new model
   In the new model, a storage class (a.k.a remote tier) is defined by the
   'remote' object storage type (one of s3, azure, GCS), bucket name and a
   prefix.

* Fixed bugs, review comments, and more unit-tests

- Leverage inline small object feature
- Migrate legacy objects to the latest object format before transitioning
- Fix restore to particular version if specified
- Extend SharedDataDirCount to handle transitioned and restored objects
- Restore-object should accept version-id for version-suspended bucket (#12091)
- Check if remote tier creds have sufficient permissions
- Bonus minor fixes to existing error messages

Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Krishna Srinivas <krishna@minio.io>
Signed-off-by: Harshavardhana <harsha@minio.io>
2021-04-23 11:58:53 -07:00
Harshavardhana 069432566f update license change for MinIO
Signed-off-by: Harshavardhana <harsha@minio.io>
2021-04-23 11:58:53 -07:00
Anis Elleuch 0eb146e1b2
add additional metrics per disk API latency, API call counts #11250)
```
mc admin info --json
```

provides these details, for now, we shall eventually 
expose this at Prometheus level eventually. 

Co-authored-by: Harshavardhana <harsha@minio.io>
2021-03-16 20:06:57 -07:00
Harshavardhana 9ccc483df6
[feat]: change erasure coding default block size from 10MiB to 1MiB (#11721)
major performance improvements in range GETs to avoid large
read amplification when ranges are tiny and random

```
-------------------
Operation: GET
Operations: 142014 -> 339421
Duration: 4m50s -> 4m56s
* Average: +139.41% (+1177.3 MiB/s) throughput, +139.11% (+658.4) obj/s
* Fastest: +125.24% (+1207.4 MiB/s) throughput, +132.32% (+612.9) obj/s
* 50% Median: +139.06% (+1175.7 MiB/s) throughput, +133.46% (+660.9) obj/s
* Slowest: +203.40% (+1267.9 MiB/s) throughput, +198.59% (+753.5) obj/s
```

TTFB from 10MiB BlockSize
```
* First Access TTFB: Avg: 81ms, Median: 61ms, Best: 20ms, Worst: 2.056s
```

TTFB from 1MiB BlockSize
```
* First Access TTFB: Avg: 22ms, Median: 21ms, Best: 8ms, Worst: 91ms
```

Full object reads however do see a slight change which won't be
noticeable in real world, so not doing any comparisons

TTFB still had improvements with full object reads with 1MiB

```
* First Access TTFB: Avg: 68ms, Median: 35ms, Best: 11ms, Worst: 1.16s
```

v/s

TTFB with 10MiB
```
* First Access TTFB: Avg: 388ms, Median: 98ms, Best: 20ms, Worst: 4.156s
```

This change should affect all new uploads, previous uploads should
continue to work with business as usual. But dramatic improvements can
be seen with these changes.
2021-03-06 14:09:34 -08:00
Harshavardhana b3c56b53fb
fix: metacache should only rename entries during cleanup (#11503)
To avoid large delays in metacache cleanup, use rename
instead of recursive delete calls, renames are cheaper
move the content to minioMetaTmpBucket and then cleanup
this folder once in 24hrs instead.

If the new cache can replace an existing one, we should
let it replace since that is currently being saved anyways,
this avoids pile up of 1000's of metacache entires for
same listing calls that are not necessary to be stored
on disk.
2021-02-11 10:22:03 -08:00
Anis Elleuch 65aa2bc614
ilm: Remove object in HEAD/GET if having an applicable ILM rule (#11296)
Remove an object on the fly if there is a lifecycle rule with delete
expiry action for the corresponding object.
2021-02-01 09:52:11 -08:00
Anis Elleuch e9ac7b0fb7
heal: Remove empty directories (#11354)
Since the introduction of __XLDIR__, an empty directory does not have a
meaning anymore in erasure mode. Make healing removes it wherever it
finds it.
2021-01-27 02:19:28 -08:00
Harshavardhana f21d650ed4
fix: readData in bulk call using messagepack byte wrappers (#11228)
This PR refactors the way we use buffers for O_DIRECT and
to re-use those buffers for messagepack reader writer.

After some extensive benchmarking found that not all objects
have this benefit, and only objects smaller than 64KiB see
this benefit overall.

Benefits are seen from almost all objects from

1KiB - 32KiB

Beyond this no objects see benefit with bulk call approach
as the latency of bytes sent over the wire v/s streaming
content directly from disk negate each other with no
remarkable benefits.

All other optimizations include reuse of msgp.Reader,
msgp.Writer using sync.Pool's for all internode calls.
2021-01-07 19:27:31 -08:00
Harshavardhana d0027c3c41
do not use large buffers if not necessary (#11220)
without this change, there is a performance
regression for small objects GETs, this makes
the overall speed to go back to pre '59d363'
commit days.
2021-01-04 18:51:52 -08:00
Harshavardhana c4131c2798
feat: Small object optimization read data in single bulk call (#11207) 2021-01-03 11:27:57 -08:00
Harshavardhana c19e6ce773
avoid a crash in crawler when lifecycle is not initialized (#11170)
Bonus for static buffers use bytes.NewReader instead of
bytes.NewBuffer, to use a more reader friendly implementation
2020-12-26 22:58:06 -08:00
Poorna Krishnamoorthy 1ebf6f146a Add support for ILM transition (#10565)
This PR adds transition support for ILM
to transition data to another MinIO target
represented by a storage class ARN. Subsequent
GET or HEAD for that object will be streamed from
the transition tier. If PostRestoreObject API is
invoked, the transitioned object can be restored for
duration specified to the source cluster.
2020-11-19 18:47:17 -08:00
Harshavardhana 9a34fd5c4a Revert "Revert "Add delete marker replication support (#10396)""
This reverts commit 267d7bf0a9.
2020-11-19 18:43:58 -08:00
Harshavardhana 267d7bf0a9 Revert "Add delete marker replication support (#10396)"
This reverts commit 50c10a5087.

PR is moved to origin/dev branch
2020-11-12 11:43:14 -08:00
Poorna Krishnamoorthy 50c10a5087
Add delete marker replication support (#10396)
Delete marker replication is implemented for V2
configuration specified in AWS spec (though AWS
allows it only in the V1 configuration).

This PR also brings in a MinIO only extension of
replicating permanent deletes, i.e. deletes specifying
version id are replicated to target cluster.
2020-11-10 15:24:14 -08:00
Harshavardhana 1a1f00fa15
fix: use internode data for DisksInfo, VolsInfo in message pack (#10821)
Similar to #10775 for fewer memory allocations, since we use
getOnlineDisks() extensively for listing we should optimize it
further.

Additionally, remove all unused walkers from the storage layer
2020-11-04 10:10:54 -08:00
Klaus Post 37749f4623
Optimize FileInfo(Version) transfer (#10775)
File Info decoding, in particular, is showing up as a major 
allocator and time consumer for internode data transfers

Switch to message pack for cross-server transfers:

```
MSGP:

Size: 945 bytes

BenchmarkEncodeFileInfoMsgp-32    	 1558444	       866 ns/op	   1.16 MB/s	       0 B/op	       0 allocs/op
BenchmarkDecodeFileInfoMsgp-32    	  479968	      2487 ns/op	   0.40 MB/s	     848 B/op	      18 allocs/op

GOB:

Size: 1409 bytes

BenchmarkEncodeFileInfoGOB-32    	  333339	      3237 ns/op	   0.31 MB/s	     576 B/op	      19 allocs/op
BenchmarkDecodeFileInfoGOB-32    	   20869	     57837 ns/op	   0.02 MB/s	   16439 B/op	     428 allocs/op
```
2020-11-02 17:07:52 -08:00
Klaus Post a982baff27
ListObjects Metadata Caching (#10648)
Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c

Gist of improvements:

* Cross-server caching and listing will use the same data across servers and requests.
* Lists can be arbitrarily resumed at a constant speed.
* Metadata for all files scanned is stored for streaming retrieval.
* The existing bloom filters controlled by the crawler is used for validating caches.
* Concurrent requests for the same data (or parts of it) will not spawn additional walkers.
* Listing a subdirectory of an existing recursive cache will use the cache.
* All listing operations are fully streamable so the number of objects in a bucket no 
  longer dictates the amount of memory.
* Listings can be handled by any server within the cluster.
* Caches are cleaned up when out of date or superseded by a more recent one.
2020-10-28 09:18:35 -07:00
Harshavardhana 17747db93f
fix: support healing older content (#10076)
This PR adds support for healing older
content i.e from 2yrs, 1yr. Also handles
other situations where our config was
not encrypted yet.

This PR also ensures that our Listing
is consistent and quorum friendly,
such that we don't list partial objects
2020-07-17 17:41:29 -07:00
Anis Elleuch 21a37e3393
fix: ListObjectVersions should return ordered Version & DeleteMarker (#9959)
The S3 specification says that versions are ordered in the response of
list object versions.

mc snapshot needs this to know which version comes first especially when
two versions have the same exact last-modified field.
2020-07-03 09:15:44 -07:00
Harshavardhana 4915433bd2
Support bucket versioning (#9377)
- Implement a new xl.json 2.0.0 format to support,
  this moves the entire marshaling logic to POSIX
  layer, top layer always consumes a common FileInfo
  construct which simplifies the metadata reads.
- Implement list object versions
- Migrate to siphash from crchash for new deployments
  for object placements.

Fixes #2111
2020-06-12 20:04:01 -07:00
poornas 3d3b75fb8d
Avoid overwriting object tags when changing lock (#9794) 2020-06-10 08:16:30 -07:00
Anis Elleuch 48e614b167
honor lifecycle expiration with tag rule (#9604) 2020-05-14 16:21:03 -07:00
Harshavardhana 23a8411732
Add a generic Walk()'er to list a bucket, optinally prefix (#9026)
This generic Walk() is used by likes of Lifecyle, or
KMS to rotate keys or any other functionality which
relies on this functionality.
2020-02-25 21:22:28 +05:30
Harshavardhana b3f22eac56 Offload listing to posix layer (#7611)
This PR adds one API WalkCh which
sorts and sends list over the network

Each disk walks independently in a sorted manner.
2019-05-14 13:49:10 -07:00
kannappanr 5ecac91a55
Replace Minio refs in docs with MinIO and links (#7494) 2019-04-09 11:39:42 -07:00
Anis Elleuch 380d6c6435 Use getObjectInfo() in both FS and XL ListObjects() to simplify and to return complete object information (#3247) 2016-11-13 11:48:02 -08:00
Harshavardhana bccf549463 server: Move all the top level files into cmd folder. (#2490)
This change brings a change which was done for the 'mc'
package to allow for clean repo and have a cleaner
github drop in experience.
2016-08-18 16:23:42 -07:00
Renamed from storage-datatypes.go (Browse further)