minio

Author	SHA1	Message	Date
Harshavardhana	2daba018d6	reduce allocations on multi-disk clusters (#12311 ) multi-disk clusters initialize buffer pools per disk, this is perhaps expensive and perhaps not useful, for a running server instance. As this may disallow re-use of buffers across sets, this change ensures that buffers across sets can be re-used at drive level, this can reduce quite a lot of memory on large drive setups.	2021-05-17 17:49:48 -07:00
Harshavardhana	2ab9dc7609	do not update bloomFilters for temporary objects	2021-05-15 19:54:07 -07:00
Harshavardhana	4d876d03e8	fix: do not fail upon faulty/non-writable drives gracefully start the server, if there are other drives available - print enough information for administrator to notice the errors in console. Bonus: for really large streams use larger buffer for writes.	2021-05-15 12:57:18 -07:00
Klaus Post	229d83bb75	feat: add dynamic usage cache (#12229 ) A cache structure will be kept with a tree of usages. The cache is a tree structure where each keeps track of its children. An uncompacted branch contains a count of the files only directly at the branch level, and contains link to children branches or leaves. The leaves are "compacted" based on a number of properties. A compacted leaf contains the totals of all files beneath it. A leaf is only scanned once every dataUsageUpdateDirCycles, rarer if the bloom filter for the path is clean and no lifecycles are applied. Skipped leaves have their totals transferred from the previous cycle. A clean leaf will be included once every healFolderIncludeProb for partial heal scans. When selected there is a one in healObjectSelectProb that any object will be chosen for heal scan. Compaction happens when either: - The folder (and subfolders) contains less than dataScannerCompactLeastObject objects. - The folder itself contains more than dataScannerCompactAtFolders folders. - The folder only contains objects and no subfolders. - A bucket root will never be compacted. Furthermore, if a has more than dataScannerCompactAtChildren recursive children (uncompacted folders) the tree will be recursively scanned and the branches with the least number of objects will be compacted until the limit is reached. This ensures that any branch will never contain an unreasonable amount of other branches, and also that small branches with few objects don't take up unreasonable amounts of space. Whenever a branch is scanned, it is assumed that it will be un-compacted before it hits any of the above limits. This will make the branch rebalance itself when scanned if the distribution of objects has changed. TLDR; With current values: No bucket will ever have more than 10000 child nodes recursively. No single folder will have more than 2500 child nodes by itself. All subfolders are compacted if they have less than 500 objects in them recursively. We accumulate the (non-deletemarker) version count for paths as well, since we are changing the structure anyway.	2021-05-11 18:36:15 -07:00
Anis Elleuch	56d4d7b8b1	MRF: Better detection of non stable disks (#12252 ) MRF does not detect when a node is disconnected and reconnected quickly this change will ensure that MRF is alerted by comparing the last disk reconnection timestamp with the last MRF check time. Signed-off-by: Anis Elleuch <anis@min.io> Co-authored-by: Klaus Post <klauspost@gmail.com>	2021-05-11 09:19:15 -07:00
Harshavardhana	764721e2c6	add root_disk threshold detection (#12259 ) as there is no automatic way to detect if there is a root disk mounted on / or /var for the container environments due to how the root disk information is masked inside overlay root inside container. this PR brings an environment variable to set root disk size threshold manually to detect the root disks in such situations.	2021-05-08 15:40:29 -07:00
Klaus Post	254698f126	fix: minor allocation improvements in xlMetaV2 (#12133 )	2021-05-07 09:11:05 -07:00
Nitish Tiwari	776589f0da	Add free inode metric for Prometheus (#12225 )	2021-05-06 12:50:48 -07:00
Harshavardhana	c8050bc079	fix: sleeper behavior in data scanner (#12164 ) do not apply healReplication() for ILM expired, transitioned objects	2021-04-27 08:24:44 -07:00
Poorna Krishnamoorthy	4be0f92067	Fix multipart restore to remove part match (#12161 ) Part ETags are not available after multipart finalizes, removing this check as not useful. Signed-off-by: Poorna Krishnamoorthy <poorna@minio.io> Co-authored-by: Harshavardhana <harsha@minio.io>	2021-04-26 18:24:06 -07:00
Krishnan Parthasarathi	c829e3a13b	Support for remote tier management (#12090 ) With this change, MinIO's ILM supports transitioning objects to a remote tier. This change includes support for Azure Blob Storage, AWS S3 compatible object storage incl. MinIO and Google Cloud Storage as remote tier storage backends. Some new additions include: - Admin APIs remote tier configuration management - Simple journal to track remote objects to be 'collected' This is used by object API handlers which 'mutate' object versions by overwriting/replacing content (Put/CopyObject) or removing the version itself (e.g DeleteObjectVersion). - Rework of previous ILM transition to fit the new model In the new model, a storage class (a.k.a remote tier) is defined by the 'remote' object storage type (one of s3, azure, GCS), bucket name and a prefix. * Fixed bugs, review comments, and more unit-tests - Leverage inline small object feature - Migrate legacy objects to the latest object format before transitioning - Fix restore to particular version if specified - Extend SharedDataDirCount to handle transitioned and restored objects - Restore-object should accept version-id for version-suspended bucket (#12091) - Check if remote tier creds have sufficient permissions - Bonus minor fixes to existing error messages Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io> Co-authored-by: Krishna Srinivas <krishna@minio.io> Signed-off-by: Harshavardhana <harsha@minio.io>	2021-04-23 11:58:53 -07:00
Harshavardhana	069432566f	update license change for MinIO Signed-off-by: Harshavardhana <harsha@minio.io>	2021-04-23 11:58:53 -07:00
Harshavardhana	2ef824bbb2	collapse two distinct calls into single RenameData() call (#12093 ) This is an optimization by reducing one extra system call, and many network operations. This reduction should increase the performance for small file workloads.	2021-04-20 10:44:39 -07:00
Harshavardhana	1456f9f090	fix: preserve shared dataDir during suspend overwrites (#12058 ) CopyObject() when shares dataDir needs to be preserved, and upon versioning suspended overwrites should still preserve the dataDir.	2021-04-15 08:44:05 -07:00
Harshavardhana	928ee1a7b2	remove null version dataDir upon overwrites (#12023 )	2021-04-08 19:55:44 -07:00
Klaus Post	d2ac2f758e	odirectReader: handle EOF correctly (#11998 ) EOF may be sent along with data so queue it up and return it when the buffer is empty. Also, when reading data without direct io don't add a buffer that only results in extra memcopy.	2021-04-07 08:32:59 -07:00
Klaus Post	788a8bc254	Fix disk info race (#11984 ) Protect updated members in xlStorage. ``` WARNING: DATA RACE Write at 0x00c004b4ee78 by goroutine 1491: github.com/minio/minio/cmd.(xlStorage).GetDiskID() d:/minio/minio/cmd/xl-storage.go:590 +0x1078 github.com/minio/minio/cmd.(xlStorageDiskIDCheck).checkDiskStale() d:/minio/minio/cmd/xl-storage-disk-id-check.go:195 +0x84 github.com/minio/minio/cmd.(xlStorageDiskIDCheck).StatVol() d:/minio/minio/cmd/xl-storage-disk-id-check.go:284 +0x16a github.com/minio/minio/cmd.erasureObjects.getBucketInfo.func1() d:/minio/minio/cmd/erasure-bucket.go:100 +0x1a5 github.com/minio/minio/pkg/sync/errgroup.(Group).Go.func1() d:/minio/minio/pkg/sync/errgroup/errgroup.go:122 +0xd7 Previous read at 0x00c004b4ee78 by goroutine 1087: github.com/minio/minio/cmd.(xlStorage).CheckFile.func1() d:/minio/minio/cmd/xl-storage.go:1699 +0x384 github.com/minio/minio/cmd.(xlStorage).CheckFile() d:/minio/minio/cmd/xl-storage.go:1726 +0x13c github.com/minio/minio/cmd.(xlStorageDiskIDCheck).CheckFile() d:/minio/minio/cmd/xl-storage-disk-id-check.go:446 +0x23b github.com/minio/minio/cmd.erasureObjects.parentDirIsObject.func1() d:/minio/minio/cmd/erasure-common.go:173 +0x194 github.com/minio/minio/pkg/sync/errgroup.(Group).Go.func1() d:/minio/minio/pkg/sync/errgroup/errgroup.go:122 +0xd7 ```	2021-04-06 11:33:42 -07:00
Harshavardhana	5cce9361bc	fix: avoid an extra rename when there is no dataDir (#11964 ) also perform globalSync() in defer when enabled for RenameData(), to ensure all calls are flushed to disk.	2021-04-05 08:52:28 -07:00
Harshavardhana	d46386246f	api: Introduce metadata update APIs to update only metadata (#11962 ) Current implementation heavily relies on readAllFileInfo but with the advent of xl.meta inlined with data, we cannot easily avoid reading data when we are only interested is updating metadata, this leads to invariably write amplification during metadata updates, repeatedly reading data when we are only interested in updating metadata. This PR ensures that we implement a metadata only update API at storage layer, that handles updates to metadata alone for any given version - given the version is valid and present. This helps reduce the chattiness for following calls.. - PutObjectTags - DeleteObjectTags - PutObjectLegalHold - PutObjectRetention - ReplicateObject (updates metadata on replication status)	2021-04-04 13:32:31 -07:00
Poorna Krishnamoorthy	47c09a1e6f	Various improvements in replication (#11949 ) - collect real time replication metrics for prometheus. - add pending_count, failed_count metric for total pending/failed replication operations. - add API to get replication metrics - add MRF worker to handle spill-over replication operations - multiple issues found with replication - fixes an issue when client sends a bucket name with `/` at the end from SetRemoteTarget API call make sure to trim the bucket name to avoid any extra `/`. - hold write locks in GetObjectNInfo during replication to ensure that object version stack is not overwritten while reading the content. - add additional protection during WriteMetadata() to ensure that we always write a valid FileInfo{} and avoid ever writing empty FileInfo{} to the lowest layers. Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io> Co-authored-by: Harshavardhana <harsha@minio.io>	2021-04-03 09:03:42 -07:00
Harshavardhana	434e5c0cfe	allow preserving legacyXLv1 with inline data format (#11951 ) current master breaks this important requirement we need to preserve legacyXLv1 format, this is simply ignored and overwritten causing a myriad of issues by leaving stale files on the namespace etc. for now lets still use the two-phase approach of writing to `tmp` and then renaming the content to the actual namespace.	2021-04-01 22:12:03 -07:00
Harshavardhana	204c610d84	do not use dataDir to reference inline data use versionID (#11942 ) versionID is the one that needs to be preserved and as well as overwritten in case of replication, transition etc - dataDir is an ephemeral entity that changes during overwrites - make sure that versionID is used to save the object content. this would break things if you are already running the latest master, please wipe your current content and re-do your setup after this change.	2021-04-01 13:09:23 -07:00
Klaus Post	2623338dc5	Inline small file data in xl.meta file (#11758 )	2021-03-29 17:00:55 -07:00
Harshavardhana	d93c6cb9c7	use Access() instead of Lstat() for frequent use (#11911 ) using Lstat() is causing tiny memory allocations, that are usually wasted and never used, instead we can simply uses Access() call that does 0 memory allocations.	2021-03-29 08:07:23 -07:00
Harshavardhana	d7f32ad649	xl: avoid sending Delete() remote call for fully successful runs an optimization to avoid extra syscalls in PutObject(), adds up to our PutObject response times.	2021-03-24 17:32:12 -07:00
Harshavardhana	410e84d273	xl: add checks for minioTmpMetaBucket in CreateFile	2021-03-24 09:36:10 -07:00
Harshavardhana	75741dbf4a	xl: remove cleanupDir instead use Delete() (#11880 ) use a single call to remove directly at disk instead of doing recursively at network layer.	2021-03-24 09:08:05 -07:00
Ritesh H Shukla	6a2ed44095	fix: optionally enable tracing posix calls	2021-03-23 22:23:08 -07:00
Anis Elleuch	98ff91b484	xl: Reduce usage of isDirEmpty() (#11838 ) When an object is removed, its parent directory is inspected to check if it is empty to remove if that is the case. However, we can use os.Remove() directly since it is only able to remove a file or an empty directory.	2021-03-19 15:42:01 -07:00
Anis Elleuch	4d86384dc7	xl: Remove non needed check for empty dir (#11835 ) RenameData renames xl.meta and data dir and removes the parent directory if empty, however, there is a duplicate check for empty dir, since the parent dir of xl.meta is always the same as the data-dir.	2021-03-19 12:26:53 -07:00
Harshavardhana	b92a220db1	fix: handle weird drives sporadic read O_DIRECT behavior (#11832 ) on freshReads if drive returns errInvalidArgument, we should simply turn-off DirectIO and read normally, there are situations in k8s like environments where the drives behave sporadically in a single deployment and may not have been implemented properly to handle O_DIRECT for reads.	2021-03-18 20:16:50 -07:00
Harshavardhana	51a8619a79	[feat] Add configurable deadline for writers (#11822 ) This PR adds deadlines per Write() calls, such that slow drives are timed-out appropriately and the overall responsiveness for Writes() is always up to a predefined threshold providing applications sustained latency even if one of the drives is slow to respond.	2021-03-18 14:09:55 -07:00
Harshavardhana	60b0f2324e	storage write call path optimizations (#11805 ) - write in o_dsync instead of o_direct for smaller objects to avoid unaligned double Write() situations that may arise for smaller objects < 128KiB - avoid fallocate() as its not useful since we do not use Append() semantics anymore, fallocate is not useful for streaming I/O we can save on a syscall - createFile() doesn't need to validate `bucket` name with a Lstat() call since createFile() is only used to write at `minioTmpBucket` - use io.Copy() when writing unAligned writes to allow usage of ReadFrom() from *os.File providing zero buffer writes().	2021-03-17 09:38:38 -07:00
Anis Elleuch	0eb146e1b2	add additional metrics per disk API latency, API call counts #11250 ) ``` mc admin info --json ``` provides these details, for now, we shall eventually expose this at Prometheus level eventually. Co-authored-by: Harshavardhana <harsha@minio.io>	2021-03-16 20:06:57 -07:00
Klaus Post	fdc2f69218	truncate xl.meta files upon rewrites #11749 ) If the destination files exist and is larger - junk data will be left at the end of the file.	2021-03-09 14:42:24 -08:00
Harshavardhana	d971061305	use listPathRaw for HealObjects() instead of expensive WalkVersions() (#11675 )	2021-03-06 09:25:48 -08:00
Klaus Post	fa9cf1251b	Imporve healing and reporting (#11312 ) * Provide information on actively healing, buckets healed/queued, objects healed/failed. * Add concurrent healing of multiple sets (typically on startup). * Add bucket level resume, so restarts will only heal non-healed buckets. * Print summary after healing a disk is done.	2021-03-04 14:36:23 -08:00
Harshavardhana	c6a120df0e	fix: Prometheus metrics to re-use storage disks (#11647 ) also re-use storage disks for all `mc admin server info` calls as well, implement a new LocalStorageInfo() API call at ObjectLayer to lookup local disks storageInfo also fixes bugs where there were double calls to StorageInfo()	2021-03-02 17:28:04 -08:00
Harshavardhana	2f4af09c01	fix: alow changes to readAllData to decrement activeCount()	2021-02-28 20:09:23 -08:00
Harshavardhana	37960cbc2f	fix: avoid writing more content on network with O_DIRECT reads (#11659 ) There was an io.LimitReader was missing for the 'length' parameter for ranged requests, that would cause client to get truncated responses and errors. fixes #11651	2021-02-28 15:33:03 -08:00
Harshavardhana	9171d6ef65	rename all references from crawl -> scanner (#11621 )	2021-02-26 15:11:42 -08:00
Harshavardhana	6386b45c08	[feat] use rename instead of recursive deletes (#11641 ) most of the delete calls today spend time in a blocking operation where multiple calls need to be recursively sent to delete the objects, instead we can use rename operation to atomically move the objects from the namespace to `tmp/.trash` we can schedule deletion of objects at this location once in 15, 30mins and we can also add wait times between each delete operation. this allows us to make delete's faster as well less chattier on the drives, each server runs locally a groutine which would clean this up regularly.	2021-02-26 09:52:27 -08:00
Harshavardhana	b517c791e9	[feat]: use DSYNC for xl.meta writes and NOATIME for reads (#11615 ) Instead of using O_SYNC, we are better off using O_DSYNC instead since we are only ever interested in data to be persisted to disk not the associated filesystem metadata. For reads we ask customers to turn off noatime, but instead we can proactively use O_NOATIME flag to avoid atime updates upon reads.	2021-02-24 00:14:16 -08:00
Harshavardhana	18ec933085	fix: for containers use root-disk detection cleverly (#11593 ) root-disk implemented currently had issues where root disk partitions getting modified might race and provide incorrect results, to avoid this lets rely again back on DeviceID and match it instead. In-case of containers `/data` is one such extra entity that needs to be verified for root disk, due to how 'overlay' filesystem works and the 'overlay' presents a completely different 'device' id - using `/data` as another entity for fallback helps because our containers describe 'VOLUME' parameter that allows containers to automatically have a virtual `/data` that points to the container root path this can either be at `/` or `/var/lib/` (on different partition)	2021-02-22 10:32:21 -08:00
Harshavardhana	8778828a03	fix: read metadata in O_DIRECT if configured and supported (#11594 ) reduce the page-cache pressure completely by moving the entire read-phase of our operations to O_DIRECT, primarily this is going to be very useful for chatty metadata operations such as listing, scanner, ilm, healing like operations to avoid filling up the page-cache upon repeated runs.	2021-02-22 01:36:17 -08:00
Harshavardhana	7875d472bc	avoid notification for non-existent delete objects (#11514 ) Skip notifications on objects that might have had an error during deletion, this also avoids unnecessary replication attempt on such objects. Refactor some places to make sure that we have notified the client before we - notify - schedule for replication - lifecycle etc.	2021-02-10 22:00:42 -08:00
Harshavardhana	0e3211f4ad	fix: server upgrades should have more descriptive error messages (#11476 ) during rolling upgrade, provide a more descriptive error message and discourage rolling upgrade in such situations, allowing users to take action. additionally also rename `slashpath -> pathutil` to avoid a slighly mis-pronounced usage of `path` package.	2021-02-08 10:15:12 -08:00
Harshavardhana	c9b0f595b9	support directory objects in listing in certain scenarios (#11452 ) When a directory object is presented as a `prefix` param our implementation tend to only list objects present common to the `prefix` than the `prefix` itself, to mimic AWS S3 like flat key behavior this PR ensures that if `prefix` is directory object, it should be automatically considered to be part of the eventual listing result. fixes #11370	2021-02-05 10:12:25 -08:00
Harshavardhana	f108873c48	fix: replication metadata comparsion and other fixes (#11410 ) - using miniogo.ObjectInfo.UserMetadata is not correct - using UserTags from Map->String() can change order - ContentType comparison needs to be removed. - Compare both lowercase and uppercase key names. - do not silently error out constructing PutObjectOptions if tag parsing fails - avoid notification for empty object info, failed operations should rely on valid objInfo for notification in all situations - optimize copyObject implementation, also introduce a new replication event - clone ObjectInfo() before scheduling for replication - add additional headers for comparison - remove strings.EqualFold comparison avoid unexpected bugs - fix pool based proxying with multiple pools - compare only specific metadata Co-authored-by: Poorna Krishnamoorthy <poornas@users.noreply.github.com>	2021-02-03 20:41:33 -08:00
Anis Elleuch	b3f81e75f6	xl: Make it clear when to create delete marker for a non existant object (#11423 )	2021-02-03 10:33:43 -08:00
Anis Elleuch	6ef678663e	xl: Create a delete-marker when no other version exists (#11362 ) Currently, it is not possible to create a delete-marker when xl.meta does not exist (no version is created for that object yet). This makes a problem for replication and mc mirroring with versioning enabled. This also follows S3 specification.	2021-02-01 13:23:50 -08:00
Anis Elleuch	65aa2bc614	ilm: Remove object in HEAD/GET if having an applicable ILM rule (#11296 ) Remove an object on the fly if there is a lifecycle rule with delete expiry action for the corresponding object.	2021-02-01 09:52:11 -08:00
Anis Elleuch	e9ac7b0fb7	heal: Remove empty directories (#11354 ) Since the introduction of __XLDIR__, an empty directory does not have a meaning anymore in erasure mode. Make healing removes it wherever it finds it.	2021-01-27 02:19:28 -08:00
Harshavardhana	43f973c4cf	fix: check for O_DIRECT support for reads and writes (#11331 ) In-case user enables O_DIRECT for reads and backend does not support it we shall proceed to turn it off instead and print a warning. This validation avoids any unexpected downtimes that users may incur.	2021-01-22 15:38:21 -08:00
Harshavardhana	d1a8f0b786	fix possible crashes on deleteMarker replication (#11308 ) Delete marker can have `metaSys` set to nil, that can lead to crashes after the delete marker has been healed. Additionally also fix isObjectDangling check for transitioned objects, that do not have parts should be treated similar to Delete marker.	2021-01-20 13:12:12 -08:00
Harshavardhana	b5049d541f	fix: reduce an extra readdir() attempted on non-legacy setups (#11301 ) to verify moving content and preserving legacy content, we have way to detect the objects through readdir() this path is not necessary for most common cases on newer setups, avoid readdir() to save multiple system calls. also fix the CheckFile behavior for most common use case i.e without legacy format.	2021-01-19 10:01:06 -08:00
Harshavardhana	3ca6330661	fix: optimize parentDirIsObject by moving isObject to storage layer (#11291 ) For objects with `N` prefix depth, this PR reduces `N` such network operations by converting `CheckFile` into a single bulk operation. Reduction in chattiness here would allow disks to be utilized more cleanly, while maintaining the same functionality along with one extra volume check stat() call is removed. Update tests to test multiple sets scenario	2021-01-18 12:25:22 -08:00
Harshavardhana	f903cae6ff	Support variable server pools (#11256 ) Current implementation requires server pools to have same erasure stripe sizes, to facilitate same SLA and expectations. This PR allows server pools to be variadic, i.e they do not have to be same erasure stripe sizes - instead they should have SLA for parity ratio. If the parity ratio cannot be guaranteed by the new server pool, the deployment is rejected i.e server pool expansion is not allowed.	2021-01-16 12:08:02 -08:00
Harshavardhana	1a5775e2e8	enable small and large file optimization (#11260 ) - for large objects we found that 1MiB block for r/w respectively. - for small objects we found that 128KiB block for r/w respectively.	2021-01-12 10:20:39 -08:00
Harshavardhana	e4e117faab	fix: enable xl.json to xl.meta only if legacy drive is found (#11255 ) another optimization is renameLegacyMetadata() never needs to validate bucket with os.Stat() again, leading to reduction in one extra syscall.	2021-01-11 02:27:04 -08:00
Harshavardhana	4593b146be	fix: print errors only when metacache status has errors (#11248 )	2021-01-08 16:52:19 +05:30
Harshavardhana	f21d650ed4	fix: readData in bulk call using messagepack byte wrappers (#11228 ) This PR refactors the way we use buffers for O_DIRECT and to re-use those buffers for messagepack reader writer. After some extensive benchmarking found that not all objects have this benefit, and only objects smaller than 64KiB see this benefit overall. Benefits are seen from almost all objects from 1KiB - 32KiB Beyond this no objects see benefit with bulk call approach as the latency of bytes sent over the wire v/s streaming content directly from disk negate each other with no remarkable benefits. All other optimizations include reuse of msgp.Reader, msgp.Writer using sync.Pool's for all internode calls.	2021-01-07 19:27:31 -08:00
Harshavardhana	76e2713ffe	fix: use buffers only when necessary for io.Copy() (#11229 ) Use separate sync.Pool for writes/reads Avoid passing buffers for io.CopyBuffer() if the writer or reader implement io.WriteTo or io.ReadFrom respectively then its useless for sync.Pool to allocate buffers on its own since that will be completely ignored by the io.CopyBuffer Go implementation. Improve this wherever we see this to be optimal. This allows us to be more efficient on memory usage. ``` 385 // copyBuffer is the actual implementation of Copy and CopyBuffer. 386 // if buf is nil, one is allocated. 387 func copyBuffer(dst Writer, src Reader, buf []byte) (written int64, err error) { 388 // If the reader has a WriteTo method, use it to do the copy. 389 // Avoids an allocation and a copy. 390 if wt, ok := src.(WriterTo); ok { 391 return wt.WriteTo(dst) 392 } 393 // Similarly, if the writer has a ReadFrom method, use it to do the copy. 394 if rt, ok := dst.(ReaderFrom); ok { 395 return rt.ReadFrom(src) 396 } ``` From readahead package ``` // WriteTo writes data to w until there's no more data to write or when an error occurs. // The return value n is the number of bytes written. // Any error encountered during the write is also returned. func (a *reader) WriteTo(w io.Writer) (n int64, err error) { if a.err != nil { return 0, a.err } n = 0 for { err = a.fill() if err != nil { return n, err } n2, err := w.Write(a.cur.buffer()) a.cur.inc(n2) n += int64(n2) if err != nil { return n, err } ```	2021-01-06 09:36:55 -08:00
Harshavardhana	d0027c3c41	do not use large buffers if not necessary (#11220 ) without this change, there is a performance regression for small objects GETs, this makes the overall speed to go back to pre '59d363' commit days.	2021-01-04 18:51:52 -08:00
Harshavardhana	c4b1d394d6	erasure: avoid io.Copy in hotpaths to reduce allocation (#11213 )	2021-01-03 16:27:34 -08:00
Harshavardhana	c4131c2798	feat: Small object optimization read data in single bulk call (#11207 )	2021-01-03 11:27:57 -08:00
Anis Elleuch	c9d502e6fa	parentDirIsObject() to return quickly with inexistant parent (#11204 ) Rewrite parentIsObject() function. Currently if a client uploads a/b/c/d, we always check if c, b, a are actual objects or not. The new code will check with the reverse order and quickly quit if the segment doesn't exist. So if a, b, c in 'a/b/c' does not exist in the first place, then returns false quickly.	2021-01-02 12:01:29 -08:00
Anis Elleuch	677e80c0f8	xl: Remove check-dir in ReadVersion (#11200 ) The only purpose of check-dir flag in ReadVersion is to return 404 when an object has xl.meta but without data. This is causing an extract call to the disk which can be penalizing in case of busy system where disks receive many concurrent access.	2021-01-02 10:35:57 -08:00
Anis Elleuch	a317d220ed	xl-storage: Do not stat bucket assuming the object exists (#11201 ) In HEAD/GET, only STAT the bucket if the object does not exist to return the correct error response.	2021-01-01 09:44:36 -08:00
Harshavardhana	cc457f1798	fix: enhance logging in crawler use console.Debug instead of logger.Info (#11179 )	2020-12-29 01:57:28 -08:00
Harshavardhana	445a9bd827	fix: heal optimizations in crawler to avoid multiple healing attempts (#11173 ) Fixes two problems - Double healing when bitrot is enabled, instead heal attempt once in applyActions() before lifecycle is applied. - If applyActions() is successful and getSize() returns proper value, then object is accounted for and should be removed from the oldCache namespace map to avoid double heal attempts.	2020-12-28 10:31:00 -08:00
Harshavardhana	c19e6ce773	avoid a crash in crawler when lifecycle is not initialized (#11170 ) Bonus for static buffers use bytes.NewReader instead of bytes.NewBuffer, to use a more reader friendly implementation	2020-12-26 22:58:06 -08:00
Harshavardhana	a773cf48d8	fix: overlapping object and prefix rejected (#11130 ) fixes #11129	2020-12-18 08:51:09 -08:00
Harshavardhana	3e83643320	lifecycle improvements and additional debug logging (#11096 ) Bonus change fix browser assets	2020-12-13 12:05:54 -08:00
Anis Elleuch	f164085227	xl: Always set root disk to true in test environment (#11094 ) Tests environments (go test or manual testing) should always consider the passed disks are root disks and should not rely on disk.IsRootDisk() function. The reason is that this latter can return a false negative when called in a busy system. However, returning a false negative will only occur in a testing environment and not in a production, so we can accept this trade-off for now.	2020-12-12 16:10:07 -08:00
Harshavardhana	d8c1f93de6	reject mixed drive situations with drives on root disks (#11057 ) till now we used to match the inode number of the root drive and the drive path minio would use, if they match we knew that its a root disk. this may not be true in all situations such as running inside a container environment where the container might be mounted from a different partition altogether, root disk detection might fail.	2020-12-09 00:27:02 -08:00
Ritesh H Shukla	038bcd9079	Add replication capacity metrics support in crawler (#10786 )	2020-12-07 13:47:48 -08:00
Klaus Post	a896125490	Add crawler delay config + dynamic config values (#11018 )	2020-12-04 09:32:35 -08:00
Harshavardhana	96c0ce1f0c	add support for tuning healing to make healing more aggressive (#11003 ) supports `mc admin config set <alias> heal sleep=100ms` to enable more aggressive healing under certain times. also optimize some areas that were doing extra checks than necessary when bitrotscan was enabled, avoid double sleeps make healing more predictable. fixes #10497	2020-12-02 11:12:00 -08:00
Harshavardhana	bdd094bc39	fix: avoid sending errors on missing objects on locked buckets (#10994 ) make sure multi-object delete returned errors that are AWS S3 compatible	2020-11-28 21:15:45 -08:00
Harshavardhana	df93102235	fix: unwrapping issues with os.Is* functions (#10949 ) reduces 3 stat calls, reducing the overall startup time significantly.	2020-11-23 08:36:49 -08:00
Poorna Krishnamoorthy	39f3d5493b	Show Delete replication status header (#10946 ) X-Minio-Replication-Delete-Status header shows the status of the replication of a permanent delete of a version. All GETs are disallowed and return 405 on this object version. In the case of replicating delete markers. X-Minio-Replication-DeleteMarker-Status shows the status of replication, and would similarly return 405. Additionally, this PR adds reporting of delete marker event completion and updates documentation	2020-11-21 23:48:50 -08:00
Poorna Krishnamoorthy	1ebf6f146a	Add support for ILM transition (#10565 ) This PR adds transition support for ILM to transition data to another MinIO target represented by a storage class ARN. Subsequent GET or HEAD for that object will be streamed from the transition tier. If PostRestoreObject API is invoked, the transitioned object can be restored for duration specified to the source cluster.	2020-11-19 18:47:17 -08:00
Harshavardhana	9a34fd5c4a	Revert "Revert "Add delete marker replication support (#10396 )"" This reverts commit `267d7bf0a9`.	2020-11-19 18:43:58 -08:00
Harshavardhana	267d7bf0a9	Revert "Add delete marker replication support (#10396 )" This reverts commit `50c10a5087`. PR is moved to origin/dev branch	2020-11-12 11:43:14 -08:00
Poorna Krishnamoorthy	50c10a5087	Add delete marker replication support (#10396 ) Delete marker replication is implemented for V2 configuration specified in AWS spec (though AWS allows it only in the V1 configuration). This PR also brings in a MinIO only extension of replicating permanent deletes, i.e. deletes specifying version id are replicated to target cluster.	2020-11-10 15:24:14 -08:00
Harshavardhana	fde3299bf3	re-use optimized readdir for isDirEmpty() (#10829 ) reduces effective memory usage by an order of magnitude, also increases performance for small objects	2020-11-04 13:05:21 -08:00
Harshavardhana	1a1f00fa15	fix: use internode data for DisksInfo, VolsInfo in message pack (#10821 ) Similar to #10775 for fewer memory allocations, since we use getOnlineDisks() extensively for listing we should optimize it further. Additionally, remove all unused walkers from the storage layer	2020-11-04 10:10:54 -08:00
Klaus Post	37749f4623	Optimize FileInfo(Version) transfer (#10775 ) File Info decoding, in particular, is showing up as a major allocator and time consumer for internode data transfers Switch to message pack for cross-server transfers: ``` MSGP: Size: 945 bytes BenchmarkEncodeFileInfoMsgp-32 1558444 866 ns/op 1.16 MB/s 0 B/op 0 allocs/op BenchmarkDecodeFileInfoMsgp-32 479968 2487 ns/op 0.40 MB/s 848 B/op 18 allocs/op GOB: Size: 1409 bytes BenchmarkEncodeFileInfoGOB-32 333339 3237 ns/op 0.31 MB/s 576 B/op 19 allocs/op BenchmarkDecodeFileInfoGOB-32 20869 57837 ns/op 0.02 MB/s 16439 B/op 428 allocs/op ```	2020-11-02 17:07:52 -08:00
Klaus Post	86e0d272f3	Reduce WriteAll allocs (#10810 ) WriteAll saw 127GB allocs in a 5 minute timeframe for 4MiB buffers used by `io.CopyBuffer` even if they are pooled. Since all writers appear to write byte buffers, just send those instead and write directly. The files are opened through the `os` package so they have no special properties anyway. This removes the alloc and copy for each operation. REST sends content length so a precise alloc can be made.	2020-11-02 16:14:31 -08:00
Krishna Srinivas	3a2f89b3c0	fix: add support for O_DIRECT reads for erasure backends (#10718 )	2020-10-30 11:04:29 -07:00
Klaus Post	a982baff27	ListObjects Metadata Caching (#10648 ) Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c Gist of improvements: * Cross-server caching and listing will use the same data across servers and requests. * Lists can be arbitrarily resumed at a constant speed. * Metadata for all files scanned is stored for streaming retrieval. * The existing bloom filters controlled by the crawler is used for validating caches. * Concurrent requests for the same data (or parts of it) will not spawn additional walkers. * Listing a subdirectory of an existing recursive cache will use the cache. * All listing operations are fully streamable so the number of objects in a bucket no longer dictates the amount of memory. * Listings can be handled by any server within the cluster. * Caches are cleaned up when out of date or superseded by a more recent one.	2020-10-28 09:18:35 -07:00
Anis Elleuch	eb95353cb1	fix: Get/HeadObject return 404 on non quorum objects (#10753 )	2020-10-26 10:30:46 -07:00
Anis Elleuch	00124c56d9	erasure: Commit data before xl.meta in RenameData() (#10734 ) This will reduce the chance to have updated xl.meta without data.	2020-10-23 21:54:58 -07:00
Harshavardhana	2042d4873c	rename crawler config option to heal (#10678 )	2020-10-14 13:51:51 -07:00
Klaus Post	03991c5d41	crawler: Remove waitForLowActiveIO (#10667 ) Only use dynamic delays for the crawler. Even though the max wait was 1 second the number of waits could severely impact crawler speed. Instead of relying on a global metric, we use the stateless local delays to keep the crawler running at a speed more adjusted to current conditions. The only case we keep it is before bitrot checks when enabled.	2020-10-13 13:45:08 -07:00
Harshavardhana	a0d0645128	remove safeMode behavior in startup (#10645 ) In almost all scenarios MinIO now is mostly ready for all sub-systems independently, safe-mode is not useful anymore and do not serve its original intended purpose. allow server to be fully functional even with config partially configured, this is to cater for availability of actual I/O v/s manually fixing the server. In k8s like environments it will never make sense to take pod into safe-mode state, because there is no real access to perform any remote operation on them.	2020-10-09 09:59:52 -07:00
Harshavardhana	736e58dd68	fix: handle concurrent lockers with multiple optimizations (#10640 ) - select lockers which are non-local and online to have affinity towards remote servers for lock contention - optimize lock retry interval to avoid sending too many messages during lock contention, reduces average CPU usage as well - if bucket is not set, when deleteObject fails make sure setPutObjHeaders() honors lifecycle only if bucket name is set. - fix top locks to list out always the oldest lockers always, avoid getting bogged down into map's unordered nature.	2020-10-08 12:32:32 -07:00
Harshavardhana	2b4eb87d77	pick disks which are common maximally used (#10600 ) further optimization to ensure that good disks are always used for listing, other than healing we only use disks that are maximally used.	2020-09-29 22:54:02 -07:00
Harshavardhana	00eb6f6bc9	cache DiskInfo at storage layer for performance (#10586 ) `mc admin info` on busy setups will not move HDD heads unnecessarily for repeated calls, provides a better responsiveness for the call overall. Bonus change allow listTolerancePerSet be N-1 for good entries, to avoid skipping entries for some reason one of the disk went offline.	2020-09-29 09:54:41 -07:00

1 2 3 4

184 commits