Commit graph

48 commits

Author SHA1 Message Date
Klaus Post
017456df63 Wait clearing the close channel (#8250)
Close channel should not be nilled before goroutines have exited.

Fixes potential hang on closing.
2019-09-16 16:18:01 -07:00
Klaus Post
ddea0bdf11 Concurrent CSV parsing and reduce S3 select allocations (#8200)
```
CSV parsing, BEFORE:
BenchmarkReaderBasic-12         	    2842	    407533 ns/op	  397860 B/op	     957 allocs/op
BenchmarkReaderReplace-12       	    2718	    429914 ns/op	  397844 B/op	     957 allocs/op
BenchmarkReaderReplaceTwo-12    	    2718	    435556 ns/op	  397855 B/op	     957 allocs/op
BenchmarkAggregateCount_100K-12    	     171	   6798974 ns/op	16667102 B/op	  308077 allocs/op
BenchmarkAggregateCount_1M-12    	      19	  65657411 ns/op	168057743 B/op	 3146610 allocs/op
BenchmarkSelectAll_10M-12    	       1	20882119900 ns/op	2758799896 B/op	41978762 allocs/op

CSV parsing, AFTER:
BenchmarkReaderBasic-12         	    3721	    312549 ns/op	  101920 B/op	     338 allocs/op
BenchmarkReaderReplace-12       	    3776	    318810 ns/op	  101993 B/op	     340 allocs/op
BenchmarkReaderReplaceTwo-12    	    3610	    330967 ns/op	  102012 B/op	     341 allocs/op
BenchmarkAggregateCount_100K-12    	     295	   4149588 ns/op	 3553623 B/op	  103261 allocs/op
BenchmarkAggregateCount_1M-12    	      30	  37746503 ns/op	33827931 B/op	 1049435 allocs/op
BenchmarkSelectAll_10M-12    	       1	17608495800 ns/op	1416504040 B/op	21007082 allocs/op

~ benchcmp old.txt new.txt
benchmark                           old ns/op       new ns/op       delta
BenchmarkReaderBasic-12             407533          312549          -23.31%
BenchmarkReaderReplace-12           429914          318810          -25.84%
BenchmarkReaderReplaceTwo-12        435556          330967          -24.01%
BenchmarkAggregateCount_100K-12     6798974         4149588         -38.97%
BenchmarkAggregateCount_1M-12       65657411        37746503        -42.51%
BenchmarkSelectAll_10M-12           20882119900     17608495800     -15.68%

benchmark                           old allocs     new allocs     delta
BenchmarkReaderBasic-12             957            338            -64.68%
BenchmarkReaderReplace-12           957            340            -64.47%
BenchmarkReaderReplaceTwo-12        957            341            -64.37%
BenchmarkAggregateCount_100K-12     308077         103261         -66.48%
BenchmarkAggregateCount_1M-12       3146610        1049435        -66.65%
BenchmarkSelectAll_10M-12           41978762       21007082       -49.96%

benchmark                           old bytes      new bytes      delta
BenchmarkReaderBasic-12             397860         101920         -74.38%
BenchmarkReaderReplace-12           397844         101993         -74.36%
BenchmarkReaderReplaceTwo-12        397855         102012         -74.36%
BenchmarkAggregateCount_100K-12     16667102       3553623        -78.68%
BenchmarkAggregateCount_1M-12       168057743      33827931       -79.87%
BenchmarkSelectAll_10M-12           2758799896     1416504040     -48.66%
```

```
BenchmarkReaderHuge/97K-12         	    2200	    540840 ns/op	 184.32 MB/s	 1604450 B/op	     687 allocs/op
BenchmarkReaderHuge/194K-12        	    1522	    752257 ns/op	 265.04 MB/s	 2143135 B/op	    1335 allocs/op
BenchmarkReaderHuge/389K-12        	    1190	    947858 ns/op	 420.69 MB/s	 3221831 B/op	    2630 allocs/op
BenchmarkReaderHuge/778K-12        	     806	   1472486 ns/op	 541.61 MB/s	 5201856 B/op	    5187 allocs/op
BenchmarkReaderHuge/1557K-12       	     426	   2575269 ns/op	 619.36 MB/s	 9101330 B/op	   10233 allocs/op
BenchmarkReaderHuge/3115K-12       	     286	   4034656 ns/op	 790.66 MB/s	12397968 B/op	   16099 allocs/op
BenchmarkReaderHuge/6230K-12       	     172	   6830563 ns/op	 934.05 MB/s	16008416 B/op	   26844 allocs/op
BenchmarkReaderHuge/12461K-12      	     100	  11409467 ns/op	1118.39 MB/s	22655163 B/op	   48107 allocs/op
BenchmarkReaderHuge/24922K-12      	      66	  19780395 ns/op	1290.19 MB/s	35158559 B/op	   90216 allocs/op
BenchmarkReaderHuge/49844K-12      	      34	  37282559 ns/op	1369.03 MB/s	60528624 B/op	  174497 allocs/op
```
2019-09-13 14:18:35 -07:00
Yao Zongyou
18fedc67d5 friendly prompt for s3select MalformedXML error (#8171)
partly fix #7911
2019-09-09 21:33:27 -07:00
Yao Zongyou
ec9bfd3aef speed up the performance of s3select on csv (#7945) 2019-08-31 00:07:40 -07:00
Kanagaraj M
12353caf35 Fix: Support Unicode delimiters in s3 select (#7931) 2019-07-17 19:10:17 +01:00
Yao Zongyou
c4f480a839 fix csv read bug (#7885) 2019-07-05 12:08:56 -07:00
Yao Zongyou
60831e3299 aggregation functions' argument may already has been cast to numeric (#7876) 2019-07-05 10:38:38 -07:00
Yao Zongyou
037319066f fix unicode support related bugs in s3select (#7877) 2019-07-05 09:43:10 -07:00
Ryan Tam
bd56f80250 Fix ignored alias for aggregate result in S3 Select (#7849)
The SQL parser as it stands right now ignores alias for aggregate
result, e.g. `SELECT COUNT(*) AS thing FROM s3object` doesn't actually
return record like `{"thing": 42}`, it returns a record like `{"_1": 42}`.
Column alias for aggregate result is supported in AWS's S3 Select, so
this commit fixes that by respecting the `expr.As` in the expression.

Also improve test for S3 select

On top of testing a simple `SELECT` query, we want to test a few more
"advanced" queries (e.g. aggregation).

Convert existing tests into table driven tests[1], and add the new test
cases with "advanced" queries into them.

[1] - https://github.com/golang/go/wiki/TableDrivenTests
2019-07-03 16:34:54 -07:00
Yao Zongyou
941fed8e4a s3Select: call Close on error to release the read lock (#7830) 2019-06-25 13:30:48 -07:00
Yao Zongyou
55092bede1 add timestamp compare support (#7832) 2019-06-25 11:05:37 -07:00
Yao Zongyou
90a3b830f4 fix typo and the string representation of the time.Time value (#7831) 2019-06-25 09:54:14 -07:00
Yao Zongyou
23b9df0694 Fix s3select TRIM function's nil pointer dereference bug (#7817) 2019-06-24 16:59:33 -07:00
Joe Stevens
a19cf063b5 Fixes for multiplatform dev and testing from forks (#7734)
Add support for correct dependency URLs on all platforms

only build mountinfo.go on linux

make testfile path relative to support fork work
2019-06-04 00:59:40 -07:00
kannappanr
5ecac91a55
Replace Minio refs in docs with MinIO and links (#7494) 2019-04-09 11:39:42 -07:00
Aditya Manthramurthy
b1b1d77893 Set S3 Select record message length to 128KiB (#7475)
- Previously this limit was a little more than 1MiB, and it broke
  compatibility with AWS SDK Java causing a buffer overflow error.
2019-04-04 00:41:52 -07:00
Kirill Motkov
3d29ab4059 Rewrite if-else chains to switch statements (#7382) 2019-03-18 07:46:20 -07:00
Harshavardhana
91d85a0d53
Fix stale locks held by SelectParquet API (#7364)
Vendorize upstream parquet-go to fix this issue.
2019-03-13 20:33:18 -07:00
Aditya Manthramurthy
e463386921 Add JSON Path expression evaluation support (#7315)
- Includes support for FROM clause JSON path
2019-03-09 08:13:37 -08:00
Aditya Manthramurthy
f4879ed96d Use jstream to serialize records to JSON format in S3Select (#7318)
- Also, switch to jstream to generate internal record representation
  from CSV/JSON readers

- This fixes a bug in which JSON output objects have their keys
  reversed from the order they are specified in the Select columns.

- Also includes a fix for tests.
2019-03-07 00:20:10 -08:00
Harshavardhana
2520e535a0
Allow lazyQuotes for certain types of CSV (#7278)
Set lazyQuotes to true, to allow a quote to appear
in an unquote field and a non-doubled quote may
appear in a quoted field.
2019-02-24 06:51:02 -08:00
Aditya Manthramurthy
80a351633f Update vendorized bcicen/jstream (#7257)
- Includes an error handling fix that is waiting to be merged upstream
- Uses order-preserving (un)marshalling for JSON objects.
2019-02-20 23:59:23 -08:00
Aditya Manthramurthy
8a405cab2f COUNT() function in select should return an int (#7243) 2019-02-13 16:32:59 -08:00
Harshavardhana
df35d7db9d Introduce staticcheck for stricter builds (#7035) 2019-02-13 18:29:36 +05:30
Aditya Manthramurthy
ee5b3622a5 Evaluate where clause in aggregation queries (#7235) 2019-02-12 13:54:26 -08:00
Harshavardhana
85e939636f Fix JSON parser handling for certain objects (#7162)
This PR also adds some comments and simplifies
the code. Primary handling is done to ensure
that we make sure to honor cached buffer.

Added unit tests as well

Fixes #7141
2019-02-07 08:04:42 +05:30
Aditya Manthramurthy
4aa9ee153b Fix S3 Select request XML parsing (#7202) 2019-02-06 13:25:52 -08:00
Aditya Manthramurthy
fd4e15c116 Flush the records staging buffer periodically (#7193)
- Staging buffer is flushed every 500ms. In cases where the result
  records are slowly generated (e.g. when a where condition
  matches very few records), this change causes the server to send
  results even though the staging buffer is not full.

- Refactor messageWriter code to use simpler channel based
  co-ordination instead of atomic variables.
2019-02-06 16:03:05 +05:30
Aditya Manthramurthy
f04f8bbc78 Add support for Timestamp data type in SQL Select (#7185)
This change adds support for casting strings to Timestamp via CAST:
`CAST('2010T' AS TIMESTAMP)`

It also implements the following date-time functions:
  - UTCNOW()
  - DATE_ADD()
  - DATE_DIFF()
  - EXTRACT()

For values passed to these functions, date-types are automatically
inferred.
2019-02-04 20:54:45 -08:00
Aditya Manthramurthy
91c839ad28 Use a buffer to collect SQL Select result rows (#7158)
Batching records into a single SQL Select message in the response
leads to significant speed up as the message header overhead is made
negligible.

This change leads to a speed up of 3-5x for queries that select many
small records.
2019-01-28 20:00:18 -08:00
Aditya Manthramurthy
2786055df4 Add new SQL parser to support S3 Select syntax (#7102)
- New parser written from scratch, allows easier and complete parsing
  of the full S3 Select SQL syntax. Parser definition is directly
  provided by the AST defined for the SQL grammar.

- Bring support to parse and interpret SQL involving JSON path
  expressions; evaluation of JSON path expressions will be
  subsequently added.

- Bring automatic type inference and conversion for untyped
  values (e.g. CSV data).
2019-01-28 17:59:48 -08:00
Bala FA
e23a42305c Rebase minio/parquet-go and fix null handling. (#7067) 2019-01-16 21:52:04 +05:30
Bala FA
b0deea27df Refactor s3select to support parquet. (#7023)
Also handle pretty formatted JSON documents.
2019-01-08 16:53:04 -08:00
Aditya Manthramurthy
2aeb3fbe86 Fix csv output delimiter bug (#6994) 2018-12-19 11:49:06 +05:30
Harshavardhana
4c7c571875 Support JSON to CSV and CSV to JSON output format conversion (#6910)
This PR implements one of the pending items in issue #6286
in S3 API a user can request CSV output for a JSON document
and a JSON output for a CSV document. This PR refactors
the code a little bit to bring this feature.
2018-12-07 14:55:32 -08:00
Harshavardhana
272b8003d6 Honor header only when requested for use (#6815) 2018-11-16 10:27:48 -08:00
Harshavardhana
7e1661f4fa Performance improvements to SELECT API on certain query operations (#6752)
This improves the performance of certain queries dramatically,
such as 'count(*)' etc.

Without this PR
```
~ time mc select --query "select count(*) from S3Object" myminio/sjm-airlines/star2000.csv.gz
2173762

real	0m42.464s
user	0m0.071s
sys	0m0.010s
```

With this PR
```
~ time mc select --query "select count(*) from S3Object" myminio/sjm-airlines/star2000.csv.gz
2173762

real	0m17.603s
user	0m0.093s
sys	0m0.008s
```

Almost a 250% improvement in performance. This PR avoids a lot of type
conversions and instead relies on raw sequences of data and interprets
them lazily.

```
benchcmp old new
benchmark                        old ns/op       new ns/op       delta
BenchmarkSQLAggregate_100K-4     551213          259782          -52.87%
BenchmarkSQLAggregate_1M-4       6981901985      2432413729      -65.16%
BenchmarkSQLAggregate_2M-4       13511978488     4536903552      -66.42%
BenchmarkSQLAggregate_10M-4      68427084908     23266283336     -66.00%

benchmark                        old allocs     new allocs     delta
BenchmarkSQLAggregate_100K-4     2366           485            -79.50%
BenchmarkSQLAggregate_1M-4       47455492       21462860       -54.77%
BenchmarkSQLAggregate_2M-4       95163637       43110771       -54.70%
BenchmarkSQLAggregate_10M-4      476959550      216906510      -54.52%

benchmark                        old bytes       new bytes      delta
BenchmarkSQLAggregate_100K-4     1233079         1086024        -11.93%
BenchmarkSQLAggregate_1M-4       2607984120      557038536      -78.64%
BenchmarkSQLAggregate_2M-4       5254103616      1128149168     -78.53%
BenchmarkSQLAggregate_10M-4      26443524872     5722715992     -78.36%
```
2018-11-14 15:55:10 -08:00
Harshavardhana
f162d7bd97 Performance improvements by re-using record buffer (#6622)
Avoid unnecessary pointer reference allocations
when not needed, for example

- *SelectFuncs{}
- *Row{}
2018-10-31 08:48:01 +05:30
Ashish Kumar Sinha
c0b4bf0a3e SQL select query for CSV/JSON (#6648)
select * , select column names have been implemented for CSV.
select * is implemented for JSON.
2018-10-22 12:12:22 -07:00
Praveen raj Mani
cef044178c Treat columns with spaces inbetween [s3Select] (#6597)
replace the double/single quotes with backticks for the xwb1989/sqlparser
to recognise such queries.

Fixes #6589
2018-10-17 11:01:26 -07:00
Aditya Manthramurthy
e3eec89d24 Optimize string processing in select (#6593)
Reduce allocations during string concatenation and simplify some
processing code.
2018-10-09 14:02:19 -07:00
Aditya Manthramurthy
16a100b597 Fix out-of-bound array access crash in select processing (#6594)
Fix test case.
2018-10-09 09:45:56 -07:00
Ashish Kumar Sinha
670f9788e3 Count(*) to give integer value (#6564)
The Max, Min functions were giving float value even when they were integers.  
Resolved max and Min to return integers in that scenario.

Fixes #6472
2018-10-04 17:33:53 -07:00
Harshavardhana
a0683d3c1f Send progress only when requested by client in SelectObject (#6467) 2018-09-17 11:52:46 +05:30
Praveen raj Mani
30d4a2cf53 s3select should honour custom record delimiter (#6419)
Allow custom delimiters like `\r\n`, `a`, `\r` etc in input csv and 
replace with `\n`.

Fixes #6403
2018-09-10 21:50:28 +05:30
Raphael Randschau
8601f29d95 select: fix int overflow of math.MaxInt64 on ARM (#6317) 2018-08-22 16:16:04 +05:30
Harshavardhana
5a4a57700b Add select docs and fix return values for Select API (#6300) 2018-08-17 17:11:39 -07:00
Arjun Mishra
7c14cdb60e S3 Select API Support for CSV (#6127)
Add support for trivial where clause cases
2018-08-15 03:30:19 -07:00