0.4.0

0.4.0 release introduced the theta-sketch based distinct count function, an S3 filesystem plugin, a unified star-tree index implementation, migration from TimeFieldSpec to DateTimeFieldSpec, etc.

Summary

0.4.0 release introduced various new features, including the theta-sketch based distinct count aggregation function, an S3 filesystem plugin, a unified star-tree index implementation, deprecation of TimeFieldSpec in favor of DateTimeFieldSpec, etc. Miscellaneous refactoring, performance improvement and bug fixes were also included in this release. See details below.

Notable New Features

  • Made DateTimeFieldSpecs mainstream and deprecated TimeFieldSpec (#2756)

    • Used time column from table config instead of schema (#5320)

    • Included dateTimeFieldSpec in schema columns of Pinot Query Console #5392

    • Used DATE_TIME as the primary time column for Pinot tables (#5399)

  • Supported range queries using indexes (#5240)

  • Supported complex aggregation functions

    • Supported Aggregation functions with multiple arguments (#5261)

    • Added api in AggregationFunction to get compiled input expressions (#5339)

  • Added a simple PinotFS benchmark driver (#5160)

  • Supported default star-tree (#5147)

  • Added an initial implementation for theta-sketch based distinct count aggregation function (#5316)

    • One minor side effect: DataSchemaPruner won't work for DistinctCountThetaSketchAggregatinoFunction (#5382)

  • Added access control for Pinot server segment download api (#5260)

  • Added Pinot S3 Filesystem Plugin (#5249)

  • Text search improvement

    • Pruned stop words for text index (#5297)

    • Used 8byte offsets in chunk based raw index creator (#5285)

    • Derived num docs per chunk from max column value length for varbyte raw index creator (#5256)

    • Added inter segment tests for text search and fixed a bug for Lucene query parser creation (#5226)

    • Made text index query cache a configurable option (#5176)

    • Added Lucene DocId to PinotDocId cache to improve performance (#5177)

    • Removed the construction of second bitmap in text index reader to improve performance (#5199)

  • Tooling/usability improvement

    • Added template support for Pinot Ingestion Job Spec (#5341)

    • Allowed user to specify zk data dir and don't do clean up during zk shutdown (#5295)

    • Allowed configuring minion task timeout in the PinotTaskGenerator (#5317)

    • Update JVM settings for scripts (#5127)

    • Added Stream github events demo (#5189)

    • Moved docs link from gitbook to docs.pinot.apache.org (#5193)

  • Re-implemented ORCRecordReader (#5267)

  • Evaluated schema transform expressions during ingestion (#5238)

  • Handled count distinct query in selection list (#5223)

  • Enabled async processing in pinot broker query api (#5229)

  • Supported bootstrap mode for table rebalance (#5224)

  • Supported order-by on BYTES column (#5213)

  • Added Nightly publish to binary (#5190)

  • Shuffled the segments when rebalancing the table to avoid creating hotspot servers (#5197)

  • Supported inbuilt transform functions (#5312)

    • Added date time transform functions (#5326)

  • Deepstore by-pass in LLC: introduced segment uploader (#5277, #5314)

  • APIs Additions/Changes

    • Added a new server api for download of segments

      • /GET /segments/{tableNameWithType}/{segmentName}

  • Upgraded helix to 0.9.7 (#5411)

  • Added support to execute functions during query compilation (#5406)

  • Other notable refactoring

    • Moved table config into pinot-spi (#5194)

    • Cleaned up integration tests. Standardized the creation of schema, table config and segments (#5385)

    • Added jsonExtractScalar function to extract field from json object (#4597)

    • Added template support for Pinot Ingestion Job Spec #5372

    • Cleaned up AggregationFunctionContext (#5364)

    • Optimized real-time range predicate when cardinality is high (#5331)

    • Made PinotOutputFormat use table config and schema to create segments (#5350)

    • Tracked unavailable segments in InstanceSelector (#5337)

    • Added a new best effort segment uploader with bounded upload time (#5314)

    • In SegmentPurger, used table config to generate the segment (#5325)

    • Decoupled schema from RecordReader and StreamMessageDecoder (#5309)

    • Implemented ARRAYLENGTH UDF for multi-valued columns (#5301)

    • Improved GroupBy query performance (#5291)

    • Optimized ExpressionFilterOperator (#5132)

Major Bug Fixes

  • Do not release the PinotDataBuffer when closing the index (#5400)

  • Handled a no-arg function in query parsing and expression tree (#5375)

  • Fixed compatibility issues during rolling upgrade due to unknown json fields (#5376)

  • Fixed missing error message from pinot-admin command (#5305)

  • Fixed HDFS copy logic (#5218)

  • Fixed spark ingestion issue (#5216)

  • Fixed the capacity of the DistinctTable (#5204)

  • Fixed various links in the Pinot website

Work in Progress

  • Upsert: support overriding data in the real-time table (#4261).

    • Add pinot upsert features to pinot common (#5175)

  • Enhancements for theta-sketch, e.g. multiValue aggregation support, complex predicates, performance tuning, etc

Backward Incompatible Changes

  • TableConfig no longer support de-serialization from json string of nested json string (i.e. no \" inside the json) (#5194)

  • The following APIs are changed in AggregationFunction (use TransformExpressionTree instead of String as the key of blockValSetMap) (#5371):

    void aggregate(int length, AggregationResultHolder aggregationResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
    void aggregateGroupBySV(int length, int[] groupKeyArray, GroupByResultHolder groupByResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
    void aggregateGroupByMV(int length, int[][] groupKeysArray, GroupByResultHolder groupByResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);