0.9.0

Summary

This release introduces a new features: Segment Merge and Rollup to simplify users day to day operational work. A new metrics plugin is added to support dropwizard. As usual, new functionalities and many UI/ Performance improvements.

The release was cut from the following commit: 13c9ee9 and the following cherry-picks: 668b5e0, ee887b9

Support Segment Merge and Roll-up

LinkedIn operates a large multi-tenant cluster that serves a business metrics dashboard, and noticed that their tables consisted of millions of small segments. This was leading to slow operations in Helix/Zookeeper, long running queries due to having too many tasks to process, as well as using more space because of a lack of compression.

To solve this problem they added the Segment Merge task, which compresses segments based on timestamps and rolls up/aggregates older data. The task can be run on a schedule or triggered manually via the Pinot REST API.

At the moment this feature is only available for offline tables, but will be added for real-time tables in a future release.

Major Changes:

  • Integrate enhanced SegmentProcessorFramework into MergeRollupTaskExecutor (#7180)

  • Merge/Rollup task scheduler for offline tables. (#7178)

  • Fix MergeRollupTask uploading segments not updating their metadata (#7289)

  • MergeRollupTask integration tests (#7283)

  • Add mergeRollupTask delay metrics (#7368)

  • MergeRollupTaskGenerator enhancement: enable parallel buckets scheduling (#7481)

  • Use maxEndTimeMs for merge/roll-up delay metrics. (#7617)

UI Improvement

This release also sees improvements to Pinot’s query console UI.

  • Cmd+Enter shortcut to run query in query console (#7359)

  • Showing tooltip in SQL Editor (#7387)

  • Make the SQL Editor box expandable (#7381)

  • Fix tables ordering by number of segments (#7564)

SQL Improvements

There have also been improvements and additions to Pinot’s SQL implementation.

New functions:

  • IN (#7542)

  • LASTWITHTIME (#7584)

  • ID_SET on MV columns (#7355)

  • Raw results for Percentile TDigest and Est (#7226),

  • Add timezone as argument in function toDateTime (#7552)

New predicates are supported:

Query compatibility improvements:

  • Infer data type for Literal (#7332)

  • Support logical identifier in predicate (#7347)

  • Support JSON queries with top-level array path expression. (#7511)

  • Support configurable group by trim size to improve results accuracy (#7241)

Performance Improvements

This release contains many performance improvement, you may sense it for you day to day queries. Thanks to all the great contributions listed below:

  • Reduce the disk usage for segment conversion task (#7193)

  • Simplify association between Java Class and PinotDataType for faster mapping (#7402)

  • Avoid creating stateless ParseContextImpl once per jsonpath evaluation, avoid varargs allocation (#7412)

  • Replace MINUS with STRCMP (#7394)

  • Bit-sliced range index for int, long, float, double, dictionarized SV columns (#7454)

  • Use MethodHandle to access vectorized unsigned comparison on JDK9+ (#7487)

  • Add option to limit thread usage per query (#7492)

  • Improved range queries (#7513)

  • Faster bitmap scans (#7530)

  • Optimize EmptySegmentPruner to skip pruning when there is no empty segments (#7531)

  • Map bitmaps through a bounded window to avoid excessive disk pressure (#7535)

  • Allow RLE compression of bitmaps for smaller file sizes (#7582)

  • Support raw index properties for columns with JSON and RANGE indexes (#7615)

  • Enhance BloomFilter rule to include IN predicate(#7444) (#7624)

  • Introduce LZ4_WITH_LENGTH chunk compression type (#7655)

  • Enhance ColumnValueSegmentPruner and support bloom filter prefetch (#7654)

  • Apply the optimization on dictIds within the segment to DistinctCountHLL aggregation func (#7630)

  • During segment pruning, release the bloom filter after each segment is processed (#7668)

  • Fix JSONPath cache inefficient issue (#7409)

  • Optimize getUnpaddedString with SWAR padding search (#7708)

  • Lighter weight LiteralTransformFunction, avoid excessive array fills (#7707)

  • Inline binary comparison ops to prevent function call overhead (#7709)

  • Memoize literals in query context in order to deduplicate them (#7720)

Other Notable New Features and Changes

  • Human Readable Controller Configs (#7173)

  • Add the support of geoToH3 function (#7182)

  • Add Apache Pulsar as Pinot Plugin (#7223) (#7247)

  • Add dropwizard metrics plugin (#7263)

  • Introduce OR Predicate Execution On Star Tree Index (#7184)

  • Allow to extract values from array of objects with jsonPathArray (#7208)

  • Add Realtime table metadata and indexes API. (#7169)

  • Support array with mixing data types (#7234)

  • Support force download segment in reload API (#7249)

  • Show uncompressed znRecord from zk api (#7304)

  • Add debug endpoint to get minion task status. (#7300)

  • Validate CSV Header For Configured Delimiter (#7237)

  • Add auth tokens and user/password support to ingestion job command (#7233)

  • Add option to store the hash of the upsert primary key (#7246)

  • Add null support for time column (#7269)

  • Add mode aggregation function (#7318)

  • Support disable swagger in Pinot servers (#7341)

  • Delete metadata properly on table deletion (#7329)

  • Add basic Obfuscator Support (#7407)

  • Add AWS sts dependency to enable auth using web identity token. (#7017)(#7445)

  • Mask credentials in debug endpoint /appconfigs (#7452)

  • Fix /sql query endpoint now compatible with auth (#7230)

  • Fix case sensitive issue in BasicAuthPrincipal permission check (#7354)

  • Fix auth token injection in SegmentGenerationAndPushTaskExecutor (#7464)

  • Add segmentNameGeneratorType config to IndexingConfig (#7346)

  • Support trigger PeriodicTask manually (#7174)

  • Add endpoint to check minion task status for a single task. (#7353)

  • Showing partial status of segment and counting CONSUMING state as good segment status (#7327)

  • Add "num rows in segments" and "num segments queried per host" to the output of Realtime Provisioning Rule (#7282)

  • Check schema backward-compatibility when updating schema through addSchema with override (#7374)

  • Optimize IndexedTable (#7373)

  • Support indices remove in V3 segment format (#7301)

  • Optimize TableResizer (#7392)

  • Introduce resultSize in IndexedTable (#7420)

  • Offset based real-time consumption status checker (#7267)

  • Add causes to stack trace return (#7460)

  • Create controller resource packages config key (#7488)

  • Enhance TableCache to support schema name different from table name (#7525)

  • Add validation for realtimeToOffline task (#7523)

  • Unify CombineOperator multi-threading logic (#7450)

  • Support no downtime rebalance for table with 1 replica in TableRebalancer (#7532)

  • Introduce MinionConf, move END_REPLACE_SEGMENTS_TIMEOUT_MS to minion config instead of task config. (#7516)

  • Adjust tuner api (#7553)

  • Adding config for metrics library (#7551)

  • Add geo type conversion scalar functions (#7573)

  • Add BOOLEAN_ARRAY and TIMESTAMP_ARRAY types (#7581)

  • Add MV raw forward index and MV BYTES data type (#7595)

  • Enhance TableRebalancer to offload the segments from most loaded instances first (#7574)

  • Improve get tenant API to differentiate offline and real-time tenants (#7548)

  • Refactor query rewriter to interfaces and implementations to allow customization (#7576)

  • In ServiceStartable, apply global cluster config in ZK to instance config (#7593)

  • Make dimension tables creation bypass tenant validation (#7559)

  • Allow Metadata and Dictionary Based Plans for No Op Filters (#7563)

  • Reject query with identifiers not in schema (#7590)

  • Round Robin IP addresses when retry uploading/downloading segments (#7585)

  • Support multi-value derived column in offline table reload (#7632)

  • Support segmentNamePostfix in segment name (#7646)

  • Add select segments API (#7651)

  • Controller getTableInstance() call now returns the list of live brokers of a table. (#7556)

  • Allow MV Field Support For Raw Columns in Text Indices (#7638)

  • Allow override distinctCount to segmentPartitionedDistinctCount (#7664)

  • Add a quick start with both UPSERT and JSON index (#7669)

  • Add revertSegmentReplacement API (#7662)

  • Smooth segment reloading with non blocking semantic (#7675)

  • Clear the reused record in PartitionUpsertMetadataManager (#7676)

  • Replace args4j with picocli (#7665)

  • Handle datetime column consistently (#7645)(#7705)

  • Allow to carry headers with query requests (#7696) (#7712)

  • Allow adding JSON data type for dimension column types (#7718)

  • Separate SegmentDirectoryLoader and tierBackend concepts (#7737)

  • Implement size balanced V4 raw chunk format (#7661)

  • Add presto-pinot-driver lib (#7384)

Major Bug fixes

  • Fix null pointer exception for non-existed metric columns in schema for JDBC driver (#7175)

  • Fix the config key for TASK_MANAGER_FREQUENCY_PERIOD (#7198)

  • Fixed pinot java client to add zkClient close (#7196)

  • Ignore query json parse errors (#7165)

  • Fix shutdown hook for PinotServiceManager (#7251) (#7253)

  • Make STRING to BOOLEAN data type change as backward compatible schema change (#7259)

  • Replace gcp hardcoded values with generic annotations (#6985)

  • Fix segment conversion executor for in-place conversion (#7265)

  • Fix reporting consuming rate when the Kafka partition level consumer isn't stopped (#7322)

  • Fix the issue with concurrent modification for segment lineage (#7343)

  • Fix TableNotFound error message in PinotHelixResourceManager (#7340)

  • Fix upload LLC segment endpoint truncated download URL (#7361)

  • Fix task scheduling on table update (#7362)

  • Fix metric method for ONLINE_MINION_INSTANCES metric (#7363)

  • Fix JsonToPinotSchema behavior to be consistent with AvroSchemaToPinotSchema (#7366)

  • Fix currentOffset volatility in consuming segment(#7365)

  • Fix misleading error msg for missing URI (#7367)

  • Fix the correctness of getColumnIndices method (#7370)

  • Fix SegmentZKMetadta time handling (#7375)

  • Fix retention for cleaning up segment lineage (#7424)

  • Fix segment generator to not return illegal filenames (#7085)

  • Fix missing LLC segments in segment store by adding controller periodic task to upload them (#6778)

  • Fix parsing error messages returned to FileUploadDownloadClient (#7428)

  • Fix manifest scan which drives /version endpoint (#7456)

  • Fix missing rate limiter if brokerResourceEV becomes null due to ZK connection (#7470)

  • Fix race conditions between segment merge/roll-up and purge (or convertToRawIndex) tasks: (#7427)

  • Fix pql double quote checker exception (#7485)

  • Fix minion metrics exporter config (#7496)

  • Fix segment unable to retry issue by catching timeout exception during segment replace (#7509)

  • Add Exception to Broker Response When Not All Segments Are Available (Partial Response) (#7397)

  • Fix segment generation commands (#7527)

  • Return non zero from main with exception (#7482)

  • Fix parquet plugin shading error (#7570)

  • Fix the lowest partition id is not 0 for LLC (#7066)

  • Fix star-tree index map when column name contains '.' (#7623)

  • Fix cluster manager URLs encoding issue(#7639)

  • Fix fieldConfig nullable validation (#7648)

  • Fix verifyHostname issue in FileUploadDownloadClient (#7703)

  • Fix TableCache schema to include the built-in virtual columns (#7706)

  • Fix DISTINCT with AS function (#7678)

  • Fix SDF pattern in DataPreprocessingHelper (#7721)

  • Fix fields missing issue in the source in ParquetNativeRecordReader (#7742)