1.2.0

Release Notes for 1.2.0

This release comes with several Improvements and Bug Fixes for the Multistage Engine, Upserts and Compaction. There are a ton of other small features and general bug fixes.

Multistage Engine Improvements

Features

New Window Functions: LEAD, LAG, FIRST_VALUE, LAST_VALUE #12878arrow-up-right #13340arrow-up-right

  • LEAD allows you to access values after the current row in a frame.

  • LAG allows you to access values before the current row in a frame.

  • FIRST_VALUE and LAST_VALUE return the respective extremal values in the frame.

Support for Logical Database in V2 Engine #12591arrow-up-right #12695arrow-up-right

  • V2 Engine now supports a "database" construct, enabling table namespace isolation within the same Pinot cluster.

  • Improves user experience when multiple users are using the same Pinot Cluster.

  • Access control policies can be set at the database level.

  • Database can be selected in a query using a SET statement, such as SET database=my_db;.

Improved Multi-Value (MV) and Array Function Support

Support for WITHIN GROUP Clause and ListAgg #13146arrow-up-right

  • WITHIN GROUP Clause can be used to process rows in a given order within a group.

  • One of the most common use-cases for this is the ListAgg function, which when combined with WITHIN GROUP can be used to concatenate strings in a given order.

Scalar/Transform Function and Set Operation Improvements

Improved Literal Handling Support

Metrics Improvements

Notable Improvements and Bug Fixes

Upsert Compaction and Minion Improvements

Features and Improvements

Minion Resource Isolation #12459arrow-up-right #12786arrow-up-right

  • Minions now support resource isolation based on an instance tag.

  • Instance tag is configured at table level, and can be set for each task on a table.

  • This enables you to implement arbitrary resource isolation strategies, i.e. you can use a set of Minion Nodes for running any set of tasks across any set of tables.

Greedy Upsert Compaction Scheduling #12461arrow-up-right

  • Upsert compaction now schedules segments for compaction based on the number of invalid docs.

  • This helps the compaction task to handle arbitrary temporal distribution of invalid docs.

Notable Improvements

Bug Fixes

  • Minions can now handle invalid instance tags in Task Configs gracefully. Prior to this change, Minions would be stuck in IN_PROGRESS state until task timeout #13092arrow-up-right.

  • Fix bug to return validDocIDsMetadata from all servers #12431arrow-up-right.

  • Upsert compaction doesn't retain maxLength information and trims string fields #13157arrow-up-right.

Upsert Improvements

Features and Improvements

Consistent Table View for Upsert Tables #12976arrow-up-right

  • Adds different modes of consistency guarantees for Upsert tables.

  • Adds a new UpsertConfig called consistencyMode which can be set to NONE, SYNC, SNAPSHOT.

  • SYNC is optimized for data freshness but can lead to elevated query latencies and is best for low-qps use-cases. In this mode, the ingestion threads will take a WLock when updating validDocID bitmaps.

  • SNAPSHOT mode can handle high-qps/high-ingestion use-cases by getting the list of valid docs from a snapshot of validDocID. The snapshot can be refreshed every few seconds and the tolerance can be set via a query option upsertViewFreshnessMs.

Pluggable Partial Upsert Merger #11983arrow-up-right

  • Partial Upsert merges the old record and the new incoming record to generate the final ingested record.

  • Pinot now allows users to customize how this merge of an old row and the new row is computed.

  • This allows a column value in the new row to be an arbitrary function of the old and the new row.

Support for Uploading Externally Partitioned Segments for Upsert Backfill 13107arrow-up-right

  • Segments uploaded for Upsert Backfill can now explicitly specify the Kafka partition they belong to.

  • This enables backfilling an Upsert table where the externally generated segments are partitioned using an arbitrary hash function on an arbitrary primary key.

Misc Improvements and Bug Fixes

  • Fixed a Bug in Handling Equal Comparison Column Values in Upsert, which could lead to data inconsistency (#12395arrow-up-right)

  • Upsert snapshot will now snapshot only those segments which have updates. #13285arrow-up-right.

Notable Features

JSON Support Improvements

  • JSON Index can now be used for evaluating Regex and Range Predicates. #12568arrow-up-right

  • jsonExtractIndex now supports contextual array filters. #12683arrow-up-right #12531arrow-up-right.

  • JSON column type now supports filter predicates like =, !=, IN and NOT IN. This is convenient for scenarios where the JSON values are very small. #13283arrow-up-right.

  • JSON_MATCH now supports exclusive predicates correctly. For instance, you can use predicates such as JSON_MATCH(person, '"$.addresses[*].country" != ''us''' to find all people who have at least one address that is not in the US. #13139arrow-up-right.

  • jsonExtractIndex supports extracting Multi-Value JSON Fields, and also supports providing any default value when the key doesn't exist. #12748arrow-up-right.

  • Added isJson UDF which increases your options to handle invalid JSONs. This can be used in queries and for filtering invalid json column values in ingestion. #12603arrow-up-right.

  • Fix ArrayIndexOutOfBoundsException in jsonExtractIndex. #13479arrow-up-right.

Lucene and Text Search Improvements

  • Improved Segment Build Time for Lucene Text Index by 40-60%. This improvement is realized when a consuming segment commits and changes to an ImmutableSegment. This significantly helps in lowering ingestion lag at commit time due to a large text index #12744arrow-up-right #13094arrow-up-right #13050arrow-up-right.

  • Phrase Search can run 3x faster when the Lucene Index Config enablePrefixSuffixMatchingInPhraseQueries is set to true. This is achieved by rewriting phrase search query to a wildcard and prefix matching query #12680arrow-up-right.

  • Fixed bug in TextMatchFilterOptimizer that was not applying precedence to the filter expressions properly, which could lead to incorrect results. #13009arrow-up-right.

  • Fixed bug in handling NOT text_match which could have returned incorrect results. #12372arrow-up-right.

  • Added SchemaConformingTranformerV2 to enhance text search abilities. #12788arrow-up-right.

  • Added metrics to track Lucene NRT Refresh Delay #13307arrow-up-right.

  • Switched to NRTCachingDirectory for Realtime segments and prevented duplicates in the Realtime Lucene Index to avoid IndexOutOfBounds query time exceptions. #13308arrow-up-right.

  • Lucene Version is upgraded to 9.11.1. #13505arrow-up-right.

  • Added funnelMaxStep function which can be used to calculate max funnel steps for a given sliding window .

  • Added funnelCompleteCount to calculate the number of completed funnels, and funnelMatchStep to get the funnel match array.

Support for Interning for OnHeapByteDictionary #12342arrow-up-right

  • This can reduce the heap usage of a dictionary encoded byte column, for a certain distribution of duplicate values. See #12223arrow-up-right for details.

Column Major Builder On By Default for New Tables #12770arrow-up-right

  • Prior to this feature, on a segment commit, Pinot would convert all the columnar data from the Mutable Segment to row-major, and then re-build column major Immutable Segments.

  • This feature skips the row-major conversion and is expected to be both space and time efficient.

  • It can help lower ingestion lag from segment commits, especially helpful when your segments are large.

Support for SQL Formatting in Query Editor #11725arrow-up-right

  • You can now prettify SQL right in the Controller UI!

Hash Function for UUID Primary Keys #12538arrow-up-right

  • Added a new lossless hash-function for Upsert Primary Keys optimized for UUIDs.

  • The hash function can reduce Old Gen by up to 30%.

  • It maps a UUID to a 16 byte array, vs encoding it in a UTF string which would take 36 bytes.

Column Level Index Skip Query Option #12414arrow-up-right

  • Convenient for debugging impact of indexes on query performance or results.

  • You can add the skipIndexes option to your query to skip any number of indexes. e.g. SET skipIndexes=inverted,range;

New UDFs and Scalar Functions

  • New GeoHash functions: encodeGeoHash, decodeGeoHash, decodeGeoHashLatitude and decodeGeoHashLongitude.

  • dateBin can be used to align a timestamp to the nearest time bucket.

  • prefixes, suffixes and uniqueNgrams UDFs for generating all respective string subsequences from a string input. #12392arrow-up-right.

  • Added isJson UDF which increases your options to handle invalid JSONs. This can be used in queries and for filtering invalid json column values in ingestion. #12603arrow-up-right.

  • splitPart UDF has minor improvements. #12437arrow-up-right.

CLP Compression Codec in Forward Indexes #12504arrow-up-right

  • CLParrow-up-right is a compressed log processor which has really high compression ratio for certain log types.

  • To enable this, you can set the compressionCodec in the fieldConfigList of the column you want to target.

Misc. Improvements

Bug Fixes

Was this helpful?