Schema
Schema is used to define the names, data types and other information for the columns of a Pinot table.
Types of columns
Columns in a Pinot table can be broadly categorized into three categories
Column Category | Description |
Dimension | Dimension columns are typically used in slice and dice operations for answering business queries. Frequent operations done on dimension columns:
|
Metric | These columns represent quantitative data of the table. Such columns are frequently used in aggregation operations. In data warehouse terminology, these are also referred to as fact or measure columns. Frequent operations done on metric columns:
|
DateTime | This column represents time columns in the data. There can be multiple time columns in a table, but only one of them is the primary time column. Primary time column is the one that is set in the segmentConfig. This primary time column is used by Pinot, for maintaining the time boundary between offline and realtime data in a hybrid table and for retention management. A primary time column is mandatory if the table's push type is Common operations done on time column:
|
| This has been deprecated. Use DateTime column type for time columns. This column represents a timestamp. There can be at most one time column in a table. Common operations done on time column:
The time column is also used internally by Pinot, for maintaining the time boundary between offline and realtime data in a hybrid table and for retention management. A time column is mandatory if the table's push type is |
Schema format
A Pinot schema is written in JSON format. Here's an example which shows all the fields of a schema
The Pinot schema is composed of
schema fields | description |
schemaName | Defines the name of the schema. This is usually the same as the table name. The offline and the realtime table of a hybrid table should use the same schema. |
dimensionFieldSpecs | A dimensionFieldSpec is defined for each dimension column. For more details, scroll down to dimensionFieldSpec |
metricFieldSpecs | A metricFieldSpec is defined for each metric column. For more details, scroll down to metricFieldSpec |
dateTimeFieldSpec | A dateTimeFieldSpec is defined for the time columns. There can be multiple time columns. For more details, scroll down to dateTimeFieldSpec. |
| Deprecated. Use dateTimeFieldSpec instead. A timeFieldSpec is defined for the time column. There can only be one time column. For more details, scroll down to timeFieldSpec |
Below is a detailed description of each type of field spec.
dimensionFieldSpecs
A dimensionFieldSpec is defined for each dimension column. Here's a list of the fields in the dimensionFieldSpec
field | description |
name | Name of the dimension column |
dataType | Data type of the dimension column. Can be STRING, BOOLEAN, INT, LONG, DOUBLE, FLOAT, BYTES <b></b> |
defaultNullValue | Represents null values in the data, since Pinot doesn't support storing null column values natively (as part of its on-disk storage format). If not specified, an internal default null value is used as listed here |
singleValueField | Boolean indicating if this is a single value or a multi value column. In the example above, the dimension |
Internal default null values for dimension
Data Type | Internal Default Null Value |
INT | |
LONG | |
FLOAT | |
DOUBLE | |
STRING | "null" |
BYTES | byte array of length 0 |
metricFieldSpecs
A metricFieldSpec is defined for each metric column. Here's a list of fields in the metricFieldSpec
field | description |
name | Name of the metric column |
dataType | Data type of the column. Can be INT, LONG, DOUBLE, FLOAT, BYTES (for specialized representations such as HLL, TDigest, etc, where the column stores byte serialized version of the value) |
defaultNullValue | Represents null values in the data. If not specified, an internal default null value is used, as listed here. The values are the same as those used for dimensionFieldSpec. |
Internal default null values for metric
Data Type | Internal Default Null Value |
INT | 0 |
LONG | 0 |
FLOAT | 0.0 |
DOUBLE | 0.0 |
STRING | "null" |
BYTES | byte array of length 0 |
dateTimeFieldSpec
A dateTimeFieldSpec is used to define time columns of the table. Here's a list of the fields in a dateTimeFieldSpec
field | description |
name | Name of the date time column |
dataType | Data type of the date time column. Can be STRING, INT, LONG |
format | The format of the time column. The syntax of the format is timeFormat can be either EPOCH or SIMPLE_DATE_FORMAT. If it is SIMPLE_DATE_FORMAT, the pattern string is also specified. For example: 1:MILLISECONDS:EPOCH - epoch millis 1:HOURS:EPOCH - epoch hours 1:DAYS:SIMPLE_DATE_FORMAT:yyyyMMdd - date specified like 1:HOURS:SIMPLE_DATE_FORMAT:EEE MMM dd HH:mm:ss ZZZ yyyy - date specified like |
granularity | The granularity in which the column is bucketed. The syntax of granularity is
|
defaultNullValue | Represents null values in the data. If not specified, an internal default null value is used, as listed here. The values are the same as those used for dimensionFieldSpec. |
timeFieldSpec
This has been deprecated. Older schemas containing timeFieldSpec will be supported. But for new schemas, use DateTimeFieldSpec instead.
A timeFieldSpec is defined for the time column. A timeFieldSpec is composed of an incomingGranularitySpec and an outgoingGranularitySpec. IncomingGranularitySpec in combination with outgoingGranularitySpec can be used to transform the time column from incoming format to the outgoing format. If both of them are specified, the segment creation process will convert the time column from the incoming format to the outgoing format. If no time column transformation is required, you can specify just the incomingGranularitySpec.
timeFieldSpec fields | Description |
incomingGranularitySpec | Details of the time column in the incoming data |
outgoingGranularitySpec | Details of the format to which the time column should be converted for using in Pinot |
The incoming and outgoing granularitySpec are defined as:
field | description |
name | Name of the time column. If incomingGranularitySpec, this is the name of the time column in the incoming data. If outgoingGranularitySpec, this is the name of the column you wish to transform it to and see in Pinot |
dataType | Data type of the time column. Can be INT, LONG or STRING |
timeType | Indicates the time unit. Can be one of DAYS, SECONDS, HOURS, MILLISECONDS, MICROSECONDS and NANOSECONDS |
timeUnitSize | Indicates the bucket length. By default 1. E.g. in the sample above outgoing time is in fiveMinutesSinceEpoch i.e. rounded to 5 minutes buckets |
timeFormat | EPOCH (millisSinceEpoch, hoursSinceEpoch etc) or SIMPLE_DATE_FORMAT (yyyyMMdd, yyyyMMdd:hhssmm etc) |
Advanced fields
Apart from these, there's some advanced fields. These are common to all field specs.
field name | description |
maxLength | Max length of this column |
transformFunction | Transform function to generate this column. See section below. |
virtualColumnProvider | Column value provider |
Ingestion Transform Functions
Transform functions can be defined on columns in the schema. For example:
Currently, we have support for 2 kinds of functions
Groovy functions
Inbuilt functions
Note
Currently, the arguments must be from the source data. They cannot be columns from the Pinot schema which have been created through transformations.
Groovy functions
Groovy functions can be defined using the syntax:
Here's some examples of commonly needed functions. Any valid Groovy expression can be used.
String concatenation
Concat firstName
and lasName
to get fullName
Find element in an array
Find max value in array bids
Time transformation
Convert timestamp
from MILLISECONDS
to HOURS
Column name change
Simply change name of the column from user_id
to userId
Ternary operation
If eventType
is IMPRESSION
set impression
to 1
. Similar for CLICK
.
AVRO Map
Store an AVRO Map in Pinot as two multi-value columns. Sort the keys, to maintain the mapping.
1) The keys of the map as map_keys
2) The values of the map as map_values
Inbuilt Pinot functions
We have several inbuilt functions that can be used directly in as ingestion transform functions
DateTime functions
These are functions which enable commonly needed time transformations.
toEpochXXX
Converts from epoch milliseconds to a higher granularity.
Function name | Description |
toEpochSeconds | Converts epoch millis to epoch seconds. Usage: |
toEpochMinutes | Converts epoch millis to epoch minutes Usage: |
toEpochHours | Converts epoch millis to epoch hours Usage: |
toEpochDays | Converts epoch millis to epoch days Usage: |
toEpochXXXRounded
Converts from epoch milliseconds to another granularity, rounding to the nearest rounding bucket. For example, 1588469352000
(2020-05-01 42:29:12) is 26474489
minutesSinceEpoch. `toEpochMinutesRounded(1588469352000) = 26474480
(2020-05-01 42:20:00)
Function Name | Description |
toEpochSecondsRounded | Converts epoch millis to epoch seconds, rounding to nearest rounding bucket
|
toEpochMinutesRounded | Converts epoch millis to epoch seconds, rounding to nearest rounding bucket
|
toEpochHoursRounded | Converts epoch millis to epoch seconds, rounding to nearest rounding bucket
|
toEpochDaysRounded | Converts epoch millis to epoch seconds, rounding to nearest rounding bucket
|
fromEpochXXX
Converts from an epoch granularity to milliseconds.
Function Name | Description |
fromEpochSeconds | Converts from epoch seconds to milliseconds
|
fromEpochMinutes | Converts from epoch minutes to milliseconds
|
fromEpochHours | Converts from epoch hours to milliseconds
|
fromEpochDays | Converts from epoch days to milliseconds
|
Simple date format
Converts simple date format strings to milliseconds and vice-a-versa, as per the provided pattern string.
Function name | Description |
toDateTime | Converts from milliseconds to a formatted date time string, as per the provided pattern
|
fromDateTime | Converts a formatted date time string to milliseconds, as per the provided pattern
|
Json functions
Function name | Description |
toJsonMapStr | Converts a JSON/Avro map to a string. This json map can then be queried using jsonExtractScalar function.
|
Creating a Schema
Create a schema for your data, or see examples
for examples. Make sure you've setup the cluster
Note: schema can also be created as part of table creation, refer to Creating a table.
Check out the schema in the Rest API to make sure it was successfully uploaded
Last updated