Schema
Schema is used to define the names, data types and other information for the columns of a Pinot table.
Types of columns
Columns in a Pinot table can be broadly categorized into three categories
Column Category
Description
Dimension
Dimension columns are typically used in slice and dice operations for answering business queries. Frequent operations done on dimension columns:
GROUP BY - group by one or more dimension columns along with aggregations on one or more metric columns
Filter processing
Metric
These columns represent quantitative data of the table. Such columns are frequently used in aggregation operations. In data warehouse terminology, these are also referred to as fact or measure columns.
Frequent operations done on metric columns:
Aggregation - SUM, MIN, MAX, COUNT, AVG etc
Filter processing
DateTime
This column represents time columns in the data. There can be multiple time columns in a table, but only one of them is the primary time column. Primary time column is the one that is set in the segmentConfig. This primary time column is used by Pinot, for maintaining the time boundary between offline and realtime data in a hybrid table and for retention management. A primary time column is mandatory if the table's push type is APPEND
and optional if the push type is REFRESH
.
Common operations done on time column:
GROUP BY
Filter processing
Time
This has been deprecated. Use DateTime column type for time columns.
This column represents a timestamp. There can be at most one time column in a table. Common operations done on time column:
GROUP BY
Filter processing
The time column is also used internally by Pinot, for maintaining the time boundary between offline and realtime data in a hybrid table and for retention management. A time column is mandatory if the table's push type is APPEND
and optional if the push type is REFRESH
.
Schema format
A Pinot schema is written in JSON format. Here's an example which shows all the fields of a schema
The Pinot schema is composed of
schema fields
description
schemaName
Defines the name of the schema. This is usually the same as the table name. The offline and the realtime table of a hybrid table should use the same schema.
dimensionFieldSpecs
A dimensionFieldSpec is defined for each dimension column. For more details, scroll down to dimensionFieldSpec
metricFieldSpecs
A metricFieldSpec is defined for each metric column. For more details, scroll down to metricFieldSpec
dateTimeFieldSpec
A dateTimeFieldSpec is defined for the time columns. There can be multiple time columns. For more details, scroll down to dateTimeFieldSpec.
timeFieldSpec
Deprecated. Use dateTimeFieldSpec instead. A timeFieldSpec is defined for the time column. There can only be one time column. For more details, scroll down to timeFieldSpec
Below is a detailed description of each type of field spec.
dimensionFieldSpecs
A dimensionFieldSpec is defined for each dimension column. Here's a list of the fields in the dimensionFieldSpec
field
description
name
Name of the dimension column
dataType
Data type of the dimension column. Can be STRING, BOOLEAN, INT, LONG, DOUBLE, FLOAT, BYTES
<b></b>
defaultNullValue
Represents null values in the data, since Pinot doesn't support storing null column values natively (as part of its on-disk storage format). If not specified, an internal default null value is used as listed here
singleValueField
Boolean indicating if this is a single value or a multi value column. In the example above, the dimension tags
is multi-valued. This means that it can have multiple values for a particular row, say tag1, tag2, tag3
. For a multi-valued column, individual rows don’t necessarily need to have the same number of values. Typical use case for this would be a column such as skillSet
for a person (one row in the table) that can have multiple values such as Real Estate, Mortgages.
Internal default null values for dimension
Data Type
Internal Default Null Value
INT
LONG
FLOAT
DOUBLE
STRING
"null"
BYTES
byte array of length 0
metricFieldSpecs
A metricFieldSpec is defined for each metric column. Here's a list of fields in the metricFieldSpec
field
description
name
Name of the metric column
dataType
Data type of the column. Can be INT, LONG, DOUBLE, FLOAT, BYTES (for specialized representations such as HLL, TDigest, etc, where the column stores byte serialized version of the value)
defaultNullValue
Represents null values in the data. If not specified, an internal default null value is used, as listed here. The values are the same as those used for dimensionFieldSpec.
Internal default null values for metric
Data Type
Internal Default Null Value
INT
0
LONG
0
FLOAT
0.0
DOUBLE
0.0
STRING
"null"
BYTES
byte array of length 0
dateTimeFieldSpec
A dateTimeFieldSpec is used to define time columns of the table. Here's a list of the fields in a dateTimeFieldSpec
field
description
name
Name of the date time column
dataType
Data type of the date time column. Can be STRING, INT, LONG
format
The format of the time column. The syntax of the format is timeSize:timeUnit:timeFormat
timeFormat can be either EPOCH or SIMPLE_DATE_FORMAT. If it is SIMPLE_DATE_FORMAT, the pattern string is also specified. For example:
1:MILLISECONDS:EPOCH - epoch millis
1:HOURS:EPOCH - epoch hours
1:DAYS:SIMPLE_DATE_FORMAT:yyyyMMdd - date specified like 20191018
1:HOURS:SIMPLE_DATE_FORMAT:EEE MMM dd HH:mm:ss ZZZ yyyy - date specified like Mon Aug 24 12:36:50 America/Los_Angeles 2019
granularity
The granularity in which the column is bucketed. The syntax of granularity is
bucket size:bucket unit
For example, the format can be milliseconds 1:MILLISECONDS:EPOCH
, but bucketed to 15 minutes i.e. we only have one value for every 15 minute interval, in which case granularity can be specified as 15:MINUTES
defaultNullValue
Represents null values in the data. If not specified, an internal default null value is used, as listed here. The values are the same as those used for dimensionFieldSpec.
timeFieldSpec
This has been deprecated. Older schemas containing timeFieldSpec will be supported. But for new schemas, use DateTimeFieldSpec instead.
A timeFieldSpec is defined for the time column. A timeFieldSpec is composed of an incomingGranularitySpec and an outgoingGranularitySpec. IncomingGranularitySpec in combination with outgoingGranularitySpec can be used to transform the time column from incoming format to the outgoing format. If both of them are specified, the segment creation process will convert the time column from the incoming format to the outgoing format. If no time column transformation is required, you can specify just the incomingGranularitySpec.
timeFieldSpec fields
Description
incomingGranularitySpec
Details of the time column in the incoming data
outgoingGranularitySpec
Details of the format to which the time column should be converted for using in Pinot
The incoming and outgoing granularitySpec are defined as:
field
description
name
Name of the time column. If incomingGranularitySpec, this is the name of the time column in the incoming data. If outgoingGranularitySpec, this is the name of the column you wish to transform it to and see in Pinot
dataType
Data type of the time column. Can be INT, LONG or STRING
timeType
Indicates the time unit. Can be one of DAYS, SECONDS, HOURS, MILLISECONDS, MICROSECONDS and NANOSECONDS
timeUnitSize
Indicates the bucket length. By default 1. E.g. in the sample above outgoing time is in fiveMinutesSinceEpoch i.e. rounded to 5 minutes buckets
timeFormat
EPOCH (millisSinceEpoch, hoursSinceEpoch etc) or SIMPLE_DATE_FORMAT (yyyyMMdd, yyyyMMdd:hhssmm etc)
Advanced fields
Apart from these, there's some advanced fields. These are common to all field specs.
field name
description
maxLength
Max length of this column
transformFunction
Transform function to generate this column. See section below.
virtualColumnProvider
Column value provider
Ingestion Transform Functions
Transform functions can be defined on columns in the schema. For example:
Currently, we have support for 2 kinds of functions
Groovy functions
Inbuilt functions
Note
Currently, the arguments must be from the source data. They cannot be columns from the Pinot schema which have been created through transformations.
Groovy functions
Groovy functions can be defined using the syntax:
Here's some examples of commonly needed functions. Any valid Groovy expression can be used.
String concatenation
Concat firstName
and lasName
to get fullName
Find element in an array
Find max value in array bids
Time transformation
Convert timestamp
from MILLISECONDS
to HOURS
Column name change
Simply change name of the column from user_id
to userId
Ternary operation
If eventType
is IMPRESSION
set impression
to 1
. Similar for CLICK
.
AVRO Map
Store an AVRO Map in Pinot as two multi-value columns. Sort the keys, to maintain the mapping.
1) The keys of the map as map_keys
2) The values of the map as map_values
Inbuilt Pinot functions
We have several inbuilt functions that can be used directly in as ingestion transform functions
DateTime functions
These are functions which enable commonly needed time transformations.
toEpochXXX
Converts from epoch milliseconds to a higher granularity.
Function name
Description
toEpochSeconds
Converts epoch millis to epoch seconds.
Usage: "transformFunction": "toEpochSeconds(millis)"
toEpochMinutes
Converts epoch millis to epoch minutes
Usage: "transformFunction": "toEpochMinutes(millis)"
toEpochHours
Converts epoch millis to epoch hours
Usage: "transformFunction": "toEpochHours(millis)"
toEpochDays
Converts epoch millis to epoch days
Usage: "transformFunction": "toEpochDays(millis)"
toEpochXXXRounded
Converts from epoch milliseconds to another granularity, rounding to the nearest rounding bucket. For example, 1588469352000
(2020-05-01 42:29:12) is 26474489
minutesSinceEpoch. `toEpochMinutesRounded(1588469352000) = 26474480
(2020-05-01 42:20:00)
Function Name
Description
toEpochSecondsRounded
Converts epoch millis to epoch seconds, rounding to nearest rounding bucket
"transformFunction": "toEpochSecondsRounded(millis, 30)"
toEpochMinutesRounded
Converts epoch millis to epoch seconds, rounding to nearest rounding bucket
"transformFunction": "toEpochMinutesRounded(millis, 10)"
toEpochHoursRounded
Converts epoch millis to epoch seconds, rounding to nearest rounding bucket
"transformFunction": "toEpochHoursRounded(millis, 6)"
toEpochDaysRounded
Converts epoch millis to epoch seconds, rounding to nearest rounding bucket
"transformFunction": "toEpochDaysRounded(millis, 7)"
fromEpochXXX
Converts from an epoch granularity to milliseconds.
Function Name
Description
fromEpochSeconds
Converts from epoch seconds to milliseconds
"transformFunction": "fromEpochSeconds(secondsSinceEpoch)"
fromEpochMinutes
Converts from epoch minutes to milliseconds
"transformFunction": "fromEpochMinutes(minutesSinceEpoch)"
fromEpochHours
Converts from epoch hours to milliseconds
"transformFunction": "fromEpochHours(hoursSinceEpoch)"
fromEpochDays
Converts from epoch days to milliseconds
"transformFunction": "fromEpochDays(daysSinceEpoch)"
Simple date format
Converts simple date format strings to milliseconds and vice-a-versa, as per the provided pattern string.
Function name
Description
toDateTime
Converts from milliseconds to a formatted date time string, as per the provided pattern
"transformFunction": "toDateTime(millis, 'yyyy-MM-dd')"
fromDateTime
Converts a formatted date time string to milliseconds, as per the provided pattern
"transformFunction": "fromDateTime(dateTimeStr, 'EEE MMM dd HH:mm:ss ZZZ yyyy')"
Json functions
Function name
Description
toJsonMapStr
Converts a JSON/Avro map to a string. This json map can then be queried using jsonExtractScalar function.
"transformFunction": "toJsonMapStr(jsonMapField)"
Creating a Schema
Create a schema for your data, or see examples
for examples. Make sure you've setup the cluster
Note: schema can also be created as part of table creation, refer to Creating a table.
Check out the schema in the Rest API to make sure it was successfully uploaded
Last updated