Schema

Schema is used to define the names, data types and other information for the columns of a Pinot table.

Types of columns

Columns in a Pinot table can be broadly categorized into three categories

Column Category

Description

Dimension

Dimension columns are typically used in slice and dice operations for answering business queries. Frequent operations done on dimension columns:

  • GROUP BY - group by one or more dimension columns along with aggregations on one or more metric columns

  • Filter processing

Metric

These columns represent quantitative data of the table. Such columns are frequently used in aggregation operations. In data warehouse terminology, these are also referred to as fact or measure columns.

Frequent operations done on metric columns:

  • Aggregation - SUM, MIN, MAX, COUNT, AVG etc

  • Filter processing

DateTime

Common operations done on time column:

  • GROUP BY

  • Filter processing

Time

This has been deprecated. Use DateTime column type for time columns.

This column represents a timestamp. There can be at most one time column in a table. Common operations done on time column:

  • GROUP BY

  • Filter processing

The time column is also used internally by Pinot, for maintaining the time boundary between offline and realtime data in a hybrid table and for retention management. A time column is mandatory if the table's push type is APPEND and optional if the push type is REFRESH .

Schema format

A Pinot schema is written in JSON format. Here's an example which shows all the fields of a schema

flights-schema.json
{
  "schemaName": "flights",
  "dimensionFieldSpecs": [
    {
      "name": "flightNumber",
      "dataType": "LONG"
    },
    {
      "name": "tags",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": "null"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "price",
      "dataType": "DOUBLE",
      "defaultNullValue": 0
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "millisSinceEpoch",
      "dataType": "LONG",
      "format": "1:MILLSECONDS:EPOCH",
      "granularity": "15:MINUTES"
    },
    {
      "name": "hoursSinceEpoch",
      "dataType": "INT",
      "format": "1:HOURS:EPOCH",
      "granularity": "1:HOURS"
    },
    {
      "name": "date",
      "dataType": "STRING",
      "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd",
      "granularity": "1:DAYS"
    }
  ]
}

The Pinot schema is composed of

schema fields

description

schemaName

Defines the name of the schema. This is usually the same as the table name. The offline and the realtime table of a hybrid table should use the same schema.

dimensionFieldSpecs

metricFieldSpecs

dateTimeFieldSpec

A dateTimeFieldSpec is defined for the time columns. There can be multiple time columns. For more details, scroll down to dateTimeFieldSpec.

timeFieldSpec

Below is a detailed description of each type of field spec.

dimensionFieldSpecs

A dimensionFieldSpec is defined for each dimension column. Here's a list of the fields in the dimensionFieldSpec

field

description

name

Name of the dimension column

dataType

Data type of the dimension column. Can be STRING, BOOLEAN, INT, LONG, DOUBLE, FLOAT, BYTES

<b></b>

defaultNullValue

Represents null values in the data, since Pinot doesn't support storing null column values natively (as part of its on-disk storage format). If not specified, an internal default null value is used as listed here

singleValueField

Boolean indicating if this is a single value or a multi value column. In the example above, the dimension tags is multi-valued. This means that it can have multiple values for a particular row, say tag1, tag2, tag3. For a multi-valued column, individual rows don’t necessarily need to have the same number of values. Typical use case for this would be a column such as skillSet for a person (one row in the table) that can have multiple values such as Real Estate, Mortgages.

Internal default null values for dimension

Data Type

Internal Default Null Value

INT

LONG

FLOAT

DOUBLE

STRING

"null"

BYTES

byte array of length 0

metricFieldSpecs

A metricFieldSpec is defined for each metric column. Here's a list of fields in the metricFieldSpec

field

description

name

Name of the metric column

dataType

Data type of the column. Can be INT, LONG, DOUBLE, FLOAT, BYTES (for specialized representations such as HLL, TDigest, etc, where the column stores byte serialized version of the value)

defaultNullValue

Represents null values in the data. If not specified, an internal default null value is used, as listed here. The values are the same as those used for dimensionFieldSpec.

Internal default null values for metric

Data Type

Internal Default Null Value

INT

0

LONG

0

FLOAT

0.0

DOUBLE

0.0

STRING

"null"

BYTES

byte array of length 0

dateTimeFieldSpec

A dateTimeFieldSpec is used to define time columns of the table. Here's a list of the fields in a dateTimeFieldSpec

field

description

name

Name of the date time column

dataType

Data type of the date time column. Can be STRING, INT, LONG

format

The format of the time column. The syntax of the format is timeSize:timeUnit:timeFormat

timeFormat can be either EPOCH or SIMPLE_DATE_FORMAT. If it is SIMPLE_DATE_FORMAT, the pattern string is also specified. For example:

1:MILLISECONDS:EPOCH - epoch millis

1:HOURS:EPOCH - epoch hours

1:DAYS:SIMPLE_DATE_FORMAT:yyyyMMdd - date specified like 20191018

1:HOURS:SIMPLE_DATE_FORMAT:EEE MMM dd HH:mm:ss ZZZ yyyy - date specified like Mon Aug 24 12:36:50 America/Los_Angeles 2019

granularity

The granularity in which the column is bucketed. The syntax of granularity is bucket size:bucket unit For example, the format can be milliseconds 1:MILLISECONDS:EPOCH, but bucketed to 15 minutes i.e. we only have one value for every 15 minute interval, in which case granularity can be specified as 15:MINUTES

defaultNullValue

Represents null values in the data. If not specified, an internal default null value is used, as listed here. The values are the same as those used for dimensionFieldSpec.

timeFieldSpec

This has been deprecated. Older schemas containing timeFieldSpec will be supported. But for new schemas, use DateTimeFieldSpec instead.

A timeFieldSpec is defined for the time column. A timeFieldSpec is composed of an incomingGranularitySpec and an outgoingGranularitySpec. IncomingGranularitySpec in combination with outgoingGranularitySpec can be used to transform the time column from incoming format to the outgoing format. If both of them are specified, the segment creation process will convert the time column from the incoming format to the outgoing format. If no time column transformation is required, you can specify just the incomingGranularitySpec.

timeFieldSpec fields

Description

incomingGranularitySpec

Details of the time column in the incoming data

outgoingGranularitySpec

Details of the format to which the time column should be converted for using in Pinot

The incoming and outgoing granularitySpec are defined as:

field

description

name

Name of the time column. If incomingGranularitySpec, this is the name of the time column in the incoming data. If outgoingGranularitySpec, this is the name of the column you wish to transform it to and see in Pinot

dataType

Data type of the time column. Can be INT, LONG or STRING

timeType

Indicates the time unit. Can be one of DAYS, SECONDS, HOURS, MILLISECONDS, MICROSECONDS and NANOSECONDS

timeUnitSize

Indicates the bucket length. By default 1. E.g. in the sample above outgoing time is in fiveMinutesSinceEpoch i.e. rounded to 5 minutes buckets

timeFormat

EPOCH (millisSinceEpoch, hoursSinceEpoch etc) or SIMPLE_DATE_FORMAT (yyyyMMdd, yyyyMMdd:hhssmm etc)

Advanced fields

Apart from these, there's some advanced fields. These are common to all field specs.

field name

description

maxLength

Max length of this column

transformFunction

virtualColumnProvider

Column value provider

Ingestion Transform Functions

Transform functions can be defined on columns in the schema. For example:

"metricFieldSpecs": [
    {
      "name": "maxPrice",
      "dataType": "DOUBLE",
      "transformFunction": "Groovy({prices.max()}, prices)" // groovy function
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "hoursSinceEpoch",
      "dataType": "INT",
      "format": "1:HOURS:EPOCH",
      "granularity": "1:HOURS",
      "transformFunction": "toEpochHours(timestamp)" // inbuilt function
    }

Currently, we have support for 2 kinds of functions

  1. Groovy functions

  2. Inbuilt functions

Note

Currently, the arguments must be from the source data. They cannot be columns from the Pinot schema which have been created through transformations.

Groovy functions

Groovy functions can be defined using the syntax:

Groovy({groovy script}, argument1, argument2...argumentN)

Here's some examples of commonly needed functions. Any valid Groovy expression can be used.

String concatenation

Concat firstName and lasName to get fullName

{
      "name": "fullName",
      "dataType": "STRING",
      "transformFunction": "Groovy({firstName+' '+lastName}, firstName, lastName)"
}

Find element in an array

Find max value in array bids

{
      "name": "maxBid",
      "dataType": "INT",
      "transformFunction": "Groovy({bids.max{ it.toBigDecimal() }}, bids)"
}

Time transformation

Convert timestamp from MILLISECONDS to HOURS

"dateTimeFieldSpecs": [{
    "name": "hoursSinceEpoch",
    "dataType": "LONG",
    "format" : "1:HOURS:EPOCH",
    "granularity": "1:HOURS"
    "transformFunction": "Groovy({timestamp/(1000*60*60)}, timestamp)"
  }]

Column name change

Simply change name of the column from user_id to userId

{
      "name": "userId",
      "dataType": "LONG",
      "transformFunction": "Groovy({user_id}, user_id)"
}

Ternary operation

If eventType is IMPRESSION set impression to 1. Similar for CLICK.

{
    "name": "impressions",
    "dataType": "LONG",
    "transformFunction": "Groovy({eventType == 'IMPRESSION' ? 1: 0}, eventType)"
},
{
    "name": "clicks",
    "dataType": "LONG",
    "transformFunction": "Groovy({eventType == 'CLICK' ? 1: 0}, eventType)"
}

AVRO Map

Store an AVRO Map in Pinot as two multi-value columns. Sort the keys, to maintain the mapping. 1) The keys of the map as map_keys 2) The values of the map as map_values

{
      "name": "map2_keys",
      "dataType": "STRING",
      "singleValueField": false,
      "transformFunction": "Groovy({map2.sort()*.key}, map2)"
},
{
      "name": "map2_values",
      "dataType": "INT",
      "singleValueField": false,
      "transformFunction": "Groovy({map2.sort()*.value}, map2)"
}

Inbuilt Pinot functions

We have several inbuilt functions that can be used directly in as ingestion transform functions

DateTime functions

These are functions which enable commonly needed time transformations.

toEpochXXX

Converts from epoch milliseconds to a higher granularity.

Function name

Description

toEpochSeconds

Converts epoch millis to epoch seconds.

Usage: "transformFunction": "toEpochSeconds(millis)"

toEpochMinutes

Converts epoch millis to epoch minutes

Usage: "transformFunction": "toEpochMinutes(millis)"

toEpochHours

Converts epoch millis to epoch hours

Usage: "transformFunction": "toEpochHours(millis)"

toEpochDays

Converts epoch millis to epoch days

Usage: "transformFunction": "toEpochDays(millis)"

toEpochXXXRounded

Converts from epoch milliseconds to another granularity, rounding to the nearest rounding bucket. For example, 1588469352000 (2020-05-01 42:29:12) is 26474489 minutesSinceEpoch. `toEpochMinutesRounded(1588469352000) = 26474480 (2020-05-01 42:20:00)

Function Name

Description

toEpochSecondsRounded

Converts epoch millis to epoch seconds, rounding to nearest rounding bucket

"transformFunction": "toEpochSecondsRounded(millis, 30)"

toEpochMinutesRounded

Converts epoch millis to epoch seconds, rounding to nearest rounding bucket

"transformFunction": "toEpochMinutesRounded(millis, 10)"

toEpochHoursRounded

Converts epoch millis to epoch seconds, rounding to nearest rounding bucket

"transformFunction": "toEpochHoursRounded(millis, 6)"

toEpochDaysRounded

Converts epoch millis to epoch seconds, rounding to nearest rounding bucket

"transformFunction": "toEpochDaysRounded(millis, 7)"

fromEpochXXX

Converts from an epoch granularity to milliseconds.

Function Name

Description

fromEpochSeconds

Converts from epoch seconds to milliseconds

"transformFunction": "fromEpochSeconds(secondsSinceEpoch)"

fromEpochMinutes

Converts from epoch minutes to milliseconds

"transformFunction": "fromEpochMinutes(minutesSinceEpoch)"

fromEpochHours

Converts from epoch hours to milliseconds

"transformFunction": "fromEpochHours(hoursSinceEpoch)"

fromEpochDays

Converts from epoch days to milliseconds

"transformFunction": "fromEpochDays(daysSinceEpoch)"

Simple date format

Converts simple date format strings to milliseconds and vice-a-versa, as per the provided pattern string.

Function name

Description

toDateTime

Converts from milliseconds to a formatted date time string, as per the provided pattern

"transformFunction": "toDateTime(millis, 'yyyy-MM-dd')"

fromDateTime

Converts a formatted date time string to milliseconds, as per the provided pattern

"transformFunction": "fromDateTime(dateTimeStr, 'EEE MMM dd HH:mm:ss ZZZ yyyy')"

Json functions

Function name

Description

toJsonMapStr

"transformFunction": "toJsonMapStr(jsonMapField)"

Creating a Schema

Create a schema for your data, or see examples for examples. Make sure you've setup the cluster

Note: schema can also be created as part of table creation, refer to Creating a table.

bin/pinot-admin.sh AddSchema -schemaFile transcript-schema.json -exec

Check out the schema in the Rest API to make sure it was successfully uploaded

Last updated