Schema

Each table in Pinot is associated with a Schema. A schema defines what fields are present in the table along with the data types.

The schema is stored in the Zookeeper, along with the table configuration.

Categories

A schema also defines what category a column belongs to. Columns in a Pinot table can be categorized into three categories:

Pinot does not enforce strict rules on which of these categories columns belong to, rather the categories can be thought of as hints to Pinot to do internal optimizations.

For example, metrics may be stored without a dictionary and can have a different default null value.

The categories are also relevant when doing segment merge and rollups. Pinot uses the dimension and time fields to identify records against which to apply merge/rollups.

Metrics aggregation is another example where Pinot uses dimensions and time are used as the key, and automatically aggregates values for the metric columns.

Data Types

Data types determine the operations that can be performed on a column. Pinot supports the following data types:

BOOLEAN, TIMESTAMP, JSON are added after release 0.7.1. In release 0.7.1 and older releases, BOOLEAN is equivalent to STRING. BIG_DECIMAL is added after release 0.10.0.

The lowest granularity TIMESTAMP type supports is milliseconds epoch, nanoseconds is not supported.

Pinot also supports columns that contain lists or arrays of items, but there isn't an explicit data type to represent these lists or arrays. Instead, you can indicate that a dimension column accepts multiple values. For more information, see DimensionFieldSpec in the Schema configuration reference.

Date Time Fields

Since Pinot doesn't have a dedicated DATETIME datatype support, you need to input time in either STRING, LONG, or INT format. However, Pinot needs to convert the date into an understandable format such as epoch timestamp to do operations. You can refer to DateTime field spec configs for more details on supported formats.

Built-in Virtual Columns

There are several built-in virtual columns inside the schema the can be used for debugging purposes:

These virtual columns can be used in queries in a similar way to regular columns.

Creating a Schema

First, Make sure your cluster is up and running.

Let's create a schema and put it in a JSON file. For this example, we have created a schema for flight data.

For more details on constructing a schema file, see the Schema configuration reference.

flights-schema.json
{
  "schemaName": "flights",
  "dimensionFieldSpecs": [
    {
      "name": "flightNumber",
      "dataType": "LONG"
    },
    {
      "name": "tags",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": "null"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "price",
      "dataType": "DOUBLE",
      "defaultNullValue": 0
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "millisSinceEpoch",
      "dataType": "LONG",
      "format": "EPOCH",
      "granularity": "15:MINUTES"
    },
    {
      "name": "hoursSinceEpoch",
      "dataType": "INT",
      "format": "EPOCH|HOURS",
      "granularity": "1:HOURS"
    },
    {
      "name": "dateString",
      "dataType": "STRING",
      "format": "SIMPLE_DATE_FORMAT|yyyy-MM-dd",
      "granularity": "1:DAYS"
    }
  ]
}

Then, we can upload the sample schema provided above using either a Bash command or REST API call.

bin/pinot-admin.sh AddSchema -schemaFile flights-schema.json -exec

OR

bin/pinot-admin.sh AddTable -schemaFile flights-schema.json -tableFile flights-table.json -exec

Check out the schema in the Rest API to make sure it was successfully uploaded

Last updated