Schema

Each table in Pinot is associated with a Schema. A schema defines what fields are present in the table along with the data types.

The schema is stored in the Zookeeper, along with the table configuration.

Category	Description
Dimension	Dimension columns are typically used in slice and dice operations for answering business queries. Some operations for which dimension columns are used: `GROUP BY` - group by one or more dimension columns along with aggregations on one or more metric columns Filter clauses such as `WHERE`
Metric	These columns represent the quantitative data of the table. Such columns are used for aggregation. In data warehouse terminology, these can also be referred to as fact or measure columns. Some operation for which metric columns are used: Aggregation - `SUM`, `MIN`, `MAX`, `COUNT`, `AVG` etc Filter clause such as `WHERE`
DateTime	This column represents time columns in the data. There can be multiple time columns in a table, but only one of them can be treated as primary. The primary time column is the one that is present in the segment config. The primary time column is used by Pinot to maintain the time boundary between offline and real-time data in a hybrid table and for retention management. A primary time column is mandatory if the table's push type is `APPEND` and optional if the push type is `REFRESH` . Common operations that can be done on time column: `GROUP BY` Filter clauses such as `WHERE`

Data Types

Data types determine the operations that can be performed on a column. Pinot supports the following data types:

Data Type	Default Dimension Value	Default Metric Value
INT	Integer.MIN_VALUE	0
LONG	Long.MIN_VALUE	0
FLOAT	Float.NEGATIVE_INFINITY	0.0
DOUBLE	Double.NEGATIVE_INFINITY	0.0
BIG_DECIMAL	Not supported	0.0
BOOLEAN	0 (false)	N/A
TIMESTAMP	0 (1970-01-01 00:00:00 UTC)	N/A
STRING	"null"	N/A
JSON	"null"	N/A
BYTES	byte array of length 0	byte array of length 0

Data Type

Default Dimension Value

Default Metric Value

INT

Integer.MIN_VALUE

LONG

Long.MIN_VALUE

FLOAT

Float.NEGATIVE_INFINITY

0.0

DOUBLE

Double.NEGATIVE_INFINITY

0.0

BIG_DECIMAL

Not supported

0.0

BOOLEAN

0 (false)

N/A

TIMESTAMP

0 (1970-01-01 00:00:00 UTC)

N/A

STRING

"null"

N/A

JSON

"null"

N/A

BYTES

byte array of length 0

BOOLEAN, TIMESTAMP, JSON are added after release 0.7.1. In release 0.7.1 and older releases, BOOLEAN is equivalent to STRING. BIG_DECIMAL is added after release 0.10.0.

The lowest granularity TIMESTAMP type supports is milliseconds epoch, nanoseconds is not supported.

Pinot also supports columns that contain lists or arrays of items, but there isn't an explicit data type to represent these lists or arrays. Instead, you can indicate that a dimension column accepts multiple values. For more information, see DimensionFieldSpec in the Schema configuration reference.

Date Time Fields

Since Pinot doesn't have a dedicated DATETIME datatype support, you need to input time in either STRING, LONG, or INT format. However, Pinot needs to convert the date into an understandable format such as epoch timestamp to do operations. You can refer to DateTime field spec configs for more details on supported formats.

Built-in Virtual Columns

There are several built-in virtual columns inside the schema the can be used for debugging purposes:

Column Name	Column Type	Data Type	Description
$hostName	Dimension	STRING	Name of the server hosting the data
$segmentName	Dimension	STRING	Name of the segment containing the record
$docId	Dimension	INT	Document id of the record within the segment

Column Name

Column Type

Data Type

Description

$hostName

Dimension

STRING

Name of the server hosting the data

$segmentName

Dimension

STRING

Name of the segment containing the record

$docId

Dimension

INT

Document id of the record within the segment

These virtual columns can be used in queries in a similar way to regular columns.

Creating a Schema

First, Make sure your cluster is up and running.

Let's create a schema and put it in a JSON file. For this example, we have created a schema for flight data.

For more details on constructing a schema file, see the Schema configuration reference.

flights-schema.json

{
  "schemaName": "flights",
  "dimensionFieldSpecs": [
    {
      "name": "flightNumber",
      "dataType": "LONG"
    },
    {
      "name": "tags",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": "null"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "price",
      "dataType": "DOUBLE",
      "defaultNullValue": 0
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "millisSinceEpoch",
      "dataType": "LONG",
      "format": "EPOCH",
      "granularity": "15:MINUTES"
    },
    {
      "name": "hoursSinceEpoch",
      "dataType": "INT",
      "format": "EPOCH|HOURS",
      "granularity": "1:HOURS"
    },
    {
      "name": "dateString",
      "dataType": "STRING",
      "format": "SIMPLE_DATE_FORMAT|yyyy-MM-dd",
      "granularity": "1:DAYS"
    }
  ]
}

Then, we can upload the sample schema provided above using either a Bash command or REST API call.

bin/pinot-admin.sh AddSchema -schemaFile flights-schema.json -exec

OR

bin/pinot-admin.sh AddTable -schemaFile flights-schema.json -tableFile flights-table.json -exec

curl -F schemaName=@transcript-schema.json  localhost:9000/schemas

Check out the schema in the Rest API to make sure it was successfully uploaded

PreviousTenant NextTable

Last updated 1 year ago

Categories

Data Types

Date Time Fields

Built-in Virtual Columns

Creating a Schema