Each table in Pinot is associated with a Schema. A schema defines what fields are present in the table along with the data types. The schema is stored in the Zookeeper along with the table configuration.
A schema also defines what category a column belongs to. Columns in a Pinot table can be broadly categorized into three categories -
Category | Description |
Dimension | ​ Dimension columns are typically used in slice and dice operations for answering business queries. Some operations for which dimension columns are used:
|
Metric | These columns represent the quantitative data of the table. Such columns are used for aggregation. In data warehouse terminology, these can also be referred to as fact or measure columns. Some operation for which metric columns are used:
|
DateTime | This column represents time columns in the data. There can be multiple time columns in a table, but only one of them can be treated as primary. Primary time column is the one that is present in the segment config.
The primary time column is used by Pinot, for maintaining the time boundary between offline and realtime data in a hybrid table and for retention management. A primary time column is mandatory if the table's push type is ​ Common operations which can be done on time column:
|
Data type helps in determining the operations which can be performed on the column. The following data types are supported by Pinot
Data type | Dimension Value default | Metric Value default |
INT | ​Integer.MIN_VALUE​ | 0 |
LONG | ​Long.MIN_VALUE​ | 0 |
FLOAT | ​Float.NEGATIVE_INFINITY​ | 0.0 |
DOUBLE | ​Double.NEGATIVE_INFINITY​ | 0.0 |
STRING | "null" | "null" |
BYTES | byte array of length 0 | byte array of length 0 |
Since pinot doesn't have a dedicated DATETIME
datatype support, the user needs to input time in either STRING, LONG, or INT format. However, Pinot needs to convert the date into an understandable format such as epoch timestamp to do operations.
To achieve this conversion, users need to provide the format of the date along with the data type in schema. The syntax of the format is timeSize:timeUnit:timeFormat:pattern
.
time size - the size of the time unit. This size is multiplied to the value present in the time column to get an actual timestamp. eg: if timesize is 5 and value in time column is 4996308 minutes. The value that will be converted to epoch timestamp will be 4996308 * 5 * 60 * 1000 = 1498892400000 milliseconds.
If your date is not in EPOCH
format, this value is not used and can be set to 1 or any other integer.
time unit - one of TimeUnit enum values. e.g. HOURS
, MINUTES
etc. If your date is not in EPOCH
format, this value is not used and can be set to MILLISECONDS
or any other unit.
timeFormat - can be either EPOCH
or SIMPLE_DATE_FORMAT
. If it is SIMPLE_DATE_FORMAT
, the pattern string is also specified.
pattern - This is optional and is only specified when the date is in SIMPLE_DATE_FORMAT
. The pattern should be specified using the java SimpleDateFormat representation. e.g. 2020-08-21 can be represented as yyyy-MM-dd
.
Here are some sample date-time formats you can use in the schema-
1:MILLISECONDS:EPOCH
- used when timestamp is in the epoch milliseconds and stored in LONG
format
1:HOURS:EPOCH
- used when timestamp is in the epoch hours and stored in LONG
or INT
format
1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd
- when date is in STRING
format and has the pattern year-month-date. e.g. 2020-08-21
1:HOURS:SIMPLE_DATE_FORMAT:EEE MMM dd HH:mm:ss ZZZ yyyy
- when date is in STRING
format. e.g. s Mon Aug 24 12:36:50 America/Los_Angeles 2019
First, Make sure your cluster is up and running. Let's create a schema and put it in a JSON file. For this example, we have created a schema for flight data.
flights-schema.json{"schemaName": "flights","dimensionFieldSpecs": [{"name": "flightNumber","dataType": "LONG"},{"name": "tags","dataType": "STRING","singleValueField": false,"defaultNullValue": "null"}],"metricFieldSpecs": [{"name": "price","dataType": "DOUBLE","defaultNullValue": 0}],"dateTimeFieldSpecs": [{"name": "millisSinceEpoch","dataType": "LONG","format": "1:MILLISECONDS:EPOCH","granularity": "15:MINUTES"},{"name": "hoursSinceEpoch","dataType": "INT","format": "1:HOURS:EPOCH","granularity": "1:HOURS"},{"name": "dateString","dataType": "STRING","format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd","granularity": "1:DAYS"}]}
Then, we can upload the sample schema provided above using either bash command or REST API call.
bin/pinot-admin.sh AddSchema -schemaFile flights-schema.json -exec​OR​bin/pinot-admin.sh AddTable -schemaFile flights-schema.json -tableFile flights-table.json -exec
curl -F schemaName=@transcript-schema.json localhost:9000/schemas
Check out the schema in the Rest API to make sure it was successfully uploaded