LogoLogo
release-0.12.1
release-0.12.1
  • Introduction
  • Basics
    • Concepts
    • Architecture
    • Components
      • Cluster
      • Controller
      • Broker
      • Server
      • Minion
      • Tenant
      • Schema
      • Table
      • Segment
      • Deep Store
      • Pinot Data Explorer
    • Getting Started
      • Running Pinot locally
      • Running Pinot in Docker
      • Quick Start Examples
      • Running in Kubernetes
      • Running on public clouds
        • Running on Azure
        • Running on GCP
        • Running on AWS
      • Batch import example
      • Stream ingestion example
      • HDFS as Deep Storage
      • Troubleshooting Pinot
      • Frequently Asked Questions (FAQs)
        • General
        • Pinot On Kubernetes FAQ
        • Ingestion FAQ
        • Query FAQ
        • Operations FAQ
    • Import Data
      • From Query Console
      • Batch Ingestion
        • Spark
        • Flink
        • Hadoop
        • Backfill Data
        • Dimension Table
      • Stream ingestion
        • Apache Kafka
        • Amazon Kinesis
        • Apache Pulsar
      • Stream Ingestion with Upsert
      • Stream Ingestion with Dedup
      • Stream Ingestion with CLP
      • File Systems
        • Amazon S3
        • Azure Data Lake Storage
        • HDFS
        • Google Cloud Storage
      • Input formats
      • Complex Type (Array, Map) Handling
    • Indexing
      • Forward Index
      • Inverted Index
      • Star-Tree Index
      • Bloom Filter
      • Range Index
      • Native Text Index
      • Text search support
      • JSON Index
      • Geospatial
      • Timestamp Index
    • Releases
      • 0.12.0
      • 0.11.0
      • 0.10.0
      • 0.9.3
      • 0.9.2
      • 0.9.1
      • 0.9.0
      • 0.8.0
      • 0.7.1
      • 0.6.0
      • 0.5.0
      • 0.4.0
      • 0.3.0
      • 0.2.0
      • 0.1.0
    • Recipes
      • GitHub Events Stream
  • For Users
    • Query
      • Querying Pinot
      • Aggregation Functions
      • Transformation Functions
      • User-Defined Functions (UDFs)
      • Grouping Algorithm
      • Query Options
      • Cardinality Estimation
      • Lookup UDF Join
      • Querying JSON data
      • Filtering with IdSet
      • Explain Plan
      • GapFill Function For Time-Series Dataset
    • APIs
      • Broker Query API
        • Query Response Format
      • Controller Admin API
      • Controller API Reference
    • External Clients
      • JDBC
      • Java
      • Python
      • Golang
    • Tutorials
      • Use OSS as Deep Storage for Pinot
      • Ingest Parquet Files from S3 Using Spark
      • Creating Pinot Segments
      • Use S3 as Deep Storage for Pinot
      • Use S3 and Pinot in Docker
      • Batch Data Ingestion In Practice
      • Schema Evolution
  • For Developers
    • Basics
      • Extending Pinot
        • Writing Custom Aggregation Function
        • Segment Fetchers
      • Contribution Guidelines
      • Code Setup
      • Code Modules and Organization
      • Update Documentation
    • Advanced
      • Data Ingestion Overview
      • Ingestion Aggregations
      • Ingestion Transformations
      • Null Value Support
      • Multi-Stage Query Engine
      • Advanced Pinot Setup
    • Plugins
      • Write Custom Plugins
        • Input Format Plugin
        • Filesystem Plugin
        • Batch Segment Fetcher Plugin
        • Stream Ingestion Plugin
    • Design Documents
      • Segment Writer API
  • For Operators
    • Deployment and Monitoring
      • Setup cluster
      • Server Startup Status Checkers
      • Setup table
      • Setup ingestion
      • Decoupling Controller from the Data Path
      • Segment Assignment
      • Instance Assignment
      • Rebalance
        • Rebalance Servers
        • Rebalance Brokers
      • Separating data storage by age
        • Using multiple tenants
        • Using multiple directories
      • Pinot managed Offline flows
      • Minion merge rollup task
      • Consistent Push and Rollback
      • Access Control
      • Monitoring
      • Tuning
        • Realtime
        • Routing
        • Query Routing using Adaptive Server Selection
        • Query Scheduling
      • Upgrading Pinot with confidence
      • Managing Logs
      • OOM Protection Using Automatic Query Killing
    • Command-Line Interface (CLI)
    • Configuration Recommendation Engine
    • Tutorials
      • Authentication, Authorization, and ACLs
      • Configuring TLS/SSL
      • Build Docker Images
      • Running Pinot in Production
      • Kubernetes Deployment
      • Amazon EKS (Kafka)
      • Amazon MSK (Kafka)
      • Monitor Pinot using Prometheus and Grafana
      • Performance Optimization Configurations
  • Configuration Reference
    • Cluster
    • Controller
    • Broker
    • Server
    • Table
    • Schema
    • Ingestion Job Spec
    • Monitoring Metrics
    • Functions
      • ABS
      • ADD
      • ago
      • arrayConcatDouble
      • arrayConcatFloat
      • arrayConcatInt
      • arrayConcatLong
      • arrayConcatString
      • arrayContainsInt
      • arrayContainsString
      • arrayDistinctInt
      • arrayDistinctString
      • arrayIndexOfInt
      • arrayIndexOfString
      • ARRAYLENGTH
      • arrayRemoveInt
      • arrayRemoveString
      • arrayReverseInt
      • arrayReverseString
      • arraySliceInt
      • arraySliceString
      • arraySortInt
      • arraySortString
      • arrayUnionInt
      • arrayUnionString
      • AVGMV
      • Base64
      • ceil
      • CHR
      • codepoint
      • concat
      • count
      • COUNTMV
      • COVAR_POP
      • COVAR_SAMP
      • day
      • dayOfWeek
      • dayOfYear
      • DISTINCT
      • DISTINCTAVG
      • DISTINCTAVGMV
      • DISTINCTCOUNT
      • DISTINCTCOUNTBITMAP
      • DISTINCTCOUNTHLLMV
      • DISTINCTCOUNTHLL
      • DISTINCTCOUNTBITMAPMV
      • DISTINCTCOUNTMV
      • DISTINCTCOUNTRAWHLL
      • DISTINCTCOUNTRAWHLLMV
      • DISTINCTCOUNTRAWTHETASKETCH
      • DISTINCTCOUNTTHETASKETCH
      • DISTINCTSUM
      • DISTINCTSUMMV
      • DIV
      • DATETIMECONVERT
      • DATETRUNC
      • exp
      • FLOOR
      • FromDateTime
      • FromEpoch
      • FromEpochBucket
      • Histogram
      • hour
      • isSubnetOf
      • JSONFORMAT
      • JSONPATH
      • JSONPATHARRAY
      • JSONPATHARRAYDEFAULTEMPTY
      • JSONPATHDOUBLE
      • JSONPATHLONG
      • JSONPATHSTRING
      • jsonextractkey
      • jsonextractscalar
      • length
      • ln
      • lower
      • lpad
      • ltrim
      • max
      • MAXMV
      • MD5
      • millisecond
      • min
      • minmaxrange
      • MINMAXRANGEMV
      • MINMV
      • minute
      • MOD
      • mode
      • month
      • mult
      • now
      • percentile
      • percentileest
      • percentileestmv
      • percentilemv
      • percentiletdigest
      • percentiletdigestmv
      • quarter
      • regexpExtract
      • regexpReplace
      • remove
      • replace
      • reverse
      • round
      • rpad
      • rtrim
      • second
      • SEGMENTPARTITIONEDDISTINCTCOUNT
      • sha
      • sha256
      • sha512
      • sqrt
      • startswith
      • ST_AsBinary
      • ST_AsText
      • ST_Contains
      • ST_Distance
      • ST_GeogFromText
      • ST_GeogFromWKB
      • ST_GeometryType
      • ST_GeomFromText
      • ST_GeomFromWKB
      • STPOINT
      • ST_Polygon
      • strpos
      • ST_Union
      • SUB
      • substr
      • sum
      • summv
      • TIMECONVERT
      • timezoneHour
      • timezoneMinute
      • ToDateTime
      • ToEpoch
      • ToEpochBucket
      • ToEpochRounded
      • TOJSONMAPSTR
      • toGeometry
      • toSphericalGeography
      • trim
      • upper
      • Url
      • UTF8
      • VALUEIN
      • week
      • year
      • yearOfWeek
    • Plugin Reference
      • Stream Ingestion Connectors
      • VAR_POP
      • VAR_SAMP
      • STDDEV_POP
      • STDDEV_SAMP
  • RESOURCES
    • Community
    • Team
    • Blogs
    • Presentations
    • Videos
  • Integrations
    • Tableau
    • Trino
    • ThirdEye
    • Superset
    • Presto
Powered by GitBook
On this page
  • DimensionFieldSpec
  • MetricFieldSpec
  • DateTimeFieldSpec
  • Advanced fields

Was this helpful?

Export as PDF
  1. Configuration Reference

Schema

Schema is used to define the names, data types, and other information for the columns of a Pinot table.

The Pinot schema is composed of:

Field
Description

schemaName

Defines the name of the schema. This is usually the same as the table name. The offline and the realtime table of a hybrid table should use the same schema.

dimensionFieldSpec

metricFieldSpec

dateTimeFieldSpec

flights-schema.json
{
  "schemaName": "flights",
  "dimensionFieldSpecs": [
    {
      "name": "flightNumber",
      "dataType": "LONG"
    },
    {
      "name": "tags",
      "dataType": "STRING",
      "singleValueField": false,
      "defaultNullValue": "null"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "price",
      "dataType": "DOUBLE",
      "defaultNullValue": 0
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "millisSinceEpoch",
      "dataType": "LONG",
      "format": "EPOCH",
      "granularity": "15:MINUTES"
    },
    {
      "name": "hoursSinceEpoch",
      "dataType": "INT",
      "format": "EPOCH|HOURS",
      "granularity": "1:HOURS"
    },
    {
      "name": "dateString",
      "dataType": "STRING",
      "format": "SIMPLE_DATE_FORMAT|yyyy-MM-dd",
      "granularity": "1:DAYS"
    }
  ]
}

The above json configuration is the example of Pinot schema derived from the flight data. As seen in the example, the schema is composed of 4 parts: schemaName, dimensionFieldSpec, metricFieldSpec, and dateTimeFieldSpec. Below is a detailed description of each type of field spec.

DimensionFieldSpec

A dimensionFieldSpec is defined for each dimension column. Here's a list of the fields in the dimensionFieldSpec:

Property
Description

name

Name of the dimension column.

dataType

Data type of the dimension column. Can be INT, LONG, FLOAT, DOUBLE, BOOLEAN, TIMESTAMP, STRING, BYTES,JSON.

defaultNullValue

Represents null values in the data, since Pinot doesn't support storing null column values natively (as part of its on-disk storage format). If not specified, an internal default null value is used as listed here.

singleValueField

Boolean indicating if this is a single-valued or a multi-valued column. Multi-valued column is modeled as a list, where the order of the values are preserved and duplicate values are allowed. Individual rows don’t necessarily have the same number of values. Typical use case for this would be a column such as skillSet for a person (one row in the table) that can have multiple values such as Real Estate, Mortgages. The default null value for a multi-valued column is a single defaultNullValue, e.g. [Integer.MIN_VALUE].

Internal default null values for dimension

Data Type
Internal Default Null Value

INT

LONG

FLOAT

DOUBLE

BOOLEAN

0 (false)

TIMESTAMP

0 (1970-01-01 00:00:00 UTC)

STRING

"null"

BYTES

byte array of length 0

JSON

"null"

MetricFieldSpec

A metricFieldSpec is defined for each metric column. Here's a list of fields in the metricFieldSpec

Property
Description

name

Name of the metric column

dataType

Data type of the column. Can be INT, LONG, FLOAT, DOUBLE, BIG_DECIMAL, BYTES (for specialized representations such as HLL, TDigest, etc, where the column stores byte serialized version of the value)

defaultNullValue

Represents null values in the data. If not specified, an internal default null value is used, as listed here.

Internal default null values for metric

Data Type
Internal Default Null Value

INT

0

LONG

0

FLOAT

0.0

DOUBLE

0.0

BIG_DECIMAL

0.0

BYTES

byte array of length 0

DateTimeFieldSpec

A dateTimeFieldSpec is used to define time columns of the table. The following fields can be configured in the date time field spec -

Property
Description

name

Name of the date time column

dataType

Data type of the date time column. Can be STRING, INT, LONG or TIMESTAMP

format

granularity

The granularity in which the column is bucketed. The syntax of granularity is bucket size:bucket unit For example, the format can be milliseconds 1:MILLISECONDS:EPOCH, but bucketed to 15 minutes i.e. we only have one value for every 15 minute interval, in which case granularity can be specified as 15:MINUTES. Currently it is just for documentation purpose, and Pinot won't automatically round the time value to the specified granularity.

defaultNullValue

Represents null values in the data. If not specified, an internal default null value is used. If date time is in String format, the default value will be null or if timestamp then it is epoch 0 (i.e. 1970-01-01 00:00:00).

For the main time column of the table (time column specified in the segmentsConfig

in the table config), the main time column value must be in the range of 1971-01-01 UTC to 2071-01-01 UTC for segment management purpose (e.g. retention, time boundary). If the specified default null value is not within this range, segment creation time is used.

New DateTime Formats

In the pinot master (0.12.0-SNAPSHOT), We have simplified date time formats for the users. The formats now follow the pattern - timeFormat|pattern/timeUnit|[timeZone/timeSize] . The fields present in [] are completely optional. timeFormat can be one of EPOCH , SIMPLE_DATE_FORMAT or TIMESTAMP .

  • TIMESTAMP - This represents timestamp in milliseconds. It is equivalent to specifying EPOCH:MILLISECONDS:1 Examples -

    • TIMESTAMP

    • EPOCH - Defaults to MILLISECONDS (only in master branch)

    • EPOCH|SECONDS

    • EPOCH|SECONDS|5

    • SIMPLE_DATE_FORMAT (only in master branch)

    • SIMPLE_DATE_FORMAT|yyyy-MM-dd HH:mm:ss

    • SIMPLE_DATE_FORMAT|yyyy-MM-dd|IST

Only datetime timeformats in lexicographical order are support in Pinot. so yyyy-MM-dd ,MM-dd and yyyy-dd are valid while MM-dd-yyyy is not. The order is decided as year > month > day > hour > minutes > second.

Old date time formats

These date-time formats are still supported in Pinot for backward compatibility. However, new users should prefer to use the formats mentioned in the previous sections.

You will need to provide the format of the date along with the data type in the schema. The format is described using the following syntax: timeSize:timeUnit:timeFormat:pattern .

  • time size - the size of the time unit. This size is multiplied to the value present in the time column to get an actual timestamp. e.g. if timesize is 5 and value in time column is 4996308 minutes. The value that will be converted to epoch timestamp will be 4996308 * 5 * 60 * 1000 = 1498892400000 milliseconds. If your date is not in EPOCH format, this value is not used and can be set to 1 or any other integer.

  • timeFormat - can be either EPOCH or SIMPLE_DATE_FORMAT. If it is SIMPLE_DATE_FORMAT, the pattern string is also specified.

Here are some sample date-time formats you can use in the schema:

  • 1:MILLISECONDS:EPOCH - used when timestamp is in the epoch milliseconds and stored in LONG format

  • 1:HOURS:EPOCH - used when timestamp is in the epoch hours and stored in LONG or INT format

  • 1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd - when the date is in STRING format and has the pattern year-month-date. e.g. 2020-08-21

  • 1:HOURS:SIMPLE_DATE_FORMAT:EEE MMM dd HH:mm:ss ZZZ yyyy - when date is in STRING format. e.g. Mon Aug 24 12:36:50 America/Los_Angeles 2019

Advanced fields

Apart from these, there's some advanced fields. These are common to all field specs.

Property
Description

maxLength

Max length of this column

virtualColumnProvider

Column value provider

PreviousTableNextIngestion Job Spec

Last updated 2 years ago

Was this helpful?

A dimensionFieldSpec is defined for each dimension column. For more details, scroll down to .

A metricFieldSpec is defined for each metric column. For more details, scroll down to .

A dateTimeFieldSpec is defined for the time columns. There can be multiple time columns. For more details, scroll down to .

​​

​​

​​

​​

The format in which the datetime is present in the column. Refer to for supported formats.

EPOCH - This represents time in timeUnit since 00:00:00 UTC on 1 January 1970, where timeUnit is one of enum values, e.g. HOURS , MINUTES etc. You can also specify the timeSize parameter. This size is multiplied to the value present in the time column to get an actual timestamp. e.g. if timesize is 5 and value in time column is 4996308 minutes. The value that will be converted to epoch timestamp will be 4996308 * 5 * 60 * 1000 = 1498892400000 milliseconds. In simplest terms, EPOCH|SECONDS|5 denotes the count of intervals of length 5 seconds from epoch 0 to now. Examples -

SIMPLE_DATE_FORMAT - This represents time in the string format. The pattern should be specified using the Joda's representation. In the master branch build, if no pattern is specified, we use to parse the date times. Optionals are supported with ISO format so users can specify date time string in yyyy or yyyy-MM or yyyy-MM-dd and so on You can also specify optional timeZone parameter which is the ID for a TimeZone, either an abbreviation such as PST, a full name such as America/Los_Angeles, or a custom ID such as GMT-8:00. Examples -

time unit - one of enum values. e.g. HOURS , MINUTES etc. If your date is not in EPOCH format, this value is not used and can be set to MILLISECONDS or any other unit.

pattern - This is optional and is only specified when the date is in SIMPLE_DATE_FORMAT . The pattern should be specified using Joda's representation. e.g. 2020-08-21 can be represented as yyyy-MM-dd.

TimeUnit
DateTimeFormat
ISO 8601 DateTimeFormat
TimeUnit
DateTimeFormat
Integer.MIN_VALUE
Long.MIN_VALUE
Float.NEGATIVE_INFINITY
Double.NEGATIVE_INFINITY
DimensionFieldSpec
MetricFieldSpec
DateTimeFieldSpec
Date time formats