LogoLogo
release-1.3.0
release-1.3.0
  • Introduction
  • Basics
    • Concepts
      • Pinot storage model
      • Architecture
      • Components
        • Cluster
          • Tenant
          • Server
          • Controller
          • Broker
          • Minion
        • Table
          • Segment
            • Deep Store
            • Segment threshold
            • Segment retention
          • Schema
          • Time boundary
        • Pinot Data Explorer
    • Getting Started
      • Running Pinot locally
      • Running Pinot in Docker
      • Quick Start Examples
      • Running in Kubernetes
      • Running on public clouds
        • Running on Azure
        • Running on GCP
        • Running on AWS
      • Create and update a table configuration
      • Batch import example
      • Stream ingestion example
      • HDFS as Deep Storage
      • Troubleshooting Pinot
      • Frequently Asked Questions (FAQs)
        • General
        • Pinot On Kubernetes FAQ
        • Ingestion FAQ
        • Query FAQ
        • Operations FAQ
    • Import Data
      • From Query Console
      • Batch Ingestion
        • Spark
        • Flink
        • Hadoop
        • Backfill Data
        • Dimension table
      • Stream ingestion
        • Ingest streaming data from Apache Kafka
        • Ingest streaming data from Amazon Kinesis
        • Ingest streaming data from Apache Pulsar
        • Configure indexes
      • Stream ingestion with Upsert
      • Segment compaction on upserts
      • Stream ingestion with Dedup
      • Stream ingestion with CLP
      • File Systems
        • Amazon S3
        • Azure Data Lake Storage
        • HDFS
        • Google Cloud Storage
      • Input formats
        • Complex Type (Array, Map) Handling
        • Complex Type Examples
        • Ingest records with dynamic schemas
      • Reload a table segment
      • Upload a table segment
    • Indexing
      • Bloom filter
      • Dictionary index
      • Forward index
      • FST index
      • Geospatial
      • Inverted index
      • JSON index
      • Native text index
      • Range index
      • Star-tree index
      • Text search support
      • Timestamp index
      • Vector index
    • Release notes
      • 1.3.0
      • 1.2.0
      • 1.1.0
      • 1.0.0
      • 0.12.1
      • 0.12.0
      • 0.11.0
      • 0.10.0
      • 0.9.3
      • 0.9.2
      • 0.9.1
      • 0.9.0
      • 0.8.0
      • 0.7.1
      • 0.6.0
      • 0.5.0
      • 0.4.0
      • 0.3.0
      • 0.2.0
      • 0.1.0
    • Recipes
      • Connect to Streamlit
      • Connect to Dash
      • Visualize data with Redash
      • GitHub Events Stream
  • For Users
    • Query
      • Querying Pinot
      • Query Syntax
        • Aggregation Functions
        • Array Functions
        • Cardinality Estimation
        • Explain Plan (Single-Stage)
        • Filtering with IdSet
        • Funnel Analysis
        • GapFill Function For Time-Series Dataset
        • Grouping Algorithm
        • Hash Functions
        • JOINs
        • Lookup UDF Join
        • Querying JSON data
        • Transformation Functions
        • URL Functions
        • Window Functions
      • Query Options
      • Query Quotas
      • Query using Cursors
      • Multi-stage query
        • Understanding Stages
        • Stats
        • Optimizing joins
        • Join strategies
          • Random + broadcast join strategy
          • Query time partition join strategy
          • Colocated join strategy
          • Lookup join strategy
        • Hints
        • Operator Types
          • Aggregate
          • Filter
          • Join
          • Intersect
          • Leaf
          • Literal
          • Mailbox receive
          • Mailbox send
          • Minus
          • Sort or limit
          • Transform
          • Union
          • Window
        • Stage-Level Spooling
      • User-Defined Functions (UDFs)
      • Explain plan
    • APIs
      • Broker Query API
        • Query Response Format
      • Controller Admin API
      • Controller API Reference
    • External Clients
      • JDBC
      • Java
      • Python
      • Golang
    • Tutorials
      • Use OSS as Deep Storage for Pinot
      • Ingest Parquet Files from S3 Using Spark
      • Creating Pinot Segments
      • Use S3 as Deep Storage for Pinot
      • Use S3 and Pinot in Docker
      • Batch Data Ingestion In Practice
      • Schema Evolution
  • For Developers
    • Basics
      • Extending Pinot
        • Writing Custom Aggregation Function
        • Segment Fetchers
      • Contribution Guidelines
      • Code Setup
      • Code Modules and Organization
      • Dependency Management
      • Update documentation
    • Advanced
      • Data Ingestion Overview
      • Ingestion Aggregations
      • Ingestion Transformations
      • Null value support
      • Use the multi-stage query engine (v2)
      • Advanced Pinot Setup
    • Plugins
      • Write Custom Plugins
        • Input Format Plugin
        • Filesystem Plugin
        • Batch Segment Fetcher Plugin
        • Stream Ingestion Plugin
    • Design Documents
      • Segment Writer API
  • For Operators
    • Deployment and Monitoring
      • Set up cluster
      • Server Startup Status Checkers
      • Set up table
      • Set up ingestion
      • Decoupling Controller from the Data Path
      • Segment Assignment
      • Instance Assignment
      • Rebalance
        • Rebalance Servers
        • Rebalance Brokers
        • Rebalance Tenant
      • Separating data storage by age
        • Using multiple tenants
        • Using multiple directories
      • Pinot managed Offline flows
      • Minion merge rollup task
      • Consistent Push and Rollback
      • Access Control
      • Monitoring
      • Tuning
        • Tuning Default MMAP Advice
        • Real-time
        • Routing
        • Query Routing using Adaptive Server Selection
        • Query Scheduling
      • Upgrading Pinot with confidence
      • Managing Logs
      • OOM Protection Using Automatic Query Killing
      • Pause ingestion based on resource utilization
    • Command-Line Interface (CLI)
    • Configuration Recommendation Engine
    • Tutorials
      • Authentication
        • Basic auth access control
        • ZkBasicAuthAccessControl
      • Configuring TLS/SSL
      • Build Docker Images
      • Running Pinot in Production
      • Kubernetes Deployment
      • Amazon EKS (Kafka)
      • Amazon MSK (Kafka)
      • Monitor Pinot using Prometheus and Grafana
      • Performance Optimization Configurations
      • Segment Operations Throttling
  • Configuration Reference
    • Cluster
    • Controller
    • Broker
    • Server
    • Table
    • Ingestion
    • Schema
    • Ingestion Job Spec
    • Monitoring Metrics
    • Functions
      • ABS
      • ADD
      • ago
      • EXPR_MIN / EXPR_MAX
      • ARRAY_AGG
      • arrayConcatDouble
      • arrayConcatFloat
      • arrayConcatInt
      • arrayConcatLong
      • arrayConcatString
      • arrayContainsInt
      • arrayContainsString
      • arrayDistinctInt
      • arrayDistinctString
      • arrayIndexOfInt
      • arrayIndexOfString
      • ARRAYLENGTH
      • arrayRemoveInt
      • arrayRemoveString
      • arrayReverseInt
      • arrayReverseString
      • arraySliceInt
      • arraySliceString
      • arraySortInt
      • arraySortString
      • arrayUnionInt
      • arrayUnionString
      • AVGMV
      • Base64
      • caseWhen
      • ceil
      • CHR
      • codepoint
      • concat
      • count
      • COUNTMV
      • COVAR_POP
      • COVAR_SAMP
      • day
      • dayOfWeek
      • dayOfYear
      • DISTINCT
      • DISTINCTAVG
      • DISTINCTAVGMV
      • DISTINCTCOUNT
      • DISTINCTCOUNTBITMAP
      • DISTINCTCOUNTBITMAPMV
      • DISTINCTCOUNTHLL
      • DISTINCTCOUNTSMARTHLL
      • DISTINCTCOUNTHLLPLUS
      • DISTINCTCOUNTHLLMV
      • DISTINCTCOUNTMV
      • DISTINCTCOUNTRAWHLL
      • DISTINCTCOUNTRAWHLLMV
      • DISTINCTCOUNTRAWTHETASKETCH
      • DISTINCTCOUNTTHETASKETCH
      • DISTINCTCOUNTULL
      • DISTINCTSUM
      • DISTINCTSUMMV
      • DIV
      • DATETIMECONVERT
      • DATETRUNC
      • exp
      • FIRSTWITHTIME
      • FLOOR
      • FrequentLongsSketch
      • FrequentStringsSketch
      • FromDateTime
      • FromEpoch
      • FromEpochBucket
      • FUNNELCOUNT
      • FunnelCompleteCount
      • FunnelMaxStep
      • FunnelMatchStep
      • Histogram
      • hour
      • isSubnetOf
      • JSONFORMAT
      • JSONPATH
      • JSONPATHARRAY
      • JSONPATHARRAYDEFAULTEMPTY
      • JSONPATHDOUBLE
      • JSONPATHLONG
      • JSONPATHSTRING
      • jsonextractkey
      • jsonextractscalar
      • LAG
      • LASTWITHTIME
      • LEAD
      • length
      • ln
      • lower
      • lpad
      • ltrim
      • max
      • MAXMV
      • MD5
      • millisecond
      • min
      • minmaxrange
      • MINMAXRANGEMV
      • MINMV
      • minute
      • MOD
      • mode
      • month
      • mult
      • now
      • percentile
      • percentileest
      • percentileestmv
      • percentilemv
      • percentiletdigest
      • percentiletdigestmv
      • percentilekll
      • percentilerawkll
      • percentilekllmv
      • percentilerawkllmv
      • quarter
      • regexpExtract
      • regexpReplace
      • remove
      • replace
      • reverse
      • round
      • roundDecimal
      • ROW_NUMBER
      • rpad
      • rtrim
      • second
      • SEGMENTPARTITIONEDDISTINCTCOUNT
      • sha
      • sha256
      • sha512
      • sqrt
      • startswith
      • ST_AsBinary
      • ST_AsText
      • ST_Contains
      • ST_Distance
      • ST_GeogFromText
      • ST_GeogFromWKB
      • ST_GeometryType
      • ST_GeomFromText
      • ST_GeomFromWKB
      • STPOINT
      • ST_Polygon
      • strpos
      • ST_Union
      • SUB
      • substr
      • sum
      • summv
      • TIMECONVERT
      • timezoneHour
      • timezoneMinute
      • ToDateTime
      • ToEpoch
      • ToEpochBucket
      • ToEpochRounded
      • TOJSONMAPSTR
      • toGeometry
      • toSphericalGeography
      • trim
      • upper
      • Url
      • UTF8
      • VALUEIN
      • week
      • year
      • Extract
      • yearOfWeek
      • FIRST_VALUE
      • LAST_VALUE
      • ST_GeomFromGeoJSON
      • ST_GeogFromGeoJSON
      • ST_AsGeoJSON
    • Plugin Reference
      • Stream Ingestion Connectors
      • VAR_POP
      • VAR_SAMP
      • STDDEV_POP
      • STDDEV_SAMP
    • Dynamic Environment
  • Reference
    • Single-stage query engine (v1)
    • Multi-stage query engine (v2)
    • Troubleshooting
      • Troubleshoot issues with the multi-stage query engine (v2)
      • Troubleshoot issues with ZooKeeper znodes
  • RESOURCES
    • Community
    • Team
    • Blogs
    • Presentations
    • Videos
  • Integrations
    • Tableau
    • Trino
    • ThirdEye
    • Superset
    • Presto
    • Spark-Pinot Connector
  • Contributing
    • Contribute Pinot documentation
    • Style guide
Powered by GitBook
On this page
  • The Ways of Gap Filling the Data
  • Aggregation/Gapfill/Aggregation
  • Select/Gapfill

Was this helpful?

Export as PDF
  1. For Users
  2. Query
  3. Query Syntax

GapFill Function For Time-Series Dataset

GapFill Function is only supported with the single-stage query engine (v1).

Many of the datasets are time series in nature, tracking state change of an entity over time. The granularity of recorded data points might be sparse or the events could be missing due to network and other device issues in the IOT environment. But analytics applications which are tracking the state change of these entities over time, might be querying for values at lower granularity than the metric interval.

Here is the sample data set tracking the status of parking lots in parking space.

lotId
event_time
is_occupied

P1

2021-10-01 09:01:00.000

1

P2

2021-10-01 09:17:00.000

1

P1

2021-10-01 09:33:00.000

0

P1

2021-10-01 09:47:00.000

1

P3

2021-10-01 10:05:00.000

1

P2

2021-10-01 10:06:00.000

0

P2

2021-10-01 10:16:00.000

1

P2

2021-10-01 10:31:00.000

0

P3

2021-10-01 11:17:00.000

0

P1

2021-10-01 11:54:00.000

0

We want to find out the total number of parking lots that are occupied over a period of time which would be a common use case for a company that manages parking spaces.

Let us take 30 minutes' time bucket as an example:

timeBucket/lotId
P1
P2
P3

2021-10-01 09:00:00.000

1

1

2021-10-01 09:30:00.000

0,1

2021-10-01 10:00:00.000

0,1

1

2021-10-01 10:30:00.000

0

2021-10-01 11:00:00.000

0

2021-10-01 11:30:00.000

0

If you look at the above table, you will see a lot of missing data for parking lots inside the time buckets. In order to calculate the number of occupied park lots per time bucket, we need gap fill the missing data.

The Ways of Gap Filling the Data

There are two ways of gap filling the data: FILL_PREVIOUS_VALUE and FILL_DEFAULT_VALUE.

FILL_PREVIOUS_VALUE means the missing data will be filled with the previous value for the specific entity, in this case, park lot, if the previous value exists. Otherwise, it will be filled with the default value.

FILL_DEFAULT_VALUE means that the missing data will be filled with the default value. For numeric column, the defaul value is 0. For Boolean column type, the default value is false. For TimeStamp, it is January 1, 1970, 00:00:00 GMT. For STRING, JSON and BYTES, it is empty String. For Array type of column, it is empty array.

We will leverage the following the query to calculate the total occupied parking lots per time bucket.

Aggregation/Gapfill/Aggregation

Query Syntax

SELECT time_col, SUM(status) AS occupied_slots_count
FROM (
    SELECT GAPFILL(time_col,'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
                   '2021-10-01 12:00:00.000','30:MINUTES', FILL(status, 'FILL_PREVIOUS_VALUE'),
                    TIMESERIESON(lotId)), lotId, status
    FROM (
        SELECT DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
               '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES') AS time_col,
               lotId, lastWithTime(is_occupied, event_time, 'INT') AS status
        FROM parking_data
        WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
        GROUP BY 1, 2
        ORDER BY 1
        LIMIT 100)
    LIMIT 100)
GROUP BY 1
LIMIT 100

Workflow

The most nested sql will convert the raw event table to the following table.

lotId
event_time
is_occupied

P1

2021-10-01 09:00:00.000

1

P2

2021-10-01 09:00:00.000

1

P1

2021-10-01 09:30:00.000

1

P3

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:30:00.000

0

P3

2021-10-01 11:00:00.000

0

P1

2021-10-01 11:30:00.000

0

The second most nested sql will gap fill the returned data as following:

timeBucket/lotId
P1
P2
P3

2021-10-01 09:00:00.000

1

1

0

2021-10-01 09:30:00.000

1

1

0

2021-10-01 10:00:00.000

1

1

1

2021-10-01 10:30:00.000

1

0

1

2021-10-01 11:00:00.000

1

0

0

2021-10-01 11:30:00.000

0

0

0

The outermost query will aggregate the gapfilled data as follows:

timeBucket
totalNumOfOccuppiedSlots

2021-10-01 09:00:00.000

2

2021-10-01 09:30:00.000

2

2021-10-01 10:00:00.000

3

2021-10-01 10:30:00.000

2

2021-10-01 11:00:00.000

1

2021-10-01 11:30:00.000

0

There is one assumption we made here that the raw data is sorted by the timestamp. The Gapfill and Post-Gapfill Aggregation will not sort the data.

The above example just shows the use case where the three steps happen:

  1. The raw data will be aggregated;

  2. The aggregated data will be gapfilled;

  3. The gapfilled data will be aggregated.

There are three more scenarios we can support.

Select/Gapfill

If we want to gapfill the missing data per half an hour time bucket, here is the query:

Query Syntax

SELECT GAPFILL(DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
               '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES'),
               '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
               '2021-10-01 12:00:00.000','30:MINUTES', FILL(is_occupied, 'FILL_PREVIOUS_VALUE'),
               TIMESERIESON(lotId)) AS time_col, lotId, is_occupied
FROM parking_data
WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
ORDER BY 1
LIMIT 100

Workflow

At first the raw data will be transformed as follows:

lotId
event_time
is_occupied

P1

2021-10-01 09:00:00.000

1

P2

2021-10-01 09:00:00.000

1

P1

2021-10-01 09:30:00.000

0

P1

2021-10-01 09:30:00.000

1

P3

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:00:00.000

0

P2

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:30:00.000

0

P3

2021-10-01 11:00:00.000

0

P1

2021-10-01 11:30:00.000

0

Then it will be gapfilled as follows:

lotId
event_time
is_occupied

P1

2021-10-01 09:00:00.000

1

P2

2021-10-01 09:00:00.000

1

P3

2021-10-01 09:00:00.000

0

P1

2021-10-01 09:30:00.000

0

P1

2021-10-01 09:30:00.000

1

P2

2021-10-01 09:30:00.000

1

P3

2021-10-01 09:30:00.000

0

P1

2021-10-01 10:00:00.000

1

P3

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:00:00.000

0

P2

2021-10-01 10:00:00.000

1

P1

2021-10-01 10:30:00.000

1

P2

2021-10-01 10:30:00.000

0

P3

2021-10-01 10:30:00.000

1

P1

2021-10-01 11:00:00.000

1

P2

2021-10-01 11:00:00.000

0

P3

2021-10-01 11:00:00.000

0

P1

2021-10-01 11:30:00.000

0

P2

2021-10-01 11:30:00.000

0

P3

2021-10-01 11:30:00.000

0

Aggregate/Gapfill

Query Syntax

SELECT GAPFILL(time_col,'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
               '2021-10-01 12:00:00.000','30:MINUTES', FILL(status, 'FILL_PREVIOUS_VALUE'),
               TIMESERIESON(lotId)), lotId, status
FROM (
    SELECT DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
           '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES') AS time_col,
           lotId, lastWithTime(is_occupied, event_time, 'INT') AS status
    FROM parking_data
    WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
    GROUP BY 1, 2
    ORDER BY 1
    LIMIT 100)
LIMIT 100

Workflow

The nested sql will convert the raw event table to the following table.

lotId
event_time
is_occupied

P1

2021-10-01 09:00:00.000

1

P2

2021-10-01 09:00:00.000

1

P1

2021-10-01 09:30:00.000

1

P3

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:30:00.000

0

P3

2021-10-01 11:00:00.000

0

P1

2021-10-01 11:30:00.000

0

The outer sql will gap fill the returned data as following:

timeBucket/lotId
P1
P2
P3

2021-10-01 09:00:00.000

1

1

0

2021-10-01 09:30:00.000

1

1

0

2021-10-01 10:00:00.000

1

1

1

2021-10-01 10:30:00.000

1

0

1

2021-10-01 11:00:00.000

1

0

0

2021-10-01 11:30:00.000

0

0

0

Gapfill/Aggregate

Query Syntax

SELECT time_col, SUM(is_occupied) AS occupied_slots_count
FROM (
    SELECT GAPFILL(DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
           '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES'),
           '1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
           '2021-10-01 12:00:00.000','30:MINUTES', FILL(is_occupied, 'FILL_PREVIOUS_VALUE'),
           TIMESERIESON(lotId)) AS time_col, lotId, is_occupied
    FROM parking_data
    WHERE event_time >= 1633078800000 AND  event_time <= 1633089600000
    ORDER BY 1
    LIMIT 100)
GROUP BY 1
LIMIT 100

Workflow

The raw data will be transformed as following at first:

lotId
event_time
is_occupied

P1

2021-10-01 09:00:00.000

1

P2

2021-10-01 09:00:00.000

1

P1

2021-10-01 09:30:00.000

0

P1

2021-10-01 09:30:00.000

1

P3

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:00:00.000

0

P2

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:30:00.000

0

P3

2021-10-01 11:00:00.000

0

P1

2021-10-01 11:30:00.000

0

The transformed data will be gap filled as follows:

lotId
event_time
is_occupied

P1

2021-10-01 09:00:00.000

1

P2

2021-10-01 09:00:00.000

1

P3

2021-10-01 09:00:00.000

0

P1

2021-10-01 09:30:00.000

0

P1

2021-10-01 09:30:00.000

1

P2

2021-10-01 09:30:00.000

1

P3

2021-10-01 09:30:00.000

0

P1

2021-10-01 10:00:00.000

1

P3

2021-10-01 10:00:00.000

1

P2

2021-10-01 10:00:00.000

0

P2

2021-10-01 10:00:00.000

1

P1

2021-10-01 10:30:00.000

1

P2

2021-10-01 10:30:00.000

0

P3

2021-10-01 10:30:00.000

1

P2

2021-10-01 10:30:00.000

0

P1

2021-10-01 11:00:00.000

1

P2

2021-10-01 11:00:00.000

0

P3

2021-10-01 11:00:00.000

0

P1

2021-10-01 11:30:00.000

0

P2

2021-10-01 11:30:00.000

0

P3

2021-10-01 11:30:00.000

0

The aggregation will generate the following table:

timeBucket
totalNumOfOccuppiedSlots

2021-10-01 09:00:00.000

2

2021-10-01 09:30:00.000

2

2021-10-01 10:00:00.000

3

2021-10-01 10:30:00.000

2

2021-10-01 11:00:00.000

1

2021-10-01 11:30:00.000

0

PreviousFunnel AnalysisNextGrouping Algorithm

Was this helpful?