GapFill Function For Time-Series Dataset
Many of the datasets are time series in nature, tracking state change of an entity over time. The granularity of recorded data points might be sparse or the events could be missing due to network and other device issues in the IOT environment. But analytics applications which are tracking the state change of these entities over time, might be querying for values at lower granularity than the metric interval.
Here is the sample data set tracking the status of parking lots in parking space.
lotId
event_time
is_occupied
P1
2021-10-01 09:01:00.000
1
P2
2021-10-01 09:17:00.000
1
P1
2021-10-01 09:33:00.000
0
P1
2021-10-01 09:47:00.000
1
P3
2021-10-01 10:05:00.000
1
P2
2021-10-01 10:06:00.000
0
P2
2021-10-01 10:16:00.000
1
P2
2021-10-01 10:31:00.000
0
P3
2021-10-01 11:17:00.000
0
P1
2021-10-01 11:54:00.000
0
We want to find out the total number of parking lots that are occupied over a period of time which would be a common use case for a company that manages parking spaces.
Let us take 30 minutes' time bucket as an example:
timeBucket/lotId
P1
P2
P3
2021-10-01 09:00:00.000
1
1
2021-10-01 09:30:00.000
0,1
2021-10-01 10:00:00.000
0,1
1
2021-10-01 10:30:00.000
0
2021-10-01 11:00:00.000
0
2021-10-01 11:30:00.000
0
If you look at the above table, you will see a lot of missing data for parking lots inside the time buckets. In order to calculate the number of occupied park lots per time bucket, we need gap fill the missing data.

The Ways of Gap Filling the Data

There are two ways of gap filling the data: FILL_PREVIOUS_VALUE and FILL_DEFAULT_VALUE.
FILL_PREVIOUS_VALUE means the missing data will be filled with the previous value for the specific entity, in this case, park lot, if the previous value exists. Otherwise, it will be filled with the default value.
FILL_DEFAULT_VALUE means that the missing data will be filled with the default value. For numeric column, the defaul value is 0. For Boolean column type, the default value is false. For TimeStamp, it is January 1, 1970, 00:00:00 GMT. For STRING, JSON and BYTES, it is empty String. For Array type of column, it is empty array.
We will leverage the following the query to calculate the total occupied parking lots per time bucket.

Aggregation/Gapfill/Aggregation

Query Syntax

1
SELECT time_col, SUM(status) AS occupied_slots_count
2
FROM (
3
SELECT GAPFILL(time_col,'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
4
'2021-10-01 12:00:00.000','30:MINUTES', FILL(status, 'FILL_PREVIOUS_VALUE'),
5
TIMESERIESON(lotId)), lotId, status
6
FROM (
7
SELECT DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
8
'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES') AS time_col,
9
lotId, lastWithTime(is_occupied, event_time, 'INT') AS status
10
FROM parking_data
11
WHERE event_time >= 1633078800000 AND event_time <= 1633089600000
12
GROUP BY 1, 2
13
ORDER BY 1
14
LIMIT 100)
15
LIMIT 100)
16
GROUP BY 1
17
LIMIT 100
Copied!

Workflow

The most nested sql will convert the raw event table to the following table.
lotId
event_time
is_occupied
P1
2021-10-01 09:00:00.000
1
P2
2021-10-01 09:00:00.000
1
P1
2021-10-01 09:30:00.000
1
P3
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:30:00.000
0
P3
2021-10-01 11:00:00.000
0
P1
2021-10-01 11:30:00.000
0
The second most nested sql will gap fill the returned data as following:
timeBucket/lotId
P1
P2
P3
2021-10-01 09:00:00.000
1
1
0
2021-10-01 09:30:00.000
1
1
0
2021-10-01 10:00:00.000
1
1
1
2021-10-01 10:30:00.000
1
0
1
2021-10-01 11:00:00.000
1
0
0
2021-10-01 11:30:00.000
0
0
0
The outermost query will aggregate the gapfilled data as follows:
timeBucket
totalNumOfOccuppiedSlots
2021-10-01 09:00:00.000
2
2021-10-01 09:30:00.000
2
2021-10-01 10:00:00.000
3
2021-10-01 10:30:00.000
2
2021-10-01 11:00:00.000
1
2021-10-01 11:30:00.000
0
There is one assumption we made here that the raw data is sorted by the timestamp. The Gapfill and Post-Gapfill Aggregation will not sort the data.
The above example just shows the use case where the three steps happen:
  1. 1.
    The raw data will be aggregated;
  2. 2.
    The aggregated data will be gapfilled;
  3. 3.
    The gapfilled data will be aggregated.
There are three more scenarios we can support.

Select/Gapfill

If we want to gapfill the missing data per half an hour time bucket, here is the query:

Query Syntax

1
SELECT GAPFILL(DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
2
'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES'),
3
'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
4
'2021-10-01 12:00:00.000','30:MINUTES', FILL(is_occupied, 'FILL_PREVIOUS_VALUE'),
5
TIMESERIESON(lotId)) AS time_col, lotId, is_occupied
6
FROM parking_data
7
WHERE event_time >= 1633078800000 AND event_time <= 1633089600000
8
ORDER BY 1
9
LIMIT 100
Copied!

Workflow

At first the raw data will be transformed as follows:
lotId
event_time
is_occupied
P1
2021-10-01 09:00:00.000
1
P2
2021-10-01 09:00:00.000
1
P1
2021-10-01 09:30:00.000
0
P1
2021-10-01 09:30:00.000
1
P3
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:00:00.000
0
P2
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:30:00.000
0
P3
2021-10-01 11:00:00.000
0
P1
2021-10-01 11:30:00.000
0
Then it will be gapfilled as follows:
lotId
event_time
is_occupied
P1
2021-10-01 09:00:00.000
1
P2
2021-10-01 09:00:00.000
1
P3
2021-10-01 09:00:00.000
0
P1
2021-10-01 09:30:00.000
0
P1
2021-10-01 09:30:00.000
1
P2
2021-10-01 09:30:00.000
1
P3
2021-10-01 09:30:00.000
0
P1
2021-10-01 10:00:00.000
1
P3
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:00:00.000
0
P2
2021-10-01 10:00:00.000
1
P1
2021-10-01 10:30:00.000
1
P2
2021-10-01 10:30:00.000
0
P3
2021-10-01 10:30:00.000
1
P1
2021-10-01 11:00:00.000
1
P2
2021-10-01 11:00:00.000
0
P3
2021-10-01 11:00:00.000
0
P1
2021-10-01 11:30:00.000
0
P2
2021-10-01 11:30:00.000
0
P3
2021-10-01 11:30:00.000
0

Aggregate/Gapfill

Query Syntax

1
SELECT GAPFILL(time_col,'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
2
'2021-10-01 12:00:00.000','30:MINUTES', FILL(status, 'FILL_PREVIOUS_VALUE'),
3
TIMESERIESON(lotId)), lotId, status
4
FROM (
5
SELECT DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
6
'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES') AS time_col,
7
lotId, lastWithTime(is_occupied, event_time, 'INT') AS status
8
FROM parking_data
9
WHERE event_time >= 1633078800000 AND event_time <= 1633089600000
10
GROUP BY 1, 2
11
ORDER BY 1
12
LIMIT 100)
13
LIMIT 100
Copied!

Workflow

The nested sql will convert the raw event table to the following table.
lotId
event_time
is_occupied
P1
2021-10-01 09:00:00.000
1
P2
2021-10-01 09:00:00.000
1
P1
2021-10-01 09:30:00.000
1
P3
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:30:00.000
0
P3
2021-10-01 11:00:00.000
0
P1
2021-10-01 11:30:00.000
0
The outer sql will gap fill the returned data as following:
timeBucket/lotId
P1
P2
P3
2021-10-01 09:00:00.000
1
1
0
2021-10-01 09:30:00.000
1
1
0
2021-10-01 10:00:00.000
1
1
1
2021-10-01 10:30:00.000
1
0
1
2021-10-01 11:00:00.000
1
0
0
2021-10-01 11:30:00.000
0
0
0

Gapfill/Aggregate

Query Syntax

1
SELECT time_col, SUM(is_occupied) AS occupied_slots_count
2
FROM (
3
SELECT GAPFILL(DATETIMECONVERT(event_time,'1:MILLISECONDS:EPOCH',
4
'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','30:MINUTES'),
5
'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS','2021-10-01 09:00:00.000',
6
'2021-10-01 12:00:00.000','30:MINUTES', FILL(is_occupied, 'FILL_PREVIOUS_VALUE'),
7
TIMESERIESON(lotId)) AS time_col, lotId, is_occupied
8
FROM parking_data
9
WHERE event_time >= 1633078800000 AND event_time <= 1633089600000
10
ORDER BY 1
11
LIMIT 100)
12
GROUP BY 1
13
LIMIT 100
Copied!

Workflow

The raw data will be transformed as following at first:
lotId
event_time
is_occupied
P1
2021-10-01 09:00:00.000
1
P2
2021-10-01 09:00:00.000
1
P1
2021-10-01 09:30:00.000
0
P1
2021-10-01 09:30:00.000
1
P3
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:00:00.000
0
P2
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:30:00.000
0
P3
2021-10-01 11:00:00.000
0
P1
2021-10-01 11:30:00.000
0
The transformed data will be gap filled as follows:
lotId
event_time
is_occupied
P1
2021-10-01 09:00:00.000
1
P2
2021-10-01 09:00:00.000
1
P3
2021-10-01 09:00:00.000
0
P1
2021-10-01 09:30:00.000
0
P1
2021-10-01 09:30:00.000
1
P2
2021-10-01 09:30:00.000
1
P3
2021-10-01 09:30:00.000
0
P1
2021-10-01 10:00:00.000
1
P3
2021-10-01 10:00:00.000
1
P2
2021-10-01 10:00:00.000
0
P2
2021-10-01 10:00:00.000
1
P1
2021-10-01 10:30:00.000
1
P2
2021-10-01 10:30:00.000
0
P3
2021-10-01 10:30:00.000
1
P2
2021-10-01 10:30:00.000
0
P1
2021-10-01 11:00:00.000
1
P2
2021-10-01 11:00:00.000
0
P3
2021-10-01 11:00:00.000
0
P1
2021-10-01 11:30:00.000
0
P2
2021-10-01 11:30:00.000
0
The aggregation will generate the following table:
timeBucket
totalNumOfOccuppiedSlots
2021-10-01 09:00:00.000
2
2021-10-01 09:30:00.000
2
2021-10-01 10:00:00.000
3
2021-10-01 10:30:00.000
2
2021-10-01 11:00:00.000
1
2021-10-01 11:30:00.000
0