GapFill Function For Time-Series Dataset
Many of the datasets are time series in nature, tracking state change of an entity over time. The granularity of recorded data points might be sparse or the events could be missing due to network and other device issues in the IOT environment. But analytics applications which are tracking the state change of these entities over time, might be querying for values at lower granularity than the metric interval.
Here is the sample data set tracking the status of parking lots in parking space.
lotId | event_time | is_occupied |
---|---|---|
P1 | 2021-10-01 09:01:00.000 | 1 |
P2 | 2021-10-01 09:17:00.000 | 1 |
P1 | 2021-10-01 09:33:00.000 | 0 |
P1 | 2021-10-01 09:47:00.000 | 1 |
P3 | 2021-10-01 10:05:00.000 | 1 |
P2 | 2021-10-01 10:06:00.000 | 0 |
P2 | 2021-10-01 10:16:00.000 | 1 |
P2 | 2021-10-01 10:31:00.000 | 0 |
P3 | 2021-10-01 11:17:00.000 | 0 |
P1 | 2021-10-01 11:54:00.000 | 0 |
We want to find out the total number of parking lots that are occupied over a period of time which would be a common use case for a company that manages parking spaces.
Let us take 30 minutes' time bucket as an example:
timeBucket/lotId | P1 | P2 | P3 |
---|---|---|---|
2021-10-01 09:00:00.000 | 1 | 1 | |
2021-10-01 09:30:00.000 | 0,1 | ||
2021-10-01 10:00:00.000 | 0,1 | 1 | |
2021-10-01 10:30:00.000 | 0 | ||
2021-10-01 11:00:00.000 | 0 | ||
2021-10-01 11:30:00.000 | 0 |
If you look at the above table, you will see a lot of missing data for parking lots inside the time buckets. In order to calculate the number of occupied park lots per time bucket, we need gap fill the missing data.
The Ways of Gap Filling the Data
There are two ways of gap filling the data: FILL_PREVIOUS_VALUE and FILL_DEFAULT_VALUE.
FILL_PREVIOUS_VALUE means the missing data will be filled with the previous value for the specific entity, in this case, park lot, if the previous value exists. Otherwise, it will be filled with the default value.
FILL_DEFAULT_VALUE means that the missing data will be filled with the default value. For numeric column, the defaul value is 0. For Boolean column type, the default value is false. For TimeStamp, it is January 1, 1970, 00:00:00 GMT. For STRING, JSON and BYTES, it is empty String. For Array type of column, it is empty array.
We will leverage the following the query to calculate the total occupied parking lots per time bucket.
Aggregation/Gapfill/Aggregation
Query Syntax
Workflow
The most nested sql will convert the raw event table to the following table.
lotId | event_time | is_occupied |
---|---|---|
P1 | 2021-10-01 09:00:00.000 | 1 |
P2 | 2021-10-01 09:00:00.000 | 1 |
P1 | 2021-10-01 09:30:00.000 | 1 |
P3 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:30:00.000 | 0 |
P3 | 2021-10-01 11:00:00.000 | 0 |
P1 | 2021-10-01 11:30:00.000 | 0 |
The second most nested sql will gap fill the returned data as following:
timeBucket/lotId | P1 | P2 | P3 |
---|---|---|---|
2021-10-01 09:00:00.000 | 1 | 1 | 0 |
2021-10-01 09:30:00.000 | 1 | 1 | 0 |
2021-10-01 10:00:00.000 | 1 | 1 | 1 |
2021-10-01 10:30:00.000 | 1 | 0 | 1 |
2021-10-01 11:00:00.000 | 1 | 0 | 0 |
2021-10-01 11:30:00.000 | 0 | 0 | 0 |
The outermost query will aggregate the gapfilled data as follows:
timeBucket | totalNumOfOccuppiedSlots |
---|---|
2021-10-01 09:00:00.000 | 2 |
2021-10-01 09:30:00.000 | 2 |
2021-10-01 10:00:00.000 | 3 |
2021-10-01 10:30:00.000 | 2 |
2021-10-01 11:00:00.000 | 1 |
2021-10-01 11:30:00.000 | 0 |
There is one assumption we made here that the raw data is sorted by the timestamp. The Gapfill and Post-Gapfill Aggregation will not sort the data.
The above example just shows the use case where the three steps happen:
The raw data will be aggregated;
The aggregated data will be gapfilled;
The gapfilled data will be aggregated.
There are three more scenarios we can support.
Select/Gapfill
If we want to gapfill the missing data per half an hour time bucket, here is the query:
Query Syntax
Workflow
At first the raw data will be transformed as follows:
lotId | event_time | is_occupied |
---|---|---|
P1 | 2021-10-01 09:00:00.000 | 1 |
P2 | 2021-10-01 09:00:00.000 | 1 |
P1 | 2021-10-01 09:30:00.000 | 0 |
P1 | 2021-10-01 09:30:00.000 | 1 |
P3 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:00:00.000 | 0 |
P2 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:30:00.000 | 0 |
P3 | 2021-10-01 11:00:00.000 | 0 |
P1 | 2021-10-01 11:30:00.000 | 0 |
Then it will be gapfilled as follows:
lotId | event_time | is_occupied |
---|---|---|
P1 | 2021-10-01 09:00:00.000 | 1 |
P2 | 2021-10-01 09:00:00.000 | 1 |
P3 | 2021-10-01 09:00:00.000 | 0 |
P1 | 2021-10-01 09:30:00.000 | 0 |
P1 | 2021-10-01 09:30:00.000 | 1 |
P2 | 2021-10-01 09:30:00.000 | 1 |
P3 | 2021-10-01 09:30:00.000 | 0 |
P1 | 2021-10-01 10:00:00.000 | 1 |
P3 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:00:00.000 | 0 |
P2 | 2021-10-01 10:00:00.000 | 1 |
P1 | 2021-10-01 10:30:00.000 | 1 |
P2 | 2021-10-01 10:30:00.000 | 0 |
P3 | 2021-10-01 10:30:00.000 | 1 |
P1 | 2021-10-01 11:00:00.000 | 1 |
P2 | 2021-10-01 11:00:00.000 | 0 |
P3 | 2021-10-01 11:00:00.000 | 0 |
P1 | 2021-10-01 11:30:00.000 | 0 |
P2 | 2021-10-01 11:30:00.000 | 0 |
P3 | 2021-10-01 11:30:00.000 | 0 |
Aggregate/Gapfill
Query Syntax
Workflow
The nested sql will convert the raw event table to the following table.
lotId | event_time | is_occupied |
---|---|---|
P1 | 2021-10-01 09:00:00.000 | 1 |
P2 | 2021-10-01 09:00:00.000 | 1 |
P1 | 2021-10-01 09:30:00.000 | 1 |
P3 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:30:00.000 | 0 |
P3 | 2021-10-01 11:00:00.000 | 0 |
P1 | 2021-10-01 11:30:00.000 | 0 |
The outer sql will gap fill the returned data as following:
timeBucket/lotId | P1 | P2 | P3 |
---|---|---|---|
2021-10-01 09:00:00.000 | 1 | 1 | 0 |
2021-10-01 09:30:00.000 | 1 | 1 | 0 |
2021-10-01 10:00:00.000 | 1 | 1 | 1 |
2021-10-01 10:30:00.000 | 1 | 0 | 1 |
2021-10-01 11:00:00.000 | 1 | 0 | 0 |
2021-10-01 11:30:00.000 | 0 | 0 | 0 |
Gapfill/Aggregate
Query Syntax
Workflow
The raw data will be transformed as following at first:
lotId | event_time | is_occupied |
---|---|---|
P1 | 2021-10-01 09:00:00.000 | 1 |
P2 | 2021-10-01 09:00:00.000 | 1 |
P1 | 2021-10-01 09:30:00.000 | 0 |
P1 | 2021-10-01 09:30:00.000 | 1 |
P3 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:00:00.000 | 0 |
P2 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:30:00.000 | 0 |
P3 | 2021-10-01 11:00:00.000 | 0 |
P1 | 2021-10-01 11:30:00.000 | 0 |
The transformed data will be gap filled as follows:
lotId | event_time | is_occupied |
---|---|---|
P1 | 2021-10-01 09:00:00.000 | 1 |
P2 | 2021-10-01 09:00:00.000 | 1 |
P3 | 2021-10-01 09:00:00.000 | 0 |
P1 | 2021-10-01 09:30:00.000 | 0 |
P1 | 2021-10-01 09:30:00.000 | 1 |
P2 | 2021-10-01 09:30:00.000 | 1 |
P3 | 2021-10-01 09:30:00.000 | 0 |
P1 | 2021-10-01 10:00:00.000 | 1 |
P3 | 2021-10-01 10:00:00.000 | 1 |
P2 | 2021-10-01 10:00:00.000 | 0 |
P2 | 2021-10-01 10:00:00.000 | 1 |
P1 | 2021-10-01 10:30:00.000 | 1 |
P2 | 2021-10-01 10:30:00.000 | 0 |
P3 | 2021-10-01 10:30:00.000 | 1 |
P2 | 2021-10-01 10:30:00.000 | 0 |
P1 | 2021-10-01 11:00:00.000 | 1 |
P2 | 2021-10-01 11:00:00.000 | 0 |
P3 | 2021-10-01 11:00:00.000 | 0 |
P1 | 2021-10-01 11:30:00.000 | 0 |
P2 | 2021-10-01 11:30:00.000 | 0 |
P3 | 2021-10-01 11:30:00.000 | 0 |
The aggregation will generate the following table:
timeBucket | totalNumOfOccuppiedSlots |
---|---|
2021-10-01 09:00:00.000 | 2 |
2021-10-01 09:30:00.000 | 2 |
2021-10-01 10:00:00.000 | 3 |
2021-10-01 10:30:00.000 | 2 |
2021-10-01 11:00:00.000 | 1 |
2021-10-01 11:30:00.000 | 0 |