githubEdit

Statistical Functions

Pinot provides statistical aggregation functions for computing variance, standard deviation, covariance, skewness, and kurtosis. These functions are useful for data analysis, quality monitoring, and understanding the distribution of your data.

Variance Functions

VAR_POP(col)

Returns the population variance of a numeric column as Double. Population variance measures how far values are spread out from the mean, using all data points.

SELECT VAR_POP(playerScore) AS score_variance
FROM gameStats

VAR_SAMP(col)

Returns the sample variance of a numeric column as Double. Sample variance uses Bessel's correction (divides by N-1 instead of N), making it an unbiased estimator when working with a sample of the population.

SELECT VAR_SAMP(responseTime) AS response_variance
FROM apiMetrics

Standard Deviation Functions

Returns the population standard deviation of a numeric column as Double. This is the square root of the population variance.

Returns the sample standard deviation of a numeric column as Double. This is the square root of the sample variance.

Covariance Functions

COVAR_POP / COVARPOP

Returns the population covariance between two numeric columns as Double. Covariance measures how two variables change together.

COVAR_SAMP / COVARSAMP

Returns the sample covariance between two numeric columns as Double.

Distribution Shape Functions

SKEWNESS

Returns the skewness of a numeric column as Double. Skewness measures the asymmetry of the probability distribution. A positive skew indicates a longer right tail, while a negative skew indicates a longer left tail.

KURTOSIS

Returns the kurtosis of a numeric column as Double. Kurtosis measures the "tailedness" of the probability distribution. Higher kurtosis indicates more outliers (heavier tails).

Notes

  • All statistical functions return Double.NaN or Double.NEGATIVE_INFINITY when no records are selected.

  • These functions operate on numeric columns only (INT, LONG, FLOAT, DOUBLE).

  • For variance and standard deviation, use the _POP variant when you have the entire population and the _SAMP variant when working with a sample.

  • All statistical functions are supported in both the Single-Stage Engine (SSE) and Multi-Stage Engine (MSE).

Last updated

Was this helpful?