Batch Ingestion (Docker)

Step-by-step guide for batch importing data into Pinot running in Docker

This guide walks you through importing batch data into a Pinot cluster running in Docker. Make sure you have completed Running Pinot in Docker first.

Prepare your data

Create a directory for your data files:

mkdir -p /tmp/pinot-quick-start/rawdata

Supported file formats are CSV, JSON, AVRO, PARQUET, THRIFT, ORC. If you don't have sample data, you can use this sample CSV.

/tmp/pinot-quick-start/rawdata/transcript.csv

studentID,firstName,lastName,gender,subject,score,timestampInEpoch
200,Lucy,Smith,Female,Maths,3.8,1570863600000
200,Lucy,Smith,Female,English,3.5,1571036400000
201,Bob,King,Male,Maths,3.2,1571900400000
202,Nick,Young,Male,Physics,3.6,1572418800000

Create a schema

Schema is used to define the columns and data types of the Pinot table. A detailed overview of the schema can be found in Schema.

Column Type

Description

Dimensions

Typically used in filters and group by, for slicing and dicing into data

Metrics

Typically used in aggregations, represents the quantitative data

Time

Optional column, represents the timestamp associated with each row

In our example, the studentID, firstName, lastName, gender, subject columns are the dimensions, score is the metric, and timestampInEpoch is the time column.

/tmp/pinot-quick-start/transcript-schema.json

{
  "schemaName": "transcript",
  "dimensionFieldSpecs": [
    { "name": "studentID", "dataType": "INT" },
    { "name": "firstName", "dataType": "STRING" },
    { "name": "lastName", "dataType": "STRING" },
    { "name": "gender", "dataType": "STRING" },
    { "name": "subject", "dataType": "STRING" }
  ],
  "metricFieldSpecs": [
    { "name": "score", "dataType": "FLOAT" }
  ],
  "dateTimeFieldSpecs": [{
    "name": "timestampInEpoch",
    "dataType": "LONG",
    "format": "1:MILLISECONDS:EPOCH",
    "granularity": "1:MILLISECONDS"
  }]
}

Create a table configuration

/tmp/pinot-quick-start/transcript-table-offline.json

{
  "tableName": "transcript",
  "segmentsConfig": {
    "timeColumnName": "timestampInEpoch",
    "timeType": "MILLISECONDS",
    "replication": "1",
    "schemaName": "transcript"
  },
  "tableIndexConfig": {
    "invertedIndexColumns": [],
    "loadMode": "MMAP"
  },
  "tenants": {
    "broker": "DefaultTenant",
    "server": "DefaultTenant"
  },
  "tableType": "OFFLINE",
  "metadata": {}
}

Upload the schema and table configuration

docker run --rm -ti \
    --network=pinot-demo \
    -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
    --name pinot-batch-table-creation \
    apachepinot/pinot:latest AddTable \
    -schemaFile /tmp/pinot-quick-start/transcript-schema.json \
    -tableConfigFile /tmp/pinot-quick-start/transcript-table-offline.json \
    -controllerHost manual-pinot-controller \
    -controllerPort 9000 -exec

Create and push a segment

Create a job specification file:

/tmp/pinot-quick-start/docker-job-spec.yml

executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/tmp/pinot-quick-start/rawdata/'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/tmp/pinot-quick-start/segments/'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
  tableName: 'transcript'
  schemaURI: 'http://manual-pinot-controller:9000/tables/transcript/schema'
  tableConfigURI: 'http://manual-pinot-controller:9000/tables/transcript'
pinotClusterSpecs:
  - controllerURI: 'http://manual-pinot-controller:9000'

Run the ingestion job:

docker run --rm -ti \
    --network=pinot-demo \
    -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
    --name pinot-data-ingestion-job \
    apachepinot/pinot:latest LaunchDataIngestionJob \
    -jobSpecFile /tmp/pinot-quick-start/docker-job-spec.yml

Query your data

Open the Query Console in your browser to run queries against the transcript table.

PreviousRunning Pinot in Docker NextStream Ingestion (Docker)

Last updated 5 minutes ago

Was this helpful?

hashtagPrepare your data

hashtagCreate a schema

hashtagCreate a table configuration

hashtagUpload the schema and table configuration

hashtagCreate and push a segment

hashtagQuery your data