Batch import example
Step-by-step guide on pushing your own data into the Pinot cluster
So far, we setup our cluster, ran some queries, explored the admin endpoints. Now, it's time to get our own data into Pinot. The rest of the instructions assume you're using Pinot running in Docker (inside a pinot-quickstart container).

Preparing your data

Let's gather our data files and put it in pinot-quick-start/rawdata.
1
mkdir -p /tmp/pinot-quick-start/rawdata
Copied!
Supported file formats are CVS, JSON, AVRO, PARQUET, THRIFT, ORC. If you don't have sample data, you can use this sample CSV.
/tmp/pinot-quick-start/rawdata/transcript.csv
1
studentID,firstName,lastName,gender,subject,score,timestampInEpoch
2
200,Lucy,Smith,Female,Maths,3.8,1570863600000
3
200,Lucy,Smith,Female,English,3.5,1571036400000
4
201,Bob,King,Male,Maths,3.2,1571900400000
5
202,Nick,Young,Male,Physics,3.6,1572418800000
Copied!

Creating a schema

Schema is used to define the columns and data types of the Pinot table. A detailed overview of the schema can be found in Schema.
Briefly, we categorize our columns into 3 types
Column Type
Description
Dimensions
Typically used in filters and group by, for slicing and dicing into data
Metrics
Typically used in aggregations, represents the quantitative data
Time
Optional column, represents the timestamp associated with each row
For example, in our sample table, the playerID, yearID, teamID, league, playerName columns are the dimensions, the playerStint, numberOfgames, numberOfGamesAsBatter, AtBatting, runs, hits, doules, triples, homeRuns, runsBattedIn, stolenBases, caughtStealing, baseOnBalls, strikeouts, intentionalWalks, hitsByPitch, sacrificeHits, sacrificeFlies, groundedIntoDoublePlays, G_old columns are the metrics and there is no time column.
Once you have identified the dimensions, metrics and time columns, create a schema for your data, using the reference below.
/tmp/pinot-quick-start/transcript-schema.json
1
{
2
"schemaName": "transcript",
3
"dimensionFieldSpecs": [
4
{
5
"name": "studentID",
6
"dataType": "INT"
7
},
8
{
9
"name": "firstName",
10
"dataType": "STRING"
11
},
12
{
13
"name": "lastName",
14
"dataType": "STRING"
15
},
16
{
17
"name": "gender",
18
"dataType": "STRING"
19
},
20
{
21
"name": "subject",
22
"dataType": "STRING"
23
}
24
],
25
"metricFieldSpecs": [
26
{
27
"name": "score",
28
"dataType": "FLOAT"
29
}
30
],
31
"dateTimeFieldSpecs": [{
32
"name": "timestampInEpoch",
33
"dataType": "LONG",
34
"format" : "1:MILLISECONDS:EPOCH",
35
"granularity": "1:MILLISECONDS"
36
}]
37
}
Copied!

Creating a table config

A table config is used to define the config related to the Pinot table. A detailed overview of the table can be found in Table.
Here's the table config for the sample CSV file. You can use this as a reference to build your own table config. Simply edit the tableName and schemaName.
/tmp/pinot-quick-start/transcript-table-offline.json
1
{
2
"tableName": "transcript",
3
"segmentsConfig" : {
4
"timeColumnName": "timestampInEpoch",
5
"timeType": "MILLISECONDS",
6
"replication" : "1",
7
"schemaName" : "transcript"
8
},
9
"tableIndexConfig" : {
10
"invertedIndexColumns" : [],
11
"loadMode" : "MMAP"
12
},
13
"tenants" : {
14
"broker":"DefaultTenant",
15
"server":"DefaultTenant"
16
},
17
"tableType":"OFFLINE",
18
"metadata": {}
19
}
Copied!

Uploading your table config and schema

Check the directory structure so far
1
$ ls /tmp/pinot-quick-start
2
rawdata transcript-schema.json transcript-table-offline.json
3
4
$ ls /tmp/pinot-quick-start/rawdata
5
transcript.csv
Copied!
Upload the table config using the following command
Docker
Launcher Script
1
docker run --rm -ti \
2
--network=pinot-demo \
3
-v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
4
--name pinot-batch-table-creation \
5
apachepinot/pinot:latest AddTable \
6
-schemaFile /tmp/pinot-quick-start/transcript-schema.json \
7
-tableConfigFile /tmp/pinot-quick-start/transcript-table-offline.json \
8
-controllerHost pinot-quickstart \
9
-controllerPort 9000 -exec
Copied!
1
bin/pinot-admin.sh AddTable \
2
-tableConfigFile /tmp/pinot-quick-start/transcript-table-offline.json \
3
-schemaFile /tmp/pinot-quick-start/transcript-schema.json -exec
Copied!
Check out the table config and schema in the Rest API to make sure it was successfully uploaded.

Creating a segment

A Pinot table's data is stored as Pinot segments. A detailed overview of the segment can be found in Segment.
To generate a segment, we need to first create a job spec yaml file. JobSpec yaml file has all the information regarding data format, input data location and pinot cluster coordinates. You can just copy over this job spec file. If you're using your own data, be sure to 1) replace transcript with your table name 2) set the right recordReaderSpec
Docker
Launcher Script
/tmp/pinot-quick-start/docker-job-spec.yml
1
executionFrameworkSpec:
2
name: 'standalone'
3
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
4
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
5
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
6
jobType: SegmentCreationAndTarPush
7
inputDirURI: '/tmp/pinot-quick-start/rawdata/'
8
includeFileNamePattern: 'glob:**/*.csv'
9
outputDirURI: '/tmp/pinot-quick-start/segments/'
10
overwriteOutput: true
11
pinotFSSpecs:
12
- scheme: file
13
className: org.apache.pinot.spi.filesystem.LocalPinotFS
14
recordReaderSpec:
15
dataFormat: 'csv'
16
className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
17
configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
18
tableSpec:
19
tableName: 'transcript'
20
schemaURI: 'http://pinot-quickstart:9000/tables/transcript/schema'
21
tableConfigURI: 'http://pinot-quickstart:9000/tables/transcript'
22
pinotClusterSpecs:
23
- controllerURI: 'http://pinot-quickstart:9000'
Copied!
/tmp/pinot-quick-start/batch-job-spec.yml
1
executionFrameworkSpec:
2
name: 'standalone'
3
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
4
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
5
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
6
jobType: SegmentCreationAndTarPush
7
inputDirURI: '/tmp/pinot-quick-start/rawdata/'
8
includeFileNamePattern: 'glob:**/*.csv'
9
outputDirURI: '/tmp/pinot-quick-start/segments/'
10
overwriteOutput: true
11
pinotFSSpecs:
12
- scheme: file
13
className: org.apache.pinot.spi.filesystem.LocalPinotFS
14
recordReaderSpec:
15
dataFormat: 'csv'
16
className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
17
configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
18
tableSpec:
19
tableName: 'transcript'
20
schemaURI: 'http://localhost:9000/tables/transcript/schema'
21
tableConfigURI: 'http://localhost:9000/tables/transcript'
22
pinotClusterSpecs:
23
- controllerURI: 'http://localhost:9000'
Copied!
Use the following command to generate a segment and upload it
Docker
Using launcher scripts
1
docker run --rm -ti \
2
--network=pinot-demo \
3
-v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
4
--name pinot-data-ingestion-job \
5
apachepinot/pinot:latest LaunchDataIngestionJob \
6
-jobSpecFile /tmp/pinot-quick-start/docker-job-spec.yml
Copied!
1
bin/pinot-admin.sh LaunchDataIngestionJob \
2
-jobSpecFile /tmp/pinot-quick-start/batch-job-spec.yml
Copied!
Sample output
1
SegmentGenerationJobSpec:
2
!!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
3
excludeFileNamePattern: null
4
executionFrameworkSpec: {extraConfigs: null, name: standalone, segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner,
5
segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner,
6
segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner}
7
includeFileNamePattern: glob:**\/*.csv
8
inputDirURI: /tmp/pinot-quick-start/rawdata/
9
jobType: SegmentCreationAndTarPush
10
outputDirURI: /tmp/pinot-quick-start/segments
11
overwriteOutput: true
12
pinotClusterSpecs:
13
- {controllerURI: 'http://localhost:9000'}
14
pinotFSSpecs:
15
- {className: org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file}
16
pushJobSpec: null
17
recordReaderSpec: {className: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader,
18
configClassName: org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig,
19
configs: null, dataFormat: csv}
20
segmentNameGeneratorSpec: null
21
tableSpec: {schemaURI: 'http://localhost:9000/tables/transcript/schema', tableConfigURI: 'http://localhost:9000/tables/transcript',
22
tableName: transcript}
23
24
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
25
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
26
Finished building StatsCollector!
27
Collected stats for 4 documents
28
Using fixed bytes value dictionary for column: studentID, size: 9
29
Created dictionary for STRING column: studentID with cardinality: 3, max length in bytes: 3, range: 200 to 202
30
Using fixed bytes value dictionary for column: firstName, size: 12
31
Created dictionary for STRING column: firstName with cardinality: 3, max length in bytes: 4, range: Bob to Nick
32
Using fixed bytes value dictionary for column: lastName, size: 15
33
Created dictionary for STRING column: lastName with cardinality: 3, max length in bytes: 5, range: King to Young
34
Created dictionary for FLOAT column: score with cardinality: 4, range: 3.2 to 3.8
35
Using fixed bytes value dictionary for column: gender, size: 12
36
Created dictionary for STRING column: gender with cardinality: 2, max length in bytes: 6, range: Female to Male
37
Using fixed bytes value dictionary for column: subject, size: 21
38
Created dictionary for STRING column: subject with cardinality: 3, max length in bytes: 7, range: English to Physics
39
Created dictionary for LONG column: timestampInEpoch with cardinality: 4, range: 1570863600000 to 1572418800000
40
Start building IndexCreator!
41
Finished records indexing in IndexCreator!
42
Finished segment seal!
43
Converting segment: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0 to v3 format
44
v3 segment location for segment: transcript_OFFLINE_1570863600000_1572418800000_0 is /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3
45
Deleting files in v1 segment directory: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0
46
Starting building 1 star-trees with configs: [StarTreeV2BuilderConfig[splitOrder=[studentID, firstName],skipStarNodeCreation=[],functionColumnPairs=[[email protected]3a48efdc],maxLeafRecords=1]] using OFF_HEAP builder
47
Starting building star-tree with config: StarTreeV2BuilderConfig[splitOrder=[studentID, firstName],skipStarNodeCreation=[],functionColumnPairs=[[email protected]3a48efdc],maxLeafRecords=1]
48
Generated 3 star-tree records from 4 segment records
49
Finished constructing star-tree, got 9 tree nodes and 4 records under star-node
50
Finished creating aggregated documents, got 6 aggregated records
51
Finished building star-tree in 10ms
52
Finished building 1 star-trees in 27ms
53
Computed crc = 3454627653, based on files [/var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/columns.psf, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/index_map, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/metadata.properties, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/star_tree_index, /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0/v3/star_tree_index_map]
54
Driver, record read time : 0
55
Driver, stats collector time : 0
56
Driver, indexing time : 0
57
Tarring segment from: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0 to: /var/folders/3z/qn6k60qs6ps1bb6s2c26gx040000gn/T/pinot-1583443148720/output/transcript_OFFLINE_1570863600000_1572418800000_0.tar.gz
58
Size for segment: transcript_OFFLINE_1570863600000_1572418800000_0, uncompressed: 6.73KB, compressed: 1.89KB
59
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner
60
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
61
Start pushing segments: [/tmp/pinot-quick-start/segments/transcript_OFFLINE_1570863600000_1572418800000_0.tar.gz]... to locations: [[email protected]f91] for table transcript
62
Pushing segment: transcript_OFFLINE_1570863600000_1572418800000_0 to location: http://localhost:9000 for table transcript
63
Sending request: http://localhost:9000/v2/segments?tableName=transcript to controller: nehas-mbp.hsd1.ca.comcast.net, version: Unknown
64
Response for pushing table transcript segment transcript_OFFLINE_1570863600000_1572418800000_0 to location http://localhost:9000 - 200: {"status":"Successfully uploaded segment: transcript_OFFLINE_1570863600000_1572418800000_0 of table: transcript"}
Copied!
Check that your segment made it to the table using the Rest API

Querying your data

You're all set! You should see your table in the Query Console and be able to run queries against it now.
Last modified 10mo ago