Batch import example

Step-by-step guide on pushing your own data into the Pinot cluster

So far, we setup our cluster, ran some queries, explored the admin endpoints. Now, it's time to get our own data into Pinot

Preparing your data

Let's gather our data files and put it in pinot-quick-start/rawdata.

mkdir -p /tmp/pinot-quick-start/rawdata

Supported file formats are CVS, JSON, AVRO, PARQUET, THRIFT, ORC. If you don't have sample data, you can use this sample CSV.

/tmp/pinot-quick-start/rawdata/transcript.csv
studentID,firstName,lastName,gender,subject,score,timestamp
200,Lucy,Smith,Female,Maths,3.8,1570863600000
200,Lucy,Smith,Female,English,3.5,1571036400000
201,Bob,King,Male,Maths,3.2,1571900400000
202,Nick,Young,Male,Physics,3.6,1572418800000

Creating a schema

Schema is used to define the columns and data types of the Pinot table. A detailed overview of the schema can be found in Schema.

Briefly, we categorize our columns into 3 types

Column Type

Description

Dimensions

Typically used in filters and group by, for slicing and dicing into data

Metrics

Typically used in aggregations, represents the quantitative data

Time

Optional column, represents the timestamp associated with each row

For example, in our sample table, the playerID, yearID, teamID, league, playerName columns are the dimensions, the playerStint, numberOfgames, numberOfGamesAsBatter, AtBatting, runs, hits, doules, triples, homeRuns, runsBattedIn, stolenBases, caughtStealing, baseOnBalls, strikeouts, intentionalWalks, hitsByPitch, sacrificeHits, sacrificeFlies, groundedIntoDoublePlays, G_old columns are the metrics and there is no time column.

Once you have identified the dimensions, metrics and time columns, create a schema for your data, using the reference below.

Creating a table config

A table config is used to define the config related to the Pinot table. A detailed overview of the table can be found in Table.

Here's the table config for the sample CSV file. You can use this as a reference to build your own table config. Simply edit the tableName and schemaName.

Uploading your table config and schema

Check the directory structure so far

Upload the table config using the following command

Check out the table config and schema in the Rest API to make sure it was successfully uploaded.

Creating a segment

A Pinot table's data is stored as Pinot segments. A detailed overview of the segment can be found in Segment.

To generate a segment, we need to first create a job spec yaml file. JobSpec yaml file has all the information regarding data format, input data location and pinot cluster coordinates. You can just copy over this job spec file. If you're using your own data, be sure to 1) replace transcript with your table name 2) set the right recordReaderSpec

Use the following command to generate a segment and upload it

Sample output

Check that your segment made it to the table using the Rest API

Querying your data

You're all set! You should see your table in the Query Console and be able to run queries against it now.

select * from transcript

Last updated

Was this helpful?