One of the primary advantage of using Pinot is its pluggable architecture. The plugins make it easy to add support for any third-party system which can be an execution framework, a filesystem or input format.
In this tutorial, we will use three such plugins to easily ingest data and push it to our pinot cluster. The plugins we will be using are -
pinot-batch-ingestion-spark
pinot-s3
pinot-parquet
You can check out , and for all the available plugins.
Setup
We are using the following tools and frameworks for this tutorial -
2.4.0 (Although any spark 2.X/3.X should work)
1.8.2
Check out the for latest configuration and FAQs
Input Data
We need to get input data to ingest first. For our demo, we'll just create some small parquet files and upload them to our S3 bucket. The easiest way is to create CSV files and then convert them to parquet. CSV makes it human-readable and thus easier to modify input in case of some failure in our demo. We will call this file students.csv
Now, we'll create parquet files from the above CSV file using Spark. Since this is a small program, we will be using Spark shell instead of writing a full fledged Spark code.
scala> val df = spark.read.format("csv").option("header", true).load("path/to/students.csv")
scala> df.write.option("compression","none").mode("overwrite").parquet("/path/to/batch_input/")
The .parquet files can now be found in /path/to/batch_input directory. You can now upload this directory to S3 either using their UI or running the command
Now that our data is available in S3 as well as we have the Tables in Pinot, we can start the process of ingesting the data. Data ingestion in Pinot involves the following steps -
Read data and generate compressed segment files from input
Upload the compressed segment files to output location
Push the location of the segment files to the controller
Once the location is available to the controller, it can notify the servers to download the segment files and populate the tables.
The above steps can be performed using any distributed executor of your choice such as Hadoop, Spark, Flink etc. For this demo we will be using Apache Spark to execute the steps.
Pinot provides runners for Spark out of the box. So as a user, you don't need to write a single line of code. You can write runners for any other executor using our provided interfaces.
Firstly, we will create a job spec configuration file for our data ingestion process.
In the job spec, we have kept execution framework as spark and configured the appropriate runners for each of our steps. We also need a temporary stagingDir for our spark job. This directory is cleaned up after our job has executed.
We can now run our spark job to execute all the steps and populate data in pinot.
We need to create a table to query the data that will be ingested. All tables in pinot are associated with a schema. You can check out and for more details on creating configurations.
You can check out for all the available commands.
Our table will now be available in the
We also provide the S3 Filesystem and Parquet reader implementation in the config to use. You can refer for complete list of configuration.
You can go through theof our Spark ingestion guide in case you face any errors.