Spark
Pinot supports Apache spark as a processor to create and push segment files to the database. Pinot distribution is bundled with the Spark code to process your files and convert and upload them to Pinot.
You can follow the wiki to build pinot distribution from source. The resulting JAR file can be found in pinot/target/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar
Next, you need to change the execution config in the job spec to the following -
1
# executionFrameworkSpec: Defines ingestion jobs to be running.
2
executionFrameworkSpec:
3
4
# name: execution framework name
5
name: 'spark'
6
7
# segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
8
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
9
10
# segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
11
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
12
13
# segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface.
14
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
15
16
#segmentMetadataPushJobRunnerClassName: class name implements org.apache.pinot.spi.ingestion.batch.runner.IngestionJobRunner interface
17
segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner'
18
19
# extraConfigs: extra configs for execution framework.
20
extraConfigs:
21
22
# stagingDir is used in distributed filesystem to host all the segments then move this directory entirely to output directory.
23
stagingDir: your/local/dir/staging
Copied!
You can check out the sample job spec here.
Now, add the pinot jar to spark's classpath using following options -
1
spark.driver.extraJavaOptions =>
2
-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins
3
4
OR
5
6
spark.driver.extraClassPath =>
7
pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar
Copied!
Please ensure environment variables PINOT_ROOT_DIR and PINOT_VERSION are set properly.
Finally execute the spark job using the command -
1
export PINOT_VERSION=0.8.0
2
export PINOT_DISTRIBUTION_DIR=${PINOT_ROOT_DIR}/pinot-distribution/target/apache-pinot-${PINOT_VERSION}-bin/apache-pinot-${PINOT_VERSION}-bin
3
4
cd ${PINOT_DISTRIBUTION_DIR}
5
6
${SPARK_HOME}/bin/spark-submit \\
7
--class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \\
8
--master "local[2]" \\
9
--deploy-mode client \\
10
--conf "spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins -Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-log4j2.xml" \\
11
--conf "spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar" \\
12
local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \\
13
-jobSpecFile ${PINOT_DISTRIBUTION_DIR}/examples/batch/airlineStats/sparkIngestionJobSpec.yaml
Copied!
Note: You should change the master to yarn and deploy-mode to cluster for production.
Last modified 1mo ago
Copy link