Apache Pinot Docs
Search…
HDFS
This guide shows you how to import data from HDFS.
You can enable the Hadoop DFS using the plugin pinot-hdfs. In the controller or server, add the config:
1
-Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-hdfs
Copied!
By default Pinot loads all the plugins, so you can just drop this plugin there. Also, if you specify -Dplugins.include, you need to put all the plugins you want to use, e.g. pinot-json, pinot-avro , pinot-kafka-2.0...
HDFS implementation provides the following options -
    hadoop.conf.path : Absolute path of the directory containing hadoop XML configuration files such as hdfs-site.xml, core-site.xml .
    hadoop.write.checksum : create checksum while pushing an object. Default is false
    hadoop.kerberos.principle
    hadoop.kerberos.keytab
Each of these properties should be prefixed by pinot.[node].storage.factory.class.hdfs. where node is either controller or server depending on the config
The kerberos configs should be used only if your Hadoop installation is secured with Kerberos. Please check Hadoop Kerberos guide on how to generate Kerberos security identification.
You will also need to provide proper Hadoop dependencies jars from your Hadoop installation to your Pinot startup scripts.
1
export HADOOP_HOME=/local/hadoop/
2
export HADOOP_VERSION=2.7.1
3
export HADOOP_GUAVA_VERSION=11.0.2
4
export HADOOP_GSON_VERSION=2.2.4
5
export CLASSPATH_PREFIX="${HADOOP_HOME}/share/hadoop/hdfs/hadoop-hdfs-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-annotations-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/hadoop-auth-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/hadoop-common-${HADOOP_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/guava-${HADOOP_GUAVA_VERSION}.jar:${HADOOP_HOME}/share/hadoop/common/lib/gson-${HADOOP_GSON_VERSION}.jar"
Copied!

Push HDFS segment to Pinot Controller

To push HDFS segment files to Pinot controller, you just need to ensure you have proper Hadoop configuration as we mentioned in the previous part. Then your remote segment creation/push job can send the HDFS path of your newly created segment files to the Pinot Controller and let it download the files.
For example, the following curl requests to Controller will notify it to download segment files to the proper table:
1
curl -X POST -H "UPLOAD_TYPE:URI" -H "DOWNLOAD_URI:hdfs://nameservice1/hadoop/path/to/segment/file.
Copied!

Examples

Job spec

Standalone Job:
1
executionFrameworkSpec:
2
name: 'standalone'
3
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
4
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
5
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
6
jobType: SegmentCreationAndTarPush
7
inputDirURI: 'hdfs:///path/to/input/directory/'
8
outputDirURI: 'hdfs:///path/to/output/directory/'
9
includeFileNamePath: 'glob:**/*.csv'
10
overwriteOutput: true
11
pinotFSSpecs:
12
- scheme: hdfs
13
className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
14
configs:
15
hadoop.conf.path: 'path/to/conf/directory/'
16
recordReaderSpec:
17
dataFormat: 'csv'
18
className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
19
configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
20
tableSpec:
21
tableName: 'students'
22
pinotClusterSpecs:
23
- controllerURI: 'http://localhost:9000'
Copied!
Hadoop Job:
1
executionFrameworkSpec:
2
name: 'hadoop'
3
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentGenerationJobRunner'
4
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentTarPushJobRunner'
5
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentUriPushJobRunner'
6
extraConfigs:
7
stagingDir: 'hdfs:///path/to/staging/directory/'
8
jobType: SegmentCreationAndTarPush
9
inputDirURI: 'hdfs:///path/to/input/directory/'
10
outputDirURI: 'hdfs:///path/to/output/directory/'
11
includeFileNamePath: 'glob:**/*.csv'
12
overwriteOutput: true
13
pinotFSSpecs:
14
- scheme: hdfs
15
className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
16
configs:
17
hadoop.conf.path: '/etc/hadoop/conf/'
18
recordReaderSpec:
19
dataFormat: 'csv'
20
className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
21
configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
22
tableSpec:
23
tableName: 'students'
24
pinotClusterSpecs:
25
- controllerURI: 'http://localhost:9000'
Copied!

Controller config

1
controller.data.dir=hdfs://path/to/data/directory/
2
controller.local.temp.dir=/path/to/local/temp/directory
3
controller.enable.split.commit=true
4
pinot.controller.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
5
pinot.controller.storage.factory.hdfs.hadoop.conf.path=path/to/conf/directory/
6
pinot.controller.segment.fetcher.protocols=file,http,hdfs
7
pinot.controller.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
8
pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
9
pinot.controller.segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
Copied!

Server config

1
pinot.server.instance.enable.split.commit=true
2
pinot.server.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
3
pinot.server.storage.factory.hdfs.hadoop.conf.path=path/to/conf/directory/
4
pinot.server.segment.fetcher.protocols=file,http,hdfs
5
pinot.server.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
6
pinot.server.segment.fetcher.hdfs.hadoop.kerberos.principle=<your kerberos principal>
7
pinot.server.segment.fetcher.hdfs.hadoop.kerberos.keytab=<your kerberos keytab>
Copied!
Last modified 3mo ago