Use OSS as Deep Storage for Pinot
Configure AliCloud Object Storage Service (OSS) as Pinot deep storage
OSS can be used as HDFS deep storage for Apache Pinot without implement OSS file system plugin. You should follow the steps below; 1. Configure hdfs-site.xml and core-site.xml files. After that, put these configurations under any desired path, then set the value of pinot.<node>.storage.factory.oss.hadoop.conf config on the controller/server configs to this path.
For hdfs-site.xml; you do not have to give any configuration;
1
<?xml version="1.0" encoding="UTF-8"?>
2
<configuration>
3
</configuration>
Copied!
For core-site.xml; you have to give OSS access/secret and bucket configurations like below;
1
<?xml version="1.0"?>
2
<configuration>
3
<property>
4
<name>fs.defaultFS</name>
5
<value>oss://your-bucket-name/</value>
6
</property>
7
<property>
8
<name>fs.oss.accessKeyId</name>
9
<value>your-access-key-id</value>
10
</property>
11
<property>
12
<name>fs.oss.accessKeySecret</name>
13
<value>your-access-key-secret</value>
14
</property>
15
<property>
16
<name>fs.oss.impl</name>
17
<value>com.aliyun.emr.fs.oss.OssFileSystem</value>
18
</property>
19
<property>
20
<name>fs.oss.endpoint</name>
21
<value>your-oss-endpoint</value>
22
</property>
23
</configuration>
Copied!
2. In order to access OSS, find your HDFS jars related to OSS and put them under the PINOT_DIR/lib. You can use jars below but be careful about versions to avoid conflict.
  • smartdata-aliyun-oss
  • smartdata-hadoop-common
  • guava
3. Set OSS deep storage configs on controller.conf and server.conf;
Controller config
1
controller.data.dir=oss://your-bucket-name/path/to/segments
2
controller.local.temp.dir=/path/to/local/temp/directory
3
controller.enable.split.commit=true
4
pinot.controller.storage.factory.class.oss=org.apache.pinot.plugin.filesystem.HadoopPinotFS
5
pinot.controller.storage.factory.oss.hadoop.conf.path=path/to/conf/directory/
6
pinot.controller.segment.fetcher.protocols=file,http,oss
7
pinot.controller.segment.fetcher.oss.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
Copied!
Server config
1
pinot.server.instance.enable.split.commit=true
2
pinot.server.storage.factory.class.oss=org.apache.pinot.plugin.filesystem.HadoopPinotFS
3
pinot.server.storage.factory.oss.hadoop.conf.path=path/to/conf/directory/
4
pinot.server.segment.fetcher.protocols=file,http,oss
5
pinot.server.segment.fetcher.oss.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
Copied!
Example Job Spec
Using the same HDFS deep storage configs and jars, you can read data from OSS, then create segments and push them to OSS again. An example standalone batch ingestion job can be like below;
1
executionFrameworkSpec:
2
name: 'standalone'
3
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
4
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
5
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
6
segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner'
7
jobType: SegmentCreationAndMetadataPush
8
inputDirURI: 'oss://your-bucket-name/input'
9
includeFileNamePattern: 'glob:**/*.csv'
10
outputDirURI: 'oss://your-bucket-name/output'
11
overwriteOutput: true
12
pinotFSSpecs:
13
- scheme: oss
14
className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
15
configs:
16
hadoop.conf.path: '/path/to/hadoop/conf'
17
recordReaderSpec:
18
dataFormat: 'csv'
19
className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
20
configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
21
tableSpec:
22
tableName: 'transcript'
23
pinotClusterSpecs:
24
- controllerURI: '<http://localhost:9000>'
25
Copied!
Last modified 7mo ago
Copy link