Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Configure AliCloud Object Storage Service (OSS) as Pinot deep storage
OSS can be used as HDFS deep storage for Apache Pinot without implement OSS file system plugin. You should follow the steps below;
1. Configure hdfs-site.xml and core-site.xml files. After that, put these configurations under any desired path, then set the value of pinot.<node>.storage.factory.oss.hadoop.conf
config on the controller/server configs to this path.
For hdfs-site.xml; you do not have to give any configuration;
For core-site.xml; you have to give OSS access/secret and bucket configurations like below;
2. In order to access OSS, find your HDFS jars related to OSS and put them under the PINOT_DIR/lib
. You can use jars below but be careful about versions to avoid conflict.
smartdata-aliyun-oss
smartdata-hadoop-common
guava
3. Set OSS deep storage configs on controller.conf and server.conf;
Controller config
Server config
Example Job Spec
Using the same HDFS deep storage configs and jars, you can read data from OSS, then create segments and push them to OSS again. An example standalone batch ingestion job can be like below;
So far, you've seen how to create a new schema for a Pinot table. In this tutorial, we'll see how to evolve the schema (e.g. add a new column to the schema). This guide assumes you have a Pinot cluster up and running (eg: as mentioned in https://docs.pinot.apache.org/basics/getting-started/running-pinot-locally). We will also assume there's an existing table baseballStats
created as part of the batch quick start.
Pinot only allows adding new columns to the schema. In order to drop a column, change the column name or data type, a new table has to be created.
Let's begin by first fetching the existing schema. We can do this using the controller API:
Let's add a new column at the end of the schema, something like this (by editing baseballStats.schema
In this example, we're adding a new column called yearsOfExperience
with a default value of 1.
You can now update the schema using the following command
Please note: this will not be reflected immediately. You can use the following command to reload the table segments for this column to show up. This can be done as follows:
After the reload, now you can query the new column as shown below:
Real-Time Pinot table: In case of real-time tables, make sure the "pinot.server.instance.reload.consumingSegment" config is set to true inside Server config. Without this, the current consuming segment(s) will not reflect the default null value for newly added columns.
Note that the real values for the newly added columns won't be reflected within the current consuming segment(s). The next consuming segment(s) will start consuming the real values.
New columns can be added with ingestion transforms. If all the source columns for the new column exist in the schema, the transformed values will be generated for the new column instead of filling default values. Note that derived column as well as corresponding data type needs to be first defined in the schema before making changes in table config for ingestion transform.
As you can observe, the current query returns the defaultNullValue
for the newly added column. In order to populate this column with real values, you will need to re-run the batch ingestion job for the past dates.
Real-Time Pinot table: Backfilling data does not work for real-time tables. If you only have a real-time table, you can convert it to a hybrid table, by adding an offline counterpart that uses the same schema. Then you can backfill the offline table and fill in values for the newly added column. More on hybrid tables here.
Pinot segments can be created offline on Hadoop, or via command line from data files. Controller REST endpoint can then be used to add the segment to the table to which the segment belongs. Pinot segments can also be created by ingesting data from realtime resources (such as Kafka).
Offline Pinot workflow
To create Pinot segments on Hadoop, a workflow can be created to complete the following steps:
Pre-aggregate, clean up and prepare the data, writing it as Avro format files in a single HDFS directory
Create segments
Upload segments to the Pinot cluster
Step one can be done using your favorite tool (such as Pig, Hive or Spark), Pinot provides two MapReduce jobs to do step two and three.
Create a job properties configuration file, such as one below:
The Pinot Hadoop module contains a job that you can incorporate into your workflow to generate Pinot segments.