arrow-left

All pages
gitbookPowered by GitBook
1 of 7

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Kubernetes Deployment

Pinot community has provided Helm based Kubernetes deployment template.

You can deploy it as simple as run a helm install command.

However there are a few things to be noted before starting the benchmark/production.

hashtag
Container Resources

We recommend to run Pinot with pre-defined resources for the container, and make requests and limits to be the same.

This will ensure the container won't be killed if there is a sudden bump of workload.

It will also be simpler to benchmark the system, e.g. get broker qps limit.

Below is an example for values to set in values.yaml file. Default resources is not set.

hashtag
JVM Setting

hashtag
Pinot Controller/Broker

JVM setting should be complaint with the container resources for Pinot Controller and Pinot Broker.

You can make JVM setting like below to make -Xmx the same size as your container.

hashtag
Pinot Server

For Pinot Server, heap is majorly used for query processing, metadata management. It uses off-heap memory for data loading/persistence, memory mapped files page caching. So we recommend just keep minimal requirement for JVM, and leave the rest of the container for off-heap data operations.

E.g. Assuming data is 100 GB on disk, the container size is 4 CPU, 10GB Memory.

For JVM, limit -Xmx to not exceed 50% container memory limit, so that the rest of the container could be leveraged by the off-heap operations.

hashtag
Deep storage

Pinot uses remote storage as deep storage to backup segments.

Default deployment creates a mount disk(e.g Amazon EBS) as deep storage in controller.

You can configure your own S3/Azure DataLate/Google Cloud Storage following this .

Build Docker Images

hashtag
Overview

The scripts to build Pinot related docker images is located at .

You can access those scripts by running below command to checkout Pinot repo:

You can find current supported 3 images in this directory:

link

Pinot: Pinot all-in-one distribution image

  • Pinot-Presto: Presto image with Presto-Pinot Connector built-in.

  • Pinot-Superset: Superset image with Pinot connector built-in.

  • hashtag
    Pinot

    This is a docker image of Apache Pinotarrow-up-right.

    hashtag
    How to build a docker image

    There is a docker build script which will build a given Git repo/branch and tag the image.

    Usage:

    This script will check out Pinot Repo [Pinot Git URL] on branch [Git Branch] and build the docker image for that.

    The docker image is tagged as [Docker Tag].

    Docker Tag: Name and tag your docker image. Default is pinot:latest.

    Git Branch: The Pinot branch to build. Default is master.

    Pinot Git URL: The Pinot Git Repo to build, users can set it to their own fork. Please note that, the URL is https:// based, not git://. Default is the Apache Repo: https://github.com/apache/incubator-pinot.git.

    • Example of building and tagging a snapshot on your own fork:

    • Example of building a release version:

    • Example of building current master branch as a snapshot:

    hashtag
    How to publish a docker image

    Script docker-push.sh publishes a given docker image to your docker registry.

    In order to push to your own repo, the image needs to be explicitly tagged with the repo name.

    • Example of publishing an image to apachepinot/pinotarrow-up-right dockerHub repo.

    • You can also tag a built image, then push it.

    Script docker-build-and-push.sh builds and publishes this docker image to your docker registry after build.

    • Example of building and publishing an image to apachepinot/pinotarrow-up-right dockerHub repo.

    hashtag
    Kubernetes Examples

    Please refer to Kubernetes Quickstart for deployment examples.

    hashtag
    Pinot Presto

    Docker image for Prestoarrow-up-right with Pinot integration.

    This docker build project is specialized for Pinot.

    hashtag
    How to build

    Usage:

    This script will check out Presto Repo [Presto Git URL] on branch [Git Branch] and build the docker image for that.

    The docker image is tagged as [Docker Tag].

    Docker Tag: Name and tag your docker image. Default is pinot-presto:latest.

    Git Branch: The Presto branch to build. Default is master.

    Presto Git URL: The Presto Git Repo to build, users can set it to their own fork. Please note that, the URL is https:// based, not git://. Default is the Apache Repo: https://github.com/prestodb/presto.git.

    hashtag
    How to push

    hashtag
    Configuration

    Follow the instructionsarrow-up-right provided by Presto for writing your own configuration files under etc directory.

    hashtag
    Volumes

    The image defines two data volumes: one for mounting configuration into the container, and one for data.

    The configuration volume is located alternatively at /home/presto/etc, which contains all the configuration and plugins.

    The data volume is located at /home/presto/data.

    hashtag
    Kubernetes Examples

    Please refer to presto-coordinator.yamlarrow-up-right as k8s deployment example.

    hashtag
    Pinot Superset

    Docker image for Supersetarrow-up-right with Pinot integration.

    This docker build project is based on Project docker-supersetarrow-up-right and specialized for Pinot.

    hashtag
    How to build

    Please modify file Makefile to change image and superset_version accordingly.

    Below command will build docker image and tag it as superset_version and latest.

    You can also build directly with docker build command by setting arguments:

    hashtag
    How to push

    hashtag
    Configuration

    Follow the instructionsarrow-up-right provided by Apache Superset for writing your own superset_config.py.

    Place this file in a local directory and mount this directory to /etc/superset inside the container. This location is included in the image's PYTHONPATH. Mounting this file to a different location is possible, but it will need to be in the PYTHONPATH.

    hashtag
    Volumes

    The image defines two data volumes: one for mounting configuration into the container, and one for data (logs, SQLite DBs, &c).

    The configuration volume is located alternatively at /etc/superset or /home/superset; either is acceptable. Both of these directories are included in the PYTHONPATH of the image. Mount any configuration (specifically the superset_config.py file) here to have it read by the app on startup.

    The data volume is located at /var/lib/superset and it is where you would mount your SQLite file (if you are using that as your backend), or a volume to collect any logs that are routed there. This location is used as the value of the SUPERSET_HOME environmental variable.

    hashtag
    Kubernetes Examples

    Please refer to superset.yamlarrow-up-right as k8s deployment example.

    herearrow-up-right
    resources:
      requests:
        cpu: 1
        memory: 1G
      limits:
        cpu: 1
        memory: 1G
    resources:
      requests:
        cpu: 1
        memory: 1G
      limits:
        cpu: 1
        memory: 1G
    jvmOpts: "-Xms256M -Xmx1G"
    resources:
      requests:
        cpu: 4
        memory: 10G
      limits:
        cpu: 4
        memory: 10G
    jvmOpts: "-Xms1G -Xmx4G"
    git clone [email protected]:apache/incubator-pinot.git pinot
    cd pinot/docker/images
    ./docker-build.sh [Docker Tag] [Git Branch] [Pinot Git URL]
    ./docker-build.sh pinot_fork:snapshot-5.2 snapshot-5.2 https://github.com/your_own_fork/pinot.git
    ./docker-build.sh pinot:release-0.1.0 release-0.1.0 https://github.com/apache/incubator-pinot.git
    ./docker-build.sh apachepinot/pinot:0.3.0-SNAPSHOT master
    ./docker-push.sh apachepinot/pinot:latest
    docker tag pinot:release-0.1.0 apachepinot/pinot:release-0.1.0
    docker push apachepinot/pinot:release-0.1.0
    ./docker-build-and-push.sh apachepinot/pinot:latest master https://github.com/apache/incubator-pinot.git
    ./docker-build.sh [Docker Tag] [Git Branch] [Presto Git URL]
    docker push apachepinot/pinot-presto:latest
    make latest
    docker build \
        --build-arg NODE_VERSION=latest \
        --build-arg PYTHON_VERSION=3.6 \
        --build-arg SUPERSET_VERSION=0.34.1 \
        --tag apachepinot/pinot-superset:0.34.1 \
        --target build .
    make push

    Running Pinot in Production

    hashtag
    Requirements

    You will need the following in order to run pinot in production:

    • Hardware for controller/broker/servers as per your load

    • Working installation of Zookeeper that Pinot can use. We recommend setting aside a path within zookpeer and including that path in pinot.controller.zkStr. Pinot will create its own cluster under this path (cluster name decided by pinot.controller.helixClusterName)

    • Shared storage mounted on controllers (if you plan to have multiple controllers for the same cluster). Alternatively, an implementation of PinotFS that the Pinot hosts have access to.

    • HTTP load balancers for spraying queries across brokers (or other mechanism to balance queries)

    • HTTP load balancers for spraying controller requests (e.g. segment push, or other controller APIs) or other mechanisms for distribution of these requests.

    hashtag
    Deploying Pinot

    In general, when deploying Pinot services, it is best to adhere to a specific ordering in which the various components should be deployed. This deployment order is recommended in case of the scenario that there might be protocol or other significant differences, the deployments go out in a predictable order in which failure due to these changes can be avoided.

    The ordering is as follows:

    1. pinot-controller

    2. pinot-broker

    3. pinot-server

    hashtag
    Managing Pinot

    Pinot provides a web-based management console and a command-line utility (pinot-admin.sh) in order to help provision and manage pinot clusters.

    hashtag
    Pinot Management Console

    The web based management console allows operations on tables, tenants, segments and schemas. You can access the console via http://controller-host:port/help. The console also allows you to enter queries for interactive debugging. Here are some screen-shots from the console.

    Listing all the schemas in the Pinot cluster:

    Rebalancing segments of a table:

    hashtag
    Command line utility (pinot-admin.sh)

    The command line utility (pinot-admin.sh) can be generated by running mvn install package -DskipTests -Pbin-dist in the directory in which you checked out Pinot.

    Here is an example of invoking the command to create a pinot segment:

    Here is an example of executing a query on a Pinot table:

    hashtag
    Monitoring Pinot

    Pinot exposes several metrics to monitor the service and ensure that pinot users are not experiencing issues. In this section we discuss some of the key metrics that are useful to monitor. A full list of metrics is available in the section.

    hashtag
    Pinot Server

    • Missing Segments -

      • Number of missing segments that the broker queried for (expected to be on the server) but the server didn’t have. This can be due to retention or stale routing table.

    • Query latency -

    hashtag
    Pinot Broker

    • Incoming QPS (per broker) -

      • The rate which an individual broker is receiving queries. Units are in QPS.

    • Dropped Requests - , ,

    hashtag
    Pinot Controller

    Many of the controller metrics include a table name and thus are dynamically generated in the code. The metrics below point to the classes which generate the corresponding metrics.

    To get the real metric name, the easiest route is to spin up a controller instance, create a table with the desired name and look through the generated metrics.

    Todo

    Give a more detailed explanation of how metrics are generated, how to identify real metrics names and where to find them in the code.

    • Percent Segments Available -

      • Percentage of complete online replicas in external view as compared to replicas in ideal state.

    • Segments in Error State -

    pinot-minion
  • Total time to take from receiving to finishing executing the query.

  • Query Execution Exceptions -

    • The number of exception which might have occurred during query execution.

  • Realtime Consumption Status -

    • This gives a binary value based on whether low-level consumption is healthy (1) or unhealthy (0). It’s important to ensure at least a single replica of each partition is consuming.

  • Realtime Highest Offset Consumed -

    • The highest offset which has been consumed so far.

  • These multiple metrics will indicate if a query is dropped, ie the processing of that query has been forfeited for some reason.

  • Partial Responses -

    • Indicates a count of partial responses. A partial response is when at least 1 of the requested servers fails to respond to the query.

  • Table QPS quota exceeded -

    • Binary metric which will indicate when the configured QPS quota for a table is exceeded (1) or if there is capacity remaining (0).

  • Table QPS quota usage percent -

    • Percentage of the configured QPS quota being utilized.

  • Number of segments in an ERROR state for a given table.

  • Last push delay - Generated in the class.

    • The time in hours since the last time an offline segment has been pushed to the controller.

  • Percent of replicas up -

    • Percentage of complete online replicas in external view as compared to replicas in ideal state.

  • Table storage quota usage percent -

    • Shows how much of the table’s storage quota is currently being used, metric will a percentage of a the entire quota.

  • Metricsarrow-up-right
    NUM_MISSING_SEGMENTSarrow-up-right
    TOTAL_QUERY_TIMEarrow-up-right
    QUERIESarrow-up-right
    REQUEST_DROPPED_DUE_TO_SEND_ERRORarrow-up-right
    REQUEST_DROPPED_DUE_TO_CONNECTION_ERRORarrow-up-right
    REQUEST_DROPPED_DUE_TO_ACCESS_ERRORarrow-up-right
    PERCENT_SEGMENTS_AVAILABLEarrow-up-right
    SEGMENTS_IN_ERROR_STATEarrow-up-right
    Next arrow-up-right
    Previousarrow-up-right
    _images/pinot-console.png
    _images/list-schemas.png
    _images/rebalance-table.png
    $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh CreateSegment -dataDir /Users/host1/Desktop/test/ -format CSV -outDir /Users/host1/Desktop/test2/ -tableName baseballStats -segmentName baseballStats_data -overwrite -schemaFile ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/sample_data/baseballStats_schema.json
    Executing command: CreateSegment  -generatorConfigFile null -dataDir /Users/host1/Desktop/test/ -format CSV -outDir /Users/host1/Desktop/test2/ -overwrite true -tableName baseballStats -segmentName baseballStats_data -timeColumnName null -schemaFile ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/sample_data/baseballStats_schema.json -readerConfigFile null -enableStarTreeIndex false -starTreeIndexSpecFile null -hllSize 9 -hllColumns null -hllSuffix _hll -numThreads 1
    Accepted files: [/Users/host1/Desktop/test/baseballStats_data.csv]
    Finished building StatsCollector!
    Collected stats for 97889 documents
    Created dictionary for INT column: homeRuns with cardinality: 67, range: 0 to 73
    Created dictionary for INT column: playerStint with cardinality: 5, range: 1 to 5
    Created dictionary for INT column: groundedIntoDoublePlays with cardinality: 35, range: 0 to 36
    Created dictionary for INT column: numberOfGames with cardinality: 165, range: 1 to 165
    Created dictionary for INT column: AtBatting with cardinality: 699, range: 0 to 716
    Created dictionary for INT column: stolenBases with cardinality: 114, range: 0 to 138
    Created dictionary for INT column: tripples with cardinality: 32, range: 0 to 36
    Created dictionary for INT column: hitsByPitch with cardinality: 41, range: 0 to 51
    Created dictionary for STRING column: teamID with cardinality: 149, max length in bytes: 3, range: ALT to WSU
    Created dictionary for INT column: numberOfGamesAsBatter with cardinality: 166, range: 0 to 165
    Created dictionary for INT column: strikeouts with cardinality: 199, range: 0 to 223
    Created dictionary for INT column: sacrificeFlies with cardinality: 20, range: 0 to 19
    Created dictionary for INT column: caughtStealing with cardinality: 36, range: 0 to 42
    Created dictionary for INT column: baseOnBalls with cardinality: 154, range: 0 to 232
    Created dictionary for STRING column: playerName with cardinality: 11976, max length in bytes: 43, range:  to Zoilo Casanova
    Created dictionary for INT column: doules with cardinality: 64, range: 0 to 67
    Created dictionary for STRING column: league with cardinality: 7, max length in bytes: 2, range: AA to UA
    Created dictionary for INT column: yearID with cardinality: 143, range: 1871 to 2013
    Created dictionary for INT column: hits with cardinality: 250, range: 0 to 262
    Created dictionary for INT column: runsBattedIn with cardinality: 175, range: 0 to 191
    Created dictionary for INT column: G_old with cardinality: 166, range: 0 to 165
    Created dictionary for INT column: sacrificeHits with cardinality: 54, range: 0 to 67
    Created dictionary for INT column: intentionalWalks with cardinality: 45, range: 0 to 120
    Created dictionary for INT column: runs with cardinality: 167, range: 0 to 192
    Created dictionary for STRING column: playerID with cardinality: 18107, max length in bytes: 9, range: aardsda01 to zwilldu01
    Start building IndexCreator!
    Finished records indexing in IndexCreator!
    Finished segment seal!
    Converting segment: /Users/host1/Desktop/test2/baseballStats_data_0 to v3 format
    v3 segment location for segment: baseballStats_data_0 is /Users/host1/Desktop/test2/baseballStats_data_0/v3
    Deleting files in v1 segment directory: /Users/host1/Desktop/test2/baseballStats_data_0
    Driver, record read time : 369
    Driver, stats collector time : 0
    Driver, indexing time : 373
    $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh PostQuery -query "select count(*) from baseballStats" ./pinot-distribution/target/apache-pinot-incubaExecuting command: PostQuery -brokerHost [broker_host] -brokerPort [broker_port] -query select count(*) from baseballStats
    Result: {"aggregationResults":[{"function":"count_star","value":"97889"}],"exceptions":[],"numServersQueried":1,"numServersResponded":1,"numSegmentsQueried":1,"numSegmentsProcessed":1,"numSegmentsMatched":1,"numDocsScanned":97889,"numEntriesScannedInFilter":0,"numEntriesScannedPostFilter":0,"numGroupsLimitReached":false,"totalDocs":97889,"timeUsedMs":107,"segmentStatistics":[],"traceInfo":{}}

    Tutorials

    Here you will find a collection of how-to guides for operators or developers tha

    QUERY_EXECUTION_EXCEPTIONSarrow-up-right
    LLC_PARTITION_CONSUMINGarrow-up-right
    HIGHEST_STREAM_OFFSET_CONSUMEDarrow-up-right
    BROKER_RESPONSES_WITH_PARTIAL_SERVERS_RESPONDEDarrow-up-right
    QUERY_QUOTA_EXCEEDEDarrow-up-right
    QUERY_QUOTA_CAPACITY_UTILIZATION_RATEarrow-up-right
    ValidationMetricsarrow-up-right
    PERCENT_OF_REPLICASarrow-up-right
    TABLE_STORAGE_QUOTA_UTILIZATIONarrow-up-right

    Amazon EKS (Kafka)

    hashtag
    If you need to connect non-EKS AWS jobs (Lambdas/EC2) to a Kafka running inside an AWS EKS

    General steps: update Kafka's advertised.listeners and make sure Kafka is accessible (e.g. allow inputs on Security Groups).

    You will probably face the following problems.

    circle-exclamation

    If you want to connect to Kafka outside of EKS, you will need to change advertised.listeners. When a client connects to a single Kafka bootstrap server (like other brokers), a bootstrap server sends a list of addresses for all brokers to the client. If you want to connect to a EKS Kafka, these default values will not be correct. This provides an excellent explanation of the field.

    circle-info

    If you use Helm to deploy Kafka to AWS EKS, please review the . It describes multiple setups for communicating into EKS.

    circle-exclamation

    Running helm upgrade on the Kafka chart does not always update the pods. The exact reason is unknown. It's probably an issue with the chart's implementation. You should run kubectl describe pod and other commands to see the current status of the pods. During initial development, you can run helm uninstall and then helm installto force the values to update.

    postarrow-up-right
    chart's READMEarrow-up-right

    Amazon MSK (Kafka)

    How to Connect Pinot with Amazon Managed Streaming for Apache Kafka (Amazon MSK)

    This wiki documents how to connect Pinot deployed in Amazon EKSarrow-up-right to Amazon Managed Kafkaarrow-up-right.

    hashtag
    Prerequisite

    Please follow this AWS Quickstart Wikiarrow-up-right to run Pinot on Amazon EKS.

    hashtag
    Create an Amazon MSK Cluster

    Please go to to create a Kafka Cluster.

    circle-info

    Note:

    1. For demo simplicity, this MSK cluster reuses same VPC created by EKS cluster in the previous step. Otherwise a is required to ensure two VPCs could talk to each other.

    Below is a sample screenshot to create an Amazon MSK cluster.

    After click on Create button, you can take a coffee break and come back.

    Once the cluster is created, you can view it and click View client information to see the Zookeeper and Kafka Broker list.

    Sample Client Information

    hashtag
    Connect to MSK

    hashtag
    Config SecurityGroup

    Until now, the MSK cluster is still not accessible, you can follow this to create an EC2 instance to connect to it for topic creation, run console producer and consumer.

    In order to connect MSK to EKS, we need to allow the traffic could go through each other.

    This is configured through Amazon VPC Page.

    1. Record the Amazon MSK SecurityGroup from the Cluster page, in the above demo, it's sg-01e7ab1320a77f1a9.

    2. Open , click on SecurityGroups on left bar. Find the EKS Security group: eksctl-${PINOT_EKS_CLUSTER}-cluster/ClusterSharedNodeSecurityGroup.

    1. In SecurityGroup, click on MSK SecurityGroup (sg-01e7ab1320a77f1a9), then Click on Edit Rules , then add above ClusterSharedNodeSecurityGroup (sg-0402b59d7e440f8d1) to it.

    1. Click EKS Security Group ClusterSharedNodeSecurityGroup (sg-0402b59d7e440f8d1), add In bound Rule for MSK Security Group (sg-01e7ab1320a77f1a9).

    Now, EKS cluster should be able to talk to Amazon MSK.

    hashtag
    Create Kafka topic

    To run below commands, please ensure you set two environment variable with ZOOKEEPER_CONNECT_STRING and BROKER_LIST_STRING (Use plaintext) from Amazon MSK client information, and replace the Variables accordingly.

    E.g.

    You can log into one EKS node or container and run below command to create a topic.

    E.g. Enter into Pinot controller container:

    Then install wget then download Kafka binary.

    Create a Kafka topic:

    Topic creation succeeds with below message:

    hashtag
    Write sample data into Kafka

    Once topic is created, we can start a simple application to produce to it.

    You can download below yaml file, then replace:

    • ${ZOOKEEPER_CONNECT_STRING} -> MSK Zookeeper String

    • ${BROKER_LIST_STRING} -> MSK Plaintext Broker String in the deployment

    • ${GITHUB_PERSONAL_ACCESS_TOKEN}

    And apply the YAML file by.

    Once the pod is up, you can verify by running a console consumer to read from it.

    circle-info

    Try to run from the Pinot Controller container entered in above step.

    hashtag
    Create a Pinot table

    This step is relatively easy.

    Since we already put table creation request into the ConfigMap, we can just enter into pinot-github-events-data-into-msk-kafka pod to execute the command.

    • Check if the pod is running:

    Sample output:

    • Enter into the pod

    • Create Table

    Sample output:

    • Then you can open Pinot Query Console to browse the data

    Under Encryption section, chooseBoth TLS encrypted and plaintext traffic allowed
    -> A personal Github Personal Access Token generated from
    , please grant all read permissions to it. Here is the
    to generate Github Events.
    MSK Landing Pagearrow-up-right
    VPC Peeringarrow-up-right
    Wikiarrow-up-right
    Amazon VPC Pagearrow-up-right
    file-download
    7KB
    github-events-aws-msk-demo.yaml
    arrow-up-right-from-squareOpen
    github-events-aws-msk-demo.yaml
    Amazon MSK Clusters View
    MSK Cluster View
    Amazon EKS ClusterSharedNodeSecurityGroup
    Add SecurityGroup to Amazon MSK
    Add SecurityGroup to Amazon EKS
                                                           ![](../../.gitbook/assets/snapshot-msk.png)
    ZOOKEEPER_CONNECT_STRING="z-3.pinot-quickstart-msk-d.ky727f.c3.kafka.us-west-2.amazonaws.com:2181,z-1.pinot-quickstart-msk-d.ky727f.c3.kafka.us-west-2.amazonaws.com:2181,z-2.pinot-quickstart-msk-d.ky727f.c3.kafka.us-west-2.amazonaws.com:2181"
    BROKER_LIST_STRING="b-1.pinot-quickstart-msk-d.ky727f.c3.kafka.us-west-2.amazonaws.com:9092,b-2.pinot-quickstart-msk-d.ky727f.c3.kafka.us-west-2.amazonaws.com:9092"
    kubectl exec -it pod/pinot-controller-0  -n pinot-quickstart bash
    apt-get update
    apt-get install wget -y
    wget https://archive.apache.org/dist/kafka/2.2.1/kafka_2.12-2.2.1.tgz
    tar -xzf kafka_2.12-2.2.1.tgz
    cd kafka_2.12-2.2.1
    bin/kafka-topics.sh \
      --zookeeper ${ZOOKEEPER_CONNECT_STRING} \
      --create \
      --topic pullRequestMergedEventsAwsMskDemo \
      --replication-factor 1 \
      --partitions 1
    Created topic "pullRequestMergedEventsAwsMskDemo".
    kubectl apply -f github-events-aws-msk-demo.yaml
    bin/kafka-console-consumer.sh \
      --bootstrap-server ${BROKER_LIST_STRING} \
      --topic pullRequestMergedEventsAwsMskDemo
    kubectl get pod -n pinot-quickstart  |grep pinot-github-events-data-into-msk-kafka
    pinot-github-events-data-into-msk-kafka-68948fb4cd-rrzlf   1/1     Running     0          14m
    podname=`kubectl get pod -n pinot-quickstart  |grep pinot-github-events-data-into-msk-kafka|awk '{print $1}'`
    kubectl exec -it ${podname} -n pinot-quickstart bash
    bin/pinot-admin.sh AddTable \
      -controllerHost pinot-controller \
      -tableConfigFile /var/pinot/examples/pullRequestMergedEventsAwsMskDemo_realtime_table_config.json \
      -schemaFile /var/pinot/examples/pullRequestMergedEventsAwsMskDemo_schema.json \
      -exec
    Executing command: AddTable -tableConfigFile /var/pinot/examples/pullRequestMergedEventsAwsMskDemo_realtime_table_config.json -schemaFile /var/pinot/examples/pullRequestMergedEventsAwsMskDemo_schema.json -controllerHost pinot-controller -controllerPort 9000 -exec
    Sending request: http://pinot-controller:9000/schemas to controller: pinot-controller-0.pinot-controller-headless.pinot-quickstart.svc.cluster.local, version: Unknown
    {"status":"Table pullRequestMergedEventsAwsMskDemo_REALTIME succesfully added"}
    herearrow-up-right
    source codearrow-up-right

    Batch Data Ingestion In Practice

    In practice, we need to run Pinot data ingestion as a pipeline or a scheduled job.

    Assuming pinot-distribution is already built, inside examples directory, you could find several sample table layouts.

    hashtag
    Table Layout

    Usually each table deserves its own directory, like airlineStats.

    Inside the table directory, rawdata is created to put all the input data.

    Typically, for data events with timestamp, we partition those data and store them into a daily folder. E.g. a typically layout would follow this pattern: rawdata/%yyyy%/%mm%/%dd%/[daily_input_files].

    hashtag
    Configuring batch ingestion job

    Create a batch ingestion job spec file to describe how to ingest the data.

    Below is an example (also located at examples/batch/airlineStats/ingestionJobSpec.yaml)

    hashtag
    Executing the job

    Below command will create example table into Pinot cluster.

    Below command will kick off the ingestion job to generate Pinot segments and push them into the cluster.

    After job finished, segments are stored in examples/batch/airlineStats/segments following same layout of input directory layout.

    hashtag
    Executing the job using Spark

    Below example is running in a spark local mode. You can download spark distribution and start it by running:

    Build latest Pinot Distribution following this .

    Below command shows how to use spark-submit command to submit a spark job using pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar.

    Sample Spark ingestion job spec yaml, (also located at examples/batch/airlineStats/sparkIngestionJobSpec.yaml):

    Please ensure parameter PINOT_ROOT_DIR and PINOT_VERSION are set properly.

    circle-info

    Please ensure you set

    • spark.driver.extraJavaOptions =>

      -Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins

    hashtag
    Executing the job using Hadoop

    Below command shows how to use Hadoop jar command to run a Hadoop job using pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar.

    Sample Hadoop ingestion job spec yaml(also located at examples/batch/airlineStats/hadoopIngestionJobSpec.yaml):

    Please ensure parameter PINOT_ROOT_DIR and PINOT_VERSION are set properly.

    Or put all the required plugins jars to CLASSPATH, then set -Dplugins.dir=${CLASSPATH}
    • spark.driver.extraClassPath =>

      pinot-all-${PINOT_VERSION}-jar-with-depdencies.jar

    Wikiarrow-up-right
    /var/pinot/airlineStats/rawdata/2014/01/01/airlineStats_data_2014-01-01.avro
    /var/pinot/airlineStats/rawdata/2014/01/02/airlineStats_data_2014-01-02.avro
    /var/pinot/airlineStats/rawdata/2014/01/03/airlineStats_data_2014-01-03.avro
    /var/pinot/airlineStats/rawdata/2014/01/04/airlineStats_data_2014-01-04.avro
    /var/pinot/airlineStats/rawdata/2014/01/05/airlineStats_data_2014-01-05.avro
    /var/pinot/airlineStats/rawdata/2014/01/06/airlineStats_data_2014-01-06.avro
    /var/pinot/airlineStats/rawdata/2014/01/07/airlineStats_data_2014-01-07.avro
    /var/pinot/airlineStats/rawdata/2014/01/08/airlineStats_data_2014-01-08.avro
    /var/pinot/airlineStats/rawdata/2014/01/09/airlineStats_data_2014-01-09.avro
    /var/pinot/airlineStats/rawdata/2014/01/10/airlineStats_data_2014-01-10.avro
    /var/pinot/airlineStats/rawdata/2014/01/11/airlineStats_data_2014-01-11.avro
    /var/pinot/airlineStats/rawdata/2014/01/12/airlineStats_data_2014-01-12.avro
    /var/pinot/airlineStats/rawdata/2014/01/13/airlineStats_data_2014-01-13.avro
    /var/pinot/airlineStats/rawdata/2014/01/14/airlineStats_data_2014-01-14.avro
    /var/pinot/airlineStats/rawdata/2014/01/15/airlineStats_data_2014-01-15.avro
    /var/pinot/airlineStats/rawdata/2014/01/16/airlineStats_data_2014-01-16.avro
    /var/pinot/airlineStats/rawdata/2014/01/17/airlineStats_data_2014-01-17.avro
    /var/pinot/airlineStats/rawdata/2014/01/18/airlineStats_data_2014-01-18.avro
    /var/pinot/airlineStats/rawdata/2014/01/19/airlineStats_data_2014-01-19.avro
    /var/pinot/airlineStats/rawdata/2014/01/20/airlineStats_data_2014-01-20.avro
    /var/pinot/airlineStats/rawdata/2014/01/21/airlineStats_data_2014-01-21.avro
    /var/pinot/airlineStats/rawdata/2014/01/22/airlineStats_data_2014-01-22.avro
    /var/pinot/airlineStats/rawdata/2014/01/23/airlineStats_data_2014-01-23.avro
    /var/pinot/airlineStats/rawdata/2014/01/24/airlineStats_data_2014-01-24.avro
    /var/pinot/airlineStats/rawdata/2014/01/25/airlineStats_data_2014-01-25.avro
    /var/pinot/airlineStats/rawdata/2014/01/26/airlineStats_data_2014-01-26.avro
    /var/pinot/airlineStats/rawdata/2014/01/27/airlineStats_data_2014-01-27.avro
    /var/pinot/airlineStats/rawdata/2014/01/28/airlineStats_data_2014-01-28.avro
    /var/pinot/airlineStats/rawdata/2014/01/29/airlineStats_data_2014-01-29.avro
    /var/pinot/airlineStats/rawdata/2014/01/30/airlineStats_data_2014-01-30.avro
    /var/pinot/airlineStats/rawdata/2014/01/31/airlineStats_data_2014-01-31.avro
    # executionFrameworkSpec: Defines ingestion jobs to be running.
    executionFrameworkSpec:
    
      # name: execution framework name
      name: 'standalone'
    
      # segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentGenerationJobRunner interface.
      segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
    
      # segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentTarPushJobRunner interface.
      segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
    
      # segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentUriPushJobRunner interface.
      segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
    
    # jobType: Pinot ingestion job type.
    # Supported job types are:
    #   'SegmentCreation'
    #   'SegmentTarPush'
    #   'SegmentUriPush'
    #   'SegmentCreationAndTarPush'
    #   'SegmentCreationAndUriPush'
    jobType: SegmentCreationAndTarPush
    
    # inputDirURI: Root directory of input data, expected to have scheme configured in PinotFS.
    inputDirURI: 'examples/batch/airlineStats/rawdata'
    
    # includeFileNamePattern: include file name pattern, supported glob pattern.
    # Sample usage:
    #   'glob:*.avro' will include all avro files just under the inputDirURI, not sub directories;
    #   'glob:**\/*.avro' will include all the avro files under inputDirURI recursively.
    includeFileNamePattern: 'glob:**/*.avro'
    
    # excludeFileNamePattern: exclude file name pattern, supported glob pattern.
    # Sample usage:
    #   'glob:*.avro' will exclude all avro files just under the inputDirURI, not sub directories;
    #   'glob:**\/*.avro' will exclude all the avro files under inputDirURI recursively.
    # _excludeFileNamePattern: ''
    
    # outputDirURI: Root directory of output segments, expected to have scheme configured in PinotFS.
    outputDirURI: 'examples/batch/airlineStats/segments'
    
    # overwriteOutput: Overwrite output segments if existed.
    overwriteOutput: true
    
    # pinotFSSpecs: defines all related Pinot file systems.
    pinotFSSpecs:
    
      - # scheme: used to identify a PinotFS.
        # E.g. local, hdfs, dbfs, etc
        scheme: file
    
        # className: Class name used to create the PinotFS instance.
        # E.g.
        #   org.apache.pinot.spi.filesystem.LocalPinotFS is used for local filesystem
        #   org.apache.pinot.plugin.filesystem.AzurePinotFS is used for Azure Data Lake
        #   org.apache.pinot.plugin.filesystem.HadoopPinotFS is used for HDFS
        className: org.apache.pinot.spi.filesystem.LocalPinotFS
    
    # recordReaderSpec: defines all record reader
    recordReaderSpec:
    
      # dataFormat: Record data format, e.g. 'avro', 'parquet', 'orc', 'csv', 'json', 'thrift' etc.
      dataFormat: 'avro'
    
      # className: Corresponding RecordReader class name.
      # E.g.
      #   org.apache.pinot.plugin.inputformat.avro.AvroRecordReader
      #   org.apache.pinot.plugin.inputformat.csv.CSVRecordReader
      #   org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader
      #   org.apache.pinot.plugin.inputformat.json.JsonRecordReader
      #   org.apache.pinot.plugin.inputformat.orc.OrcRecordReader
      #   org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReader
      className: 'org.apache.pinot.plugin.inputformat.avro.AvroRecordReader'
    
    # tableSpec: defines table name and where to fetch corresponding table config and table schema.
    tableSpec:
    
      # tableName: Table name
      tableName: 'airlineStats'
    
      # schemaURI: defines where to read the table schema, supports PinotFS or HTTP.
      # E.g.
      #   hdfs://path/to/table_schema.json
      #   http://localhost:9000/tables/myTable/schema
      schemaURI: 'http://localhost:9000/tables/airlineStats/schema'
    
      # tableConfigURI: defines where to reade the table config.
      # Supports using PinotFS or HTTP.
      # E.g.
      #   hdfs://path/to/table_config.json
      #   http://localhost:9000/tables/myTable
      # Note that the API to read Pinot table config directly from pinot controller contains a JSON wrapper.
      # The real table config is the object under the field 'OFFLINE'.
      tableConfigURI: 'http://localhost:9000/tables/airlineStats'
    
    # segmentNameGeneratorSpec: defines how to init a SegmentNameGenerator.
    segmentNameGeneratorSpec:
    
      # type: Current supported type is 'simple' and 'normalizedDate'.
      type: normalizedDate
    
      # configs: Configs to init SegmentNameGenerator.
      configs:
        segment.name.prefix: 'airlineStats_batch'
        exclude.sequence.id: true
    
    # pinotClusterSpecs: defines the Pinot Cluster Access Point.
    pinotClusterSpecs:
      - # controllerURI: used to fetch table/schema information and data push.
        # E.g. http://localhost:9000
        controllerURI: 'http://localhost:9000'
    
    # pushJobSpec: defines segment push job related configuration.
    pushJobSpec:
    
      # pushAttempts: number of attempts for push job, default is 1, which means no retry.
      pushAttempts: 2
    
      # pushRetryIntervalMillis: retry wait Ms, default to 1 second.
      pushRetryIntervalMillis: 1000
    bin/pinot-admin.sh AddTable  -schemaFile examples/batch/airlineStats/airlineStats_schema.json -tableConfigFile examples/batch/airlineStats/airlineStats_offline_table_config.json -exec
    bin/pinot-ingestion-job.sh examples/batch/airlineStats/ingestionJobSpec.yaml
    /var/pinot/airlineStats/segments/2014/01/01/airlineStats_batch_2014-01-01_2014-01-01.tar.gz
    /var/pinot/airlineStats/segments/2014/01/02/airlineStats_batch_2014-01-02_2014-01-02.tar.gz
    /var/pinot/airlineStats/segments/2014/01/03/airlineStats_batch_2014-01-03_2014-01-03.tar.gz
    /var/pinot/airlineStats/segments/2014/01/04/airlineStats_batch_2014-01-04_2014-01-04.tar.gz
    /var/pinot/airlineStats/segments/2014/01/05/airlineStats_batch_2014-01-05_2014-01-05.tar.gz
    /var/pinot/airlineStats/segments/2014/01/06/airlineStats_batch_2014-01-06_2014-01-06.tar.gz
    /var/pinot/airlineStats/segments/2014/01/07/airlineStats_batch_2014-01-07_2014-01-07.tar.gz
    /var/pinot/airlineStats/segments/2014/01/08/airlineStats_batch_2014-01-08_2014-01-08.tar.gz
    /var/pinot/airlineStats/segments/2014/01/09/airlineStats_batch_2014-01-09_2014-01-09.tar.gz
    /var/pinot/airlineStats/segments/2014/01/10/airlineStats_batch_2014-01-10_2014-01-10.tar.gz
    /var/pinot/airlineStats/segments/2014/01/11/airlineStats_batch_2014-01-11_2014-01-11.tar.gz
    /var/pinot/airlineStats/segments/2014/01/12/airlineStats_batch_2014-01-12_2014-01-12.tar.gz
    /var/pinot/airlineStats/segments/2014/01/13/airlineStats_batch_2014-01-13_2014-01-13.tar.gz
    /var/pinot/airlineStats/segments/2014/01/14/airlineStats_batch_2014-01-14_2014-01-14.tar.gz
    /var/pinot/airlineStats/segments/2014/01/15/airlineStats_batch_2014-01-15_2014-01-15.tar.gz
    /var/pinot/airlineStats/segments/2014/01/16/airlineStats_batch_2014-01-16_2014-01-16.tar.gz
    /var/pinot/airlineStats/segments/2014/01/17/airlineStats_batch_2014-01-17_2014-01-17.tar.gz
    /var/pinot/airlineStats/segments/2014/01/18/airlineStats_batch_2014-01-18_2014-01-18.tar.gz
    /var/pinot/airlineStats/segments/2014/01/19/airlineStats_batch_2014-01-19_2014-01-19.tar.gz
    /var/pinot/airlineStats/segments/2014/01/20/airlineStats_batch_2014-01-20_2014-01-20.tar.gz
    /var/pinot/airlineStats/segments/2014/01/21/airlineStats_batch_2014-01-21_2014-01-21.tar.gz
    /var/pinot/airlineStats/segments/2014/01/22/airlineStats_batch_2014-01-22_2014-01-22.tar.gz
    /var/pinot/airlineStats/segments/2014/01/23/airlineStats_batch_2014-01-23_2014-01-23.tar.gz
    /var/pinot/airlineStats/segments/2014/01/24/airlineStats_batch_2014-01-24_2014-01-24.tar.gz
    /var/pinot/airlineStats/segments/2014/01/25/airlineStats_batch_2014-01-25_2014-01-25.tar.gz
    /var/pinot/airlineStats/segments/2014/01/26/airlineStats_batch_2014-01-26_2014-01-26.tar.gz
    /var/pinot/airlineStats/segments/2014/01/27/airlineStats_batch_2014-01-27_2014-01-27.tar.gz
    /var/pinot/airlineStats/segments/2014/01/28/airlineStats_batch_2014-01-28_2014-01-28.tar.gz
    /var/pinot/airlineStats/segments/2014/01/29/airlineStats_batch_2014-01-29_2014-01-29.tar.gz
    /var/pinot/airlineStats/segments/2014/01/30/airlineStats_batch_2014-01-30_2014-01-30.tar.gz
    /var/pinot/airlineStats/segments/2014/01/31/airlineStats_batch_2014-01-31_2014-01-31.tar.gz
    wget https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
    tar xvf spark-2.4.5-bin-hadoop2.7.tgz
    cd spark-2.4.5-bin-hadoop2.7
    ./bin/spark-shell --master 'local[2]'
    # executionFrameworkSpec: Defines ingestion jobs to be running.
    executionFrameworkSpec:
    
      # name: execution framework name
      name: 'spark'
    
      # segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentGenerationJobRunner interface.
      segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
    
      # segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentTarPushJobRunner interface.
      segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
    
      # segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentUriPushJobRunner interface.
      segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
    
      # extraConfigs: extra configs for execution framework.
      extraConfigs:
    
        # stagingDir is used in distributed filesystem to host all the segments then move this directory entirely to output directory.
        stagingDir: examples/batch/airlineStats/staging
    
    # jobType: Pinot ingestion job type.
    # Supported job types are:
    #   'SegmentCreation'
    #   'SegmentTarPush'
    #   'SegmentUriPush'
    #   'SegmentCreationAndTarPush'
    #   'SegmentCreationAndUriPush'
    jobType: SegmentCreationAndTarPush
    
    # inputDirURI: Root directory of input data, expected to have scheme configured in PinotFS.
    inputDirURI: 'examples/batch/airlineStats/rawdata'
    
    # includeFileNamePattern: include file name pattern, supported glob pattern.
    # Sample usage:
    #   'glob:*.avro' will include all avro files just under the inputDirURI, not sub directories;
    #   'glob:**\/*.avro' will include all the avro files under inputDirURI recursively.
    includeFileNamePattern: 'glob:**/*.avro'
    
    # excludeFileNamePattern: exclude file name pattern, supported glob pattern.
    # Sample usage:
    #   'glob:*.avro' will exclude all avro files just under the inputDirURI, not sub directories;
    #   'glob:**\/*.avro' will exclude all the avro files under inputDirURI recursively.
    # excludeFileNamePattern: ''
    
    # outputDirURI: Root directory of output segments, expected to have scheme configured in PinotFS.
    outputDirURI: 'examples/batch/airlineStats/segments'
    
    # overwriteOutput: Overwrite output segments if existed.
    overwriteOutput: true
    
    # pinotFSSpecs: defines all related Pinot file systems.
    pinotFSSpecs:
    
      - # scheme: used to identify a PinotFS.
        # E.g. local, hdfs, dbfs, etc
        scheme: file
    
        # className: Class name used to create the PinotFS instance.
        # E.g.
        #   org.apache.pinot.spi.filesystem.LocalPinotFS is used for local filesystem
        #   org.apache.pinot.plugin.filesystem.AzurePinotFS is used for Azure Data Lake
        #   org.apache.pinot.plugin.filesystem.HadoopPinotFS is used for HDFS
        className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
    
    # recordReaderSpec: defines all record reader
    recordReaderSpec:
    
      # dataFormat: Record data format, e.g. 'avro', 'parquet', 'orc', 'csv', 'json', 'thrift' etc.
      dataFormat: 'avro'
    
      # className: Corresponding RecordReader class name.
      # E.g.
      #   org.apache.pinot.plugin.inputformat.avro.AvroRecordReader
      #   org.apache.pinot.plugin.inputformat.csv.CSVRecordReader
      #   org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader
      #   org.apache.pinot.plugin.inputformat.json.JsonRecordReader
      #   org.apache.pinot.plugin.inputformat.orc.OrcRecordReader
      #   org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReader
      className: 'org.apache.pinot.plugin.inputformat.avro.AvroRecordReader'
    
    # tableSpec: defines table name and where to fetch corresponding table config and table schema.
    tableSpec:
    
      # tableName: Table name
      tableName: 'airlineStats'
    
      # schemaURI: defines where to read the table schema, supports PinotFS or HTTP.
      # E.g.
      #   hdfs://path/to/table_schema.json
      #   http://localhost:9000/tables/myTable/schema
      schemaURI: 'http://localhost:9000/tables/airlineStats/schema'
    
      # tableConfigURI: defines where to reade the table config.
      # Supports using PinotFS or HTTP.
      # E.g.
      #   hdfs://path/to/table_config.json
      #   http://localhost:9000/tables/myTable
      # Note that the API to read Pinot table config directly from pinot controller contains a JSON wrapper.
      # The real table config is the object under the field 'OFFLINE'.
      tableConfigURI: 'http://localhost:9000/tables/airlineStats'
    
    # segmentNameGeneratorSpec: defines how to init a SegmentNameGenerator.
    segmentNameGeneratorSpec:
    
      # type: Current supported type is 'simple' and 'normalizedDate'.
      type: normalizedDate
    
      # configs: Configs to init SegmentNameGenerator.
      configs:
        segment.name.prefix: 'airlineStats_batch'
        exclude.sequence.id: true
    
    # pinotClusterSpecs: defines the Pinot Cluster Access Point.
    pinotClusterSpecs:
      - # controllerURI: used to fetch table/schema information and data push.
        # E.g. http://localhost:9000
        controllerURI: 'http://localhost:9000'
    
    # pushJobSpec: defines segment push job related configuration.
    pushJobSpec:
    
      # pushParallelism: push job parallelism, default is 1.
      pushParallelism: 2
    
      # pushAttempts: number of attempts for push job, default is 1, which means no retry.
      pushAttempts: 2
    
      # pushRetryIntervalMillis: retry wait Ms, default to 1 second.
      pushRetryIntervalMillis: 1000
    export PINOT_VERSION=0.4.0-SNAPSHOT
    export PINOT_DISTRIBUTION_DIR=${PINOT_ROOT_DIR}/pinot-distribution/target/apache-pinot-incubating-${PINOT_VERSION}-bin/apache-pinot-incubating-${PINOT_VERSION}-bin
    cd ${PINOT_DISTRIBUTION_DIR}
    ${SPARK_HOME}/bin/spark-submit \
      --class org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher \
      --master "local[2]" \
      --deploy-mode client \
      --conf "spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins -Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-log4j2.xml" \
      --conf "spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar" \
      local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \
      ${PINOT_DISTRIBUTION_DIR}/examples/batch/airlineStats/sparkIngestionJobSpec.yaml
    # executionFrameworkSpec: Defines ingestion jobs to be running.
    executionFrameworkSpec:
    
      # name: execution framework name
      name: 'hadoop'
    
      # segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentGenerationJobRunner interface.
      segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentGenerationJobRunner'
    
      # segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentTarPushJobRunner interface.
      segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentTarPushJobRunner'
    
      # segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentUriPushJobRunner interface.
      segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentUriPushJobRunner'
    
      # extraConfigs: extra configs for execution framework.
      extraConfigs:
    
        # stagingDir is used in distributed filesystem to host all the segments then move this directory entirely to output directory.
        stagingDir: examples/batch/airlineStats/staging
    
    # jobType: Pinot ingestion job type.
    # Supported job types are:
    #   'SegmentCreation'
    #   'SegmentTarPush'
    #   'SegmentUriPush'
    #   'SegmentCreationAndTarPush'
    #   'SegmentCreationAndUriPush'
    jobType: SegmentCreationAndTarPush
    
    # inputDirURI: Root directory of input data, expected to have scheme configured in PinotFS.
    inputDirURI: 'examples/batch/airlineStats/rawdata'
    
    # includeFileNamePattern: include file name pattern, supported glob pattern.
    # Sample usage:
    #   'glob:*.avro' will include all avro files just under the inputDirURI, not sub directories;
    #   'glob:**\/*.avro' will include all the avro files under inputDirURI recursively.
    includeFileNamePattern: 'glob:**/*.avro'
    
    # excludeFileNamePattern: exclude file name pattern, supported glob pattern.
    # Sample usage:
    #   'glob:*.avro' will exclude all avro files just under the inputDirURI, not sub directories;
    #   'glob:**\/*.avro' will exclude all the avro files under inputDirURI recursively.
    # _excludeFileNamePattern: ''
    
    # outputDirURI: Root directory of output segments, expected to have scheme configured in PinotFS.
    outputDirURI: 'examples/batch/airlineStats/segments'
    
    # overwriteOutput: Overwrite output segments if existed.
    overwriteOutput: true
    
    # pinotFSSpecs: defines all related Pinot file systems.
    pinotFSSpecs:
    
      - # scheme: used to identify a PinotFS.
        # E.g. local, hdfs, dbfs, etc
        scheme: file
    
        # className: Class name used to create the PinotFS instance.
        # E.g.
        #   org.apache.pinot.spi.filesystem.LocalPinotFS is used for local filesystem
        #   org.apache.pinot.plugin.filesystem.AzurePinotFS is used for Azure Data Lake
        #   org.apache.pinot.plugin.filesystem.HadoopPinotFS is used for HDFS
        className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
    
    # recordReaderSpec: defines all record reader
    recordReaderSpec:
    
      # dataFormat: Record data format, e.g. 'avro', 'parquet', 'orc', 'csv', 'json', 'thrift' etc.
      dataFormat: 'avro'
    
      # className: Corresponding RecordReader class name.
      # E.g.
      #   org.apache.pinot.plugin.inputformat.avro.AvroRecordReader
      #   org.apache.pinot.plugin.inputformat.csv.CSVRecordReader
      #   org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader
      #   org.apache.pinot.plugin.inputformat.json.JsonRecordReader
      #   org.apache.pinot.plugin.inputformat.orc.OrcRecordReader
      #   org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReader
      className: 'org.apache.pinot.plugin.inputformat.avro.AvroRecordReader'
    
    # tableSpec: defines table name and where to fetch corresponding table config and table schema.
    tableSpec:
    
      # tableName: Table name
      tableName: 'airlineStats'
    
      # schemaURI: defines where to read the table schema, supports PinotFS or HTTP.
      # E.g.
      #   hdfs://path/to/table_schema.json
      #   http://localhost:9000/tables/myTable/schema
      schemaURI: 'http://localhost:9000/tables/airlineStats/schema'
    
      # tableConfigURI: defines where to reade the table config.
      # Supports using PinotFS or HTTP.
      # E.g.
      #   hdfs://path/to/table_config.json
      #   http://localhost:9000/tables/myTable
      # Note that the API to read Pinot table config directly from pinot controller contains a JSON wrapper.
      # The real table config is the object under the field 'OFFLINE'.
      tableConfigURI: 'http://localhost:9000/tables/airlineStats'
    
    # segmentNameGeneratorSpec: defines how to init a SegmentNameGenerator.
    segmentNameGeneratorSpec:
    
      # type: Current supported type is 'simple' and 'normalizedDate'.
      type: normalizedDate
    
      # configs: Configs to init SegmentNameGenerator.
      configs:
        segment.name.prefix: 'airlineStats_batch'
        exclude.sequence.id: true
    
    # pinotClusterSpecs: defines the Pinot Cluster Access Point.
    pinotClusterSpecs:
      - # controllerURI: used to fetch table/schema information and data push.
        # E.g. http://localhost:9000
        controllerURI: 'http://localhost:9000'
    
    # pushJobSpec: defines segment push job related configuration.
    pushJobSpec:
    
      # pushParallelism: push job parallelism, default is 1.
      pushParallelism: 2
    
      # pushAttempts: number of attempts for push job, default is 1, which means no retry.
      pushAttempts: 2
    
      # pushRetryIntervalMillis: retry wait Ms, default to 1 second.
      pushRetryIntervalMillis: 1000
    export PINOT_VERSION=0.4.0-SNAPSHOT
    export PINOT_DISTRIBUTION_DIR=${PINOT_ROOT_DIR}/pinot-distribution/target/apache-pinot-incubating-${PINOT_VERSION}-bin/apache-pinot-incubating-${PINOT_VERSION}-bin
    export HADOOP_CLIENT_OPTS="-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins -Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-log4j2.xml"
    hadoop jar  \
            ${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \
            org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher \
            ${PINOT_DISTRIBUTION_DIR}/examples/batch/airlineStats/hadoopIngestionJobSpec.yaml