LogoLogo
release-0.4.0
release-0.4.0
  • Introduction
  • Basics
    • Concepts
    • Architecture
    • Components
      • Cluster
      • Controller
      • Broker
      • Server
      • Minion
      • Tenant
      • Table
      • Schema
      • Segment
    • Getting started
      • Frequent questions
      • Running Pinot locally
      • Running Pinot in Docker
      • Running Pinot in Kubernetes
      • Public cloud examples
        • Running on Azure
        • Running on GCP
        • Running on AWS
      • Manual cluster setup
      • Batch import example
      • Stream ingestion example
    • Data import
      • Stream ingestion
        • Import from Kafka
      • File systems
        • Import from ADLS (Azure)
        • Import from HDFS
        • Import from GCP
      • Input formats
        • Import from CSV
        • Import from JSON
        • Import from Avro
        • Import from Parquet
        • Import from Thrift
        • Import from ORC
    • Feature guides
      • Pinot data explorer
      • Text search support
      • Indexing
    • Releases
      • 0.3.0
      • 0.2.0
      • 0.1.0
    • Recipes
      • GitHub Events Stream
  • For Users
    • Query
      • Pinot Query Language (PQL)
        • Unique Counting
    • API
      • Querying Pinot
        • Response Format
      • Pinot Rest Admin Interface
    • Clients
      • Java
      • Golang
  • For Developers
    • Basics
      • Extending Pinot
        • Writing Custom Aggregation Function
        • Pluggable Streams
        • Pluggable Storage
        • Record Reader
        • Segment Fetchers
      • Contribution Guidelines
      • Code Setup
      • Code Modules and Organization
      • Update Documentation
    • Advanced
      • Data Ingestion Overview
      • Advanced Pinot Setup
    • Tutorials
      • Pinot Architecture
      • Store Data
        • Batch Tables
        • Streaming Tables
      • Ingest Data
        • Batch
          • Creating Pinot Segments
          • Write your batch
          • HDFS
          • AWS S3
          • Azure Storage
          • Google Cloud Storage
        • Streaming
          • Creating Pinot Segments
          • Write your stream
          • Kafka
          • Azure EventHub
          • Amazon Kinesis
          • Google Pub/Sub
    • Design Documents
  • For Operators
    • Basics
      • Setup cluster
      • Setup table
      • Setup ingestion
      • Access Control
      • Monitoring
      • Tuning
        • Realtime
        • Routing
    • Tutorials
      • Build Docker Images
      • Running Pinot in Production
      • Kubernetes Deployment
      • Amazon EKS (Kafka)
      • Amazon MSK (Kafka)
      • Batch Data Ingestion In Practice
  • RESOURCES
    • Community
    • Blogs
    • Presentations
    • Videos
  • Integrations
    • ThirdEye
    • Superset
    • Presto
  • PLUGINS
    • Plugin Architecture
    • Pinot Input Format
    • Pinot File System
    • Pinot Batch Ingestion
    • Pinot Stream Ingestion
Powered by GitBook
On this page
  • Initialize Record Reader
  • CSV Record Reader Config
  • Thrift Record Reader Config
  • ORC Record Reader Config
  • Implement Your Own Record Reader
  • Generic Row
  • Contracts for Record Reader

Was this helpful?

Edit on Git
Export as PDF
  1. For Developers
  2. Basics
  3. Extending Pinot

Record Reader

Pinot supports indexing data from various file formats. To support reading from a file format, a record reader need to be provided to read the file and convert records into the general format which the indexing engine can understand. The record reader serves as the connector from each individual file format to Pinot record format.

Pinot package provides the following record readers out of the box:

  • Avro record reader: record reader for Avro format files

  • CSV record reader: record reader for CSV format files

  • JSON record reader: record reader for JSON format files

  • ORC record reader: record reader for ORC format files

  • Thrift record reader: record reader for Thrift format files

  • Pinot segment record reader: record reader for Pinot segment

Initialize Record Reader

To initialize a record reader, the data file and table schema should be provided (for Pinot segment record reader, only need to provide the index directory because schema can be derived from the segment). The output record will follow the table schema provided.

For Avro/JSON/ORC/Pinot segment record reader, no extra configuration is required as column names and multi-values are embedded in the data file.

For CSV/Thrift record reader, extra configuration might be provided to determine the column names and multi-values for the data.

CSV Record Reader Config

The CSV record reader config contains the following settings:

  • Header: the header for the CSV file (column names)

  • Column delimiter: delimiter for each column

  • Multi-value delimiter: delimiter for each value for a multi-valued column

If no config provided, use the default setting:

  • Use the first row in the data file as the header

  • Use ‘,’ as the column delimiter

  • Use ‘;’ as the multi-value delimiter

Thrift Record Reader Config

The Thrift record reader config is mandatory. It contains the Thrift class name for the record reader to de-serialize the Thrift objects.

ORC Record Reader Config

The following property is to be set during segment generation in your Hadoop properties.

record.reader.path: ${FULL_PATH_OF_YOUR_RECORD_READER_CLASS}

For ORC, it would be:

record.reader.path: org.apache.pinot.orc.data.readers.ORCRecordReader

Implement Your Own Record Reader

Generic Row

Contracts for Record Reader

There are several contracts for record readers that developers should follow when implementing their own record readers:

  • The output GenericRow should follow the table schema provided, in the sense that:

    • All the columns in the schema should be preserved (if column does not exist in the original record, put default value instead)

    • Columns not in the schema should not be included

    • Values for the column should follow the field spec from the schema (data type, single-valued/multi-valued)

    • If incoming and outgoing time column name are the same, use incoming time field spec

    • If incoming and outgoing time column name are different, put both of them as time field spec

    • We keep both incoming and outgoing time column to handle cases where the input file contains time values that are already converted

PreviousPluggable StorageNextSegment Fetchers

Last updated 4 years ago

Was this helpful?

For other file formats, we provide a general interface for record reader - . To index the file into Pinot segment, simply implement the interface and plug it into the index engine - . We use a 2-passes algorithm to index the file into Pinot segment, hence the rewind() method is required for the record reader.

is the record abstraction which the index engine can read and index with. It is a map from column name (String) to column value (Object). For multi-valued column, the value should be an object array (Object[]).

For the time column (refer to ), record reader should be able to read both incoming and outgoing time (we allow incoming time - time value from the original data to outgoing time - time value stored in Pinot conversion during index creation).

RecordReader
SegmentCreationDriverImpl
GenericRow
TimeFieldSpec