pinot.server.instance.realtime.alloc.offheap
to true
. With this configuration in place, the server allocates off-heap memory by memory-mapping files. These files are never flushed to stable storage by Pinot (the Operating System may do so depending on demand for memory on the host). The files are discarded when the consuming segment is turned into a completed segment.pinot.server.consumerDir
. Given that there is no control over flushing of pages from the memory mapped for consuming segments, you may want to set the directory to point to a memory-based file system, eliminating wasteful disk I/O.pinot.server.instance.realtime.alloc.offheap.direct
to true
. In this case, pinot allocates direct ByteBuffer objects for consuming segments. Using direct allocation can potentially result in address space fragmentation.realtime.segment.flush.threshold.segment.size
setting as described in StreamConfigs Section. You can run the administrative tool pinot-admin.sh RealtimeProvisioningHelper
that will help you to come up with an optimal setting for the segment size.LowLevel
.tagOverrideConfig
as described in Table Config. Pinot will automatically move them once the consuming segments are completed.LowLevel
.completionConfig
as described in Table Config can be used to configure this.LowLevel
.
Build the segment Start a transaction with the lead controller to commit the segment (CommitStart phase) Post the completed segment to any of the controllers (and the controller posts it to segment store) End the transaction with the lead controller (CommentEnd phase). Optionally, this step can be done with the segment metadata.
pinot.controller.enable.split.commit
to true
(default is false
).pinot.server.enable.split.commit
to true
(default is false
).pinot.server.enable.commitend.metadata
to true
(default is false).tableConfigFile
: This is the path to the table config filenumPartitions
: Number of partitions in your streamnumHosts
: This is a list of the number of hosts for which you need to compute the actual parameters. For example, if you are planning to deploy between 4 and 8 hosts, you may specify 4,6,8. In this case, the parameters will be computed for each configuration -- that of 4 hosts, 6 hosts, and 8 hosts. You can then decide which of these configurations to use.numHours
: This is a list of maximum number of hours you want your consuming segments to be in consuming state. After these many hours the segment will move to completed state, even if other criteria (like segment size or number of rows) are not met yet. This value must be smaller than the retention of your stream. If you specify too small a value, then you run the risk of creating too many segments, this resulting in sub-optimal query performance. If you specify this value to be too big, then you may run the risk of having too large segments, running out of "hot" memory (consuming segments are in read-write memory). Specify a few different (comma-separated) values, and the command computes the segment size for each of these.sampleCompletedSegmentDir
: The path of the directory in which the sample segment is present. See above if you do not have a sample segment.pushFrequency
: This is optional. If this is a hybrid table, then enter the frequency with which offline segments are pushed (one of "hourly", "daily", "weekly" or "monthly"). This argument is ignored if retentionHours
is specified.maxUsableHostMemory
: This is the total memory available in each host for hosting retentionHours
worth of data (i.e. "hot" data) of this table. Remember to leave some for query processing (or other tables, if you have them in the same hosts). If your latency needs to be very low, this value should not exceed the physical memory available to store pinot segments of this table, on each host in your cluster. On the other hand, if you are trying to lower cost and can take higher latencies, consider specifying a bigger value here. Pinot will leave the rest to the Operating System to page memory back in as necessary.retentionHours
: This argument should specify how many hours of data will typically be queried on your table. It is assumed that these are the most recent hours. If pushFrequency
is specified, then it is assumed that the older data will be served by the offline table, and the value is derived automatically. For example, if pushFrequency
is daily
, this value defaults to 72
. If hourly
, then 24
. If weekly
, then 8d
. If monthly
, then 32d
. If neither pushFrequency
nor retentionHours
is specified, then this value is assumed to be the retention time of the realtime table (e.g. if the table is retained for 6 months, then it is assumed that most queries will retrieve all six months of data). As an example, if you have a realtime only table with a 21 day retention, and expect that 90% of your queries will be for the most recent 3 days, you can specify a retentionHours
value of 72. This will help you configure a system that performs much better for most of your queries while taking a performance hit for those that occasionally query older data.ingestionRate
: Specify the average number of rows ingested per second per partition of your stream.schemaWithMetadataFile
: This is needed if you do not have a sample segment from the topic to be ingested. This argument allows you to specify a schema file with additional information to describe the data characteristics (like number of unique values each column can have, etc.).numRows
: This is an optional argument if you want the tool to generate a segment for you. If it is not give, then a default value of 10000
is used.