true. With this configuration in place, the server allocates off-heap memory by memory-mapping files. These files are never flushed to stable storage by Pinot (the Operating System may do so depending on demand for memory on the host). The files are discarded when the consuming segment is turned into a completed segment.
pinot.server.consumerDir. Given that there is no control over flushing of pages from the memory mapped for consuming segments, you may want to set the directory to point to a memory-based file system, eliminating wasteful disk I/O.
true. In this case, pinot allocates direct ByteBuffer objects for consuming segments. Using direct allocation can potentially result in address space fragmentation.
realtime.segment.flush.threshold.segment.sizesetting as described in StreamConfigs Section. You can run the administrative tool
pinot-admin.sh RealtimeProvisioningHelperthat will help you to come up with an optimal setting for the segment size.
completionConfigas described in Table Config can be used to configure this.
- Build the segment
- Start a transaction with the lead controller to commit the segment (CommitStart phase)
- Post the completed segment to any of the controllers (and the controller posts it to segment store)
- End the transaction with the lead controller (CommentEnd phase). Optionally, this step can be done with the segment metadata.
true(default is false).
tableConfigFile: This is the path to the table config file
numPartitions: Number of partitions in your stream
numHosts: This is a list of the number of hosts for which you need to compute the actual parameters. For example, if you are planning to deploy between 4 and 8 hosts, you may specify 4,6,8. In this case, the parameters will be computed for each configuration -- that of 4 hosts, 6 hosts, and 8 hosts. You can then decide which of these configurations to use.
numHours: This is a list of maximum number of hours you want your consuming segments to be in consuming state. After these many hours the segment will move to completed state, even if other criteria (like segment size or number of rows) are not met yet. This value must be smaller than the retention of your stream. If you specify too small a value, then you run the risk of creating too many segments, this resulting in sub-optimal query performance. If you specify this value to be too big, then you may run the risk of having too large segments, running out of "hot" memory (consuming segments are in read-write memory). Specify a few different (comma-separated) values, and the command computes the segment size for each of these.
sampleCompletedSegmentDir: The path of the directory in which the sample segment is present. See above if you do not have a sample segment.
pushFrequency: This is optional. If this is a hybrid table, then enter the frequency with which offline segments are pushed (one of "hourly", "daily", "weekly" or "monthly"). This argument is ignored if
maxUsableHostMemory: This is the total memory available in each host for hosting
retentionHoursworth of data (i.e. "hot" data) of this table. Remember to leave some for query processing (or other tables, if you have them in the same hosts). If your latency needs to be very low, this value should not exceed the physical memory available to store pinot segments of this table, on each host in your cluster. On the other hand, if you are trying to lower cost and can take higher latencies, consider specifying a bigger value here. Pinot will leave the rest to the Operating System to page memory back in as necessary.
retentionHours: This argument should specify how many hours of data will typically be queried on your table. It is assumed that these are the most recent hours. If
pushFrequencyis specified, then it is assumed that the older data will be served by the offline table, and the value is derived automatically. For example, if
daily, this value defaults to
32d. If neither
retentionHoursis specified, then this value is assumed to be the retention time of the realtime table (e.g. if the table is retained for 6 months, then it is assumed that most queries will retrieve all six months of data). As an example, if you have a realtime only table with a 21 day retention, and expect that 90% of your queries will be for the most recent 3 days, you can specify a
retentionHoursvalue of 72. This will help you configure a system that performs much better for most of your queries while taking a performance hit for those that occasionally query older data.
ingestionRate: Specify the average number of rows ingested per second per partition of your stream.
schemaWithMetadataFile: This is needed if you do not have a sample segment from the topic to be ingested. This argument allows you to specify a schema file with additional information to describe the data characteristics (like number of unique values each column can have, etc.).
numRows: This is an optional argument if you want the tool to generate a segment for you. If it is not give, then a default value of