Here you will find a collection of ready-made sample applications and examples for real-world data
Loading...
Loading...
Loading...
Loading...
Steps for setting up a Pinot cluster and a real-time table which consumes from the GitHub events stream.
In this recipe you will set up an Apache Pinot cluster and a real-time table which consumes data flowing from a GitHub events stream. The stream is based on GitHub pull requests and uses Kafka.
In this recipe you will perform the following steps:
Set up a Pinot cluster, to do which you will:
a. Start zookeeper.
b. Start the controller.
c. Start the broker.
d. Start the server.
Set up a Kafka cluster.
Create a Kafka topic, which will be called pullRequestMergedEvents
.
Create a real-time table called pullRequestMergedEvents
and a schema.
Start a task which reads from the GitHub events API and publishes events about merged pull requests to the topic.
Query the real-time data.
Pull the Docker image
Get the latest Docker image.
Long version
Set up the Pinot cluster
Follow the instructions in Advanced Pinot Setup to set up a Pinot cluster with the components:
Zookeeper
Controller
Broker
Server
Kafka
Create a Kafka topic
Create a Kafka topic called pullRequestMergedEvents
for the demo.
Add a Pinot table and schema
The schema is present at examples/stream/githubEvents/pullRequestMergedEvents_schema.json
and is also pasted below
The table config is present at examples/stream/githubEvents/docker/pullRequestMergedEvents_realtime_table_config.json
and is also pasted below.
Note
If you're setting this up on a pre-configured cluster, set the properties stream.kafka.zk.broker.url
and stream.kafka.broker.list
correctly, depending on the configuration of your Kafka cluster.
Add the table and schema using the following command:
Publish events
Start streaming GitHub events into the Kafka topic:
Prerequisites
Generate a personal access token on GitHub.
Short version
The short method of setting things up is to use the following command. Make sure to stop any previously running Pinot services.
Get Pinot
Follow the instructions in Build from source to get the latest Pinot code
Long version
Set up the Pinot cluster
Follow the instructions in Advanced Pinot Setup to set up the Pinot cluster with the components:
Zookeeper
Controller
Broker
Server
Kafka
Create a Kafka topic
Download Apache Kafka.
Create a Kafka topic called pullRequestMergedEvents
for the demo.
Add a Pinot table and schema
Schema can be found at /examples/stream/githubevents/
in the release, and is also pasted below:
The table config can be found at /examples/stream/githubevents/
in the release, and is also pasted below.
Note
If you're setting this up on a pre-configured cluster, set the properties stream.kafka.zk.broker.url
and stream.kafka.broker.list
correctly, depending on the configuration of your Kafka cluster.
Add the table and schema using the command:
Publish events
Start streaming GitHub events into the Kafka topic
Prerequisites
Generate a personal access token on GitHub.
Short version
For a single command to setup all the above steps
If you already have a Kubernetes cluster with Pinot and Kafka (see Running Pinot in Kubernetes), first create the topic, then set up the table and streaming using
Browse to the Query Console to view the data.
You can use SuperSet to visualize this data. Some of the interesting insights we captures were
Repositories by number of commits in the Apache organization
To integrate with SuperSet you can check out the SuperSet Integrations page.
In this Apache Pinot guide, we'll learn how visualize data using the Streamlit web framework.
In this guide you'll learn how to visualize data from Apache Pinot using Streamlit. Streamlit is a Python library that makes it easy to build interactive data based web applications.
We're going to use Streamlit to build a real-time dashboard to visualize the changes being made to Wikimedia properties.
Real-Time Dashboard Architecture
We're going to use the following Docker compose file, which spins up instances of Zookeeper, Kafka, along with a Pinot controller, broker, and server:
docker-compose.yml
Run the following command to launch all the components:
Wikimedia provides provides a continuous stream of structured event data describing changes made to various Wikimedia properties. The events are published over HTTP using the Server-Side Events (SSE) Protocol.
You can find the endpoint at: stream.wikimedia.org/v2/stream/recentchange
We'll need to install the SSE client library to consume this data:
Next, create a file called wiki.py
that contains the following:
wiki.py
The highlighted section shows how we connect to the recent changes feed using the SSE client library.
Let's run this script as shown below:
We'll see the following (truncated) output:
Output
Now we're going to import each of the events into Apache Kafka. First let's create a Kafka topic called wiki_events
with 5 partitions:
Create a new file called wiki_to_kafka.py
and import the following libraries:
wiki_to_kafka.py
Add these functions:
wiki_to_kafka.py
And now let's add the code that calls the recent changes API and imports events into the wiki_events
topic:
wiki_to_kafka.py
The highlighted parts of this script indicate where events are ingested into Kafka and then flushed to disk.
If we run this script:
We'll see a message every time 100 messages are pushed to Kafka, as shown below:
Output
Let's check that the data has made its way into Kafka.
The following command returns the message offset for each partition in the wiki_events
topic:
Output
Looks good. We can also stream all the messages in this topic by running the following command:
Output
Now let's configure Pinot to consume the data from Kafka.
We'll have the following schema:
schema.json
And the following table config:
table.json
The highlighted lines are how we connect Pinot to the Kafka topic that contains the events. Create the schema and table by running the following commnad:
Once you've done that, navigate to the Pinot UI and run the following query to check that the data has made its way into Pinot:
As long as you see some records, everything is working as expected.
Now let's write some more queries against Pinot and display the results in Streamlit.
First, install the following libraries:
Create a file called app.py
and import libraries and write a header for the page:
app.py
Connect to Pinot and write a query that returns recent changes, along with the users who made the changes, and domains where they were made:
app.py
The highlighted part of the query shows how to count the number of events from the last minute and the minute before that. We then do a similar thing to count the number of unique users and domains.
Now let's create some metrics based on that data:
app.py
Go back to the terminal and run the following command:
Navigate to localhost:8501 to see the Streamlit app. You should see something like the following:
Next, let's add a line chart that shows the number of changes being done to Wikimedia per minute. Add the following code to app.py
:
app.py
Go back to the web browser and you should see something like this:
At the moment we need to refresh our web browser to update the metrics and line chart, but it would be much better if that happened automatically. Let's now add auto refresh functionality.
Add the following code just under the header at the top of app.py
:
app.py
And the following code at the very end:
app.py
If we navigate back to our web browser, we'll see the following:
The full script used in this example is shown below:
app.py
In this guide we've learnt how to publish data into Kafka from Wikimedia's event stream, ingest it from there into Pinot, and finally make sense of the data using SQL queries run from Streamlit.
Install Redash and start a running instance, following the .
Configure Redash to query Pinot, by doing the following:
Create visualizations, by doing the following:
Apache Pinot provides a Python client library pinotdb
to query Pinot from Python applications. Install pinotdb
inside the Redash worker instance to make network calls to Pinot.
Navigate to the root directory where you’ve cloned Redash. Run the following command to get the name of the Redash worker container (by default, redash_worker_1
):
docker-compose ps
Run the following command (change redash_worker_1
to your own Redash worker container name, if applicable):
Restart Docker.
In Redash, select Settings > Data Sources.
On the Redash Settings - Data Source page, add Pinot
as the name of the data source, enter pinotdb
in the Modules to import prior to running the script field.
Enter the following optional fields as needed:
AdditionalModulesPaths: Enter a comma-separated list of absolute paths on the Redash server to Python modules to make available when querying from Redash. Useful for private modules unavailable in pip
.
AdditionalBuiltins: Specify additional built-in functions as needed. By default, Redash automatically includes 25 Python built-in functions.
Click Save.
Run the following command in a new terminal to spin up an Apache Pinot Docker container in the quick start mode with a baseball stats dataset built in.
Click Execute to run the query and view results.
You can also include libraries like Pandas to perform more advanced data manipulation on Pinot’s data and visualize the output with Redash.
The following query connects to Pinot and queries the baseballStats
table to retrieve the top ten players with the highest scores. The results are transformed into a dictionary format supported by Redash.
In Redash, after you've ran your query, click the New Visualization tab, and select the type of visualization your want to create, for example, Bar Chart. The Visualization Editor appears with your chart.
For example, you may want to create a bar chart to view the top 10 players with highest scores.
You may want to create a line chart to view the total variation in strikeouts over time.
Create a dashboard with one or more visualizations (widgets).
In Redash, go to Dashboards > New Dashboards.
In this Apache Pinot guide, we'll learn how visualize data using the Dash web framework.
In this guide you'll learn how to visualize data from Apache Pinot using Plotly's web framework. Dash is the most downloaded, trusted Python framework for building ML & data science web apps.
We're going to use Dash to build a real-time dashboard to visualize the changes being made to Wikimedia properties.
Real-Time Dashboard Architecture
We're going to use the following Docker compose file, which spins up instances of Zookeeper, Kafka, along with a Pinot controller, broker, and server:
docker-compose.yml
Run the following command to launch all the components:
Wikimedia provides provides a continuous stream of structured event data describing changes made to various Wikimedia properties. The events are published over HTTP using the Server-Side Events (SSE) Protocol.
We'll need to install the SSE client library to consume this data:
Next, create a file called wiki.py
that contains the following:
wiki.py
The highlighted section shows how we connect to the recent changes feed using the SSE client library.
Let's run this script as shown below:
We'll see the following (truncated) output:
Output
Now we're going to import each of the events into Apache Kafka. First let's create a Kafka topic called wiki_events
with 5 partitions:
Create a new file called wiki_to_kafka.py
and import the following libraries:
wiki_to_kafka.py
Add these functions:
wiki_to_kafka.py
And now let's add the code that calls the recent changes API and imports events into the wiki_events
topic:
wiki_to_kafka.py
The highlighted parts of this script indicate where events are ingested into Kafka and then flushed to disk.
If we run this script:
We'll see a message every time 100 messages are pushed to Kafka, as shown below:
Output
Let's check that the data has made its way into Kafka.
The following command returns the message offset for each partition in the wiki_events
topic:
Output
Looks good. We can also stream all the messages in this topic by running the following command:
Output
Now let's configure Pinot to consume the data from Kafka.
We'll have the following schema:
schema.json
And the following table config:
table.json
The highlighted lines are how we connect Pinot to the Kafka topic that contains the events. Create the schema and table by running the following commnad:
As long as you see some records, everything is working as expected.
Now let's write some more queries against Pinot and display the results in Dash.
First, install the following libraries:
Create a file called dashboard.py
and import libraries and write a header for the page:
app.py
Connect to Pinot and write a query that returns recent changes, along with the users who made the changes, and domains where they were made:
app.py
The highlighted part of the query shows how to count the number of events from the last minute and the minute before that. We then do a similar thing to count the number of unique users and domains.
Now let's create some metrics based on that data.
First, let's create a couple of helper functions for creating these metrics:
dash_utils.py
And now let's add the following import to app.py
:
app.py
And the following code at the end of the file:
app.py
Go back to the terminal and run the following command:
Next, let's add a line chart that shows the number of changes being done to Wikimedia per minute. Update app.py
as follows:
app.py
Go back to the web browser and you should see something like this:
At the moment we need to refresh our web browser to update the metrics and line chart, but it would be much better if that happened automatically. Let's now add auto refresh functionality.
This will require some restructuring of our application so that each component is rendered from a function annotated with a callback that causes the function to be called on an interval.
The app layout now looks like this:
app.py
interval-component
is configured to fire a callback every 1,000 milliseconds.
latest-timestamp
is a container that will contain the latest timestamp.
indicators
will contain indicators with the latest counts of users, domains, and changes.
time-series
will contain the time series line chart.
The timestamp is refreshed by the following callback function:
app.py
The indicators are refreshed by this function:
app.py
And finally, the following function refreshes the line chart:
app.py
If we navigate back to our web browser, we'll see the following:
The full script used in this example is shown below:
dashboard.py
In this guide we've learnt how to publish data into Kafka from Wikimedia's event stream, ingest it from there into Pinot, and finally make sense of the data using SQL queries run from Dash.
Streamlit Metrics
Streamlit Time Series
Streamlit Auto Refresh
Select New Data Source, and then select Python from the list.
In Redash, select Queries > New Query, and then select the Python data source you created in .
Add Python code to query data. For more information, see the .
For more information, see in Redash documentation.
For more information, see .
Add the widgets to your dashboard. For example, by adding the three visualizations from the above, you create a Baseball stats dashboard.
For more information, see in the Redash documentation.
You can find the endpoint at:
Once you've done that, navigate to the and run the following query to check that the data has made its way into Pinot:
Navigate to to see the Dash app. You should see something like the following:
Dash Metrics
Dash Time Series
Dash Auto Refresh