RRAP (Real-time Reddit Analysis Platform)

A real-time trend analysis platform for Reddit's worldnews subreddit, capable of processing and visualizing the popularity of topics over time.

Features

Real-time data ingestion from Reddit using PRAW
Stream processing with Apache Flink
Persistent storage in Amazon RDS
Data visualization with Grafana

Prerequisites

Before you begin, ensure you meet the following requirements:

Java 8 or higher is installed on your system.
An active AWS account with an RDS instance setup.
Apache Flink and Kafka are installed and configured on your system or in your cloud environment.
Grafana is installed and configured for visualizing the processed data. Some necessary libraries' versions have been specified in the requirements.txt

Installation

To get this project up and running on your system, follow these steps:

Clone the project repository:

git clone https://github.com/yourusername/RRAP.git

Access Reddit Data Stream: Reddit offers a vast stream of data from various subreddits that can be used for real-time analytics.
Show instructions
1. You can access Reddit's data through their API. To do this, you'll need to create a Reddit account, register an application, and get your API credentials (client ID, client secret, and user agent).
2. Create a .env file in the root directory of the project and specify these arguments.
```
   REDDIT_CLIENT_ID='YOUR_CLIENT_ID'
   REDDIT_CLIENT_SECRET='YOUR_CLIENT_SECRE'
   REDDIT_USER_AGENT='YOUR_APP_NAME/version by /u/YOUR_REDDIT_USERNAME'
```
Running Apache Kafka:
1. Start the Kafka server: Run this command in the root directory of installed Kafka:
```
bin/kafka-server-start.sh config/server.properties
```
2. Create a Topic:
```
bin/kafka-topics.sh --create --topic worldnews --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1
```
3. Use a Process Manager to Start Kafka Broker Automatically:
  - To ensure that your Kafka broker starts automatically when the EC2 instance boots, you can create a systemd service unit for Kafka: Create a new service file in /etc/systemd/system/, e.g., kafka.service:
```
sudo nano /etc/systemd/system/kafka_broker.service
```
  - Add the following content to the file, making sure to replace /path/to/kafka with the actual directory path where Kafka is installed:
```
[Unit]
Description=Apache Kafka Server
Documentation=http://kafka.apache.org/documentation.html
Requires=zookeeper.service
After=zookeeper.service

[Service]
Type=simple
User=kafka
ExecStart=/path/to/kafka/bin/kafka-server-start.sh /path/to/kafka/config/server.properties
ExecStop=/path/to/kafka/bin/kafka-server-stop.sh
Restart=on-abnormal

[Install]
WantedBy=multi-user.target
```
  - After that, type following commands to Enable and Start Kafka Service:
```
sudo systemctl daemon-reload
sudo systemctl enable kafka_broker.service
sudo systemctl start kafka_broker.service
```
4. Use a Process Manager to Automatically Publish Data to Kafka Topics:
  - Create a new service file in /etc/systemd/system/, e.g., call_reddit_kafka.service:
```
sudo nano /etc/systemd/system/call_reddit_kafka.service
```
  - Add following to the file:
```
[Unit]
Description=Reddit Kafka Service
After=network.target

[Service]
User=ec2-user
ExecStart=/usr/bin/python3 /path/to/call-apis.py
Restart=always

[Install]
WantedBy=multi-user.target
```
5. Flink Stream Processing Application:
  - In the root folder of Flink, start Flink's local cluster with the following command: ./bin/start-cluster.sh
  - In the root folder of the app, compile your application into a JAR file.: ./gradlew jar
  - Submit the JAR to your Flink cluster using the Flink CLI: ./bin/flink run -c flinkJob.trendDetection.TrendDetectionJob ~/RRAP/flink-app/build/libs/flink-app.jar
  - Monitor the Flink Dashboard at http://localhost:8081 to see your job's status and metrics.
    Important Note: Ensure that the Apache Flink dashboard is accessible on port 8081.
  [Optional] Automate Flink Start on Boot:
  Show instructions
  * Create a Flink Service File (/etc/systemd/system/flink_listener.service):
```
 ```ini
       [Unit]
       Description=Apache Flink
       After=network.target
  
       [Service]
       Type=simple
       User=ubuntu
       ExecStart=/path/to/flink/bin/start-cluster.sh
       ExecStop=/path/to/flink/bin/stop-cluster.sh
       Restart=on-failure
       RestartSec=10
  
       [Install]
       WantedBy=multi-user.target
 ```
 
 * Enabel and start flink:

 ```sh
    sudo systemctl daemon-reload
    sudo systemctl enable flink_listener.service
    sudo systemctl start flink_listener.service
 ```
```
6. RDS Database:
  - Specify your AWS credentials, in your .env file like the following:
```
DB_USER="Your-RDS-User"
DB_PASSWORD="Your-RDS-Password"
DB_INSTANCE="Your-RDS-Instance-Endpoint"
DB_NAME="Your-RDS-DB-Name"
```
  - After your RDS instance is up and running, connect to it using a MySQL client and create your database and table. Here is the SQL command you would use:
```
CREATE DATABASE IF NOT EXISTS DB_NAME;

USE DB_NAME;

CREATE TABLE IF NOT EXISTS DB_TABLE (
 trend_id INT AUTO_INCREMENT PRIMARY KEY,
 title VARCHAR(255) NOT NULL,
 upvotes INT NOT NULL,
 timestamp DATE_FORMAT NOT NULL
);
```
  - Ensure you're security group allows inbound and outbound traffic for the port 3306.
7. Grafana Dashboard:
  - Install Grafana and Run Grafana Dashboard: Follow the instructions in this link.
  - Ensure you're security group allows inbound traffic for the port 3000.
  - Open Grafana dashboard at http://localhost:3000 to make your visualizations.

Contributing

Contributions to the Realtime Reddit Analysis Platform are welcome!

Fork the repository.
Create a new branch: git checkout -b branch_name.
Make changes and test.
Submit Pull Request with comprehensive description of changes.

Contributors

@AliDavoodi98

If you have any questions or want to reach out, please contact me on LinkedIn

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

RRAP (Real-time Reddit Analysis Platform)

Features

Prerequisites

Installation

Important Note: Ensure that the Apache Flink dashboard is accessible on port 8081.

Contributing

Contributors

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

RRAP (Real-time Reddit Analysis Platform)

Features

Prerequisites

Installation

Important Note: Ensure that the Apache Flink dashboard is accessible on port 8081.

Contributing

Contributors

License