Skip to content

timhannifan/big-data-architecture

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

This repository contains code to consume Twosides, a comprehensive database of adverse side effects from pharmacological cobinations. The dataset is 4.3GB and contains around 72 million cobminations of drugs, side effects, and data-mined/FDA reported propensity scores for each tuplet.

The intention is twofold: to demonstrate the layers of the lambda architecture, and to provide a scalable platform to do research on this topic using matrix factorization and machine learning.

Functionality

The batch layer scripts:

  • Download Twosides data and extract the data
  • Ingest the raw csv data into HDFS
  • Create Hive table containg the raw csv data
  • Copy the raw Hive table into a new table in ORC formaat

The serving layer scripts:

  • Create an external HBase table from the ORC table

The ui script:

  • Starts the Node.js web application on localhost:30000.
  • Generates tables with get requests to Hbase.
  • See localhost:30000/conditions.html for example.

The speed layer scripts:

  • Accepts user input from report.html and sends it to a kafka broker.
  • The StreamReactions scala object runs in the background via spark-submit. This program reads in from the kafka broker and writes a speed table on hbase.

Running on the cluster:

  1. Generate the batch layer: ingest csv to HDFS and use Hive to create batch views
git clone https://github.com/timhannifan/big-data-architecture application
cd ~/application/src/batch-layer && sh run.sh
  1. Generate the serving layer: use Hive to create HBase tables for API consumption

First create new column family in hbase for our complete data, then a separate view grouped by condition.

$ hbase shell
create 'hannifan_twosides_hbase', 'interactions'
create 'hannifan_twosides_conditions', 'c'
exit

Then run the hive scripts in the serving layer directory to create HBase views:

cd ~/application/src/serving-layer && sh run.sh
  1. Start Speed layer

Create a kafka topic hannifan_latest_interactions for incoming data

cd /usr/hdp/current/kafka-broker/bin

# Create a kafka topic named latest_interactions
sh kafka-topics.sh --create --zookeeper mpcs53014c10-m-6-20191016152730.us-central1-a.c.mpcs53014-2019.internal:2181 --replication-factor 1 --partitions 1 --topic hannifan_latest_interactions

# Check that it exits
sh kafka-topics.sh --list --zookeeper mpcs53014c10-m-6-20191016152730.us-central1-a.c.mpcs53014-2019.internal:2181 | grep 'hannifan'

Create a hbase table for the real-time (combined serving and speed) data

hbase shell
hbase> create 'hannifan_hbase_realtime', 'reactions'
  1. Start Web application on webserver
gcloud compute ssh hannifan@webserver

## app is transfered from local machine. A copy exists on the webserver.
cd ~/ui && node app.js

The application can be viewed at this link on the cluster.

  1. To simulate the speed layer, go to the submission form and submit the default information. It will be sent to the kafka topic hannifan_latest_interactions, then consumed via a spark job and added to the hbase table hannifan_hbase_realtime. To kick off the spark job:
cd ~/application/src/speed-layer/speed_layer_reactions/target

spark-submit --class StreamReactions uber-speed_layer_reactions-0.0.1-SNAPSHOT.jar mpcs53014c10-m-6-20191016152730.us-central1-a.c.mpcs53014-2019.internal:6667

Running locally in a Docker container:

  1. Clone this repo. The second line runs a script to start Docker, copy the source code to the container, and log in aas the root user.
git clone https://github.com/timhannifan/big-data-architecture application
cd application && run_docker.sh
  1. Install prerequisites. Add permissions to application directory for hadoop user.
sudo apt-get install xz-utils
sudo chown -R hadoop:hadoop /home/hadoop/application/
  1. Change to hadoop user and start services.
sudo su hadoop
start-all.sh
start-hbase.sh
  1. Run the batch layer.
cd ~/application/batch-layer && sh run.sh
  1. Create a Hbase table for the serving layer, then create serving views.
$ hbase shell
create 'hannifan_twosides_hbase', 'interactions'
create 'hannifan_twosides_conditions', 'c'
exit

$ cd ~/application/serving-layer && sh run.sh
  1. Start the web application
$ cd ~/application/ui && node app.js