Skip to content

This repo features a boilerplate code for configuring an ELK pipeline for replicating an RDS database into an AWS hosted Elasticsearch Cluster.

Notifications You must be signed in to change notification settings

drycode/RDS-to-Elasticsearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logstash Connector POC

This repo is being used to proof out logstash and it's ability to assist in replicating an RDS database in Elasticsearch

Installations

Development Setup

  • Setup Python 3 virtual environment and install dependencies
  • Store a dev.env file at the root of your project with the following configurations
  • Configure an alias in your shell profile such that logstash executes the logstash program on your machine. Example: alias logstash=/usr/local/bin/logstash

Environment Configuration

You'll need to contribute a local file with the following configurations. This file will be used to populate your environment when calling app/config.py explicitly or when running the run_logstash.sh file.

   ELASTICSEARCH_URL=""
   RDS_PASSWORD=""
   RDS_USERNAME=""
   JDBC_DRIVER_LIBRARY=""
   JDBC_CONNECTION_STRING=""
   SQL_DB_NAME=""
   JDBC_DRIVER_CLASS=""

Running the Logstash Project

The entrypoint for this project is run_logstash.sh. If you step through the file you will notice it's doing a few things.

  • python app/config.py is called, which parses your dev.env file and produces a dev_env.sh file, which will export all your variables to your primary shell process
  • source dev_env.sh sources the aforementioned file.
  • logstash comes with a -r flag which watches for changes in the .conf file which configures the pipeline. This is useful for debugging purposes.
  • The -f flag is used as an identifier that the next argument will be the path the configuration file.

Things to note:

  • If running a pipeline who's output is an elasticsearch instance, make sure elasticsearch is up and running.

Python Elasticsearch interface

Primary manual interface for interacting with Elasticsearch.

  • Mappings for each index are created in the index_settings.py
  • Calling python app/interface.py will instantiate the mappings in the Elasticsearch instance for each defined index in index_settings.py
  • search.py will eventually be used to query against the Elasticsearch instance

Pipeline Configurations

Each ls_conf file is a different pipeline configuration for use with Logstash.

Each logstash.conf file is made up of 3 parts: input, filter, and output. The *_dev.conf files are a workaround in which the output dumps the results to a json file for quicker development, preventing the need to query an elasticsearch index.

input
  • used to establish a connection with a database
  • establishes a secure connection with the database via the JDBC (Java Database Connector)
  • runs a query against the database (queries stored in sql_scripts)
filter
  • parses and configures results of the input query and prepares them for delivery to the output
output
  • establishes a secure connection with the outbound database via the JDBC
  • there are also various textual output streams that can be used

For more detail, checkout the logstash docs, specifically:

Note the inline comments for more information...

input {
    # Connection to the jdbc connector. Each ${var} is picked up from the shell environment
  jdbc {
    jdbc_driver_library => "${JDBC_DRIVER_LIBRARY}"
    jdbc_connection_string => "${JDBC_CONNECTION_STRING}"
    jdbc_driver_class => "${JDBC_DRIVER_CLASS}"
    jdbc_user => "${RDS_USERNAME}"
    jdbc_password => "${RDS_PASSWORD}"
    # Path to the sql query to be executed and consequently mapped to elasticsearch
    statement_filepath => "sql_scripts/update_index.sql"
  }
}

filter {
    # Assumes a jsonified RDS query response from
    json{
      id => "id"
      source => "rows"
      remove_field => "rows"
    }
} 
output {
    # ES configuration. Notice there are no mappings here. Mappings are configured by the python interface as of now
  elasticsearch {
    document_id => "%{task_id}"
    document_type => "index"
    index => "indexs"
    codec => "json"
    hosts => "${ELASTICSEARCH_URL}"
  }
}

Other Resources

[logstash]:

About

This repo features a boilerplate code for configuring an ELK pipeline for replicating an RDS database into an AWS hosted Elasticsearch Cluster.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published