This repo is being used to proof out logstash and it's ability to assist in replicating an RDS database in Elasticsearch
- Elasticsearch
- [Logstash][]
- JDBC Driver
- Setup Python 3 virtual environment and install dependencies
- Store a dev.env file at the root of your project with the following configurations
- Configure an alias in your shell profile such that
logstash
executes thelogstash
program on your machine. Example:alias logstash=/usr/local/bin/logstash
You'll need to contribute a local file with the following configurations. This file will be used to populate your environment when calling app/config.py
explicitly or when running the run_logstash.sh
file.
ELASTICSEARCH_URL=""
RDS_PASSWORD=""
RDS_USERNAME=""
JDBC_DRIVER_LIBRARY=""
JDBC_CONNECTION_STRING=""
SQL_DB_NAME=""
JDBC_DRIVER_CLASS=""
The entrypoint for this project is run_logstash.sh
. If you step through the file you will notice it's doing a few things.
python app/config.py
is called, which parses yourdev.env
file and produces adev_env.sh
file, which will export all your variables to your primary shell processsource dev_env.sh
sources the aforementioned file.logstash
comes with a-r
flag which watches for changes in the.conf
file which configures the pipeline. This is useful for debugging purposes.- The
-f
flag is used as an identifier that the next argument will be the path the configuration file.
Things to note:
- If running a pipeline who's output is an elasticsearch instance, make sure elasticsearch is up and running.
Primary manual interface for interacting with Elasticsearch.
- Mappings for each index are created in the
index_settings.py
- Calling
python app/interface.py
will instantiate the mappings in the Elasticsearch instance for each defined index inindex_settings.py
search.py
will eventually be used to query against the Elasticsearch instance
Each ls_conf file is a different pipeline configuration for use with Logstash.
Each logstash.conf file is made up of 3 parts: input, filter, and output. The *_dev.conf
files are a workaround in which the output dumps the results to a json file for quicker development, preventing the need to query an elasticsearch index.
- used to establish a connection with a database
- establishes a secure connection with the database via the JDBC (Java Database Connector)
- runs a query against the database (queries stored in
sql_scripts
)
- parses and configures results of the input query and prepares them for delivery to the output
- establishes a secure connection with the outbound database via the JDBC
- there are also various textual output streams that can be used
For more detail, checkout the logstash docs, specifically:
Note the inline comments for more information...
input {
# Connection to the jdbc connector. Each ${var} is picked up from the shell environment
jdbc {
jdbc_driver_library => "${JDBC_DRIVER_LIBRARY}"
jdbc_connection_string => "${JDBC_CONNECTION_STRING}"
jdbc_driver_class => "${JDBC_DRIVER_CLASS}"
jdbc_user => "${RDS_USERNAME}"
jdbc_password => "${RDS_PASSWORD}"
# Path to the sql query to be executed and consequently mapped to elasticsearch
statement_filepath => "sql_scripts/update_index.sql"
}
}
filter {
# Assumes a jsonified RDS query response from
json{
id => "id"
source => "rows"
remove_field => "rows"
}
}
output {
# ES configuration. Notice there are no mappings here. Mappings are configured by the python interface as of now
elasticsearch {
document_id => "%{task_id}"
document_type => "index"
index => "indexs"
codec => "json"
hosts => "${ELASTICSEARCH_URL}"
}
}
-
Aggregate
-
Mapping
-
Indexing
[logstash]: