Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tiny EPIC] Support SOLR standalone on ECS #3826

Closed
28 of 39 tasks
jbrown-xentity opened this issue May 13, 2022 · 3 comments
Closed
28 of 39 tasks

[Tiny EPIC] Support SOLR standalone on ECS #3826

jbrown-xentity opened this issue May 13, 2022 · 3 comments
Assignees
Labels
CI/CD component/solr-service Related to Solr-as-a-Service, a brokered Solr offering component/ssb Epic Feature

Comments

@jbrown-xentity
Copy link
Contributor

jbrown-xentity commented May 13, 2022

Purpose

We want to a security compliant SOLR, but we're not sure how to do that.

Given above this need, conducting ECS deployment with SOLR8 image is needed to provide factual knowledge on future steps.

Acceptance Criteria

[ACs should be clearly demo-able/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN a CKAN system needs a SOLR instance
    WHEN 5 days have expired
    THEN CKAN can connect to SOLR
    AND an index is started
    AND the necessary stories are created to implement this in a production way

Background

To work around security issues around supporting EKS in AWS.
We already deployed SOLR8 on EKS in AWS, should be able to be ported to terraform.

Sketch

Feasibility Testing

  • Convert James' DOI Solr deployment to terraform (in datagov-brokerpak-solr repo
    • Add Secure DNS
    • Add SSL
    • Add encryption (at-rest and in-transit) for EFS volume
    • Add authentication to Solr
    • Create a secure way of generating an initial Admin password
    • Enable logging for ECS services
    • Update appropriate IAM roles for ECS/EFS/LB/et cetera...
    • Add CloudMap Service Discovery for container-to-container dns communication
  • On branch of catalog, move back to a single Solr deployment
  • Push Solr Classic docker image to GHCR
  • Iterate until bugs are fixed
  • Connect to local/cloud.gov catalog instance (whichever is more convenient)
  • Validate solr search index works or solr harvest works (whichever is more convenient)

Make service broker-able

  • Package terraform code into brokerpak definition
  • Rework processes and procedures to support AWS Credentials in repo
    • Copy local testing from eks-brokerpak
    • Setup Github Action secrets for CI tests to pass
  • (Local) Make sure solr-on-ecs service can provision/bind and unbind/deprovision
  • Preserve the original solrcloud service as a sibling service that the solr brokerpak supports
  • (Local) Make sure solr-cloud service can still provision/bind and unbind/deprovision
  • Get Github Action tests working
  • Solr on ECS GSA-TTS/datagov-brokerpak-solr#36 (merge this)

Follow-on work

  • Test reindex speed on a production-comparable DB size
  • Test harvesting reliability
  • Perform load testing to gauge the performance of solr standalone
  • Make a new ticket for Leader-Follower paradigm (if necessary)
  • Make a new ticket for EFS performance boosting (if necessary)

Example: Add Leader-Follower paradigm

  • Make a new PR
  • Add new ECS services for the followers
  • Create script to initialize Follower containers
  • Ensure Leader and Follower Solr instances can communicate

Example: Boost EFS performance

@nickumia-reisys
Copy link
Contributor

List of references (in case we ever need them..):

@nickumia-reisys
Copy link
Contributor

As a retrospective comment to anyone who wanders onto this ticket (probably future-me 😅), this was a very important effort in the cloud migration of Data.gov apps, specifically https://catalog.data.gov and https://inventory.data.gov. I attempted to link all of the follow-on work that was needed after this pivotal ticket occurred that helped catapult this over the finish line; however, there's probably a few that still slipped through the cracks.

I consider this one of my biggest contributions to Data.gov. There were many possible paths for how Data.gov could procure a production Solr setup. Prior to me joining Data.gov, there was work to create Solr as an application on cloud.gov. This was made impractical because CloudFoundry only allows a maximum of 6GB of persistent storage and our Solr instance (as of writing) requires ~22GB. The next step was converting the app into a custom cloud.gov service based on Apache's solr-operator/solrcloud and AWS's EKS architecture. There were many forks and convoluted paths that we (@mogul, me and others before me) struggled with in its design. The EKS Brokerpak is still alive and very much practical for other projects. While this path never hit a hard wall or dead-end, the entire design was overly complex and had many, many moving parts. It was abandoned when there were a host of mystifying errors and bugs with indiscernible causes, specifically about solrcloud. It was at this point that inspiration from @jbrown-xentity led us to a pure Solr implementation on ECS (this ticket). This simplified our design on two fronts: (1) ECS had better defaults than EKS for us. It was like using docker-compose over kubernetes. (2) Data.gov was more familiar with solr than solrcloud. It had a Solr deployment on it's older platform and the team had better confidence Solr would be more stable.

In terms of how this was implemented, I don't really like it ( ...I know I wrote it ). As a relatively large user and producer of open-source code, Data.gov strives to stick closely with the communities we pull from and give back meaningful contributions as well. The less customization we have, the better we're able to develop, stay secure and remain integrated. Our Solr deployment is very special. It is at the intersection of many open-source communities ... Solr ... CKAN ... AWS ... Terraform ... Cloud.gov ... Brokerpaks ... (and maybe a few more) I feel like there were too many customizations to this code to meet the unspoken requirements of Data.gov. This is code we had to write because we couldn't borrow from something that existed already and it isn't well-abstracted for others to use it to do anything else. I believe this was a necessary evil for the position we were in, but going forward this will likely face a painful death in the future.

With more than 3 months of production catalog (and 4 months of production inventory) using this code, I think it's safe to say that it is rather successful. There were a few bugs and concerns after the initial release; but thanks to @FuhuXia's diligence, we've been able to monitor Solr's performance and health and ensure problems are taken care of. From the initial release of the Leader-Follower design, there has not been major changes to the core code or infrastructure. And I take that as a win.

I'm leaving this comment as a reminder for me and as counsel for whoever this code may effect in the future. This endeavor was an unwelcomingly large part of my life for almost a year. It wasn't very fun working on this. Did I learn a lot? Yes. Was it challenging? I don't think for the right reason haha, but yes. Probably, the only thing that got me through this was the encouragement, support and guidance that I received from my team. Very warm and hearty thanks to @mogul @jbrown-xentity @FuhuXia 🙇

Follow-on tickets:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD component/solr-service Related to Solr-as-a-Service, a brokered Solr offering component/ssb Epic Feature
Projects
Archived in project
Development

No branches or pull requests

3 participants