Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"atlantis apply" intermittently gets stuck when running terraform that opens github PRs #4892

Open
transient1 opened this issue Sep 3, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@transient1
Copy link

transient1 commented Sep 3, 2024

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

We have 13 terraform files that set up/control 13 repositories. These files are all in a "parent" repo. Each file in the parent repo defines a git_commit changeset (using this provider https://github.com/arl-sh/terraform-provider-git) and a github_repository_pull_request using the official github terraform provider. Each of the target repos has an atlantis.yaml file at the root of the repo that points to a directory that will holds terraform. The parent repo also has an atlantis.yaml file at its root that points to the current directory.

When we open a PR in the parent repo, atlantis plan runs and completes. Then we comment atlantis apply. At this point the atlantis user, which has a github PAT that gives it permissions to the target repos, runs terraform that is supposed to open PRs against the target repos, where the commit consists of the files designated in the git_commit_changeset resource. Sometimes this works without a hitch. Other times in the atlantis ui for the parent repo we can see entries like the following

github_repository_pull_request.${PR1 NAME}: Still creating.... [16m30s elapsed]
github_repository_pull_request.${PR2 NAME}: Still creating.... [35m20s elapsed]
...

It hangs repeating this message (with incrementing times) for every target repo until we force restart the statefulset.

Reproduction Steps

This might be an issue of scale so not sure if it can be easily reproduced. But essentially you'd need a setup like the above where you have one repo responsible for having atlantis run terraform that opens PRs against a number of target repos.

Logs

Logs
{"level":"warn","ts":"2024-09-03T19:39:41.554Z","caller":"events/events_controller.go:747","msg":"payload signature check failed","json":{},"stacktrace":"github.com/runatlantis/atlantis/server/controllers/events.(*VCSEventsController).respond\n\tgithub.com/runatlantis/atlantis/server/controllers/events/events_controller.go:747\ngithub.com/runatlantis/atlantis/server/controllers/events.(*VCSEventsController).handleGithubPost\n\tgithub.com/runatlantis/atlantis/server/controllers/events/events_controller.go:161\ngithub.com/runatlantis/atlantis/server/controllers/events.(*VCSEventsController).Post\n\tgithub.com/runatlantis/atlantis/server/controllers/events/events_controller.go:104\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\tgithub.com/gorilla/[email protected]/mux.go:210\ngithub.com/urfave/negroni/v3.(*Negroni).UseHandler.Wrap.func1\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:59\ngithub.com/urfave/negroni/v3.HandlerFunc.ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:33\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:51\ngithub.com/runatlantis/atlantis/server.(*RequestLogger).ServeHTTP\n\tgithub.com/runatlantis/atlantis/server/middleware.go:70\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Recovery).ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/recovery.go:210\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Negroni).ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:111\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2938\nnet/http.(*conn).serve\n\tnet/http/server.go:2009"}

The above is the only thing we see in the logs.

Environment details

  • Atlantis version: v0.25.0

  • Deployment method: ArgoCD (templated Helm manifests)

  • If not running the latest Atlantis version have you tried to reproduce this issue on the latest version: no

  • Atlantis flags:

      - name: ATLANTIS_FAIL_ON_PRE_WORKFLOW_HOOK_ERROR
        value: "true"
      - name: ATLANTIS_GH_ORG
        value: REDACTED
      - name: ATLANTIS_HIDE_PREV_PLAN_COMMENTS
        value: "true"
      - name: ATLANTIS_LOG_LEVEL
        value: info
      - name: ATLANTIS_SILENCE_ALLOWLIST_ERRORS
        value: "true"
      - name: ATLANTIS_SILENCE_NO_PROJECTS
        value: "false"
      - name: ATLANTIS_SILENCE_VCS_STATUS_NO_PLANS
        value: "false"
      - name: GITHUB_OWNER
        value: REDACTED
      - name: TF_CLI_CONFIG_FILE
        value: REDACTED
      - name: ATLANTIS_ENABLE_DIFF_MARKDOWN_FORMAT
        value: "true"
      - name: ATLANTIS_DATA_DIR
        value: /atlantis-data
      - name: ATLANTIS_REPO_ALLOWLIST
        value: REDACTED
      - name: ATLANTIS_PORT
        value: REDACTED
      - name: ATLANTIS_REPO_CONFIG
        value: REDACTED
      - name: ATLANTIS_ATLANTIS_URL
        value: REDACTED
      - name: ATLANTIS_GH_USER
        value: REDACTED
      - name: ATLANTIS_GH_TOKEN
        valueFrom:
          secretKeyRef:
            REDACTED
      - name: ATLANTIS_GH_WEBHOOK_SECRET
        valueFrom:
          secretKeyRef:
            REDACTED
    

Atlantis server-side config file:
Nothing here but pre-workflow hooks to copy necessary secrets and tokens from vault

Repo atlantis.yaml file:

version: 3
automerge: true
projects:
- name: REDACTED
  dir: "./"
  workspace: "default"

We're running Atlantis as a statefulset in a Kubernetes cluster. Due to our setup it is possible for multiple people to be working on the same parent repo and attempting to run atlantis at the same time, wherein atlantis will respond that it can't run an apply because another PR has the lock. When that occurs we either wait until the other PR has been applied and merged, or we run atlantis unlock on the other PR and then run the one we want. Not sure if this can be a contributing factor.

Terraform state is kept in an S3 bucket.

Additional Context

@transient1 transient1 added the bug Something isn't working label Sep 3, 2024
@anryko
Copy link
Contributor

anryko commented Sep 13, 2024

This doesn't look like an Atlantis issue to me. Atlantis is just executing the terraform code of yours which is utilising the above mentioned terraform-provider-git provider. The issue must be in the provider which is hanging on failed interaction with the Github API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants