Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major documentation structure for helping our team debug issues #1167

Closed
11 of 13 tasks
choldgraf opened this issue Apr 3, 2022 · 13 comments
Closed
11 of 13 tasks

Major documentation structure for helping our team debug issues #1167

choldgraf opened this issue Apr 3, 2022 · 13 comments

Comments

@choldgraf
Copy link
Member

choldgraf commented Apr 3, 2022

Context

In our latest incident on CarbonPlan @yuvipanda and I had some conversation about ways to lower the barrier for people to debug our cloud infrastructure when incidents occur. We agreed it would be helpful to have some basic documentation to help our team members get started with debugging common things.

Proposal

We've got an outline of major areas of documentation to write to make it easier for people to use our docs in the debugging/operations process. Here's that document:

https://hackmd.io/omFosDsjS3-2UEtAIZstcA

Here are the documents outlined there that we should create

We should flesh out the major sections in that HackMD, and update this issue as we do so with refs to PRs that implement things. We can close the issue once we've got docs that cover the major parts of that HackMD.

Updates and actions

@yuvipanda
Copy link
Member

A more structured way to think about this is in terms of the 'objects' we have and the 'actions' that can be performed on them.

A starter pack would be:

Object Verbs
Cloud account Provision, authenticate, view web console, etc
Kubernetes cluster Authenticate, explore objects in kubectl, look at logs, etc
Hub View full config of, change config of, create, decomission, deploy change to, look at logs for, find out community rep, etc
User image Create repo, setup image, debug changes, etc

A lot of these have detailed documentation and trianing provided elsewhere (kubectl for example), so we should strike a balance between writing our own custom docs, linking to upstream docs, and improving upstream docs.

@yuvipanda
Copy link
Member

I think it'll also help us identify required training that we can ask folks to take as part of onboarding so they have time set aside to learn how to use and navigate the tools we use. We stand on the shoulders of giants and we gotta use them!

@damianavila
Copy link
Contributor

This exercise would also help us identify which pieces we can build to help in the process (ie. a quick way to get a cmd line ready to write kubectl commands to start debugging).

@yuvipanda
Copy link
Member

We are working on structure for more docs to be written to make support steward work less stressful - https://hackmd.io/omFosDsjS3-2UEtAIZstcA.

@choldgraf choldgraf changed the title Chris and Yuvi document some cluster debugging steps Major documentation structure for helping our team debug issues Apr 13, 2022
@choldgraf
Copy link
Member Author

choldgraf commented Apr 21, 2022

I see that the two PRs attached to this one have now been merged. Can we close this one? If not, can we define some glanceable deliverables that we can use to know when this issue should be closed?

@yuvipanda
Copy link
Member

@choldgraf i've lifted the list of documents from the hackmd to the issue body. We can close this when those are written!

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Apr 26, 2022
- Reword title to complete 'How do I...?'
- Remove cmd-access.md, and link instead to tutorials/getting-started,
  which has the same content.
- Document the new behavior of health checks
- Reword to emphasize that local deploys are ok but you *must* get them
  into CI asap, with reasoning.

Ref 2i2c-org#1167
@yuvipanda
Copy link
Member

From #1314 (comment), we should also develop guides for GPU debugging.

@choldgraf
Copy link
Member Author

added to the list at the top 👍

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Sep 15, 2022
Docker images with datascience related packages can be *huge*, and very
difficult to build locally! We run a remote [docker-in-docker hack](https://gist.github.com/yuvipanda/48100eb9e15dae808052c7dc9fb22edb)
on our 2i2c cluster to make this a lot more painless. This document describes
*how* you can use this to build docker images from your laptop much faster.
This frees up your laptop's resources, as well as provides you with a datacenter
scale upload / download speeds.

Ref 2i2c-org#1167
@choldgraf
Copy link
Member Author

I'd like to unassign myself from this one - I am more than happy to help with documentation, but I don't think that I will have the time to spearhead any of the efforts on this one. I would be happy to be tagged-in as a support person though, or to review PRs etc.

@damianavila
Copy link
Contributor

damianavila commented Oct 26, 2022

  1. Breaking up into smaller issues (@damianavila)
  2. Revisit our documentation to see if the checkbox are actually relevant (@sgibson91 and @yuvipanda).

@sgibson91
Copy link
Member

sgibson91 commented Oct 27, 2022

Comment migrated to #1826 (comment).

@damianavila
Copy link
Contributor

@sgibson91, I moved over your comment into the dedicated issue referenced above.

@yuvipanda
Copy link
Member

I'm going to close this one, as I think whatever improvements that were made during this push are complete by now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

4 participants