Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report physical disks through the inventory system #5145

Merged
merged 17 commits into from
Mar 13, 2024
Merged

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Feb 27, 2024

  • In Sled Agent, send info about physical storage up to Nexus
  • In Nexus, process physical disks and add them to inventory in a new inv_physical_disk table
  • This PR also adjusts inventory tests to account for this new "physical disk" information being added alongside sleds, and adds some disk to the "representative inventory data"

Fixes #5150

@smklein smklein marked this pull request as ready for review February 28, 2024 00:04
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; just two tiny nits picking at memory usage that probably doesn't matter - feel free to ignore either or both.

nexus/db-queries/src/db/datastore/inventory.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/inventory.rs Outdated Show resolved Hide resolved
@@ -0,0 +1,13 @@
CREATE TABLE IF NOT EXISTS omicron.public.inv_physical_disk (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can/should we include the slot number here? (if so, can/should we use that for pagination/uniqueness?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we'll probably want that out of inventory anyway. Also, I agree, this seems like a better PK ("sled id" + "slot") than the reported values of serial/model/vendor.

Updated to plumb through the slot from the sled agent, and using as pagination key in this PR.

@@ -750,6 +752,17 @@ impl SledAgent {
self.config.hardware.reservoir_ram,
)
.context("reservoir_size")?,
disks: self
.storage
.lock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any situations where this lock might be held for a long time (thinking tens of seconds or minutes)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so, especially not since this is the simulated sled agent. At a cursory glance, i think it's mostly just getting locked to populate / grab contents from a group of hashmaps.

use db::schema::inv_physical_disk::dsl;

let mut iter =
disks.chunks(SQL_BATCH_SIZE.get().try_into().unwrap());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I like the idea of inserting in batches. But this is all going into one transaction so I'm not sure it's any better in that case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch size is still fairly large 1,000, so it's unlikely we'll be looping much regardless on a single rack or two. But I could see this blowing past the size of a "single SQL statement" if we consider it for larger multi-rack cases?

@smklein
Copy link
Collaborator Author

smklein commented Mar 12, 2024

Thank you @jgallagher and @davepacheco for the feedback! Apologies for the delay on responding, I've been building off this PR in #5172 and trying to make sure it'll work -- seems like yes.

I believe I've responded to all feedback -- I know y'all have me the +1 to merge, but I'll wait another ~half day if there are any other comments.

@smklein
Copy link
Collaborator Author

smklein commented Mar 12, 2024

reconfigurator-cli::test_basic test_complex looks like it's failing, and I think that requires a particular lab environment that's being used to update, so I guess this PR is blocked for the moment. This seems like a bit of a hard bottleneck (can only be safely edited from a single, specific deployment)?

InvPhysicalDisk::new(collection_id, *sled_id, disk.clone())
})
})
.collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit: I think you can drop this collect() (and the corresponding : Vec<_> type hint), make this mut, and then remove the disks.into_iter() call on line 659.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes the following compilation error:

error: implementation of `FnOnce` is not general enough
  --> nexus/src/app/background/inventory_collection.rs:58:9
   |
58 | /         async {
59 | |             match inventory_activate(
60 | |                 opctx,
61 | |                 &self.datastore,
...  |
88 | |         }
89 | |         .boxed()
   | |________________^ implementation of `FnOnce` is not general enough
   |
   = note: closure with signature `fn(&'0 nexus_types::inventory::PhysicalDisk) -> InvPhysicalDisk` must implement `FnOnce<(&'1 nexus_types::inventory::PhysicalDisk,)>`, for any two lifetimes `'0` and `'1`...
   = note: ...but it actually implements `FnOnce<(&nexus_types::inventory::PhysicalDisk,)>`

I believe the usage of disks.into_iter() is called within a transaction, which is where the FnOnce comes from?

I think I'm going to keep this pattern of "collect it into Vec, pass that Vec to the FnOnce" for the moment, because the lifetime complexity doesn't seem worth trying to optimize right now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grrrr, this is very annoying. I checked that diff against nexus-db-queries and it was fine, but didn't check whether it cause compilation errors down the line. That feels like a bug - if my function signature hasn't changed, why can implementation changes cause client breakage?

Copy link
Contributor

@jgallagher jgallagher Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to @cbiffle who found we're hitting rust-lang/rust#99492 here; this ought to work. I found a couple weird looking workarounds using hints from that issue:

@@ -127,15 +127,10 @@ impl DataStore {
             })
             .collect::<Result<Vec<_>, Error>>()?;
         // Pull disks out of all sled agents
-        let disks: Vec<_> = collection
-            .sled_agents
-            .iter()
-            .flat_map(|(sled_id, sled_agent)| {
-                sled_agent.disks.iter().map(|disk| {
-                    InvPhysicalDisk::new(collection_id, *sled_id, disk.clone())
-                })
-            })
-            .collect();
+        let mut disks =
+            collection.sled_agents.iter().flat_map(|(sled_id, sled_agent)| {
+                std::iter::repeat(sled_id).zip(sled_agent.disks.iter())
+            });

         // Partition the sled agents into those with an associated baseboard id
         // and those without one.  We handle these pretty differently.
@@ -656,10 +651,18 @@ impl DataStore {
                 use db::schema::inv_physical_disk::dsl;

                 let batch_size = SQL_BATCH_SIZE.get().try_into().unwrap();
-                let mut disks = disks.into_iter();
                 loop {
-                    let some_disks =
-                        disks.by_ref().take(batch_size).collect::<Vec<_>>();
+                    let some_disks = disks
+                        .by_ref()
+                        .take(batch_size)
+                        .map(|(&sled_id, disk)| {
+                            InvPhysicalDisk::new(
+                                collection_id,
+                                sled_id,
+                                disk.clone(),
+                            )
+                        })
+                        .collect::<Vec<_>>();
                     if some_disks.is_empty() {
                         break;
                     }

or if we don't like collection_id floating around, it can go in the repeat too:

@@ -127,15 +127,11 @@ impl DataStore {
             })
             .collect::<Result<Vec<_>, Error>>()?;
         // Pull disks out of all sled agents
-        let disks: Vec<_> = collection
-            .sled_agents
-            .iter()
-            .flat_map(|(sled_id, sled_agent)| {
-                sled_agent.disks.iter().map(|disk| {
-                    InvPhysicalDisk::new(collection_id, *sled_id, disk.clone())
-                })
-            })
-            .collect();
+        let mut disks =
+            collection.sled_agents.iter().flat_map(|(sled_id, sled_agent)| {
+                std::iter::repeat((collection_id, *sled_id))
+                    .zip(sled_agent.disks.iter())
+            });

         // Partition the sled agents into those with an associated baseboard id
         // and those without one.  We handle these pretty differently.
@@ -656,10 +652,18 @@ impl DataStore {
                 use db::schema::inv_physical_disk::dsl;

                 let batch_size = SQL_BATCH_SIZE.get().try_into().unwrap();
-                let mut disks = disks.into_iter();
                 loop {
-                    let some_disks =
-                        disks.by_ref().take(batch_size).collect::<Vec<_>>();
+                    let some_disks = disks
+                        .by_ref()
+                        .take(batch_size)
+                        .map(|((collection_id, sled_id), disk)| {
+                            InvPhysicalDisk::new(
+                                collection_id,
+                                sled_id,
+                                disk.clone(),
+                            )
+                        })
+                        .collect::<Vec<_>>();
                     if some_disks.is_empty() {
                         break;
                     }

I don't know that either of these are worth saving the intermediate .collect(), but at least we (kind of) understand it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I ran into this yesterday. It's a pretty bad bug in rustc.

@@ -0,0 +1,828 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file used? I recently ran into something where I considered updating the RSS on-disk plan format, and found a way to not do that to avoid the hassle.

Copy link
Collaborator Author

@smklein smklein Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used, and is stored, but it's not used particularly frequently.

  • We always try to "load a plan if one already exists" before creating a new one:
    let service_plan = if let Some(plan) =
    ServicePlan::load(&self.log, storage_manager).await?
    {
    plan
    } else {
    ServicePlan::create(
    &self.log,
    &config,
    &storage_manager,
    &plan.sleds,
    )
    .await?
    };
  • This tries to load the plan with the latest version:
    if let Some(ledger) = ledger {
    info!(log, "RSS plan already created, loading from file");
    Ok(Some(ledger.data().clone()))
    } else if Self::has_v1(storage_manager).await.map_err(|err| {
    PlanError::Io {
    message: String::from("looking for v1 RSS plan"),
    err,
    }
    })? {
    // If we found no current-version service plan, but we _do_ find
    // a v1 plan present, bail out. We do not expect to ever see this
    // in practice because that would indicate that:
    //
    // - We ran RSS previously on this same system using an older
    // version of the software that generates v1 service plans and it
    // got far enough through RSS to have written the v1 service plan.
    // - That means it must have finished initializing all sled agents,
    // including itself, causing it to record a
    // `StartSledAgentRequest`s in its ledger -- while still running
    // the older RSS.
    // - But we're currently running software that knows about v2
    // service plans. Thus, this process started some time after that
    // ledger was written.
    // - But the bootstrap agent refuses to execute RSS if it has a
    // local `StartSledAgentRequest` ledgered. So we shouldn't get
    // here if all of the above happened.
    //
    // This sounds like a complicated set of assumptions. If we got
    // this wrong, we'll fail spuriously here and we'll have to figure
    // out what happened. But the alternative is doing extra work to
    // support a condition that we do not believe can ever happen in any
    // system.
    Err(PlanError::FoundV1)
    } else {
    Ok(None)
    }
    , and in other cases, at least it tries to report a better error

In the limit: I think we may get more mileage out of "wiping everything and forcing RSS to restart" in the case where we restart, which would make these intermediate files removable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But those bits of code are still referencing the "rss-service-plan-v2.json" path, right? Should we bump that to v3? (I don't see anywhere where the schema json is checked, either? Sorry if I'm just missing it.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, thank you, good catch, I'm using this in #5172 , but not here yet. I'll pull this into the appropriate PR.

@davepacheco
Copy link
Collaborator

reconfigurator-cli::test_basic test_complex looks like it's failing, and I think that requires a particular lab environment that's being used to update, so I guess this PR is blocked for the moment. This seems like a bit of a hard bottleneck (can only be safely edited from a single, specific deployment)?

The purpose of that file is to exercise the CLI with data saved from a real system. That seemed useful because part of the goal was to be able to import data from real systems. But it could be more trouble than it's worth while we're still evolving the blueprint format.

Also, it doesn't have to be any particular system. The test will dynamically pick an inventory collection and blueprints that it finds.

@smklein
Copy link
Collaborator Author

smklein commented Mar 12, 2024

reconfigurator-cli::test_basic test_complex looks like it's failing, and I think that requires a particular lab environment that's being used to update, so I guess this PR is blocked for the moment. This seems like a bit of a hard bottleneck (can only be safely edited from a single, specific deployment)?

The purpose of that file is to exercise the CLI with data saved from a real system. That seemed useful because part of the goal was to be able to import data from real systems. But it could be more trouble than it's worth while we're still evolving the blueprint format.

Also, it doesn't have to be any particular system. The test will dynamically pick an inventory collection and blueprints that it finds.

I logged onto a machine set up by a4x2, configured my ssh keys to be able to copy files into and out of that machine, used zlogin to access, the switch zone, and ran omdb db reconfigurator-save as the test recommended, but it is still failing because I didn't generate enough blueprints.

I can spin up this environment and take another lap - to clarify, do I need to do omdb nexus blueprints regenerate twice? I'm not sure if I'm being dumb and there are some docs I should be following for "how to do this the right way".

@smklein
Copy link
Collaborator Author

smklein commented Mar 12, 2024

Okay, I ran:

  • omdb nexus background-tasks show
    Parsed that output to find the last collection id, and used that to run
  • omdb -w nexus blueprints generate-from-collection
    I then waited for inventory to be collected again, and re-ran those commands, before running
  • omdb db reconfigurator-save complex.json
  • I then copied that file from my switch zone to my global zone, then added my ssh keys from my "hosting" illumos box to ~/.ssh/authorized_keys on the "target" illumos box via the serial console, then I copied complex.json out to my host machine, and re-ran the complex test
  • That failed, but re-running with EXPECTORATE=overwrite caused it to succeed

@smklein smklein merged commit cf422c4 into main Mar 13, 2024
21 checks passed
@smklein smklein deleted the disk-in-inventory branch March 13, 2024 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add physical disk information to the Nexus inventory subsystem
4 participants