Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update profiling records without reseting catalog #519

Merged

Conversation

outtanames
Copy link
Contributor

@outtanames outtanames commented Jan 15, 2024

Variety of updates/fixes to the profiler to support generating profiling catalogs across local/cloud:

  • profiling runs now load and update the existing catalog rather than regenerating the entire thing, so we can do incremental model profiling runs for a specific subset of models
  • Various fixes/checks when reading/writing to/from the profiling catalog, edge cases were breaking e.g. gpu utilization
  • Refactor profiler to make different overloads of prof less ambiguous between the Profiler itself, context managers, actual profiling results etc.
  • --catalog-path flag so we can write profiling results to a filemount in skypilot rather than needing to track down the catalog inside the nos cache each time.
Screenshot 2024-01-22 at 3 34 28 PM

Summary

Related issues

Checks

  • make lint: I've run make lint to lint the changes in this PR.
  • make test: I've made sure the tests (make test-cpu or make test) are passing.
  • Additional tests:
    • Benchmark tests (when contributing new models)
    • GPU/HW tests

@spillai
Copy link
Contributor

spillai commented Jan 15, 2024

Can you add tests for these?

@@ -228,6 +228,22 @@ def as_df(self) -> pd.DataFrame:
)
return df

def from_df(self, df: pd.DataFrame) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a classmethod where you return cls(records)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@classmethod
def from_df(cls, df: pd.DataFrame) -> "Profiler":
    ...
    return cls(records)

record.update(key, value)
self.records.append(record)

def from_json_path(self, filename: Union[Path, str]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, needs to be a classmethod that returns return cls.from_df(df)

self.prof = Profiler()
self.prof.from_json_path(NOS_PROFILE_CATALOG_PATH)
else:
logger.debug("No prof catalog found")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use f"Profile catalog not found (filename={NOS_PROFILE_CATALOG_PATH})."

with Profiler() as self.prof, torch.inference_mode():
from nos.constants import NOS_PROFILE_CATALOG_PATH

self.prof = Profiler()
Copy link
Contributor

@spillai spillai Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.prof = Profiler.from_json_path(NOS_PROFILE_CATALOG_PATH)

@outtanames outtanames force-pushed the sloftin/inline-profiling-records branch from ae09f21 to 49b158e Compare January 16, 2024 08:32
@spillai spillai changed the title update profiling records without reseting catalog Update profiling records without reseting catalog Jan 16, 2024
@outtanames outtanames force-pushed the sloftin/inline-profiling-records branch from 0c0c126 to c2b6b9f Compare January 22, 2024 23:54
@outtanames
Copy link
Contributor Author

Can you add tests for these?

see additions, we now check for gpu utilization in the catalog for clip-vit-patch32 (along with a successful profiling run overall).

@outtanames outtanames merged commit 09b0acd into autonomi-ai:main Jan 26, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants