Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Make sure Client parameters are strings #1577

Merged
merged 5 commits into from
Feb 20, 2024
Merged

Conversation

beggers
Copy link
Contributor

@beggers beggers commented Dec 23, 2023

Description of changes

Summarize the changes made by this PR.

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readbility, Modularity, Intuitiveness)

@jeffchuber
Copy link
Contributor

@beggers should we force port to be an int? Is that a better solution here? I agree with the other inputs. Also do end users see these type hints?

@@ -255,3 +282,9 @@ def AdminClient(settings: Settings = Settings()) -> AdminAPI:

"""
return AdminClientCreator(settings=settings)


def _stringify_headers(headers: Optional[Dict[str, str]]) -> None:
Copy link
Contributor

@rancomp rancomp Dec 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we can be more suggestive with the type-hinting here. If headers is already Dict[str, str] then we would not need to cast. More explicitly,

Suggested change
def _stringify_headers(headers: Optional[Dict[str, str]]) -> None:
def _stringify_headers(headers: Optional[Dict[str, str | int]]) -> None:

I would have suggested a SupportsStr type but I think every(?) object in python has a __str__ method? see related exchange https://www.mail-archive.com/[email protected]/msg24545.html

@beggers
Copy link
Contributor Author

beggers commented Dec 23, 2023

@jeffchuber int makes more sense; changed it everywhere. End users see these type hints if they use mypy but many pythonistas don't do that. Importantly, I believe these typechecks only happen statically (never at runtime) when folks (us included) manually run mypy on their code -- there's nothing actually preventing end-users from passing in ints, strings, or even objects as these parameters. I think this is a reasonable way to solve this problem for now. @HammadB can correct me if my understanding of our typechecking is wrong since I'm definitely not an expert here.

@rancomp for the reasons stated above we probably want headers to be Dict[str, str]. That will help keep us honest and careful in our own codebase and provide better hinting to end users. That said, there's still currently nothing preventing users from passing in a Dict with any types in keys and values so we should just stringify for them. WDYT?

@rancomp
Copy link
Contributor

rancomp commented Dec 23, 2023

@beggers my understanding of type-hints in python is similar to yours and I agree with you that headers should be typed as Dict[str, str]. However, I was more focused on the internal function _stringify_headers. This function is part of our back-end and serves as a validation layer. I think that, for our own code documentation, we should specify what types we would want to stringify.

PS, because practically anything can be stringified, I want to bring up the option to directly validate the values types through assert isinstance(value, ⋅).

PSS, I thought about it. Do we want to mutate headers? It's unintuitive that a function like HttpClient mutates its arguments. So another suggestion is to change _stringify_headers to return a new dictionary

def _stringify_headers(headers: Optional[Dict[str, str]]) -> Dict[str, str]:
    if headers is not None:
        return {key: str(value) for key, value in headers.items()}
    return dict()

@beggers
Copy link
Contributor Author

beggers commented Dec 23, 2023

Returning new headers makes sense, done.

I still think the type annotation should be Dict[str, str] though. We don't want anything else in headers. This function is protection against that case but it's not intended to be a blessed path.

@tazarov
Copy link
Contributor

tazarov commented Dec 27, 2023

@beggers, what do you think of replacing the host and port altogether in favor of a URI-based endpoint?

Here are some wins:

  • cohesion - endpoint string like http://localhost:8000 is a clear and cohesive representation of the resource location
  • Simpler - a single endpoint string is more straightforward to reason about than host and port individually
  • Validation - it is generally simpler to validate URI instead of host + port + implicit scheme + path.
  • Works well with cloud - endpoint strings are well suited for anything behind proxy (aka AWS API GW)
  • Standardization - using endpoints for REST APIs is more natural than host and port

I know this does not fit in this PR, but we can work in this direction in parallel.

Regarding the headers, all three versions of HTTP:

State that headers are US-ASCII chars so it does make sense to have them as Dict[str,str] without explicitly enforcing the type check as the underlying libs usually do this for us (or throw an appropriate error)

@beggers
Copy link
Contributor Author

beggers commented Jan 2, 2024

I would prefer we keep our current separate host and port setup. There are some benefits to using a single URI as you point out, but it feels overall less legible to me: I prefer more, smaller, labeled and type-annotated parameters to fewer, larger ones. It's a pretty minor change either way but I don't think it provides any value.

host = str(host)
port = int(port)
ssl = bool(ssl)
headers = _stringify_headers(headers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we insist all headers are strings?

Copy link
Contributor Author

@beggers beggers Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do this with the type annotations.

However type annotations are statically checked only when you feel like it and don't actually enforce anything at runtime -- this PR exists because a user was hitting a subtle bug when they passed in an int for port instead of a string.

Our options here are to stringify headers or simply pass them as they're passed to us. @tazarov requested below that we pass them as-is and I don't feel strongly about it so that's what we do now. Removed the _stringify_headers stuff.

return ClientCreator(tenant=tenant, database=database, settings=settings)


def HttpClient(
host: str = "localhost",
port: str = "8000",
port: int = 8000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will happen if the user currently is passing a string? just a type error, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can pass whatever they want -- these are just type hints which users can choose to statically check or not. If a user is currently passing a string we'll turn it into an int with the port = int(port) line below. This will not break anyone.

"""

# https://github.com/chroma-core/chroma/issues/1573
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to link to this issue in 4 separate places? just feels a little messy- id rather add one-liner with the rationale

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed these all to a comment, lmk what you think. Happy to change.



# Despite type hints, users may pass in non-string values for headers.
def _stringify_headers(headers: Optional[Dict[str, str]]) -> Optional[Dict[str, str]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be converting headers to strings here. Instead, we should let requests library error out on headers that are not strings () - psf/requests#3491 (comment) and https://github.com/psf/requests/blob/72eccc8dd8b7c272e520f22b0256386c80864e94/src/requests/utils.py#L1040

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, removed _stringify_headers

@tazarov
Copy link
Contributor

tazarov commented Jan 2, 2024

I would prefer we keep our current separate host and port setup. There are some benefits to using a single URI as you point out, but it feels overall less legible to me: I prefer more, smaller, labeled and type-annotated parameters to fewer, larger ones. It's a pretty minor change either way but I don't think it provides any value.

I have to disagree with it not bringing any value. Consider having only host and port; how would one add a path or query param in the URL (adding more params to the HttpClient would make it messier)?
Recently, several community members had issues with Chroma running behind proxies where paths may be needed to distinguish between different services behind the proxy (e.g., AWS API GW). The way we worked around this was to support full-blown URLs in the host parameter https://example.com:8443/chroma.
All said and done, I think we should move to have an endpoint-like parameter and, with it, make things a little more future-proof WRT supporting different schemes (e.g. grpc) and connection string query params (much like most SQL databases nowadays).

@beggers
Copy link
Contributor Author

beggers commented Jan 3, 2024

I would prefer we keep our current separate host and port setup. There are some benefits to using a single URI as you point out, but it feels overall less legible to me: I prefer more, smaller, labeled and type-annotated parameters to fewer, larger ones. It's a pretty minor change either way but I don't think it provides any value.

I have to disagree with it not bringing any value. Consider having only host and port; how would one add a path or query param in the URL (adding more params to the HttpClient would make it messier)? Recently, several community members had issues with Chroma running behind proxies where paths may be needed to distinguish between different services behind the proxy (e.g., AWS API GW). The way we worked around this was to support full-blown URLs in the host parameter https://example.com:8443/chroma. All said and done, I think we should move to have an endpoint-like parameter and, with it, make things a little more future-proof WRT supporting different schemes (e.g. grpc) and connection string query params (much like most SQL databases nowadays).

Sure, seems reasonable. I would be very happy if we supported a path-like parameter now and a protocol parameter in the future (we currently almost have this since we take ssl= and use http or https depending on it).

I would also be fine if we added an connection_string-like param, though removing the existing params would require a code change from everyone using Chroma so we should leave them. In any event, this conversation has moved pretty far off-base from this PR but I'd be happy to continue it elsewhere. Though it seems like you and I are in agreement about adding more configurability to this API.

@beggers beggers changed the title Make sure Client parameters are strings [BUG] Make sure Client parameters are strings Jan 4, 2024
@@ -165,6 +173,13 @@ def HttpClient(
if settings is None:
settings = Settings()

# Make sure paramaters are the correct types -- users can pass anything.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the int conversions have a try/catch in case of failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're fine erroring if someone passes in a port that's not an int (same applies to other types). All we would do is throw another error.

@beggers beggers merged commit 8a0f67e into main Feb 20, 2024
98 checks passed
atroyn pushed a commit to csbasil/chroma that referenced this pull request Apr 3, 2024
## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- Stringify all paremeters to `Client`s which are meant to be strings.
At present some parameters -- `port` in particular -- can be reasonably
passed as integers which causes weird and unexpected behavior.
	 - Fixes chroma-core#1573

## Test plan
*How are these changes tested?*

- [ ] Tests pass locally with `pytest` for python, `yarn test` for js

## Documentation Changes
*Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
repository](https://github.com/chroma-core/docs)?*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Started ChromaDB in Docker Compose. Client reporting port 8000 is not the same as 8000.
5 participants