-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor search code #5197
Refactor search code #5197
Conversation
This does a number of things: * Removes the simple_search endpoint, so that we only have 1 entry point for search * Re-adds the search signals that we removed in the refactor, these are required for the .com * A few small UI/UX cleanup things to make search results nicer
readthedocs/search/tests/test_api.py
Outdated
@@ -12,8 +12,8 @@ | |||
@pytest.mark.search | |||
class TestDocumentSearch(object): | |||
|
|||
def __init__(self): | |||
# This reverse needs to be inside the ``__init__`` method because from | |||
def setUp(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
search/tests/test_api.py:13
/Users/eric/projects/readthedocs.org/readthedocs/search/tests/test_api.py:13: PytestWarning: cannot collect test class 'TestDocumentSearch' because it has a __init__ constructor
This need to have a careful review. I will review it tonight. |
We had two different entry points for search, which means we had to repeat logic a bunch of places. Why do we need |
Its actually necessary. One is search in API, which does not need aggregated data. But the project search and file search do need aggregated data and the number, so they do need aggregated data.
We need simple search for not overwhelming the API search endpoint. Aggregated query in large dataset are comparable slower than non aggregated query. So we should not run aggregated query when its not needed. |
I think we can ship this for now, and simplify if we have issues. It seems like having two totally different code paths for search is more complexity than value to me. This makes things much simpler, and allows us to keep all the logic in one place. |
This cleans up a lot of logic and makes it easier to read. It also moves indexing away from the Django-Elasticsearch-DSL code for any updates or deletes, and does it all in Celery
readthedocs/search/api.py
Outdated
kwargs = {} | ||
kwargs['projects_list'] = [p.slug for p in self.get_all_projects()] | ||
kwargs['versions_list'] = self.request.query_params.get('version') | ||
user = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a better pattern to default to the AnonymousUser
instead. So, anywhere where this is used, all the method are still available and returning the proper values.
I think you can pass self.request.user
directly without checking anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree here, unless there is some significance to ES handling the user as an empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, was just the old way we were doing it. Fixed it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't quite grokked the original changes before refactor, so be warned that I'm not super effective reviewing this. I noted a couple of JS changes -- I did try to add a method of bubbling DEBUG up to our search javascript, but went over my time limit without success, so we can revist adding debug to our local output. It's fairly easy to point out when Sphinx index is used in production at the moment anyways
@@ -32,6 +32,7 @@ function attach_elastic_search_query(data) { | |||
var total_count = data.count || 0; | |||
|
|||
if (hit_list.length) { | |||
console.debug('Read the Docs search got a result. Showing results.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should drop debug/log statements like this for prod. You can tell if the Sphinx indexes are used as the search result return will be empty -- or easier, you'll see a flood of requests for Sphinx's index files.
I do think this is helpfu though. I think an addition to our JS could be to log when DEBUG = True, but I took a really quick swing at this and hit issues. We need to pass through our footer most likely. I'd say lets remove these statements and find a method of exposing DEBUG to our JS.
contents.html(content_text); | ||
contents.find('em').addClass('highlighted'); | ||
list_item.append(contents); | ||
for (index in highlight.content) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, JS quirk here. for ... in
isn't actually for iterables, it's for properties on an object. It's better to use the old for (var i; ...)
approach for arrays:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for...in
There is for ... of
, which loops over iterables, but browser support is still new:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for...of#Browser_compatibility
@@ -97,6 +105,7 @@ function attach_elastic_search_query(data) { | |||
}, | |||
complete: function (resp, status_code) { | |||
if (status_code !== 'success' || resp.responseJSON.count === 0) { | |||
console.debug('Read the Docs search failed, skipping loading search content.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another debug statement
@@ -127,14 +127,16 @@ <h3>{% blocktrans with query=query|default:"" %}Results for {{ query }}{% endblo | |||
{% if result.name %} | |||
|
|||
{# Project #} | |||
<a href="{{ result.url }}">{{ result.name }}</a> | |||
<a href="{{ result.url }}">{{ result.name }} (<em>{{ result.slug }}</em>)</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the notation for subproject results, perhaps we should mention "(from project {{slug}})" or "(from project {{name}})" would be even better. I don't have an example of how this looks with subproject search in doc right now though. If in-doc we omit "from project", we can omit here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mostly because project's can have different names and Slugs, which can be confusing. Eg. in prod we have like 5000 projects with the name "Docs" or similar, so this helps make it explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that we had a bug that makes all the results to say (from project blah)
but I'm not sure what was the mistake. I think it was a problem with JS and the ===
instead of ==
. It may worth to check the history in case it's related to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments.
I don't understand this code completely, though.
I tested this branch locally (some commits "ago") and I did work properly: generating index via management command, index after docs built, search and get results (I had to add the port to CORS).
class Meta(object): | ||
model = HTMLFile | ||
fields = ('commit',) | ||
ignore_signals = settings.ES_PAGE_IGNORE_SIGNALS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
search.rst
should be updated accordingly. This setting is not used anymore.
from readthedocs.search.documents import PageDocument, ProjectDocument | ||
from readthedocs.search.signals import before_file_search, before_project_search | ||
|
||
log = logging.getLogger(__name__) | ||
|
||
|
||
class RTDFacetedSearch(FacetedSearch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind. Got confused by the comment.
readthedocs/search/faceted_search.py
Outdated
|
||
# need to search for both 'and' and 'or' operations | ||
# the score of and should be higher as it satisfies both or and and | ||
for operator in ['and', 'or']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, we were using AND
and OR
, not sure if it affects.
|
||
def elastic_project_search(request, project_slug): | ||
"""Use elastic search to search in a project.""" | ||
queryset = Project.objects.protected(request.user) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why .protected
is used here instead of .public
? If it's because of .com, shouldn't be a combination of .public
+ .for_user
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what all the dashboard views in projects.views.public
do.
@@ -127,14 +127,16 @@ <h3>{% blocktrans with query=query|default:"" %}Results for {{ query }}{% endblo | |||
{% if result.name %} | |||
|
|||
{# Project #} | |||
<a href="{{ result.url }}">{{ result.name }}</a> | |||
<a href="{{ result.url }}">{{ result.name }} (<em>{{ result.slug }}</em>)</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that we had a bug that makes all the results to say (from project blah)
but I'm not sure what was the mistake. I think it was a problem with JS and the ===
instead of ==
. It may worth to check the history in case it's related to this.
docs/development/search.rst
Outdated
the other part is responsible for querying the Index to show the proper results to users. | ||
We use the `django-elasticsearch-dsl`_ package mostly to the keep the search working. | ||
|
||
* One part is responsible for **indexing** the documents and projects (`documents.py`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want to use the ``
here instead of the single one.
…rch. Also support passing a version_slug to get_project_list_or_404 in order to filter by version privacy instead of Project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest changes look good.
I left some nitpick comments to consider.
readthedocs/search/faceted_search.py
Outdated
@@ -21,19 +21,10 @@ def __init__(self, user, **kwargs): | |||
but is used on the .com | |||
""" | |||
self.user = user | |||
if 'filter_user' in kwargs: | |||
self.filter_user = kwargs.pop('filter_user') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: this can be written as
self.filter_user = kwargs.pop('filter_user', None)
to avoid the if
.
readthedocs/search/utils.py
Outdated
subprojects = Project.objects.filter(superprojects__parent_id=main_project.id) | ||
for project in list(subprojects) + [main_project]: | ||
version = Version.objects.public(user).filter(project__slug=project.slug, slug=version_slug) | ||
if version.count(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: .exists()
is better for this use case.
readthedocs/search/utils.py
Outdated
for project in list(subprojects) + [main_project]: | ||
version = Version.objects.public(user).filter(project__slug=project.slug, slug=version_slug) | ||
if version.count(): | ||
project_list.append(version[0].project) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I think using .first()
is the Django-way for this.
project_list = [] | ||
main_project = get_object_or_404(Project.objects.all(), slug=project_slug) | ||
subprojects = Project.objects.filter(superprojects__parent_id=main_project.id) | ||
for project in list(subprojects) + [main_project]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but I'm thinking that this for could be replaced by a query itself, like:
versions = Version.objects.public(user)
.filter(project__in=projects, slug=version_slug)
.values_list('id', flat=True)
projects = Project.objects.filter(versions__id__in=versions)
Use it if you consider it clearer.
readthedocs/search/utils.py
Outdated
""" | ||
# Support private projects with public versions | ||
project_list = [] | ||
main_project = get_object_or_404(Project.objects.all(), slug=project_slug) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: no need to .objects.all()
, just get_object_or_404(Project, slug=project_slug)
works
readthedocs/search/api.py
Outdated
@@ -62,7 +62,7 @@ def get_queryset(self): | |||
# Validate all the required params are there | |||
self.validate_query_params() | |||
query = self.request.query_params.get('q', '') | |||
kwargs = {} | |||
kwargs = {'filter_user': False} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I'd like to come up with a better name for this. I'm thinking about filter_by_user
which is a little more explicit.
What we want to communicate here is "filter versions by users permissions" I suppose, but I didn't find a good name for that :(
kwargs = { | ||
'using': using or cls._doc_type.using, | ||
'index': index or cls._doc_type.index, | ||
'doc_types': [cls], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericholscher I think we can pass the doc_type
from here to avoide the lazy import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take a peek at that in a refactor, I think it's OK for now.
kwargs = { | ||
'using': using or cls._doc_type.using, | ||
'index': index or cls._doc_type.index, | ||
'doc_types': [cls], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, pass the doc_types
in order to avoide the lazy import
I have run the management command with
I was expecting this error because this is trying to index projects and documents by the index alias. If the index alias has more than one index associate with it, it will raise error. This only works when running with |
This PR has gotten pretty large, so I think the easiest thing to do is actually to just check it out and read the code in
readthedocs/search
. I'd really like more eyes from @rtfd/core on this, since we're all now responsible for maintaining this code :)This does a number of things:
Closes #5167 #5168