Refactor search code #5197

ericholscher · 2019-01-29T18:20:42Z

This PR has gotten pretty large, so I think the easiest thing to do is actually to just check it out and read the code in readthedocs/search. I'd really like more eyes from @rtfd/core on this, since we're all now responsible for maintaining this code :)

This does a number of things:

Removes the simple_search endpoint, so that we only have 1 entry point for search
Re-adds the search signals that we removed in the refactor, these are required for the .com
A few small UI/UX cleanup things to make search results nicer
Some optimizations that reduce the size of the ES results that we get back from the server.
Moves all ES updates/deletes to celery, totally removes default django-elasticsearch-dsl signals
- This allows us to remove our custom logic that we needed to remove invalid HTMLFile's, and removes the entire RTDDocType class

Closes #5167 #5168

This does a number of things: * Removes the simple_search endpoint, so that we only have 1 entry point for search * Re-adds the search signals that we removed in the refactor, these are required for the .com * A few small UI/UX cleanup things to make search results nicer

ericholscher · 2019-01-29T22:44:32Z

readthedocs/search/tests/test_api.py

@@ -12,8 +12,8 @@
 @pytest.mark.search
 class TestDocumentSearch(object):

-    def __init__(self):
-        # This reverse needs to be inside the ``__init__`` method because from
+    def setUp(self):


search/tests/test_api.py:13 /Users/eric/projects/readthedocs.org/readthedocs/search/tests/test_api.py:13: PytestWarning: cannot collect test class 'TestDocumentSearch' because it has a __init__ constructor

safwanrahman · 2019-01-30T07:39:55Z

This need to have a careful review. I will review it tonight.
@ericholscher Can you explain more about removing the simple_search? I think we can overwrite the search method to pass the singal.

ericholscher · 2019-01-30T14:23:43Z

@ericholscher Can you explain more about removing the simple_search? I think we can overwrite the search method to pass the singal.

We had two different entry points for search, which means we had to repeat logic a bunch of places. Why do we need simple_search instead of just always using faceted search?

safwanrahman · 2019-01-30T17:16:37Z

We had two different entry points for search, which means we had to repeat logic a bunch of places.

Its actually necessary. One is search in API, which does not need aggregated data. But the project search and file search do need aggregated data and the number, so they do need aggregated data.

Why do we need simple_search instead of just always using faceted search?

We need simple search for not overwhelming the API search endpoint. Aggregated query in large dataset are comparable slower than non aggregated query. So we should not run aggregated query when its not needed.
If we use faceted search all the time, it will just make the query slower and may have effect in our Elasticsearch cluster. If we can keep the query simple and fast, we can implement search as you type, suggestion and other things without breaking the user experience.

ericholscher · 2019-01-30T19:18:05Z

We need simple search for not overwhelming the API search endpoint. Aggregated query in large dataset are comparable slower than non aggregated query. So we should not run aggregated query when its not needed.

I think we can ship this for now, and simplify if we have issues. It seems like having two totally different code paths for search is more complexity than value to me. This makes things much simpler, and allows us to keep all the logic in one place.

This cleans up a lot of logic and makes it easier to read. It also moves indexing away from the Django-Elasticsearch-DSL code for any updates or deletes, and does it all in Celery

humitos · 2019-02-05T18:57:01Z

readthedocs/search/api.py

+        kwargs = {}
+        kwargs['projects_list'] = [p.slug for p in self.get_all_projects()]
+        kwargs['versions_list'] = self.request.query_params.get('version')
+        user = ''


I think it's a better pattern to default to the AnonymousUser instead. So, anywhere where this is used, all the method are still available and returning the proper values.

I think you can pass self.request.user directly without checking anything.

I agree here, unless there is some significance to ES handling the user as an empty string.

Nope, was just the old way we were doing it. Fixed it now.

agjohnson

I haven't quite grokked the original changes before refactor, so be warned that I'm not super effective reviewing this. I noted a couple of JS changes -- I did try to add a method of bubbling DEBUG up to our search javascript, but went over my time limit without success, so we can revist adding debug to our local output. It's fairly easy to point out when Sphinx index is used in production at the moment anyways

agjohnson · 2019-02-05T18:30:27Z

readthedocs/core/static-src/core/js/doc-embed/search.js

@@ -32,6 +32,7 @@ function attach_elastic_search_query(data) {
                var total_count = data.count || 0;

                if (hit_list.length) {
+                    console.debug('Read the Docs search got a result. Showing results.')


We should drop debug/log statements like this for prod. You can tell if the Sphinx indexes are used as the search result return will be empty -- or easier, you'll see a flood of requests for Sphinx's index files.

I do think this is helpfu though. I think an addition to our JS could be to log when DEBUG = True, but I took a really quick swing at this and hit issues. We need to pass through our footer most likely. I'd say lets remove these statements and find a method of exposing DEBUG to our JS.

agjohnson · 2019-02-05T18:35:45Z

readthedocs/core/static-src/core/js/doc-embed/search.js

-                            contents.html(content_text);
-                            contents.find('em').addClass('highlighted');
-                            list_item.append(contents);
+                            for (index in highlight.content) {


So, JS quirk here. for ... in isn't actually for iterables, it's for properties on an object. It's better to use the old for (var i; ...) approach for arrays:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for...in

There is for ... of, which loops over iterables, but browser support is still new:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for...of#Browser_compatibility

agjohnson · 2019-02-05T18:36:46Z

readthedocs/core/static-src/core/js/doc-embed/search.js

@@ -97,6 +105,7 @@ function attach_elastic_search_query(data) {
            },
            complete: function (resp, status_code) {
                if (status_code !== 'success' || resp.responseJSON.count === 0) {
+                    console.debug('Read the Docs search failed, skipping loading search content.')


Another debug statement

agjohnson · 2019-02-05T18:55:12Z

readthedocs/templates/search/elastic_search.html

@@ -127,14 +127,16 @@ <h3>{% blocktrans with query=query|default:"" %}Results for {{ query }}{% endblo
                            {% if result.name %}

                              {# Project #}
-                              <a href="{{ result.url }}">{{ result.name }}</a>
+                              <a href="{{ result.url }}">{{ result.name }} (<em>{{ result.slug }}</em>)</a>


If this is the notation for subproject results, perhaps we should mention "(from project {{slug}})" or "(from project {{name}})" would be even better. I don't have an example of how this looks with subproject search in doc right now though. If in-doc we omit "from project", we can omit here too.

It's mostly because project's can have different names and Slugs, which can be confusing. Eg. in prod we have like 5000 projects with the name "Docs" or similar, so this helps make it explicit.

I remember that we had a bug that makes all the results to say (from project blah) but I'm not sure what was the mistake. I think it was a problem with JS and the === instead of ==. It may worth to check the history in case it's related to this.

…-search-signals

humitos

I left some comments.

I don't understand this code completely, though.

I tested this branch locally (some commits "ago") and I did work properly: generating index via management command, index after docs built, search and get results (I had to add the port to CORS).

humitos · 2019-02-05T19:02:04Z

readthedocs/search/documents.py

-    class Meta(object):
-        model = HTMLFile
-        fields = ('commit',)
-        ignore_signals = settings.ES_PAGE_IGNORE_SIGNALS


search.rst should be updated accordingly. This setting is not used anymore.

humitos · 2019-02-05T19:04:11Z

readthedocs/search/faceted_search.py

 from readthedocs.search.documents import PageDocument, ProjectDocument
-from readthedocs.search.signals import before_file_search, before_project_search

 log = logging.getLogger(__name__)


 class RTDFacetedSearch(FacetedSearch):


Nevermind. Got confused by the comment.

humitos · 2019-02-05T19:07:43Z

readthedocs/search/faceted_search.py

+
+        # need to search for both 'and' and 'or' operations
+        # the score of and should be higher as it satisfies both or and and
+        for operator in ['and', 'or']:


Just in case, we were using AND and OR, not sure if it affects.

humitos · 2019-02-05T19:18:51Z

readthedocs/search/views.py

+
+def elastic_project_search(request, project_slug):
+    """Use elastic search to search in a project."""
+    queryset = Project.objects.protected(request.user)


Why .protected is used here instead of .public? If it's because of .com, shouldn't be a combination of .public + .for_user?

This is what all the dashboard views in projects.views.public do.

humitos · 2019-02-05T19:21:20Z

readthedocs/templates/search/elastic_search.html

@@ -127,14 +127,16 @@ <h3>{% blocktrans with query=query|default:"" %}Results for {{ query }}{% endblo
                            {% if result.name %}

                              {# Project #}
-                              <a href="{{ result.url }}">{{ result.name }}</a>
+                              <a href="{{ result.url }}">{{ result.name }} (<em>{{ result.slug }}</em>)</a>


I remember that we had a bug that makes all the results to say (from project blah) but I'm not sure what was the mistake. I think it was a problem with JS and the === instead of ==. It may worth to check the history in case it's related to this.

humitos · 2019-02-06T10:08:05Z

docs/development/search.rst

-the other part is responsible for querying the Index to show the proper results to users.
-We use the `django-elasticsearch-dsl`_ package mostly to the keep the search working.
+
+* One part is responsible for **indexing** the documents and projects (`documents.py`)


I think you want to use the `` here instead of the single one.

…rch. Also support passing a version_slug to get_project_list_or_404 in order to filter by version privacy instead of Project.

humitos

Latest changes look good.

I left some nitpick comments to consider.

humitos · 2019-02-06T14:48:05Z

readthedocs/search/faceted_search.py

@@ -21,19 +21,10 @@ def __init__(self, user, **kwargs):
            but is used on the .com
        """
        self.user = user
+        if 'filter_user' in kwargs:
+            self.filter_user = kwargs.pop('filter_user')


nitpick: this can be written as

self.filter_user = kwargs.pop('filter_user', None)

to avoid the if.

humitos · 2019-02-06T14:50:12Z

readthedocs/search/utils.py

+    subprojects = Project.objects.filter(superprojects__parent_id=main_project.id)
+    for project in list(subprojects) + [main_project]:
+        version = Version.objects.public(user).filter(project__slug=project.slug, slug=version_slug)
+        if version.count():


nitpick: .exists() is better for this use case.

humitos · 2019-02-06T14:51:35Z

readthedocs/search/utils.py

+    for project in list(subprojects) + [main_project]:
+        version = Version.objects.public(user).filter(project__slug=project.slug, slug=version_slug)
+        if version.count():
+            project_list.append(version[0].project)


nitpick: I think using .first() is the Django-way for this.

humitos · 2019-02-06T14:56:16Z

readthedocs/search/utils.py

+    project_list = []
+    main_project = get_object_or_404(Project.objects.all(), slug=project_slug)
+    subprojects = Project.objects.filter(superprojects__parent_id=main_project.id)
+    for project in list(subprojects) + [main_project]:


Not sure, but I'm thinking that this for could be replaced by a query itself, like:

versions = Version.objects.public(user) .filter(project__in=projects, slug=version_slug) .values_list('id', flat=True) projects = Project.objects.filter(versions__id__in=versions)

Use it if you consider it clearer.

humitos · 2019-02-06T14:56:51Z

readthedocs/search/utils.py

+    """
+    # Support private projects with public versions
+    project_list = []
+    main_project = get_object_or_404(Project.objects.all(), slug=project_slug)


nitpick: no need to .objects.all(), just get_object_or_404(Project, slug=project_slug) works

humitos · 2019-02-06T15:05:49Z

readthedocs/search/api.py

@@ -62,7 +62,7 @@ def get_queryset(self):
        # Validate all the required params are there
        self.validate_query_params()
        query = self.request.query_params.get('q', '')
-        kwargs = {}
+        kwargs = {'filter_user': False}


nitpick: I'd like to come up with a better name for this. I'm thinking about filter_by_user which is a little more explicit.

What we want to communicate here is "filter versions by users permissions" I suppose, but I didn't find a good name for that :(

safwanrahman · 2019-02-06T15:58:43Z

readthedocs/search/documents.py

        kwargs = {
-            'using': using or cls._doc_type.using,
-            'index': index or cls._doc_type.index,
-            'doc_types': [cls],


@ericholscher I think we can pass the doc_type from here to avoide the lazy import.

Will take a peek at that in a refactor, I think it's OK for now.

safwanrahman · 2019-02-06T15:59:19Z

readthedocs/search/documents.py

        kwargs = {
-            'using': using or cls._doc_type.using,
-            'index': index or cls._doc_type.index,
-            'doc_types': [cls],


Same here, pass the doc_types in order to avoide the lazy import

safwanrahman · 2019-02-06T16:50:00Z

I have run the management command with CELERY_ALWAYS_EAGER=False and it raises following error.

[06/Feb/2019 16:41:36] celery.app.trace:249[1564]: ERROR Task readthedocs.search.tasks.index_objects_to_es[676db212-aaf4-4fee-830c-5bd6f173b3cc] raised unexpected: RequestError(400, 'illegal_argument_exception', {'error': {'root_cause': [{'type': 'illegal_argument_exception', 'reason': "Alias [project_index] has more than one indices associated with it [[project_index_20190206163659, project_index_20190206163146]], can't execute a single index op"}], 'type': 'illegal_argument_exception', 'reason': "Alias [project_index] has more than one indices associated with it [[project_index_20190206163659, project_index_20190206163146]], can't execute a single index op"}, 'status': 400})
Traceback (most recent call last):
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/celery/app/trace.py", line 375, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/celery/app/trace.py", line 632, in __protected_call__
    return self.run(*args, **kwargs)
  File "/Users/safwan/readthedocs/readthedocs/search/tasks.py", line 36, in index_objects_to_es
    doc_obj.update(queryset.iterator())
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/django_elasticsearch_dsl/documents.py", line 231, in update
    self._get_actions(object_list, action), **kwargs
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/django_elasticsearch_dsl/documents.py", line 191, in bulk
    return bulk(client=self.connection, actions=actions, **kwargs)
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 257, in bulk
    for ok, item in streaming_bulk(client, actions, **kwargs):
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 192, in streaming_bulk
    raise_on_error, **kwargs)
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 99, in _process_bulk_chunk
    raise e
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 95, in _process_bulk_chunk
    resp = client.bulk('\n'.join(bulk_actions) + '\n', **kwargs)
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 76, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/client/__init__.py", line 1150, in bulk
    headers={'content-type': 'application/x-ndjson'})
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/transport.py", line 314, in perform_request
    status, headers_response, data = connection.perform_request(method, url, params, body, headers=headers, ignore=ignore, timeout=timeout)
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 180, in perform_request
    self._raise_error(response.status, raw_data)
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/connection/base.py", line 125, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: TransportError(400, 'illegal_argument_exception', "Alias [project_index] has more than one indices associated with it [[project_index_20190206163659, project_index_20190206163146]], can't execute a single index op")

I was expecting this error because this is trying to index projects and documents by the index alias. If the index alias has more than one index associate with it, it will raise error.
In current code that is in the master, the new index name is passed to the task and the task index to the new index using the index name. But this functionality get broken by this PR.

This only works when running with CELERY_ALWAYS_EAGER=True as the task will run in the same instance as the management command.

ericholscher added 5 commits January 29, 2019 13:16

Show more results in JS search results

a84bb15

Fix docstring

6f36acd

Fix lint

0bfa998

Fix tests

936465d

ericholscher requested a review from a team January 29, 2019 21:41

ericholscher added 5 commits January 29, 2019 16:44

Small fixes for search

0496cf6

Fix comment.

ab28491

Move to pformat and real logging.

617f47f

Standardize search result listings

d0f6082

Only request 3 fragments, as that’s all we display

af8a28f

ericholscher requested a review from safwanrahman January 29, 2019 22:40

ericholscher commented Jan 29, 2019

View reviewed changes

Fix pytest test logic.

5f197ec

ericholscher added 4 commits January 30, 2019 10:16

Handle changing newlines to periods.

465628e

Fix tests.

e3e245c

Check for content to highlight

e909759

Attempt to fix tests agian.

49098d7

Delete some old code, and remove single function/class files.

07dc26b

ericholscher added 6 commits January 30, 2019 14:59

Fix lint error.

e9f3f3f

Keep all search views in the search app

d6d02da

remove need to pass around an index_name

336e674

Properly filter project search

eb65817

Refactor the Document and FacetedSearch classes

144ed0e

This cleans up a lot of logic and makes it easier to read. It also moves indexing away from the Django-Elasticsearch-DSL code for any updates or deletes, and does it all in Celery

Merge remote-tracking branch 'origin/master' into readd-search-signals

d314f58

ericholscher changed the title ~~Reactor search code~~ Refactor search code Jan 31, 2019

Merge remote-tracking branch 'origin/master' into readd-search-signals

e062191

humitos reviewed Feb 5, 2019

View reviewed changes

agjohnson reviewed Feb 5, 2019

View reviewed changes

ericholscher added 6 commits February 5, 2019 16:02

Nicer highlight replacement syntax

8e4cc2b

Remove search debug logging.

8c7bda4

Use normal user object everywhere.

5b9f460

Merge remote-tracking branch 'origin/readd-search-signals' into readd…

e2e271b

…-search-signals

Fix typo

a993f08

Use classic JS loop

d52b968

humitos previously approved these changes Feb 5, 2019

View reviewed changes

Update docs

fc277fa

ericholscher dismissed humitos’s stale review via fc277fa February 5, 2019 19:39

ericholscher added 3 commits February 5, 2019 16:40

Cap operators

417ea45

Fix lint again

80c58c7

Once more with the linting

72d867f

humitos reviewed Feb 6, 2019

View reviewed changes

ericholscher added 3 commits February 6, 2019 08:23

Change API queryset filter to public(user)

5f118ff

Small doc fixup

e263e69

Support filter_user argument for not filtering users in corporate sea…

1a3e146

…rch. Also support passing a version_slug to get_project_list_or_404 in order to filter by version privacy instead of Project.

humitos previously approved these changes Feb 6, 2019

View reviewed changes

Address review feedback

fab9f42

ericholscher dismissed humitos’s stale review via fab9f42 February 6, 2019 15:31

More cleanup.

0a06726

safwanrahman reviewed Feb 6, 2019

View reviewed changes

ericholscher merged commit 05b7c3f into master Feb 6, 2019

delete-merged-branch bot deleted the readd-search-signals branch February 6, 2019 16:40

stsewd mentioned this pull request Feb 6, 2019

Convert docsearch newlines into BR's #5168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor search code #5197

Refactor search code #5197

ericholscher commented Jan 29, 2019 •

edited

Loading

ericholscher Jan 29, 2019

safwanrahman commented Jan 30, 2019

ericholscher commented Jan 30, 2019

safwanrahman commented Jan 30, 2019

ericholscher commented Jan 30, 2019

humitos Feb 5, 2019

agjohnson Feb 5, 2019

ericholscher Feb 5, 2019

agjohnson left a comment

agjohnson Feb 5, 2019

agjohnson Feb 5, 2019

agjohnson Feb 5, 2019

agjohnson Feb 5, 2019 •

edited

Loading

ericholscher Feb 5, 2019

humitos Feb 5, 2019

humitos left a comment

humitos Feb 5, 2019

humitos Feb 5, 2019

humitos Feb 5, 2019

humitos Feb 5, 2019

ericholscher Feb 5, 2019

humitos Feb 5, 2019

humitos Feb 6, 2019

humitos left a comment

humitos Feb 6, 2019

humitos Feb 6, 2019

humitos Feb 6, 2019

humitos Feb 6, 2019

humitos Feb 6, 2019

humitos Feb 6, 2019

safwanrahman Feb 6, 2019

ericholscher Feb 6, 2019

safwanrahman Feb 6, 2019

safwanrahman commented Feb 6, 2019 •

edited

Loading

Refactor search code #5197

Refactor search code #5197

Conversation

ericholscher commented Jan 29, 2019 • edited Loading

Choose a reason for hiding this comment

safwanrahman commented Jan 30, 2019

ericholscher commented Jan 30, 2019

safwanrahman commented Jan 30, 2019

ericholscher commented Jan 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agjohnson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agjohnson Feb 5, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

safwanrahman commented Feb 6, 2019 • edited Loading

ericholscher commented Jan 29, 2019 •

edited

Loading

agjohnson Feb 5, 2019 •

edited

Loading

safwanrahman commented Feb 6, 2019 •

edited

Loading