Use fuse.js for dbt docs search #162

joellabes · 2021-01-03T08:15:08Z

resolves #143, resolves #166

Description

Moving on from the work done in PR #145, use Fuse's fuzzy search instead of hand-rolled version.

Extremely WIP-y, running into issues when running yarn install locally so kicking the work off to the CI version on Netlify

Checklist

I have signed the CLA
I have generated docs locally, and this change appears to resolve the stated issue
I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

This reverts commit 4f57431.

joellabes · 2021-01-03T10:35:43Z

From the look of things, there will need to be some level of transformation from the raw data in manifest.json, because Fuse expects a list of items but each model in the manifest is its own object. The data that gets passed into Fuse will need to look more like this:

[
	{
		"node": "model.educationperfect.classes",
		"description": "All classes, including staff classes and Unlicensed Users",
		"alias": "classes",
		"columns": [
			{
				"name": "class_id",
				"description": "",
				"tags": []
			},
			{
				"name": "is_current_academic_year",
				"description": "Current means that today's date is inside the bounds of the academic year set on the class (if one is set)",
				"tags": ["some-tag"]
			}
		],
		"tags": ["another-tag"],
		"raw_sql": "with source as (\r\n    select *\r\n    from {{ source('classes', 'classes') }}\r\n),\r\n\r\nrenamed as (\r\n    select     \r\n        id as class_id,\r\n        name as class_name,\r\n        school as org_id,\r\n        datecreated as class_created_at,\r\n        case\r\n            when classtypeid = 1 then 'Standard'\r\n            when classtypeid = 2 then 'Unlicenced Users'\r\n            when classtypeid = 3 then 'Staff'\r\n            when classtypeid = 4 then 'CRM Staff'\r\n        end as class_type,\r\n        academicyearid as academic_year_id,\r\n        classtypeid = 1 as is_student_class \r\n    from source\r\n),\r\n\r\nacademic_years as (\r\n    select *\r\n    from {{ref('classes__academic_years')}}\r\n),\r\n\r\ncombined as (\r\n    select renamed.*,\r\n        coalesce(academic_years.is_currently_active, false) as is_current_academic_year\r\n    from renamed\r\n    left outer join academic_years using (academic_year_id)\r\n)\r\n\r\nselect * from combined",
	},
	
	{
		"node": "model.educationperfect.classes__academic_years",
		"description": "",
		"alias": "classes__academic_years",
		"columns": [
			{
				"name": "academic_year_id",
				"description": "",
				"tags": []
			},
			{
				"name": "start_date",
				"description": "",
				"tags": []
			},
			{
				"name": "end_date",
				"description": "",
				"tags": []
			}
		],
		"tags": [],
      	"raw_sql": "with source as (\r\n    select * from {{source('classes', 'academicyear')}}\r\n),\r\n\r\nrenamed as (\r\n    select id as academic_year_id,\r\n            name as academic_year_name,\r\n            hemisphereid as hemisphere_id,\r\n            startdate as start_date,\r\n            enddate as end_date,\r\n            startdate <= date_trunc('day', getdate()) and enddate >= date_trunc('day', getdate()) as is_currently_active\r\n    from source\r\n)\r\n\r\nselect *\r\nfrom renamed",
	}	
]

than the default:

"columns": {
        "class_id": {
          "name": "class_id",
          "description": "",
          "tags": []
        }
...
}

I'll work on that next for a while, because that doesn't depend on working out why Fuse gives Object is not a constructor errors 🤔

This reverts commit 4f57431.

…/dbt-docs into feat-143-fusejs-search

joellabes · 2021-01-04T09:22:21Z

@drewbanin Happy New Year! When you're back on deck, I'd appreciate a conceptual review before I go too much further down this rabbit hole:

In particular, is the prepareModelsForSearching work necessary? Is there a JS-ier way to do it? I've punted on giving things good names while I'm experimenting - I'll fix them up before the final review.
I also notice that a lot of the heavy lifting for search is taking place in project-service.js as opposed to search.js. For now, I've stuck with the example, but once everything's working it feels like it might need to be shuffled around a bit.

The other blocker I have at the moment is successfully installing Fuse. I got it working in a new Glitch project perfectly happily, which seems to be identical to the docs project. Obviously I've done something wrong though, or there's a weird interplay between Fuse and some other lib that docs uses.

Once Fuse actually loads and I can get some results, I can dive into highlighting results etc instead of just dumping stuff in the console.

joellabes · 2021-01-08T08:50:28Z

src/app/services/project_service.js

+		  // make a copy
+		  var newModel = _.assign({}, model);
+		  newModel.columns = _.values(model.columns);
+		  newModel.searchableName = getModelName(model);


An interesting side-effect of this approach is that macros from other packages become much surfaced much more often than before (esp on short search terms), e.g. dbt-utils and codegen, because the package name is also passed in as a search option, as opposed to just for disambiguation before being rendered on the page.

It does the same thing for sources (in that the schema name is searchable as well now). This feels more clearly cut as a benefit.

I've done it this way because fuse gives back a range of matching characters for highlighting, and tacking on an extra string afterwards makes that a bit less elegant than I'd like (and means that schema names etc wouldn't highlight at all). I think this is somewhere between a net positive and neutral, but might be lying to myself so that I don't have to fix it. Open to alternative views.

In general, I'm a bit bemused at the number of different ways different resources can be named, and that there isn't a canonical field for this. Or is there?

haha - ok - this one sounds like a can of worms. dbt does have "unique identifiers" but they are not semantically super useful (eg. they look like model.fishtown.fct_users). It would be really good to give every node type a canonical name for search representation, but that's something we'd need to layer in as you've done here AFAIK

Going to come back to this to take a deeper dive soon

joellabes · 2021-01-08T08:56:35Z

src/app/components/search/search.js

+                           if(
+                           		   (show_names && match.key === "searchableName") 
+                           		|| (show_descriptions && (match.key === "description" || match.key == "columns.description")) 
+                           		|| (show_columns && match.key === "columns.name") 
+                           		|| (show_code && match.key === "raw_sql") 
+                           		|| (show_tags && (match.key === "tags" || match.key == "columns.tags"))
+                           ) {


Technically, these tickboxes no longer do exactly what they used to. They used to be "search only these fields for the query term", but are now "only return results where this field was one of the ones that included a query term".

In real life, this means that a model tagged with my-cool-tag will lose out to a different model called my_cool_tag and tagged with something-else when searching for cool tag with Tags Only ticked.

As far as I can see, to get around this would require a re-instantiation of Fuse with a new index (by passing just the relevant keys) each time these tick boxes were toggled.

joellabes · 2021-01-08T09:15:36Z

src/app/services/project_service.js

+			const fuseOptions = {
+			  includeScore: true,
+			  includeMatches: true,
+			  ignoreLocation: true,


My understanding of Fuse's options is that by setting ignoreLocation, there's no point in setting threshold. Does that sound right to you?

I'd like to tone down the fuzziness a bit, but it's not in terms of how close to the start of the string a term is, it's how aggressively it thinks character replacements/transpositions should be to get a match. And I don't think it offers any other levers to pull.

hm hm hm! I am reading this slightly differently - it sounds so me like ignoreLocation will score matches in any part of the string, but I believe that threshold still has a role to play! I played around with this by setting threshold to 0.1, then searching for "zz" and did not see any matches. Next, I changed the threshold to 0.9 and searched for "zz" and I did see matches!

I bet we can make the search less fuzzy by reducing the threshold from the default (0.6) to a smaller number... unclear to me how to pick that number beyond guess-and-check'ing though :)

You're right! Currently on 0.4 - I tried 0.3 but that excluded some things I thought should be allowed.

joellabes · 2021-01-08T11:28:25Z

src/app/components/search/search.html

+	   overflow: hidden;
+	   text-overflow: ellipsis;
+	   display: -webkit-box;
+	   -webkit-line-clamp: 1; 


It feels like there should be some way to extend a base line-clamp, but my CSS isn't good enough so just copy-pasting for now.

joellabes · 2021-01-08T11:41:59Z

@drewbanin I'm pretty sure everything works end-to-end now! I've left comments and questions in different places.

Next step is to work out why it performs so slowly on larger projects - if you've got any hints on that front that'd be great. I've spent some time enabling and disabling indexing by various keys, but haven't found a specific smoking gun.

I have noticed that it seems to be calling some functions twice, but no idea where or why.

This reverts commit d80aadf.

…ndedSearch

…ing extendedSearch" This reverts commit ee5ce9b.

joellabes · 2021-01-10T06:26:31Z

src/app/services/project_service.js

+			  includeScore: true,
+			  includeMatches: true,
+			  ignoreLocation: true,
+			  //useExtendedSearch: true,


I really want to enable extended search, because it means you can prefix things with a single quote to force exact matches over fuzzy ones. However when I enabled it, I found that the matches array has way more sharp edges (to name a few: unsorted, duplicate records, overlapping ranges).

The duplicates and sorting are pretty trivial to resolve, but I haven't come up with an elegant solution to making sense of [[1, 3], [1, 1], [1, 2], [2, 3]]. I think this might be something to come back and have a second pass at once the bulk of it's merged in.

right on - totally agree - let's get fuse in here and then make it more sophisticated in the future!

drewbanin

@joellabes I took another pass at this today - thanks for your patience!! Looking really good so far. For my first pass, mostly focused on the UX side of things. Need to come back here to dig a little bit deeper into the code and how everything works. Let me know if you have any questions, thoughts, or comments!!

drewbanin · 2021-02-09T14:20:00Z

src/app/main/index.js

@@ -145,67 +145,9 @@ angular
    });

    $scope.$watch('search.query', function(q) {
-        $scope.search.results = assignSearchRelevance(projectService.search(q));
+        $scope.search.results = projectService.search(q);


I think that both this line and this line in components/search/search.js are both called when the search.query variable changes. I'm seeing some input lag between typing a key and seeing results (~150 models in my test project) and I wonder if running the search twice for each keystroke contributes to the latency?

scope.results = filterResults(projectService.search(scope.query), scope.checkboxStatus);

I made a quick loom with some of my observations & open questions - check it out here! https://www.loom.com/share/8422b5b0865d40e0a477cf25cb0815e5

I've taken the redundant search call out of index.js - on my computer it's still sitting around 350ms per keystroke which is odd, I would have expected that to be much lower now that it's only doing the work once. I also did some basic Googling on $$debounceViewValueCommit but nothing jumped right out at me :(

drewbanin · 2021-02-09T14:33:07Z

src/app/services/project_service.js

+			const fuseOptions = {
+			  includeScore: true,
+			  includeMatches: true,
+			  ignoreLocation: true,


hm hm hm! I am reading this slightly differently - it sounds so me like ignoreLocation will score matches in any part of the string, but I believe that threshold still has a role to play! I played around with this by setting threshold to 0.1, then searching for "zz" and did not see any matches. Next, I changed the threshold to 0.9 and searched for "zz" and I did see matches!

I bet we can make the search less fuzzy by reducing the threshold from the default (0.6) to a smaller number... unclear to me how to pick that number beyond guess-and-check'ing though :)

drewbanin · 2021-02-09T14:33:48Z

src/app/services/project_service.js

+			  includeScore: true,
+			  includeMatches: true,
+			  ignoreLocation: true,
+			  //useExtendedSearch: true,


right on - totally agree - let's get fuse in here and then make it more sophisticated in the future!

drewbanin · 2021-02-09T14:35:53Z

src/app/services/project_service.js

+		try {
+			//As search terms become longer, be less tolerant of tiny fuzzy matches
+			var shortestWord = q.split(' ').sort(function(a, b){ return a.length - b.length})[0]
+			service.fuse.options.minMatchCharLength = Math.max(1, shortestWord.length - 2);


huh... did you find that search results were too fuzzy before adding this code? I wonder it ratcheting down the threshold will help with this without needing to hack into the search options for each keystroke

drewbanin · 2021-02-09T14:38:07Z

src/app/services/project_service.js

+		  // make a copy
+		  var newModel = _.assign({}, model);
+		  newModel.columns = _.values(model.columns);
+		  newModel.searchableName = getModelName(model);


haha - ok - this one sounds like a can of worms. dbt does have "unique identifiers" but they are not semantically super useful (eg. they look like model.fishtown.fct_users). It would be really good to give every node type a canonical name for search representation, but that's something we'd need to layer in as you've done here AFAIK

Going to come back to this to take a deeper dive soon

joellabes · 2021-02-21T08:12:08Z

@drewbanin I’ve gotten rid of the duplicate searches, but it still performs really badly. I don’t have the JS chops to debug this unfortunately - do you mind if I throw the heavy lifting on this one at you? Happy to jump on a call or something and pair on it if you think that’d help, but I’d mostly be along for the ride 😬

joellabes · 2021-04-06T09:34:59Z

@drewbanin another bump here sorry! Would love to get this closed out, but I'm sure there's a lot going on over there 🏎️

joellabes · 2022-04-05T04:22:49Z

I'm abandoning this in favour of unabandoning PR 145, which is simple and works! Fuse.js's weighting didn't feel good to me. Reading the docs doesn't make me confident that those weighting/fuzziness issues are readily resolvable, so I'm not eager to put a lot of effort into resolving crummy performance (especially since there's not really any SMEs available right now!) just to discover that the whole premise is flawed.

Hello world for fuse installation

a13fd9c

cla-bot bot added the cla:yes label Jan 3, 2021

joellabes mentioned this pull request Jan 3, 2021

Swap from exact substring matches to each individual word matching #145

Merged

3 tasks

joellabes marked this pull request as draft January 3, 2021 08:23

Joel Labes added 6 commits January 3, 2021 21:23

Remove console logging of keyboard, add fuse logging

6713574

Import fuse at index.js

925d1a5

Fix accessing uninitialised variable

364b69e

Missing quotation mark

9ab246a

Try importing Fuse

4f57431

Revert "Try importing Fuse"

e0e6ad9

This reverts commit 4f57431.

Joel Labes added 18 commits January 4, 2021 19:04

Test prepare model for search

8d11cb2

Push not add

3acd117

Array

8e7561c

Add rest of transformed node

b01b17d

Missing comma

2d2ccf6

Loop instead of being clever with maps

414d790

declare var

0750f8c

columns fix

09125d7

Coalesce null tags and descriptions

be4e2ce

Explicit not-null check

04a67bf

Revert "Try importing Fuse"

c039ff0

This reverts commit 4f57431.

Add rest of transformed node

3b236c7

columns fix

94a6b1c

Merge branch 'feat-143-fusejs-search' of https://github.com/joellabes…

7c999bc

…/dbt-docs into feat-143-fusejs-search

Tidy fuse setup

146b10b

catch block

2e180fb

DIsable fuse while it doesn't work

8517ffb

Search options config

0f18f01

joellabes commented Jan 8, 2021

View reviewed changes

Joel Labes added 3 commits January 8, 2021 22:16

Tidy up commented out code

1685f28

Add loop support for column IDs

fbbd8d4

Handle tags

1560eed

joellabes commented Jan 8, 2021

View reviewed changes

Highlight sql with new approach, remove old approach

3d28b36

Joel Labes added 11 commits January 10, 2021 13:30

Sort by score asc

7c9648b

Remove column filtering

06f1a3e

Only show columns and tags when there are some

3fbe0fa

enable extended search

31a044e

Add link to extended search docs (needs a proper svg)

d80aadf

Revert "Add link to extended search docs (needs a proper svg)"

77eb95e

This reverts commit d80aadf.

Add search hint and link to docs

767ede5

Deal with off-by-ones caused by different indexing when enabling exte…

ee5ce9b

…ndedSearch

Revert "Deal with off-by-ones caused by different indexing when enabl…

0e1c113

…ing extendedSearch" This reverts commit ee5ce9b.

Disable extended search again

6935a10

Remove reference to extended search

4389011

joellabes commented Jan 10, 2021

View reviewed changes

joellabes marked this pull request as ready for review January 10, 2021 06:27

drewbanin reviewed Feb 9, 2021

View reviewed changes

Joel Labes added 2 commits February 10, 2021 23:23

Delete duplicate search, tweak threshold to reduce fuzziness

acbef83

Put minMatchCharLength back

82bc8cc

joellabes requested a review from drewbanin February 21, 2021 08:08

joellabes closed this Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use fuse.js for dbt docs search #162

Use fuse.js for dbt docs search #162

joellabes commented Jan 3, 2021 •

edited

Loading

joellabes commented Jan 3, 2021

joellabes commented Jan 4, 2021

joellabes Jan 8, 2021 •

edited

Loading

drewbanin Feb 9, 2021

joellabes Jan 8, 2021

joellabes Jan 8, 2021

drewbanin Feb 9, 2021

joellabes Feb 10, 2021

joellabes Jan 8, 2021

joellabes commented Jan 8, 2021 •

edited

Loading

joellabes Jan 10, 2021

drewbanin Feb 9, 2021

drewbanin left a comment

drewbanin Feb 9, 2021

drewbanin Feb 9, 2021

joellabes Feb 10, 2021

drewbanin Feb 9, 2021

drewbanin Feb 9, 2021

drewbanin Feb 9, 2021

drewbanin Feb 9, 2021

joellabes commented Feb 21, 2021

joellabes commented Apr 6, 2021

joellabes commented Apr 5, 2022

Use fuse.js for dbt docs search #162

Use fuse.js for dbt docs search #162

Conversation

joellabes commented Jan 3, 2021 • edited Loading

Description

Checklist

joellabes commented Jan 3, 2021

joellabes commented Jan 4, 2021

joellabes Jan 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joellabes commented Jan 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewbanin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joellabes commented Feb 21, 2021

joellabes commented Apr 6, 2021

joellabes commented Apr 5, 2022

joellabes commented Jan 3, 2021 •

edited

Loading

joellabes Jan 8, 2021 •

edited

Loading

joellabes commented Jan 8, 2021 •

edited

Loading