Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use fuse.js for dbt docs search #162

Closed
wants to merge 60 commits into from

Conversation

joellabes
Copy link
Contributor

@joellabes joellabes commented Jan 3, 2021

resolves #143, resolves #166

Description

Moving on from the work done in PR #145, use Fuse's fuzzy search instead of hand-rolled version.

Extremely WIP-y, running into issues when running yarn install locally so kicking the work off to the CI version on Netlify

Checklist

  • I have signed the CLA
  • I have generated docs locally, and this change appears to resolve the stated issue
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@joellabes
Copy link
Contributor Author

From the look of things, there will need to be some level of transformation from the raw data in manifest.json, because Fuse expects a list of items but each model in the manifest is its own object. The data that gets passed into Fuse will need to look more like this:

[
	{
		"node": "model.educationperfect.classes",
		"description": "All classes, including staff classes and Unlicensed Users",
		"alias": "classes",
		"columns": [
			{
				"name": "class_id",
				"description": "",
				"tags": []
			},
			{
				"name": "is_current_academic_year",
				"description": "Current means that today's date is inside the bounds of the academic year set on the class (if one is set)",
				"tags": ["some-tag"]
			}
		],
		"tags": ["another-tag"],
		"raw_sql": "with source as (\r\n    select *\r\n    from {{ source('classes', 'classes') }}\r\n),\r\n\r\nrenamed as (\r\n    select     \r\n        id as class_id,\r\n        name as class_name,\r\n        school as org_id,\r\n        datecreated as class_created_at,\r\n        case\r\n            when classtypeid = 1 then 'Standard'\r\n            when classtypeid = 2 then 'Unlicenced Users'\r\n            when classtypeid = 3 then 'Staff'\r\n            when classtypeid = 4 then 'CRM Staff'\r\n        end as class_type,\r\n        academicyearid as academic_year_id,\r\n        classtypeid = 1 as is_student_class \r\n    from source\r\n),\r\n\r\nacademic_years as (\r\n    select *\r\n    from {{ref('classes__academic_years')}}\r\n),\r\n\r\ncombined as (\r\n    select renamed.*,\r\n        coalesce(academic_years.is_currently_active, false) as is_current_academic_year\r\n    from renamed\r\n    left outer join academic_years using (academic_year_id)\r\n)\r\n\r\nselect * from combined",
	},
	
	{
		"node": "model.educationperfect.classes__academic_years",
		"description": "",
		"alias": "classes__academic_years",
		"columns": [
			{
				"name": "academic_year_id",
				"description": "",
				"tags": []
			},
			{
				"name": "start_date",
				"description": "",
				"tags": []
			},
			{
				"name": "end_date",
				"description": "",
				"tags": []
			}
		],
		"tags": [],
      	"raw_sql": "with source as (\r\n    select * from {{source('classes', 'academicyear')}}\r\n),\r\n\r\nrenamed as (\r\n    select id as academic_year_id,\r\n            name as academic_year_name,\r\n            hemisphereid as hemisphere_id,\r\n            startdate as start_date,\r\n            enddate as end_date,\r\n            startdate <= date_trunc('day', getdate()) and enddate >= date_trunc('day', getdate()) as is_currently_active\r\n    from source\r\n)\r\n\r\nselect *\r\nfrom renamed",
	}	
]

than the default:

"columns": {
        "class_id": {
          "name": "class_id",
          "description": "",
          "tags": []
        }
...
}

I'll work on that next for a while, because that doesn't depend on working out why Fuse gives Object is not a constructor errors 🤔

@joellabes
Copy link
Contributor Author

@drewbanin Happy New Year! When you're back on deck, I'd appreciate a conceptual review before I go too much further down this rabbit hole:

  • In particular, is the prepareModelsForSearching work necessary? Is there a JS-ier way to do it? I've punted on giving things good names while I'm experimenting - I'll fix them up before the final review.
  • I also notice that a lot of the heavy lifting for search is taking place in project-service.js as opposed to search.js. For now, I've stuck with the example, but once everything's working it feels like it might need to be shuffled around a bit.

The other blocker I have at the moment is successfully installing Fuse. I got it working in a new Glitch project perfectly happily, which seems to be identical to the docs project. Obviously I've done something wrong though, or there's a weird interplay between Fuse and some other lib that docs uses.

Once Fuse actually loads and I can get some results, I can dive into highlighting results etc instead of just dumping stuff in the console.

// make a copy
var newModel = _.assign({}, model);
newModel.columns = _.values(model.columns);
newModel.searchableName = getModelName(model);
Copy link
Contributor Author

@joellabes joellabes Jan 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting side-effect of this approach is that macros from other packages become much surfaced much more often than before (esp on short search terms), e.g. dbt-utils and codegen, because the package name is also passed in as a search option, as opposed to just for disambiguation before being rendered on the page.

It does the same thing for sources (in that the schema name is searchable as well now). This feels more clearly cut as a benefit.

I've done it this way because fuse gives back a range of matching characters for highlighting, and tacking on an extra string afterwards makes that a bit less elegant than I'd like (and means that schema names etc wouldn't highlight at all). I think this is somewhere between a net positive and neutral, but might be lying to myself so that I don't have to fix it. Open to alternative views.

In general, I'm a bit bemused at the number of different ways different resources can be named, and that there isn't a canonical field for this. Or is there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha - ok - this one sounds like a can of worms. dbt does have "unique identifiers" but they are not semantically super useful (eg. they look like model.fishtown.fct_users). It would be really good to give every node type a canonical name for search representation, but that's something we'd need to layer in as you've done here AFAIK

Going to come back to this to take a deeper dive soon

Comment on lines +55 to +61
if(
(show_names && match.key === "searchableName")
|| (show_descriptions && (match.key === "description" || match.key == "columns.description"))
|| (show_columns && match.key === "columns.name")
|| (show_code && match.key === "raw_sql")
|| (show_tags && (match.key === "tags" || match.key == "columns.tags"))
) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, these tickboxes no longer do exactly what they used to. They used to be "search only these fields for the query term", but are now "only return results where this field was one of the ones that included a query term".

In real life, this means that a model tagged with my-cool-tag will lose out to a different model called my_cool_tag and tagged with something-else when searching for cool tag with Tags Only ticked.

As far as I can see, to get around this would require a re-instantiation of Fuse with a new index (by passing just the relevant keys) each time these tick boxes were toggled.

const fuseOptions = {
includeScore: true,
includeMatches: true,
ignoreLocation: true,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of Fuse's options is that by setting ignoreLocation, there's no point in setting threshold. Does that sound right to you?

I'd like to tone down the fuzziness a bit, but it's not in terms of how close to the start of the string a term is, it's how aggressively it thinks character replacements/transpositions should be to get a match. And I don't think it offers any other levers to pull.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm hm hm! I am reading this slightly differently - it sounds so me like ignoreLocation will score matches in any part of the string, but I believe that threshold still has a role to play! I played around with this by setting threshold to 0.1, then searching for "zz" and did not see any matches. Next, I changed the threshold to 0.9 and searched for "zz" and I did see matches!

I bet we can make the search less fuzzy by reducing the threshold from the default (0.6) to a smaller number... unclear to me how to pick that number beyond guess-and-check'ing though :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! Currently on 0.4 - I tried 0.3 but that excluded some things I thought should be allowed.

overflow: hidden;
text-overflow: ellipsis;
display: -webkit-box;
-webkit-line-clamp: 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like there should be some way to extend a base line-clamp, but my CSS isn't good enough so just copy-pasting for now.

@joellabes
Copy link
Contributor Author

joellabes commented Jan 8, 2021

@drewbanin I'm pretty sure everything works end-to-end now! I've left comments and questions in different places.

Next step is to work out why it performs so slowly on larger projects - if you've got any hints on that front that'd be great. I've spent some time enabling and disabling indexing by various keys, but haven't found a specific smoking gun.

I have noticed that it seems to be calling some functions twice, but no idea where or why.

includeScore: true,
includeMatches: true,
ignoreLocation: true,
//useExtendedSearch: true,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really want to enable extended search, because it means you can prefix things with a single quote to force exact matches over fuzzy ones. However when I enabled it, I found that the matches array has way more sharp edges (to name a few: unsorted, duplicate records, overlapping ranges).

The duplicates and sorting are pretty trivial to resolve, but I haven't come up with an elegant solution to making sense of [[1, 3], [1, 1], [1, 2], [2, 3]]. I think this might be something to come back and have a second pass at once the bulk of it's merged in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right on - totally agree - let's get fuse in here and then make it more sophisticated in the future!

@joellabes joellabes marked this pull request as ready for review January 10, 2021 06:27
Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joellabes I took another pass at this today - thanks for your patience!! Looking really good so far. For my first pass, mostly focused on the UX side of things. Need to come back here to dig a little bit deeper into the code and how everything works. Let me know if you have any questions, thoughts, or comments!!

@@ -145,67 +145,9 @@ angular
});

$scope.$watch('search.query', function(q) {
$scope.search.results = assignSearchRelevance(projectService.search(q));
$scope.search.results = projectService.search(q);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that both this line and this line in components/search/search.js are both called when the search.query variable changes. I'm seeing some input lag between typing a key and seeing results (~150 models in my test project) and I wonder if running the search twice for each keystroke contributes to the latency?

scope.results = filterResults(projectService.search(scope.query), scope.checkboxStatus);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a quick loom with some of my observations & open questions - check it out here! https://www.loom.com/share/8422b5b0865d40e0a477cf25cb0815e5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken the redundant search call out of index.js - on my computer it's still sitting around 350ms per keystroke which is odd, I would have expected that to be much lower now that it's only doing the work once. I also did some basic Googling on $$debounceViewValueCommit but nothing jumped right out at me :(

const fuseOptions = {
includeScore: true,
includeMatches: true,
ignoreLocation: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm hm hm! I am reading this slightly differently - it sounds so me like ignoreLocation will score matches in any part of the string, but I believe that threshold still has a role to play! I played around with this by setting threshold to 0.1, then searching for "zz" and did not see any matches. Next, I changed the threshold to 0.9 and searched for "zz" and I did see matches!

I bet we can make the search less fuzzy by reducing the threshold from the default (0.6) to a smaller number... unclear to me how to pick that number beyond guess-and-check'ing though :)

includeScore: true,
includeMatches: true,
ignoreLocation: true,
//useExtendedSearch: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right on - totally agree - let's get fuse in here and then make it more sophisticated in the future!

try {
//As search terms become longer, be less tolerant of tiny fuzzy matches
var shortestWord = q.split(' ').sort(function(a, b){ return a.length - b.length})[0]
service.fuse.options.minMatchCharLength = Math.max(1, shortestWord.length - 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh... did you find that search results were too fuzzy before adding this code? I wonder it ratcheting down the threshold will help with this without needing to hack into the search options for each keystroke

// make a copy
var newModel = _.assign({}, model);
newModel.columns = _.values(model.columns);
newModel.searchableName = getModelName(model);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha - ok - this one sounds like a can of worms. dbt does have "unique identifiers" but they are not semantically super useful (eg. they look like model.fishtown.fct_users). It would be really good to give every node type a canonical name for search representation, but that's something we'd need to layer in as you've done here AFAIK

Going to come back to this to take a deeper dive soon

@joellabes
Copy link
Contributor Author

@drewbanin I’ve gotten rid of the duplicate searches, but it still performs really badly. I don’t have the JS chops to debug this unfortunately - do you mind if I throw the heavy lifting on this one at you? Happy to jump on a call or something and pair on it if you think that’d help, but I’d mostly be along for the ride 😬

@joellabes
Copy link
Contributor Author

@drewbanin another bump here sorry! Would love to get this closed out, but I'm sure there's a lot going on over there 🏎️

@joellabes
Copy link
Contributor Author

I'm abandoning this in favour of unabandoning PR 145, which is simple and works! Fuse.js's weighting didn't feel good to me. Reading the docs doesn't make me confident that those weighting/fuzziness issues are readily resolvable, so I'm not eager to put a lot of effort into resolving crummy performance (especially since there's not really any SMEs available right now!) just to discover that the whole premise is flawed.

@joellabes joellabes closed this Apr 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Searches can be typed in before search is initialised Implement fuzzier search for model names in docs
2 participants