Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User testing for dataset factories syntax #2602

Closed
merelcht opened this issue May 22, 2023 · 1 comment
Closed

User testing for dataset factories syntax #2602

merelcht opened this issue May 22, 2023 · 1 comment
Assignees
Labels
Stage: User Research 🔬 Ticket needs to undergo user research before implementation

Comments

@merelcht
Copy link
Member

merelcht commented May 22, 2023

Description

We need to test that the proposed dataset factories implementation for #2423 actually works for users.

Exercise

  1. Open GitPod link
  2. This contains a Kedro spaceflight project
    1. Contains two pipelines: data processing and data science
    2. Price-prediction project for return trips to the Moon
    3. Data: companies, reviews, shuttles
  3. Based on the explanation and example of dataset factories: modify the project to reduce dataset repetitions in the catalog.
  4. What repetitive patterns do you see?
  5. What parts of the project should be updated? (catalog.yml + pipeline.py for both pipelines)
  6. What do you think of this feature? Easy of use, syntax, possibilities.
  7. Is there anything missing?
  8. Run kedro catalog resolve
  9. What do you think of that CLI command?

Metrics that define success of the feature:

  • Do users find the syntax intuitive to use?
  • Can users come up with working syntax themselves?
  • Does it solve their problem of long repetitive catalogs?
  • Does the “resolve” CLI command help in understanding the dataset factories?

Open questions:

  • Do we also need to test the parsing rules?
  • What's the best way to show the user the explanation? Send it to them beforehand? Share during the interview?
@merelcht merelcht added the Stage: User Research 🔬 Ticket needs to undergo user research before implementation label May 22, 2023
@merelcht merelcht self-assigned this May 23, 2023
@merelcht
Copy link
Member Author

merelcht commented Jun 15, 2023

Summary

Interviewed 10 users:

  • 2 users from the OS community, 8 QB users
  • 1 DE, 6 DS, 2 SWE/MLOps

Personal favourite user quotes:

"It gets rid of the need to do Jinja and it also gets rid of the need to like define a global list that you use in your catalog." 🔥

"I love it definitely love it. I feel it's actually, it's more descriptive than, than we had previously." ❤️

"It seems super simple. It's actually, I wish this was a month ago that was already an option." 🤩

"Decreasing the size of the catalog and potentially even having just one generic catalog that you only need this one file and it will work for all of your different pipelines because the pipeline will change the catalog could be extremely powerful." 💪

About kedro catalog resolve: "Unreal. That's an amazing feature." 💥

Insights from user interviews

1. Users would use dataset factories

9 users explicitly said they would use this feature.

"Yeah, absolutely.[...] it gets rid of the funky, I guess it replaces the funky Jinja with a funky wild carding but I think this is much more compact and like makes a lot more sense [...] it is more compact than using a big for loop and like having the structure of your code be changed."

"Awesome. Like, yeah, to be honest, I super like it right now. Yeah. It's it's super good. It's super descriptive."

"It would be super simple to use. Yeah, it's, I will be a, a big fan of it."

"So I guess, you know, refactoring huge catalog to small one is always a, a good place because you, you decrease the surface of errors. "

However, several users did mention they were cautious about using it by default:

  • Users have doubts about using it for all catalogs

    "Yeah, I think I would use it but I'm cautious about over-engineering it. I think like using it in narrow spaces like this for all my input data makes sense."

    "It's intuitive but the suffixes make me think about it. Do I need to have those suffixes and so on? I like the, the original names better, the, the ones that are straightforward without the specific implementation detail, like intermediate or not. But you know, at the end of the day I can live with that. If it's, if it simplifies the catalog to, to only a few entries, then it's fine for me."

    "So I think unless my catalog is really, really long, I will probably not use it. I think that's the vibe that I'm getting."

2. There should be a warning about the catch-all pattern that replaces default dataset creation

5 users mentioned this explicitly:

"if you did a dataset that replaced all the memory dataset, would we get like a warning or an error or something?"

"Maybe it'll be nice to, for users if they define something like that, to give them like warning or even to give them a prompt that they made or ask them if it's, if it's expected. Because I guess that that will be the, the first mistake everyone makes when they start working with the feature."

3. Users need clear guidelines/explanation about how the pattern matching works

8 users

"Only my problem [...] I do not really understand what will happen with Layer here."

"I guess maybe that's the question. Is the layer it has to be in the dataset name or does it get like injected if I add it to the node definition?"

"I guess point of ambiguity for me, like companies underscore csv, how does it know companies isn't a layer and CSV isn't a data set name if it matches that. Yeah, like something underscore something that feels very unclear."

"what happens in the previous example if you have multiple namespaces added to [the dataset], that's possible if I remember correctly. So here you have only one namespace, but you can define a nested structure of name spaces. So what happens with that example?"

"You might want to have maybe more examples in the documentation so that people can, can go there and, and try other, other more specific things."

"I think it's having just some more examples would kind of be helpful. I think, well now doing this [together] [...] makes it super clear but I think to have some more documentation on that."

Users want to know how dataset factories work with existing solutions
"A question that I had is the ordering on parsing. Like if I did the classic like yaml anchoring and had something like this, does this play nicely with the, the wild card that we defined or the factory that we defined?"

On how to use this with OmegaConfigLoader: "For people who don't know Kedro, why is this like defined with dollar sign but this one is defined without dollar sign and why is this result in some other place?"

Users need clarity about which patterns match which datasets
About overlapping patterns: "that there [is a] warning saying that so like that you know, that you're [...] pointing to the same kind of directly to the same dataset with two different patterns."

There is some confusion about the meaning of characters in patterns and if it conflicts with e.g. transcoding
"[...] when comes to the transcoding feature is this, does this have like a special treatment in the framework itself underneath the hood? If there's the @ sign in the dataset name or it's just a convention, I could just as well use other special character or transcoding."

4. Dataset factories make dataset names in the pipeline very verbose

3 users mentioned this:

"if we're happy to put sort of quite verbose and and descriptive names in the pipeline, then that's fine, but it just [...] you're moving a problem from one place to another and maybe it's better in that place than the other"

"that's my fear that it'll get convoluted like really fast and yeah, that, that may be a thing"

"I mean the fact that you had to create some suffix is really problematic, especially when you move between environments and for example, the idea behind the catalog was that you don't have to give a specific names."

5. Users like having an explicit catalog

3 users:

"my general instinct is I like to be super explicit about things and one of the things that to me is that like I'm not gonna have any clue what the catalog is actually gonna look like until I run the thing."

"I'm generally of the belief that like we shouldn't over-engineer the catalog. Like Jinja is a great example of that. I think Jinja overall decreases readability, maintainability, even though it, you know, decreases the number of like keys you need to hit."

6. Dataset factories will enforce structure and naming conventions

3 users:

"So [for] bigger projects it's more valuable to enforce that structure. I think maybe that's one of the benefits of this is it really pushes people into a corner. They've gotta meet a standard and all the names are going to match each other, which is fantastic."

"it's [...] very prescriptive way to, instead of just using generic names or [...] you can use very descriptive, non like nonfunctional name, which is awesome I feel."

7. Dataset factories replaces current solutions

3 users:

"The only time I've ever used template a config loader was for loops. So this is basically a complete replacement. Like if you, as long as you can have the, the factory for a name space that completely removes the need to do a for loop in my opinion."

"What we are seeing right now on the screen for the use cases with namespaces, that actually simplifies everything for them. And I mean it's much simpler and much more logical to define it once like it's shown here instead of, you know, having those copy pasting stuff done in, in catalog yaml."

8. Users find kedro catalog resolve useful especially for debugging

3 users:

"If it was just a way to see like which things are getting matched for at the very least debugging. Yeah, no this is great. [...] I think it works great. This is definitely something I would use"

"yeah, that's nice. It's, it'll be useful for the debugging especially."

"Unreal. That's an amazing feature."

9. Users would like additional flags for kedro catalog resolve

  • "if you could like filter with a wild card match or something to say like filter to just things that end in csv"
  • "Something like by namespace would be great rather than like spitting out everything, you just spit it out for one particular namespace"
  • "if you would then do a Kedro catalog resolved, then at least you have a good overview. Especially if you [...] specify the environment"
  • "you could add an option to save to a file"

10. Users like that dataset factories allow overwriting default dataset creation

2 users:

"And that's actually a cool feature if you know about it. I, because if that's actually override the default dataset, that means that yeah, it you don't have to to create specific runner anymore, right?"

"So like one thing I would use a lot is probably default to a pickle dataset for everything and then explicitly call out parque data sets for a sub pattern."

"Actually kind of a cool feature. I didn't think about that. Yeah, that's really cool."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stage: User Research 🔬 Ticket needs to undergo user research before implementation
Projects
Archived in project
Development

No branches or pull requests

1 participant