Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about package domain scope: please clarify "data munging" #21

Closed
eriknw opened this issue Dec 30, 2022 · 9 comments
Closed

Question about package domain scope: please clarify "data munging" #21

eriknw opened this issue Dec 30, 2022 · 9 comments

Comments

@eriknw
Copy link
Contributor

eriknw commented Dec 30, 2022

I was reading this:
https://www.pyopensci.org/peer-review-guide/software-peer-review-guide/author-guide.html#python-package-domain-scope

and it seems to me that "data munging" is potentially the largest scope (i.e., lots of packages "do something with data"), but it doesn't seem clearly defined or explained (IMHO). I think it would help to have examples. For example, would numpy, scipy, pandas, networkx, and scikit-learn be within scope if they weren't already well-established packages with communities? What are canonical examples of "data munging" packages that are within scope?

Thanks!

@NickleDave
Copy link
Contributor

NickleDave commented Dec 30, 2022

Hi @eriknw and welcome to pyOpenSci!

Thanks for your feedback.

Just for context (in case anyone hasn't read the page you linked), data munging is defined there as "Tools for processing data from scientific data formats."

You are right that it would be good to have examples.

As you probably saw, we are finishing up an overhaul of our guides, and once we finish, we will be able to start actively reviewing packages again now that we have a new fiscal sponsor .

So the first answer to your question is: we will probably have a lot more examples soon!

What are canonical examples of "data munging" packages that are within scope?

would numpy, scipy, pandas, networkx, and scikit-learn be within scope if they weren't already well-established packages with communities?

We are focused on more domain-specific packages that build on top of the well-established packages you name, as @lwasser explains here: https://www.pyopensci.org/blog/what-makes-open-source-python-package-healthy.html#a-note-about-our-pyopensci-packages

So, no, those packages would not be in scope, although you are right, under a very loose definition of "data from scientific formats", they all technically have functionality for data munging.

Some examples of packages we've already reviewed that have data munging are:

Similarly you can see that packages from our sister org rOpenSci with data-munging functionality are focused on more or less domain-specific data formats: e.g., medical record transcription data, spatial data
https://ropensci.org/tags/data-munging/

@lwasser maybe we should:

  • include examples for each category on the scope page
    • in the current revision or later? It would be nice to be able to click on each category and get a list of packages populated by a tag. I know, easier said than done, just trying to think of how we might help make these definitions more obvious
  • revisit categories once you recover from revising the guides
    • I think these were originally adopted from rOpenSci but I don't actually see "munging" any more as a category there.

@eriknw I hear what you are saying that "data munging" is a very broad term.

Would linking to examples address this issue in your mind?
I'll let @lwasser reply (she is out of office until after New Year's).
I am not sure what else we could do to clarify "data munging" specifically but please feel free to discuss categories and scope more broadly on the forum: https://pyopensci.discourse.group/

@lwasser
Copy link
Member

lwasser commented Dec 30, 2022

hi @eriknw !! Welcome to pyOpenSci!! 👋 I'm mostly offline through next monday end of day but I wanted to say hello! AND thank you for the question 🎆

i'm just curious - are you considering submitting a package to us and trying to better understand scope?

Or are you trying to help us (THANK YOU!) with clarifying those areas in our scope so others can better understand what is in vs out of scope? As perhaps those bullets are confusing (this has actually been brought up before by @arianesasso so well worth considering carefully!

Or maybe both?

@eriknw
Copy link
Contributor Author

eriknw commented Dec 30, 2022

Hi there, thanks for the replies!

The answer to @lwasser's question is, potentially, both. I was trying to understand scope, purpose, and vision, and thought data munging in particular needs clarification for the benefit of everyone.

To add context, I'm considering submittingpython-graphblas, but I haven't brought it up with the team yet, and it's still not clear to me whether it would be in scope (and I know the proper way to determine this is to raise an issue to ask the question). The super-short summary is that python-graphblas is like NumPy for graphs. Where scipy.sparse and networkx are used--which is across all scientific domains--python-graphblas can likely make things faster and more scalable, and enable new analytics. This is why I asked whether numpy, scipy, and networkx would be within scope if they weren't already major projects.

The examples I've seen (thanks @NickleDave!) tend to be very niche, specific, and close to the science or application. I think python-graphblas is a weird example in many ways. It's niche b/c sparse linear algebra and graphs are niche, but it's also very general and not close to any particular domain or application.

@NickleDave
Copy link
Contributor

Thank you for your quick reply @eriknw.
I should have started by asking whether you were asking WRT submitting a package.

I don't want to speak for @lwasser again in some way that makes her reply while she's trying to not work 😬 but I would not say that our scope is limited to very particular domains or applications. One of our goals is to help packages in those domains achieve consistent standards that align with core scientific Python packages.

My immediate impression is that python-graphblas would be in scope, because it provides a Python interface to GraphBLAS. Providing access to research software tools written in other languages through Python is definitely something we want to do. For a similar package we've reviewed, see pygmt.

@lwasser I actually can't find good language in the guide right now about this, do we need to say more about "Python wrappers / interfaces for tools in other languages" somewhere? (We do talk about API wrappers but that's obvs not the same thing).

@eriknw
Copy link
Contributor Author

eriknw commented Dec 30, 2022

Right on, thanks for the quick replies all! (And do please enjoy the holidays and time off if applicable)

I think it's good enough for this issue that python-graphblas may be within scope. Staying on topic and following up on @NickleDave's last reply, I think it could be more clear why it or packages like it may be in scope. Does it fall under data munging or other (perhaps to-be-created) categories?

As an outsider reading the pyOpenSci website, the vision and purpose seems to be fairly broad and inclusive w.r.t. scientifically oriented packages. The specific section "Python package domain scope" seems more narrow. It's probably pretty difficult to adequately define scope in such a way.

@eriknw
Copy link
Contributor Author

eriknw commented Dec 30, 2022

Also, my specific questions have been answered, so feel free to close. Thanks again!

@lwasser
Copy link
Member

lwasser commented Jan 3, 2023

hi @eriknw i just wanted to followup after reading comments above. and then i will close.

Our scope categories came out of early pyOpenSci meetings. Early on it made sense to be broad and focus on things I was working on (geospatial & education!) so I had more expertise there. as such i think we need to revisit them. you aren't the first with this question!

For example, would numpy, scipy, pandas, networkx, and scikit-learn be in scope

I want to modify @NickleDave response. those packages are definitely in our domain scope to be reviewed.

However, because they have huge maintainer teams, and high quality infrastructure and are widely visible, they aren't target packages for us to review. But for example. rOpenSci did an early review of tidyverse (which is big and widely used) in the R world. So i don't want to say that they aren't in scope. They technically are. they just aren't the types of packages we are focused on now.

Your package does NOT need to be tightly linked to a specific domain for us to review it (even tho you may see examples of this right now in our ecosystem). It can have general applications and still be in scope. And I do think it IS in scope for us (as David said too!).

And the fact that it wraps another tool would not make it out of scope. We just want to ensure it does so using best practices (particularly thinking about future maintenance if a maintainer steps down). That is a technical scope issue rather than a domain scope.

I hope that helps. we shall revisit this for sure!
Just for information collecting, If you could "classify" your package what categories (y) would you put it in?

@lwasser lwasser closed this as completed Jan 3, 2023
@lwasser
Copy link
Member

lwasser commented Jan 3, 2023

oh also happy new year!!!

@eriknw
Copy link
Contributor Author

eriknw commented Jan 5, 2023

Just for information collecting, If you could "classify" your package what categories (y) would you put it in?

I would categorize it with numpy, scipy.sparse, pandas, xarray, awkward, and networkx. So maybe "Data Structures" , "Restricted Computational Domain" (a James Powell term), and "Graph Theory / Network Analysis". Of the current categories, probably "Data Munging".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants