Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to define "data frame" #2

Open
wesm opened this issue May 14, 2020 · 19 comments
Open

Trying to define "data frame" #2

wesm opened this issue May 14, 2020 · 19 comments

Comments

@wesm
Copy link

wesm commented May 14, 2020

There was a question on the sync call today about defining "what is a data frame?". People may have different perspectives, but I wanted to offer mine:


A "data frame" is a programming interface for expressing data manipulations and analytical operations on tabular datasets (a dataset, in turn, is a collection of columns each having their own logical data type) in a general purpose programming language. The interface often exposes imperative, composable constructs where operations consist of multiple statements or function calls. This contrasts with the declarative interface of query languages like SQL.


Things that IMHO should not be included in the definition, and are implementation-specific concerns, and any given "data frame library" may work differently:

  • Presumptions of data representation (for example, most data frame libraries in Python have a bespoke / custom representation based on lower-level data containers). This includes specific type-specific questions like "how is a timestamp represented" or "how is categorical data reprepresented", since these are implementation dependent. Also, just because the programming interface has "columns" does not guarantee that the data representation is columnar.
  • Presumptions of data locality (in-memory, distributed in-memory, out-of-core, remote)
  • Presumptions of execution semantics (eager vs. deferred)

Hopefully one objective of this group will be to define a standardized programming interface that avoids commingling implementation-specific details into the interface.

That said, there may be people that want to create "RPandas" (see RPython) -- i.e. to provide for substituting new objects into existing code that uses pandas. If that's what some people want, we will need to clarify that up front.

@TomAugspurger
Copy link

Thanks Wes. I think the only thing I would clarify in your definition is that the columns are named. I think it's safe to require named columns in a data frame (requiring uniqueness or just allowing string names is a topic to discuss though). It probably isn't appropriate to include row labels in the definition, despite them being present in at least R and pandas.

Agreed with all of your exclusions from the definition.


Can you clarify a question about your RPandas comment? Are you saying that the API we define shouldn't limit itself to a subset of the pandas API? If so, then agreed.

@aregm
Copy link

aregm commented May 15, 2020

I think that this definition is too high-level. The terms "dataset", "column", "logical data type" requires precise definition. And the same is for data frames. The Data frame as a concept was introduced in S and has a very specific meaning in S, and then R, a table-like structure which allows matrix-like operations. I strongly prefer to have a definition in terms of algebra, similar to the relational model, which will allow defining semantics of the operations, thus defining the execution model and then one can derive API. SQL is just one example of API implementing Codd's relational algebra. I think that Kepner's associative arrays' math is a good example that can power such formalism.
I agree that implementation specifics should be kept out from the definition.

@wesm
Copy link
Author

wesm commented May 15, 2020

Can you clarify a question about your RPandas comment? Are you saying that the API we define shouldn't limit itself to a subset of the pandas API? If so, then agreed.

Yes, that's what I mean. One productive thing we could do would be to highlight some of the known deficiencies / limitations of the pandas API as examples of ways that we ideally would like to do better going forward.

@wesm
Copy link
Author

wesm commented May 15, 2020

I strongly prefer to have a definition in terms of algebra, similar to the relational model, which will allow defining semantics of the operations, thus defining the execution model and then one can derive API.

This is tricky. If you survey the spectrum of libraries that are considered by users to be "data frame" libraries, the only real commonality you have is that

  • Data consists of named columns that have different types
  • API calls either access data in that structure or apply manipulations to it, yielding new tabular data objects

In R/S, a data.frame is simply a subclass of an S/R list object. Its only built in functionality consists of:

  • Adding or removing columns
  • Subsetting columns
  • Subsetting rows
  • Setting row names (as any R vector)

We can say that the goal is to define a general purpose "data frame algebra" (this is what Ibis does, for example), but there are plenty of "data frame" libraries that have no such algebra implemented. They merely provide a simple programming interface for interacting with tabular data. Berkeley BIDS for example created a minimalistic data frame library for pedagogical purposes because they felt the expansive nature of pandas got in the way of teaching certain things.

This presentation of mine from 2015 goes more in depth on this topic of defining "data frame".

@devin-petersohn
Copy link
Member

A "data frame" is a programming interface for expressing data manipulations and analytical operations on tabular datasets (a dataset, in turn, is a collection of columns each having their own logical data type) in a general purpose programming language. The interface often exposes imperative, composable constructs where operations consist of multiple statements or function calls. This contrasts with the declarative interface of query languages like SQL.

@wesm I think this viewpoint is problematic for a number of reasons.

1.) If it is just a programming interface for tables, is an ORM a dataframe? ORMs are composable interfaces for relational tables, so why isn't SQLAlchemy a dataframe by this definition? If SQLAlchemy is a dataframe, then why would anyone use pandas?

2.) If a dataframe is just an interface for table, we have nothing to discuss because standards are well defined for relational structures. Dataframes were created in S to treat objects as matrix-like and table-like with no pre-defined schema: Read Here!. What is the difference between a dataframe and a relation in your definition?

3.) It leaves out some important widely used components of dataframes. I am of the mindset that we should try to understand, support, and optimize for what users are trying to do.

We need to differentiate the dataframe from other existing, well-defined data structures, or agree that it is just a table/matrix/spreadsheet. There is no sense in defining new APIs on existing APIs/standards/structures.

It is much easier to answer questions like "How is a dataframe different from X?" than it is to just say "What is a dataframe?", so that is where I propose we start.


My perspective is taken from lessons I have learned from understanding and studying user behaviors. I think it is important that we try to support and maintain behaviors and functionalities and optimize from that constraint. It is also taken from history, where it emerged from a specific need that is/was not met with existing data structures.

While the dataframe has roots in both relational and linear algebra systems, it is neither a table nor a matrix. We can conceptualize dataframes from both relational and linear algebra points of view, however the dataframe has some data model differences that ultimately conflict the fundamental data model of each.

From a relational viewpoint, dataframes consist of:

  • An ordered table
  • Named rows that can be any data type
  • Column names that can be any data type
  • Column and row symmetry
  • A lazily-induced schema ⭐
  • Support for linear algebra operators (e.g. matrix multiply)

The lazily-induced schema basically allows the dataframe system to interpret the types itself, not require that the types are declared upfront. This is something that relational systems cannot do.

From a linear algebra viewpoint, dataframes consist of:

  • Heterogeneous matrix-like data structure
  • Both numeric and non-numeric types
  • Explicit row and column labels
  • Support for relational algebra operators (e.g. join)

It is important to note that we don't know how to solve some of these problems optimally yet. That is the exciting thing about working on dataframes: there are plenty of unsolved problems. It won't just be an engineering exercise.


I think this thread will likely get very cluttered if we try to discuss each component of the dataframe in one place, and it will be difficult to gauge consensus. It is very likely that there will be disagreement on certain components of the data model, a disparate set of tools with very different capabilities are represented here.

@wesm
Copy link
Author

wesm commented May 15, 2020

1.) If it is just a programming interface for tables, is an ORM a dataframe? ORMs are composable interfaces for relational tables, so why isn't SQLAlchemy a dataframe by this definition? If SQLAlchemy is a dataframe, then why would anyone use pandas?

The primary modality of the programming interface is "expressing data manipulations and analytical operations on tabular datasets". I don't think that describes the primary modality of an ORM like SQLAlchemy.

For example, some backends of Ibis are implemented using SQLAlchemy. The SQL query SELECT sum(x) FROM table in ibis is table.x.sum(). To create this query in SQLAlchemy requires using an interface whose modality is not designed for data analysis.

My point really is that "data frame" is just a name and it means different things to different people. It happens to be that some "data frame libraries" have substantially different scopes of features, but their commonality of providing a programming interface whose dominant modality is data manipulations and analytics on tabular data sets.

@devin-petersohn
Copy link
Member

This is tricky. If you survey the spectrum of libraries that are considered by users to be "data frame" libraries...

I think we should not try to include all systems that market themselves as dataframes in this definition. We should be opinionated and precise, otherwise we will end up where we are now, with no well-defined way of determining what a dataframe is or what the API should be. Users will call something what it is marketed as, so I don't think that calling a project a dataframe makes it so.

My point really is that "data frame" is just a name and it means different things to different people.

Yes, I completely agree that this is the problem. It has more or less become a marketing term. In my opinion we should define some standard that will determine whether a system conforms to a precise and specific definition, otherwise we may end up not making meaningful progress in defining the API either.

@maartenbreddels
Copy link

Maybe we shouldn't try to push a very specific definition to the word dataframe if it is used very broadly/loosely.

What about having a descriptive name for different use cases. Let's say (by lack of a better name), we call a dataframe 'level-0' a bag of order columns, as described above by Wes. A definition that fits almost all dataframes and will allow us to define an interchange API. Maybe this is what we should call a dataframe in the end, but if we want to be specific in a discussion, we may want to give it a more explicit name.

Later on, when we will work on a computational API/operations/features, we can talk about a dataframe-level-1 (again by lack of a better).

Maybe we can start by having very specific/boring names that match very specific goal and see if we can regroup/rename them when we have a better perspective.

For instance, I don't think Vaex will ever match the description by @devin-petersohn, but still, it would be good to have names/APIs for the overlap between Vaex/Pandas/Modin, and to be able to describe what Modin/Pandas can do in addition to Vaex and visa versa (If we ever get there).

@maartenbreddels
Copy link

To make this concrete, we could consider adopting some of the nomenclatures in https://arxiv.org/pdf/2001.00888.pdf such as a matrix dataframe (all columns of the same type, int or float).

@rgommers
Copy link
Member

I think that this definition is too high-level. The terms "dataset", "column", "logical data type" requires precise definition. And the same is for data frames
...
There is no sense in defining new APIs on existing APIs/standards/structures.

@aregm and @devin-petersohn I think it would be useful to separate concerns a little here. You are diving straight into defining things so precisely that a lot of semantics are fixed. Those are good things to worry about and we'll indeed need to deal with that, however it's the most detailed level at which we should define things in an API.

A few thoughts:

  • @wesm's initial definition seems like a good starting point. The exclusions are useful, few assumptions are made, and it doesn't prohibit making things more specific.
  • The layered definition @maartenbreddels talks about is also what I had in mind, at least conceptuallly. The more detailed it gets, the fewer libraries will be (or able to become) fully compliant. That seems fine, it's then still useful to be compliant only up to some level, and document deviations on other levels.
  • Beyond the definition of "what is a data frame", there'll be questions to answer about interchange formats (this is Related topic: dataframe protocol for data interchange/export #1), about function/methods/signatures in an API, and then about semantics. This is related to "what is an API standard", see https://github.com/pydata-apis/workgroup/issues/2
  • Having use cases defined will really help here (I'll start a separate issue to solicit those). For example "I need well-defined semantics for set of operations X, so I can build a much more performant implementation". There's also participants with existing dataframe implementations - Pandas, Dask, Vaex, Modin, OmniSci, Riptide, maybe others - that may need a viable path to compatibility at some level (perhaps with some backwards compat breaks, but making things so specific in the main definition that there's no such path cannot really work).

I think we should not try to include all systems that market themselves as dataframes in this definition. We should be opinionated and precise, otherwise we will end up where we are now, with no well-defined way of determining what a dataframe is or what the API should be

Agreed with "all systems". The ones I listed above though (plus cuDF, and probably a few more) all seem reasonable to take into account and consider impact on. It would also be useful to list ones that do market themselves as dataframes but are so different that there's not much sense in taking them into account. Do you have any in mind?

One productive thing we could do would be to highlight some of the known deficiencies / limitations of the pandas API as examples of ways that we ideally would like to do better going forward.

This sounds like a good idea. Would be good to have a separate tracking issue for this, and then if needed split off from there to go in detail on specific methods/topics. @TomAugspurger I'm sure you have a bunch in mind, would you be able to start this issue?

@TomAugspurger
Copy link

Opened #4 for the sub-discussion of avoiding issues with pandas API.


meta-comment: I'd like to see the discussion on definitions go on a bit further and then let's try to summarize the defininition in a hackmd document that we can achieve consensus on. I'd be happy to write / co-write a draft sometime next week.

@devin-petersohn
Copy link
Member

@maartenbreddels I think Vaex is close to the definition, much closer than many other systems 😄. I have a list below of classifications similar to what you were mentioning about what certain types of systems can do vs others, feel free to edit!

@rgommers I see your point, my definition is more along the lines of traditional dataframes. We do need specific definitions and precision to meaningfully describe an API and standard. Perhaps binning systems to define multiple standards will help?

Here is a candidate high-level binning from my perspective, each of which can potentially have its own standard. My PhD thesis focus is on what I have labeled as "Traditional Dataframes", so that set of properties is going to be more precise at first than the others (system maintainers feel free to edit to add properties/put your system in the right bin, you will know your system better than me).

Note: My intention with this binning is for creating standards, misclassifying a system will make it difficult to create a standard about that group, so we should try to be as precise as possible.


Traditional Dataframes

Properties:

  • Explicit row/column labels from any type
  • No predefined schema, able to interpret schema at runtime
  • Multiple data types supported
  • Row/Column symmetry (tranposable)
  • Support for relational/linear algebra (e.g. joins, matrix-multiply)
  • Indexing by label or by position

Systems:

  • S data.frame
  • R data.frame
  • pandas DataFrame
  • Modin pandas.DataFrame and internal modin_frame
  • cuDF DatFrame (minus true symmetry, only symmetrical if homogenous dataframe)
  • ...

Columnar Dataframes

Properties:

  • Explicit column labels
  • Columns contain single data type
  • No predefined schema, able to interpret schema at runtime
  • No row/column symmetry (no row labels, not transposable)
  • ...

Systems:

  • pyarrow Table (Not a user-facing dataframe API) @wesm to verify
  • Vaex
  • R tidyverse
  • ...

Relational Dataframes

Properties:

  • Purely relational data model
  • An interface for expressing data manipulations and analytical operations (@wesm stated above)
  • Schema must be predefined
  • No row/column symmetry

Systems:

  • Ibis
  • Spark
  • OmniSci
  • R data.table
  • ...

Unclassified Dataframes

Systems:

  • Dask (somewhere between relational and traditional, not columnar) @TomAugspurger to verify
  • Riptide (Sorry I do not know much about this one)
  • ...

...

@rgommers
Copy link
Member

Thanks @devin-petersohn. I think your binning is interesting and we'll need it at some point; it does drag in a lot of assumptions though, for example on underlying storage.

Re schema predefined vs at runtime: this is execution rather than API related I think.

Re matrix multiply: that's quite specific (e.g. not defined for heterogenously typed columns, extremely inefficient unless one has 2-D contiguous storage and can call on the BLAS-optimized implementation of the underlying array). If you're doing linear algebra as a user, you probably should be using arrays, rather than dataframes that act as arrays with axis labels tacked on.

Re row/column symmetry: also a little artificial probably. Normally rows and columns have different semantic meaning (rows -> observations, columns -> features), there's typically a dtype per column, not per row, etc.

Note: My intention with this binning is for creating standards, misclassifying a system will make it difficult to create a standard about that group

This is an important point. I think we have a bit of a misunderstanding here. The aim is a single API standard for things that you split into multiple "bins", not multiple standards. Or maybe what you mean is similar to what @maartenbreddels called "levels". At the lowest (most core/common) levels one can write, e.g., programs that Pandas, Dask and Vaex can all run and give the same results for (just with different performance profiles), even though you put them in different bins.

We didn't have a chance to talk beforehand about this consortium; maybe it's worth having a quick call. I'd also be interested to learn what your main use case or objective is. I'll send you an email.

@devin-petersohn
Copy link
Member

devin-petersohn commented May 17, 2020

Thanks @rgommers I'll sync offline about the consortium discussion, but some comments about your points. I understand the purpose of this group, I am just trying to be precise. Without precision there is nothing to discuss, we may end up with a non-impactful API.

Re schema predefined vs at runtime: this is execution rather than API related I think.

I must completely disagree here. If a schema is required before building a table, you are limited to APIs that have a known output schema given the input schema and operator. Relational systems can only do compatible relational operators, so if that is the minimum common subset then there is a lot left out in the other types of systems that are commonly used and valuable for users. I don't think it will be meaningful if we just decide to go with a new api for doing SQL queries, which is the lowest common denominator of all of these systems. This is why precisely deciding on "what is a dataframe" is so important to do at the outset.

Re matrix multiply...

This was an illustrative example, and yes it is hard to optimize in a dataframe setting, but that is a challenge specific to the implementation of those systems. All listed systems support it, which was the main point of the example. Those systems are hybrids of the relational and linear algebra data model.

Re row/column symmetry...

You are correct that there is a schematic asymmetry but I did not want to get that low level in those points. The interchangeability of columns and rows is possible in each of those systems, and is commonly used in pandas. Often this is because data in the wild comes in many different formats, and the orientation of the data may not be correct in the source files.

The bins do have a hierarchy and @maartenbreddels's comment was what I was getting at. The groups later in that list cannot/do not implement data model features earlier the list, and the groups earlier that list are a superset or or can emulate the data model of every from below.

@aregm
Copy link

aregm commented May 18, 2020

I strongly prefer to have a definition in terms of algebra, similar to the relational model, which will allow defining semantics of the operations, thus defining the execution model and then one can derive API.

This is tricky. If you survey the spectrum of libraries that are considered by users to be "data frame" libraries, the only real commonality you have is that

  • Data consists of named columns that have different types
  • API calls either access data in that structure or apply manipulations to it, yielding new tabular data objects
    You are jumping here to the implementation. See remarks below.

We can say that the goal is to define a general purpose "data frame algebra" (this is what Ibis does, for example), but there are plenty of "data frame" libraries that have no such algebra implemented. They merely provide a simple programming interface for interacting with tabular data. Berkeley BIDS for example created a minimalistic data frame library for pedagogical purposes because they felt the expansive nature of pandas got in the way of teaching certain things.

This presentation of mine from 2015 goes more in depth on this topic of defining "data frame".

Ibis definition is highly relying on relational algebra as a basis and Python semantics for operations. This is an approach, but IMHO is one step ahead of the thing I am talking about. Let me try to explain here.

First of all what I understand under algebra. In a broader term, the definition of Chamber's Dictionary: "any of a number of systems using symbols and involving reasoning about relationships and operations". More specifically, an algebra consists of a set of objects and a set of operators that together satisfy certain axioms or laws (such as closure, associativity, commutativity, etc.)

What you are referring to as a set of APIs, on the foundation level is called calculus. In the same way, that relational algebra served as a vehicle for implementing the relational calculus, which led to QUEL -> SEQUEL -> SQL, dataframe algebra can serve as a vehicle to define dataframe calculus, which will lead to maybe many systems and help existing dataframe systems to remove redundancies and actually be called dataframe systems.

But to get there we need as a first step to define the underlying object. Is the dataframe a simple relation? Is it a relation on top of lists? Is it a simple collection of associative arrays? Is it a matrix of simple data types? Is it a matrix of structured data types?

The calculus - do we need a reduction algorithm? Should we follow Python, R, or SparkSQL? Or maybe S? Or maybe they all caught it wrong?

I am not suggesting to lay down math foundation (I guess nobody is interested), but I would suggest to at least in this workgroup and at least on an intuition level have a common understanding of what is the dataframe object, what are the operators, what are the laws. Maybe it is associative arrays, maybe Python/Pandas, maybe SparkSQL - but let's give it a thought and discuss. Maybe interested parties can come up with their definition and we can discuss and compare them?

@rgommers
Copy link
Member

But to get there we need as a first step to define the underlying object. Is the dataframe a simple relation? Is it a relation on top of lists? Is it a simple collection of associative arrays? Is it a matrix of simple data types? Is it a matrix of structured data types?

"associative arrays" are basically dictionaries if I read Wikipedia correctly. Which sounds about right; a mapping with column names as keys and 1-D arrays (homogeneous data type per column) as values.

@wesm said "a collection of columns each having their own logical data type".

your "matrix of simple data types" could be the same as well (under the condition that matrix doesn't imply underlying 2-D contiguous storage or other restriction beyond a set of 1-D arrays of the same length).

It sounds like we're all saying similar things in slightly different words.

Is it a matrix of structured data types?

I'd suggest to add structured data types as out of scope for the API standard (v1 at least), even if it's possible to include structured dtypes in some current dataframe libraries. Data types should be limited to something tractable (e.g. integers, floats, strings, datetime types, categoricals). Everything else can be postponed till a future version of the API standard.

.... to define dataframe calculus, which will lead to maybe many systems and help existing dataframe systems to remove redundancies and actually be called dataframe systems. ...

doing this exercise from the perspective of:

  1. what's currently present in APIs
  2. data on how functionality is used
  3. removing duplication
  4. excluding things based on experience with rationales for why

is going to be a lot more productive than doing it from first principles I believe.

The calculus - do we need a reduction algorithm?

The answer to that should be motivated by:

  • do current libraries have reduction algorithms?
  • does our API usage data say it gets used enough?
  • (if the question to the above two questions is yes) do we see a reason to exclude it based on use cases & requirements we will formulate?

Should we follow Python, R, or SparkSQL? Or maybe S? Or maybe they all caught it wrong?

Hard to believe they got it all wrong. Mostly they implement the same math/logic, with different APIs. We'd like to end up with a clean, Pythonic API I'd think.

@TomAugspurger
Copy link

So, an attempt to summarize the discussion thus far.

@wesm started with a general definition

A "data frame" is a programming interface for expressing data manipulations and analytical operations on tabular datasets (a dataset, in turn, is a collection of columns each having their own logical data type) in a general purpose programming language[...].

And a few properties that the definition explicitly takes no stance one.

To this I would add the restriction (brought up by @devin-petersohn and others) that the rows are ordered, and possibly that there are row labels.

There's some concern that this definition leaves out crucial components of what makes a dataframe a dataframe. I think this is mostly around "what operations must a dataframe support" (linear algebra? joins?, etc.), but I'm not sure that's necessary to pin that down here. That will be made clear by the API standard. This relates to the "levels" brought up by @maartenbreddels in #2 (comment).

I think it's worth asking: what is this definition for? For me, this will aid us in designing the API. When we're discussing any given method we'll use the definition to inform the answer (should the dataframe have an is_fortran_contiguous property? Answer: No, because the definition does not make any assumption on data representation).

One property I don't understand is "a lazily induced schema". Could you expand on that @devin-petersohn?

@devin-petersohn
Copy link
Member

Sure, happy to expand on lazily induced schema.

The idea of the lazily induced schema is that the user need not define their data schema upfront, nor does the schema need to be known for every output operator. This is in contrast with relational systems that require data to be defined schema first.

Operators in systems that support lazily induced schemas do not need to have a known output schema for any given input schema (but they can). The dtypes of a given column can be determined by the system after an operation has completed. This is particularly helpful for user defined functions. It also enables dataframes to treat columns and rows as equivalent, supporting operators along either columns or rows (with the axis argument in pandas).

This idea is all about flexibility. Data in the wild is often schemaless and semi-structured. Relational systems have (mostly) solved the problem of structured data. Dataframes like pandas fit the need for the schemaless poorly structured data.

@TomAugspurger
Copy link

nor does the schema need to be known for every output operator.

Thanks, I think this is the component I was missing earlier. An example would be something like pivot_table(index="x", columns="y"), where the metadata (the columns, which are the unique values of y) is known only by computing the result? Or something like a get_dummies. I think in the name of predictability we'll want to minimize this, but I agree that the ability to do this kind of operation is an important component of a dataframe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants