Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC of Proposal: Refactor values.Value into a struct #3437

Closed
jsternberg opened this issue Jan 14, 2021 · 8 comments
Closed

POC of Proposal: Refactor values.Value into a struct #3437

jsternberg opened this issue Jan 14, 2021 · 8 comments
Assignees

Comments

@jsternberg
Copy link
Contributor

Problem

Value is used in some of the most performance sensitive areas of flux. It is used in any place where a user writes code and it is also used as an intermediary for many transformations that manipulate data. For that reason, it is very important that this struct be as fast as possible since it is commonly in the most performance sensitive areas.

While the current implementation of Value isn't slow, it is unfortunately not performance-critical fast. A few of the ways it is slow:

  • None of the data access functions can be inlined because it is an interface. This requires a vtable lookup along with a function call. Normally this wouldn't be a problem, but this is very performance sensitive code.
  • Since it is an interface, constructing the Value always requires a memory allocation. This removes any benefit for native data types like floats or integers.

As a separate problem, Value is presently a very large interface. It presently has 16 methods and will get a new method each time a new type is added. Adding a method, such as Dict(), involves finding every implementation of the interface and adding a panic with unexpected kind to most of them. This can be very difficult and a maintainability problem. In addition to the large number of methods, it is very infrequent that any of these new Value implementations implement anything new. Most of the alternative implementations end up being very similar to other implementations and usually involve more complex data types like functions rather than something simple like integers.

Proposal

I think we should refactor the values.Value interface into a struct. Instead of an interface, I think we can create a struct with the following signature:

type Value struct {
    t semantic.Nature
    v uint64
    data interface{}
}

I think this simple structure will increase the speed and maintainability of the values package in some very performance sensitive areas. For one, the uint64 can be used to store the value for float64, int64, and uint64. These three would benefit from being inlined and would have faster data access. As a comparison, #3436 for integers got the call to Int() down to 3.36 ns/op. An implementation using the above is 0.469 ns/op.

As an aside, we would not experience this benefit for floats until a future version of Go. Possibly go 1.17. math.Float64frombits() was previously marking itself as more expensive by the inliner and would not be inlined. That has been modified on the master branch: golang/go#42788 (comment).

For value types that are more complex and an allocation would be expected, the data interface{} can be used to hold the value like it presently does. These functions would still benefit from inlining.

@rogpeppe
Copy link
Contributor

math.Float64frombits() was previously marking itself as more expensive by the inliner and would not be inlined.

I don't think that's quite true. I think that issue was pointing out that the "cost" (as calculated by the compiler) of a Float32bits function was unexpectedly high such that a function involving many calls to it was not itself inlined. Float64bits has itself been inlineable at least since Go 1.6 (I didn't try any further back).

For one, the uint64 can be used to store the value for float64, int64, and uint64

I'd also include bool in this as one of the basic Flux types.

@rogpeppe
Copy link
Contributor

BTW, I think the proposal should also suggest what static methods might be defined on the Value type.

@jsternberg
Copy link
Contributor Author

I don't think that's quite true. I think that issue was pointing out that the "cost" (as calculated by the compiler) of a Float32bits function was unexpectedly high such that a function involving many calls to it was not itself inlined. Float64bits has itself been inlineable at least since Go 1.6 (I didn't try any further back).

I took another look at my prototype and I suspect the reason why this wasn't getting inlined was because of an issue fixed in #3436. I'll give this another try with those changes but I suspect that it will inline fine since the current cost is 83 without the change to CheckKind that lowered its inline cost.

@nathanielc
Copy link
Contributor

Let's do a timebox POC for this proposal and compare it directly to the vectoritization proposal for the compiler package.

The hypothesis is that the large gains we saw from the vectoritization POC were because we fixed/reduce unnecessary allocations or because we were taking advantage of cache locality on the CPU? This value struct proposal should fix the unnecessary allocations in a different way and will help us understand where the gains are coming from.

DOD for the POC:

  • Add a new benchmark along similar lines to the benchmarks used in the vectorization POC to compare. This means we should be able to compare three implementations, base line, vectorization, and value struct.

@nathanielc nathanielc changed the title Proposal: Refactor values.Value into a struct POC of Proposal: Refactor values.Value into a struct Aug 16, 2021
@jsternberg jsternberg self-assigned this Aug 17, 2021
@jsternberg
Copy link
Contributor Author

jsternberg commented Aug 18, 2021

A few basic changes were made to the compiler in order to determine if there were other optimizations that would help to improve the compiler. The following changes were attempted:

  • The value interface was changed to a struct and memory allocations were optimized.
  • Access to object and scope values were optimized to use indices discovered from type inference rather than doing a name lookup.
  • Argument setting from arrow arrays was optimized to use a single level of indirection rather than two levels of indirection.
  • Typed evaluators instead of using Value

For comparison, the benchmark for vectorization produced a 95% speedup over the base case. This comparison was only done on filter, but the optimizations should be pretty similar for map expressions. In particular, map expression benefit from optimizing the creation of object expressions which is similar to the second optimization discussed above.

The memory optimizations for value produced a 75% speedup over the base case. It reduced the total memory allocations from 9,010 allocs/op to 6 allocs/op. The amount of memory used went from 273,564 B/op to 1,412 B/op. In comparison, vectorization produced 101 allocs/op and 31,568 B/op. While the value optimizations produced a larger drop of allocs/op, I believe that after proper attention is paid, the difference in allocations is likely minimal in effect on the overall performance of the system. Vectorization is also not exclusive with these memory optimizations.

This does signify a pretty significant gain from paying more attention to memory allocations.

The second optimization was attempted after noticing in flamegraphs that a substantial amount of time was spent in identifier and member expression access. The second optimization takes advantage of type inference to retrieve the exact index of a member of a record or the exact index of a value in the scope. For these lookups, we can store and retrieve them directly from an index instead of doing that dynamically at runtime. When these optimizations were put into place, an improvement of 78% over the base case was observed.

A third optimization was performed to improve how values were set in arguments. The current version was accessing the array repeatedly and then accessing the value from the column reader. If you access the array once, wrap it in an interface, and then set the values using that interface, it improves the speedup to 82%.

The fourth optimization involved adding typed evaluators for the primitives. The binary expression was also explicitly created rather than relying on an indirect function call to implement the operation. This last optimization kept the same memory improvements, but improved the speed further to a 87% speedup from the base version. These typed evaluators were an original portion of the design but they were removed because type inference could not support it and debugging the system was difficult when this was the case.

Some areas where I think further work would be beneficial and could indicate other areas of speedup:

  • Remove recursive calls
  • Create typed scopes so that Value does not have to be used
  • Place constant values in the scope with the compiler rather than using an evaluator

I have specific ideas on how to do each of these. I am going to prepare some thoughts on where I think we should go with this information for next steps.

In summary, we can get an estimated 87% speedup through optimizing Go code related to the compiler.

@jsternberg
Copy link
Contributor Author

I'm still going to write out the thoughts for next steps but I have pushed up the proof of concept that I created that demonstrates the performance improvements mentioned above to the feat/vectorize branch so that we can have a larger discussion about it. I'm going to close this issue.

@Marwes
Copy link
Contributor

Marwes commented Nov 15, 2021

Not having unions/ADTs makes me sad but I think we can still do a bit better here by storing Nature in the data field, something like

type Value struct {
    v uint64
    data HasNature
}

type HasNature interface {
    Nature() semantic.Nature
}

Since values stored in v do not need any value specific data stored in data all such values point to a singleton HasNature value for each type (int/uint/float/etc)

@jsternberg
Copy link
Contributor Author

The problem with using an interface for the singleton access is I fear it will likely end up killing the performance. There's still a virtual indirect call to go through the interface. It might be worth trying, but I think we want the direct check of the enum type for the nature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants