Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reference: feat: replace v1 with v2 #1690

Closed
wants to merge 69 commits into from
Closed

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Sep 9, 2022

This is just me playing around with removing v1. Should you want to do this yourself, no problem!

  • Replace v1 tests with v2.
  • Fix up setup.cfg linting rules.
  • Move to malloc and free instead of Awkward allocators.
  • Remove preliminary set of unused kernels. Any unused specialisations are not yet removed.
  • Replace use of contents in Forth with NumPy arrays or import v2 objects
  • Remove kernel dispatch, replace kernel::malloc with malloc or new
  • Support datetime.datetime in from_iter
  • Move array_deleter to include/awkward/util.h
  • Replace pyobject_deleter with the appropriate pybind11 feature.
  • Move LayoutBuilder / ArrayBuilder to _ext
    Fixes ak._v2.from_iter should recognize Python datetimes/timedeltas #1701

@agoose77 agoose77 marked this pull request as draft September 9, 2022 13:45
@agoose77 agoose77 force-pushed the agoose77/split-v1-v2 branch 2 times, most recently from 9b7f7d0 to 344276e Compare September 10, 2022 14:22
@agoose77
Copy link
Collaborator Author

Removing the v1 Python layer is fairly easy. Need to think about the C++ layer; the Forth machine returns a ContentPtr in a number of places. Some of these are trivial Numpy arrays, but the bytecodes getter function returns a jagged list. The obvious answer is to return the bytecodes and their offsets, and let the Python binding layer build the Array object from C++ via pybind Python execution. I haven't thought yet whether I like the idea of the ForthMachine returning pybind11's py::array outside of the Python layer, so I think I'm leaning towards removing the toNumpyArray interface and doing that at the bindings level.

@jpivarski
Copy link
Member

Need to think about the C++ layer; the Forth machine returns a ContentPtr in a number of places. Some of these are trivial Numpy arrays, but the bytecodes getter function returns a jagged list. The obvious answer is to return the bytecodes and their offsets

Knowing that the split is coming up soon, we've been using the ForthMachine in ways that don't depend on the return values being ContentPtr and IndexPtr: we've been immediately casting them as NumPy arrays.

The bytecodes method/property is not used much. I think nothing would break if it were removed entirely, but it's probably better to replace it with just a 2-tuple of offsets and content, as you describe. It exists as a debugging tool (though the decompile one, which prints out source code reconstructed from the bytecodes, is more useful), and was constructed as a jagged array just because it occurred to me that one could do that. Unlike imperative bytecode, Forth bytecode does not have arbitrary jumps; it happens to be a structured language in the sense of not having GOTOs, a fact that was emphasized by Forth enthusiasts in the 80's, but nowadays nearly all languages are structured at the human-readable source code level. The only thing that stands out for Forth is that its bytecode can also be blocked out as subroutines, a jagged array.1

Anyway, the point is that we've been using the ContentPtr/IndexPtr-dependent parts of ForthMachine in a minimal way so that they can be replaced by raw buffers (py::buffer_info in pybind11) without issues. Along the same lines, we will be removing the custom growable-buffer of ForthOutputBuffer to use Manasvi's new GrowableBuffer (and reap the benefits of non-contiguous filling), but that is an entirely separate step that can wait until the v1 pruning is entirely done. (ForthOutputBuffer doesn't depend on the old GrowableBuffer—that one can be safely deleted.)

Footnotes

  1. Although somebody might point out that this is now true of other bytecodes that I don't know about. Maybe JVM? I'm pretty sure Python bytecode has unconditional jumps.

@agoose77 agoose77 force-pushed the agoose77/split-v1-v2 branch 2 times, most recently from d3cd98e to 511356f Compare September 13, 2022 13:34
@agoose77
Copy link
Collaborator Author

Phew.

Now v1-less Awkward compiles and passes most tests, having removed the array and type machinery.

I've removed the low hanging fruit in the C++ side, so I've therefore left a much reduced version of Index.cpp. My rationale there is that older C++ standards don't have std::span yet, so either we start using pybind features inside of AwkwardForth, or we keep this small abstraction.

New bugs:

from_iter doesn't currently handle native Python datetime objects, so many tests fail. We should fix this.

@jpivarski
Copy link
Member

We should be able to remove Index from ForthMachine. I'll take a look at it later, but that's a separable problem from everything else.

@agoose77
Copy link
Collaborator Author

We should be able to remove Index from ForthMachine. I'll take a look at it later, but that's a separable problem from everything else.

I think it depends if we want to be passing around unbounded arrays. Actually, this is barely a point at the moment because the Index impl doesn't so anything safety wise (i.e. provide an iterator).

IIRC you never get an index directly; rather, you get the output and call toIndexXX on it, so we can just use the metadata from the output.

Otherwise, I still need to remove the unused kernels...

@jpivarski
Copy link
Member

In the way AwkwardForth is used, the output of toNumpyArray and toIndex are used, they're immediately wrapped with np.asarray, so we only need one way to get a buffer out. We can even change the name (breaking Uproot), since Uproot v5 and Awkward v2 releases will be coordinated. A name change won't be very disruptive.

Otherwise, I still need to remove the unused kernels...

Perhaps removing unused kernels should be a separate PR. We might even be able to find more that could be removed and haven't yet. Some of them might be equivalent to arange or something.

@jpivarski
Copy link
Member

wip: remove unused kernels

It's okay. But I think we can remove more, and fishing for them can be a different PR.

@agoose77
Copy link
Collaborator Author

wip: remove unused kernels

It's okay. But I think we can remove more, and fishing for them can be a different PR.

Cool. I found these ones by directly looking with (Xonsh)

import yaml
import re

kernel_spec = yaml.safe_load(p"/home/angus/Git/awkward/kernel-specification.yml".read_text())['kernels']
in_use_specs = set($(rg r"awkward_\w+" src/awkward/contents src/awkward/_reducers.py -oNI).splitlines())
name_to_specs = {g['name']: [s['name'] for s in g['specializations']] for g in kernel_spec}
spec_to_name = {v: k for k, s in name_to_specs.items() for v in s}

# For cleaning up the spec file
in_use_names = (name_to_specs.keys() & in_use_specs)

# For removing kernels
for f in [f for f in pg`src/cpu-kernels/awkward*.cpp` if not any(s in f.read_text() for s in used_specs)]:
    f.unlink()

@agoose77
Copy link
Collaborator Author

agoose77 commented Sep 13, 2022

the output of toNumpyArray and toIndex are used

Yeah, now that all the removed code is gone, it's easier to see that we don't need an object that holds both length and ptr (besides the output itself). The main "issue" is internally within the Forth machine, rather than the bindings. It should be possible to just access the (shared) pointer directly within the Forth routines, and wrap it in the bindings.

@agoose77
Copy link
Collaborator Author

We might even be able to find more that could be removed and haven't yet. Some of them might be equivalent to arange or something.

Agreed, there are several that I know of (not sure if they're still carried through in this PR, haven't checked) ;)

@agoose77
Copy link
Collaborator Author

OK, this now removes Index as well, so it's nearly all the way there. I've rebased my commits where possible to give a clean-ish history. I can't guarantee that every commit builds a wheel that passes the test suite (they certainly don't when I start moving files), but after the large-scale refactoring commits, most revisions then compile.

@agoose77
Copy link
Collaborator Author

I haven't fully pulled out the kernel dispatch mechanism yet. I'll tentatively note that I'm not intimately familiar with this code, but my current understanding is that our kernel handling has all been superseded by the nplike dispatch mechnism.

@jpivarski
Copy link
Member

Any kernel-dispatch in C++ is old. We have an entirely new kernel-dispatch system in Python, using ctypes.

Anything you find in C++ about startup functions, particularly for setting up awkward-cuda-kernels, can be removed. That's all Python now as well.

Another thing that can go is the dlpack git submodule. That was all for supporting CUDA arrays in C++, which is now handled by CuPy in nplike.

@agoose77
Copy link
Collaborator Author

Another thing that can go is the dlpack git submodule

It's already gone!

@agoose77 agoose77 changed the title chore: remove v1 chore: replace v1 with v2 Sep 13, 2022
@agoose77
Copy link
Collaborator Author

OK, now we're at a point where there's not much else (anything?) that needs to be removed.

The main changes are now described in this PRs description.

@agoose77
Copy link
Collaborator Author

I've added some rudimentary time handling. I originally tried employing std::chrono, but I clearly don't fully understand how C++ is handling times vs Python:

If I calculate the local epoch using:

+std::chrono::system_clock::time_point local_epoch() {
+    std::tm tm = {
+       /* .tm_sec  = */ 0,
+       /* .tm_min  = */ 0,
+       /* .tm_hour = */ 0,
+       /* .tm_mday = */ 1,
+       /* .tm_mon  = */ 1 - 1,
+       /* .tm_year = */ 1970 - 1900,
+    };
+    // Use local time zone DST
+    tm.tm_isdst = -1;
+    return std::chrono::system_clock::from_time_t(std::mktime(&tm));
+}

Then when I use this in these calculations, I get one-hour shifts for particular datetimes:

+    auto time = obj.cast<std::chrono::system_clock::time_point>();
+    int64_t time_since_epoch_us = std::chrono::duration_cast<std::chrono::microseconds>(
+        time - local_epoch()
+    ).count();
+    std::cout << "time since epoch     " << time_since_epoch_us << std::endl;
+    std::cout << "time since epoch cpp " << std::chrono::duration_cast<std::chrono::microseconds>(time.time_since_epoch()).count() << std::endl;
+    self.datetime(time_since_epoch_us, "datetime64[us]");

I am currently under the impression that this is because the timezones differ between Python and C++. I'm looking into this.

@agoose77
Copy link
Collaborator Author

agoose77 commented Sep 16, 2022

The above code example is wrong - the epoch should not be specified in local time (via std::tm). Instead, it's just std::time_t (0).

Now that I'm more familiar with mktime vs localtime, some thoughts on my timezone (BST)

  • NumPy's datetime objects are naive and don't attempt to determine any DST offsets

  • datetime.datetime.timestamp does try to use the local timezone to figure out DST offsets

  • (dt - epoch) does not try to do this, so behaves like NumPy

  • std::mktime (used by pybind11) converts from local clock into UTC. The tm_isdst flag is used to help mktime figure out how to transform the input. It's set to -1 by default, which means that mktime uses the local clock to figure out if the time period should be during DST, and corrects it if so

    tm_isdst expected DST? offset new tm_isdst
    1 yes 0 1
    1 no -1 0
    0 yes 1 1
    0 no 0 0
    -1 yes 0 1
    -1 no 0 0

    I'm least confident on -1 for tm_isdst. I believe that -1 just signals to mktime to determine from the local time whether the given time is DST, and reflect that guess in the field. It doesn't modify the time fields after deduction - it assumes the user entered the correct local time for the timezone. mktime is idempotent, as it only changes fields if tm_isdst is changed.

  • time_t literally returns the same as time_since_epoch()

  • mktime considers whether tm_isdst changes when building the result time_t.

@agoose77
Copy link
Collaborator Author

Actually, I didn't think about this - the timezone boundaries do not map onto distinct values: 01:00 and 02:00 (UTC) in my current timezone on 2022/03/27 are both the same local times (01:00):

  • 1 AM BST = 1 AM UTC
  • 2 AM BST = 1 AM UTC
    so, it's not possible to losslessly recover the input (local) time from the UTC output (in std::time_t). Therefore, we either need to modify the timezone (set it to UTC), or do this in Python.

@jpivarski
Copy link
Member

Since NumPy does not deal with timezones (its times are timezone-naive, which is to say that knowledge of the timezone is not contained within the array, but is managed by the user in some external metadata somewhere), then we have to operate at the same level (also timezone-naive).

Python's datetime library can operate in both modes: https://stackoverflow.com/a/7065242/1623645

Probably the least-surprising way to turn timezone-aware datetimes (datetimes with the tzinfo set to non-None) into timezone-naive datetimes would be to remove the tzinfo, rather than converting to UTC, as in the StackOverflow example. The best we can do is to be consistent and document it.

This is not the worst thing to happen to people who deal with time-data...

@agoose77
Copy link
Collaborator Author

We're on the same page w.r.t whether we support timezones!

In this PR, the issue I'm running into is concerned with the pybind11 bindings for datetime.datetime objects. If we want to bump performance by not calling into the Python interpreter several times per datetime.datetime object, then it's best if we convert them once into a native C++ type, e.g. std::time_t. However, the pybind11 converters use mktime to produce a std::chrono::system_clock::time_point object (which is convertible to time_t) according to the localtime settings (see pybind/pybind11#1638). This means that the conversion result depends upon the user's system.

Actually, writing this response has given me an idea for a "hack". These problems all go away if the localtime is UTC, because there are no DST periods. However, setting the TZ env var from requires us to add special code to handle windows _putenv vs linux putenv. Or, we could just call into Python and let Python handle this, once per from_iter call. I'd prefer to just fix this at the root, which would probably require pybind11 to ship a third-party header-only library, (or we could do this and forgo pybind11's conversion).

@agoose77
Copy link
Collaborator Author

@jpivarski NB I haven't touched "studies/" yet, because that should be a simple find-replace and I didn't want to grow this stage of the PR review yet.

@agoose77
Copy link
Collaborator Author

    elif isinstance(array, ak.layout.ArrayBuilder):
        form = ak.forms.from_json(array.form())
        return ak.types.ArrayType(
            form.type_from_behavior(self._behavior), len(self._layout)
        )

Eventually, we'll want to get rid of ak.layout as a module. Now it only contains the low-level ArrayBuilder and two instances of LayoutBuilder-as-Python that we'll be removing. Maybe that's too much to tack onto this PR. But after that point, the low-level ArrayBuilder can be referred to simply as ak._ext.ArrayBuilder.

I'll make this a todo for later review

@jpivarski
Copy link
Member

@jpivarski NB I haven't touched "studies/" yet, because that should be a simple find-replace and I didn't want to grow this stage of the PR review yet.

The studies/ directory should be left as-is. It's just a record of the temporary work that we did to figure out the algorithms, and so removing v1 doesn't change anything there.

The code in studies/ doesn't even need to be able to run. It's a historical record.

@jpivarski
Copy link
Member

There's a lot of JSON-handling code that can now be removed, too. Would you mind if I git pull and do it?

That is, you're not working on something that would be hard to merge if I remove some C++?

@agoose77
Copy link
Collaborator Author

agoose77 commented Sep 22, 2022

@jpivarski sure, which module is this? I thought I'd removed the unused parts. Perhaps some of src/libawkward/io/json.cpp is not required actually.

@jpivarski
Copy link
Member

There's more, but it's not specifically called out.

Also, I can remove the non-C++ template LayoutBuilder (i.e. the old one). I can do the JSON and the LayoutBuilder in two separate commits, for record-keeping.

@jpivarski
Copy link
Member

I just did git pull now, for the sake of synchronization.

@agoose77
Copy link
Collaborator Author

Ah yes - I left the old layout builder because it still works, and it can be used right-now (whilst we don't have a jitted version of the new layout builder).

@jpivarski
Copy link
Member

Are there tests with the old LayoutBuilder?

The motivation for reimplementing LayoutBuilder was so that it could be optimized for use in compiled loops. A Python interface to such a thing doesn't make sense because you wouldn't be able to use it in compiled loops. The C++ LayoutBuilder will eventually be usable in C++ JIT-compiled code (cppyy) and another implementation is planned for Numba.

But if we have any tests now with LayoutBuilder in a Python loop, we can just remove them without replacement, since future uses of LayoutBuilder won't involve anything like Python loops.

@agoose77
Copy link
Collaborator Author

@jpivarski Yes, tests/test_0924-layout-builder.py and tests/test_1302-layout-builder-nested-list.py

@agoose77
Copy link
Collaborator Author

Ah good catch @jpivarski. And I hadn't realised that the ToJSON wasn't used any more. I wonder if it was used for the forms, which I had temporarily left in after removing the contents? 🤔

@jpivarski
Copy link
Member

I don't think so; I think the forms built JSON with strings.

As for the two versions of from-JSON: there would have been no way to know. I recently revamped the JSON-handling code, and when I needed a C++ function, I wrote a new one beside the old one so that it would be easier to remove the old one (what I just did). There wasn't an obvious indication that one was old and the other new.

I'm almost done with the LayoutBuilder.

@@ -1,12 +1,12 @@
import awkward as ak
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file (being in the studies/ directory) didn't need to be changed, but it's not bad to drop all the "._v2"s.

@jpivarski jpivarski mentioned this pull request Sep 23, 2022
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

image

As you know, I made some changes. Look it over and see if you agree/have any changes to make based on these. I have created a main-v1 branch. If you think this is done, you can do the honors and merge it into main as a single commit.

(After that, PR #1666 will have to be closed and re-applied to main and possibly main-v1. I don't see how it can be easily merged with this, and anyway it's mostly generated by a pre-commit configuration.)

Final (?) compilation times (pip install .) on a Mac with 8 cores:

  • with v1 (old): 265.87s user 13.39s system 333% cpu 1:23.73 total
  • this PR (new): 86.52s user 7.11s system 371% cpu 25.186 total

Compiled binary sizes:

  • with v1 (old): 9.5M _ext.cpython-310-darwin.so, 1.3M libawkward-cpu-kernels.dylib, 6.1M libawkward.dylib
  • this PR (new): 1.1M _ext.cpython-310-darwin.so, 729K libawkward-cpu-kernels.dylib, 921K libawkward.dylib

Loading (import awkward) times:

  • with v1 (old): 95 ms
  • this PR (new): 83 ms (I guess it wasn't dominated by loading compiled symbols)

Anyway, awesome work! Disentangling v1 from v2 was not trivial!

@agoose77
Copy link
Collaborator Author

I'm happy to merge this. I won't update the docs, because we need to do more than just replace ak.layout in any documented examples of the mid-layer anyway.

There may well be some bugs / quirks in this branch, but I think at this point we pass the test suite, and its had 1/2 other pairs of eyes on it, so I am happy to iterate in main for now.

Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Just a few minor comments. Please, check

src/awkward/_connect/avro.py Outdated Show resolved Hide resolved
include/awkward/python/util.h Show resolved Hide resolved
src/awkward/_connect/numexpr.py Outdated Show resolved Hide resolved
src/awkward/_connect/numexpr.py Outdated Show resolved Hide resolved
src/awkward/_connect/numpy.py Show resolved Hide resolved
@agoose77 agoose77 changed the title feat: replace v1 with v2 reference: feat: replace v1 with v2 Sep 23, 2022
@agoose77
Copy link
Collaborator Author

Closed by #1721

@agoose77 agoose77 closed this Sep 24, 2022
@agoose77 agoose77 deleted the agoose77/split-v1-v2 branch October 1, 2022 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ak._v2.from_iter should recognize Python datetimes/timedeltas
3 participants