reference: feat: replace v1 with v2 #1690

agoose77 · 2022-09-09T13:45:31Z

This is just me playing around with removing v1. Should you want to do this yourself, no problem!

agoose77 · 2022-09-10T16:10:14Z

Removing the v1 Python layer is fairly easy. Need to think about the C++ layer; the Forth machine returns a ContentPtr in a number of places. Some of these are trivial Numpy arrays, but the bytecodes getter function returns a jagged list. The obvious answer is to return the bytecodes and their offsets, and let the Python binding layer build the Array object from C++ via pybind Python execution. I haven't thought yet whether I like the idea of the ForthMachine returning pybind11's py::array outside of the Python layer, so I think I'm leaning towards removing the toNumpyArray interface and doing that at the bindings level.

jpivarski · 2022-09-10T19:07:46Z

Need to think about the C++ layer; the Forth machine returns a ContentPtr in a number of places. Some of these are trivial Numpy arrays, but the bytecodes getter function returns a jagged list. The obvious answer is to return the bytecodes and their offsets

Knowing that the split is coming up soon, we've been using the ForthMachine in ways that don't depend on the return values being ContentPtr and IndexPtr: we've been immediately casting them as NumPy arrays.

The bytecodes method/property is not used much. I think nothing would break if it were removed entirely, but it's probably better to replace it with just a 2-tuple of offsets and content, as you describe. It exists as a debugging tool (though the decompile one, which prints out source code reconstructed from the bytecodes, is more useful), and was constructed as a jagged array just because it occurred to me that one could do that. Unlike imperative bytecode, Forth bytecode does not have arbitrary jumps; it happens to be a structured language in the sense of not having GOTOs, a fact that was emphasized by Forth enthusiasts in the 80's, but nowadays nearly all languages are structured at the human-readable source code level. The only thing that stands out for Forth is that its bytecode can also be blocked out as subroutines, a jagged array.¹

Anyway, the point is that we've been using the ContentPtr/IndexPtr-dependent parts of ForthMachine in a minimal way so that they can be replaced by raw buffers (py::buffer_info in pybind11) without issues. Along the same lines, we will be removing the custom growable-buffer of ForthOutputBuffer to use Manasvi's new GrowableBuffer (and reap the benefits of non-contiguous filling), but that is an entirely separate step that can wait until the v1 pruning is entirely done. (ForthOutputBuffer doesn't depend on the old GrowableBuffer—that one can be safely deleted.)

Although somebody might point out that this is now true of other bytecodes that I don't know about. Maybe JVM? I'm pretty sure Python bytecode has unconditional jumps. ↩

agoose77 · 2022-09-13T13:35:48Z

Phew.

Now v1-less Awkward compiles and passes most tests, having removed the array and type machinery.

I've removed the low hanging fruit in the C++ side, so I've therefore left a much reduced version of Index.cpp. My rationale there is that older C++ standards don't have std::span yet, so either we start using pybind features inside of AwkwardForth, or we keep this small abstraction.

New bugs:

from_iter doesn't currently handle native Python datetime objects, so many tests fail. We should fix this.

jpivarski · 2022-09-13T13:48:13Z

We should be able to remove Index from ForthMachine. I'll take a look at it later, but that's a separable problem from everything else.

agoose77 · 2022-09-13T15:21:02Z

We should be able to remove Index from ForthMachine. I'll take a look at it later, but that's a separable problem from everything else.

I think it depends if we want to be passing around unbounded arrays. Actually, this is barely a point at the moment because the Index impl doesn't so anything safety wise (i.e. provide an iterator).

IIRC you never get an index directly; rather, you get the output and call toIndexXX on it, so we can just use the metadata from the output.

Otherwise, I still need to remove the unused kernels...

jpivarski · 2022-09-13T15:27:16Z

In the way AwkwardForth is used, the output of toNumpyArray and toIndex are used, they're immediately wrapped with np.asarray, so we only need one way to get a buffer out. We can even change the name (breaking Uproot), since Uproot v5 and Awkward v2 releases will be coordinated. A name change won't be very disruptive.

Otherwise, I still need to remove the unused kernels...

Perhaps removing unused kernels should be a separate PR. We might even be able to find more that could be removed and haven't yet. Some of them might be equivalent to arange or something.

jpivarski · 2022-09-13T15:28:12Z

wip: remove unused kernels

It's okay. But I think we can remove more, and fishing for them can be a different PR.

agoose77 · 2022-09-13T15:38:42Z

wip: remove unused kernels

It's okay. But I think we can remove more, and fishing for them can be a different PR.

Cool. I found these ones by directly looking with (Xonsh)

import yaml
import re

kernel_spec = yaml.safe_load(p"/home/angus/Git/awkward/kernel-specification.yml".read_text())['kernels']
in_use_specs = set($(rg r"awkward_\w+" src/awkward/contents src/awkward/_reducers.py -oNI).splitlines())
name_to_specs = {g['name']: [s['name'] for s in g['specializations']] for g in kernel_spec}
spec_to_name = {v: k for k, s in name_to_specs.items() for v in s}

# For cleaning up the spec file
in_use_names = (name_to_specs.keys() & in_use_specs)

# For removing kernels
for f in [f for f in pg`src/cpu-kernels/awkward*.cpp` if not any(s in f.read_text() for s in used_specs)]:
    f.unlink()

agoose77 · 2022-09-13T15:40:45Z

the output of toNumpyArray and toIndex are used

Yeah, now that all the removed code is gone, it's easier to see that we don't need an object that holds both length and ptr (besides the output itself). The main "issue" is internally within the Forth machine, rather than the bindings. It should be possible to just access the (shared) pointer directly within the Forth routines, and wrap it in the bindings.

agoose77 · 2022-09-13T15:42:39Z

We might even be able to find more that could be removed and haven't yet. Some of them might be equivalent to arange or something.

Agreed, there are several that I know of (not sure if they're still carried through in this PR, haven't checked) ;)

agoose77 · 2022-09-13T16:34:30Z

OK, this now removes Index as well, so it's nearly all the way there. I've rebased my commits where possible to give a clean-ish history. I can't guarantee that every commit builds a wheel that passes the test suite (they certainly don't when I start moving files), but after the large-scale refactoring commits, most revisions then compile.

agoose77 · 2022-09-13T16:55:39Z

I haven't fully pulled out the kernel dispatch mechanism yet. I'll tentatively note that I'm not intimately familiar with this code, but my current understanding is that our kernel handling has all been superseded by the nplike dispatch mechnism.

jpivarski · 2022-09-13T16:59:43Z

Any kernel-dispatch in C++ is old. We have an entirely new kernel-dispatch system in Python, using ctypes.

Anything you find in C++ about startup functions, particularly for setting up awkward-cuda-kernels, can be removed. That's all Python now as well.

Another thing that can go is the dlpack git submodule. That was all for supporting CUDA arrays in C++, which is now handled by CuPy in nplike.

agoose77 · 2022-09-13T19:07:06Z

Another thing that can go is the dlpack git submodule

It's already gone!

agoose77 · 2022-09-13T21:09:21Z

OK, now we're at a point where there's not much else (anything?) that needs to be removed.

The main changes are now described in this PRs description.

agoose77 · 2022-09-15T22:21:26Z

I've added some rudimentary time handling. I originally tried employing std::chrono, but I clearly don't fully understand how C++ is handling times vs Python:

If I calculate the local epoch using:

+std::chrono::system_clock::time_point local_epoch() {
+    std::tm tm = {
+       /* .tm_sec  = */ 0,
+       /* .tm_min  = */ 0,
+       /* .tm_hour = */ 0,
+       /* .tm_mday = */ 1,
+       /* .tm_mon  = */ 1 - 1,
+       /* .tm_year = */ 1970 - 1900,
+    };
+    // Use local time zone DST
+    tm.tm_isdst = -1;
+    return std::chrono::system_clock::from_time_t(std::mktime(&tm));
+}

Then when I use this in these calculations, I get one-hour shifts for particular datetimes:

+    auto time = obj.cast<std::chrono::system_clock::time_point>();
+    int64_t time_since_epoch_us = std::chrono::duration_cast<std::chrono::microseconds>(
+        time - local_epoch()
+    ).count();
+    std::cout << "time since epoch     " << time_since_epoch_us << std::endl;
+    std::cout << "time since epoch cpp " << std::chrono::duration_cast<std::chrono::microseconds>(time.time_since_epoch()).count() << std::endl;
+    self.datetime(time_since_epoch_us, "datetime64[us]");

I am currently under the impression that this is because the timezones differ between Python and C++. I'm looking into this.

agoose77 · 2022-09-16T00:03:39Z

The above code example is wrong - the epoch should not be specified in local time (via std::tm). Instead, it's just std::time_t (0).

Now that I'm more familiar with mktime vs localtime, some thoughts on my timezone (BST)

NumPy's datetime objects are naive and don't attempt to determine any DST offsets
datetime.datetime.timestamp does try to use the local timezone to figure out DST offsets
(dt - epoch) does not try to do this, so behaves like NumPy
std::mktime (used by pybind11) converts from local clock into UTC. The tm_isdst flag is used to help mktime figure out how to transform the input. It's set to -1 by default, which means that mktime uses the local clock to figure out if the time period should be during DST, and corrects it if so

tm_isdst expected DST? offset new tm_isdst

1 yes 0 1

1 no -1 0

0 yes 1 1

0 no 0 0

-1 yes 0 1

-1 no 0 0

I'm least confident on -1 for tm_isdst. I believe that -1 just signals to mktime to determine from the local time whether the given time is DST, and reflect that guess in the field. It doesn't modify the time fields after deduction - it assumes the user entered the correct local time for the timezone. mktime is idempotent, as it only changes fields if tm_isdst is changed.
time_t literally returns the same as time_since_epoch()
mktime considers whether tm_isdst changes when building the result time_t.

agoose77 · 2022-09-16T12:53:58Z

Actually, I didn't think about this - the timezone boundaries do not map onto distinct values: 01:00 and 02:00 (UTC) in my current timezone on 2022/03/27 are both the same local times (01:00):

1 AM BST = 1 AM UTC
2 AM BST = 1 AM UTC
so, it's not possible to losslessly recover the input (local) time from the UTC output (in std::time_t). Therefore, we either need to modify the timezone (set it to UTC), or do this in Python.

jpivarski · 2022-09-16T13:12:25Z

Since NumPy does not deal with timezones (its times are timezone-naive, which is to say that knowledge of the timezone is not contained within the array, but is managed by the user in some external metadata somewhere), then we have to operate at the same level (also timezone-naive).

Python's datetime library can operate in both modes: https://stackoverflow.com/a/7065242/1623645

Probably the least-surprising way to turn timezone-aware datetimes (datetimes with the tzinfo set to non-None) into timezone-naive datetimes would be to remove the tzinfo, rather than converting to UTC, as in the StackOverflow example. The best we can do is to be consistent and document it.

This is not the worst thing to happen to people who deal with time-data...

agoose77 · 2022-09-16T13:52:06Z

We're on the same page w.r.t whether we support timezones!

In this PR, the issue I'm running into is concerned with the pybind11 bindings for datetime.datetime objects. If we want to bump performance by not calling into the Python interpreter several times per datetime.datetime object, then it's best if we convert them once into a native C++ type, e.g. std::time_t. However, the pybind11 converters use mktime to produce a std::chrono::system_clock::time_point object (which is convertible to time_t) according to the localtime settings (see pybind/pybind11#1638). This means that the conversion result depends upon the user's system.

Actually, writing this response has given me an idea for a "hack". These problems all go away if the localtime is UTC, because there are no DST periods. However, setting the TZ env var from requires us to add special code to handle windows _putenv vs linux putenv. Or, we could just call into Python and let Python handle this, once per from_iter call. I'd prefer to just fix this at the root, which would probably require pybind11 to ship a third-party header-only library, (or we could do this and forgo pybind11's conversion).

agoose77 · 2022-09-22T22:30:23Z

@jpivarski NB I haven't touched "studies/" yet, because that should be a simple find-replace and I didn't want to grow this stage of the PR review yet.

agoose77 · 2022-09-22T22:31:00Z

    elif isinstance(array, ak.layout.ArrayBuilder):
        form = ak.forms.from_json(array.form())
        return ak.types.ArrayType(
            form.type_from_behavior(self._behavior), len(self._layout)
        )
Eventually, we'll want to get rid of ak.layout as a module. Now it only contains the low-level ArrayBuilder and two instances of LayoutBuilder-as-Python that we'll be removing. Maybe that's too much to tack onto this PR. But after that point, the low-level ArrayBuilder can be referred to simply as ak._ext.ArrayBuilder.

I'll make this a todo for later review

jpivarski · 2022-09-22T22:40:38Z

@jpivarski NB I haven't touched "studies/" yet, because that should be a simple find-replace and I didn't want to grow this stage of the PR review yet.

The studies/ directory should be left as-is. It's just a record of the temporary work that we did to figure out the algorithms, and so removing v1 doesn't change anything there.

The code in studies/ doesn't even need to be able to run. It's a historical record.

jpivarski · 2022-09-22T22:41:46Z

There's a lot of JSON-handling code that can now be removed, too. Would you mind if I git pull and do it?

That is, you're not working on something that would be hard to merge if I remove some C++?

agoose77 · 2022-09-22T22:42:55Z

@jpivarski sure, which module is this? I thought I'd removed the unused parts. Perhaps some of src/libawkward/io/json.cpp is not required actually.

jpivarski · 2022-09-22T22:44:06Z

There's more, but it's not specifically called out.

Also, I can remove the non-C++ template LayoutBuilder (i.e. the old one). I can do the JSON and the LayoutBuilder in two separate commits, for record-keeping.

jpivarski · 2022-09-22T22:44:32Z

I just did git pull now, for the sake of synchronization.

agoose77 · 2022-09-22T22:45:10Z

Ah yes - I left the old layout builder because it still works, and it can be used right-now (whilst we don't have a jitted version of the new layout builder).

jpivarski · 2022-09-22T22:50:16Z

Are there tests with the old LayoutBuilder?

The motivation for reimplementing LayoutBuilder was so that it could be optimized for use in compiled loops. A Python interface to such a thing doesn't make sense because you wouldn't be able to use it in compiled loops. The C++ LayoutBuilder will eventually be usable in C++ JIT-compiled code (cppyy) and another implementation is planned for Numba.

But if we have any tests now with LayoutBuilder in a Python loop, we can just remove them without replacement, since future uses of LayoutBuilder won't involve anything like Python loops.

agoose77 · 2022-09-22T22:53:39Z

@jpivarski Yes, tests/test_0924-layout-builder.py and tests/test_1302-layout-builder-nested-list.py

agoose77 · 2022-09-22T23:08:33Z

Ah good catch @jpivarski. And I hadn't realised that the ToJSON wasn't used any more. I wonder if it was used for the forms, which I had temporarily left in after removing the contents? 🤔

jpivarski · 2022-09-22T23:12:30Z

I don't think so; I think the forms built JSON with strings.

As for the two versions of from-JSON: there would have been no way to know. I recently revamped the JSON-handling code, and when I needed a C++ function, I wrote a new one beside the old one so that it would be easier to remove the old one (what I just did). There wasn't an obvious indication that one was old and the other new.

I'm almost done with the LayoutBuilder.

jpivarski · 2022-09-22T23:52:50Z

studies/awkward_cupy_perfscript.py

@@ -1,12 +1,12 @@
 import awkward as ak


This file (being in the studies/ directory) didn't need to be changed, but it's not bad to drop all the "._v2"s.

jpivarski

LGTM!

As you know, I made some changes. Look it over and see if you agree/have any changes to make based on these. I have created a main-v1 branch. If you think this is done, you can do the honors and merge it into main as a single commit.

(After that, PR #1666 will have to be closed and re-applied to main and possibly main-v1. I don't see how it can be easily merged with this, and anyway it's mostly generated by a pre-commit configuration.)

Final (?) compilation times (pip install .) on a Mac with 8 cores:

with v1 (old): 265.87s user 13.39s system 333% cpu 1:23.73 total
this PR (new): 86.52s user 7.11s system 371% cpu 25.186 total

Compiled binary sizes:

with v1 (old): 9.5M _ext.cpython-310-darwin.so, 1.3M libawkward-cpu-kernels.dylib, 6.1M libawkward.dylib
this PR (new): 1.1M _ext.cpython-310-darwin.so, 729K libawkward-cpu-kernels.dylib, 921K libawkward.dylib

Loading (import awkward) times:

with v1 (old): 95 ms
this PR (new): 83 ms (I guess it wasn't dominated by loading compiled symbols)

Anyway, awesome work! Disentangling v1 from v2 was not trivial!

agoose77 · 2022-09-23T08:00:02Z

I'm happy to merge this. I won't update the docs, because we need to do more than just replace ak.layout in any documented examples of the mid-layer anyway.

There may well be some bugs / quirks in this branch, but I think at this point we pass the test suite, and its had 1/2 other pairs of eyes on it, so I am happy to iterate in main for now.

ianna

Great job! Just a few minor comments. Please, check

src/awkward/_connect/avro.py

include/awkward/python/util.h

src/awkward/_connect/numexpr.py

src/awkward/_connect/numpy.py

agoose77 · 2022-09-24T12:48:50Z

Closed by #1721

agoose77 marked this pull request as draft September 9, 2022 13:45

agoose77 force-pushed the agoose77/split-v1-v2 branch 2 times, most recently from 9b7f7d0 to 344276e Compare September 10, 2022 14:22

agoose77 force-pushed the agoose77/split-v1-v2 branch 2 times, most recently from d3cd98e to 511356f Compare September 13, 2022 13:34

agoose77 force-pushed the agoose77/split-v1-v2 branch from 83f5a1a to 987ec50 Compare September 13, 2022 16:32

agoose77 force-pushed the agoose77/split-v1-v2 branch from 15de1cd to ec8390c Compare September 13, 2022 20:20

agoose77 changed the title ~~chore: remove v1~~ chore: replace v1 with v2 Sep 13, 2022

agoose77 force-pushed the agoose77/split-v1-v2 branch from 3546e43 to a992b09 Compare September 13, 2022 21:09

agoose77 force-pushed the agoose77/split-v1-v2 branch from a992b09 to be6445a Compare September 14, 2022 17:42

jpivarski added 2 commits September 22, 2022 17:54

Removed ToJson and its subclasses.

cf2b2b2

Removed the old/unused from-json functions, leaving the new ones intact.

2fc8876

jpivarski added 2 commits September 22, 2022 18:18

Removed the Python-interfaced (i.e. old) LayoutBuilder.

8954c1d

Removed ak.layout in favor of getting ArrayBuilder directly from _ext.

4889c2a

jpivarski reviewed Sep 22, 2022

View reviewed changes

jpivarski mentioned this pull request Sep 23, 2022

chore: add isort #1666

Merged

jpivarski approved these changes Sep 23, 2022

View reviewed changes

ianna reviewed Sep 23, 2022

View reviewed changes

agoose77 added 4 commits September 23, 2022 09:05

refactor: remove redundant if

d2faec9

chore: remove old comment

44513cf

chore: remove old code

f24d21a

chore: remove commented debug code

6e696b1

agoose77 mentioned this pull request Sep 23, 2022

feat: replace v1 with v2 (preserve-history) #1721

Merged

agoose77 changed the title ~~feat: replace v1 with v2~~ reference: feat: replace v1 with v2 Sep 23, 2022

agoose77 closed this Sep 24, 2022

agoose77 deleted the agoose77/split-v1-v2 branch October 1, 2022 13:57

reference: feat: replace v1 with v2 #1690

reference: feat: replace v1 with v2 #1690

Conversation

agoose77 commented Sep 9, 2022 • edited Loading

agoose77 commented Sep 10, 2022

jpivarski commented Sep 10, 2022

Footnotes

agoose77 commented Sep 13, 2022

jpivarski commented Sep 13, 2022

agoose77 commented Sep 13, 2022

jpivarski commented Sep 13, 2022

jpivarski commented Sep 13, 2022

agoose77 commented Sep 13, 2022

agoose77 commented Sep 13, 2022 • edited Loading

agoose77 commented Sep 13, 2022

agoose77 commented Sep 13, 2022

agoose77 commented Sep 13, 2022

jpivarski commented Sep 13, 2022

agoose77 commented Sep 13, 2022

agoose77 commented Sep 13, 2022

agoose77 commented Sep 15, 2022

agoose77 commented Sep 16, 2022 • edited Loading

agoose77 commented Sep 16, 2022

jpivarski commented Sep 16, 2022

agoose77 commented Sep 16, 2022

agoose77 commented Sep 22, 2022

agoose77 commented Sep 22, 2022

jpivarski commented Sep 22, 2022

jpivarski commented Sep 22, 2022

agoose77 commented Sep 22, 2022 • edited Loading

jpivarski commented Sep 22, 2022

jpivarski commented Sep 22, 2022

agoose77 commented Sep 22, 2022

jpivarski commented Sep 22, 2022

agoose77 commented Sep 22, 2022

agoose77 commented Sep 22, 2022

jpivarski commented Sep 22, 2022

jpivarski Sep 22, 2022

Choose a reason for hiding this comment

jpivarski left a comment

Choose a reason for hiding this comment

agoose77 commented Sep 23, 2022

ianna left a comment

Choose a reason for hiding this comment

agoose77 commented Sep 24, 2022

agoose77 commented Sep 9, 2022 •

edited

Loading

agoose77 commented Sep 13, 2022 •

edited

Loading

agoose77 commented Sep 16, 2022 •

edited

Loading

agoose77 commented Sep 22, 2022 •

edited

Loading