-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precomputed offset changes. #82
Conversation
Simplify code ✔️ Sounds good to me :-) |
Some other benefits of this that I thought of afterwards. Subsetting Because the data format for a zone is self contained and well defined, writing code to get a subset of years (2014-2020 for example) is just a matter of deserializing, filtering out entries outside the range, and re-serializing. Other data sources @mj1856 may be able to confirm this, but I believe windows does not have the same data source as https://github.com/eggert/tz for its timezones. It is now possible to generate data sets from non tzdb sources, as it is no longer a Rule/Zone data set, but a timestamp/offset/abbr data set. |
Ok, after looking at this again, I think we can make even more optimizations. The new data format combines the index arrays for both offsets and abbreviations into one array. This comes with slight duplication in the abbreviation and offset maps, but it is not major.
The filesize savings for this change are fairly substantial. 10% raw savings, 14% gzipped savings.
Another change I made was to switch from base62 to base60. While this seems like it would actually increase the data size, there is a nice correlation between our data and the number 60. There are 60 seconds in a minute, and 60 minutes in an hour, so every value that is a full hour ends with a This also has the benefit of making the offsets easier to read. See |
Use zdump to generate data set as well as unit tests. Move timezone specific unit tests to tests/zones. Change timezone tests to use a common helper method.
I think I might be doing something wrong, but it seems to me that the latest changes on the develop branch break DST support:
|
@richardfickling, sorry, yes, the develop branch is going to be broken for a little while. I'm working on a pretty substantial rewrite of the internals of moment-timezone. The only apis that will stay the same are You can follow along with the task list in #87, but if you need a working version, I'd suggest using version 0.0.6. https://github.com/moment/moment-timezone/tree/0.0.6 |
No problem, that's why they call it the develop branch :) I just wanted to confirm that it wasn't something that was expected to be working. |
Regarding the comment about Windows timezone sources - don't bother with that. Anything serious uses the IANA TZDB. Even on Windows. I'm more concerned with avoiding use of the |
Also, should we really be using data straight from eggert/tz? Technically, that's the unofficial release. The official releases come from iana.org/time-zones. AFAIK, they're identical, but I'm not sure if it's a sound process. If we are to continue to link to eggert/tz as a submodule, we should be sure to use the specific commits that are tagged as releases with the corresponding IANA version. The version number should be somewhere in the data, and there should also be a function to easily retrieve the version number. |
Ah, I was not aware of http://www.iana.org/time-zones, I'll make sure when I'm working on the data build process to use that as the source. Everything should be working as expected with the For the data, I was planning on storing all previous releases in their packed and unpacked formats, and having a packed
I'll make sure the data version number is included as well. Probably set a property |
For the DST issue, I just wrote this up separately as #88 |
Storing DST occurrences directly seems like the right thing to do. I still don't quite understand how (and why) there is another (the original) format to store timezone data. Another cool thing is that we can offer a slimmer version that only supports years after 1990 (for example) -- it won't have a bunch of weird irregular DST changes of the past. About all the size/performance improvements -- I wouldn't go to crazy with the encoding, because that would complicate the build scripts. Using base60 is a bit extreme -- I agree it saves a few bytes but where do you get that from (I guess you did a custom implementation in js). We can keep it that way, just please don't make it crazier. What about the extra data (zone aliases and metadata), which is not part of the new format? Isn't it still needed? Would it be present in the same format as now? |
The original format is based on the source format of the tzdb. For the slimmer versions, we should have a couple builds easily available for download on momentjs.com and in the The compressed encoding was originally base62, which is a pretty common poor man's compression. I switched because the data lends itself very well to base60. Hours have 60 minutes and minutes have 60 seconds. 3 hours is The zone aliases got lost in the new build process as a side effect of using I think it would be a nice optimization in the build process to create aliases out of zones that have the same data. This is especially useful when we start subsetting data by year ranges and the rules that would have differentiated two zones no longer apply. See the madhouse that is Indiana before 1970. The zone metadata contained latitude, longitude, and rules used. Because we no longer use rules, those can be removed. The lat/long data is still useful, but I think would be better suited in a |
Here's an idea - when you create the data builder to go along with the new format - how about a field to pick the earliest year you want in the data? We could then truncate accordingly. I'm sure many people only care about the date their app launches and later. Some might want a few years back for legacy data. Few probably need the entire history of the TZDB. |
This is a pretty major change, so bear with me.
Thought Process
After looking at #79, I was trying to think of how we could cache computations better.
I was thinking we could iterate through all the zones and rules and build up an array of relevant data (start or end timestamp, offset, and abbreviation). Then, when looking up the offset for a moment, we would just iterate through the array until we reached the relevant index.
I took a couple attempts at building this list from the ZoneSet, Zone, RuleSet, and Rules that are currently in place and things were becoming increasingly complicated.
Then I realized, we are already using zdump(8) to get data in this exact format for constructing the tests. Why not use zdump(8) to construct the data as well? (I know its a little redundant to use the same data source for getting data and constructing tests, but the benefits are pretty substantial).
Once I switched to using precomputed arrays of offset changes, the core lib was simplified dramatically. The RuleSets, Rules, and ZoneSets were all removed, and replaced with a new Zone object which can be constructed with a single string.
Filesize changes
One of the concerns I had with precomputed arrays was the large number of offset changes that would need to be shipped to the client. America/St_Johns is a notable example with 239 offset changes between 1900 and 2038. I wanted to make sure we wouldn't take too much of a hit on the data filesizes.
After running some filesize tests, the increase is not that bad. moment-timezone.json contains all the old rules and zones, and moment-timezone-2.json contains the new zones. It's not a 100% direct comparison, as moment-timezone.json also contains some zone metadata and the zone aliases, but its at least a good frame of reference.
One thing to note is that the new data gzips much better than the old data.
New Data Format
Here are the two data formats compared.
The new version is more portable and durable. It no longer requires complicated logic to figure out which rules need to be included for which zones. Each zone is completely self sufficient.
In order to save as many bytes as possible, I used this super compact format to store the data.
The data is split into 6 sections separated by pipes.
America/Los_Angeles
PST PDT PWT PPT
01010230101010101010101010101011010101010101010101010
7uw 6ys
01010110101010101010101010101011010101010101010101010
-1Mx224 1e796 TQkw 1e796 LBHj2 7uX9S gPhS 5ePXq 1IcHC
0. Name
The canonical name of the timezone.
1. Abbreviation Map
A space separated list of all the abbreviations ever used in this timezone.
2. Abbreviation Index
A tightly packed array of indices into the abbreviation map. These are all in base62, so it will support up to 62 abbreviations per timezone. All zones have less than 10 unique abbreviations right now, so it should be fairly future proof to represent each index with a single character.
3. Offset Map
A space separated list of all the offsets ever used in this timezone in seconds in base62.
4. Offset Index
A tightly packed array of indices into the offset map. These are also in base62.
5. Timestamp Diff
This is where the timestamps are stored.
All these numbers represent the 'until' value of a change. The associated abbreviation and offset should only be used if the time being evaluated is before this value.
Because we are dealing with a sorted list of integers, we can just store the diff from the last integer rather than storing the full integer.
The first item in the array is a unix timestamp in seconds in base62. All items after the first item are numbers of seconds in base62 to be added to the previous value during unpacking.
One of the nice things about using diffs instead of whole integers is that the diffs are often the same from one year to the next, making gzipping much more effective.
I experimented with using a map + index for these values like the offsets and abbreviations, but some of the timezones with many changes overflowed the 62 value index. Additionally, the gzipped size increased when switching to a map + index, even though the un-gzipped size decreased.
Speed
The original implementation was fairly slow. Looping through ZoneSets and Zones and RuleSets and Rules took considerable time. When memoization was added in #39, that time was significantly lessened. However, it came at the expense of memory leaks as noted in #79.
Because we no longer need to loop through ZoneSets, Zones, RuleSets and Rules, things are even faster. Here are the times for running all unit tests.
Memory
Because we are no longer memoizing results for each new moment, the memory used is constant per timezone. Each timezone will have 3 arrays of the same length (186 in the case of America/Los_Angeles for example). One array will contain strings, and the other two will contain numbers. This will probably be even less memory that the original moment-timezone used.
Additionally, the data is only unpacked from the source string on the first time a timezone is used. Even if all the zone strings are loaded into moment-timezone, only the ones that are used will be unpacked into their array values.
Complexity
As you can see from the moment-timezone.js diff below, the library is now significantly less complex. Of the 532 lines, 381 were deleted and 89 added for a new total of 240 lines.
The old version took 80+ dev hours to get all the tests passing, as each little edge case would break another edge case. As a result, the current code is very fragile, and refactoring is unrealistic.
With the new version, the code path for finding an offset for a moment is just a handful of statements, much more accessible for people unfamiliar with the codebase.
I think this should also make solving #27 much easier. I have taken a few stabs at solving #27, and each time hit a roadblock after a couple days of work.
TLDR;
Switch to a new data format is faster, smaller, simpler, and doesn't leak memory.