Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precomputed offset changes. #82

Merged
merged 1 commit into from
May 23, 2014
Merged

Conversation

timrwood
Copy link
Member

This is a pretty major change, so bear with me.

Thought Process

After looking at #79, I was trying to think of how we could cache computations better.

I was thinking we could iterate through all the zones and rules and build up an array of relevant data (start or end timestamp, offset, and abbreviation). Then, when looking up the offset for a moment, we would just iterate through the array until we reached the relevant index.

I took a couple attempts at building this list from the ZoneSet, Zone, RuleSet, and Rules that are currently in place and things were becoming increasingly complicated.

Then I realized, we are already using zdump(8) to get data in this exact format for constructing the tests. Why not use zdump(8) to construct the data as well? (I know its a little redundant to use the same data source for getting data and constructing tests, but the benefits are pretty substantial).

Once I switched to using precomputed arrays of offset changes, the core lib was simplified dramatically. The RuleSets, Rules, and ZoneSets were all removed, and replaced with a new Zone object which can be constructed with a single string.

Filesize changes

One of the concerns I had with precomputed arrays was the large number of offset changes that would need to be shipped to the client. America/St_Johns is a notable example with 239 offset changes between 1900 and 2038. I wanted to make sure we wouldn't take too much of a hit on the data filesizes.

After running some filesize tests, the increase is not that bad. moment-timezone.json contains all the old rules and zones, and moment-timezone-2.json contains the new zones. It's not a 100% direct comparison, as moment-timezone.json also contains some zone metadata and the zone aliases, but its at least a good frame of reference.

moment-timezone.json    raw 182574  gzip 36712
moment-timezone-2.json  raw 220336  gzip 25077

One thing to note is that the new data gzips much better than the old data.

New Data Format

Here are the two data formats compared.

// OLD
{
    "zones": {
        "America/Los_Angeles": [
            "-7:52:58 - LMT 1883_10_18_12_7_2 -7:52:58",
            "-8 US P%sT 1946 -8",
            "-8 CA P%sT 1967 -8",
            "-8 US P%sT"
        ]
    },
    "rules": {
        "US": [
            "1918 1919 2 0 8 2 0 1 D",
            "1918 1919 9 0 8 2 0 0 S",
            "1942 1942 1 9 7 2 0 1 W",
            "1945 1945 7 14 7 23 1 1 P",
            "1945 1945 8 30 7 2 0 0 S",
            "1967 2006 9 0 8 2 0 0 S",
            "1967 1973 3 0 8 2 0 1 D",
            "1974 1974 0 6 7 2 0 1 D",
            "1975 1975 1 23 7 2 0 1 D",
            "1976 1986 3 0 8 2 0 1 D",
            "1987 2006 3 1 0 2 0 1 D",
            "2007 9999 2 8 0 2 0 1 D",
            "2007 9999 10 1 0 2 0 0 S"
        ],
        "CA": [
            "1948 1948 2 14 7 2 0 1 D",
            "1949 1949 0 1 7 2 0 0 S",
            "1950 1966 3 0 8 2 0 1 D",
            "1950 1961 8 0 8 2 0 0 S",
            "1962 1966 9 0 8 2 0 0 S"
        ]
    }
}

// NEW
"America/Los_Angeles|PST PDT PWT PPT|010102301010101010101010101010101
0101010101010101010101010101010101010101010101010101010101010101010101
0101010101010101010101010101010101010101010101010101010101010101010101
0101010101010|7uw 6ys|010101101010101010101010101010101010101010101010
1010101010101010101010101010101010101010101010101010101010101010101010
10101010101010101010101010101010101010101010101010101010101010101010|
-1Mx224 1e796 TQkw 1e796 LBHj2 7uX9S gPhS 5ePXq 1IcHC 2PtBu Rh7y 1gGm4
TOso 1e91e TOso 1e91e TOso 1e91e TOso 1e91e TOso 1gGm4 TOso 1e91e TOso
1e91e TOso 1e91e TOso 1e91e TOso 1gGm4 Rh7y 1gGm4 13XNK 13ZFS 13XNK
13ZFS 13XNK 13ZFS 16v8A 11sl2 16v8A 13ZFS 13XNK 13ZFS 13XNK 13ZFS 13XNK
13ZFS 13XNK 13ZFS 16v8A 13ZFS 13XNK 13ZFS 13XNK pois 1Izba H9Ek 1qNPi
13ZFS 16v8A 11sl2 16v8A 13ZFS 13XNK 13ZFS 13XNK 13ZFS 13XNK 13ZFS 13XNK
13ZFS 16v8A 11sl2 16v8A 13ZFS 13XNK 13ZFS 13XNK 13ZFS 13XNK WnFm 1bzOg
WnFm 1e796 TQkw 1e796 TQkw 1e796 WnFm 1bzOg WnFm 1bzOg WnFm 1e796 TQkw
1e796 TQkw 1e796 WnFm 1bzOg WnFm 1bzOg WnFm 1bzOg WnFm 1e796 TQkw 1e796
TQkw 1e796 WnFm 1bzOg WnFm 1bzOg WnFm 1e796 TQkw 1e796 TQkw 1e796 Mek0
1ogus JGZa 1ogus JGZa 1ogus Mek0 1ogus JGZa 1ogus JGZa 1ogus JGZa 1ogus
JGZa 1ogus JGZa 1ogus Mek0 1ogus JGZa 1ogus JGZa 1ogus JGZa 1ogus JGZa
1ogus Mek0 1ogus JGZa 1ogus JGZa 1ogus JGZa 1ogus JGZa 1ogus JGZa 1ogus
Mek0 1ogus JGZa 1ogus JGZa 1ogus JGZa 1ogus JGZa 1ogus Mek0 1ogus JGZa
1ogus JGZa 1ogus JGZa 1ogus JGZa 1ogus JGZa 1ogus",

The new version is more portable and durable. It no longer requires complicated logic to figure out which rules need to be included for which zones. Each zone is completely self sufficient.

In order to save as many bytes as possible, I used this super compact format to store the data.

The data is split into 6 sections separated by pipes.

# Type Example
0 Name America/Los_Angeles
1 Abbreviation Map PST PDT PWT PPT
2 Abbreviation Index 01010230101010101010101010101011010101010101010101010
3 Offset Map 7uw 6ys
4 Offset Index 01010110101010101010101010101011010101010101010101010
5 Timestamp Diff -1Mx224 1e796 TQkw 1e796 LBHj2 7uX9S gPhS 5ePXq 1IcHC

0. Name

The canonical name of the timezone.

1. Abbreviation Map

A space separated list of all the abbreviations ever used in this timezone.

2. Abbreviation Index

A tightly packed array of indices into the abbreviation map. These are all in base62, so it will support up to 62 abbreviations per timezone. All zones have less than 10 unique abbreviations right now, so it should be fairly future proof to represent each index with a single character.

3. Offset Map

A space separated list of all the offsets ever used in this timezone in seconds in base62.

4. Offset Index

A tightly packed array of indices into the offset map. These are also in base62.

5. Timestamp Diff

This is where the timestamps are stored.

All these numbers represent the 'until' value of a change. The associated abbreviation and offset should only be used if the time being evaluated is before this value.

Because we are dealing with a sorted list of integers, we can just store the diff from the last integer rather than storing the full integer.

The first item in the array is a unix timestamp in seconds in base62. All items after the first item are numbers of seconds in base62 to be added to the previous value during unpacking.

  packed : -100  20  30  60
unpacked : -100 -80 -50  10

One of the nice things about using diffs instead of whole integers is that the diffs are often the same from one year to the next, making gzipping much more effective.

I experimented with using a map + index for these values like the offsets and abbreviations, but some of the timezones with many changes overflowed the 62 value index. Additionally, the gzipped size increased when switching to a map + index, even though the un-gzipped size decreased.

Speed

The original implementation was fairly slow. Looping through ZoneSets and Zones and RuleSets and Rules took considerable time. When memoization was added in #39, that time was significantly lessened. However, it came at the expense of memory leaks as noted in #79.

Because we no longer need to loop through ZoneSets, Zones, RuleSets and Rules, things are even faster. Here are the times for running all unit tests.

   original 63s
   memoized 18s
precomputed  7s

Memory

Because we are no longer memoizing results for each new moment, the memory used is constant per timezone. Each timezone will have 3 arrays of the same length (186 in the case of America/Los_Angeles for example). One array will contain strings, and the other two will contain numbers. This will probably be even less memory that the original moment-timezone used.

Additionally, the data is only unpacked from the source string on the first time a timezone is used. Even if all the zone strings are loaded into moment-timezone, only the ones that are used will be unpacked into their array values.

Complexity

As you can see from the moment-timezone.js diff below, the library is now significantly less complex. Of the 532 lines, 381 were deleted and 89 added for a new total of 240 lines.

The old version took 80+ dev hours to get all the tests passing, as each little edge case would break another edge case. As a result, the current code is very fragile, and refactoring is unrealistic.

With the new version, the code path for finding an offset for a moment is just a handful of statements, much more accessible for people unfamiliar with the codebase.

I think this should also make solving #27 much easier. I have taken a few stabs at solving #27, and each time hit a roadblock after a couple days of work.

TLDR;

Switch to a new data format is faster, smaller, simpler, and doesn't leak memory.

@gjohnson
Copy link

Simplify code ✔️
Don't leak memory ✔️

Sounds good to me :-)

@timrwood
Copy link
Member Author

Some other benefits of this that I thought of afterwards.

Subsetting

Because the data format for a zone is self contained and well defined, writing code to get a subset of years (2014-2020 for example) is just a matter of deserializing, filtering out entries outside the range, and re-serializing.

Other data sources

@mj1856 may be able to confirm this, but I believe windows does not have the same data source as https://github.com/eggert/tz for its timezones. It is now possible to generate data sets from non tzdb sources, as it is no longer a Rule/Zone data set, but a timestamp/offset/abbr data set.

@timrwood
Copy link
Member Author

Ok, after looking at this again, I think we can make even more optimizations.

The new data format combines the index arrays for both offsets and abbreviations into one array. This comes with slight duplication in the abbreviation and offset maps, but it is not major.

# Type Example
0 Name America/Los_Angeles
1 Abbreviation Map PST PDT PWT PPT
2 Offset Map 800 600 600 600
3 Offset/Abbreviation Index 01010230101010101010101010101011010101010101010101010
4 Timestamp Diff -261q00 1nX00 11B00 1nX00 SgN00 8x100 iy00 5Wp00 1Vb00

The filesize savings for this change are fairly substantial. 10% raw savings, 14% gzipped savings.

moment-timezone.json    raw 182574  gzip 36712 
moment-timezone-2.json  raw 220336  gzip 25077  # original pr
moment-timezone-2.json  raw 200837  gzip 21777  # commit 3ac60eb

Another change I made was to switch from base62 to base60. While this seems like it would actually increase the data size, there is a nice correlation between our data and the number 60. There are 60 seconds in a minute, and 60 minutes in an hour, so every value that is a full hour ends with a 00. Because there are now so many values that end in 0, gzip is able to compress the data much better than before.

This also has the benefit of making the offsets easier to read. See 7uw 6ys from the original pull request, and 800 600 from the changes in 32c60eb.

This was referenced May 20, 2014
Use zdump to generate data set as well as unit tests.
Move timezone specific unit tests to tests/zones.
Change timezone tests to use a common helper method.
timrwood added a commit that referenced this pull request May 23, 2014
@timrwood timrwood merged commit aea1e39 into develop May 23, 2014
@timrwood timrwood deleted the feature/memoize-timestamps branch May 23, 2014 15:35
@timrwood timrwood mentioned this pull request May 27, 2014
22 tasks
@dickfickling
Copy link

I think I might be doing something wrong, but it seems to me that the latest changes on the develop branch break DST support:

moment.tz("2012-01-01", "America/Los_Angeles").zoneAbbr()
-> 'PDT'

moment.tz("2012-06-01", "America/Los_Angeles").zoneAbbr()
-> 'PDT'

@timrwood
Copy link
Member Author

@richardfickling, sorry, yes, the develop branch is going to be broken for a little while.

I'm working on a pretty substantial rewrite of the internals of moment-timezone. The only apis that will stay the same are moment.tz(..., zoneName) and moment.fn.tz(zoneName).

You can follow along with the task list in #87, but if you need a working version, I'd suggest using version 0.0.6. https://github.com/moment/moment-timezone/tree/0.0.6

@dickfickling
Copy link

No problem, that's why they call it the develop branch :) I just wanted to confirm that it wasn't something that was expected to be working.

@mattjohnsonpint
Copy link
Contributor

Regarding the comment about Windows timezone sources - don't bother with that. Anything serious uses the IANA TZDB. Even on Windows.

I'm more concerned with avoiding use of the Date object - or at least sidestepping it's buggy DST algorithm. In general, if moment-timezone is applied, then moment functions like zone and isDSTShifted should use the moment-timezone data. I think this is already happening? Or are more changes to moment required?

@mattjohnsonpint
Copy link
Contributor

Also, should we really be using data straight from eggert/tz? Technically, that's the unofficial release. The official releases come from iana.org/time-zones. AFAIK, they're identical, but I'm not sure if it's a sound process.

If we are to continue to link to eggert/tz as a submodule, we should be sure to use the specific commits that are tagged as releases with the corresponding IANA version. The version number should be somewhere in the data, and there should also be a function to easily retrieve the version number.

@timrwood
Copy link
Member Author

Ah, I was not aware of http://www.iana.org/time-zones, I'll make sure when I'm working on the data build process to use that as the source.

Everything should be working as expected with the Date object crossing DST in moment core. We use Date#setUTC* instead of Date#set*, which does not have any notion of DST. Tests for keeping the offset in sync while changing the time are here.

For the data, I was planning on storing all previous releases in their packed and unpacked formats, and having a packed latest.json file as well. I'm not sure if it is completely necessary, but it is a good at-a-glance indicator of how up to date the data is.

/data
    /packed
        latest.json
        2014a.json
        2014b.json
    /unpacked
        2014a.json
        2014b.json
    /zdump
        2014a.json
        2014b.json

I'll make sure the data version number is included as well. Probably set a property moment.tz.dataVersion or something when loading in data.

@mattjohnsonpint
Copy link
Contributor

For the DST issue, I just wrote this up separately as #88

@ichernev
Copy link
Contributor

Storing DST occurrences directly seems like the right thing to do. I still don't quite understand how (and why) there is another (the original) format to store timezone data. Another cool thing is that we can offer a slimmer version that only supports years after 1990 (for example) -- it won't have a bunch of weird irregular DST changes of the past.

About all the size/performance improvements -- I wouldn't go to crazy with the encoding, because that would complicate the build scripts. Using base60 is a bit extreme -- I agree it saves a few bytes but where do you get that from (I guess you did a custom implementation in js). We can keep it that way, just please don't make it crazier.

What about the extra data (zone aliases and metadata), which is not part of the new format? Isn't it still needed? Would it be present in the same format as now?

@timrwood
Copy link
Member Author

The original format is based on the source format of the tzdb.

For the slimmer versions, we should have a couple builds easily available for download on momentjs.com and in the min directory similar to how we have moment-with-langs.js. Maybe something like moment-timezone-2000-2020.js. I was also planning to rebuild the data builder to support year ranges.

The compressed encoding was originally base62, which is a pretty common poor man's compression. I switched because the data lends itself very well to base60. Hours have 60 minutes and minutes have 60 seconds. 3 hours is 30 minutes in base60 and 300 seconds in base60 instead of 180 and 10800 in base10 or 2U and 2Oc in base62.

The zone aliases got lost in the new build process as a side effect of using zic(8) and zdump(8) to construct the data, as they don't differentiate between links and actual zones. See Africa/Timbuktu in the old data and Africa/Timbuktu in the new data.

I think it would be a nice optimization in the build process to create aliases out of zones that have the same data. This is especially useful when we start subsetting data by year ranges and the rules that would have differentiated two zones no longer apply. See the madhouse that is Indiana before 1970.

The zone metadata contained latitude, longitude, and rules used. Because we no longer use rules, those can be removed. The lat/long data is still useful, but I think would be better suited in a /data/meta/latest.json file as it was only added for the interactive map on the momentjs.com/timezone homepage.

@mattjohnsonpint
Copy link
Contributor

Here's an idea - when you create the data builder to go along with the new format - how about a field to pick the earliest year you want in the data? We could then truncate accordingly. I'm sure many people only care about the date their app launches and later. Some might want a few years back for legacy data. Few probably need the entire history of the TZDB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants