Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use text-based columnar storage by default for big memory and processing savings #96

Closed
wants to merge 4 commits into from

Conversation

jeremyevans
Copy link
Contributor

This changes the default data for mime-types from a JSON
file to a set of text files, with one line per mime type, and one
file per mime type attribute. This only loads the content_type and
extensions for mime-types by default, reducing default memory usage
of mime-types from over 22MB to just over 2MB. Additionally, it
speeds up requiring from 0.38 seconds to about 0.10 seconds.

Even when loading all of the attributes for the mime types, the
memory usage of mime-types stays under 11MB, though load time
increases to 0.52 seconds. I think that is an acceptable trade off.

If the txt files are shipped instead of the json file, there is also
a disk space savings of 167KB (392KB instead of 560KB).

rake convert:yaml:txt has been added for converting the yaml files
to the necessary txt files. In order to get the conversion to work,
I had to make convert.rb turn on lazy loading for mime types.

This should be completely backwards compatible if a path is given
containing the types. The only current backwards compatibility
difference is for the "text/directory - DEPRECATED by RFC6350" mime
type, which looks to me like an invalid mime type anyway, and
probably should be removed.

This has been tested on ruby 1.9 - 2.2 and jruby 1.7

I've tested this with the mail gem tests and mail also passes all tests
when using this.

This should address #94 and #83 and possibly also #64 and #62.

…ing savings

This changes the default data for mime-types from a JSON
file to a set of text files, with one line per mime type, and one
file per mime type attribute.  This only loads the content_type and
extensions for mime-types by default, reducing default memory usage
of mime-types from over 22MB to just over 2MB.  Additionally, it
speeds up requiring from 0.38 seconds to about 0.10 seconds.

Even when loading all of the attributes for the mime types, the
memory usage of mime-types stays under 11MB, though load time
increases to 0.52 seconds.  I think that is an acceptable trade off.

If the txt files are shipped instead of the json file, there is also
a disk space savings of 167KB (392KB instead of 560KB).

rake convert:yaml:txt has been added for converting the yaml files
to the necessary txt files.  In order to get the conversion to work,
I had to make convert.rb turn on lazy loading for mime types.

This should be completely backwards compatible if a path is given
containing the types.  The only current backwards compatibility
difference is for the "text/directory - DEPRECATED by RFC6350" mime
type, which looks to me like an invalid mime type anyway, and
probably should be removed.
@halostatue
Copy link
Member

In light of #95, I want to get clearance from you that you will be fine if this is relicensed to be MIT-only when mime-types 3.x comes out.

@jeremyevans
Copy link
Contributor Author

Definitely. I license all of my own libraries MIT.

@halostatue
Copy link
Member

Thanks for this. I’m going to take a couple of days to review this, benchmark it, etc., and compare what you’ve made against my future plans for the library (as I now say in the README, master is essentially the 3.0 branch and I’m going to have a 2.99.0 that no longer imports the deprecated fields without removing the methods; this may become 2.99.0) and how I’m going to make this fit. I think I may still want to implement some of the other proposed enhancements (shared strings, for at least some of the strings like application, image, etc.) but I need to benchmark those out and see the next steps against the other open issues.

Use a single string for each encoding.  This decreases the number of
encoding strings stored by default from 1909 to 4, saving about 700KB
of memory by default on 64-bit MRI.

While here, reduce object allocations while loading data from the
mime type txt files.
@jeremyevans
Copy link
Contributor Author

Sounds good. I'm going to push another commit that adds pooling for encoding strings, since that seems like the lowest hanging fruit (only 4 possible strings). But I'll stop after that so you can review.

If possible and you agree this is backwards compatible, could you consider releasing 2.6 with just this change? This will significantly reduce memory consumption for most applications using mime-types directly or indirectly, and it would be great to get this out without waiting for 2.99, which I'm guessing won't be released for another 6 months.

@halostatue halostatue self-assigned this Apr 26, 2015
@halostatue halostatue added this to the 2.99 milestone Apr 26, 2015
@halostatue
Copy link
Member

If it’s backwards compatible, we can do that. The latest date that 2.99 and 3.0 will be released is November 2015; if they are ready before then, I will release them before then. I’m going to have to do a dependency search on mime-types again and probably suggest another “apocalypse” post to Peter Cooper because I want people who depend on the data that is being dropped to change their gem requirements to >= 2.0, < 2.99 instead of >= 2.0, < 3.0 or ~> 2.0.

(The truth is that the data isn’t being dropped; just some old representations of it.)

@halostatue
Copy link
Member

BTW, I do have to think about this one for a 2.x release, because I had not planned on turning on lazy loading by default until 3.0 and I have to consider whether this is going to be a breaking change. I’m comfortable with the file format change (it’s purely an internal implementation detail), but I would almost prefer having a different require that people can opt into (I would keep it for 3.0, but it would be the default). Say,

gem 'mime-types', '~> 2.6', require: 'mime/types/lazy'

In 3.0, I want to figure out how to have a configure block so that various pieces can be set in an initialization phase before loading the data.

@jeremyevans
Copy link
Contributor Author

Looks like there is a single assertion failing on JRuby. It doesn't always fail, though, it only fails sometimes. Of the Travis builds, it looks like it failed 3 times out of 4. I can get it to fail on my machine, but not as often.

This appears to be either a race condition or an order-dependent test. I'm going to try to use minitest-bisect to see if it is order-dependent test and can be fixed. If it is a race condition, I'm not sure why it would be in the code I added, as the mutex should catch any attempt to access the lazy loaded data before it is loaded. If the bisect doesn't turn up anything, I'll try to add more debugging to see what could be going wrong.

I'm OK with a separate require, or using a separate environment variable to enable it. As long as there is some way to enable it.

@halostatue
Copy link
Member

A couple more review comments:

  • I need this to conform to the PATH protocol that I created for 2.x and to probably restrict the search pattern for JSON loads to **/*.json (rather than ** as it is now). Without the PATH protocol, this will make it harder for system packagers to use newer versions (of course, Debian is still shipping 1.25 even on systems where they no longer offer Ruby 1.8, which is stupid because I built this protocol for them). If we use an environment variable or separate require for this, then we can still use PATH, but look for File.join(PATH, '**/*.txt') instead of File.join(PATH, '**/*.json').
  • It feels like this (columnar loading) is something we could do using the normal MIME::Type class rather than a subclass; even if we have it as a subclass for mime-types 2.6+, I would probably convert it to an integral part of the MIME::Type class for 3.x, so we would want to probably # :nodoc: the LazyType class. The normal MIME::Type class accepts a Hash as an initializer parameter, so we can easily initialize with container: container on load and have that be internally handled.
  • Even though these are text files, I almost want to treat them as if they were something different, at least signalling by extension that these should be considered untouchable. Maybe instead of txt, it’s .col or .column or something else? It’s a small thing, but I’ve had people editing the JSON directly in the past (despite the Contributing document saying not to do that), so anything that discourages random modifications to these files is a net benefit, IMO. This isn’t a deal breaker, though.

@jeremyevans
Copy link
Contributor Author

OK, the JRuby failure definitely appears to be an order dependent bug. SEED=5339 jruby -S rake always passes for me, SEED=35357 jruby -S rake always fails. Unfortunately, I haven't been able to get minitest_bisect to work on JRuby yet.

:nodoc for LazyType and LazyTypes seems like a good idea to me. Do you want me to push changes for fixes like that?

As for the PATH protocol, this pull request is backwards compatible in that if PATH is set by a packager, the old behavior is used. So people that set PATH won't get the reduced memory use, but everything will still work. Because PATH doesn't appear to deal with interdependent file sets, I'm not sure it can be used the same way for the txt file as it can for the YAML and JSON fails. There's probably a way to get it to work, but I'm guessing it will require additional changes.

This was causing the test-order dependent bug on JRuby.
The clear_cache_file code was taken from the cache test.
@jeremyevans
Copy link
Contributor Author

I never did get minitest_bisect to work on JRuby, but I was able to figure out the underlying test order bug and fix it. :)

@halostatue
Copy link
Member

Yeah. Parts of the PATH protocol are broken, but those are bugs. It also turns out that, at least for the main Linux distributions, no one is using it (everyone is still shipping mime-types 1.x as packages even if they do not provide ruby 1.8). I’m still going to maintain it—it’s what gems that provide data and code should do—but I’m calling that out for me to look at as much as anything.

You’ve done quite a bit of amazing work here with this and you don’t have to solve all the problems.

def _each_file_line(type, lookup=true)
LOAD_MUTEX.synchronize do
next if @loaded_attributes.include?(type)
File.open(File.join(ROOT, "mime-types-#{type}.txt"), 'r:UTF-8') do |file|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do this a little cleaner with a Pathname and/or with each_with_index

type = "friendly"
ROOT = Pathname.new(File.expand_path('../../../data', __FILE__).freeze)
ROOT.join("mime-types-#{type}.txt").each_line($/, mode: "r:UTF-8").each_with_index do |line, index| 
  puts [line, index].inspect
end

It's fine as is, I just really like working with pathnames 😸

@schneems
Copy link
Contributor

I ran my app with this PR and it looks great! There's a 39% decrease in memory useage (from 52.5 to 31.8). The mail gem goes from 19 mb to about 3 mb.

The bigger memory delta is likely due to a high allocation factor, we could probably tune down by decreasing RUBY_GC_HEAP_GROWTH_FACTOR, but we would still expect to see a 10-16mb decrease.

Everything looks good. My one suggestion might be the addition of a way for gems to explicitly declare and load the data columns they need at require time, that way we take advantage of CoW when using a forking webserver. Maybe something like MIME::Type.needs(:friendly, :"content-type"). That should probably go in on a different discussion.

Here's my data.

Before

application: 52.5313 mb
  maildown: 20.6484 mb
    maildown/ext/action_mailer: 20.5625 mb
      mail: 19.3789 mb
        mime/types: 14.4727 mb
        mail/field: 1.9219 mb
        mail/message: 0.9023 mb
      action_view/view_paths: 0.4219 mb
        action_view/base: 0.3828 mb
  rails/all: 17.7422 mb
    active_record/railtie: 8.6289 mb
      active_record: 7.1172 mb
        arel: 3.0586 mb
          arel/nodes: 1.2617 mb
          arel/visitors: 1.0664 mb

After

application: 31.832 mb
  rails/all: 12.6211 mb
    rails: 5.3555 mb
      active_support: 2.5 mb
        active_support/dependencies/autoload: 1.6719 mb
          active_support/inflector/methods: 1.6133 mb
            active_support/inflections: 1.4102 mb
              active_support/inflector/inflections: 1.2188 mb
                active_support/i18n: 0.6172 mb
                  i18n: 0.3594 mb
        active_support/logger: 0.5859 mb
          logger: 0.3203 mb
      rails/application: 1.6289 mb
        rails/engine: 1.4844 mb
          rails/railtie: 1.3906 mb
            rails/configuration: 1.3203 mb
              active_support/core_ext/object: 1.2227 mb
                active_support/core_ext/object/conversions: 0.9141 mb
                  active_support/core_ext/hash/conversions: 0.7617 mb
                    active_support/time: 0.7539 mb
                      active_support/core_ext/time: 0.6875 mb
                        active_support/core_ext/time/calculations: 0.6523 mb
                          active_support/core_ext/time/conversions: 0.4531 mb
                            active_support/values/time_zone: 0.4453 mb
                              tzinfo: 0.3242 mb

@halostatue
Copy link
Member

I’m almost ready with this. Another couple of days and I’ll have whipped it into shape with the restrictions that I had wanted.

The interface will (for 2.6) default to not using the columnar store (so the gem will continue to include both formats). To get the columnar store by default, require 'mime/types/columnar'.

@jeremyevans
Copy link
Contributor Author

@halostatue That seems like a fine compromise to me. After 2.6 is released, I'll submit a pull request to mail to require mime-types 2.6 and use mime/types/columnar.

I assume that in future versions, if columnar is made the default, mime/types/columnar will still work as an alias to mime/types, correct?

@halostatue
Copy link
Member

That is correct. The JSON capabilities will probably stick around until mime-types 4, although deprecated and replaced with a default columnar store. The only thing that may change is if we find a more efficient pure-Ruby columnar store rather than text. I’ll commit now that, until at least mime-types 5, requiring mime/types/columnar will work. If I decide that it needs to go away for mime-types 5, there will be warnings starting with mime-types 4.

Given that I see at least a couple of years growth once I make mime-types 3, we’re probably talking around 2020 or so.

@halostatue
Copy link
Member

@jeremyevans, @schneems: I’m still a couple of days away from finalizing the release—I take my documentation fairly seriously, and there are no tests regarding columnar support yet, but I wanted to see if you guys had time to look over what I’ve got on my working branch prior to rebase/squash and merge.

https://github.com/mime-types/ruby-mime-types/tree/2.6-work

One thing that I’ve done is added a Columnar report to rake benchmark:load, and even though the memory is substantially lower, the load time is about 2x slower (the default is for 50 loads in the benchmark).

                        user     system      total        real
Normal:             5.510000   0.100000   5.610000 (  5.701441)
Columnar:          10.010000   0.100000  10.110000 ( 10.358063)
Lazy:               0.590000   0.020000   0.610000 (  0.620081)
Lazy+Load:          5.410000   0.090000   5.500000 (  5.570651)
Cached:             3.560000   0.060000   3.620000 (  3.667201)
Lazy Cached:        0.690000   0.030000   0.720000 (  0.729733)
Lazy Cached Load:   3.470000   0.070000   3.540000 (  3.610310)

I have not yet timed a “full” load with the columnar format. I haven't yet profiled to see where the slowdown might be, but that’ll be something that definitely gets tackled for the 3.0 release and part of why I am happy to make this an opt-in release. I strongly suspect that there are performance boosts to be found with minimal cost in terms of memory. But that’s for a future time after 2.6.

@jeremyevans
Copy link
Contributor Author

I looked over the 2.6-work tree and it looks good to me.

I'm not sure why the benchmark is showing columnar being 2x slower than JSON. I know when I benchmarked it, it was substantially faster, and even if you were loading all attributes from the columnar store, it was only about 30% slower. But in my benchmarking, I was only loading the types once, not multiple times. I was basically doing:

time ruby -r mime/types -e ""

and comparing the times before and after the change.

@halostatue
Copy link
Member

It’s possible I regressed something with the PATH support changes, but I don’t think so. I know that the benchmark is slightly unrealistic.

@schneems
Copy link
Contributor

We probably want to reach out to the top users of mime/types to get them to test and switch over to the new store once this is released. I whipped up a script that gets reverse dependencies from rubygems and sorts them by download number. Here's the script and the top 50 gems: https://gist.github.com/schneems/e09c95da37a8c50047a8

No surprise, mail is number 1.

@schneems
Copy link
Contributor

I did a run through of all of the most recent commits compared to master. On the surface everything looks good. Let me know if you want some help running any specific benchmarks.

@halostatue
Copy link
Member

I was going to reach out to Peter Cooper, as well.

@halostatue halostatue mentioned this pull request May 25, 2015
halostatue added a commit that referenced this pull request May 25, 2015
Release 2.6

Fixes #83 (Decrease Memory Usage).
Fixes #84 (Requiring mime/types accounts for 2% of all application RAM).
Based on and Fixes #96.
tute pushed a commit to thoughtbot/paperclip that referenced this pull request May 27, 2015
Mime-types 2.6.1+ has a columnar store that only loads the data that is
needed for a drastically reduced memory footprint.
cc/ @jeremyevans @halostatue

See: mime-types/ruby-mime-types#96.

[fixes #1873]
jsonn pushed a commit to jsonn/pkgsrc that referenced this pull request Jun 8, 2015
== 2.6.1 / 2015-05-25

* Bugs:
  * Make columnar store handle all supported extensions, not just the first.
  * Avoid circular require when using the columnar store.

== 2.6 / 2015-05-25

* New Feature:
  * Columnar data storage for the MIME::Types registry, contributed by Jeremy
    Evans (@jeremyevans). Reduces default memory use substantially (the mail
    gem drops from 19 Mib to about 3 Mib). Resolves
    {#96}[mime-types/ruby-mime-types#96],
    {#94}[mime-types/ruby-mime-types#94],
    {#83}[mime-types/ruby-mime-types#83]. Partially
    addresses {#64}[mime-types/ruby-mime-types#64]
    and {#62}[mime-types/ruby-mime-types#62].
* Development:
  * Removed caching of deprecation messages in preparation for mime-types 3.0.
    Now, deprecated methods will always warn their deprecation instead of only
    warning once.
  * Added a logger for deprecation messages.
  * Renamed <tt>lib/mime.rb</tt> to <tt>lib/mime/deprecations.rb</tt> to not
    conflict with the {mime}[https://rubygems.org/gems/mime] gem on behalf of
    the maintainers of the {Praxis Framework}[http://praxis-framework.io/].
    Provided by Josep M. Blanquer (@blanquer),
    {#100}[mime-types/ruby-mime-types#100].
  * Added the columnar data conversion tool, also provided by Jeremy Evans.
* Documentation:
  * Improved documentation and ensured that all deprecated methods are marked
    as such in the documentation.
* Development:
  * Added more Ruby variants to Travis CI.
  * Silenced deprecation messages for internal tools. Noisy deprecations are
    noisy, but that's the point.

== 2.5 / 2015-04-25

* Bugs:
  * David Genord (@albus522) fixed a bug in loading MIME::types cache where a
    container loaded from cache did not have the expected +default_proc+,
    {#86}[mime-types/ruby-mime-types#86].
  * Richard Schneeman (@schneems) provided a patch that substantially reduces
    unnecessary allocations.
* Documentation:
  * Tibor Szolár (@flexik) fixed a typo in the README,
    {#82}[mime-types/ruby-mime-types#82]
  * Fixed {#80}[mime-types/ruby-mime-types#80],
    clarifying the relationship of MIME::Type#content_type and
    MIME::Type#simplified, with Ken Ip (@kenips).
* Development:
  * Juanito Fatas (@JuanitoFatas) enabled container mode on Travis CI,
    {#87}[mime-types/ruby-mime-types#87].
* Moved development to a mime-types organization under
  {mime-types/ruby-mime-types}[https://github.com/mime-types/ruby-mime-types].
Bestra added a commit to Aperta-project/Aperta that referenced this pull request Jan 25, 2017
This is based on mime-types/ruby-mime-types#96,
and some kind words from Alex Burkhart.  By using the columnar store
in the mime types gem we should be able to save 10-20MB of memory per
ruby process.
jthurn24 added a commit to jthurn24/ppclip-RoR that referenced this pull request Aug 26, 2021
Mime-types 2.6.1+ has a columnar store that only loads the data that is
needed for a drastically reduced memory footprint.
cc/ @jeremyevans @halostatue

See: mime-types/ruby-mime-types#96.

[fixes #1873]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants