-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache + URL resolver doesn't work as expected #204
Comments
…n. [ci skip] Thanks @delafer for bringing this to my attention.
It's embarrassingly simple really, the cache appears to be completely broken! The first exception is because the CSS resource reader is closed after being read the first time. The second exception is because the bytes of an ImageResource are cleared after being converted to a PDF image, making it invalid to use a second time. We can make this the cache system replacement thread. As outlined in #170 I think we need four caches. The first should be an internal per-page cache, with resources tied to a particular page. The second should be an internal per-document cache. We already have caching at these levels but it is ad-hoc throughout the code. The next cache should be a per-thread cache which will enable us to cache thread-unsafe objects. It will be external of the project but cache opaque objects. Finally we need an external thread-safe cache of last-resort for strings and byte arrays. I think we start with the last one. Something like: public class DefaultMultiThreadCache {
protected final Map<String, String> _textCache = new ConcurrentHashMap<String, String>();
protected final Map<String, byte[]> _byteCache = new ConcurrentHashMap<String, byte[]>();
// Question: How best to synchronize this method?
// Does it even need to be synchronized?
public String getText(String uri) {
String text = _textCache.get(uri);
if (text != null) {
return text;
} else {
byte[] bytes = getBytes(uri);
if (bytes != null) {
return new String(bytes, "UTF-8");
} else {
return null;
}
}
}
public byte[] getBytes(String uri) {
return _byteCache.get(uri);
}
public void put(String uri, String text) {
_textCache.put(uri, text);
}
public void put(String uri, byte[] bytes) {
_byteCache.put(uri, text);
}
} |
Have you considered using the guava caches to implement the caches? It would of course mean to have guava as dependency on this project, but they have done caches right, and you can configure the cache however you want (i.e. with soft/weak refs, timeouts, max item count, ...). See the details at https://github.com/google/guava/wiki/CachesExplained Using guava caches would mean that the cache just works, as they did extensive tuning and fixing. |
no, in this case it's not necessary at all, but we can't use ConcurrentHashMap "as-is", we should delete old or rarely used cached entries or limit maximum amount of entries, and if we will implement a more complex logic suitable for cache functionality , I mean some real cache strategy in this case most likely synchronization will be necessary. I should look the new code and I can help to make it thread-safe. |
Yes, we can use SoftReference's to cache elements. As a default implementation a simple build in cache provider could be used, without third party deps, but if it is not enough, you can provide the opportunity to use more powerful cache solutions. So for example it's done in a hibernate with a connection pool, by default a simple built-in implementation is used, but there is a possibility to use (configure) the third party Implementations for production environments. |
Using as default implementation commons third party libraries has also lead to problems e.G. if project already uses Guava cache for own needs but a different version with incompatible API (deprecated) methods and we will have dependencies resolution conflicts e.G. in maven with different versions , in most cases it's not critical, but it's annoying. It doesn't mean , that its a bad idea, but.. |
I think @delafer is right here: We should only provide a very simple basic cache implementation, maybe even not multithreaded and not shared between runs and allow to provide a custom cache implementation with the Builder. If we have a nice interface it should be trivial to e.g. use the Guava cache to implement a real multithread safe cache, and also the user can configure the cache however he wants. Some applications may have loads of memory and may only need soft refs (if at all), other applications might be tight on memory and then use a size based evict strategy (e.g. in Guava using maximumWeight() and weighter()). Or implement a disk based cache etc. Guava is very useful, but it's true that we could cause a bit of a dependency hell for the users of OpenHTMLToPdf. We could of course create a new openhtmltopdf-cache artefact, which would only contain some Guava based default implementations. But I am not sure if this is worth the effort. |
I think @rototor has right. Cache could speed-up generation of pdf documents, but openhtmltopdf is not a database or JPA layer or big data implementation or SOLR/Lucene search&index frameworks, I mean it's not a library which intensively works with a large amounts of input/output data, with a huge data flow, in other words: in most cases cache does not play a major role here and for most cases a simple solution is enough. it is not worth it |
OK, I've started the new implementation in commit 1fa6701 Tell me what you think. |
@danfickle Your cache interface features a get() and a put() operation. Thats fine per se, but does not allow to benefit from a unique property of the Guava caches: They guarantee that a value for a given key is not calculated twice by two threads at the same time. When fetching network resources this means that the cache can guarantee that you don't fetch the same resource twice and waste bandwidth. If two threads want to get the same key at the same time, one thread will wait for the result of the other thread. To allow this, the FSMultiThreadCache interface must look different:
I.e. the cache decides when to fetch the value. To avoid creating additional objects per cache call the lazyValueFetcher should be an interface like this:
This can be implemented once per UserAgent. It would then just delegate this to openReader()/openStream() and read the value (as it is now already). A trivial default cache implementation would be:
P.S.: Prefixing a template/generic parameter with T is just an old C++ habit ... :) |
little joke: " American astronaut Scott Kelly condemned US President Donald Trump for threatening to start World War III. And what could he do, being held by the KGB agents, standing in front of the airlock? " @danfickle : about last post of @rototor . He is damn right. Your simple map-like API doesn't help avoid loading some resources twice or even more times. Timeline: 100 ms 200 ms 250 ms 300 ms summing up: "sample.jpeg" will be loaded 3 times remotely and 1 time fetched from the cache. This implementation is still thread-safe, but it's not effective enough for caching purposes. Thats why a special API like in guava should be used avoid redundant (remote) requests to save time and to save bandwidth usage / traffic. This API forces other threads to wait until the necessary resource will be loaded. So, second and third, etc. threads should wait for the first thread until first thread finishes loading "sample.jpg" In the next post I will describe how it works without third party libriries like guava. I will show a simple thread safe example using standard java thread locks and notifiers, instead of modern concurrency clases / patterns, which are more difficult to understand. |
I know it looks just like old good c++ or object pascal / delphi code. so I have immediately thought that you are an old-style programmer like me with an experience with c/c++ or with object pascal & delphi RAD or with all at once. |
So, I have written now some simple cache implementation. It should be thread-safe and efficient enought, import java.util.LinkedHashMap;
import java.util.Map;
import java.util.WeakHashMap;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.ReentrantReadWriteLock;
public class SimpleCacheImpl<K,V> {
private Map<K,CacheEntry<V>> cacheMRU;
private Map<K,CacheEntry<V>> cacheLRU;
private ReentrantReadWriteLock lock;
public SimpleCacheImpl(final int capacity)
{
init(capacity);
}
private void init(int capacity) {
lock = new ReentrantReadWriteLock();
cacheLRU = new WeakHashMap<>();
cacheMRU = new LinkedHashMap<K,CacheEntry<V>>(capacity*4/3, 0.75f, true) {
protected boolean removeEldestEntry(Map.Entry<K,CacheEntry<V>> entry)
{
if (this.size() > capacity) {
cacheLRU.put(entry.getKey(), entry.getValue());
return true;
}
return false;
};
};
}
public V get(K key)
{
try {
lock.readLock().lock();
CacheEntry<V> value = getCacheEntry(key);
return value != null ? value.get() : null;
} finally {
lock.readLock().unlock();
}
}
public CacheEntry<V> set(K key)
{
try {
lock.writeLock().lock();
CacheEntry<V> entry = getCacheEntry(key);
if (entry == null) {
entry = new CacheEntry<>();
cacheMRU.put(key, entry);
}
return entry;
} finally {
lock.writeLock().unlock();
}
}
private CacheEntry<V> getCacheEntry(K key) {
CacheEntry<V> value = cacheMRU.get(key);
if (value == null) {
value = cacheLRU.get(key);
if (value!=null) {
moveToMRU(key, value);
}
}
return value;
}
private void moveToMRU(K key, CacheEntry<V> value) {
cacheLRU.remove(key);
cacheMRU.put(key, value);
}
/**
* The Class CacheEntry.
*
* @param <V> the value type
*/
static class CacheEntry<V> {
/** The obj. */
private V obj;
private boolean set;
private final static long TIMEOUT_MS = TimeUnit.MINUTES.toMillis(5);
/** The lock obj. */
private final Object lockObj = new Object();
/**
* Instantiates a new cache entry.
*/
public CacheEntry() {
}
/**
* Returns value containing object.
*
* @return the cached value
*/
public V get() {
try {
synchronized (lockObj) {
long init = System.currentTimeMillis();
long current = init;
while (!set && ((current-init)<TIMEOUT_MS)) {
lockObj.wait(TIMEOUT_MS);
current = System.currentTimeMillis();
}
}
} catch (InterruptedException ignore) {}
return obj;
}
/**
* free locked entry
*/
public void release() {
synchronized (lockObj) {
this.obj = null;
this.set = true;
lockObj.notifyAll();
}
}
public void set(V obj) {
synchronized (lockObj) {
this.obj = obj;
this.set = true;
lockObj.notifyAll();
}
}
@Override
public String toString() {
return obj != null ? obj.toString() : "empty";
}
}
} |
Only currently implemented for PDF renderer with CSS resources. Unstable, for discussion.
In regards to the Actually, I'm rethinking whether it is worth having a cache at this level at all. Users are likely to get their files from four main sources:
I think it might be a better use of time to concentrate on a per-thread cache which could cache "compiled" objects such as the in-memory structure of a parsed CSS sheet, or a decompressed PNG. This might give worthwhile performance benefits when processing many documents. What do you think? |
For sure, it's a correct approach (like it's done for e.G. in Chrome, etc), but it is more difficult to implement.
Yes, but if we plug cache here, a non working cache / buggy cache implementation in openhtmltopdf library should be cleaned / removed from code (as a deprecated non-working impl.) or fixed. Why do we need a cache impl. in openhtmltopdf code, which doesn't work rightly ? You are project owner / maintainer, so you should decide. |
IMHO: it's more convenient way doing such things: public Map<String, CacheEntry> mixedCachedEntries = new ConcurrentHashMap<>();
interface CacheEntry {
}
interface ByteArrayCacheEntry extends CacheEntry {
byte[] getData();
}
interface TextCacheEntry extends CacheEntry {
String getData();
} |
@danfickle Have you worked before with jmc.exe (Java Mission Control :: FlightRecorder) profiler tool ? should be used together with oracle JVM and VM options:
It's better as jvisualvm and rather convenient, to find memory leaks and places in the code to optimize (where for example you need a cache or other optimizations steps) |
@danfickle overview-of-memory-saving-techniques-java if briefly: working only with byte[] arrays in a right way could help us to save memory. You could use also off-heap memory areas. Strings in java/jvm (jdk / jvm < 8.xx) always stored internally in unicode format. Latest versions >= 8 using some techniques for strings - to save / spare memory allocation. |
All caches except the font metrics cache are now removed. Future caches will only be introduced if a compelling performance case can be made. Thanks everyone. |
Cache + URL resolver doesn't work as expected
In log you can see next messages:
com.openhtmltopdf.css-parse WARNING:: Couldn't parse stylesheet at URI file:/G:/Programme/tomcat/tomcat8.lst/webapps/ROOT/WEB-INF/classes/styles/resources/standard.css:
The text was updated successfully, but these errors were encountered: