Google

Oct 24, 2012

Caching Data in Java and LRU strategy


Q. What is the purpose of using a cache?
A. A cache is an area of local memory that holds a copy of frequently accessed data that is otherwise expensive to get or compute. Examples of such data include a result of a query to a database, a disk file or a report. Caching is used primarily for performance and scalability. It is a mechanism by which subsequent requests for the data are served faster.

Avoid the anti-pattern or pitfall of "Cache Them All", and cache only the data that is hard to get.

Q. What are some of the requirements you need to consider when writing your own cache or using a cacheing framework like EHCache, OSCache, etc?
A.
  1. The cache needs to have some boundary conditions defined to limit the memory usage. Each item in the cache can consume unbounded amounts of memory, hence the cache should hold its values using WeakReferences, which makes it possible for the cache to evict entries when memory is short, even when the cache is not full or the boundary conditions are not met.
  2. You need to have a replacement algorithm to purge entries from a cache when the boundary conditions are reached.  For example, reaching the maximum number of entries allowed. One such algorithm  is LRU (Least Recently Used). In this algorithm cache entries which have not been accessed recently will be replaced. If you are writing your own cache, one approach is to maintain  a timestamp at which the entry was inserted and select the entry with the oldest timestamp to be removed. But this search would be linear taking O(N) time. A more efficient approach would be to maintain the entries in a sorted collection based on the order in which the entries were accessed. Alternatively, you can use a doubly linked list where the items that are accessed via the cache are moved towards the end of the list, and purging of the entries in the cache are done from the front of the list.  
  3. Some caches need to be regularly refreshed to prevent them from becoming stale. This can be accomplished by adding an expiry timestamp. 
  4. If you are writing your own caching mechanism, then the cache item insertion and lookup operations need to be fast preferably O(1) time. This means a HashMap will be a potential candidate. The cached data can be accessed concurrently, hence a ConcurrentHashMap or a SynchronizedMap will be a good candidate. The cache should not be locked when an entry is searched, and only the searched entry should be locked. In the scenario where multiple threads search for the same key, the computation of the key value should be performed only once.


So, this is why you need to apply the "don't reinvent the wheel" principle and favor utilizing a proven and tested caching frameworks like EHCache that takes care of better memory management with LRU and FIFO eviction strategies, disk overflow, data expiration and many other optional advanced features, as opposed to writing your own. Having said this, there are cases where you might want to write your own simple caching solution or you might be asked in job interviews with questions like -- How will you implement an LRU cache in Java?

This is is very similar to the question, if Java did not have a HashMap class, how will you go about implementing your own? This was discussed with the answer and key considerations in the book "Core Java Career Esentials"


Q. What are the differences between HashMap, TreeMap (i.e. SortedMap), and a LinkedHashMap?
A. All three classes implement the Map interface and offer mostly the same functionality. The differences are in how the entries are iterated through

  • HashMap does not offer any guarantees about the iteration order. Its iteration order can change completely when new elements are added.
  • TreeMap will iterate according to the "natural ordering" of the keys according to their compareTo( ) method or based on an externally supplied Comparator. Additionally, it implements the SortedMap interface, which contains methods that depend on this sort order.
  • LinkedHashMap will iterate in the order in which the entries were put into the map.  The LinkedHashMap also provides a great starting point for creating a LRU Cache by overriding the removeEldestEntry( ) method. This lets you create a Cache that can expire data using some criteria that you define.

Q. What are the different types of algorithms that can be used to remove objects from the cache?
A. A variety of cache expiration mechanisms can remove objects from a cache. These algorithms are based on criteria such as least frequently used (LFU), least recently used (LRU), most recently used (MRU), first in first out (FIFO), last access time, and object size. Each algorithm has advantages and disadvantages. So, analyse your requirements and choose the most appropriate one.

In enterprise applications, object caching in a cluster is important because multiple JVMs run in a cluster, and it is crucial to keep all the cluster members' cached data in sync.


Q. How will you implement an LRU cache in Java?
A. Here is the sample code in Java

Firstly, define an interface so that the implementation can be changed  if for example,  in future you have a ConcurrentLinkedHashMap class.  A typical interface for a Java cache will look like

public interface LRUCache<K,V> {
   public abstract V put(K key, V item);
   public abstract V get(K key);
   public abstract V atomicGetAndSet(K key, V item); 
}


Now, the implementation class.

import java.lang.ref.SoftReference;
import java.util.LinkedHashMap;
import java.util.Map;

public class LRUCacheImpl<K, V> implements LRUCache<K, V> {

 //SoftReference is used for a memory friendly cache. 
 //the value will be removed under memory shortage situations and 
 //the keys of the values will be removed from the cache map. 
 private final Map<K, SoftReference<V>> cache;

 
 public LRUCacheImpl(final int cacheSize) {

  // 'true'  uses the access order instead of the insertion order.
  this.cache = new LinkedHashMap<K, SoftReference<V>> (cacheSize, 0.75f, true) {
   
   private static final long serialVersionUID = 1L;

   @Override
   protected boolean removeEldestEntry(Map.Entry<K, SoftReference<V>> eldest) {
    // When to remove the eldest entry i.e Least Recently Used (i.e LRU) entry
    return size() > cacheSize; // Size exceeded the max allowed.
   }
  };
 }

 @Override
 public V put(K key, V value) {
    SoftReference<V> previousValueReference = cache.put(key, new SoftReference<V>(value));
    return previousValueReference != null ? previousValueReference.get() : null;
 }

 @Override
 public V get(K key) {
     SoftReference<V> valueReference = cache.get(key);
     return valueReference != null ? valueReference.get() : null;
 }

 @Override
 public V atomicGetAndSet(K key, V value) {
    V result = get(key);
    put(key, value);
    return result;
 }

}


Here some key points on the above code that are worthy of mentioning not only in the job interviews to stand out from your competition, but also to write quality code and get thumbs up in the code reviews.

  1. The above code would have been written as  LRUCacheImpl<K, V>extends LinkedHashMap<K, V>, but the GoF design pattern favors composition over inheritance as inheritance is more fragile to changes.
  2. SoftReference is  used as opposed to the hard reference, to force the items to be garbage collected when the Java heap runs low in memory.
  3. The methods are synchronized to be thread-safe. In future, if a ConcurrentLinkedHashMap were to be added, then it can be used instead the LinkedHashMap.
  4. If performance is of utmost importance is of utmost importance, then the public V get(K key)  method can be executed asynchrously via a thread pool as this method will get frequently executed.

Q. What are the different types of caches?
A. There are different types of caches
  • An application cache is a cache that an application accesses directly. An application benefits from using a cache by keeping most frequently accessed data in memory. This is also known as the first level cache. For example, Hibernate can use EHCache, OSCache, SwarmCache, or JBoss TreeCache as its first level cache.
  • Level 2 (aka L2) cache  provides caching services to an object-relational mapping (ORM) framework or a data mapping (DM) framework such as Hibernate or iBatis respectively by reducing the number of trips to the database. An L2 cache hides the complexity of the caching logic from an application.
  • In memory Data Grids/Clouds for distributed caching with powerful analysis and management tools to give you a complete solution for managing fast-changing data in multiple servers, compute grid, or in the cloud. For example, Apache HadoopGridGain, Oracle Cherence, etc. A data grid is a reliable distrbuted cache that uses an external data source to retrieve data that is not present in the cache and an external data store to write the updates. In simple terms, grid or cloud computing is all about setting up an HTTP server to farm out requests RESTfully, and accept responses the same way. You can also use messaging with JMS, RMI or even the old-school Corba. But why reinvent the wheel when there are open-source frameworks like Hadoop, GridGain, Hazelcast, etc. 

Q. Will WeakHashMap make a good cache? Explain your answer?
A. No. There are 2 reasons for it.

  1.  It uses weak references as the underlying memory management mechanism. If the garbage collector determines that an object is weakly reachable, it will clear atomically all weak references to the object., whereas if the garbage collector determines that an object is softly reachable (i.e. uses a soft reference as opposed to a weak refrence), it may clear atomically all soft references to the object, in the case that it finds that memory is running low, or at its own discretion.
  2. In a WeakHashMap, the weak references are used for the keys and not for the values. You want the map values to use a weaker reference.


Q. Where would you use a WeakHashMap ?
A. It is good to implement canonical maps where you can associate some extra information to an object that you have a strong reference to. You put an entry in a WeakHashMap with the object as the key, and the extra information as the map value. This means, as long as you keep a strong reference to the object, you will be able to check the map to retrieve the extra information, and once you release the object, the map entry will be cleared and the memory used by the extra information will be released.

Labels:

5 Comments:

Blogger Unknown said...

Nicely done ! Bookmarked.

4:28 PM, December 07, 2013  
Anonymous Anonymous said...

Simple explanation on nontrivial stuff, well done!

10:00 AM, December 12, 2013  
Blogger GopiKrishnan said...

Thanks for sharing your valuable insights about cache. Could you please point me how to implement object caching in cluster enviornment and what are the considerations needs to be noted down? is the above LRU cache in Java implemented in cluster environment? if two jvms in different machine, which place can i share the objects for two jvms? how to synchronizely update the objects if newly added objects. i hope you got my questions. thanks

8:59 PM, January 19, 2014  
Blogger Unknown said...

You need to look at the distributed caching. For example EHCache with terracotta or EHCache with RMI. Have a look at http://ehcache.org/documentation/2.5/configuration/distributed-cache-configuration

11:27 AM, January 20, 2014  
Blogger cmen said...

This article is very Helpful,But does it solves my problem.
http://stackoverflow.com/questions/24245989/ehcache-compex-query-for-different-conditions

2:37 AM, June 18, 2014  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home