Friday, 2 January 2015

Big Data, Search and Analytics - Elasticsearch

ElasticSearch is a highly scalable open source search engine with a REST API that is hard not to love. In this tutorial we'll look at some of the key concepts when getting started with ElasticSearch.

Downloading and running ElasticSearch

ElasticSearch can be downloaded packaged in various formats such as ZIP and TAR.GZ from elasticsearch.org. After downloading and extracting a package running it couldn't be much easier, at least if you already have a Java runtime installed.

Running ElasticSearch on Windows

To run ElasticSearch on Windows we run elasticsearch.bat located in the bin folder from a command window. This will start ElasticSearch running in the foreground in the console, meaning we'll see errors in the console and can shut it down using CTRL+C.

If we don't have a Java runtime installed or not correctly configured we'll not see output like the one above but instead a message saying "JAVA_HOME environment variable must be set!". To fix that first download and install Java if you don't already have it installed. Second, ensure that you have a JAVA_HOME environment variable configured correctly (Google it if unsure of how).

Running ElasticSearch on OS X

To run ElasticSearch on OS X we run the shell script elasticsearch in the bin folder. This starts ElasticSearch in the background, meaning that if we want to see output from it in the console and be able to shut it down we should add a -f flag.

If the script is unable to find a suitable Java runtime it will help you download it (nice!).

Using the REST API with Sense

Once you have an instance of ElasticSearch up and running you can talk to it using it's JSON based REST API residing at localhost port 9200. You can use any HTTP client to talk to it. In ElasticSearch's own documentation all examples use curl, which makes for concise examples. However, when playing with the API you may find a graphical client such as Fiddler or RESTClient more convenient.

Even more convenient is the Chrome plug-in Sense. Sense provides a simple user interface specifically for using ElasticSearch's REST API. It also has a number of convenient features such as autocomplete for ElasticSearch's query syntax and copying and pasting requests in curl format, making it easy to run examples from the documentation.

We'll be looking at a combination of curl requests and screenshots from Sense throughout this tutorial and I recommend you to install Sense and use it to follow along.

Once you have installed it you'll find Sense's icon in the upper right corner in Chrome. The first time you click it and run Sense a very simple sample request is prepared for you.

The above request will perform the simplest of search queries, matching all documents in all indexes on the server. Running it against a vanilla installation of ElasticSearch produces an error in the response as there aren't any indexes.

Our next step is to index some data, fixing this issue.

CRUD

While we may want to use ElasticSearch primarily for searching the first step is to populate an index with some data, meaning the "Create" of CRUD, or rather, "indexing". While we're at it we'll also look at how to update, read and delete individual documents.

Indexing

In ElasticSearch indexing corresponds to both "Create" and "Update" in CRUD - if we index a document with a given type and ID that doesn't already exists it's inserted. If a document with the same type and ID already exists it's overwritten.

In order to index a first JSON object we make a PUT request to the REST API to a URL made up of the index name, type name and ID. That is: http://localhost:9200/<index>/<type>/[<id>].

Index and type are required while the id part is optional. If we don't specify an ID ElasticSearch will generate one for us. However, if we don't specify an id we should use POST instead of PUT.

The index name is arbitrary. If there isn't an index with that name on the server already one will be created using default configuration.

As for the type name it too is arbitrary. It serves several purposes, including:

Each type has its own ID space.
Different types can have different mappings ("schema" that defines how properties/fields should be indexed).
Although it's possible, and common, to search over multiple types, it's easy to search only for one or more specific type(s).

Let's index something! We can put just about anything into our index as long as it can be represented as a single JSON object. In this tutorial we'll be indexing and searching for movies. Here's a classic one:

{
    "title": "The Godfather",
    "director": "Francis Ford Coppola",
    "year": 1972
}

To index that we decide on an index name ("movies"), a type name ("movie") and an id ("1") and make a request following the pattern described above with the JSON object in the body.

curl -XPUT "http://localhost:9200/movies/movie/1" -d'
{
    "title": "The Godfather",
    "director": "Francis Ford Coppola",
    "year": 1972
}'

You can either run that using curl or use Sense. With Sense you can either populate the URL, method and body yourself or you can copy the above curl example, place the cursor in the body field in Sense and press Ctrl/Command + Shift + V and all of the fields will be populated for you.

After executing the request we receive a response from ElasticSearch in the form of a JSON object.

The response object contains information about the indexing operation, such as whether it was successful ("ok") and the documents ID which can be of interest if we don't specify that ourselves.

If we now run the default search request that Sense provides (accessible using the "History" button in Sense given that you indeed executed it) that failed before we'll see a different result.

Instead of an error we're seeing a search result. We'll get to searching later, but for now let's rejoice in the fact that we've indexed something!

Now that we've got a movie in our index let's look at how we can update it, adding a list of genres to it. In order to do that we simply index it again using the same ID. In other words, we make the exact same indexing request as as before but with an extended JSON object containing genres.

curl -XPUT "http://localhost:9200/movies/movie/1" -d'
{
    "title": "The Godfather",
    "director": "Francis Ford Coppola",
    "year": 1972,
    "genres": ["Crime", "Drama"]
}'

The response from ElasticSearch is the same as before with one difference, the _version property in the result object has value two instead of one.

The version number can be used to track how many times a document has been indexed. It's primary purpose however is to allow for optimistic concurrency control as we can supply a version in indexing requests as well and ElasticSearch will then only overwrite the document if the supplied version is higher than what's in the index.

Getting by ID

We've so far covered indexing new documents as well as updating existing ones. We've also seen an example of a simple search request and that our indexed movie appeared in that.

While it's possible to search for documents in the index that's overkill if we only want to retrieve a single one with a known ID. A simple and faster approach would be to retrieve it by ID, using GET.

In order to do that we make a GET request to the same URL as when we indexed it, only this time the ID part of the URL is mandatory. In other words, in order to retrieve a document by ID from ElasticSearch we make a GET request to http://localhost:9200/<index>/<type>/<id>.

Let's try it with our movie using the following request:

curl -XGET "http://localhost:9200/movies/movie/1" -d''

As you can see the result object contains similar metadata as we've saw when indexing, such as index, type and version information. Last but not least it has a property named "_source" which contains the actual document.

There's not much more to say about GET as it's pretty straightforward. Let's move on to the final CRUD operation.

Deleting documents

In order to remove a single document from the index by ID we again use the same URL as for indexing and getting it, only this time we change the HTTP method to DELETE.

curl -XDELETE "http://localhost:9200/movies/movie/1" -d''

The response object contains some of the usual suspects in terms of meta data, along with a property named "_found" indicating that the document was indeed found and that the operation was successful.

If we, after executing the DELETE call, switch back to GET we can verify that the document has indeed been deleted.

Searching

So, we've covered the basics of working with data in an ElasticSearch index and it's time to move on to more exciting things - searching. However, considering the last thing we did was to delete the only document we had from our index we'll first need some sample data. Below is a number of indexing requests that we'll use.

curl -XPUT "http://localhost:9200/movies/movie/1" -d'
{
    "title": "The Godfather",
    "director": "Francis Ford Coppola",
    "year": 1972,
    "genres": ["Crime", "Drama"]
}'

curl -XPUT "http://localhost:9200/movies/movie/2" -d'
{
    "title": "Lawrence of Arabia",
    "director": "David Lean",
    "year": 1962,
    "genres": ["Adventure", "Biography", "Drama"]
}'

curl -XPUT "http://localhost:9200/movies/movie/3" -d'
{
    "title": "To Kill a Mockingbird",
    "director": "Robert Mulligan",
    "year": 1962,
    "genres": ["Crime", "Drama", "Mystery"]
}'

curl -XPUT "http://localhost:9200/movies/movie/4" -d'
{
    "title": "Apocalypse Now",
    "director": "Francis Ford Coppola",
    "year": 1979,
    "genres": ["Drama", "War"]
}'

curl -XPUT "http://localhost:9200/movies/movie/5" -d'
{
    "title": "Kill Bill: Vol. 1",
    "director": "Quentin Tarantino",
    "year": 2003,
    "genres": ["Action", "Crime", "Thriller"]
}'

curl -XPUT "http://localhost:9200/movies/movie/6" -d'
{
    "title": "The Assassination of Jesse James by the Coward Robert Ford",
    "director": "Andrew Dominik",
    "year": 2007,
    "genres": ["Biography", "Crime", "Drama"]
}'

It's worth pointing out that ElasticSearch has and endpoint (_bulk) for indexing multiple documents with a single request however that's out of scope for this tutorial so we're keeping it simple and using six separate requests.

The _search endpoint

Now that we have put some movies into our index, let's see if we can find them again by searching. In order to search with ElasticSearch we use the _search endpoint, optionally with an index and type. That is, we make requests to an URL following this pattern: <index>/<type>/_search where index and type are both optional.

In other words, in order to search for our movies we can make POST requests to either of the following URLs:

http://localhost:9200/_search - Search across all indexes and all types.
http://localhost:9200/movies/_search - Search across all types in the movies index.
http://localhost:9200/movies/movie/_search - Search explicitly for documents of type movie within the movies index.

As we only have a single index and a single type which one we use doesn't matter. We'll use the first URL for the sake of brevity.

Search request body and ElasticSearch's query DSL

If we simply send a request to one of the above URL's we'll get all of our movies back. In order to make a more useful search request we also need to supply a request body with a query. The request body should be a JSON object which, among other things, can contain a property named "query" in which we can use ElasticSearch's query DSL.

{
    "query": {
        //Query DSL here
    }
}

One may wonder what the query DSL is. It's ElasticSearch's own domain specific language based on JSON in which queries and filters can be expressed. Think of it like ElasticSearch's equivalent of SQL for a relational database. Here's part of how ElasticSearch's own documentation explains it:

Think of the Query DSL as an AST of queries. Certain queries can contain other queries (like the bool query), other can contain filters (like the constant_score), and some can contain both a query and a filter (like the filtered). Each of those can contain any query of the list of queries or any filter from the list of filters, resulting in the ability to build quite complex (and interesting) queries.

Basic free text search

The query DSL features a long list of different types of queries that we can use. For "ordinary" free text search we'll most likely want to use one called "query string query".

A query string query is an advanced query with a lot of different options that ElasticSearch will parse and transform into a tree of simpler queries. Still, it can be very easy to use if we ignore all of its optional parameters and simply feed it a string to search for.

Let's try a search for the word "kill" which is present in the title of two of our movies:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "query_string": {
            "query": "kill"
        }
    }
}'

Let's execute the request and take a look at the result.

As expected we're getting two hits, one for each of the movies with the word "kill" in the title. Let's look at another scenario, searching in specific fields.

Specifying fields to search in

In the previous example we used a very simple query, a query string query with only a single property, "query". As mentioned before the query string query has a number of settings that we can specify and if we don't it will use sensible default values.

One such setting is called "fields" and can be used to specify a list of fields to search in. If we don't use that the query will default to searching in a special field called "_all" that ElasticSearch automatically generates based on all of the individual fields in a document.

Let's try to search for movies only by title. That is, if we search for "ford" we want to get a hit for "The Assassination of Jesse James by the Coward Robert Ford" but not for either of the movies directed by FrancisFord Coppola.

In order to do that we modify the previous search request body so that the query string query has a fields property with an array of fields we want to search in:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "query_string": {
            "query": "ford",
            "fields": ["title"]
        }
    }
}'

Let's execute that and see what happens:

As expected we get a single hit, the movie with the word "ford" in its title. Compare that to a request were we've removed the fields property from the query:

Filtering

We've covered a couple of simple free text search queries above. Let's look at another one where we search for "drama" without explicitly specifying fields:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "query_string": {
            "query": "drama"
        }
    }
}'

As we have five movies in our index containing the word "drama" in the _all field (from the category field) we get five hits for the above query. Now, imagine that we want to limit these hits to movies released in 1962. In order to do that we need to apply a filter requiring the "year" field to equal 1962.

To add such a filter we modify our search request body so that our current top level query, the query string query, is wrapped in a filtered query:

{
    "query": {
        "filtered": {
            "query": {
                "query_string": {
                    "query": "drama"
                }
            },
            "filter": {
                //Filter to apply to the query
            }
        }
    }
}

A filtered query is a query that has two properties, query and filter. When executed it filters the result of the query using the filter. To finalize the query we'll need to add a filter requiring the year field to have value 1962.

ElasticSearch's query DSL has a wide range of filters to choose from. For this simple case where a certain field should match a specific value a term filter will work well.

"filter": {
    "term": { "year": 1962 }
}

The complete search request now looks like this:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "filtered": {
            "query": {
                "query_string": {
                    "query": "drama"
                }
            },
            "filter": {
                "term": { "year": 1962 }
            }
        }
    }
}'

When we execute it we, as expected, only get two hits, both with year == 1962.

Filtering without a query

In the above example we limit the results of a query string query using a filter. What if all we want to do is apply a filter? That is, we want all movies matching a certain criteria.

In such cases we still use the "query" property in the search request body, which expects a query. In other words, we can't just add a filter, we need to wrap it in some sort of query.

One solution for doing this is to modify our current search request, replacing the query string query in the filtered query with a match_all query which is a query that simply matches everything. Like this:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "filtered": {
            "query": {
                "match_all": {
                }
            },
            "filter": {
                "term": { "year": 1962 }
            }
        }
    }
}'

Another, simpler option is to use a constant score query:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "constant_score": {
            "filter": {
                "term": { "year": 1962 }
            }
        }
    }
}'

Mapping

Let's look at a search request similar to the last one, only this time we filter by author instead of year.

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "constant_score": {
            "filter": {
                "term": { "director": "Francis Ford Coppola" }
            }
        }
    }
}'

As we have two movies directed by Francis Ford Coppola in our index it doesn't seem too far fetched that this request should result in two hits, right? That's not the case however.

What's going on here? We've obviously indexed two movies with "Francis Ford Coppola" as director and that's what we see in search results as well. Well, while ElasticSearch has a JSON object with that data that it returns to us in search results in the form of the _source property that's not what it has in its index.

When we index a document with ElasticSearch it (simplified) does two things: it stores the original data untouched for later retrieval in the form of _source and it indexes each JSON property into one or more fields in a Lucene index. During the indexing it processes each field according to how the field is mapped. If it isn't mapped default mappings depending on the fields type (string, number etc) is used.

As we haven't supplied any mappings for our index ElasticSearch uses the default mappings for strings for the director field. This means that in the index the director fields value isn't "Francis Ford Coppola". Instead it's something more like ["francis", "ford", "coppola"].

We can verify that by modifying our filter to instead match "francis" (or "ford" or "coppola"):

So, what to do if we want to filter by the exact name of the director? We modify how it's mapped. There are a number of ways to add mappings to ElasticSearch, through a configuration file, as part of a HTTP request that creates and index and by calling the _mapping endpoint.

Using the last approach we could in theory fix the above issue by adding a mapping for the "director" field instructing ElasticSearch not to analyze (tokenize etc.) the field at all when indexing it, like this:

curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d'
{
   "movie": {
      "properties": {
         "director": {
            "type": "string",
            "index": "not_analyzed"
        }
      }
   }
}'

There are however a couple of issues if we do this. First of all, it won't work as there already is a mapping for the field:

In many cases it's not possible to modify existing mappings. Often the easiest work around for that is to create a new index with the desired mappings and re-index all of the data into the new index.

The second problem with adding the above mapping is that, even if we could add it, we would have limited our ability to search in the director field. That is, while a search for the exact value in the field would match we wouldn't be able to search for single words in the field.

Luckily, there's a simple solution to our problem. We add a mapping that upgrades the field to a multi field. What that means is that we'll map the field multiple times for indexing. Given that one of the ways we map it match the existing mapping both by name and settings that will work fine and we won't have to create a new index.

Here's a request that does that:

curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d'
{
   "movie": {
      "properties": {
         "director": {
            "type": "multi_field",
            "fields": {
                "director": {"type": "string"},
                "original": {"type" : "string", "index" : "not_analyzed"}
            }
         }
      }
   }
}'

This time when we try to add the mappings ElasticSearch is happy to do so.

So, what did we just do? We told ElasticSearch that whenever it sees a property named "director" in a movie document that is about to be indexed in the movies index it should index it multiple times. Once into a field with the same name (director) and once into a field named "director.original" and the latter field should not be analyzed, maintaining the original value allowing is to filter by the exact director name.

With our new shiny mapping in place we can re-index one or both of the movies directed by Francis Ford Coppola (copy from the list of initial indexing requests above) and try the search request that filtered by author again. Only, this time we don't filter on the "director" field (which is indexed the same way as before) but instead on the "director.original" field:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "constant_score": {
            "filter": {
                "term": { "director.original": "Francis Ford Coppola" }
            }
        }
    }
}'

Executing it shows that it indeed works:

Where to go from here

We've covered quite a lot of things in this article. Still, we've barely scratched the surface of ElasticSearch's goodness.

For instance, there's a lot more to searching with ElasticSearch than we've seen here. We can create search requests where we specify how many hits we want, use highlighting, get spelling suggestions and much more. Also, the query DSL contains many interesting queries and filters that we can use. Then there's of course also a whole range of facets that we can use to extract statistics from our data or build navigations.

As if that wasn't enough, we can go far, far beyond the simple mapping example we've seen here to accomplish wonderful and interesting things. And then there are of course plenty of performance optimizations and considerations. And functionality to find similar content. And, and, and...

But for now, thanks for reading! I hope you found this tutorial useful on your way to discovering the great open source project ElasticSearch.

Tuesday, 8 April 2014

One Step Ahead With EhCache(Search API)

After so many Days now writing this Blog:

Ehcache 2.4 launched. The big new feature right in the core of Ehcache 2.4 is Search.

It uses a new fluent API which looks like this:

Results results = cache.createQuery().includeKeys().addCriteria(age.eq(32).and(gender.eq(“male”))).execute();

In short, it lets you further offload the database. With Ehcache now supporting up to 2TB and linear scale-out you can do more than ever.

What is searchable?

You can search against predefined indexes of keys, values, or attributes extracted from values.

Attributes can be extracted using JavaBeans conventions or by specifying a method to call.

For example to declare a cache searchable and extract age as a JavaBean and gender as a method out of a Person class:

<cache name="cache3" maxElementsInMemory="10000" >
     <searchable>
          <searchAttribute name="age"/>
          <searchAttribute name="gender" expression="value.getGender()"/>
     </searchable>
</cache>

Caches can also be made searchable programmatically. And Custom Value Extractors can be created so that you can index and search against virtually anything you put in the cache.

Search Query Language

Ehcache Search introduces EQL, a fluent, Object Oriented query language which we call EQL, following DSL principles, which should feel familiar and natural to Java programmers.

Here is a full example. Search for men whose names start with “Greg”, and then order the results by age. Don’t return more than 10 results. We want to include keys and values in the results. Finally we iterate through the Results.

Query query = cache.createQuery();
query.includeKeys();
query.includeValues();
query.addCriteria(name.ilike(“Greg*”).and(gender.eq(Gender.MALE))).addOrderBy(age, Direction.ASCENDING).maxResults(10);

Results results = query.execute();
System.out.println(” Size: ” + results.size());
for (Result result : results.all()) {
System.out.println(“Got: Key[" + result.getKey()
+ "] Value class [" + result.getValue().getClass()
+ "] Value [" + result.getValue() + "]“);
}

EQL is very rich. There is a large number of Criteria such as ilike, lt, gt, between and or which you use to build up complex queries. There are also Aggregators such as min, max, average, sum and count which will summarise the results.

Like NoSQL, EQL executes against a single cache – there are no joins. If you need to combine the results of searches from two caches, you can perform two searches and then combine the results yourself.

Standalone and Distributed

Search is built into Ehcache core. It works with standalone in-process caching and will work for distributed caches in the forthcoming Terracotta 3.5 platform release which goes GA in March and is available as a release candidate now.

Distributed cache search is indexed and executes on the Terracotta Server Array using a scatter gather pattern. The EQL is sent to each cache partition (the scatter), returning partial results (the gather) to the requesting Ehcache node which then combines the results and presents them to the caller. Terracotta servers utilise precomputed indexes to speed queries.

Indeed, the distributed cache performance has an important property: searches execute in O(logN)/partitions time. So if you have 50GB of cache in one partition which takes 40ms to search and then you double the data, you can hold the execute time constant by simply adding another partition. Generally, you can hold execution time constant for any size of data with this approach.

The standalone cache takes a different approach. Most in-process caches are relatively small. And Ehcache is lightning fast. We don’t use indexes but instead visit each element in the cache a maximum on once, resolving the EQL. It takes 5ms to run a typical query against a 10,000 element cache. Generally the performance is O(N) but even a 1 million entry cache will take less than a second to search using this approach.

Sample Use Cases

Caches can also be made searchable programmatically. And Custom Value Extractors can be created so that you can index and search against virtually anything you put in the cache.

Database Search Offload

Take a shipping company that creates 50GB of consignment notes per week. Customers search by consignment note id but also by addressee name. Most searches (95%) are done within two weeks of the creation of a consignment note. The consignment notes get stored in a relational database that grows and grows. Searches against the database now take 650ms which take enquiry outside it’s SLA.

Solution: Put the last two weeks of data in the cache. Index by consignment note id, first name, last name and date. Search the cache first and only search the database if there the consignment note is not found. This takes about 50ms and provides a 95% database offload.

Real-Time Analytics Search

In-house analytics engines processes large amounts of data and compute some result from it. The results need to be queried very quickly to enable processing of a within a business transaction. And the results need to be updated through the day in response to business events. Some examples are credit card fraud scoring, or a holding position in a trading application.

Create a distributed cache and index it as required. Various roll-ups are cached and updated after the system of record has been written to with new transactions. Use Ehcache’s bulk loading mode to quickly upload the results of overnight analytics runs.

Searches execute much more quickly than it would take to compute positions from scratch using the system of record, enabling the real-time analytics.

in next blog i will write blog on Elastic Search which will use less resource and fast document based search

Monday, 18 November 2013

Increase Your Application Performance Using EhCache

Ehcache is an open source, standards-based cache for boosting performance, offloading your database, and simplifying scalability. It’s the most widely-used Java-based cache because it’s robust, proven, and full-featured. Ehcache scales from in-process, with one or more nodes, all the way to mixed in-process/out-of-process configurations with terabyte-sized caches. For applications needing a coherent distributed cache, Ehcache uses the open source Terracotta Sever Array.

Download
First, we need to download the ehcache-core-2.6.2.jar, slf4j-api-1.6.1.jar and slf4j-jdk14-1.6.1.jar in ehcache-core-2.6.2-distribution.tar.gz from http://ehcache.org/downloads.

Then, add this jar in the librairies folder of your project \war\WEB-INF\ or in your pom.xml (MAVEN);

Configuration ehcache.xml
Create (this file could be copied from downloaded package) an ehcache.xml file in the classpath of your project. This file contains a default Cache configuration with an implicit name “default” which is a reserved cache name. This cache will be applied to caches created programmatically (using CacheManager.add(String cacheName)).

So, we will add a sample cache named “myCache1″ wich will contain a maximum in memory of 10000 elements, and will expire an element if it is idle for more than 5 minutes (300 sec) and lives for more than 10 minutes (600 sec). If there are more than 10000 elements it will overflow to the disk cache, which in this configuration will go to wherever java.io.tmp is defined on your system. On a standard Linux system this will be ‘/tmp’, for Windows7 it could be ‘C:\Users\username\AppData\Local\Temp’.

<?xml version="1.0" encoding="UTF-8"?>

<ehcache xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:noNamespaceSchemaLocation="ehcache.xsd"

updateCheck="true" monitoring="autodetect"

dynamicConfig="true">

<defaultCache

maxEntriesLocalHeap="10000"

eternal="false"

timeToIdleSeconds="120"

timeToLiveSeconds="120"

diskSpoolBufferSizeMB="30"

maxEntriesLocalDisk="10000000"

diskExpiryThreadIntervalSeconds="120"

memoryStoreEvictionPolicy="LRU"

statistics="false">

</defaultCache>

<cache name="myCache1"

maxEntriesLocalHeap="10000"

maxEntriesLocalDisk="1000"

eternal="false"

diskSpoolBufferSizeMB="20"

timeToIdleSeconds="300"

timeToLiveSeconds="600"

memoryStoreEvictionPolicy="LFU"

transactionalMode="off">

</cache>

</ehcache>

Use / Utilization
To prove the effectiveness of Ehcache, we need create several classes.

First, we will create an utilitary class named CacheUtil in order to manipulate and sollicit the Ehcache:
* Get the cache instance of Ehcache via the method getCache. This method could be synchronized. More, to specify the ehcache configuration file, we could use an environment or a VM variable:

· public static CacheManager cacheMgr = null;

· private static Ehcache getCache(String cacheName){

· if(cacheMgr == null){

· // We could use an environment or a VM variable

· cacheMgr = CacheManager.create("...\\config\\ehcache.xml");

· }

· Ehcache cache = null;

· if(cacheMgr!=null){

· //cache = cacheMgr.addCacheIfAbsent(name);

· cache = cacheMgr.getEhcache(cacheName);

· }

· return cache;

· }

* Get data from the cache via the method getListFromCache. This method could be synchronized. In anticipation of its use, this method is waiting a threadName argument.

@SuppressWarnings("unchecked")

public static <T> List<T> getListFromCache(String threadName, String cacheName, String key, CacheCreation<T> cacheCreation){

List<T> all = new ArrayList<T>();

Ehcache cache = getCache(cacheName);

Element element = null;

if(cache!=null){

element = cache.get(key);

}

if(element==null){

System.out.println(threadName+" : CacheUtil.getListFromCache() : the element '"+key+"' has not been found in the cache ---> get the original data.");

all = cacheCreation.getAll();

cache.put(new Element(key, all));

System.out.println(threadName+" : CacheUtil.getListFromCache() : the original data for the element '"+key+"' has been added in the cache.");

}else{

System.out.println(threadName+" : CacheUtil.getListFromCache() : the element '"+key+"' has been found in the cache.");

//all = (List<T>) element.getValue();

all = (List<T>) element.getObjectValue();

}

return all;

}

We have also created an abstract class CacheCreation to in anticipation of the use of cache:

· public abstract class CacheCreation<T> {

· public abstract List<T> getAll();

· }

Then, we will create an class UseCaseClass to use and check the Ehcache containing:
* main method to create several threads soliciting the Ehcache:

· public static void main(String[] args) {

· int nbThreads = 3;

· ExecutorService execService = Executors.newFixedThreadPool(nbThreads);

· // Create several threads which solicit the Ehcache

· for (int i = 0; i < nbThreads; i++) {

· final int indexFinal = i;

· execService.submit(new Runnable(){

· String threadName= null;

· UseCaseClass useCaseClass = null;

· public void run(){

· try {

· useCaseClass = new UseCaseClass();

· threadName = "thread_"+indexFinal;

· useCaseClass.getAllData1(threadName);

· {

· int sleepTime = getRandomSleepTime(1000, 5000);

· System.out.println(threadName+" will sleep during "+sleepTime+"ms.");

· Thread.currentThread().sleep(sleepTime);

· System.out.println(threadName+" wakes up");

· }

· useCaseClass.getAllData2(threadName);

· {

· int sleepTime = getRandomSleepTime(1000, 5000);

· System.out.println(threadName+" will sleep during "+sleepTime+"ms.");

· Thread.currentThread().sleep(sleepTime);

· System.out.println(threadName+" wakes up");

· }

· useCaseClass.getAllData1(threadName);

· useCaseClass.getAllData2(threadName);

· useCaseClass.getAllData1(threadName);

· useCaseClass.getAllData2(threadName);

· } catch (Throwable e) {

· e.printStackTrace();

· }

· }//end-run

· private int getRandomSleepTime(int min, int max){

· return min + (int)(Math.random() * ((max - min) + 1));

· }

· }//end-runnable

· );//end-submit

· }//end-for

· }

* The Ehcache will be sollicited by 2 methods getAllData1 and getAllData2:

private static final String CACHE_NAME = "myCache1";

public List<String> getAllData1(final String threadName){

return CacheUtil.getListFromCache(threadName, CACHE_NAME, "data1", new CacheCreation<String>(){

@Override

public List<String> getAll(){

System.out.println(threadName+" : UseCaseClass.getAllData1() : the target original method is called to get the values.");

List<String> list = new ArrayList<String>();

list.add("data1-value1");

list.add("data1-value2");

list.add("data1-value3");

list.add("data1-value4");

return list;

}

});

}

public List<String> getAllData2(final String threadName){

return CacheUtil.getListFromCache(threadName, CACHE_NAME, "data2", new CacheCreation<String>(){

@Override

public List<String> getAll(){

System.out.println(threadName+" : UseCaseClass.getAllData2() : the target original method is called to get the values.");

List<String> list = new ArrayList<String>();

list.add("data2-value1");

list.add("data2-value2");

list.add("data2-value3");

list.add("data2-value4");

return list;

}

});

}

…So, if we execute the main method:

with the following parameters in ehcache.xml (the cache will expire an element in memory if it is idle for more than 5 minutes and lives for more than 10 minutes):

· <cache name="myCache1"

· maxEntriesLocalHeap="10000"

· maxEntriesLocalDisk="1000"

· eternal="false"

· diskSpoolBufferSizeMB="20"

· timeToIdleSeconds="300"

· timeToLiveSeconds="600"

· memoryStoreEvictionPolicy="LFU"

· transactionalMode="off">

· <persistence strategy="localTempSwap"/>

· </cache>

…we obtain the below results – at the end, the elements ‘data1′ and ‘data2′ have been found in the cache-:

thread_0 : CacheUtil.getListFromCache() : the element 'data1' has not been found in the cache ---> get the original data.

thread_0 : UseCaseClass.getAllData1() : the target original method is called to get the values.

thread_1 : CacheUtil.getListFromCache() : the element 'data1' has been found in the cache.

thread_1 will sleep during 3820ms.

thread_0 : CacheUtil.getListFromCache() : the original data for the element 'data1' has been added in the cache.

thread_0 will sleep during 3252ms.

thread_2 : CacheUtil.getListFromCache() : the element 'data1' has been found in the cache.

thread_2 will sleep during 2626ms.

thread_2 wakes up

thread_2 : CacheUtil.getListFromCache() : the element 'data2' has not been found in the cache ---> get the original data.

thread_2 : UseCaseClass.getAllData2() : the target original method is called to get the values.

thread_2 : CacheUtil.getListFromCache() : the original data for the element 'data2' has been added in the cache.

thread_2 will sleep during 3622ms.

thread_0 wakes up

thread_0 : CacheUtil.getListFromCache() : the element 'data2' has been found in the cache.

thread_0 will sleep during 1956ms.

thread_1 wakes up

thread_1 : CacheUtil.getListFromCache() : the element 'data2' has been found in the cache.

thread_1 will sleep during 2747ms.

thread_0 wakes up

thread_0 : CacheUtil.getListFromCache() : the element 'data1' has been found in the cache.

thread_0 : CacheUtil.getListFromCache() : the element 'data2' has been found in the cache.

thread_0 : CacheUtil.getListFromCache() : the element 'data1' has been found in the cache.

thread_0 : CacheUtil.getListFromCache() : the element 'data2' has been found in the cache.

thread_2 wakes up

thread_2 : CacheUtil.getListFromCache() : the element 'data1' has been found in the cache.

thread_2 : CacheUtil.getListFromCache() : the element 'data2' has been found in the cache.

thread_2 : CacheUtil.getListFromCache() : the element 'data1' has been found in the cache.

thread_2 : CacheUtil.getListFromCache() : the element 'data2' has been found in the cache.

thread_1 wakes up

thread_1 : CacheUtil.getListFromCache() : the element 'data1' has been found in the cache.

thread_1 : CacheUtil.getListFromCache() : the element 'data2' has been found in the cache.

thread_1 : CacheUtil.getListFromCache() : the element 'data1' has been found in the cache.

thread_1 : CacheUtil.getListFromCache() : the element 'data2' has been found in the cache.

with the following parameters in ehcache.xml (the cache will expire an element in memory if it is idle for more than 1 seconds and lives for more than 2 seconds):

· <cache name="myCache1"

· maxEntriesLocalHeap="10000"

· maxEntriesLocalDisk="1000"

· eternal="false"

· diskSpoolBufferSizeMB="20"

· timeToIdleSeconds="1"

· timeToLiveSeconds="2"

· memoryStoreEvictionPolicy="LFU"

· transactionalMode="off">

· <persistence strategy="localTempSwap"/>

· </cache>

…we obtain the below results – at the end, the elements ‘data1′ and ‘data2′ have been found in the cache, but they expire frequently-:

thread_2 : CacheUtil.getListFromCache() : the element 'data1' has not been found in the cache ---> get the original data.

thread_2 : UseCaseClass.getAllData1() : the target original method is called to get the values.

thread_1 : CacheUtil.getListFromCache() : the element 'data1' has not been found in the cache ---> get the original data.

thread_1 : UseCaseClass.getAllData1() : the target original method is called to get the values.

thread_0 : CacheUtil.getListFromCache() : the element 'data1' has not been found in the cache ---> get the original data.

thread_0 : UseCaseClass.getAllData1() : the target original method is called to get the values.

thread_2 : CacheUtil.getListFromCache() : the original data for the element 'data1' has been added in the cache.

thread_2 will sleep during 3449ms.

thread_1 : CacheUtil.getListFromCache() : the original data for the element 'data1' has been added in the cache.

thread_1 will sleep during 1335ms.

thread_0 : CacheUtil.getListFromCache() : the original data for the element 'data1' has been added in the cache.

thread_0 will sleep during 2558ms.

thread_1 wakes up

thread_1 : CacheUtil.getListFromCache() : the element 'data2' has not been found in the cache ---> get the original data.

thread_1 : UseCaseClass.getAllData2() : the target original method is called to get the values.

thread_1 : CacheUtil.getListFromCache() : the original data for the element 'data2' has been added in the cache.

thread_1 will sleep during 3509ms.

thread_0 wakes up

thread_0 : CacheUtil.getListFromCache() : the element 'data2' has not been found in the cache ---> get the original data.

thread_0 : UseCaseClass.getAllData2() : the target original method is called to get the values.

thread_0 : CacheUtil.getListFromCache() : the original data for the element 'data2' has been added in the cache.

thread_0 will sleep during 1451ms.

thread_2 wakes up

thread_2 : CacheUtil.getListFromCache() : the element 'data2' has been found in the cache.

thread_2 will sleep during 3111ms.