The Full-Text Index

This section describes the full-text index provided with TM4J. The TM4J Full Text Index API is intended to support integration with any full-text indexing implementation. At the present time, the only implementation distributed with TM4J uses Jakarta Lucene as the full-text search engine.

The TM4J Full Text Index API is encapsulated in the interface org.tm4j.topicmap.index.FullTextIndex. This interface extends the org.tm4j.topicmap.index.Index interface with the following method:

		
      public QueryResult findByText(String query, boolean includeURIs) throws IndexException;
    
  1. The parameter query contains a query String following the Query Syntax of the full-text engine being used. At the present time, the syntax of this query string is the Lucene Query Syntax.
  2. The parameter includeURIs indicates wether the search should also index URIs found in occurrences. This can be used for example to find all URLs with .com domains in a topic map).

The result of the query is returned in a org.tm4j.topicmap.index.QueryResult object which somewhat resembles a Java Array or ArrayList. It´s size() method returns the number of hits in the QueryResult. With its getHit(int index), the result hit can be retrieved. Every hit is represented by a org.tm4j.topicmap.index.QueryHit object which stores the org.tm4j.topicmap.TopicMapObject that contains the hit and the score that this hit got from the search engine.

Using the Lucene FullText Index

The FullText index is not part of the BasicIndexProvider, but is instead provided by a separate FullTextIndexProvider which must be registered with the IndexManager before you can use the FullTextIndex instance.

The FullTextIndexProvider class has a constructor which takes a java.utils.Properties instance as a parameter. This Properties instance provides configuration properties for the index which can be used to control how indexing is done and whether the index is a transient, in-memory index or a persistent, file-based index. The indexing method is controlled by specifying the Lucene Analyzer instance to be used. Lucene comes with several different analyzers including analyzers for German and Russian as well as for English. By default, the Lucene StandardAnalyzer is used. For more details on Lucene Analyzers, please refer to the Lucene FAQ on indexing.

This example shows how a FullText index can be created an used:

Example 6.5. Example of using the Lucene FullText Index

	
  TopicMap map = getTopicMap(); // get a TopicMap 
      
  FullTextIndex index = map.getIndexManager().getIndex(FullTextIndex.class);
				index.open();
  QueryResult result  = index.findByText("tm4j", false);			
  // displaying all BaseNames in the result
  for (int i = 0; i < result.size(); i++) {
    QueryHit hit = result.getHit(i);      
    System.out.print(hit.getScore()+" ");
    if(hit.getObject() instanceof BaseName) {
      BaseName name = (BaseName) hit.getObject();      
      System.out.println(name.getData());
    }
  }

Limitations of the Full-Text Index

The full-text index is a new feature in TM4J release 0.9.0 and the current implementation has a number of limitations which you should be aware of if you plan to use this feature.

Known Limitations of the Full-Text Index

  1. Full-text indexes are not incrementally updated. This means that if you modify the topic map, you will need to call the reindex() method to synchronise the full-text index with the modified topic map data.
  2. Full-text index reindexing requires a walk of all topics in the topic map. This can make reindexing a very computationally expensive task.
  3. The Ozone and Hibernate implementaions of the full-text index support only the persistent, file-based index.
  4. There is no client-server component to the full-text index implementation. This means that the file system directory containing the full-text index must be accessible to your application, even if the application is connecting to a remote Ozone or RDBMS server.

Future releases of TM4J will attempt to address these issues.