Trainable Feature Extraction Tutorial

There are many situations in which a feature extractor requires some training. For example, a common feature used in document classification is to generate a TF-IDF score for each word in the document. This requires knowing the term frequencies of words and the distribution of those words across a corpus of documents. Such information can be acquired by counting up words in a corpus and giving that information to the feature extractor in a training step. Another example is to scale a numeric feature to a normalized value whose mean value is zero and standard deviation is 1. This requires knowing the average value of a feature and its standard deviation. Again, this information can be acquired by examining many values of a given numeric feature and computing the mean and standard deviation in a training step for the feature extractor. ClearTK provides a TrainableExtractor interface to support feature extractors which require training.

Example

As mentioned above, a common feature extractor used in document classificataion provides TF-IDF score for each word in a document. ClearTK provides an implementation of this called TfidfExtractor which looks like this:

CleartkExtractor<DocumentAnnotation, Token> countsExtractor = 
   new CleartkExtractor<DocumentAnnotation, Token>(
        Token.class,
        new CoveredTextExtractor<Token>(),
        new CleartkExtractor.Count(new CleartkExtractor.Covered()));

TfidfExtractor<String, DocumentAnnotation> tfidfExtractor = 
   new TfidfExtractor<String, DocumentAnnotation>("TF-IDF", countsExtractor);

tfidfExtractor.load(uri);

In the above code a feature extractor that counts words/tokens is first instantiated as countsExtractor. A typical feature generated by this feature extractor would have a name such as Count_Covered_fox and a value such as ‘4’ if a document contained the word “fox” four times. Next a TF-IDF feature extractor is instantiated with countsExtractor as tfidfExtractor. In this example, it is typed with String as its first type parameter because this code is pulled from a document classification example and we will be assigning classification outcomes of type String to each document. The second type parameter is DocumentAnnotatin because we will be classifying annotations of this type. Finally, the tfidfExtractor calls a load method which loads in all the TF-IDF information for this feature extractor from the provided uri. A typical feature generated by this feature extractor would have a name such as TF-IDF_Count_Covered_fox and a positive double value corresponding to the TF-IDF score for the word “fox”. This feature extractor would be called like this:

DocumentAnnotation doc = (DocumentAnnotation) jCas.getDocumentAnnotationFs();
tfidfExtractor.extract(jCas, doc);

The extract method will return a list of features corresponding to a TF-IDF score for each token annotation in the document.

Training

Training the TfidfExtractor requires the following steps:

Instantiate the TfidfExtractor as in the example code above but do not call the load method.
Run the feature extractor on a large number of documents. For each list of features returned create an Instance<String> object and collect these objects.
Call the tfidfExtractor.train method and pass in the collected instances. This will method will create an IDFMap object from all of the features containing word counts.
Finally, call tfidfExtractor.save method to save the IDFMap data that was collected in the train method to the provided path. The provided path now points to a resource that can be loaded via tfidfExtractor.load(uri).

In practice, training an extractor is not this straightforward for a variety of reasons:

Collecting a large number of Instance objects may cause memory problems. For this reason, the train method actually takes an Iterable<Instance>.
There is tension between the training of a CleartkAnnotator’s classifier and the training of its feature extractors. Are they separate steps? If so, do we end up repeating a lot of the same work? To avoid this problem we introduce TransformableFeature and InstanceDataWriter as described below.
We do not want to pollute the code of a CleartkAnnotator with code that must consider whether or not a feature extractor has been trained yet or not.
We may have several trainable feature extractors that need to be trained and we would like to avoid repeating work for each.

We have taken an integrated approach in which both classifier training and feature extraction training can be accomplished with a single pass over the training data by performing the following steps:

A feature extractor that is not trained is instantiated and generates features that can be later transformed after the feature extractor has been trained. These features are defined by the class TransformableFeature. In the tfidfExtractor example, the TransformabelFeature created contains the features generated by its sub-extractor countsExtractor.
Training data is generated with untrained feature extractors using the InstanceDataWriter class rather than a machine learning library specific data writer such as e.g. LibSvmStringOutcomeDataWriter. This writes out the instance objects to disk using Java serialization. Again, the instances created will contain features from the untrained feature extractors of type TransformableFeature.
We then load the Instance objects from disk using the code: InstanceStream.loadFromDirectory(someDirectory);
Next, the feature extractors requiring training are trained using the instance data loaded in the previous step. Each feature extractor will be trained using those TransformableFeature features which were generated by it. The code called is e.g. tfidfExtracgtor.train(instances);
Next, the trained feature extractors transform any features that were generated by them in the training data. For the tfidfExtractor example, all of the features generated by countsExtractor will be transformed into features that contain TF-IDF scores.
We then write the instances (whose features are now transformed) using a machine learning specific data writer such as LibSvmStringOutcomeDataWriter.
Finally, classifier training is performed on the resulting training data.

A complete example

A complete example of how this all works in code can be found in the class org.cleartk.examples.documentclassification.advanced.DocumentClassificationEvaluation in the examples project. The method train contains code for executing all the steps above.