There are many situations in which a feature extractor requires some training.
For example, a common feature used in document classification is to generate a TF-IDF score for each word in the document.
This requires knowing the term frequencies of words and the distribution of those words across a corpus of documents.
Such information can be acquired by counting up words in a corpus and giving that information to the feature extractor in a training step.
Another example is to scale a numeric feature to a normalized value whose mean value is zero and standard deviation is 1.
This requires knowing the average value of a feature and its standard deviation.
Again, this information can be acquired by examining many values of a given numeric feature and computing the mean and standard deviation in a training step for the feature extractor.
ClearTK provides a TrainableExtractor
interface to support feature extractors which require training.
As mentioned above, a common feature extractor used in document classificataion provides TF-IDF score for each word in a document.
ClearTK provides an implementation of this called TfidfExtractor
which looks like this:
CleartkExtractor<DocumentAnnotation, Token> countsExtractor =
new CleartkExtractor<DocumentAnnotation, Token>(
Token.class,
new CoveredTextExtractor<Token>(),
new CleartkExtractor.Count(new CleartkExtractor.Covered()));
TfidfExtractor<String, DocumentAnnotation> tfidfExtractor =
new TfidfExtractor<String, DocumentAnnotation>("TF-IDF", countsExtractor);
tfidfExtractor.load(uri);
In the above code a feature extractor that counts words/tokens is first instantiated as countsExtractor.
A typical feature generated by this feature extractor would have a name such as Count_Covered_fox
and a value such as ‘4’ if a document contained the word “fox” four times.
Next a TF-IDF feature extractor is instantiated with countsExtractor as tfidfExtractor.
In this example, it is typed with String as its first type parameter because this code is pulled from a document classification example and we will be assigning classification outcomes of type String to each document.
The second type parameter is DocumentAnnotatin because we will be classifying annotations of this type.
Finally, the tfidfExtractor calls a load method which loads in all the TF-IDF information for this feature extractor from the provided uri.
A typical feature generated by this feature extractor would have a name such as TF-IDF_Count_Covered_fox
and a positive double value corresponding to the TF-IDF score for the word “fox”.
This feature extractor would be called like this:
DocumentAnnotation doc = (DocumentAnnotation) jCas.getDocumentAnnotationFs();
tfidfExtractor.extract(jCas, doc);
The extract method will return a list of features corresponding to a TF-IDF score for each token annotation in the document.
Training the TfidfExtractor
requires the following steps:
Instance<String>
object and collect these objects.tfidfExtractor.train
method and pass in the collected instances.
This will method will create an IDFMap
object from all of the features containing word counts.tfidfExtractor.save
method to save the IDFMap
data that was collected in the train method to the provided path.
The provided path now points to a resource that can be loaded via tfidfExtractor.load(uri)
.In practice, training an extractor is not this straightforward for a variety of reasons:
Instance
objects may cause memory problems.
For this reason, the train method actually takes an Iterable<Instance>
.TransformableFeature
and InstanceDataWriter
as described below.CleartkAnnotator
with code that must consider whether or not a feature extractor has been trained yet or not.We have taken an integrated approach in which both classifier training and feature extraction training can be accomplished with a single pass over the training data by performing the following steps:
TransformableFeature
.
In the tfidfExtractor
example, the TransformabelFeature created contains the features generated by its sub-extractor countsExtractor
.InstanceDataWriter
class rather than a machine learning library specific data writer such as e.g. LibSvmStringOutcomeDataWriter
.
This writes out the instance objects to disk using Java serialization.
Again, the instances created will contain features from the untrained feature extractors of type TransformableFeature
.Instance
objects from disk using the code: InstanceStream.loadFromDirectory(someDirectory);
TransformableFeature
features which were generated by it.
The code called is e.g. tfidfExtracgtor.train(instances);
tfidfExtractor
example, all of the features generated by countsExtractor
will be transformed into features that contain TF-IDF scores.LibSvmStringOutcomeDataWriter
.A complete example of how this all works in code can be found in the class org.cleartk.examples.documentclassification.advanced.DocumentClassificationEvaluation
in the examples project. The method train
contains code for executing all the steps above.