ClearTK-ML

The ClearTK-ML module provides the core APIs for building machine-learning-based annotators in UIMA.

CleartkAnnotator

In ClearTK, machine-learning-based annotators are subclasses of CleartkAnnotator (or for sequence-labeling tasks, its sibling, CleartkSequenceAnnotator). The POS tagger tutorial and the BIO tagging tutorial give detailed examples of creating such annotators. In short though, a CleartkAnnotator is just a UIMA annotator (i.e. a subclass of JCasAnnotator_ImplBase) where the process(JCas) method extracts features and passes them to a classifier. CleartkAnnotator (and CleartkSequenceAnnotator) provides useful methods and instance variables that are available to subclasses:

CleartkAnnotator Parameters

As is usual for UIMA, an annotator must be wrapped in an AnalysisEngine before it can be used in a pipeline. The AnalysisEngine must specify any parameters needed by whatever annotator it wraps. CleartkAnnotator (and CleartkSequenceAnnotator) have a number of parameters that must be specified:

When PARAM_IS_TRAINING is true, the following parameters are required:

When PARAM_IS_TRAINING is false, the following parameters are required:

(Side note: You may have noticed that many of these parameters are not directly on the CleartkAnnotator class. That’s because CleartkAnnotator allows you to specify arbitrary factory classes for creating DataWriters or Classifiers via its parameters PARAM_DATA_WRITER_FACTORY_CLASS_NAME and PARAM_CLASSIFIER_FACTORY_CLASS_NAME. The default factory classes are where most of the above required parameters come from. For typical use of ClearTK, the default factory classes will almost always be what you want.)

CleartkAnnotator AnalysisEngines

Putting the above all together, a typical pipeline that prepares training data for a classifier will include code that looks like:

AnalysisEngineFactory.createPrimitiveDescription(
    <name-of-your-cleartk-annotator>.class,
    CleartkAnnotator.PARAM_IS_TRAINING,
    true,
    DirectoryDataWriterFactory.PARAM_OUTPUT_DIRECTORY,
    <your-output-directory-file>,
    DefaultSequenceDataWriterFactory.PARAM_DATA_WRITER_CLASS_NAME,
    <name-of-your-selected-classifier's-data-writer>.class);

And a pipeline that uses the classifier to classify new instances will typically include code that looks like:

AnalysisEngineFactory.createPrimitiveDescription(
    <name-of-your-cleartk-annotator>.class,
    CleartkAnnotator.PARAM_IS_TRAINING,
    true,
    GenericJarClassifierFactory.PARAM_CLASSIFIER_JAR_PATH,
    <path-to-your-model.jar-file>);

For more detailed examples, see the POS tagger tutorial and the BIO tagging tutorial.