Learning a CRF tagger
If you want to identify MWEs in a corpus, you can now train a model and tag your corpus thanks to a CRF, using CRFsuite.
CRF stands for conditional random fields.
It is a statistical supervised learning method very popular in natural language processing for sequence prediction tasks like part-of-speech tagging and named entity recognition.
Training a MWE tagger
We have implemented a tagger that uses IOB encoding to segment the MWEs in a text. This means that, instead of annotating a corpus using a lexicon, you need a corpus (manually) annotated with MWEs from which the tagger will be learned. This is the training corpus.
Currently, the toolkit only accepts the dimsum format, except that you do not need to provide supersenses. In addition to surface forms, the training corpus can also contain manual or automatic POS tags and lemmas. This information will help learning a better tagger for MWEs.
For instance, if you want to learn a CRF tagger using the DIMSUM training set, you can run the command below:
# Training your model on a giving corpus:
The model is stored in a file called
You can personalize many options to train your CRF model. By default, the training script uses a list of features that you can find in
test/CRF/listFeatures.txt. These are inspired from several articles like this one, this one and this one. However, you can specify your own list of features, using the option
--pathListFeatures. These features must be in a textual file containing feature patterns, one per line. A pattern must respect the syntax below:
- Orthographics features:
digits → Binary feature, 1 if the surface form contains digts, else 0.
hyphen → Binary feature, 1 if the surface form contains a hyphen, else 0.
capitalized → Binary feature, 1 if the surface form starts with an uppercase letter, else 0.
allCapitalized → Binary feature, 1 if the surface form contains only by uppercase letters, else 0.
capitalizedAndBOS → Binary feature, 1 if the surface form starts with an uppercase letter AND is at the beginning of a sentence, else 0.
- N-gram features:
[wlp]\[\-?[0-9]\](\|[wlp]\[\-?[0-9]\])*: This regular expression describes sequences of pipe-separated (
|) word surface forms (
w), lemmas (
l) or parts of speech (
p) and their relative position with respect to the current token in squared brackets (
[i]). Indeed, it is not very pleasant to read, so here are some examples:
p[-2] → Part of speech of the token located 2 words before the current one.
l[-1]|l|l → Trigram composed of the lemma of the previous, current and next tokens.
w[-1]|l|w|p → Not very useful, but this 4-gram is a valid n-gram feature.
- Lexical features
AM_.* → Any feature prefixed with
AM_ is accepted if you provide a file with lexical features (see below). For example, if you define a feature named
AM_mle, then your lexical features file must have a column named
Sometimes, you want to add external resources to guide tagging decisions. This is typically the case when you have available, handcrafted or automatically build MWE dictionaries. As shown in scientific papers, these can greatly enhance the quality of MWE identification.
In order to indicate to the tagger that you want to use lexical features, you must use the option
--pathFileAM. AM actually stands for association measures, which are a type of lexical features. We will rename this in the future.
This file, in TSV format (tab-separated values), must contain a column named
ngram with the lemmas of the matching lexical units. Then, you can add columns corresponding to the values of lexical features, like
1 if you just want a binary on/off feature. If the lexical features are automatically extracted, please remember that you should quantise all real values, otherwise CRFSuite will not know how to handle them.
The easiest way to understand lexical features is probably by checking the example file, available in
Annotating/tagging a corpus
Now, you have trained your model! You can now annotate your test corpus. For instance, if you want to annotate
dimsum16.test file, you can run:
# Annotate your corpus
bin/annotate_crf.py test/CRF/input/dimsum16.test > tags.txt
The list of predicted IOB tags will be stored in a file named
tag.txt. Unfortunately, we do not yet output other formats for the MWE tags, but this is work in progress. You will need to paste the tags file with your input corpus to rebuild the result, you can do this using the
paste command on the terminal.
By default, the script looks for the model file
test/CRF/CRF.model. If you want to use another model, use
If you ever have the gold tags in your file, you can use the option
--eval to evaluate the performance of the model.