mwetoolkit

 Home

Tools

annotate_crf.py <corpus>
Convert input corpus to lower case.
annotate_mwe.py -c <candidates-file> <corpus>
Takes a corpus and a MWE lexicon and outputs an annotated corpus.
avg_precision.py -f <feats> <candidates>
Reads a MWE lexicon and prints a summary containing
the Average Precision for all real-valued features.
Apply a pattern to corpus and output a lexicon of MWE candidates.
changepos.py <corpus>
Convert corpus POS-tags and output new corpus.
combine_freqs.py <candidates.xml>
Combine the frequency of multiple sources in input MWE lexicon
and output a lexicon with a single frequency value per lexicon.
content_words.py -t <tagset> <input-file>
Remove from corpus all words that are not NOUN/VERB/ADJ/ADV.
MWEs are kept in the corpus, even if they contain stopwords.
Calculate the frequency of each word/entry in a MWE lexicon
according to some corpus.
cycle.py <input-file>
Cycle through all input entries (by default, infinite output).
Tag each word according with a DiMSUM sense-tag.
eval_automatic.py -r <reference.xml> <candidates>
Annotate correctness of each entry in candidate MWE lexicon based on reference lexicon.
Command-line interface to manually evaluate MWE lexicon entries.
extract_candidates.py <corpus>
Extract MWE lexicon from annotated corpus.
feat_annotatedness.py -c <corpus-file> <candidates>
Read a MWE lexicon and a MWE-annotated corpus file.
Output MWE lexicon with an "annotatedness" score.
feat_association.py <candidates>
Read frequency-annotated MWE lexicon.
Output MWE lexicon with association scores.
feat_compositionality.py -e <embeddings-file> <candidates>
Read a MWE lexicon and a MWE-annotated embeddings file.
Output MWE lexicon with a "compositionality" score.
feat_contrast.py -o <name> <candidates>
Read a MWE lexicon annotated with frequency from multiple sources.
Output MWE lexicon with a "contrast" score.
feat_entropy.py <candidates>
Read a MWE lexicon annotated with <vars> elements (XML format).
Output MWE lexicon with an "entropy" feature.
feat_pattern.py <candidates>
Read MWE lexicon.
Output MWE lexicon with 3 surface features: POS sequence, n-gram length, case
filter.py <input-file>
Filter entries in an input file.
fix_feature.py -f <feat-name> <candidates>
Corrects inexistent/NaN feature values.
from_rasp.py <corpus.RASP>
Convert input corpus in "RASP" format into another corpus format.
from_treetagger.py <corpus.TreeTagger>
Convert input corpus in "TreeTagger" format to another corpus format.
Filter input file according to pattern.
head.py <input-file>
Print first N entries in input file.
histogram.py <candidates.xml>
Calculate a histogram based on frequencies of MWE lexicon entries.
index.py -i <index> <corpus>
Build set of index files for fast execution of count.py.
kappa.py <file.txt>
Computes annotation agreement scores.
localmaxs.py <corpus>
Extract candidate MWE lexicon from raw corpus (uses the LocalMaxs algorithm).
lowercase.py <corpus>
Convert input corpus to lower case.
matheus_combine.py <candidates-file>
Extract combined frequencies for Matheus Westhelle
Compare annotated corpus vs reference. Output a summary of MWE correctness.
measure_tagging.py -a <attr> -r <reference> <corpus>
Compare tagged corpus vs reference. Output a summary of tag correctness.
Sort MWE lexicon according to specified feature.
split.py <input-file>
Output fixed-size pieces of input data into
files named x001, x002, ... xNNN.
tail.py <input-file>
Print last N entries in input file.
to_csv.py <input-file>
Convert input MWE lexicon to "TSV" file format.
(Most of the time, you can use transform.py instead).
to_ucs.py <input-file>
Convert MWE lexicon into "UCS" format. Non-bigrams are discarded.
(Most of the time, you can use transform.py instead).
train_crf.py <corpus>
Convert input corpus to lower case.
transform.py <input-file>
Apply transformation code to input file.
(Can also be used to convert between file formats).
uniq.py <candidates>
Print each input MWE lexicon only once.
(Does not require previous sorting).
view.py <input-file>
Pretty-print file through a pager with syntax highlight (colors the MWEs).
wc.py <input-file>
Count number of characters, words and entries.
Links

  Powered by Phite