This page describes the internals of the mwetoolkit.
If you just want to use the toolkit without implementing extra functionality,
you do not need to read this.
The mwetoolkit is a collection of tools
that can be usually composed together through Unix pipes.
Most of the times, a tool will have the following usage syntax:
mwetoolkit/bin/toolname.py [OPTIONS] <file-inputs>
The input will come from either
or, if no filenames are explicitly given, it will come from
. Output should be sent to
. Warnings and errors should be printed in
In order to simplify this architecture, the toolkit has a shared library that can be used by the python tools:
This library is basically divided between the
base package, which defines the internal representation of things in the toolkit; and the
filetype package, which converts between the toolkit's accepted filetypes and the internal representation in memory.
The main component inside the toolkit is the Word.
Currently, words are completely mutable (you can set its attributes directly).
FeatureSets are essentially python dicts, and may in the future be converted into dicts when the "features" are simple (as in word frequencies, which always have integer values that can be combined through sum). Features are essentially a (key, value) tuple (and might actually be reimplemented as a namedtuple in the near future).
N-grams, sentences and MWE candidates contain a list of words as their main attribute. An MWE occurrence is a mapping between a candidate and the internals of a sentence.
The next figure is a rough diagram of the sequence of events that characterizes a tool in the mwetoolkit that takes a list of corpora as its input files. Notice how the parser delegates the handling of each Sentence object to a chain of InputHandlers, each of which will manipulate the object and pass it on to the next handler through handle_sentence.
The execution always begins with filetype.parse, which receives a list of file objects, wraps them into InputFile instances and passes each one to the parse method of an instance of of AbstractParser. This parser then reads the input file and calls the callbacks in the FirstInputHandler. In order to simplify this process for line-based filetypes, the AbstractTxtParser is also provided by the toolkit's filetype library.
The last InputHandler in the chain is the one that is implemented by a given tool in the toolkit. It may just handle the input by itself (as the CounterHandler does in mwetoolkit/bin/wc.py), or may even delegate to a further InputHandler in the chain, which will usually be a subclass of AbstractPrinter (as HeadPrinterHandler does in mwetoolkit/bin/head.py).
There are other internal parts of the toolkit that are still not completely stable. For example, the info dictionary that keeps being passed around is not very well formally defined. In the future, we should define a ContextInfo class and use that to carry around information about the parsing context. (The reason it's currently a dictionary is because we initially had no idea what kind of information we would want to have, and the information was not even always available...)