For developers

This page describes the internals of the mwetoolkit. If you just want to use the toolkit without implementing extra functionality, you do not need to read this.

General description

The mwetoolkit is a collection of tools that can be usually composed together through Unix pipes. Most of the times, a tool will have the following usage syntax:

    mwetoolkit/bin/ [OPTIONS] <file-inputs>
The input will come from either <file-inputs> or, if no filenames are explicitly given, it will come from stdin. Output should be sent to stdout. Warnings and errors should be printed in stderr.

In order to simplify this architecture, the toolkit has a shared library that can be used by the python tools: mwetoolkit/bin/lib. This library is basically divided between the base package, which defines the internal representation of things in the toolkit; and the filetype package, which converts between the toolkit's accepted filetypes and the internal representation in memory.

The base package

The main component inside the toolkit is the Word. Currently, words are completely mutable (you can set its attributes directly).

FeatureSets are essentially python dicts, and may in the future be converted into dicts when the "features" are simple (as in word frequencies, which always have integer values that can be combined through sum). Features are essentially a (key, value) tuple (and might actually be reimplemented as a namedtuple in the near future).

Class diagrams: Word and FeatureSet

N-grams, sentences and MWE candidates contain a list of words as their main attribute. An MWE occurrence is a mapping between a candidate and the internals of a sentence.

Class diagram: Ngram and subclasses

The filetype package

The next figure is a rough diagram of the sequence of events that characterizes a tool in the mwetoolkit that takes a list of corpora as its input files. Notice how the parser delegates the handling of each Sentence object to a chain of InputHandlers, each of which will manipulate the object and pass it on to the next handler through handle_sentence.

Diagram of messages being sent Parser->FirstInputHandler(->ChainedInputHandlers)->InputHandler: loop(before_file, handle_sentence, handle_sentence, ..., after_file), finish

The execution always begins with filetype.parse, which receives a list of file objects, wraps them into InputFile instances and passes each one to the parse method of an instance of of AbstractParser. This parser then reads the input file and calls the callbacks in the FirstInputHandler. In order to simplify this process for line-based filetypes, the AbstractTxtParser is also provided by the toolkit's filetype library.

Class diagram: AbstractParser and subclasses

The last InputHandler in the chain is the one that is implemented by a given tool in the toolkit. It may just handle the input by itself (as the CounterHandler does in mwetoolkit/bin/, or may even delegate to a further InputHandler in the chain, which will usually be a subclass of AbstractPrinter (as HeadPrinterHandler does in mwetoolkit/bin/

Class diagram: InputHandler and subclasses

There are other internal parts of the toolkit that are still not completely stable. For example, the info dictionary that keeps being passed around is not very well formally defined. In the future, we should define a ContextInfo class and use that to carry around information about the parsing context. (The reason it's currently a dictionary is because we initially had no idea what kind of information we would want to have, and the information was not even always available...)


  Powered by Phite