File type overview

The mwetoolkit does not deal with a specific parser/tagger format — rather, it can seamlessly read data from many different file types, and provides several tools to convert from external formats (such as "TreeTagger", "RASP", "Palavras", etc) to one of the accepted input formats. If your data is already encoded in a supported filetype format (e.g. "CONLL"), you can use it directly as input to mwetoolkit scripts.

With the exception of the "BinaryIndex" format, all other input/output formats accepted by the toolkit are represented as a single file (possibly streaming from stdin or to stdout), and will be automatically decompressed1 if necessary.

Any given input/output to the mwetoolkit must belong to exactly one of the following "categories":

  • Category candidates: A list of MWE candidates.
  • Category corpus: A list of sentences (sequences of words, possibly annotated somehow).
  • Category dict: A list of MWE candidates to be used as part of a gold standard.
  • Category patterns: A list of token-based patterns, to be used in MWE identification.
  • Category embeddings: A list of word-embedding vectors.

For example, a file with filetype "XML" and category "candidates" can store only a list of MWE candidates according to the mwetoolkit-defined XML filetype format, and nothing else. Note that some filetype+category combinations may not be supported — e.g. the "PlainCandidates" filetype can only represent MWE candidates (that is, one cannot have a corpus or a list of patterns using this filetype).

The main menu to the left contains some links to the detailed description of each specific file type. You can also have a look at sample files.

Special characters

All file types supported by the mwetoolkit require that you escape potentially problematic input characters before feeding your input to the toolkit. For example, when you have a tab-delimited "CONLL" input with a tab character that is part of an entry, you must escape it as "${tab}". Also, due to the choice of "$" as the escaping character, you should escape the occurrences of "$" itself as "${dollar}".

Also note that most Unicode control characters are explicitly NOT supported by the toolkit (with the exception of "\t" and "\n"), as some file parsers intentionally fail when seeing them (e.g. the third-party XML expat parser). If you have those characters in the input data, we suggest you escape them manually (for example, replacing them by characters from a Unicode Supplementary Plane).

Code comments

All file types supported by the mwetoolkit allow comments that will be ignored by the underlying algorithm and simply passed on unchanged output. In most text-based formats (e.g. "CONLL", "PlainCandidates"), you can add a comment by prefixing any line with the special character "#". In this case, the "#" character will have to be escaped when appearing on the input, and will be automatically escaped by the mwetoolkit when generating output. In the XML-based filetype formats, you can add a comment using standard XML comment format: "<!-- comment -->".

MWETOOLKIT directives

The mwetoolkit supports code directives, which instruct the toolkit with extra information not present in the underlying filetype format. For example, you can add the directive below as a comment in the first line of a "CONLL" file to explicitly indicate to the mwetoolkit that your file has "CONLL" filetype:
# MWETOOLKIT: filetype="CONLL"
If such a comment is absent, automatic file detection will be attempted by the mwetoolkit, but not all file formats can be detected, and the heuristics may fail in some corner cases. If you want to explicitly specify the filetype from command-line, some mwetoolkit scripts will allow a syntax such as --from=CONLL or --to=CONLL. Additionally, remember that you can always convert to XML and take it from there.

Syntactic information in corpora

To conform with internal representation, dependency syntax has a special format in the mwetoolkit:

The toolkit supports corpora with syntactic annotations: the element can contain a syn attribute, which contains a list of the syntactic dependencies of the word in the sentence, in the format deptype1:wordnum1;deptype2:wordnum2;...deptypeN:wordnumN, where deptypeN is the type of the dependency, and wordnumN is the number of the word that is the target of the dependency (first word is 1).

For example, the entry

<w lemma="book" pos="N" syn="dobj:4" />
in an "XML" corpus represents a noun, "book", which is the direct object of the fourth word in the sentence. (Again, the syntactic tag will vary depending on the convention used in the corpus). This means that you should manually escape any occurrences of ":" or "," in the synrel/deprel names if they appear in the parser tag-set.

Fallback to XML format

The file type "XML" is the only format originally accepted by the mwetoolkit, and is hence the most generic and flexible one. If you have any problem with other input filetypes, you can always try to your file into "XML" format and take it from there.

1. Supported formats are gzip, bzip2 and zip (with only 1 file inside). Tar is not supported, as it is an archive format, and we deal with simple files only.

  Powered by Phite