Annotated Data

It is impossible to build modern NLP applications without annotated data. We also work regularly with related annotations tasks. Materials we regularly work on involve:

  • Treebanks on Uralic languages (Komi, Erzya, Moksha, Skolt Saami)
  • OCR Ground Truth data
  • Speech recognition data
  • Materials annotated for features such as sarcasm

We aim to publish all these resources openly in Zenodo. They usually accompany some of our papers, and references to the papers are included with the data, and the data points to the papers using it as well.

We are also interested in building derived and enhanced datasets from materials that already have been publishing for some different use. This is allowed and encouraged by open and clear licensing, and for the best new results we should build upon earlier work. This benefits both the creators of the old materials and contemporary research that it helps to advance.


We have been involved with several different treebank projects. Universal Dependency treebanks are useful for a multitude of different tasks as they contain lemmatization, and syntactic and morphological annotation.

Spanish Sarcasm DATA

We have elaborated an intriguing dataset on sarcasm in Spanish. This dataset is based on two episodes of South Park and two episodes of Archer. Each line uttered by the characters of the shows has been transcribed and annotated for sarcasm. Sarcastic sentences also contain further annotation based on different theories on sarcasm. This has not been collected on AMT, but actually annotated by a real philologist.

[data] [paper]

Finnish morphosyntax

Natural language generation in Finnish is challenging because words tend to be inflected differently based on the case government. Verbs will prefer their objects (näen kissan/katson kissaa) in a certain case and certain place names are to be used in the internal locative instead of the external one (Tampereella/Helsingissä). Don’t worry, we have datasets for you!

[data – verb complements] [data – locative] [paper]

Finnish semantics

Semantics with Syntactic Relations

SemFi is a great semantic resource on words and how they relate to each other. With SemFi, it is possible to find typical adjectives for a dog, or gather that dogs bark.

[usage] [data] [paper]

Concreteness of Finnish words

Abstractness or concreteness can tell a lot about words and their usage. It can also reveal more about the text type in question. We have elaborated a dataset that contains a vast amount of Finnish words and a value indicating how concrete they are.

[data] [paper]

Computational Creativity Resources

Famous people and their properties

What kind of a person is Adam Sandler? How about Abraham Lincoln? If a computer needs to generate a story about them, it needs to know their characters better. Such a data is hardly obtained by just mining Wikipedia. That’s when it is time to look at our dataset of famous people and their adjectival properties

[data] [paper]

Humor Dataset

Humor can take place in many forms. Our pun dataset consists of movie titles and puns people have made out of them such as the Beauty and the Beets.

[data] [paper]

Prosody annotated Finnish poetry

We have compiled a small dataset with prosody annotations. The data has audio and annotated text.

[data] [paper]

Skolt Sami Cognates

We predicted automatically cognate relations between Skolt Sami and North Sami. These predictions were then automatically filtered and finally read through by linguists. We are releasing the final human verified list of the cognates found by our method.

[data] [paper]