We describe the on-going scientific work that is related to uralicNLP in various publications. As applying natural language processing into real language data is complex and often connects into different pipelines, our studies also attempt to solve different loosely related problems in various parts of these workflows.

This page contains only the publications that have resulted in publication available data or code.

Non-Standard Data

Finnish Dialect Normalization

Partanen, N., Hämäläinen, M., & Alnajjar, K. (2019). Dialect Text Normalization to Normative Standard Finnish. In The Fifth Workshop on Noisy User-generated Text (W-NUT 2019): Proceedings of the Workshop (pp. 141–146).

[code] [model]

Finnish Dialect ADAPTATION

Hämäläinen, M., Partanen, N., Alnajjar, K., Rueter J. & Poibeau T. (2020). Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity. In Proceedings of the 11th International Conference on Computational Creativity. p. 204-211

[code] [model]

Historical English Normalization

Hämäläinen, M., Säily, T., Rueter, J., Tiedemann, J., & Mäkelä, E. (2019). Revisiting NMT for normalization of early English letters. In Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (pp. 71–75).

[code] [model]

Unsupervised OCR post correction

Hämäläinen, M., & Hengchen, S. (2019). From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction. In  Proceedings of Recent Advances in Natural Language Processing (pp. 432-437).

[code] [English model]

Swedish normalization

Hämäläinen, M., Partanen, N., & Alnajjar, K. (2020). Normalization of Different Swedish Dialects Spoken in Finland. In GeoHumanities’20: Proceedings of the 4th ACM SIGSPATIAL Workshop on Geospatial Humanities (pp. 24–27). ACM.

[code] [data]

Endangered Languages

Skolt Sami

Rueter, J., & Hämäläinen, M. (2020). FST Morphology for the Endangered Skolt Sami Language. In Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020) (pp. 250-257).

[models] [code]

Hämäläinen, M., & Rueter, J. (2019). Finding Sami Cognates with a Character-Based NMT Approach. In Proceedings of the 3rd Workshop on Computational Methods in the Study of Endangered Languages: (Volume 1) Papers (pp. 39-45).


Online dictionaries

Hämäläinen, M., & Rueter, J. (2018). Advances in synchronized XML-MediaWiki dictionary development in the context of endangered Uralic languages. In Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts (pp. 967-978).

[code] [service]


Hämäläinen, M. (2019). UralicNLP: An NLP Library for Uralic Languages. Journal of open source software4(37), [1345]



Hämäläinen, M. (2018). Extracting a Semantic Database with Syntactic Relations for Finnish to Boost Resources for Endangered Uralic Languages. In The Proceedings of Logic and Engineering of Natural Language Semantics 15 (LENLS15) [9]

[SemFi] [SemUr]


Partanen, N., Blokland, R., Lim, K., Poibeau, T., & Rießler, M. (2018). The First Komi-Zyrian Universal Dependencies Treebanks. In Second Workshop on Universal Dependencies (UDW 2018) (pp. 126-132).

[data – written] [data – spoken

Rueter, J., Partanen, N., & Ponomareva, L. (2020). On the questions in developing computational infrastructure for Komi-Permyak. In Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages (pp. 15-25).


Rueter, J., & Tyers, F. (2018). Towards an open-source universal-dependency treebank for Erzya. In International Workshop for Computational Linguistics of Uralic Languages.


Speech Recognition for Samoyedic Languages

Partanen, N., Hämäläinen, M., & Klooster, T. (2020). Speech Recognition for Endangered and Extinct Samoyedic languages. In Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation


Computational creativity aNd Figurative Language

Humor Generation

Hämäläinen, M., & Alnajjar, K. (2019). Modelling the Socialization of Creative Agents in a Master-Apprentice Setting: The Case of Movie Title Puns. In Proceedings of the 10th International Conference on Computational Creativity (pp. 266-273).


Poem Generation

Hämäläinen, M., & Alnajjar, K. (2019). Let’s FACE it: Finnish Poetry Generation with Aesthetics and Framing. In 12th International Conference on Natural Language Generation: Proceedings of the Conference (pp. 290-300)

[data] [code]


Alnajjar, K., & Hämäläinen, M. (2019). A Creative Dialog Generator for Fallout 4. In Proceedings of the 14th International Conference on the Foundations of Digital Games [48] New York: ACM. 


Natural Language Generation

Hämäläinen, M., & Rueter, J. (2018). Development of an Open Source Natural Language Generation Tool for Finnish. In Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages (pp. 51-58).

[code] [data – verb complements] [data – locative]


Hämäläinen, M. K. (2016). Reconocimiento automático del sarcasmo: ¡Esto va a funcionar bien!. University of Helsinki (Master’s thesis)


Knowledge bases

Alnajjar, K., Hämäläinen, M., Chen, H., & Toivonen, H. (2017). Expanding and Weighting Stereotypical Properties of Human Characters for Linguistic Creativity. In Proceedings of the 8th International Conference on Computational Creativity (ICCC’17) (pp. 25-32).


Prosody in Poetry

Hämäläinen, M., & Rueter, J. (2020). Runonlausunnan prosodia ja sen mallintaminen koneellisesti puhesynteesillä. In Материалы Международного образовательного салона (pp. 5-17). Ижевск: Институт компьютерных исследований.