Ask a librarian

Dataset contains TF-IDF data matrices generated from "Ask a librarian" question/answer corpus and targeted for machine learning use. Corpus is in Finnish. Data matrices are especially suitable for training Extreme Multi-label Text Classification (XMTC) machine learning models.

The original corpus contains 3150 Finnish language relatively short documents from the service Kysy kirjastonhoitajalta (Ask a librarian). Each document is a question from the general public with an answer from a librarian.

The corpus was extracted from the collection of over 25000 question/answer pairs with the requirement that the document must have a minimum of 4 subjects.

The corpus has been split into the following directories:

all: contains all the documents (N=3150)
train: contains questions asked before 2016 (N=2625), intended for training
maui-train: random sample subset (N=200) of train, intended for training a Maui model
validate: contains questions asked in 2016 (N=213), intended for validating (e.g. choosing hyperparameters for a classifier)
test: contains questions asked in 2017 (N=312), intended for final evaluation

The original corpus is available from

The Ask a Librarian service can be found here is responsible for developing and maintaining the service.


Additional Info

Collection Open Data
Maintainer CSC – IT Center For Science Ltd.
Maintainer email
Links to additional information
Update frequency
Last modified 04.02.2022
Show change log
Created on 21.12.2020