Ask a librarian

Dataset contains TF-IDF data matrices generated from "Ask a librarian" question/answer corpus and targeted for machine learning use. Corpus is in Finnish. Data matrices are especially suitable for training Extreme Multi-label Text Classification (XMTC) machine learning models.

The original corpus contains 3150 Finnish language relatively short documents from the service Kysy kirjastonhoitajalta (Ask a librarian). Each document is a question from the general public with an answer from a librarian.

The corpus was extracted from the collection of over 25000 question/answer pairs with the requirement that the document must have a minimum of 4 subjects.

The corpus has been split into the following directories:

all: contains all the documents (N=3150)
train: contains questions asked before 2016 (N=2625), intended for training
maui-train: random sample subset (N=200) of train, intended for training a Maui model
validate: contains questions asked in 2016 (N=213), intended for validating (e.g. choosing hyperparameters for a classifier)
test: contains questions asked in 2017 (N=312), intended for final evaluation

The original corpus is available from

Resources (4)

Additional Info

Maintainer email
Maintainer website
Links to additional information
Geographical coverage
Update frequency
Valid from
Valid until
Last modified 02.03.2021
Show change log
Created on 21.12.2020

Give feedback

comments powered by Disqus