Ask a librarian

Dataset contains TF-IDF data matrices generated from "Ask a librarian" question/answer corpus and targeted for machine learning use. Corpus is in Finnish. Data matrices are especially suitable for training Extreme Multi-label Text Classification (XMTC) machine learning models.

The original corpus contains 3150 Finnish language relatively short documents from the service Kysy kirjastonhoitajalta (Ask a librarian). Each document is a question from the general public with an answer from a librarian.

The corpus was extracted from the collection of over 25000 question/answer pairs with the requirement that the document must have a minimum of 4 subjects.

The corpus has been split into the following directories:

all: contains all the documents (N=3150)
train: contains questions asked before 2016 (N=2625), intended for training
maui-train: random sample subset (N=200) of train, intended for training a Maui model
validate: contains questions asked in 2016 (N=213), intended for validating (e.g. choosing hyperparameters for a classifier)
test: contains questions asked in 2017 (N=312), intended for final evaluation

The original corpus is available from https://github.com/NatLibFi/Annif-corpora/tree/master/fulltext/kirjastonhoitaja

Resources (4)

Additional Info

Maintainer email
  1. analytics@csc.fi
Maintainer website
Links to additional information
  1. https://github.com/NatLibFi/Annif-corpora/tree/master/fulltext/kirjastonhoitaja
Geographical coverage
Update frequency
Valid from
Valid until
Last modified 02.03.2021
Show change log
Created on 21.12.2020

Give feedback

comments powered by Disqus