1

I am looking for data sets that I could use to perform sequential short text classification (multilabel is OK). By short text I mean ~100 words max. By sequential I mean that the texts should follow each other.

Example: the Dialog State Tracking Challenge 4's data set. In this case, the short text is one dialogue utterance. Each dialogue utterance is classified into one or several speech by category. The classification is sequential since utterances in a dialogue form a sequence.

2
  • Did you try NUS SMS Corpus: wing.comp.nus.edu.sg:8080/SMSCorpus; the ENRON Email Dataset cs.cmu.edu/~./enron or any corpus involved in studies about authorship analysis of e-mails, SMS or forum messages? Commented Nov 20, 2015 at 6:18
  • @Claude Thanks, good idea I'll look at it. Commented Nov 27, 2015 at 14:27

1 Answer 1

1

Two other datasets for sequential short text classification:

  • Switchboard Dialog Act Corpus. [Jurafsky et al.1997]
  • MRDA: ICSI Meeting Recorder Dialog Act Corpus (Janin et al., 2003; Shriberg et al., 2004)

In terms of size, here is an overview:

enter image description here

In case anyone is interested, we presented an overview of the state-of-the art results on these three datasets in Ji Young Lee, Franck Dernoncourt, Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks. NAACL 2016.

We found a few more interesting data sets for sequential short text classification mentioned in the literature, but we could not access it.

Other data sets:


  • [Jurafsky et al.1997] Dan Jurafsky, Elizabeth Shriberg, and Debra Biasca. 1997. Switchboard SWBDDAMSL shallow-discourse-function annotation coders manual. Institute of Cognitive Science. Technical Report, pages 97–102.
  • [Janin et al.2003] Adam Janin, Don Baron, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Peskin, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, et al. 2003. The ICSI meeting corpus. In Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE International Conference on, volume 1, pages I–364. IEEE.
  • [Shriberg et al.2004] Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, Jeremy Ang, and Hannah Carvey. 2004. The ICSI meeting recorder dialog act (MRDA) corpus. Technical report, DTIC Document.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.