Gold Standard Online Debates Summaries:
Salient Sentence Selection Dataset (SSSD)

 

    About this dataset:

    This dataset consists of comments collected from 11 online debate topics. Each debate topics consists of two oppoing sides: Agree (Yes) and Disagree (No). It was primarily used in the salient sentence selection process, one of the processes in online debate summarization. Each comment was annotated by 5 judgments, based on a compression rate of 20%. Additionally, one sentence was also annotated from each comment as we considered that each comment should express a salient piece of information. More information related to this corpus can be found in the paper.

    XML Format [New!]

    Online debates comments are stored in the XML file format as shown the figure below.



    From the figure, each debate contains several comments (indicated by <comment id = "">). Each comment is split into senteneces (indicated by <sentence id = "">).
    In addition, each comment was annotated by 20% of the compression rate which is stored in the tag <annotation>). Other useful information that was also collected is the side of each comment (Agree or Disagree with the debate topic) and the number of likes that people support that comment.



    CSV Format

    The datatset contains the following attributes:
    1. recordid: the record identification number which uniquely identifies each sentence.
    2. debateid: the identificaiton number of each debate comment.
    3. debatetopicname: the name of a debate topic.
    4. commentid: the identification number of each debate comment.
    5. sentenceid: the identification number of a sentence in each debate comment.
    6. sentence: a sentence in each comment.
    7. side: a side (stance) of a debate comment.
    8. like: the number of votes that support this comment.
    9. annotation: the annotation for a comment.

    The comments in the CSV version are the same as those in the XML format. The difference is how the comments are organized as shown the example below. The boundary for each comment is separated by the sentenceid column. For instance, the commentid 10 is in debateid DTP03 and contains five sentences (row 0-4). For this comment, 5 annotatotos manually selected sentenceid 2, 2, 2, 2, and 5 respectively.

    Dowload this dataset:

    If you use this dataset in your work, please cite our paper below.
    By downloading this dataset, I agree that the data will be used for the educational purposes only.

    [DOWNLOAD XML FORMAT]

    [DOWNLOAD CSV FORMAT]

    References:

    Download this paper: [PDF]
    Dowload bibtex entry: [BibTex]
    @InProceedings{sanchan20188,
      author    = {Sanchan, Nattapong  and  Aker, Ahmet  and  Bontcheva, Kalina},
      title     = {Gold Standard Online Debates Summaries and First Experiments Towards Automatic Summarization of Online Debate Data},
      booktitle = {Computational Linguistics and Intelligent Text Processing},
      editor    = {Gelbukh, Alexander}, 
      year      = {2018},
      address=  = {Cham},
      Volume    = {10762},
      publisher = {Springer International Publishing},
      pages     = {495--505},
      abstract  = {Usage of online textual media is steadily increasing. Daily, more and more news stories, blog posts and scientific articles are added to the online volumes. These are all freely accessible and have been employed extensively in multiple research areas, e.g. automatic text summarization, information retrieval, information extraction, etc. Meanwhile, online debate forums have recently become popular, but have remained largely unexplored. For this reason, there are no sufficient resources of annotated debate data available for conducting research in this genre. In this paper, we collected and annotated debate data for an automatic summarization task. Similar to extractive gold standard summary generation our data contains sentences worthy to include into a summary. Five human annotators performed this task. Inter-annotator agreement, based on semantic similarity, is 36{\%} for Cohen's kappa and 48{\%} for Krippendorff's alpha. Moreover, we also implement an extractive summarization system for online debates and discuss prominent features for the task of summarizing online debate data automatically.},
      isbn      = {978-3-319-77116-8}
       }
    Rich text bibliography entry (for copy & paste into a word processor):
    Sanchan N., Aker A., Bontcheva K. (2018) Gold Standard Online Debates Summaries and First Experiments Towards Automatic Summarization of Online Debate Data. In: Gelbukh A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science, vol 10762. Springer, Cham