Domain-Focused Summarization of Polarized Debates


Nattapong Sanchan

    Abstract:

    Due to the exponential growth of Internet use, textual content is increasingly published in online media. In everyday, more and more news content, blog posts, and scientific articles are published to the online volumes and thus open doors for the text summarization research community to conduct research on those areas. Whilst there are freely accessible repositories for such content, online debates which have recently become popular have remained largely unexplored. This thesis addresses the challenge in applying text summarization to summarize online debates. We view that the task of summarizing online debates should not only focus on summarization techniques but also should look further on presenting the summaries into the formats favored by users. In this thesis, we present how a summarization system is developed to generate online debate summaries in accordance with a designed output, called the Combination 2. It is the combination of two summaries. The primary objective of the first summary, Chart Summary, is to visualize the debate summary as a bar chart in high-level view. The chart consists of the bars conveying clusters of the salient sentences, labels showing short descriptions of the bars, and numbers of salient sentences conversed in the two opposing sides. The other part, Side-By-Side Summary, linked to the Chart Summary, shows a more detailed summary of an online debate related to a bar clicked by a user. The development of the summarization system is divided into three processes. In the first process, we create a gold standard dataset of online debates. The dataset contains a collection of debate comments that have been subjectively annotated by 5 judgments. We develop a summarization system with key features to help identify salient sentences in the comments. The sentences selected by the system are evaluated against the annotation results. We found that the system performance outperforms the baseline. The second process begins with the generation of Chart Summary from the salient sentences selected by the system. We propose a framework with two branches where each branch presents either a term-based clustering and the term-based labeling method or X-means based clustering and the MI labeling strategy. Our evaluation results indicate that the X-means clustering approach is a better alternative for clustering. In the last process, we view the generation of Side-By-Side Summary as a contradiction detection task. We create two debate entailment datasets derived from the two clustering approaches and annotate them with the Contradiction and Non-Contradiction relations. We develop a classifier and investigate combinations of features that maximize the F1 scores. Based on the proposed features, we discovered that the combinations of at least two features to the maximum of eight features yield good results.

    Keywords: online debate summarization, text summarization, X-means clustering, term-based clustering, ontology, summary design, summary representation, text mining, information extraction, sentence extraction, semantic similarity, inter-annotator agreement, relaxed inter-annotator agreement

    References:

    Download this thesis: [PDF]
    Dowload bibtex entry: [BibTex]
    @phdthesis{SanchanBA18,
    author = {Nattapong Sanchan},
    title = {Domain-Focused Summarization of Polarized Debates},
    school = {The University of Sheffield, UK},
    year = 2018, month = {5}, note = {http://etheses.whiterose.ac.uk/20878/}
    Rich text bibliography entry (for copy & paste into a word processor):
    Sanchan, N. (2018). Domain-Focused Summarization of Polarized Debates. PhD thesis, The University of Sheffield, UK. http://etheses.whiterose.ac.uk/20878/.