Query-Based Extractive Multi-Document Summarization Using Paraphrasing and Textual Entailment

Document Type : Original Article

Author

ComputerGroup,ZarnadIndustrialandMiningFaculty,ShahidBahonarUniversity,Kerman,Iran

Abstract

One of the most common problems with computer networks is the amount of information in these networks. Meanwhile searching and getting inform about content of textual document, as the most widespread forms of information on such networks, is difficult and sometimes impossible. The goal of multi-document textual summarization is to produce a pre-defined length summary from input textual documents while maximizing documents’ content coverage. This paper presents a new approach for textual document summarization based on paraphrasing and textual entailment relations and formulating the problem as an optimization problem. In this approach the sentences of input documents are clustered according to paraphrasing relation and then the entailment score and final score of a fraction of the header sentences of clusters which have the best score according to the user query is calculated. Finally, the optimization problem is solved via greedy and dynamic programming approaches and while selecting the best sentences, the final summary is generated. The results of implementing the proposed system on standard datasets and evaluation via ROUGE system show that the proposed system outperforms the state-of-the-art systems at least by 2.5% in average.

Keywords


مراجع

[1] Ani Nenkova and Kathleen McKeown, “A Survey of Text Summarization Techniques”, Mining Text Data, C.C. Aggarwal and C.X. Zhai (eds.), Springer-Science, pp. 43-77, 2012.
[2] Radev, D.R., E. Hovy, and K. McKeown, Introduction to the special issue on summarization. Computational linguistics, 2002. 28(4): pp. 399-408.
[3] Rankel, P.A., J.M. Conroy, H.T. Dang, and A. Nenkova. A decade of automatic content evaluation of news summaries: Reassessing the state of the art. in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2013. pp. 131-136.
[4] Nenkova, A., and McKeown, K. 2011. Automatic summarization. Foundations and Trends in Information Retrieval 5. no. 2-3: 103-233.
[5] Lin, S.-H. and B. Chen. A risk minimization framework for extractive speech summarization. in Proceedings of the 48th annual meeting of the Association for Computational Linguistics. 2010. Association for Computational Linguistics. pp. 79-87.
[6] C. Orasan, V. Pekar, and L. Hasler, “A comparison of summarization methods based on term specificity estimation”, in Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC2004), 2004.
[7] - E. Filatova and V. Hatzivassiloglou. “A formal model for information selection in multi-sentence text extraction”, In Proceedings of the International Conference on Computational Linguistic, pages 397–403, 2004.
[8] Jiwei Li and Sujian Li, “A Novel Feature-based Bayesian Model for Query Focused Multi-document Summarization”, Transactions of the Association for Computational Linguistics, 1 (2013) 89–98. Action Editor: Noah Smith. Submitted 12/2012; Published 5/2013.
[9] J. M. Conroy and D. P. O’leary, “Text summarization via hidden Markov models”, in Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, US, pp. 406–407, 2001.
[10] Silva, J. Character-Level Convolutional Neural Network for Paraphrase Detection and Other Experiments. in Artificial Intelligence and Natural Language: 6th Conference, AINL 2017, St. Petersburg, Russia, September 20–23, 2017, Revised Selected Papers. 2017. Springer. pp. 293-301.
[11] Litvak, M., M. Last, and M. Friedman. A new approach to improving multilingual summarization using a genetic algorithm. in Proceedings of the 48th annual meeting of the association for computational linguistics. 2010. Association for Computational Linguistics. pp. 927-936.
[12] Shen, D., J.-T. Sun, H. Li, Q. Yang, and Z. Chen. Document Summarization Using Conditional Random Fields. in IJCAI. 2007. pp. 2862-2867.
[13] Kai Hong and Ani Nenkova, “Improving the Estimation of Word Importance for News Multi-Document Summarization”, Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 712–721, Gothenburg, Sweden, April 26-30 2014.
[14] Natalie Schluter and Anders Sogaard, “Unsupervised extractive summarization via coverage maximization with syntactic and semantic concepts”, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), pages 840–844, Beijing, China, July 26-31, 2015.
[15] Xiaoyan Cai and Wenjie Li, “Ranking Through Clustering: An Integrated Approach to Multi-Document Summarization”, IEEE Transactions On Audio, Speech, And Language Processing, Vol. 21, No. 7, July 2013.
[16] Ouyang, Y., Li,W., Li, S. and Lu, Q., “Applying regression models to query-focused multidocument summarization”, 47(2), 227–237, 2011.
[17] Doina Tatar, Emma Tamaianu Morita, Andreea Mihis,and Dana Lupsa. 2008. Summarization by logic segmentation and text entailment. In Conference on Intelligent Text Processing and Computational Linguistics (CICLing 08), pages 15–26, 2008.
[18] Anand Gupta, Manpreet Kaur, Arjun Singh, Ashish Sachdeva, and Shruti Bhati. 2012. Analog textual entailment and spectral clustering (ATESC) based summarization. In Lecture Notes in Computer Science, Springer, pages 101–110, New Delhi, India.
[19] Anand Gupta, Manpreet Kaur, Adarsh Singh, Aseem Goel and Shachar Mirkin, “Text Summarization through Entailment-based Minimum Vertex Cover”, Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014), pages 75–80, Dublin, Ireland, August 23-24 2014.
[20] Pavlick, E., P. Rastogi, J. Ganitkevitch, B. Van Durme, and C. Callison-Burch. PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2015. pp. 425-430.
[21] Cocos, A. and C. Callison-Burch. Clustering paraphrases by word sense. in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016. pp. 1463-1472.
[22] Mikolov, T., I. Sutskever, K. Chen, G.S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. in Advances in neural information processing systems. 2013. pp. 3111-3119.
[23] Wang, L., H. Raghavan, V. Castelli, R. Florian, and C. Cardie, A sentence compression based framework to query-focused multi-document summarization, in AAAI,  2016, pp.181-192.
[24] Haghighi, A. and L. Vanderwende. Exploring content models for multi-document summarization. in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2009. Association for Computational Linguistics. pp. 362-370.
[25] Cai, X. and W. Li, Ranking through clustering: An integrated approach to multi-document summarization. IEEE transactions on audio, speech, and language processing, 2013. 21(7): pp. 1424-1433.
[26] Toutanova, K., C. Brockett, M. Gamon, J. Jagarlamudi, H. Suzuki, and L. Vanderwende. The pythy summarization system: Microsoft research at duc 2007. in Proc. of DUC 2007, 2007, pp. 141-153.
[27] Chali, Y., M. Tanvee, and M.T. Nayeem. Towards Abstractive Multi-Document Summarization Using Submodular Function-Based Framework, Sentence Compression and Merging. in Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2017. pp. 418-424.
[28] Canhasi, E. and I. Kononenko, Weighted hierarchical archetypal analysis for multi-document summarization. Computer Speech & Language, 2016. 37: pp. 24-46.
[29] Cao, Z., W. Li, S. Li, F. Wei, and Y. Li, Attsum: Joint learning of focusing and summarization with neural attention. in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016, ACL, pp. 547–556.
[30] Conroy, J.M., J.D. Schlesinger, and D.P. O’Leary. Classy 2007 at duc 2007. in Proceedings of the Document Understanding Conference 2007. 2007, pp. 79-93.
[31] Feigenblat, G., H. Roitman, O. Boni, and D. Konopnicki. Unsupervised Query-Focused Multi-Document Summarization using the Cross Entropy Method. in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2017. ACM. pp. 961-964.
[32] He, Z., C. Chen, J. Bu, C. Wang, L. Zhang, D. Cai, and X. He. Document Summarization Based on Data Reconstruction. in AAAI. 2012, ACM, pp.620-626.
[33] Li, P., L. Bing, W. Lam, H. Li, and Y. Liao. Reader-Aware Multi-Document Summarization via Sparse Coding. in IJCAI. 2015. pp. 1270-1276.
[34] Mani, K., I. Verma, and L. Dey, Multi-Document Summarization using Distributed Bag-of-Words Model. in Proceedings of of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2017, pp.312-326.
[35] Mason, R. and E. Charniak. Extractive multi-document summaries should explicitly not contain document-specific content. in Proceedings of the Workshop on Automatic Summarization for Different Genres, Media, and Languages. 2011. Association for Computational Linguistics. pp. 49-54.
[36] Chin-Yew Lin and Eduard Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 71–78, Edmonta, Canada, 27 May- June 1.
CAPTCHA Image