خلاصه‌سازی چندسندی استخراجی مبتنی بر پرس‌وجوی متن با استفاده از تفسیر و استلزام متنی

نوع مقاله: مقاله پژوهشی

نویسنده

گروه کامپیوتر، مجتمع آموزش عالی زرند، کرمان، ایران

چکیده

یکی از مشکلات رایج شبکه‌های کامپیوتری حجم زیاد اطلاعات موجود در چنین شبکه‌هایی است. در این بین، جستجو و اطلاع از محتوای اسناد متنی که گسترده‌ترین نوع اطلاعات بر روی چنین شبکه‌هایی هستند، بسیار مشکل و گاهی اوقات غیرممکن می‌باشد. هدف سیستم‌های خلاصه‌سازی چند سندی متن، تولید کردن خلاصه‌ای با طول ثابت از اسناد متنی ورودی ضمن پوشش حداکثری محتوای اسناد می‌باشد. مقاله‌ی حاضر، روشی جدید برای خلاصه‌سازی اسناد متنی بر مبنای استفاده از روابط تفسیر و استلزام متنی و با فرموله‌سازی مسأله در قالب یک مسأله‌ی بهینه‌سازی ارائه کرده است. در این روش، جمله‌های درون اسناد ورودی ابتدا بر اساس رابطه‌ی تفسیر متنی خوشه‌بندی شده سپس امتیاز استلزام متنی برای کسری از سرآیند خوشه‌ها که دارای بیشترین امتیاز مرتبط با پرس‌وجوی کاربر هستند محاسبه شده و براساس آن امتیاز نهایی هر جمله به دست می‌آید. در نهایت، به کمک دو رویکرد حریصانه و برنامه‌ریزی پویا مسأله‌ی بهینه‌سازی حل شده و ضمن انتخاب بهترین جمله‌ها، خلاصه‌ی نهایی تولید می‌شود. نتایج اجرای سیستم پیشنهادی بر روی مجموعه‌داده‌های استاندارد و انجام ارزایابی بر اساس سیستم ROUGE نشان می‌دهند که این سیستم کارایی بهترین سیستم‌های خلاصه‌سازی استخراجی مبتنی بر پرس‌وجو را به صورت میانگین حداقل به میزان 5/2% بهبود داده است.

کلیدواژه‌ها


عنوان مقاله [English]

Query-Based Extractive Multi-Document Summarization Using Paraphrasing and Textual Entailment

نویسنده [English]

  • Ali Naserasadi
ComputerGroup,ZarnadIndustrialandMiningFaculty,ShahidBahonarUniversity,Kerman,Iran
چکیده [English]

One of the most common problems with computer networks is the amount of information in these networks. Meanwhile searching and getting inform about content of textual document, as the most widespread forms of information on such networks, is difficult and sometimes impossible. The goal of multi-document textual summarization is to produce a pre-defined length summary from input textual documents while maximizing documents’ content coverage. This paper presents a new approach for textual document summarization based on paraphrasing and textual entailment relations and formulating the problem as an optimization problem. In this approach the sentences of input documents are clustered according to paraphrasing relation and then the entailment score and final score of a fraction of the header sentences of clusters which have the best score according to the user query is calculated. Finally, the optimization problem is solved via greedy and dynamic programming approaches and while selecting the best sentences, the final summary is generated. The results of implementing the proposed system on standard datasets and evaluation via ROUGE system show that the proposed system outperforms the state-of-the-art systems at least by 2.5% in average.

کلیدواژه‌ها [English]

  • Textual Document Summarization
  • Dynamic Programming
  • Textual Entailment

مراجع

[1] Ani Nenkova and Kathleen McKeown, “A Survey of Text Summarization Techniques”, Mining Text Data, C.C. Aggarwal and C.X. Zhai (eds.), Springer-Science, pp. 43-77, 2012.

[2] Radev, D.R., E. Hovy, and K. McKeown, Introduction to the special issue on summarization. Computational linguistics, 2002. 28(4): pp. 399-408.

[3] Rankel, P.A., J.M. Conroy, H.T. Dang, and A. Nenkova. A decade of automatic content evaluation of news summaries: Reassessing the state of the art. in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2013. pp. 131-136.

[4] Nenkova, A., and McKeown, K. 2011. Automatic summarization. Foundations and Trends in Information Retrieval 5. no. 2-3: 103-233.

[5] Lin, S.-H. and B. Chen. A risk minimization framework for extractive speech summarization. in Proceedings of the 48th annual meeting of the Association for Computational Linguistics. 2010. Association for Computational Linguistics. pp. 79-87.

[6] C. Orasan, V. Pekar, and L. Hasler, “A comparison of summarization methods based on term specificity estimation”, in Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC2004), 2004.

[7] - E. Filatova and V. Hatzivassiloglou. “A formal model for information selection in multi-sentence text extraction”, In Proceedings of the International Conference on Computational Linguistic, pages 397–403, 2004.

[8] Jiwei Li and Sujian Li, “A Novel Feature-based Bayesian Model for Query Focused Multi-document Summarization”, Transactions of the Association for Computational Linguistics, 1 (2013) 89–98. Action Editor: Noah Smith. Submitted 12/2012; Published 5/2013.

[9] J. M. Conroy and D. P. O’leary, “Text summarization via hidden Markov models”, in Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, US, pp. 406–407, 2001.

[10] Silva, J. Character-Level Convolutional Neural Network for Paraphrase Detection and Other Experiments. in Artificial Intelligence and Natural Language: 6th Conference, AINL 2017, St. Petersburg, Russia, September 20–23, 2017, Revised Selected Papers. 2017. Springer. pp. 293-301.

[11] Litvak, M., M. Last, and M. Friedman. A new approach to improving multilingual summarization using a genetic algorithm. in Proceedings of the 48th annual meeting of the association for computational linguistics. 2010. Association for Computational Linguistics. pp. 927-936.

[12] Shen, D., J.-T. Sun, H. Li, Q. Yang, and Z. Chen. Document Summarization Using Conditional Random Fields. in IJCAI. 2007. pp. 2862-2867.

[13] Kai Hong and Ani Nenkova, “Improving the Estimation of Word Importance for News Multi-Document Summarization”, Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 712–721, Gothenburg, Sweden, April 26-30 2014.

[14] Natalie Schluter and Anders Sogaard, “Unsupervised extractive summarization via coverage maximization with syntactic and semantic concepts”, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), pages 840–844, Beijing, China, July 26-31, 2015.

[15] Xiaoyan Cai and Wenjie Li, “Ranking Through Clustering: An Integrated Approach to Multi-Document Summarization”, IEEE Transactions On Audio, Speech, And Language Processing, Vol. 21, No. 7, July 2013.

[16] Ouyang, Y., Li,W., Li, S. and Lu, Q., “Applying regression models to query-focused multidocument summarization”, 47(2), 227–237, 2011.

[17] Doina Tatar, Emma Tamaianu Morita, Andreea Mihis,and Dana Lupsa. 2008. Summarization by logic segmentation and text entailment. In Conference on Intelligent Text Processing and Computational Linguistics (CICLing 08), pages 15–26, 2008.

[18] Anand Gupta, Manpreet Kaur, Arjun Singh, Ashish Sachdeva, and Shruti Bhati. 2012. Analog textual entailment and spectral clustering (ATESC) based summarization. In Lecture Notes in Computer Science, Springer, pages 101–110, New Delhi, India.

[19] Anand Gupta, Manpreet Kaur, Adarsh Singh, Aseem Goel and Shachar Mirkin, “Text Summarization through Entailment-based Minimum Vertex Cover”, Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014), pages 75–80, Dublin, Ireland, August 23-24 2014.

[20] Pavlick, E., P. Rastogi, J. Ganitkevitch, B. Van Durme, and C. Callison-Burch. PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2015. pp. 425-430.

[21] Cocos, A. and C. Callison-Burch. Clustering paraphrases by word sense. in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016. pp. 1463-1472.

[22] Mikolov, T., I. Sutskever, K. Chen, G.S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. in Advances in neural information processing systems. 2013. pp. 3111-3119.

[23] Wang, L., H. Raghavan, V. Castelli, R. Florian, and C. Cardie, A sentence compression based framework to query-focused multi-document summarization, in AAAI,  2016, pp.181-192.

[24] Haghighi, A. and L. Vanderwende. Exploring content models for multi-document summarization. in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2009. Association for Computational Linguistics. pp. 362-370.

[25] Cai, X. and W. Li, Ranking through clustering: An integrated approach to multi-document summarization. IEEE transactions on audio, speech, and language processing, 2013. 21(7): pp. 1424-1433.

[26] Toutanova, K., C. Brockett, M. Gamon, J. Jagarlamudi, H. Suzuki, and L. Vanderwende. The pythy summarization system: Microsoft research at duc 2007. in Proc. of DUC 2007, 2007, pp. 141-153.

[27] Chali, Y., M. Tanvee, and M.T. Nayeem. Towards Abstractive Multi-Document Summarization Using Submodular Function-Based Framework, Sentence Compression and Merging. in Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2017. pp. 418-424.

[28] Canhasi, E. and I. Kononenko, Weighted hierarchical archetypal analysis for multi-document summarization. Computer Speech & Language, 2016. 37: pp. 24-46.

[29] Cao, Z., W. Li, S. Li, F. Wei, and Y. Li, Attsum: Joint learning of focusing and summarization with neural attention. in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016, ACL, pp. 547–556.

[30] Conroy, J.M., J.D. Schlesinger, and D.P. O’Leary. Classy 2007 at duc 2007. in Proceedings of the Document Understanding Conference 2007. 2007, pp. 79-93.

[31] Feigenblat, G., H. Roitman, O. Boni, and D. Konopnicki. Unsupervised Query-Focused Multi-Document Summarization using the Cross Entropy Method. in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2017. ACM. pp. 961-964.

[32] He, Z., C. Chen, J. Bu, C. Wang, L. Zhang, D. Cai, and X. He. Document Summarization Based on Data Reconstruction. in AAAI. 2012, ACM, pp.620-626.

[33] Li, P., L. Bing, W. Lam, H. Li, and Y. Liao. Reader-Aware Multi-Document Summarization via Sparse Coding. in IJCAI. 2015. pp. 1270-1276.

[34] Mani, K., I. Verma, and L. Dey, Multi-Document Summarization using Distributed Bag-of-Words Model. in Proceedings of of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2017, pp.312-326.

[35] Mason, R. and E. Charniak. Extractive multi-document summaries should explicitly not contain document-specific content. in Proceedings of the Workshop on Automatic Summarization for Different Genres, Media, and Languages. 2011. Association for Computational Linguistics. pp. 49-54.

[36] Chin-Yew Lin and Eduard Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 71–78, Edmonta, Canada, 27 May- June 1.

CAPTCHA Image