دانشکده ی مهندسی برق و کامپیوتر، دانشگاه تهران، تهران، ایران
چکیده
تشخیص موضوع یکی از مسائل حوزه ی پردازش زبان طبیعی است که در سال های اخیر همواره مورد توجه بوده و از زوایای متفاوتی مورد پژوهش قرارگرفته است. هدف کلی در این مسئله خوشه بندی اسناد متنی در دسته های مختلف است بهگونهای که اسناد موجود در هر خوشه موضوع یکسانی داشته باشد. بخش قابلتوجهی از راهحلهای ارائهشده برای این مسئله از الگوریتم های خوشه بندی مانند K-means استفاده میکنند. علاوه بر روشهای مبتنی بر خوشهبندی اسناد، در دسته ای از پژوهش ها برای حل مسئله تشخیص موضوع از روش های مدلسازی موضوعی استفادهشده است. در این پژوهش ابتدا حساسیت قابلتوجه الگوریتم K-means به انتخاب مراکز اولیه بهصورت عملی نشان داده میشود و سپس روشی برای انتخاب هوشمندانه مراکز اولیه ارائه میشود که استفاده از آن کیفیت الگوریتم K-means را در مسئلهی تشخیص موضوع ارتقاء میدهد. روش پیشنهادشده برای تشخیص موضوع در این مقاله با بهره گیری از مدل سازی موضوعی (LDA (Latent Dirichlet Allocation، پس از انتخاب هوشمندانه مراکز اولیه، اقدام به خوشه بندی اسناد بر اساس موضوع آنها می کند. در روش ارائهشده فاصله اسناد بر اساس توزیع موضوع حاصل از LDA آنها محاسبهشده است. آزمایش ها نشان می دهند که استفاده از روش ارائهشده باعث بهبود چشمگیر کیفیت تشخیص موضوع نسبت به روش LDA در دو مجموعه از سه مجموعه دادگان مورد آزمایش می شود. همچنین در مقایسه با روش ++K-means برای انتخاب مراکز اولیه، در روش ارائهشدهی ما انتخاب مراکز اولیه در دو مجموعه دادگان همیشه مناسب تر بوده و احتمال بهتر بودن مراکز انتخابی در مجموعه دادگان دیگر مورد آزمایش برابر با 70 درصد است.
[1] J. Allan, “Introduction to topic detection and tracking,” in Topic detection and tracking, Springer, 2002, pp. 1–16.
[2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
[3] M. E. Celebi, H. A. Kingravi, and P. A. Vela, “A comparative study of efficient initialization methods for the k-means clustering algorithm,” Expert Systems with Applications, vol. 40, no. 1, pp. 200–210, 2013.
[4] Y. Yang, T. Pierce, and J. Carbonell, “A study of retrospective and on-line event detection,” in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 1998, pp. 28–36.
[5] J. Weng and B.-S. Lee, “Event Detection in Twitter.,” ICWSM, vol. 11, pp. 401–408, 2011.
[6] X. Zhang, X. Chen, Y. Chen, S. Wang, Z. Li, and J. Xia, “Event detection and popularity prediction in microblogging,” Neurocomputing, vol. 149, pp. 1469–1480, 2015.
[7] W. Zhang, T. Chen, G. Li, J. Pang, Q. Huang, and W. Gao, “Fusing cross-media for topic detection by dense keyword groups,” Neurocomputing, vol. 169, pp. 169–179, 2015.
[8] D. Spina, J. Gonzalo, and E. Amigó, “Learning similarity functions for topic detection in online reputation monitoring,” in Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, 2014, pp. 527–536.
[9] X. Wang and A. McCallum, “Topics over time: a non-Markov continuous-time model of topical trends,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006, pp. 424–433.
[10] L. Hong and B. D. Davison, “Empirical study of topic modeling in twitter,” in Proceedings of the first workshop on social media analytics, 2010, pp. 80–88.
[11] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, “The author-topic model for authors and documents,” in Proceedings of the 20th conference on Uncertainty in artificial intelligence, 2004, pp. 487–494.
[12] D. Ramage, S. T. Dumais, and D. J. Liebling, “Characterizing Microblogs with Topic Models.,” ICWSM, vol. 10, pp. 1–1, 2010.
[13] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, “Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, 2009, pp. 248–256.
[14] J.-F. Yeh, Y.-S. Tan, and C.-H. Lee, “Topic detection and tracking for conversational content by using conceptual dynamic latent Dirichlet allocation,” Neurocomputing, vol. 216, pp. 310–318, 2016.
[15] W. Sriurai, “Improving text categorization by using a topic model,” Advanced Computing, vol. 2, no. 6, p. 21, 2011.
[16] J. MacQueen and others, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967, vol. 1, pp. 281–297.
[17] T. F. Gonzalez, “Clustering to minimize the maximum intercluster distance,” Theoretical Computer Science, vol. 38, pp. 293–306, 1985.
[18] D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007, pp. 1027–1035.
[19] T. Onoda, M. Sakai, and S. Yamada, “Careful seeding method based on independent components analysis for k-means clustering,” Journal of Emerging Technologies in Web Intelligence, vol. 4, no. 1, pp. 51–59, 2012.
[20] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A k-means clustering algorithm,” Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.
[21] G. N. Lance and W. T. Williams, “A general theory of classificatory sorting strategies II. Clustering systems,” The computer journal, vol. 10, no. 3, pp. 271–277, 1967.
[22] C. C. Aggarwal and C. K. Reddy, Data clustering: algorithms and applications. CRC press, 2013.
[23] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of computational and applied mathematics, vol. 20, pp. 53–65, 1987.
[24] D. Cai, X. Wang, and X. He, “Probabilistic dyadic data analysis with local and global consistency,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 105–112.
[25] D. Cai, Q. Mei, J. Han, and C. Zhai, “Modeling hidden topics on document manifold,” in Proceedings of the 17th ACM conference on Information and knowledge management, 2008, pp. 911–920.
[26] D. Cai, X. He, W. V. Zhang, and J. Han, “Regularized locality preserving indexing via spectral regression,” in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007, pp. 741–750.
[27] D. Cai, X. He, and J. Han, “Document clustering using locality preserving indexing,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 12, pp. 1624–1637, 2005.
[28] C. Cieri, D. Graff, M. Liberman, N. Martey, S. Strassel, and others, “The TDT-2 text and speech corpus,” in Proceedings of the DARPA Broadcast News workshop, 1999, pp. 57–60.
[29] W. Li, J. Joo, H. Qi, and S.-C. Zhu, “Joint Image-Text News Topic Detection and Tracking with And-Or Graph Representation,” arXiv preprint arXiv:1512.04701, 2015.
[30] D. D. Lewis, “Reuters-21578 text categorization test collection, distribution 1.0,” http://www. research. att. com/∼ lewis/reuters21578. html, 1997.
[31] T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization.,” DTIC Document, 1996.
[32] C. Wartena and R. Brussee, “Topic detection by clustering keywords,” in 2008 19th International Workshop on Database and Expert Systems Applications, 2008, pp. 54–58.
[33] X.-H. Phan, L.-M. Nguyen, and S. Horiguchi, “Learning to classify short and sparse text & web with hidden topics from large-scale data collections,” in Proceedings of the 17th international conference on World Wide Web, 2008, pp. 91–100.
[34] I. Bíró, J. Szabó, and A. A. Benczúr, “Latent dirichlet allocation in web spam filtering,” in Proceedings of the 4th international workshop on Adversarial information retrieval on the web, 2008, pp. 29–32.
[35] G. Karypis, E.-H. Han, and V. Kumar, “Chameleon: Hierarchical clustering using dynamic modeling,” Computer, vol. 32, no. 8, pp. 68–75, 1999.