انتخاب هوشمندانه مراکز اولیه در الگوریتم خوشه بندی K-means به‌منظور بهبود تشخیص موضوع

آروین, سپهر; ورداسبی, علی; فیلی, هشام; شاکری, آزاده

انتخاب هوشمندانه مراکز اولیه در الگوریتم خوشه بندی K-means به‌منظور بهبود تشخیص موضوع

نویسندگان

سپهر آروین

علی ورداسبی

هشام فیلی

آزاده شاکری

دانشکده ی مهندسی برق و کامپیوتر، دانشگاه تهران، تهران، ایران

چکیده

تشخیص موضوع یکی از مسائل حوزه ی پردازش زبان طبیعی است که در سال های اخیر همواره مورد توجه بوده و از زوایای متفاوتی مورد پژوهش قرارگرفته است. هدف کلی در این مسئله خوشه بندی اسناد متنی در دسته های مختلف است به‌گونه‌ای که اسناد موجود در هر خوشه موضوع یکسانی داشته باشد. بخش قابل‌توجهی از راه‌حل‌های ارائه‌شده برای این مسئله از الگوریتم های خوشه بندی مانند K-means استفاده می‌کنند. علاوه بر روش‎های مبتنی بر خوشه‎بندی اسناد، در دسته ای از پژوهش ها برای حل مسئله تشخیص موضوع از روش های مدلسازی موضوعی استفاده‌شده است.
در این پژوهش ابتدا حساسیت قابل‌توجه الگوریتم K-means به انتخاب مراکز اولیه به‌صورت عملی نشان داده می‌شود و سپس روشی برای انتخاب هوشمندانه مراکز اولیه ارائه می‌شود که استفاده از آن کیفیت الگوریتم K-means را در مسئله‌ی تشخیص موضوع ارتقاء می‌دهد. روش پیشنهادشده برای تشخیص موضوع در این مقاله با بهره گیری از مدل سازی موضوعی (LDA (Latent Dirichlet Allocation، پس از انتخاب هوشمندانه مراکز اولیه، اقدام به خوشه بندی اسناد بر اساس موضوع آنها می کند. در روش ارائه‌شده فاصله اسناد بر اساس توزیع موضوع حاصل از LDA آنها محاسبه‌شده است. آزمایش ها نشان می دهند که استفاده از روش ارائه‌شده باعث بهبود چشمگیر کیفیت تشخیص موضوع نسبت به روش LDA در دو مجموعه از سه مجموعه دادگان مورد آزمایش می شود. همچنین در مقایسه با روش ++K-means برای انتخاب مراکز اولیه، در روش ارائه‌شده‎ی ما انتخاب مراکز اولیه در دو مجموعه دادگان همیشه مناسب تر بوده و احتمال بهتر بودن مراکز انتخابی در مجموعه دادگان دیگر مورد آزمایش برابر با 70 درصد است.

کلیدواژه‌ها

تشخیص موضوع

LDA (Latent Dirichlet Allocation)

خوشه بندی

تعیین مراکز اولیه

K-means

معیار فاصله

Silhouette

[1] J. Allan, “Introduction to topic detection and tracking,” in Topic detection and tracking, Springer, 2002, pp. 1–16.
[2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
[3] M. E. Celebi, H. A. Kingravi, and P. A. Vela, “A comparative study of efficient initialization methods for the k-means clustering algorithm,” Expert Systems with Applications, vol. 40, no. 1, pp. 200–210, 2013.
[4] Y. Yang, T. Pierce, and J. Carbonell, “A study of retrospective and on-line event detection,” in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 1998, pp. 28–36.
[5] J. Weng and B.-S. Lee, “Event Detection in Twitter.,” ICWSM, vol. 11, pp. 401–408, 2011.
[6] X. Zhang, X. Chen, Y. Chen, S. Wang, Z. Li, and J. Xia, “Event detection and popularity prediction in microblogging,” Neurocomputing, vol. 149, pp. 1469–1480, 2015.
[7] W. Zhang, T. Chen, G. Li, J. Pang, Q. Huang, and W. Gao, “Fusing cross-media for topic detection by dense keyword groups,” Neurocomputing, vol. 169, pp. 169–179, 2015.
[8] D. Spina, J. Gonzalo, and E. Amigó, “Learning similarity functions for topic detection in online reputation monitoring,” in Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, 2014, pp. 527–536.
[9] X. Wang and A. McCallum, “Topics over time: a non-Markov continuous-time model of topical trends,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006, pp. 424–433.
[10] L. Hong and B. D. Davison, “Empirical study of topic modeling in twitter,” in Proceedings of the first workshop on social media analytics, 2010, pp. 80–88.
[11] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, “The author-topic model for authors and documents,” in Proceedings of the 20th conference on Uncertainty in artificial intelligence, 2004, pp. 487–494.
[12] D. Ramage, S. T. Dumais, and D. J. Liebling, “Characterizing Microblogs with Topic Models.,” ICWSM, vol. 10, pp. 1–1, 2010.
[13] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, “Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, 2009, pp. 248–256.
[14] J.-F. Yeh, Y.-S. Tan, and C.-H. Lee, “Topic detection and tracking for conversational content by using conceptual dynamic latent Dirichlet allocation,” Neurocomputing, vol. 216, pp. 310–318, 2016.
[15] W. Sriurai, “Improving text categorization by using a topic model,” Advanced Computing, vol. 2, no. 6, p. 21, 2011.
[16] J. MacQueen and others, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967, vol. 1, pp. 281–297.
[17] T. F. Gonzalez, “Clustering to minimize the maximum intercluster distance,” Theoretical Computer Science, vol. 38, pp. 293–306, 1985.
[18] D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007, pp. 1027–1035.
[19] T. Onoda, M. Sakai, and S. Yamada, “Careful seeding method based on independent components analysis for k-means clustering,” Journal of Emerging Technologies in Web Intelligence, vol. 4, no. 1, pp. 51–59, 2012.
[20] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A k-means clustering algorithm,” Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.
[21] G. N. Lance and W. T. Williams, “A general theory of classificatory sorting strategies II. Clustering systems,” The computer journal, vol. 10, no. 3, pp. 271–277, 1967.
[22] C. C. Aggarwal and C. K. Reddy, Data clustering: algorithms and applications. CRC press, 2013.
[23] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of computational and applied mathematics, vol. 20, pp. 53–65, 1987.
[24] D. Cai, X. Wang, and X. He, “Probabilistic dyadic data analysis with local and global consistency,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 105–112.
[25] D. Cai, Q. Mei, J. Han, and C. Zhai, “Modeling hidden topics on document manifold,” in Proceedings of the 17th ACM conference on Information and knowledge management, 2008, pp. 911–920.
[26] D. Cai, X. He, W. V. Zhang, and J. Han, “Regularized locality preserving indexing via spectral regression,” in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007, pp. 741–750.
[27] D. Cai, X. He, and J. Han, “Document clustering using locality preserving indexing,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 12, pp. 1624–1637, 2005.
[28] C. Cieri, D. Graff, M. Liberman, N. Martey, S. Strassel, and others, “The TDT-2 text and speech corpus,” in Proceedings of the DARPA Broadcast News workshop, 1999, pp. 57–60.
[29] W. Li, J. Joo, H. Qi, and S.-C. Zhu, “Joint Image-Text News Topic Detection and Tracking with And-Or Graph Representation,” arXiv preprint arXiv:1512.04701, 2015.
[30] D. D. Lewis, “Reuters-21578 text categorization test collection, distribution 1.0,” http://www. research. att. com/∼ lewis/reuters21578. html, 1997.
[31] T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization.,” DTIC Document, 1996.
[32] C. Wartena and R. Brussee, “Topic detection by clustering keywords,” in 2008 19th International Workshop on Database and Expert Systems Applications, 2008, pp. 54–58.
[33] X.-H. Phan, L.-M. Nguyen, and S. Horiguchi, “Learning to classify short and sparse text & web with hidden topics from large-scale data collections,” in Proceedings of the 17th international conference on World Wide Web, 2008, pp. 91–100.
[34] I. Bíró, J. Szabó, and A. A. Benczúr, “Latent dirichlet allocation in web spam filtering,” in Proceedings of the 4th international workshop on Adversarial information retrieval on the web, 2008, pp. 29–32.
[35] G. Karypis, E.-H. Han, and V. Kumar, “Chameleon: Hierarchical clustering using dynamic modeling,” Computer, vol. 32, no. 8, pp. 68–75, 1999.

دوره 16، شماره 2
پاییز و زمستان
پاییز 1397

XML

اصل مقاله 548.99 K

تعداد مشاهده مقاله	377
تعداد دریافت فایل اصل مقاله	83

انتخاب هوشمندانه مراکز اولیه در الگوریتم خوشه بندی K-means به‌منظور بهبود تشخیص موضوع

دوره 16، شماره 2پاییز و زمستانپاییز 1397

فایل ها

هم رسانی

ارجاع به این مقاله

آمار

دوره 16، شماره 2
پاییز و زمستان
پاییز 1397