ANALYZING ENSEMBLE LEARNING TECHNIQUES FOR DETECTING REDUNDANT QUESTIONS ON QUORA

Authors

  • B. Yelure Department of Computer Science & Engg., Government College of Engineering Kolhapur, INDIA
  • S. Pawar Department of Information Technology, Government College of Engineering Karad, INDIA.
  • S. Thorat Department of Information Technology, Government College of Engineering Karad, INDIA
  • S. Patil Department of Computer Engineering, DYPCOE, Kolhapur, INDIA.
  • A. Patokar Department of Information Technology, Government College of Engineering Karad, INDIA
  • P. Jawade Department of Computer Engineering, Government College of Engineering Nagpur, INDIA

DOI:

https://doi.org/10.4314/njt.2026.6109

Keywords:

machine learning , Natural language processing, Redundant Questions, feature extraction

Abstract

Redundant questions are a major concern faced by community question-answering platforms such as Quora, Stack Overflow. Redundant questions reduce the efficiency of information retrieval, hinder effective statistical categorization, and limit the availability of diverse responses for users. Therefore, this research focuses on identifying duplicate or redundant questions by applying techniques from Machine Learning and Natural Language Processing. The dataset of more than 400,000 question pairings retrieved from Quora is preprocessed using tokenization and stop word removal. Feature extraction is performed on this preprocessed dataset. The proposed approach utilizes Bag-of-Words (BoW) for feature extraction, transforming raw question pairs into structured numerical vectors to optimize the performance of the ensemble classifiers. The algorithms, including Decision Tree, Random Forest, XGboost, and Adaboost are applied on the dataset for detecting duplicate questions. Random Forest outperformed Decision Tree, XGboost, and Adaboost classifiers with an accuracy of 81.69 %.

References

[1] L. Wang, L. Zhang and J. Jiang, “Duplicate Question Detection With Deep Learning in Stack Overflow,” IEEE Access, 8, pp. 25964–25975, 2020. doi: 10.1109/access.2020.2968391.

[2] Z. Xu and H. Yuan, “Forum Duplicate Question Detection by Domain Adaptive Semantic Matching,” IEEE Access, 8, pp. 56029–56038, 2020. doi: 10.1109/access.2020.2982268.

[3] D. Basavesha and Y. S. Nijagunaraya, “Detecting Duplicate Questions in Community Based Websites Using Machine Learning,” in Proc. Int. Conf. Innovative Computing & Communication (ICICC), University of Delhi, India, 2021. doi: 10.2139/ssrn.3835083.

[4] S. K. Panda, V. Bhalerao and A. R. Sathya, “A Machine Learning Model to Identify Duplicate Questions in Social Media Forums,” Int. J. Innovative Technology and Exploring Engineering, 9(4), pp. 370–373, 2020. doi: 10.35940/ijitee.D1362.029420

[5] Z. Imtiaz, M. Umer, M. Ahmad, S. Ullah, G. S. Choi and A. Mehmood, “Duplicate Questions Pair Detection Using Siamese MaLSTM,” IEEE Access, 8, pp. 21932–21942, 2020. doi: 10.1109/access.2020.2969041.

[6] C. Saedi, J. Rodrigues, J. Silva, A. Branco and V. Maraev, “Learning Profiles in Duplicate Question Detection,” in Proc. 2017 IEEE Int. Conf. Information Reuse and Integration (IRI), San Diego, CA, USA, pp. 544–550, 2017. doi: 10.1109/iri.2017.39.

[7] V. M. Tambakhe and K. P. Wagh, “Review on Exploring Similarity between Two Questions using Machine Learning,” Int. J. Scientific Research in Computer Science, Engineering and Information Technology, 7(20), pp. 287–293, 2021. doi: https://doi.org/10.32628/cseit217360

[8] R. R., R. P. Kumar, A. R. S. and A. N. Khan “Identification of Duplication in Questions Posed on Knowledge Sharing Platform Quora using Machine Learning Techniques,” Int. J. Innovative Technology and Exploring Engineering, 8(12), pp. 2444–2451, 2019. doi: 10.35940/ijitee.L3017.1081219.

[9] S. Rani, A. Kumar, N. Kumar and S. Kumar “Deep Neural Model for Duplicate Question Detection Using Support Vector Machines (SVM),” Turkish Journal of Computer and Mathematics Education, 12(6), pp. 4024–4033, 2021.

[10] K. Sharma and S. K. Tadepalli “Detecting Duplicate Questions in Online Forums Using Machine Learning Techniques,” Int. J. Research in Applied Science & Engineering Technology, 10, pp. 4775–4778, 2022. https://doi.org/10.22214/ijraset.2022.45072

[11] Y. Zhang, D. Lo, X. Xia and J.-L. Sun “Multi-Factor Duplicate Question Detection in Stack Overflow,” Journal of Computer Science & Technology, 30(5), pp. 981–997, 2015. doi: 10.1007/s11390-015-1576-4.

[12] T. Addair, “Duplicate Question Pair Detection with Deep Learning,” Department of Computer Science, Stanford University, Stanford, CA, USA, 2017. Available: Stanford University PDF.

[13] N. Ansari and R. Sharma “Identifying Semantically Duplicate Questions Using Data Science Approach: A Quora Case Study,” arXiv preprint, abs/2004.11694, 2020. [Online]. Available: https://arxiv.org/abs/2004.11694

[14] N. N. Qomariyah, E. Heriyanni, A. N. Fajar and D. Kazakov “Comparative Analysis of Decision Tree Algorithm for Learning Ordinal Data Expressed as Pairwise Comparisons,” in Proc. 8th Int. Conf. Information and Communication Technology (ICoICT), Yogyakarta, Indonesia, pp. 1–4, 2020. doi: 10.1109/Icoict49345.2020.9166341.

[15] J. K. Jaiswal and R. Samikannu “Application of Random Forest Algorithm on Feature Subset Selection and Classification and Regression,” in Process 2017 World Congress on Computing and Communication Technologies (WCCCT), Tiruchirappalli, India, pp. 65–68, 2017. doi: 10.1109/wccct.2016.25.

[16] C. N. Obiora, A. Ali and A. N. Hasan “Implementing Extreme Gradient Boosting (XGBoost) Algorithm in Predicting Solar Irradiance,” in Proc. 2021 IEEE PES/IAS PowerAfrica, Nairobi, Kenya, pp. 1–5, 2021. doi: 10.1109/Powerafrica52236.2021.9543159.

[17] R. Wang “AdaBoost for Feature Selection, Classification and Its Relation with SVM: A Review,” Physics Procedia, 25, pp. 800–807, 2012. doi: 10.1016/j.phpro.2012.03.160.

[18] C. Tang, B. Xu and H. Liu “The Application of the AdaBoost Algorithm in Text Classification,” in Proc. 2nd IEEE IMCEC, Xi’an, China, pp. 1792–1796, 2018. doi: 10.1109/imcec.2018.8469497.

[19] T. P. Nagarhalli, V. Vaze and N. K. Rana “Impact of Machine Learning in Natural Language Processing: A Review,” in Proc. Third Int. Conf. Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, pp. 1529–1534, 2021. doi: 10.1109/ICICV50876.2021.9388380.

[20] S. Li, J. Zhang and Y. Chen “Further Improvement of AdaBoost Algorithm,” in Proc. Seventh Int. Conf. Measuring Technology and Mechatronics Automation (ICMTMA), pp. 499–501, 2015. doi: 10.1109/icmtma.2015.127.

[21] “Quora Question Pairs Dataset” Kaggle. [Online]. Available: https://www.kaggle.com/competitions/quora-question-pairs/data. Accessed on 2025.

[22] P. A. Kumar, E. Pugazhendhi and K. V. Lakshmi “Cloud Data Storage Optimization by Using Novel De-Duplication Technique,” in Proc. 2022 4th Int. Conf. Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, pp. 436–442, 2022. doi:10.1109/icssit53264.2022.9716508.

[23] S. Tarek, H. M. Noaman and M. Kayed “Enhancing Question Pairs Identification with Ensemble Learning: Integrating Machine Learning and Deep Learning Models,” Int. J. Advanced Computer Science and Applications, 14(11), pp. 600–608, 2023, doi:10.14569/ijacsa.2023.01411100

[24] N. B. Korade, M. B. Salunke et al. “Exploring NLP Techniques for Duplicate Question Detection to Maximizing Responses on Q&A Websites,” Int. J. Intelligent Systems and Applications in Engineering, 12(3), pp. 11–20, 2024. doi: 10.51463/ijisae.v12i2s1.5218.

[25] R. P. Kumar, B. M. G., R. Elakkiya and V. Druva “Exploring Machine Learning Models for Duplicate Question Detection in Online Communities,” in Proc. 2023 Inter Conference Computer Science and Engineering, Bengaluru, India, 2023. doi: 10.1109/icaecc59324.2023.10560222.

[26] A. Bhardwaj, R. Hasan and S. Mahmood “Semantic similarity in community forum questions: Case study on Quora dataset,” Journal of Umm Al-Qura University for Engineering and Architecture, 16, pp.1719–1728, 2025. doi: 10.1007/s43995-025-00206-0.

[27] J. Martinez-Gil “Automatic Design of Semantic Similarity Ensembles Using Grammatical Evolution,” arXiv preprint arXiv:2307.00925v8, 2025. doi:10.48550/arxiv.2307.00925.

[28] L. Yu, C. Che, B. Liu, Q. Lin and X. Zhao “Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method,” arXiv preprint arXiv:2401.06782, 2024. doi:10.48550/arxiv.2401.06782.

[29] James, I. I. ., & Osubor, V. I. “Hostile social media harassment: A machine learning framework for filtering anti-female jokes,” Nigerian Journal of Technology, 41(2), pp. 342–350, 2022. doi: https://doi.org/10.4314/njt.v41i2.13

[30] B. S. Yelure, N. S. Deokule, S. S. Mane, M. V. Bhosale, A. B. Chavan and V. C. Satpute "Remote monitoring of Covid-19 patients using IoT and AI," 2022 Second International Conference on Artificial Intelligence and Smart Energy (ICAIS), Coimbatore, India, pp. 73-80, 2022. doi: 10.1109/icais53314.2022.9742750

Downloads

Published

2026-05-13

Issue

Section

SI: Advances in Modelling, Simulation, and AI/ML for Multi-Disciplinary Engineering Applications

How to Cite

ANALYZING ENSEMBLE LEARNING TECHNIQUES FOR DETECTING REDUNDANT QUESTIONS ON QUORA. (2026). Nigerian Journal of Technology, 45(S1). https://doi.org/10.4314/njt.2026.6109