Document Clustering using Linear Discriminant Analysis and Support Vector Machine

Authors

  • P Saidesh Kumar, Dr P Vijayapal Reddy

Keywords:

Document clustering, TF-IDF, Hash vectorization, Latent Dirichlet Allocation, Support vector machine.

Abstract

Document clustering is useful in a variety of text mining and information retrieval applications. document clustering is the use of cluster analysis on text documents. It has applications in areas like as automatically organizing documents, extracting topics from documents, and quickly retrieving or filtering information. This paper presents a document clustering approach using Term Frequency - Inverse Document Frequency (TF-IDF) and hash vectorization for text vectorization. Not only does TF-IDF concentrate on the frequency of words that are found in the corpus, but it also offers information on the significance of the words. The TF-IDF model includes information not just on the most essential words but also on the words with the least amount of significance. Hash vectorization is both fast and requires very low memory. Latent Dirichlet Allocation (LDA) is applied on the extracted features for dimensionality reduction. LDA solves topic modeling problems. The reduced features are then classified using multiclass Support vector machine.

Published

2023-05-02

How to Cite

P Saidesh Kumar, Dr P Vijayapal Reddy. (2023). Document Clustering using Linear Discriminant Analysis and Support Vector Machine. SJIS-P, 35(1), 1272–1281. Retrieved from http://sjis.scandinavian-iris.org/index.php/sjis/article/view/509

Issue

Section

Articles