Document Clustering using Linear Discriminant Analysis and Support Vector Machine
Keywords:
Document clustering, TF-IDF, Hash vectorization, Latent Dirichlet Allocation, Support vector machine.Abstract
Document clustering is useful in a variety of text mining and information retrieval applications. document clustering is the use of cluster analysis on text documents. It has applications in areas like as automatically organizing documents, extracting topics from documents, and quickly retrieving or filtering information. This paper presents a document clustering approach using Term Frequency - Inverse Document Frequency (TF-IDF) and hash vectorization for text vectorization. Not only does TF-IDF concentrate on the frequency of words that are found in the corpus, but it also offers information on the significance of the words. The TF-IDF model includes information not just on the most essential words but also on the words with the least amount of significance. Hash vectorization is both fast and requires very low memory. Latent Dirichlet Allocation (LDA) is applied on the extracted features for dimensionality reduction. LDA solves topic modeling problems. The reduced features are then classified using multiclass Support vector machine.