Text Preprocessing

The preprocessing pipeline includes several steps to clean and prepare the text data:

1. Tokenization

Using NLTK's RegexpTokenizer to split text into tokens

2. Normalization

Converting to lowercase
Removing numbers
Removing short tokens (length < 2)

3. Stop Word Removal

Using NLTK's English stop words
Removing custom high-frequency words

4. Lemmatization

Using WordNet lemmatizer to reduce words to their base form

5. Bigram Detection

Adding meaningful word pairs that frequently occur together

6. High-Frequency Word Filtering

The system automatically identifies and removes high-frequency words that appear across multiple topics (frequency > 10) to improve topic distinctiveness.

Text Preprocessing ​

1. Tokenization ​

2. Normalization ​

3. Stop Word Removal ​

4. Lemmatization ​

5. Bigram Detection ​