Skip to content

Text Filtering

TF-IDF Based Filtering

The system employs TF-IDF (Term Frequency-Inverse Document Frequency) filtering to:

  • Remove the top 200 most common terms (potentially too generic)
  • Remove the bottom 50 least common terms (potentially noise)

Dictionary Filtering

Additional filtering is applied using Gensim's Dictionary:

  • Removes terms that appear in less than 2 documents
  • Removes terms that appear in more than 90% of documents