A comprehensive NLP toolkit for multi-language sentiment analysis, toxicity detection, and text detoxification using state-of-the-art language models.
This toolkit provides a set of modular NLP tools designed to:
- Detect languages in text using FastText
- Translate non-English text to English for unified analysis
- Analyze sentiment (positive/negative/mixed) with detailed explanations
- Detect toxic content with contextual understanding
- Detoxify text by rewriting toxic content in a polite, constructive manner
The system is designed to handle batch processing of text datasets with robust error handling, caching, and adaptive prompting to optimize results from language models.
nlp-sentiment-toxicity/
├── src/ # Source code
│ ├── __init__.py # Package initialization
│ ├── main.py # Entry point
│ ├── config.py # Configuration settings
│ ├── models.py # Model loading utilities
│ ├── language_utils.py # Language detection and translation
│ ├── text_processor.py # Text processing and batch handling
│ ├── data_processor.py # Data processing functions
│ ├── analysis.py # Core analysis functions
│ ├── sentiment.py # Sentiment analysis module
│ ├── toxicity.py # Toxicity detection and detoxification
│ ├── processing.py # Text preprocessing and rule-based processing
│ └── output_parser.py # Output parsing and fixing
│
├── tests/ # Test modules
│ ├── __init__.py # Test package initialization
│ ├── test_config.py # Test configuration
│ ├── test_sentiment.py # Sentiment analysis tests
│ ├── test_toxicity.py # Toxicity analysis tests
│ └── run_all_tests.py # Test runner
│
├── models/ # Stores model files
│ └── lid.176.bin # FastText language identification model
│
├── data/ # Input datasets
│ ├── multilingual-sentiment-test-solutions.csv
│ └── toxic-test-solutions.csv
│
├── output/ # Results output directory
│ └── ... # Generated result files
│
├── test_output/ # Test results and logs
│ ├── logs/ # Test log files
│ └── ... # Test output files
│
└── README.md # Project documentation
- Python 3.8+: Core programming language
- PyTorch: Deep learning framework for model inference
- Transformers (Hugging Face): For accessing and using pre-trained models
- FastText: For language detection
- LangChain: For agent-based processing and tool integration
- Pandas: For data manipulation and CSV processing
- Ollama: For accessing lightweight LLMs
- Logging: For comprehensive tracking and debugging
The toolkit uses several specialized models:
- Language Detection: FastText’s
lid.176.bin
model (supports 176 languages) - Translation:
UBC-NLP/toucan-base
(Multilingual MT5-based model) - Sentiment & Toxicity Analysis:
ibm-granite/granite-3.0-2b-instruct
(Instruction-tuned LLM) - Agent-based Processing:
llama3.2:1b
via Ollama (Local lightweight LLM)
Each model is used for its specific strengths to create a robust pipeline for text analysis.
- Python 3.8 or higher
- CUDA-compatible GPU (recommended for faster processing)
- Ollama installed locally (for agent-based processing)
-
Clone the repository:
git clone https://github.com/yourusername/nlp-sentiment-toxicity.git cd nlp-sentiment-toxicity
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install torch pandas tqdm transformers fasttext langchain-ollama
-
Download required models:
# FastText language model needs to be downloaded manually to models/ directory mkdir -p models curl -o models/lid.176.bin https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
-
Set up Ollama (if not already installed):
# Follow instructions at https://ollama.ai/ # Then pull the required model ollama pull llama3.2:1b
To process the datasets:
python -m src.main
This will:
- Load all required models
- Process sentiment analysis on multilingual data
- Process toxicity analysis and detoxification
- Save results to the
output/
directory
To run individual tests:
cd tests # Go to the tests directory first
python test_sentiment.py # For sentiment analysis tests
python test_toxicity.py # For toxicity analysis tests
To run all tests:
cd tests # Go to the tests directory first
python run_all_tests.py
Test results will be saved in the test_output/
directory with timestamped filenames, and logs will be stored in test_output/logs/
.
Key configuration settings can be adjusted in src/config.py
, including:
- Model paths
- Input/output file paths
- Processing parameters (retries, delay between requests)
- Device selection (CUDA/CPU)
- Processing large datasets may take significant time
- GPU acceleration is strongly recommended for optimal performance
- The system includes caching to avoid reprocessing identical texts
- Adaptive prompting improves reliability with large language models
- Direct function implementations ensure independence between sentiment and toxicity analysis
- Enhanced Sentiment Analysis: Improved sentiment analysis templates and output parsing to avoid confusion with toxicity analysis
- Optimized Toxicity Detection: Added direct function implementations for better toxicity detection accuracy
- Improved Detoxification: Using specialized direct functions for more effective text detoxification
- Better Logging: Enhanced logging functionality including test run logs
- Improved Error Handling: Better error handling and fallback mechanisms
If you encounter import errors when running tests, make sure to:
-
Run tests from the test directory: Always navigate to the
tests
directory before running test scriptscd tests python test_sentiment.py
-
Python Module Path: If you’re running tests as modules with
-m
, make sure you’re in the project root directory:# From project root PYTHONPATH=. python -m tests.test_sentiment # Linux/Mac set PYTHONPATH=. && python -m tests.test_sentiment # Windows
-
Missing Modules: If you see errors about missing modules, ensure you’ve installed all dependencies:
pip install -r requirements.txt # If available # or manually install required packages pip install torch pandas tqdm transformers fasttext langchain-ollama
If you see warnings like:
I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results...
These are informational messages from TensorFlow about optimizations. The code already contains fixes to suppress these messages, but if they still appear, you can:
-
Additional Environment Variables: Set these environment variables before running your script:
# Linux/Mac export TF_ENABLE_ONEDNN_OPTS=0 export TF_CPP_MIN_LOG_LEVEL=2 # Windows set TF_ENABLE_ONEDNN_OPTS=0 set TF_CPP_MIN_LOG_LEVEL=2
-
Run with Python Flag: Use the
-W
flag to ignore warnings:python -W ignore test_sentiment.py
-
Alternative TensorFlow Installation: Consider installing the CPU-only version of TensorFlow if you’re not using its GPU features:
pip uninstall tensorflow pip install tensorflow-cpu
Leave a Reply