Financial sentiment analysis using FinBERT model to process Reddit discussions and generate market sentiment scores.
Transforms raw financial text data into structured sentiment analysis with 1-5 star ratings optimized for trading insights.
- Model:
ProsusAI/finbert - Specialization: Financial text sentiment analysis
- Output: 1-5 star rating system
- Performance: 55+ texts/second on CPU
- Memory: Optimized for CPU-only inference
- 1-2 stars: Bearish sentiment
- 3 stars: Neutral sentiment
- 4-5 stars: Bullish sentiment
python process_data.pypython test_finbert.py- Source: S3
raw-data/reddit_financial_*.csv - Content: Reddit posts and comments with metadata
- Format: CSV with
title,content,category, etc.
- Destination: S3
processed-data/processed_data.csv - Added Columns:
sentiment_label,sentiment_score - Format: Enhanced CSV with all original data + sentiment
# AWS (required)
AWS_ACCESS_KEY_ID=your_key
AWS_SECRET_ACCESS_KEY=your_secret
AWS_DEFAULT_REGION=us-east-1
# S3 (required)
S3_BUCKET_NAME=automated-trading-data-bucketpip install -r requirements.txtKey packages:
transformers- FinBERT modeltorch- PyTorch (CPU optimized)pandas- Data processingboto3- AWS S3 integration
- Speed: 55+ texts/second (CPU-only)
- Batch Size: Optimized for memory efficiency
- Model Size: ~440MB download (cached locally)
- Memory Usage: ~2GB RAM during processing
- Records Processed: 1,156 Reddit posts/comments
- Processing Time: ~20 seconds total
- Sentiment Distribution: Balanced across 1-5 stars
- Success Rate: 100% (no failed analyses)
- Base: BERT-base-uncased
- Fine-tuning: Financial news and reports
- Tokenizer: BERT WordPiece tokenizer
- Max Length: 512 tokens (auto-truncated)
- Load: Fetch new CSV files from S3
- Combine: Merge title + content for analysis
- Analyze: FinBERT sentiment scoring
- Map: Convert to 1-5 star system
- Save: Upload enhanced data to S3
- Empty Text: Assigns neutral (3 stars)
- Model Errors: Logs and continues processing
- S3 Failures: Retries with exponential backoff
- Combines: Post title + content for full context
- Handles: Emojis, URLs, special characters
- Preserves: Original text alongside sentiment
- Confidence: Includes raw model scores
- Trained On: Financial news, earnings reports, market commentary
- Understands: Trading terminology, market sentiment
- Optimized For: Investment-related discussions
- Batch Processing: Handle larger datasets efficiently
- Monitoring: Add processing time/success metrics
- Validation: Compare sentiment with price movements
- Fine-tuning: Train on Reddit-specific financial data
- Ensemble: Combine multiple sentiment models
- Confidence Filtering: Flag low-confidence predictions
- GPU Support: Optional GPU acceleration for large batches
- Caching: Store model in memory for repeated runs
- Parallel Processing: Multi-threading for CPU optimization
- Bullish (4-5 stars): ~35% of posts
- Neutral (3 stars): ~30% of posts
- Bearish (1-2 stars): ~35% of posts
- CRYPTO: More volatile sentiment swings
- US_STOCKS: Generally more conservative sentiment
- ECONOMICS: Longer-term, macro-focused sentiment
- Long Posts: Truncated at 512 tokens (BERT limit)
- Sarcasm: May misinterpret sarcastic posts
- Context: Limited to individual post context
ai-workbench/
βββ models/
β βββ sentiment_analyzer.py # FinBERT implementation
βββ data/
β βββ s3_manager.py # S3 data handling
βββ process_data.py # Main processing script
βββ test_finbert.py # Model testing
βββ requirements.txt # Dependencies
Part of the automated-trading pipeline Previous: data-harvester collects raw data Next: insight-dashboard visualizes sentiment results