Ferret Scan is a sensitive data detection tool that scans files for potential sensitive information such as credit card numbers and passport numbers.
# macOS: Handle security for downloaded binary installation
chmod +x ferret-scan
xattr -d com.apple.quarantine ferret-scan
# System-wide installation (installs to /usr/local/bin/ferret-scan)
sudo scripts/install-system.sh
# Verify installation
ferret-scan --version
# Set up pre-commit integration (see docs/PRE_COMMIT_INTEGRATION.md)
pip install pre-commit
# Create .pre-commit-config.yaml with ferret-scan hookchmod +x ferret-scan
xattr -d com.apple.quarantine ferret-scanOnly use these commands for executables from trusted sources.
# Run PowerShell as Administrator and install system-wide
.\scripts\install-system-windows.ps1
# Or install for current user only (no admin required)
.\scripts\install-system-windows.ps1 -UserInstall
# Verify installation
ferret-scan --version
# Set up pre-commit integration (see docs/PRE_COMMIT_INTEGRATION.md)
pip install pre-commit
# Create .pre-commit-config.yaml with ferret-scan hook- 📦 Release Installation - Quick guide for release downloads
- 📖 Complete Installation Guide - Comprehensive installation options
- 🗑️ Uninstallation Guide - Complete removal instructions
- 🔧 Manual Installation: Download from GitLab Releases
- 🏗️ Build from Source:
make build(see Building the Application)
Ferret Scan is extensively documented with comprehensive guides for users, developers, and operators:
📖 Complete Documentation Index →
- 🚀 Getting Started: Installation | Configuration Guide | Architecture Overview
- 👥 User Guides: Web UI | Docker | Preprocess-Only | Suppressions
- 🛠 Development: Creating Validators | Testing Guide
- 🚀 Deployment: GitLab Integration | GitLab Security Scanner Setup | CI/CD Setup
Ferret Scan uses a modular architecture with pluggable validators and preprocessors. For detailed architecture documentation and flow diagrams, see docs/architecture-diagram.md.
For application flow and processing diagrams, see docs/ferret-application-flow.md.
For complete documentation, see the Documentation Index.
- Email Address Validation: RFC-compliant email detection with domain validation
- Enhanced Credit Card Detection: Mathematical validation with 15+ card brands, test pattern filtering, XML/HTML support, multiple separator formats (dashes, spaces, none), and improved quoted string handling
- Intellectual Property Detection: Patents, trademarks, copyrights, and trade secrets
- Intelligent SSN Detection: Domain-aware validation with HR/Tax/Healthcare context understanding
- IP Address Detection: IPv4 and IPv6 address identification with network context
- Metadata Analysis: EXIF and document metadata extraction and validation with intelligent file type filtering for improved performance
- Advanced Passport Recognition: Multi-country formats (US, UK, Canada, EU, MRZ) with travel context analysis
- Person Name Detection: Pattern matching with embedded name databases for first/last names, titles, and cultural variations
- Phone Number Recognition: International and domestic formats with country code support
- Social Media Detection: Configurable platform detection for handles, profiles, and usernames
- Sophisticated Secrets Detection: Entropy analysis + 40+ API key patterns (AWS, GitHub, Google Cloud, Stripe, etc.)
- Context-Aware Analysis: Domain and document type understanding (Healthcare, Financial, HR, etc.)
- Environment Detection: Automatic dev/test/production environment recognition with confidence adjustments
- Cross-Validator Signals: Pattern correlation analysis across different validator types
- Confidence Calibration: Historical performance-based confidence scoring
- Pattern Learning: Automatic discovery of new patterns from operational feedback
- Multi-Language Support: Content validation in 9+ languages with locale-specific patterns
- Batch Processing: Optimized parallel processing with worker pools and pattern caching
- Memory Optimization: Efficient resource management with memory pools and garbage collection
- Real-Time Analytics: Performance metrics, accuracy tracking, and predictive insights
- Docker Support: Containerized deployment for easy integration
- CI/CD Integration: Pre-commit hooks and pipeline integration
- GitLab Security Scanner: Native GitLab SAST report format for Security Dashboard integration
- Web UI Interface: Professional web interface with bulk operations and suppression management (use
ferret-scan --web)
- Intelligent File Type Filtering: Metadata validator automatically skips plain text files that cannot contain meaningful metadata
- Optimized Processing: 20-30% performance improvement for workloads with many plain text files (.txt, .py, .js, .json, .md, etc.)
- Smart Content Routing: Only processes metadata extraction for files that actually contain metadata (images, documents, audio, video)
- Reduced False Positives: Eliminates false positives from analyzing plain text content as metadata
- Memory Scrubbing: Secure memory handling to minimize sensitive data exposure
- Suppression System: Rule-based filtering to reduce false positives with bulk management
- Confidence Scoring: Multi-factor confidence calculation with context adjustments
- Audit Trail: Comprehensive logging and observability for compliance requirements
Ferret Scan implements memory scrubbing to reduce the exposure of sensitive data in memory:
- SecureString: Sensitive data is stored in controlled byte slices instead of regular Go strings
- Explicit Clearing: Memory is overwritten with zeros using multiple passes after processing
- Automatic Cleanup: All matches are cleared from memory after output formatting
- Reduced Exposure Window: Minimizes the time sensitive data remains in memory
Due to Go language constraints, memory scrubbing provides partial protection:
- ✅ Controlled clearing: Byte slices are explicitly zeroed
- ✅ Multiple overwrites: Data is overwritten multiple times
- ✅ Forced garbage collection: Memory cleanup is triggered after processing
- ❌ String immutability: Go strings create temporary copies during processing
- ❌ Compiler optimizations: May eliminate "dead" memory overwrites
- ❌ No memory locking: Cannot prevent swapping to disk
The implementation provides better security than storing sensitive data in regular strings, but cannot guarantee complete memory protection due to Go's memory model.
Memory scrubbing is automatically enabled and requires no additional configuration. Sensitive data is cleared from memory after each scan completes.
To set up your development environment:
# First, clone the repository
git clone https://code.aws.dev/personal_projects/alias_a/adifabio/Ferret-Scan.git
cd Ferret-Scan
# Make the setup script executable and run it
chmod +x scripts/setup-dev.sh
./scripts/setup-dev.shThis script will:
- Install required Go tools (like golint)
- Add the Go bin directory to your PATH
- Install other dependencies
The project includes a Makefile to simplify common development tasks:
# Build the application
make build
# Format code
make fmt
# Run linter
make lint
# Run go vet
make vet
# Clean build artifacts
make clean
# Run all checks and build
make all
# Install configuration file
make install-configIf you prefer not to use the Makefile, you can build manually:
go build -ldflags="-s -w" -o ferret-scan cmd/main.go./ferret-scan --file <path-to-file> [options]--file: Path to the input file, directory, or glob pattern (e.g., *.pdf) (required for CLI mode)--config: Path to configuration file (YAML)--profile: Profile name to use from config file--list-profiles: List available profiles in config file--format: Output format: "text", "json", "csv", "yaml", "junit", "gitlab-sast" (default: "text")- gitlab-sast: GitLab Security Report format for integration with GitLab Security Dashboard and merge request widgets
--confidence: Confidence levels to display, comma-separated: "high", "medium", "low", or "all" (default: "all")--checks: Specific checks to run, comma-separated: "CREDIT_CARD", "EMAIL", "INTELLECTUAL_PROPERTY", "IP_ADDRESS", "METADATA", "PASSPORT", "PERSON_NAME", "PHONE", "SECRETS", "SOCIAL_MEDIA", "SSN", or "all" (default: "all")- SOCIAL_MEDIA: Requires configuration - see Social Media Configuration Guide
--verbose: Display detailed information for each finding (default: false)--debug: Enable debug logging to show preprocessing and validation flow--output: Path to output file (if not specified, output to stdout)--no-color: Disable colored output (useful for logging or non-terminal output)--show-match: Display the actual matched text in findings (otherwise shows [HIDDEN])--quiet: Suppress progress output (useful for scripts and CI/CD)--help: Show help information--version: Show version information
--enable-preprocessors: Enable text extraction from documents (PDF, Office files) (default: true, use--enable-preprocessors=falseto disable)--preprocess-only: Output preprocessed text and exit (no validation or redaction)-p: Short form of--preprocess-only--recursive: Recursively scan directories (default: false)
--enable-redaction: Enable redaction of sensitive data found in documents--redaction-output-dir: Directory where redacted files will be stored (default: "./redacted")--redaction-strategy: Default redaction strategy: "simple", "format_preserving", or "synthetic" (default: "format_preserving")--redaction-audit-log: Path to save redaction audit log file (JSON format for compliance)
--generate-suppressions: Generate suppression rules for all findings (disabled by default, updates last_seen_at for existing rules)--suppression-file: Path to suppression configuration file (default: ".ferret-scan-suppressions.yaml")--show-suppressed: Include suppressed findings in output with suppression details (marked as [SUPP] in text format)
--web: Start web server mode instead of CLI scanning--port: Port for web server (default: 8080, only used with --web)
Scan a file and display all findings in text format:
./ferret-scan --file sample.txtScan a file and only show high confidence findings:
./ferret-scan --file sample.txt --confidence highScan a file and output results as JSON:
./ferret-scan --file sample.txt --format jsonGenerate GitLab Security Report for CI/CD integration:
./ferret-scan --file . --recursive --format gitlab-sast --output gl-sast-report.jsonScan a file and save results to an output file:
./ferret-scan --file sample.txt --format json --output results.jsonShow detailed information for high and medium confidence findings:
./ferret-scan --file sample.txt --confidence high,medium --verboseUse a configuration file:
./ferret-scan --file sample.txt --config ferret.yamlUse a specific profile from the configuration file:
./ferret-scan --file sample.txt --config ferret.yaml --profile thoroughScan for intellectual property (requires configuration):
./ferret-scan --file document.txt --config ferret.yaml --checks INTELLECTUAL_PROPERTYScan for social media profiles and handles (requires configuration):
./ferret-scan --file document.txt --checks SOCIAL_MEDIAStart web server on default port (8080):
./ferret-scan --webStart web server on custom port:
./ferret-scan --web --port 9000Note: Social media detection requires configuration. See the Social Media Configuration Guide for setup instructions.
List available profiles in the configuration file:
./ferret-scan --list-profiles --config ferret.yamlExtract preprocessed text from documents without validation:
./ferret-scan --file document.pdf --preprocess-onlyExtract text using short form flag:
./ferret-scan --file document.docx -pExtract text from multiple files:
./ferret-scan --file documents/ --recursive --preprocess-onlyExtract text with verbose output showing processor details:
./ferret-scan --file image.jpg --preprocess-only --verboseFerret Scan includes comprehensive pre-configured profiles for different use cases:
# Generate JUnit XML for test result integration
./ferret-scan --file . --recursive --config ferret.yaml --profile ci# Security-focused scan excluding low confidence matches
./ferret-scan --file . --recursive --config ferret.yaml --profile security-audit# Fast scan focusing on critical data types only
./ferret-scan --file . --config ferret.yaml --profile quick# Full analysis with all features and YAML output
./ferret-scan --file document.pdf --config ferret.yaml --profile comprehensive```bash
# CSV format for spreadsheet analysis
./ferret-scan --file . --recursive --config ferret.yaml --profile csv-export
# JSON API format for programmatic processing
./ferret-scan --file document.txt --config ferret.yaml --profile json-api
# GitLab Security Scanner Integration
./ferret-scan --file . --recursive --config ferret.yaml --profile gitlab-securityQuiet mode for scripts and CI/CD:
# Suppress progress output for clean script output
./ferret-scan --file document.txt --quiet
# Combine with other options for automated scanning
./ferret-scan --file *.pdf --quiet --format json --output results.jsonDetect secrets and API keys:
# Scan for secrets in configuration files
./ferret-scan --file config.json --checks SECRETS
# High confidence secrets only
./ferret-scan --file .env --checks SECRETS --confidence high
# Verbose output with entropy analysis
./ferret-scan --file app.py --checks SECRETS --verboseView suppressed findings:
# Show what findings were suppressed and why
./ferret-scan --file document.txt --format json --show-suppressed
# Regular scan (suppressed findings not shown)
./ferret-scan --file document.txt --format jsonGenerate suppression rules for findings to reduce false positives:
# Generate disabled suppression rules for all findings
./ferret-scan --file document.txt --generate-suppressions
# Run again to update last_seen_at timestamps
./ferret-scan --file document.txt --generate-suppressions
# Use custom suppression file
./ferret-scan --file document.txt --suppression-file custom-suppressions.yaml
# Default suppression file location
# ~/.ferret-scan/suppressions.yamlWeb UI Management: The web interface provides comprehensive suppression management:
- View Rules: Browse all suppression rules with file details and pagination
- Bulk Operations: Select multiple rules for enable/disable/delete operations
- Individual Actions: Enable, disable, edit, or remove single rules
- Undo Support: Undo button appears after operations to reverse changes
- New Findings Integration: Add suppressions directly from scan results
- Auto-generation: Rules created during scans with --generate-suppressions
- CLI Compatibility: Suppressions work seamlessly between web UI and command line
The web UI supports efficient bulk operations for managing multiple suppressions:
Bulk Operations:
# Select multiple suppressions using checkboxes
# Available bulk actions:
- Enable Selected: Activate multiple rules at once
- Disable Selected: Deactivate multiple rules at once
- Delete Selected: Permanently remove multiple rules
- Add Selected as Suppressions: Create rules from new scan findingsUndo Functionality:
- Undo button appears after any bulk or individual operation
- Reverses the last action (enable → disable, create → delete, etc.)
- Preserves original enabled/disabled state for deleted rules
- Works for both bulk and individual operations
Selection Features:
- Checkbox selection with "Select All" and "Clear" options
- Separate selection systems for existing rules vs new findings
- Visual indicators show selection count and available actions
For comprehensive suppression documentation, see Suppression System Guide.
The person name validator has been significantly optimized with database-first processing and enhanced accuracy:
What Changed:
- Database-First Processing: Names are checked against embedded databases before pattern matching
- Early Exit Optimization: Non-matching text exits immediately without expensive pattern matching
- Enhanced Technical Context Detection: Automatic confidence penalties for technical terms (API, function, method)
- Comma-Separated Name Support: New patterns for "Last, First" format detection
- Confidence Bug Fixes: Eliminated confidence leakage from zero-confidence matches
User Impact:
- Dramatic Performance Improvement: 98% faster processing with 12x throughput increase
- Better Accuracy: Reduced false positives in technical documentation
- Enhanced Detection: Support for additional name formats and patterns
- Same Interface: No configuration changes required
No Configuration Required: This optimization is automatic and maintains full backward compatibility.
The metadata validator now includes intelligent file type filtering that automatically determines which files can contain meaningful metadata:
What Changed:
- Plain text files (.txt, .py, .js, .json, .md, etc.) are automatically skipped during metadata validation
- Only files that can actually contain metadata (images, documents, audio, video) are processed
- Debug logging now shows file type filtering decisions
User Impact:
- Faster Performance: 20-30% improvement for workloads with many plain text files
- Fewer False Positives: Eliminates false matches from analyzing text content as metadata
- Same Accuracy: Full metadata detection maintained for files that actually contain metadata
- Debug Output: New debug messages show which files are processed vs skipped
No Configuration Required: This optimization is automatic and requires no changes to existing configurations or command-line usage.
- Configuration Guide - YAML configuration and profiles
- Docker Guide - Container deployment
- Web UI Guide - Web interface documentation
- Examples - Code examples and usage samples
- Creating Validators - Developer guide
- Debug Logging - Troubleshooting guide
- Text Extraction Integration - Document processing
- Battle Card - Competitive analysis and positioning
- Changelog - Version history and updates
Ferret Scan is designed for seamless integration into modern development workflows.
Run Ferret Scan in a containerized environment using Docker or Finch:
# Build container image (Docker or Finch)
# Build container image (auto-detects Docker/Finch)
make container-build
# Web UI mode with persistent data (recommended)
./scripts/container-run.sh -p 8080:8080 -v ~/.ferret-scan:/home/ferret/.ferret-scan ferret-scan
# CLI mode - basic scan
./scripts/container-run.sh --rm -v $(pwd):/data ferret-scan ferret-scan --file /data/document.txt
# CLI mode - with persistent configuration and suppressions
./scripts/container-run.sh --rm -v $(pwd):/data -v ~/.ferret-scan:/home/ferret/.ferret-scan ferret-scan ferret-scan --file /data/document.txt
# Or use container runtime directly (Docker/Finch):
# docker run -p 8080:8080 -v ~/.ferret-scan:/home/ferret/.ferret-scan ferret-scan
# finch run -p 8080:8080 -v ~/.ferret-scan:/home/ferret/.ferret-scan ferret-scanSee the Container Guide for detailed usage instructions.
Volume Mapping:
-v ~/.ferret-scan:/root/.ferret-scan- Persist config and suppressions-v $(pwd):/workspace- Mount current directory for file access-e FERRET_CONFIG_DIR=/config- Override config directory location
Integrate Ferret Scan directly into your Git workflow using pre-commit hooks:
# Complete team setup in one command
make setup-team
# This configures:
# • Team security policies (.ferret-scan.yaml)
# • Pre-commit hooks (.pre-commit-config.yaml)
# • GitHub Actions workflow (.github/workflows/ferret-scan.yml)
# Commit the configuration to share with your team
git add .ferret-scan.yaml .pre-commit-config.yaml .github/workflows/ferret-scan.yml
git commit -m "Add Ferret Scan team security configuration"# Each team member runs once:
make setup-developer
# This installs pre-commit hooks and tests the setup
# Commits will now be automatically scanned for sensitive dataOption 1: Python Package
# Install via pip
# PyPi Package Coming Soon!!# .pre-commit-config.yaml
repos:
- repo: https://github.com/awslabs/ferret-scan
rev: v1.0.0
hooks:
- id: ferret-scan
name: Ferret Scan - Sensitive Data Detection
files: '\.(txt|py|js|json|yaml|md)$'# .pre-commit-config.yaml (requires ferret-scan to be installed)
repos:
- repo: local
hooks:
- id: ferret-scan
name: Ferret Scan - Sensitive Data Detection
entry: ferret-scan
language: system
files: '\.(txt|py|js|json|yaml|md)$'
args: ['--file', '--quiet']Option 3: Direct Binary Integration
# .pre-commit-config.yaml (build from source)
repos:
- repo: local
hooks:
- id: ferret-scan
name: Ferret Scan - Direct Binary
entry: go run cmd/main.go --pre-commit-mode
language: system
files: '\.(txt|py|js|json|yaml|md)$'
pass_filenames: trueGitLab CI/CD:
# .gitlab-ci.yml
security-scan:
stage: security
image: ferret-scan:latest
script:
- ferret-scan --file . --recursive --format json --output scan-results.json
artifacts:
reports:
junit: junit-report.xml
paths:
- scan-results.json
expire_in: 1 week
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCHJenkins Pipeline:
pipeline {
agent any
stages {
stage('Security Scan') {
steps {
sh 'ferret-scan --file . --recursive --confidence high --format json --output results.json'
archiveArtifacts artifacts: 'results.json'
}
}
}
}Ferret Scan includes a web-based interface for easy file scanning through your browser.
# Build and start the web UI
make build
./bin/ferret-scan --web --port 8080Then open http://localhost:8080 in your browser (or the port you specified).
Main Interface - File Upload and Configuration

Scan Results - Interactive Results Table

- Display: Version number shown in top navigation bar
- Details: Click version number for detailed build information
- API: Complete version data available via
/healthendpoint - Timestamp: Uses current server timestamp for real-time information
- Single File: Click "Choose Files" and select one file
- Multiple Files: Hold Ctrl/Cmd while selecting files, or drag multiple files onto the upload area
- Real-time Processing: Results appear progressively as each file is scanned
- Visual Progress Bar: Shows completion percentage and current file being processed
- Progress Tracking: Shows current file being processed (e.g., "Scanning file 2 of 5: document.pdf")
- Confidence Levels: Filter results by HIGH, MEDIUM, LOW, or all levels
- Check Types: Select specific validators or run all checks
- Verbose Output: Show detailed information for each finding
- Recursive Scanning: Process directories recursively (when applicable)
- Real-time Updates: Results appear as each file completes processing
- Smart Sorting: Default multi-level sort by confidence (desc), filename (asc), line number (asc)
- Interactive Pagination: Navigate large result sets with clickable page numbers (50/100 per page or all)
- Clickable Statistics: Filter results by confidence level using stat cards
- Suppressed Findings: Click "Suppressed" stat card to view detailed modal of suppressed findings with rule information
- Color Coding: Visual distinction between confidence levels
- Detailed Information: File location, line numbers, confidence scores, and metadata
- Export Options: Download results as CSV or JSON with current display settings
- Error Handling: Individual file errors don't stop processing of other files
- Bulk Operations: Select multiple suppressions using checkboxes for batch enable/disable/delete
- Individual Actions: Quick enable/disable/edit/remove buttons for single rules
- Undo Functionality: Undo button appears after operations to reverse the last change
- New Findings Integration: Scan results show new findings that can be added as suppressions
- Rule Details: Click rule IDs to view complete suppression information
- Status Indicators: Visual ENABLED (green) and DISABLED (red) status badges
- Smart Pagination: Navigate large suppression rule sets efficiently
The web UI supports all file types available in the CLI version:
- Plain text (.txt, .log, .csv, .json, .xml, etc.)
- Source code files (.py, .js, .java, .cpp, etc.)
- Configuration files (.yaml, .ini, .conf, etc.)
Note: The metadata validator automatically skips these file types as they cannot contain meaningful metadata, improving performance by 20-30% for workloads with many plain text files.
- PDF documents (.pdf)
- Microsoft Office (.docx, .xlsx, .pptx)
- OpenDocument (.odt, .ods, .odp)
- JPEG (.jpg, .jpeg) - EXIF metadata, GPS coordinates
- PNG (.png) - Image metadata and properties
- GIF (.gif) - Animation and metadata
- BMP (.bmp) - Basic image metadata
- TIFF (.tiff, .tif) - Comprehensive metadata
- WebP (.webp) - Modern format metadata
Note: The metadata validator automatically processes these file types for metadata extraction.
- MP3 (.mp3) - ID3v1/v2 tags, artist, album, lyrics
- M4A (.m4a) - iTunes metadata, AAC format
- WAV (.wav) - RIFF chunks, broadcast metadata
- FLAC (.flac) - Vorbis comments, lossless metadata
- OGG (.ogg) - Vorbis comments, stream info
Note: The metadata validator automatically processes these file types for metadata extraction.
- MP4 (.mp4) - MP4 atoms, iTunes metadata, codec info
- MOV (.mov) - QuickTime atoms, metadata
- AVI (.avi) - RIFF chunks, stream metadata
- MKV (.mkv) - Matroska elements, tags
- WMV (.wmv) - ASF headers, Windows Media metadata
Note: The metadata validator automatically processes these file types for metadata extraction.
- Local Processing: All scanning happens on your machine
- Temporary Files: Uploaded files are automatically deleted after scanning
- No Data Storage: Results are not saved permanently on the server
- Memory Scrubbing: Sensitive data is cleared from memory after processing
- Upload:
contract.pdf - Confidence: "High & Medium"
- Checks: "All Checks"
- Click "Scan File"
- Select multiple files (Ctrl+click or drag multiple)
- Configure desired settings
- Watch real-time progress as each file is processed
- View accumulated results sorted by severity
- View Suppressions: Click "Suppressions" tab to manage rules
- Bulk Operations: Select multiple rules with checkboxes, then use bulk action buttons
- Individual Actions: Use enable/disable/edit/remove buttons on each rule
- Undo Changes: Click "Undo Last Change" button to reverse operations
- Add from Scan: New findings from scans can be directly added as suppressions
- Default port: 8080
- Auto-increment: If 8080 is busy, tries 8081, 8082, etc.
- Custom port: Set
PORTenvironment variable
- Maximum upload: 10MB per file
- Multiple files: No limit on total count
- Sequential processing: Files are scanned one at a time for stability
- Smart pagination: Only shows pagination controls when needed (50+ results)
- Progress feedback: Visual progress bar with real-time updates
- Error isolation: Problems with one file don't affect others
- Memory efficient: Results are paginated to handle large datasets
# Build the main binary first
make build
# Then start web UI
./bin/ferret-scan --web --port 8080The web UI automatically finds an available port. Check the console output for the actual port being used.
- Check file size limits (10MB for most files, 500MB for audio)
- Try processing files individually if bulk upload fails
# Build main binary
make build
# Start web server
./bin/ferret-scan --web --port 8080
# Or run directly
./bin/ferret-scan --web --port 8080- Modify
internal/web/server.goto add features - Update HTML template in
web/template.htmlfor UI changes - Adjust file size limits or add new scan options
The web UI exposes a REST API at /scan that accepts multipart form data with the same parameters as the web interface.
- CloudScape Design: AWS Console-style interface with professional styling
- Responsive Layout: Works on desktop and mobile devices
- Interactive Help: Comprehensive help modal with usage tips and examples
- CLI Command Display: Shows equivalent command-line usage based on current settings
- Smart Pagination: Page numbers with Previous/Next navigation
- Sortable Columns: Click any column header to sort results
- Expandable Sections: Collapsible configuration sections for clean interface
Ferret Scan uses a standard directory structure for configuration and data files:
~/.ferret-scan/
├── config.yaml # Main configuration file
└── suppressions.yaml # Suppression rules
Environment Variables:
FERRET_CONFIG_DIR: Override the base directory (default:~/.ferret-scan)
Ferret Scan supports YAML configuration files to set default options and create profiles for different scanning scenarios. This allows you to save commonly used settings and quickly switch between different scanning configurations.
For detailed configuration documentation, see Configuration Guide.
| Profile | Purpose | Output Format | Use Case |
|---|---|---|---|
quick |
Fast security check | Text | Development, pre-commit hooks |
ci |
CI/CD integration | JUnit XML | Automated testing pipelines |
security-audit |
Security team scanning | JSON | Compliance, security audits |
comprehensive |
Complete analysis | YAML | Forensic investigation, debugging |
csv-export |
Data analysis | CSV | Spreadsheet analysis, reporting |
json-api |
API integration | JSON | Programmatic processing |
debug |
Troubleshooting | YAML | Validator development, debugging |
silent |
Automation | JSON | Scripts, monitoring systems |
credit-card |
Payment security | Text | PCI compliance |
passport |
Travel documents | Text | Identity verification |
intellectual-property |
IP protection | Text | Corporate security |
The tool looks for configuration files in the following locations (in order of precedence):
- Path specified with
--configflag config.yamlin the current directoryferret.yamlorferret.ymlin the current directory~/.ferret-scan/config.yaml(standard location).ferret.yamlor.ferret.ymlin the user's home directory (legacy)
The configuration file has three main sections:
defaults: Default settings applied when no profile is specifiedvalidators: Global validator-specific configurationsprofiles: Named profiles for different scanning scenarios
Example configuration file:
# Default settings applied when no profile is specified
defaults:
format: text # Output format: text or json
confidence_levels: all # Confidence levels to display: high, medium, low, or combinations
checks: all # Specific checks to run: CREDIT_CARD, EMAIL, INTELLECTUAL_PROPERTY, IP_ADDRESS, METADATA, PASSPORT, PERSON_NAME, PHONE, SECRETS, SOCIAL_MEDIA, SSN, or combinations
verbose: false # Display detailed information for each finding
no_color: false # Disable colored output
recursive: false # Recursively scan directories
# Validator-specific configurations
validators:
# Intellectual property validator configuration
intellectual_property:
# Internal company URL patterns to detect
internal_urls:
- "http[s]?:\\/\\/s3\\.amazonaws\\.com"
- "http[s]?:\\/\\/.*\\.internal\\..*"
- "http[s]?:\\/\\/.*\\.corp\\..*"
- "http[s]?:\\/\\/.*-internal\\..*"
# Custom intellectual property patterns
intellectual_property_patterns:
patent: "\\b(US|EP|JP|CN|WO)[ -]?(\\d{1,3}[,.]?\\d{3}[,.]?\\d{3}|\\d{1,3}[,.]?\\d{3}[,.]?\\d{2}[A-Z]\\d?)\\b"
trademark: "\\b(\\w+\\s*[™®]|\\w+\\s*\\(TM\\)|\\w+\\s*\\(R\\)|\\w+\\s+Trademark|\\w+\\s+Registered\\s+Trademark)\\b"
copyright: "(©|\\(c\\)|\\(C\\)|Copyright|\\bCopyright\\b)\\s*\\d{4}[-,]?(\\d{4})?\\s+[A-Za-z0-9\\s\\.,]+"
trade_secret: "\\b(Confidential|Trade\\s+Secret|Proprietary|Company\\s+Confidential|Internal\\s+Use\\s+Only|Restricted|Classified)\\b"
# Profiles for different scanning scenarios
profiles:
# Quick scan profile - only high confidence matches, minimal output
quick:
format: text
confidence_levels: high
checks: all
verbose: false
no_color: false
recursive: false
description: "Quick scan with only high confidence matches"
# Thorough scan profile - all confidence levels, verbose output, recursive scanning
thorough:
format: text
confidence_levels: all
checks: all
verbose: true
no_color: false
recursive: true
description: "Thorough scan with all confidence levels and recursive scanning"
# Company-specific profile with custom internal URLs and patterns
company-specific:
format: text
confidence_levels: all
checks: INTELLECTUAL_PROPERTY
verbose: true
no_color: false
recursive: true
description: "Company-specific intellectual property scan"
validators:
intellectual_property:
# Company-specific internal URL patterns
internal_urls:
- "http[s]?:\\/\\/company-wiki\\.internal"
- "http[s]?:\\/\\/docs\\.company\\.com"
- "http[s]?:\\/\\/.*\\.company-internal\\.com"Command line options take precedence over configuration file settings. The order of precedence is:
- Command line options
- Profile settings (if a profile is specified)
- Default settings from the configuration file
- Built-in default values
- HIGH (90-100%): Very likely to be sensitive data
- MEDIUM (60-89%): Possibly sensitive data
- LOW (0-59%): Likely not sensitive data or false positive
Ferret Scan includes multiple validators for different types of sensitive data:
- Credit Card Validator - Detects credit card numbers from major providers with advanced mathematical validation
- Passport Validator - Detects passport numbers from various countries with contextual analysis
- SSN Validator - Detects Social Security Numbers with domain-aware validation
- IP Address Validator - Detects IP addresses with sensitivity filtering (excludes private, reserved, test ranges)
- Email Validator - Detects email addresses with advanced domain validation
- Phone Validator - Detects phone numbers with international format support
- Secrets Validator - Detects API keys, tokens, passwords, and other secrets using entropy analysis
- Social Media Validator - Detects social media profiles, usernames, and handles across major platforms (LinkedIn, Twitter/X, Facebook, GitHub, Instagram, YouTube, TikTok, etc.)
- Intellectual Property Validator - Detects patents, trademarks, copyrights, and trade secrets
- 🆕 Enhanced Metadata Validator - Preprocessor-aware metadata validation with intelligent file type filtering and type-specific patterns
For details on each validator's capabilities, supported formats, and detection methods, please refer to their individual documentation.
The metadata validator now features intelligent file type filtering and a sophisticated dual-path routing system with preprocessor-aware validation:
The metadata validator automatically determines which files can contain meaningful metadata and skips processing of plain text files:
Files Processed for Metadata:
- Images: .jpg, .jpeg, .png, .gif, .tiff, .tif, .bmp, .webp, .heic, .heif, .raw, .cr2, .nef, .arw
- Documents: .pdf, .docx, .doc, .xlsx, .xls, .pptx, .ppt, .odt, .ods, .odp
- Audio: .mp3, .flac, .wav, .ogg, .m4a, .aac, .wma, .opus
- Video: .mp4, .mov, .avi, .mkv, .wmv, .flv, .webm, .m4v, .3gp, .ogv
Files Skipped for Metadata (Performance Optimization):
- Plain Text: .txt, .md, .log, .csv, .json, .xml, .html, .js, .py, .go, .java, .c, .cpp, .h, .sh, .bat, .ps1, .yaml, .yml
- Source Code: All programming language files and configuration files
- Unknown Extensions: Files without extensions or unrecognized file types
Performance Benefits:
- 20-30% faster processing for workloads with many plain text files
- Eliminates false positives from analyzing text content as metadata
- Reduced memory usage and CPU consumption
- Maintains full accuracy for files that actually contain metadata
- Image Metadata: EXIF data, GPS coordinates, camera information, creator details
- File Types: JPG, JPEG, TIFF, TIF, PNG, GIF, BMP, WEBP
- Enhanced Detection: GPS data (+60% confidence), device info (+40%), creator info (+30%)
- Document Metadata: Author information, document properties, rights data
- File Types: PDF, DOCX, XLSX, PPTX, ODT, ODS, ODP
- Enhanced Detection: Manager info (+40% confidence), comments (+50%), author info (+30%)
- Audio Metadata: Artist information, contact details, recording data
- File Types: MP3, FLAC, WAV, M4A
- Enhanced Detection: Contact info (+50% confidence), management (+40%), artist info (+30%)
- Video Metadata: Location data, device information, production details
- File Types: MP4, MOV, M4V
- Enhanced Detection: GPS data (+60% confidence), location info (+50%), device info (+40%)
- Improved Accuracy: 20-30% improvement in precision through targeted validation
- Reduced False Positives: 40-50% reduction through preprocessor-aware patterns
- Enhanced Performance: 5-15% faster processing through intelligent content routing
- Better Debugging: Detailed observability into validation decisions and confidence scoring
# Enhanced metadata validation with debug output (shows file type filtering decisions)
ferret-scan --file photo.jpg --checks METADATA --debug --verbose
# Scan multiple metadata types with high confidence (automatically skips .txt, .py, .js files)
ferret-scan --file media/ --recursive --checks METADATA --confidence high
# Use enhanced metadata profile with detailed output (shows which files are processed vs skipped)
ferret-scan --config ferret.yaml --profile enhanced-metadata --file documents/
# Example showing file type filtering in action
ferret-scan --file mixed-folder/ --recursive --checks METADATA --debug
# Output will show: "Skipping metadata validation for file.txt (plain text file type)"
# Output will show: "Processing metadata for photo.jpg (image file type)"All validators implement advanced false positive prevention:
- Zero Confidence Filtering: Automatically excludes matches with 0% confidence scores
- Context-Aware Analysis: Uses surrounding text and keywords to improve accuracy
- Pattern Validation: Mathematical and structural validation for applicable data types
- Sensitivity Filtering: IP Address validator excludes non-identifying addresses (private, reserved, test ranges)
- Test Data Detection: Identifies and filters common test patterns and placeholder data
Web UI Access: All validators are available through the web interface at http://localhost:8080 after running ferret-scan --web (or specify a custom port with --port <number>).
To add a new validator for detecting other types of sensitive data:
- Create a new package under
internal/validators/ - Implement the
detector.Validatorinterface - Add your validator to the list in
cmd/main.go - Create a README.md in your validator's package directory with:
- Description of what the validator detects
- Supported formats or types
- Detection capabilities and features
- Confidence scoring methodology
- Usage examples
- Implementation details
- Add a link to your validator's README in the main README.md
See the existing validator READMEs for examples of the recommended documentation structure.
- Format code with
make fmtbefore committing - Run
make vetandmake lintto check for common issues - Follow Go's standard naming conventions and code organization
For questions, issues, or contributions:
- Developers: Andrea Di Fabio (adifabio@), Lee Myers (mlmyers@)
- Artwork: Original logo artwork by Olivia Myers McMullin
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: Apache-2.0