Skip to content

Fix: Arbitrary File Read Vulnerability + Feat: Security settings #1991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

adithyan-ak
Copy link
Contributor

Arbitrary File Read Vulnerability Fix + Security Enhancements

Summary

This document outlines the security vulnerabilities identified in the previous implementation of image handling within the ImageTextPromptValue class and details the remediation steps taken in the refactored code. The core issues revolved around insufficient input validation and unsafe handling of user-provided strings, leading to Arbitrary File Read (LFI), Server-Side Request Forgery (SSRF), and potential Denial of Service (DoS) vulnerabilities.

Identified Vulnerabilities (Original Code)

  1. Arbitrary File Read (LFI):

    • Via file:// URL and urlopen: The is_valid_url check allowed file:// schemes if a netloc (hostname like localhost) was provided. mimetypes.guess_type could be tricked by appending an image extension in a URL fragment (e.g., #fake.jpg), which urlopen ignores when accessing the file system. This allowed reading arbitrary files accessible to the application user (e.g., file://localhost/etc/passwd#fake.jpg).
    • Via direct open() (Potential): Although the primary exploit used urlopen, the code path involving encode_image_to_base64(item) directly called open(item). If an attacker could bypass the is_valid_url check but still trick mimetypes.guess_type (e.g., using path manipulation if not properly handled before open), this path could also lead to LFI. Null bytes (\0) were blocked by Python's open(), but other techniques might exist depending on the OS and context.
  2. Server-Side Request Forgery (SSRF):

    • Via http:// / https:// URL and urlopen: The code allowed http and https URLs via is_valid_url. If mimetypes.guess_type passed (e.g., URL ending in .jpg), the download_and_encode_image function would use urllib.request.urlopen to fetch the URL. Attackers could provide URLs pointing to internal network services, cloud metadata endpoints (like AWS 169.254.169.254), or external servers, making the application act as a proxy. The fetched content was then base64 encoded and potentially returned, facilitating data exfiltration.
  3. Denial of Service (DoS):

    • Via LFI: Reading excessively large files (e.g., /dev/zero) using the file:// LFI could exhaust memory or CPU.
    • Via SSRF: Fetching extremely large files or URLs that hang indefinitely from remote servers could exhaust memory, bandwidth, or application threads.
  4. Weak Input Validation:

    • Reliance on mimetypes.guess_type based on file extensions in user-controlled strings is fundamentally insecure. It does not verify the actual content.
    • is_valid_url only checked for scheme and netloc presence, not which schemes were safe or allowed.
    • The is_base64 check was basic and could potentially match non-image base64 data.

Remediation Strategy Implemented

The refactored code abandons the flawed is_image logic and implements a secure processing pipeline (_securely_process_item) with the following principles:

  1. Explicit Input Type Handling: The code now explicitly checks for distinct input formats in a specific order:

    • Base64 Data URIs (data:image/...;base64,...)
    • Allowed Remote URLs (http://, https://)
    • (Optional/Discouraged) Allowed Local File Paths
    • Text (Default fallback)
  2. Secure Base64 Handling:

    • Uses a regular expression (DATA_URI_REGEX) to strictly match the expected data:image/...;base64,... format.
    • Includes robust base64 decoding with error handling.
    • Optional: Can be enhanced to use Pillow to verify the decoded data is a valid image.
  3. Secure URL Fetching:

    • Scheme Allowlist: Only URLs with schemes defined in ALLOWED_URL_SCHEMES (default: http, https) are processed. file:// is disallowed by default.
    • Robust HTTP Client: Uses the requests library instead of urllib.request for easier configuration of timeouts (REQUESTS_TIMEOUT_SECONDS) and streaming downloads.
    • Size Limiting: Checks Content-Length header and enforces MAX_DOWNLOAD_SIZE_BYTES during streaming download to prevent DoS.
    • Content Validation: Crucially, downloaded content is validated using the Pillow library (Image.open, img.verify()) to ensure it is actually a valid image file before encoding and returning. This prevents processing malicious non-image files delivered via allowed URLs.
    • Verified MIME Type: The MIME type included in the final data: URI is derived from the actual image format identified by Pillow, not guessed from the URL.
  4. Secure Local File Handling (Optional & Default Disabled):

    • Disabled by Default: ALLOW_LOCAL_FILE_ACCESS is False. Must be explicitly enabled and configured with extreme care.
    • Strict Path Confinement: Requires configuring an ALLOWED_IMAGE_BASE_DIR. User input is treated as relative to this directory.
    • Path Normalization & Validation: Uses os.path.abspath and os.path.commonprefix to rigorously ensure the resolved file path remains within the ALLOWED_IMAGE_BASE_DIR, preventing directory traversal attacks (../).
    • Existence & Type Check: Verifies the path exists and is a file.
    • Size Limiting: Enforces MAX_LOCAL_FILE_SIZE_BYTES using os.path.getsize before reading the file.
    • Content Validation: Uses Pillow to verify the file content is a valid image format.
  5. Removal of Insecure Functions: The original, vulnerable methods (is_image, get_image, is_base64, is_valid_url, encode_image_to_base64, download_and_encode_image) have been removed or replaced by the secure processing logic.

Required Configuration & Dependencies

  • Dependencies: The secure code requires installing requests and Pillow:
    pip install requests Pillow
  • Constants: Review and adjust the security policy constants defined at the beginning of the file:
    • ALLOWED_URL_SCHEMES
    • MAX_DOWNLOAD_SIZE_BYTES
    • REQUESTS_TIMEOUT_SECONDS
  • Local File Access (If Enabled):
    • Set ALLOW_LOCAL_FILE_ACCESS = True.
    • Set ALLOWED_IMAGE_BASE_DIR to the absolute path of the only directory allowed for image loading. Ensure this directory has appropriate permissions and contains no sensitive files.
    • Adjust MAX_LOCAL_FILE_SIZE_BYTES if needed.

Conclusion

The refactored code significantly enhances security by replacing insecure pattern matching and uncontrolled resource fetching with explicit validation, strict policy enforcement (schemes, paths, sizes), and content verification using trusted libraries. This approach mitigates the identified LFI, SSRF, and DoS vulnerabilities. Remember to keep dependencies (requests, Pillow) updated to patch potential vulnerabilities within them.

@adithyan-ak
Copy link
Contributor Author

@jjmachan I have prepared a comprehensive fix for the vulnerability and would appreciate your review. Please let me know if you have any questions. While I have conducted preliminary functional testing, I would value your thorough testing to ensure all features remain intact and nothing breaks. I will update the pull request with the CVE-ID once it is assigned by MITRE, and you can proceed with merging the fix afterward 😃

@jjmachan
Copy link
Member

jjmachan commented Apr 3, 2025

hey @adithyan-ak thanks again man 🙂. I'll do the testing on my end to get this merged in this week so that we can do a release next week along with a couple of PRs we have pending as well.

This PR introduces checks to mitigate Server-Side Request Forgery (SSRF) vulnerabilities when fetching remote images via URLs.

Changes:

- Before fetching http/https URLs in _download_validate_and_encode, the hostname is now resolved to its IP address(es) using socket.getaddrinfo.

- Each resolved IP address is validated using the ipaddress library against a configurable policy (DISALLOWED_IP_CHECKS).

- By default, requests to loopback (127.0.0.1, ::1), private (10.x, 172.16.x, 192.168.x), link-local (169.254.x.x), and reserved IPs are blocked.

- A configuration flag ALLOW_INTERNAL_TARGETS (default: False) exists but disabling this check is strongly discouraged.
@adithyan-ak
Copy link
Contributor Author

Explanation of Changes

SSRF Protection via IP Address Validation:

  • DNS Resolution & IP Check: Before any allowed remote URL (http/https) is fetched, the hostname is resolved to its IP address(es) using socket.getaddrinfo.
  • Disallowed IP Categories: Each resolved IP address is checked against disallowed categories using the ipaddress library. By default (DISALLOWED_IP_CHECKS), requests to loopback (127.0.0.1, ::1), private (10.x.x.x, 172.16.x.x-172.31.x.x, 192.168.x.x), link-local (169.254.x.x), and reserved IP addresses are blocked to prevent SSRF attacks targeting internal networks or cloud metadata services.
  • Configuration: This blocking behavior is enabled by default (ALLOW_INTERNAL_TARGETS = False). Disabling this check is strongly discouraged due to security risks.
  • Logging: Blocked attempts due to disallowed IPs are logged with an error message indicating the hostname and the problematic IP address.

Limitations and Considerations

  • SSRF via Redirects: The current IP address check happens before the initial HTTP request. If allow_redirects=True (the default for requests), a request to an initially safe public URL could be redirected by the remote server to a disallowed internal IP address, bypassing the initial check. Setting allow_redirects=False mitigates this but might break legitimate cross-domain image links. Manually handling redirects and re-checking the Location header's target IP adds complexity.
  • DNS Rebinding Attacks: Advanced attackers controlling a DNS server could potentially return a safe public IP during the application's check and then quickly change the DNS record to a private IP before the actual connection is established by the requests library. Mitigating this robustly at the application layer is complex. The implemented check significantly reduces the most common SSRF risks.
…mpt.py

Added explicit binascii import and fixed the exception handler to use binascii.Error instead of base64.binascii.Error, resolving type checking errors.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@adithyan-ak adithyan-ak marked this pull request as ready for review May 4, 2025 03:13
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L This PR changes 100-499 lines, ignoring generated files.
2 participants