Beyond BLOBs: A Flexible, Database-Integrated Approach to File Storage
How to combine PostgreSQL, PostgREST, and PL/Python to build a secure and abstract file management backend.
In modern application development, managing user-uploaded files presents a classic architectural challenge. The most common approach — storing binary large objects (BLOBs) directly in the database — is simple to implement but often leads to significant performance degradation, bloated backups, and scalability nightmares. On the other hand, storing files on a separate filesystem disconnects them from the database’s transactional and security models, creating a new set of data consistency problems.
What if we could get the best of both worlds? An architecture where the database remains the single source of truth for all metadata and security policies, while delegating the heavy lifting of binary storage to an external, abstract filesystem.
This article explores a proof-of-concept that achieves just that, leveraging the powerful combination of PostgreSQL, PostgREST, and PL/Python to create a secure, flexible, and deeply integrated file storage backend.
Disclaimer
This project is a technical demonstration and is not production-ready. It was developed with the assistance of generative AI tools to explore a specific architectural pattern. The solution has been tested with PostgreSQL 15, using both local OS filesystem storage and Amazon S3 as backends.
The Challenge: Bridging Databases and File Systems
The core problem is one of integration. We want our file metadata — like filenames, descriptions, ownership, and timestamps — to live alongside our other application data in a relational database. This allows us to enforce foreign key constraints, apply complex security rules, and perform transactional operations.
However, storing the file content itself (the bytes) in the same database is inefficient. Databases are optimized for structured data, not for streaming gigabytes of binary content.
The ideal solution would:
- Keep metadata in the database.
- Store binary data on a suitable filesystem (like a local disk, a network share, or a cloud object store like S3).
- Ensure the database orchestrates all file operations, so the application’s API doesn’t need to know where the files are physically stored.
- Enforce security and access control at the database level.
An Integrated Architecture
Our solution is a containerized stack orchestrated by Docker Compose, where each component has a distinct and powerful role. The docker-compose.yml file below defines the entire stack, including the React frontend, the PostgREST API, the custom PostgreSQL image, and the Traefik reverse proxy that directs traffic.
# docker-compose.yml
version: '3.8'
services:
# PostgreSQL database service, built from a custom Dockerfile to include PL/Python
postgres:
build:
context: .
dockerfile: postgres/postgres.Dockerfile
container_name: postgres_db
restart: always
environment:
POSTGRES_DB: file_storage_db
POSTGRES_USER: admin
POSTGRES_PASSWORD: your_strong_password
# The key to our storage abstraction. PL/Python reads this variable.
# Change this one line to switch from local storage to S3, etc.
STORAGE_BACKEND_URL: "osfs:///var/storage"
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
- ./storage:/var/storage # Mount local storage directory
ports:
- "5433:5432"
networks:
- backend_net
# React Frontend GUI
gui_service:
build:
context: ./gui/react_gui
container_name: gui_service
ports:
- "8082:80"
networks:
- backend_net
labels:
# Traefik label to route root traffic to the GUI
- "traefik.http.routers.gui_service.rule=Host(`localhost`)"
# PostgREST service to expose the database as a RESTful API
postgrest:
image: postgrest/postgrest:latest
container_name: postgrest_api
restart: always
environment:
PGRST_DB_URI: "postgres://admin:your_strong_password@postgres_db:5432/file_storage_db"
PGRST_DB_SCHEMA: "file_storage" # Expose only the public API schema
PGRST_JWT_SECRET: "a_very_secret_and_long_jwt_secret_key_that_is_at_least_32_characters"
PGRST_DB_ANON_ROLE: "authenticator" # Role for handling login requests
depends_on:
- postgres
networks:
- backend_net
labels:
# Traefik label to route API calls to PostgREST
- "traefik.http.routers.postgrest.rule=Host(`localhost`) && (PathPrefix(`/rpc`) || PathPrefix(`/files`))"
# Traefik reverse proxy
traefik:
image: traefik:v2.10
container_name: traefik_proxy
command:
- "--api.insecure=true"
- "--providers.docker=true"
- "--entrypoints.web.address=:81"
ports:
- "81:81" # Main entry point for all traffic
- "8080:8080" # Traefik dashboard
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
networks:
- backend_net
volumes:
postgres_data:
networks:
backend_net:- PostgreSQL as the Core: The database is the brain of the operation. It doesn’t just store metadata; it contains the business logic for authentication, file access, and security, all implemented in SQL and PL/pgSQL.
- PostgREST for the API Layer: PostgREST is a remarkable tool that takes a PostgreSQL schema and instantly turns it into a secure, RESTful API. All our API endpoints, from
/rpc/loginto/rpc/store_file, are simply database functions that PostgREST exposes over HTTP. - PL/Python and
pyfilesystem: This is the secret sauce. By enabling the PL/Python procedural language in PostgreSQL, we can execute Python code from within our SQL functions. We use the versatilepyfilesystemlibrary, which provides a simple, universal API for interacting with dozens of different filesystems. A single environment variable (STORAGE_BACKEND_URL) can switch the entire application's storage from a local folder (osfs:///var/storage) to an S3 bucket (s3://your-bucket) without changing a single line of SQL or API code. - Docker and Traefik: Docker containerizes our services for portability, while Traefik acts as a reverse proxy, routing requests from the user’s browser to either the React GUI or the PostgREST API based on the URL path.
The Database as the Brain
In this model, the database is not a passive data store. It’s an active participant in the application logic. The login function, for example, verifies credentials and generates a signed JWT directly within the database.
-- file_storage.login()
-- This function is called by unauthenticated users. It checks credentials
-- and returns a signed JWT if they are valid.
CREATE OR REPLACE FUNCTION file_storage.login(email TEXT, password TEXT, OUT token text) AS $$
DECLARE
user_record file_storage.users;
BEGIN
-- Check email and password against the users table
SELECT * INTO user_record FROM file_storage.users WHERE users.email = login.email;
IF user_record.id IS NOT NULL AND user_record.password_hash = crypt(password, user_record.password_hash) THEN
-- If valid, call the internal sign function to create the token.
-- The secret here MUST match the PGRST_JWT_SECRET in docker-compose.yml
SELECT jwt_internal.sign(
json_build_object(
'role', user_record.role,
'user_id', user_record.id,
'username', user_record.username,
'exp', extract(epoch from now() + interval '1 day')
),
'a_very_secret_and_long_jwt_secret_key_that_is_at_least_32_characters'
) INTO token;
ELSE
RAISE invalid_password USING MESSAGE = 'Invalid email or password';
END IF;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;When a user uploads a file, the request hits a PostgREST endpoint that calls a PostgreSQL function, file_storage.store_file(). This function acts as a secure wrapper, validating the request before delegating the actual file I/O to a separate, internal PL/Python function.
-- file_storage.store_file()
-- This is the public-facing RPC endpoint for uploading files.
CREATE OR REPLACE FUNCTION file_storage.store_file(p_file_data BYTEA)
RETURNS UUID AS $$
DECLARE
v_user_id UUID;
v_username TEXT;
v_user_provided_path TEXT;
v_storage_path TEXT;
v_file_id UUID;
BEGIN
-- 1. Extract user info from the JWT and file path from request headers.
SELECT
current_setting('request.jwt.claims', true)::jsonb ->> 'user_id',
current_setting('request.jwt.claims', true)::jsonb ->> 'username',
current_setting('request.headers', true)::jsonb ->> 'file-path'
INTO v_user_id, v_username, v_user_provided_path;
-- 2. Construct the actual, secure storage path.
v_storage_path := v_username || '/' || v_user_provided_path;
-- 3. Call the internal Python function to write the file to the storage backend.
PERFORM file_storage.internal_write_to_storage(v_storage_path, p_file_data);
-- 4. Upsert the file's metadata into the database.
INSERT INTO file_storage.files (user_id, file_path, description)
VALUES (v_user_id, v_user_provided_path, v_description)
ON CONFLICT (user_id, file_path) DO UPDATE SET updated_at = NOW()
RETURNING id INTO v_file_id;
RETURN v_file_id;
END;
$$ LANGUAGE plpgsql;The magic happens in the internal_write_to_storage function, which is written in PL/Python. It's completely unaware of users, roles, or JWTs; its only job is to write bytes to a path using the abstract pyfilesystem library.
-- file_storage.internal_write_to_storage()
-- This internal PL/Python function handles the actual file I/O.
-- It is never exposed to the API.
CREATE OR REPLACE FUNCTION file_storage.internal_write_to_storage(p_storage_path TEXT, p_file_data BYTEA)
RETURNS void AS $$
import os
from fs import open_fs
# Read the storage backend URL from the environment variable set in docker-compose
storage_url = os.environ.get('STORAGE_BACKEND_URL', 'osfs:///var/storage')
# open_fs() from pyfilesystem creates a filesystem object, be it local, S3, FTP, etc.
with open_fs(storage_url) as home_fs:
# Ensure the target directory exists
dir_path = os.path.dirname(p_storage_path)
if dir_path:
home_fs.makedirs(dir_path, recreate=True)
# Write the binary data to the specified path
home_fs.writebytes(p_storage_path, p_file_data)
$$ LANGUAGE plpython3u SECURITY DEFINER;Every step is controlled from within the database, and access is enforced by PostgreSQL’s robust Row-Level Security (RLS) policies, ensuring users can only ever see or retrieve their own files.
Conclusion and Future Directions
This architecture demonstrates a powerful pattern for building flexible and secure systems. By leveraging the extensibility of PostgreSQL, we can create a backend where the storage layer is a completely abstract and interchangeable component, all while maintaining the transactional integrity and security of a database-centric design.
While this project serves as a successful proof-of-concept, several improvements would be necessary to make it truly production-ready.
1. Content-Addressable Storage
The current method of storing files based on a user-provided path prepended with a username is effective but has limitations. A more robust and scalable approach is content-addressable storage.
- How it works: Instead of using a filename, we would calculate a cryptographic hash (e.g., SHA-256) of the file’s content upon upload. This hash becomes the file’s unique identifier.
- Benefits:
- Automatic Deduplication: If ten users upload the exact same file, only one copy is stored. The database simply creates ten metadata records all pointing to the same hash. This saves significant storage space.
- Data Integrity: We can verify the integrity of a downloaded file at any time by hashing its content and comparing it to the stored hash.
- Versioning: This model makes file versioning trivial. Updating a file simply means uploading new content, which generates a new hash and a new metadata record, while the old version remains untouched.
- Implementation Example: To avoid having millions of files in a single directory, the hash itself can be used to create a nested folder structure. For example, a file with the hash
c85320d9ddb90c13f4a215f1f0a87b531ab33310would be stored at the path:/c8/53/20/d9/.../10.
2. Enhanced Security and Configuration
- JWT Secret Management: The JWT secret is currently hardcoded in the database
loginfunction for simplicity. In production, this is a major security risk. The secret should be removed from the database code and managed exclusively by PostgREST, which can load it securely from an environment variable or a secrets management service. - Fine-Grained Permissions: The permissions could be further refined, for instance, by creating roles that can only read or write to specific paths within a user’s storage.
3. Scalability and Reliability
- Asynchronous Uploads: For very large files, the synchronous HTTP request could time out. A production system could implement an asynchronous flow where the file is first uploaded to a temporary location, and a background worker process handles the hashing, transfer to permanent storage, and database metadata update.
- Robust Error Handling and Logging: More comprehensive logging and more specific error handling would be crucial for debugging issues in a live environment.
By building on this foundation and incorporating these improvements, it’s possible to create a highly reliable, secure, and exceptionally flexible file management system that truly combines the best of the database and the filesystem worlds.
The code inspiring this article can be found at: https://github.com/desmoteo/postgrest_fs
