Plenful San Francisco, CA

Site Reliability Engineer

Plenful San Francisco, CA

1 day ago

Over 200 applicants

See who Plenful has hired for this role

Save

About Plenful

Plenful is on a mission to move pharmacy forward through intelligent automation. We build AI-powered software that eliminates administrative burden, strengthens compliance, and unlocks revenue across critical pharmacy workflows, solving one of the biggest challenges in healthcare today: delayed patient care.

Built by a passionate team of former healthcare operators and world-class AI technologists, Plenful combines deep domain expertise with enterprise-grade technology to automate complex workflows across intake authorization, 340B program optimization, and pharmacy revenue reconciliation. Our AI platform is trusted by 95+ leading healthcare organizations to power smarter, faster, and more resilient pharmacy operations.

Backed by leading investors including Notable Capital, Bessemer Venture Partners, and TQ Ventures, Plenful is building the institutional memory for healthcare and powering the most complex, highest ROI healthcare workflows. We’re actively hiring as we continue to scale.

About The Role

We’re hiring a Site Reliability Engineer (SRE) to ensure the reliability, performance, and scalability of Plenful’s production systems as we continue to grow.

This role is centered on operating real systems at scale — not just building infrastructure, but deeply understanding how it behaves under load, fails in production, and recovers. You’ll define reliability standards, own production health, and build the feedback loops that make our systems more resilient over time.

You’ll work closely with backend, data, and ML engineers to ensure our platform is highly available, measurable, and continuously improving. This includes everything from incident response and performance debugging to SLO design and system-level optimization.

What You’ll Do

Reliability Engineering & System Ownership

Define and implement SLIs, SLOs, and error budgets across core services
Own production system health, including uptime, latency, and availability targets
Continuously improve system resilience through proactive reliability work
Identify and mitigate single points of failure across distributed systems

Production Operations & Incident Response

Participate in and improve on-call rotations and incident response processes
Lead incident triage, mitigation, and resolution in real time
Conduct blameless postmortems and ensure follow-through on action items
Build tooling and automation to reduce MTTR (Mean Time to Recovery)

Observability & System Insight

Design and evolve observability systems across:

Metrics, logs, and distributed tracing (OpenTelemetry)
Tooling including Datadog, CloudWatch, Grafana, Sentry

Improve signal quality to reduce noise and alert fatigue
Develop dashboards and alerts that reflect real system health and user impact
Use observability data to drive performance and reliability improvements

Performance & Scalability

Analyze system performance under load and identify bottlenecks
Optimize latency, throughput, and resource utilization across:

Serverless systems (AWS Lambda)
Containerized services (ECS)
Data systems (Aurora Postgres, ClickHouse)

Partner with engineering teams to improve system efficiency and scaling behavior

Automation & Reliability Tooling

Build automation to eliminate repetitive operational work
Improve deployment safety through reliability checks and safeguards
Contribute to CI/CD pipelines (GitHub Actions) with a focus on system stability
Develop tools for:

Incident response
Debugging
Capacity planning

Security, Compliance & Operational Maturity

Partner with security and compliance to ensure systems meet operational standards
Support audit readiness and reliability-related compliance requirements (Vanta)
Integrate monitoring and alerting into security and SIEM workflows
Help mature operational practices across the engineering team

Environment & Technical Context

You’ll Work Across a Modern Distributed Stack

Cloud: AWS (ECS, Lambda, RDS Aurora Postgres, CloudWatch)
Infrastructure: Terraform, Ansible, Linux
CI/CD: GitHub Actions
Observability: Datadog, Grafana, CloudWatch, OpenTelemetry, Sentry, pganalyze
Data Systems: Postgres, ClickHouse
Security & Compliance: Vanta, SIEM tooling
Product & Analytics: Amplitude
ML/Platform Infra: TrueFoundry

What Success Looks Like

Clear, enforced SLOs and error budgets across critical systems
Incidents are well-managed, rare, and decrease over time
Engineers have high-confidence signals about system health
Alerts are actionable, not noisy
Systems scale predictably under load without degradation
Postmortems lead to real, measurable improvements
Reliability is treated as a shared engineering responsibility, not a reactive function

Ideal Background

Must Have

5+ years in Site Reliability Engineering, SRE-adjacent roles, or production infrastructure
Strong experience operating and debugging distributed systems in production
Hands-on experience with:

Observability tooling (Datadog, Grafana, OpenTelemetry, etc.)
Incident response and on-call practices
Performance and reliability debugging

Experience defining and working with SLOs / SLIs / error budgets
Familiarity with:

AWS environments
Serverless and container-based architectures
Postgres or similar relational databases

Ability to write code/scripts (Python, Bash, etc.) for automation and tooling
Strong systems thinking and ability to reason about failure modes

Nice to Have

Experience in high-growth or high-scale environments
Background in regulated industries (healthcare, fintech)
Experience with ClickHouse or analytical systems at scale
Familiarity with chaos engineering or load testing frameworks
Exposure to ML infrastructure or data platforms

Plenful perks

Comprehensive Benefits Package: Enjoy unlimited PTO, fully covered health insurance (medical, dental, and vision), meal stipend, health & wellness stipend, 401(k) matching, and stock options.
Mission-Driven, World-Class Team: Join an exceptional group of professionals aligned around a meaningful mission and committed to making an impact.
Opportunities for Growth: Strengthen your partnership expertise through collaboration with experienced, high-performing leaders across the organization.
Flexible Work Environment: Employees based in the Bay Area enjoy two days per week in a brand-new downtown San Francisco office. Employees based in other cities enjoy a fully remote work environment with the ability to travel for collaboration.

Seniority level
Mid-Senior level
Employment type
Full-time
Job function
Engineering and Information Technology
Industries
Hospitals and Health Care

Referrals increase your chances of interviewing at Plenful by 2x

See who you know

Get notified when a new job is posted.

Set alert

Similar jobs

Site Reliability Engineer

Site Reliability Engineer

Clay

San Francisco, CA $130,000.00 - $300,000.00 6 months ago
Site Reliability Engineer

Site Reliability Engineer

Cognition

San Francisco, CA 1 week ago
Infrastructure Engineer

Infrastructure Engineer

HappyRobot

San Francisco, CA 3 days ago
Site Reliability Engineer

Site Reliability Engineer

Blaxel (YC X25)

San Francisco, CA $175,000.00 - $250,000.00 3 days ago
Member of Technical Staff, Reliability

Member of Technical Staff, Reliability

Sieve

San Francisco, CA $150,000.00 - $300,000.00 3 months ago
Platform Engineer — Infra / Reliability Specialist

Platform Engineer — Infra / Reliability Specialist

Poly

San Francisco, CA $150,000.00 - $300,000.00 1 year ago
Site Reliability Engineer

Site Reliability Engineer

Baseten

San Francisco, CA $135,000.00 - $285,000.00 2 weeks ago
Site Reliability Engineer

Site Reliability Engineer

Fluidstack

San Francisco, CA $175,000.00 - $320,000.00 6 days ago
Site Reliability Engineer

Site Reliability Engineer

Superhuman

San Francisco, CA $214,000.00 - $260,000.00 1 week ago
Site Reliability Engineer

Site Reliability Engineer

Gamma

San Francisco, CA 5 days ago
Site Reliability Engineer

Site Reliability Engineer

WorkOS

San Francisco, CA $175,000.00 - $275,000.00 4 months ago
Integration Reliability Engineer

Integration Reliability Engineer

Claryo, Inc.

San Francisco, CA $150,000.00 - $170,000.00 2 weeks ago
Site Reliability Engineer

Site Reliability Engineer

CodeRabbit

San Francisco, CA $170,000.00 - $240,000.00 4 months ago
Infrastructure engineer

Infrastructure engineer

WRITER

San Francisco, CA $139,800.00 - $273,700.00 1 week ago
Infrastructure Engineer

Infrastructure Engineer

Reducto

San Francisco, CA 1 week ago
Systems Reliability Engineer

Systems Reliability Engineer

Claryo, Inc.

San Francisco, CA $150,000.00 - $170,000.00 2 weeks ago
Staff Site Reliability Engineer

Staff Site Reliability Engineer

Fivetran

Oakland, CA 4 days ago
Platform Engineer

Platform Engineer

Zyphra

San Francisco, CA 7 months ago
Site Reliability Engineer (SRE) | Tala Health

Site Reliability Engineer (SRE) | Tala Health

Titan Holdings

San Francisco, CA 2 weeks ago
Site Reliability Engineer

Site Reliability Engineer

Airbyte

San Francisco, CA $190,000.00 - $220,000.00 4 days ago
Software Engineer, Site Reliability San Francisco

Software Engineer, Site Reliability San Francisco

fal

San Francisco, CA 3 weeks ago
Site Reliability Engineer

Site Reliability Engineer

TextNow

San Francisco, CA 1 week ago
Senior/Staff Site Reliability Engineer

Senior/Staff Site Reliability Engineer

Ivo

San Francisco, CA 2 days ago
Site Reliability Engineer

Site Reliability Engineer

Berkeley Lab

Berkeley, CA 3 weeks ago
Infrastructure Engineer

Infrastructure Engineer

Roboflow

San Francisco, CA 5 days ago
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Block

San Francisco, CA 1 week ago
Infrastructure Engineer, Foundation

Infrastructure Engineer, Foundation

Pylon

San Francisco, CA $140,000.00 - $220,000.00 1 month ago

Explore top content on LinkedIn

Find curated posts and insights for relevant topics all in one place.

View top content