health — A Health Check Library for Go Services

	health/v2	heptiolabs	alexliesenfeld	InVisionApp
External deps	0	2	3	5
K8s probes	live, ready, startup	live, ready	live, ready	live, ready
Degraded state	yes	no	no	no
Built-in checkers	HTTP, TCP, DNS, Redis, DB, command	HTTP, TCP, DNS	none	none
Dependency graphs	yes	no	no	no
OTel / Prometheus	both	Prometheus only	no	no
Maintained	active	archived	active	archived

Core Concepts

The library is built around three interfaces: Manager, Checker, and Reporter. Understanding how they interact is essential to using the library effectively.

Manager

The Manager is the central orchestrator. It controls the lifecycle of every checker and reporter in the system. Calling mgr.Run(ctx) starts all reporters, dispatches all checkers on their configured schedules, and begins evaluating aggregate fitness.

The manager is responsible for:

Dispatching checks (one-time or at intervals, with optional delays)
Processing check results through a buffered channel (checkFunnel)
Evaluating aggregate fitness: liveness, readiness, and startup
Notifying all reporters when state changes
Graceful shutdown of all checkers and reporters

mgr := std.Manager{Logger: health.DefaultLogger()}
mgr.AddCheck("postgres", checker, opts...)
mgr.AddReporter("http", reporter)
errChan := mgr.Run(ctx)  // starts everything
mgr.Stop(ctx)             // graceful shutdown

Checker

A Checker performs an individual health check and returns a *CheckResult. That's the entire interface: one method.

type Checker interface {
    Check(context.Context) *CheckResult
}

The manager calls this method on the schedule you configure. A checker has no knowledge of scheduling, reporters, or fitness evaluation. Its sole responsibility is to answer one question: is this dependency healthy right now?

Checkers return a CheckResult with:

Status: Healthy, Degraded, or Unhealthy
Duration: how long the check took
Error: what went wrong (if anything)
Metadata: arbitrary key-value pairs for observability

The manager overrides certain fields after the checker returns (Name, impact flags, Group, ComponentType, DependsOn) from the registered options. This means a checker never needs to know how it's configured in the manager.

Reporter

A Reporter exposes health state to the outside world. Some are pull-based (HTTP server waits for requests), some are push-based (stdout prints on every update, OTel emits metrics).

type Reporter interface {
    Run(context.Context) error
    Stop(context.Context) error
    SetLiveness(context.Context, bool)
    SetReadiness(context.Context, bool)
    SetStartup(context.Context, bool)
    UpdateHealthChecks(context.Context, map[string]*CheckResult)
}

The manager calls SetLiveness, SetReadiness, and SetStartup whenever the aggregate state changes. It calls UpdateHealthChecks after every individual check completes, passing the latest result. Reporters are free to store, cache, or forward this data however they want.

A manager can have multiple reporters simultaneously (e.g., HTTP for K8s probes + Prometheus for metrics + stdout for local dev).

How They Fit Together

flowchart LR
    subgraph reg ["You Register"]
        AC1["AddCheck(...)"]
        AC2["AddCheck(...)"]
        AC3["AddCheck(...)"]
        AR1["AddReporter(...)"]
        AR2["AddReporter(...)"]
    end
    subgraph mgr ["Manager"]
        D["Dispatch on Schedule"]
        C["Collect Results"]
        E["Evaluate Fitness"]
        N["Notify on State Change"]
        D --> C --> E --> N
    end
    subgraph out ["Reporters Expose"]
        L["/livez"]
        R["/readyz"]
        H["/healthz"]
        W["/.well-known/health"]
        O["OTel Metrics"]
        P["Prometheus /metrics"]
        G["gRPC Health.Check()"]
    end
    AC1 --> D
    AC2 --> D
    AC3 --> D
    AR1 --> N
    AR2 --> N
    N --> L & R & H & W & O & P & G

CheckResult

The CheckResult is the data type that flows through the entire system. Every checker produces one, the manager enriches it, and every reporter consumes it.

type CheckResult struct {
    Name             string            // set by manager from registered name
    Status           Status            // Healthy, Degraded, or Unhealthy
    AffectsLiveness  bool              // set by manager from WithLivenessImpact()
    AffectsReadiness bool              // set by manager from WithReadinessImpact()
    AffectsStartup   bool              // set by manager from WithStartupImpact()
    Group            string            // set by manager from WithGroup()
    ComponentType    string            // set by manager from WithComponentType()
    DependsOn        []string          // set by manager from WithDependsOn()
    Error            error             // set by checker
    ErrorSince       time.Time         // set by checker
    Duration         time.Duration     // set by checker
    Metadata         map[string]string // set by checker
    Timestamp        time.Time         // set by checker
}

Fitness Evaluation

After every check result, the manager runs evaluateFitness. This is the logic that determines whether the service is live, ready, and started:

If any check with WithLivenessImpact() is Unhealthy, liveness is false (and readiness is also false, because you can't be ready if you're not live)
If any check with WithReadinessImpact() is Unhealthy, readiness is false (but liveness is unaffected)
Degraded checks are reported but never fail probes. They're warnings, not failures.
Readiness is not set to true until all registered checks have reported at least once (the "first-run gate")
If startup checks exist, liveness and readiness are not evaluated until all startup checks pass (the "startup gate"). Once startup completes, it's never re-evaluated.

Concurrency Model

Each checker gets its own dispatch goroutine. Results flow through a buffered channel (checkFunnel, sized to the number of checkers) into a single processing goroutine that runs processHealthCheck and evaluateFitness sequentially. This means fitness evaluation is never concurrent with itself, eliminating race conditions in state transitions.

Checkers that panic are recovered by safeCheck() and reported as Unhealthy. Nil results are caught with a nil guard. The manager never crashes from a misbehaving checker.

Quick Start

This example registers an HTTP dependency check, a TCP check against Postgres, and an HTTP reporter with BasicAuth middleware.

package main

import (
    "context"
    "log"
    "os/signal"
    "syscall"
    "time"

    "github.com/schigh/health/v2"
    "github.com/schigh/health/v2/manager/std"
    "github.com/schigh/health/v2/checker/http"
    "github.com/schigh/health/v2/checker/tcp"
    "github.com/schigh/health/v2/reporter/httpserver"
)

func main() {
    ctx, cancel := signal.NotifyContext(context.Background(),
        syscall.SIGINT, syscall.SIGTERM)
    defer cancel()

    mgr := std.Manager{Logger: health.DefaultLogger()}

    // HTTP dependency check
    if err := mgr.AddCheck("payments-api",
        http.NewChecker("payments", "https://payments.internal/health"),
        health.WithCheckFrequency(health.CheckAtInterval, 10*time.Second, 0),
        health.WithLivenessImpact(),
        health.WithReadinessImpact(),
        health.WithGroup("external"),
        health.WithComponentType("http"),
    ); err != nil {
        log.Fatal(err)
    }

    // Postgres via TCP
    if err := mgr.AddCheck("postgres",
        tcp.NewChecker("postgres", "localhost:5432"),
        health.WithCheckFrequency(health.CheckAtInterval, 5*time.Second, 0),
        health.WithLivenessImpact(),
        health.WithReadinessImpact(),
        health.WithStartupImpact(),
        health.WithGroup("database"),
        health.WithComponentType("datastore"),
    ); err != nil {
        log.Fatal(err)
    }

    // HTTP reporter with BasicAuth
    if err := mgr.AddReporter("http", httpserver.New(
        httpserver.WithPort(8181),
        httpserver.WithServiceName("orders-api"),
        httpserver.WithMiddleware(httpserver.BasicAuth("admin", "secret")),
    )); err != nil {
        log.Fatal(err)
    }

    errChan := mgr.Run(ctx)
    select {
    case err := <-errChan:
        log.Fatalf("manager error: %v", err)
    case <-ctx.Done():
        if err := mgr.Stop(ctx); err != nil {
            log.Printf("stop error: %v", err)
        }
    }
}

Built-in Checkers

All built-in checkers use only the Go standard library and have no external dependencies. Each returns a *health.CheckResult populated with Duration, Timestamp, and Metadata.

checker/http

Checks an HTTP endpoint returns the expected status code.

c := http.NewChecker("api", "https://api.example.com/health",
    http.WithTimeout(3*time.Second),
    http.WithExpectedStatus(200),
    http.WithMethod("HEAD"),
)

Options: WithTimeout, WithExpectedStatus, WithMethod, WithClient

checker/tcp

Checks a TCP port is accepting connections.

c := tcp.NewChecker("postgres", "localhost:5432",
    tcp.WithTimeout(2*time.Second),
)

Options: WithTimeout

checker/dns

Checks a hostname resolves to at least one address.

c := dns.NewChecker("coredns", "kubernetes.default.svc",
    dns.WithTimeout(2*time.Second),
)

Options: WithTimeout, WithResolver

checker/redis

PING via raw RESP protocol. Zero dependency on go-redis.

c := redis.NewChecker("cache", "localhost:6379",
    redis.WithTimeout(time.Second),
    redis.WithPassword("secret"),
)

Options: WithTimeout, WithPassword

Supports standalone Redis with optional legacy AUTH. Redis Cluster and ACL-only not supported.

checker/db

Pings a SQL database via the CtxPinger interface (*sql.DB satisfies this).

c := db.NewChecker("postgres", sqlDB,
    db.WithTimeout(3*time.Second),
)

Options: WithTimeout

checker/command

Run any func(ctx) error. Covers every dependency without a dedicated checker. Panics are recovered.

c := command.NewChecker("s3", func(ctx context.Context) error {
    _, err := s3Client.HeadBucket(ctx,
        &s3.HeadBucketInput{Bucket: &bucket})
    return err
})

Custom Checkers

Implement the Checker interface or use CheckerFunc for one-offs:

type Checker interface {
    Check(context.Context) *CheckResult
}

// Functional shortcut
health.CheckerFunc(func(ctx context.Context) *health.CheckResult {
    return &health.CheckResult{
        Name:   "custom",
        Status: health.StatusHealthy,
    }
})

Option	What it does
`WithCheckFrequency(freq, interval, delay)`	Set check schedule: `CheckOnce`, `CheckAtInterval`, `CheckAfter`
`WithLivenessImpact()`	Failing check kills liveness (and readiness)
`WithReadinessImpact()`	Failing check kills readiness only
`WithStartupImpact()`	Check must pass before liveness/readiness are evaluated
`WithGroup("database")`	Logical group for filtering and display
`WithComponentType("datastore")`	Type hint for observability tools
`WithDependsOn("http://svc:8181")`	Declare dependency for graph discovery

Reporters

Reporters receive health state from the manager and expose it to external observers. The core module includes three built-in reporters with no external dependencies. Three additional reporters are available as separate Go modules.

HTTP Server core

Runs an HTTP server with liveness, readiness, startup, and discovery endpoints.

reporter := httpserver.New(
    httpserver.WithPort(8181),
    httpserver.WithServiceName("my-api"),
)
// Endpoints:
// /livez     → 200 or 503
// /readyz    → 200 or 503
// /healthz  → 200 or 503
// /.well-known/health → manifest JSON

gRPC separate module

Standard grpc.health.v1.Health protocol. Check + Watch.

// go get github.com/schigh/health/v2/reporter/grpc
reporter := grpc.NewReporter(grpc.Config{
    Addr: "0.0.0.0:8182",
})

OpenTelemetry separate module

Emits metrics via OTel API: check status, duration, executions, liveness/readiness/startup.

// go get github.com/schigh/health/v2/reporter/otel
reporter, _ := otel.NewReporter(otel.Config{
    MeterProvider: provider,
})

Prometheus separate module

Exposes metrics for Prometheus scraping. Configurable namespace.

// go get github.com/schigh/health/v2/reporter/prometheus
reporter := prometheus.NewReporter(prometheus.Config{
    Namespace: "myapp",
})
http.Handle("/metrics", reporter.Handler())

stdout core

ASCII table output. Great for local development.

mgr.AddReporter("stdout", &stdout.Reporter{})

test core

Instrumented reporter for unit tests. Tracks state changes, toggles, update counts.

rpt := &test.Reporter{}
mgr.AddReporter("test", rpt)
// ... run checks ...
report := rpt.Report()
fmt.Println(report.NumLivenessStateChanges)

Service Discovery

Every service using this library can expose a /.well-known/health manifest describing its health checks and their dependencies. Other services can fetch these manifests and build transitive dependency graphs without any additional infrastructure.

The Manifest

// GET /.well-known/health
{
  "service": "orders-api",
  "version": "1.2.3",
  "status": "pass",
  "checks": [
    {
      "name": "postgres",
      "status": "healthy",
      "group": "database",
      "componentType": "datastore",
      "duration": "1.2ms"
    },
    {
      "name": "payments",
      "status": "healthy",
      "dependsOn": ["http://payments:8181"]
    }
  ],
  "timestamp": "2026-03-28T20:00:00Z"
}

Walking the Graph

import "github.com/schigh/health/v2/discovery"

// Fetch one service's manifest
manifest, _ := discovery.FetchManifest(ctx, "http://orders:8181")

// Walk the full dependency graph (BFS)
graph, _ := discovery.DiscoverGraph(ctx, "http://api-gateway:8181",
    discovery.WithMaxDepth(5),
    discovery.WithTimeout(3*time.Second),
)

// Render as Mermaid
fmt.Println(graph.Mermaid())

// Render as Graphviz DOT
fmt.Println(graph.DOT())

How It Works

flowchart LR
    A["orders-api
/.well-known/health"]
    B["payments
/.well-known/health"]
    C["stripe-gw
/.well-known/health"]
    A -- DependsOn --> B -- DependsOn --> C
    DG["DiscoverGraph()"] -.->|"1. fetch manifest"| A
    DG -.->|"2. follow DependsOn"| B
    DG -.->|"3. follow DependsOn"| C
    style DG fill:#6c5ce7,color:#fff,stroke:#6c5ce7
    style A fill:#4caf50,color:#fff,stroke:#4caf50
    style B fill:#4caf50,color:#fff,stroke:#4caf50
    style C fill:#4caf50,color:#fff,stroke:#4caf50

Convention: /.well-known/ follows RFC 8615 for machine-discoverable service metadata. Same convention as OpenID Connect, ACME, and security.txt. Unreachable nodes are recorded as "unknown" without failing the graph. Max depth (default 10) prevents cycles.

Kubernetes

The library supports all three Kubernetes probe types. The following manifest snippet can be added directly to your deployment configuration.

# Liveness: is the process alive?
livenessProbe:
  httpGet:
    path: /livez
    port: 8181
  initialDelaySeconds: 5
  periodSeconds: 10

# Readiness: can it serve traffic?
readinessProbe:
  httpGet:
    path: /readyz
    port: 8181
  initialDelaySeconds: 5
  periodSeconds: 10

# Startup: has it finished initializing?
startupProbe:
  httpGet:
    path: /healthz
    port: 8181
  failureThreshold: 30
  periodSeconds: 2

Startup Probe Flow

Checks with WithStartupImpact() must all pass before liveness and readiness are evaluated. Once startup completes, it's not re-evaluated. This prevents K8s from killing your pod while it's still loading data or warming caches.

mgr.AddCheck("cache-warm",
    command.NewChecker("cache", warmCache),
    health.WithStartupImpact(),
    health.WithCheckFrequency(health.CheckAtInterval, 2*time.Second, 0),
)

Individual Health Checks

Following the Kubernetes API health check convention, you can query individual checks by name and get verbose output.

# Individual check by name
curl http://localhost:8181/livez/postgres
# [+]postgres ok      (200)

curl http://localhost:8181/readyz/redis
# [-]redis failed: connection refused    (503)

# Verbose: list all checks with status
curl http://localhost:8181/livez?verbose
# [+]postgres ok
# [-]redis failed: connection refused

# Exclude a check from evaluation
curl "http://localhost:8181/livez?verbose&exclude=redis"
# [+]postgres ok      (200, redis excluded)

Same pattern for all probes: /livez/{name}, /readyz/{name}, /healthz/{name}. Unknown check names return 404.

Internals

This section describes the internal data flow and key design decisions for those who want to understand how the library works under the hood.

flowchart TD
    AC["AddCheck(name, checker, opts...)"] --> CM["checkers map"]
    CM --> DI["dispatchIntervalCheck
goroutine per check"]
    CM --> DO["dispatchOneTimeCheck
goroutine per check"]
    DI --> SC["safeCheck()
panic recovery"]
    DO --> SC
    SC --> AO["applyCheckOptions()
set Name, Group, Impact flags"]
    AO --> CF["checkFunnel
buffered channel"]
    CF --> PH["processHealthCheck
single goroutine"]
    PH --> |"1. nil guard
2. store result
3. first-run gate
4. update reporters"| EF["evaluateFitness"]
    EF --> |"startup gate
liveness AND
readiness AND
can't be ready if not live"| SS["setLive / setReady / setStartup
atomic swap + fan out"]
    SS --> HTTP["/livez /readyz /healthz
/.well-known/health"]
    SS --> GRPC["gRPC
Check() / Watch()"]
    SS --> OTEL["OTel
gauges, histograms, counters"]
    SS --> PROM["Prometheus
/metrics"]

    style AC fill:#6c5ce7,color:#fff,stroke:none
    style CF fill:#ff9800,color:#fff,stroke:none
    style PH fill:#6c5ce7,color:#fff,stroke:none
    style EF fill:#6c5ce7,color:#fff,stroke:none
    style HTTP fill:#4caf50,color:#fff,stroke:none
    style GRPC fill:#4caf50,color:#fff,stroke:none
    style OTEL fill:#4caf50,color:#fff,stroke:none
    style PROM fill:#4caf50,color:#fff,stroke:none

Key Design Decisions

Zero deps in core. The core module has zero external dependencies. Heavy deps (gRPC, OTel, Prometheus) are in separate go.mod files. You only pay for what you import.
Functional options everywhere. Every configuration point uses the functional options pattern. No bare booleans, no positional parameters.
Single processing goroutine. All check results funnel through one goroutine. This eliminates race conditions in fitness evaluation without requiring locks on state transitions.
Panic recovery at every boundary. safeCheck() recovers panicking checkers. cacheHealthChecks() recovers serialization panics. Recover middleware catches HTTP handler panics. The library never takes down your service.
Self-describing endpoints. Health JSON includes Group, ComponentType, Duration, Timestamp, DependsOn. The /.well-known/health manifest makes every service discoverable.
Module architecture. Core module (github.com/schigh/health/v2) has everything you need for basic use. Reporters with heavy dependencies are separate Go modules with their own go.mod:
- reporter/grpc depends on google.golang.org/grpc
- reporter/otel depends on go.opentelemetry.io/otel
- reporter/prometheus depends on github.com/prometheus/client_golang

E2E Testing

The library includes a full end-to-end test suite that deploys three microservices to a Kind (Kubernetes-in-Docker) cluster with real Postgres and Redis infrastructure.

flowchart LR
    GW["Gateway
:8181"]
    ORD["Orders
:8182"]
    PAY["Payments
:8183"]
    PG[("Postgres")]
    RD[("Redis")]

    GW -- HTTP --> ORD -- HTTP --> PAY
    ORD -- TCP --> PG
    ORD -- RESP --> RD
    PAY -- TCP --> PG

    style GW fill:#6c5ce7,color:#fff,stroke:none
    style ORD fill:#6c5ce7,color:#fff,stroke:none
    style PAY fill:#6c5ce7,color:#fff,stroke:none
    style PG fill:#ff9800,color:#fff,stroke:none
    style RD fill:#ff9800,color:#fff,stroke:none

Test	What it proves
`TestProbesHealthy`	All 3 K8s probes return 200 on all 3 services
`TestSelfDescribingJSON`	JSON includes group, componentType, duration, lastCheck
`TestDiscoveryManifest`	All 3 services have correct manifest with check counts
`TestDiscoveryGraph`	Dependency chain gateway → orders → payments verified
`TestRedisFailure`	Redis down: readiness=503, liveness still 200
`TestCascadingFailure`	Postgres down cascades through all 3 services + recovery
`TestStartupSequencing`	Pod stays not-ready until startup dep available
`TestManifestStatus`	Manifest status changes during outage, shows error details

# Run the full E2E suite (requires Docker + Kind)
make e2e

# Or step by step for debugging
make e2e-cluster   # create Kind cluster
make e2e-build     # build Docker images
make e2e-deploy    # deploy to K8s
make e2e-test      # run tests
make e2e-teardown  # delete cluster

When To Use This

Why This Library