Zero Dependencies

health

A health check library for Go services

go get github.com/schigh/health/v2

When To Use This

This library is designed for Go services running in Kubernetes with multiple external dependencies. It is most valuable when you need readiness separate from liveness (your pod is alive but a dependency is down), when you have startup sequencing requirements, or when you run multiple services and want structured visibility into what depends on what.

If your service is stateless with no external dependencies, a simple handler returning 200 is sufficient. Where this library earns its place is in multi-service environments where "what's broken and why" matters more than "is the process alive."

Why This Library

Most Go health check libraries are either unmaintained, carry heavy dependency trees, or require you to write every checker from scratch. This library takes a different approach: zero external dependencies in the core, batteries included, and designed for Kubernetes from the ground up.

health/v2 heptiolabs alexliesenfeld InVisionApp
External deps0235
K8s probeslive, ready, startuplive, readylive, readylive, ready
Degraded stateyesnonono
Built-in checkersHTTP, TCP, DNS, Redis, DB, commandHTTP, TCP, DNSnonenone
Dependency graphsyesnonono
OTel / PrometheusbothPrometheus onlynono
Maintainedactivearchivedactivearchived

Core Concepts

The library is built around three interfaces: Manager, Checker, and Reporter. Understanding how they interact is essential to using the library effectively.

M

Manager

The Manager is the central orchestrator. It controls the lifecycle of every checker and reporter in the system. Calling mgr.Run(ctx) starts all reporters, dispatches all checkers on their configured schedules, and begins evaluating aggregate fitness.

The manager is responsible for:

  • Dispatching checks (one-time or at intervals, with optional delays)
  • Processing check results through a buffered channel (checkFunnel)
  • Evaluating aggregate fitness: liveness, readiness, and startup
  • Notifying all reporters when state changes
  • Graceful shutdown of all checkers and reporters
mgr := std.Manager{Logger: health.DefaultLogger()}
mgr.AddCheck("postgres", checker, opts...)
mgr.AddReporter("http", reporter)
errChan := mgr.Run(ctx)  // starts everything
mgr.Stop(ctx)             // graceful shutdown
C

Checker

A Checker performs an individual health check and returns a *CheckResult. That's the entire interface: one method.

type Checker interface {
    Check(context.Context) *CheckResult
}

The manager calls this method on the schedule you configure. A checker has no knowledge of scheduling, reporters, or fitness evaluation. Its sole responsibility is to answer one question: is this dependency healthy right now?

Checkers return a CheckResult with:

  • Status: Healthy, Degraded, or Unhealthy
  • Duration: how long the check took
  • Error: what went wrong (if anything)
  • Metadata: arbitrary key-value pairs for observability

The manager overrides certain fields after the checker returns (Name, impact flags, Group, ComponentType, DependsOn) from the registered options. This means a checker never needs to know how it's configured in the manager.

R

Reporter

A Reporter exposes health state to the outside world. Some are pull-based (HTTP server waits for requests), some are push-based (stdout prints on every update, OTel emits metrics).

type Reporter interface {
    Run(context.Context) error
    Stop(context.Context) error
    SetLiveness(context.Context, bool)
    SetReadiness(context.Context, bool)
    SetStartup(context.Context, bool)
    UpdateHealthChecks(context.Context, map[string]*CheckResult)
}

The manager calls SetLiveness, SetReadiness, and SetStartup whenever the aggregate state changes. It calls UpdateHealthChecks after every individual check completes, passing the latest result. Reporters are free to store, cache, or forward this data however they want.

A manager can have multiple reporters simultaneously (e.g., HTTP for K8s probes + Prometheus for metrics + stdout for local dev).

How They Fit Together

flowchart LR
    subgraph reg ["You Register"]
        AC1["AddCheck(...)"]
        AC2["AddCheck(...)"]
        AC3["AddCheck(...)"]
        AR1["AddReporter(...)"]
        AR2["AddReporter(...)"]
    end
    subgraph mgr ["Manager"]
        D["Dispatch on Schedule"]
        C["Collect Results"]
        E["Evaluate Fitness"]
        N["Notify on State Change"]
        D --> C --> E --> N
    end
    subgraph out ["Reporters Expose"]
        L["/livez"]
        R["/readyz"]
        H["/healthz"]
        W["/.well-known/health"]
        O["OTel Metrics"]
        P["Prometheus /metrics"]
        G["gRPC Health.Check()"]
    end
    AC1 --> D
    AC2 --> D
    AC3 --> D
    AR1 --> N
    AR2 --> N
    N --> L & R & H & W & O & P & G
    

CheckResult

The CheckResult is the data type that flows through the entire system. Every checker produces one, the manager enriches it, and every reporter consumes it.

type CheckResult struct {
    Name             string            // set by manager from registered name
    Status           Status            // Healthy, Degraded, or Unhealthy
    AffectsLiveness  bool              // set by manager from WithLivenessImpact()
    AffectsReadiness bool              // set by manager from WithReadinessImpact()
    AffectsStartup   bool              // set by manager from WithStartupImpact()
    Group            string            // set by manager from WithGroup()
    ComponentType    string            // set by manager from WithComponentType()
    DependsOn        []string          // set by manager from WithDependsOn()
    Error            error             // set by checker
    ErrorSince       time.Time         // set by checker
    Duration         time.Duration     // set by checker
    Metadata         map[string]string // set by checker
    Timestamp        time.Time         // set by checker
}

Fitness Evaluation

After every check result, the manager runs evaluateFitness. This is the logic that determines whether the service is live, ready, and started:

Concurrency Model

Each checker gets its own dispatch goroutine. Results flow through a buffered channel (checkFunnel, sized to the number of checkers) into a single processing goroutine that runs processHealthCheck and evaluateFitness sequentially. This means fitness evaluation is never concurrent with itself, eliminating race conditions in state transitions.

Checkers that panic are recovered by safeCheck() and reported as Unhealthy. Nil results are caught with a nil guard. The manager never crashes from a misbehaving checker.

Quick Start

This example registers an HTTP dependency check, a TCP check against Postgres, and an HTTP reporter with BasicAuth middleware.

package main

import (
    "context"
    "log"
    "os/signal"
    "syscall"
    "time"

    "github.com/schigh/health/v2"
    "github.com/schigh/health/v2/manager/std"
    "github.com/schigh/health/v2/checker/http"
    "github.com/schigh/health/v2/checker/tcp"
    "github.com/schigh/health/v2/reporter/httpserver"
)

func main() {
    ctx, cancel := signal.NotifyContext(context.Background(),
        syscall.SIGINT, syscall.SIGTERM)
    defer cancel()

    mgr := std.Manager{Logger: health.DefaultLogger()}

    // HTTP dependency check
    if err := mgr.AddCheck("payments-api",
        http.NewChecker("payments", "https://payments.internal/health"),
        health.WithCheckFrequency(health.CheckAtInterval, 10*time.Second, 0),
        health.WithLivenessImpact(),
        health.WithReadinessImpact(),
        health.WithGroup("external"),
        health.WithComponentType("http"),
    ); err != nil {
        log.Fatal(err)
    }

    // Postgres via TCP
    if err := mgr.AddCheck("postgres",
        tcp.NewChecker("postgres", "localhost:5432"),
        health.WithCheckFrequency(health.CheckAtInterval, 5*time.Second, 0),
        health.WithLivenessImpact(),
        health.WithReadinessImpact(),
        health.WithStartupImpact(),
        health.WithGroup("database"),
        health.WithComponentType("datastore"),
    ); err != nil {
        log.Fatal(err)
    }

    // HTTP reporter with BasicAuth
    if err := mgr.AddReporter("http", httpserver.New(
        httpserver.WithPort(8181),
        httpserver.WithServiceName("orders-api"),
        httpserver.WithMiddleware(httpserver.BasicAuth("admin", "secret")),
    )); err != nil {
        log.Fatal(err)
    }

    errChan := mgr.Run(ctx)
    select {
    case err := <-errChan:
        log.Fatalf("manager error: %v", err)
    case <-ctx.Done():
        if err := mgr.Stop(ctx); err != nil {
            log.Printf("stop error: %v", err)
        }
    }
}

Built-in Checkers

All built-in checkers use only the Go standard library and have no external dependencies. Each returns a *health.CheckResult populated with Duration, Timestamp, and Metadata.

checker/http

Checks an HTTP endpoint returns the expected status code.

c := http.NewChecker("api", "https://api.example.com/health",
    http.WithTimeout(3*time.Second),
    http.WithExpectedStatus(200),
    http.WithMethod("HEAD"),
)
Options: WithTimeout, WithExpectedStatus, WithMethod, WithClient

checker/tcp

Checks a TCP port is accepting connections.

c := tcp.NewChecker("postgres", "localhost:5432",
    tcp.WithTimeout(2*time.Second),
)
Options: WithTimeout

checker/dns

Checks a hostname resolves to at least one address.

c := dns.NewChecker("coredns", "kubernetes.default.svc",
    dns.WithTimeout(2*time.Second),
)
Options: WithTimeout, WithResolver

checker/redis

PING via raw RESP protocol. Zero dependency on go-redis.

c := redis.NewChecker("cache", "localhost:6379",
    redis.WithTimeout(time.Second),
    redis.WithPassword("secret"),
)
Options: WithTimeout, WithPassword
Supports standalone Redis with optional legacy AUTH. Redis Cluster and ACL-only not supported.

checker/db

Pings a SQL database via the CtxPinger interface (*sql.DB satisfies this).

c := db.NewChecker("postgres", sqlDB,
    db.WithTimeout(3*time.Second),
)
Options: WithTimeout

checker/command

Run any func(ctx) error. Covers every dependency without a dedicated checker. Panics are recovered.

c := command.NewChecker("s3", func(ctx context.Context) error {
    _, err := s3Client.HeadBucket(ctx,
        &s3.HeadBucketInput{Bucket: &bucket})
    return err
})

Custom Checkers

Implement the Checker interface or use CheckerFunc for one-offs:

type Checker interface {
    Check(context.Context) *CheckResult
}

// Functional shortcut
health.CheckerFunc(func(ctx context.Context) *health.CheckResult {
    return &health.CheckResult{
        Name:   "custom",
        Status: health.StatusHealthy,
    }
})

Check Options

Every check is configured using the functional options pattern. Each option is self-documenting at the call site.

OptionWhat it does
WithCheckFrequency(freq, interval, delay)Set check schedule: CheckOnce, CheckAtInterval, CheckAfter
WithLivenessImpact()Failing check kills liveness (and readiness)
WithReadinessImpact()Failing check kills readiness only
WithStartupImpact()Check must pass before liveness/readiness are evaluated
WithGroup("database")Logical group for filtering and display
WithComponentType("datastore")Type hint for observability tools
WithDependsOn("http://svc:8181")Declare dependency for graph discovery

Health Status

StatusHealthy

Check is passing. All good.

StatusDegraded

Check is passing with warnings. Does not fail liveness or readiness probes. Reported to observers.

StatusUnhealthy

Check is failing. Affects liveness/readiness based on impact options.

Reporters

Reporters receive health state from the manager and expose it to external observers. The core module includes three built-in reporters with no external dependencies. Three additional reporters are available as separate Go modules.

HTTP Server core

Runs an HTTP server with liveness, readiness, startup, and discovery endpoints.

reporter := httpserver.New(
    httpserver.WithPort(8181),
    httpserver.WithServiceName("my-api"),
)
// Endpoints:
// /livez     → 200 or 503
// /readyz    → 200 or 503
// /healthz  → 200 or 503
// /.well-known/health → manifest JSON

gRPC separate module

Standard grpc.health.v1.Health protocol. Check + Watch.

// go get github.com/schigh/health/v2/reporter/grpc
reporter := grpc.NewReporter(grpc.Config{
    Addr: "0.0.0.0:8182",
})

OpenTelemetry separate module

Emits metrics via OTel API: check status, duration, executions, liveness/readiness/startup.

// go get github.com/schigh/health/v2/reporter/otel
reporter, _ := otel.NewReporter(otel.Config{
    MeterProvider: provider,
})

Prometheus separate module

Exposes metrics for Prometheus scraping. Configurable namespace.

// go get github.com/schigh/health/v2/reporter/prometheus
reporter := prometheus.NewReporter(prometheus.Config{
    Namespace: "myapp",
})
http.Handle("/metrics", reporter.Handler())

stdout core

ASCII table output. Great for local development.

mgr.AddReporter("stdout", &stdout.Reporter{})

test core

Instrumented reporter for unit tests. Tracks state changes, toggles, update counts.

rpt := &test.Reporter{}
mgr.AddReporter("test", rpt)
// ... run checks ...
report := rpt.Report()
fmt.Println(report.NumLivenessStateChanges)

Caching

Any checker can be wrapped with TTL-based caching to reduce load on expensive dependencies. During a cache refresh, stale results are served to concurrent callers, preventing thundering herd behavior.

// Cache Redis check results for 30 seconds
cached := health.WithCache(
    redis.NewChecker("redis", "localhost:6379"),
    30*time.Second,
)
mgr.AddCheck("redis", cached,
    health.WithCheckFrequency(health.CheckAtInterval, 5*time.Second, 0),
    health.WithReadinessImpact(),
)
How it works: First call executes synchronously. Subsequent calls return the cached result if within TTL. When TTL expires, one goroutine refreshes while others get the stale result. Double-checked locking with RWMutex.

Middleware

The HTTP reporter supports a middleware chain. The first middleware in the list is the first to see the request.

reporter := httpserver.New(
    httpserver.WithPort(8181),
    httpserver.WithMiddleware(
        httpserver.BasicAuth("admin", "secret"),
        myLoggingMiddleware,
    ),
)

// Custom middleware
func myLoggingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        log.Printf("health check: %s", r.URL.Path)
        next.ServeHTTP(w, r)
    })
}
Built-in: BasicAuth(user, pass) uses constant-time comparison. Recover middleware is always outermost (catches panics).

Service Discovery

Every service using this library can expose a /.well-known/health manifest describing its health checks and their dependencies. Other services can fetch these manifests and build transitive dependency graphs without any additional infrastructure.

The Manifest

// GET /.well-known/health
{
  "service": "orders-api",
  "version": "1.2.3",
  "status": "pass",
  "checks": [
    {
      "name": "postgres",
      "status": "healthy",
      "group": "database",
      "componentType": "datastore",
      "duration": "1.2ms"
    },
    {
      "name": "payments",
      "status": "healthy",
      "dependsOn": ["http://payments:8181"]
    }
  ],
  "timestamp": "2026-03-28T20:00:00Z"
}

Walking the Graph

import "github.com/schigh/health/v2/discovery"

// Fetch one service's manifest
manifest, _ := discovery.FetchManifest(ctx, "http://orders:8181")

// Walk the full dependency graph (BFS)
graph, _ := discovery.DiscoverGraph(ctx, "http://api-gateway:8181",
    discovery.WithMaxDepth(5),
    discovery.WithTimeout(3*time.Second),
)

// Render as Mermaid
fmt.Println(graph.Mermaid())

// Render as Graphviz DOT
fmt.Println(graph.DOT())

How It Works

flowchart LR
    A["orders-api
/.well-known/health"] B["payments
/.well-known/health"] C["stripe-gw
/.well-known/health"] A -- DependsOn --> B -- DependsOn --> C DG["DiscoverGraph()"] -.->|"1. fetch manifest"| A DG -.->|"2. follow DependsOn"| B DG -.->|"3. follow DependsOn"| C style DG fill:#6c5ce7,color:#fff,stroke:#6c5ce7 style A fill:#4caf50,color:#fff,stroke:#4caf50 style B fill:#4caf50,color:#fff,stroke:#4caf50 style C fill:#4caf50,color:#fff,stroke:#4caf50
Convention: /.well-known/ follows RFC 8615 for machine-discoverable service metadata. Same convention as OpenID Connect, ACME, and security.txt. Unreachable nodes are recorded as "unknown" without failing the graph. Max depth (default 10) prevents cycles.

Kubernetes

The library supports all three Kubernetes probe types. The following manifest snippet can be added directly to your deployment configuration.

# Liveness: is the process alive?
livenessProbe:
  httpGet:
    path: /livez
    port: 8181
  initialDelaySeconds: 5
  periodSeconds: 10

# Readiness: can it serve traffic?
readinessProbe:
  httpGet:
    path: /readyz
    port: 8181
  initialDelaySeconds: 5
  periodSeconds: 10

# Startup: has it finished initializing?
startupProbe:
  httpGet:
    path: /healthz
    port: 8181
  failureThreshold: 30
  periodSeconds: 2

Startup Probe Flow

Checks with WithStartupImpact() must all pass before liveness and readiness are evaluated. Once startup completes, it's not re-evaluated. This prevents K8s from killing your pod while it's still loading data or warming caches.

mgr.AddCheck("cache-warm",
    command.NewChecker("cache", warmCache),
    health.WithStartupImpact(),
    health.WithCheckFrequency(health.CheckAtInterval, 2*time.Second, 0),
)

Individual Health Checks

Following the Kubernetes API health check convention, you can query individual checks by name and get verbose output.

# Individual check by name
curl http://localhost:8181/livez/postgres
# [+]postgres ok      (200)

curl http://localhost:8181/readyz/redis
# [-]redis failed: connection refused    (503)

# Verbose: list all checks with status
curl http://localhost:8181/livez?verbose
# [+]postgres ok
# [-]redis failed: connection refused

# Exclude a check from evaluation
curl "http://localhost:8181/livez?verbose&exclude=redis"
# [+]postgres ok      (200, redis excluded)
Same pattern for all probes: /livez/{name}, /readyz/{name}, /healthz/{name}. Unknown check names return 404.

Internals

This section describes the internal data flow and key design decisions for those who want to understand how the library works under the hood.

flowchart TD
    AC["AddCheck(name, checker, opts...)"] --> CM["checkers map"]
    CM --> DI["dispatchIntervalCheck
goroutine per check"] CM --> DO["dispatchOneTimeCheck
goroutine per check"] DI --> SC["safeCheck()
panic recovery"] DO --> SC SC --> AO["applyCheckOptions()
set Name, Group, Impact flags"] AO --> CF["checkFunnel
buffered channel"] CF --> PH["processHealthCheck
single goroutine"] PH --> |"1. nil guard
2. store result
3. first-run gate
4. update reporters"| EF["evaluateFitness"] EF --> |"startup gate
liveness AND
readiness AND
can't be ready if not live"| SS["setLive / setReady / setStartup
atomic swap + fan out"] SS --> HTTP["/livez /readyz /healthz
/.well-known/health"] SS --> GRPC["gRPC
Check() / Watch()"] SS --> OTEL["OTel
gauges, histograms, counters"] SS --> PROM["Prometheus
/metrics"] style AC fill:#6c5ce7,color:#fff,stroke:none style CF fill:#ff9800,color:#fff,stroke:none style PH fill:#6c5ce7,color:#fff,stroke:none style EF fill:#6c5ce7,color:#fff,stroke:none style HTTP fill:#4caf50,color:#fff,stroke:none style GRPC fill:#4caf50,color:#fff,stroke:none style OTEL fill:#4caf50,color:#fff,stroke:none style PROM fill:#4caf50,color:#fff,stroke:none

Key Design Decisions

E2E Testing

The library includes a full end-to-end test suite that deploys three microservices to a Kind (Kubernetes-in-Docker) cluster with real Postgres and Redis infrastructure.

flowchart LR
    GW["Gateway
:8181"] ORD["Orders
:8182"] PAY["Payments
:8183"] PG[("Postgres")] RD[("Redis")] GW -- HTTP --> ORD -- HTTP --> PAY ORD -- TCP --> PG ORD -- RESP --> RD PAY -- TCP --> PG style GW fill:#6c5ce7,color:#fff,stroke:none style ORD fill:#6c5ce7,color:#fff,stroke:none style PAY fill:#6c5ce7,color:#fff,stroke:none style PG fill:#ff9800,color:#fff,stroke:none style RD fill:#ff9800,color:#fff,stroke:none
TestWhat it proves
TestProbesHealthyAll 3 K8s probes return 200 on all 3 services
TestSelfDescribingJSONJSON includes group, componentType, duration, lastCheck
TestDiscoveryManifestAll 3 services have correct manifest with check counts
TestDiscoveryGraphDependency chain gateway → orders → payments verified
TestRedisFailureRedis down: readiness=503, liveness still 200
TestCascadingFailurePostgres down cascades through all 3 services + recovery
TestStartupSequencingPod stays not-ready until startup dep available
TestManifestStatusManifest status changes during outage, shows error details
# Run the full E2E suite (requires Docker + Kind)
make e2e

# Or step by step for debugging
make e2e-cluster   # create Kind cluster
make e2e-build     # build Docker images
make e2e-deploy    # deploy to K8s
make e2e-test      # run tests
make e2e-teardown  # delete cluster

Installation

# Core (zero dependencies)
go get github.com/schigh/health/v2

# gRPC reporter
go get github.com/schigh/health/v2/reporter/grpc

# OpenTelemetry reporter
go get github.com/schigh/health/v2/reporter/otel

# Prometheus reporter
go get github.com/schigh/health/v2/reporter/prometheus

Requires Go 1.22+.