A health check library for Go services
go get github.com/schigh/health/v2
This library is designed for Go services running in Kubernetes with multiple external dependencies. It is most valuable when you need readiness separate from liveness (your pod is alive but a dependency is down), when you have startup sequencing requirements, or when you run multiple services and want structured visibility into what depends on what.
If your service is stateless with no external dependencies, a simple handler returning 200 is sufficient. Where this library earns its place is in multi-service environments where "what's broken and why" matters more than "is the process alive."
Most Go health check libraries are either unmaintained, carry heavy dependency trees, or require you to write every checker from scratch. This library takes a different approach: zero external dependencies in the core, batteries included, and designed for Kubernetes from the ground up.
| health/v2 | heptiolabs | alexliesenfeld | InVisionApp | |
|---|---|---|---|---|
| External deps | 0 | 2 | 3 | 5 |
| K8s probes | live, ready, startup | live, ready | live, ready | live, ready |
| Degraded state | yes | no | no | no |
| Built-in checkers | HTTP, TCP, DNS, Redis, DB, command | HTTP, TCP, DNS | none | none |
| Dependency graphs | yes | no | no | no |
| OTel / Prometheus | both | Prometheus only | no | no |
| Maintained | active | archived | active | archived |
The library is built around three interfaces: Manager, Checker, and Reporter. Understanding how they interact is essential to using the library effectively.
The Manager is the central orchestrator. It controls the lifecycle of every checker and reporter in the system. Calling mgr.Run(ctx) starts all reporters, dispatches all checkers on their configured schedules, and begins evaluating aggregate fitness.
The manager is responsible for:
checkFunnel)mgr := std.Manager{Logger: health.DefaultLogger()}
mgr.AddCheck("postgres", checker, opts...)
mgr.AddReporter("http", reporter)
errChan := mgr.Run(ctx) // starts everything
mgr.Stop(ctx) // graceful shutdown
A Checker performs an individual health check and returns a *CheckResult. That's the entire interface: one method.
type Checker interface {
Check(context.Context) *CheckResult
}
The manager calls this method on the schedule you configure. A checker has no knowledge of scheduling, reporters, or fitness evaluation. Its sole responsibility is to answer one question: is this dependency healthy right now?
Checkers return a CheckResult with:
The manager overrides certain fields after the checker returns (Name, impact flags, Group, ComponentType, DependsOn) from the registered options. This means a checker never needs to know how it's configured in the manager.
A Reporter exposes health state to the outside world. Some are pull-based (HTTP server waits for requests), some are push-based (stdout prints on every update, OTel emits metrics).
type Reporter interface {
Run(context.Context) error
Stop(context.Context) error
SetLiveness(context.Context, bool)
SetReadiness(context.Context, bool)
SetStartup(context.Context, bool)
UpdateHealthChecks(context.Context, map[string]*CheckResult)
}
The manager calls SetLiveness, SetReadiness, and SetStartup whenever the aggregate state changes. It calls UpdateHealthChecks after every individual check completes, passing the latest result. Reporters are free to store, cache, or forward this data however they want.
A manager can have multiple reporters simultaneously (e.g., HTTP for K8s probes + Prometheus for metrics + stdout for local dev).
flowchart LR
subgraph reg ["You Register"]
AC1["AddCheck(...)"]
AC2["AddCheck(...)"]
AC3["AddCheck(...)"]
AR1["AddReporter(...)"]
AR2["AddReporter(...)"]
end
subgraph mgr ["Manager"]
D["Dispatch on Schedule"]
C["Collect Results"]
E["Evaluate Fitness"]
N["Notify on State Change"]
D --> C --> E --> N
end
subgraph out ["Reporters Expose"]
L["/livez"]
R["/readyz"]
H["/healthz"]
W["/.well-known/health"]
O["OTel Metrics"]
P["Prometheus /metrics"]
G["gRPC Health.Check()"]
end
AC1 --> D
AC2 --> D
AC3 --> D
AR1 --> N
AR2 --> N
N --> L & R & H & W & O & P & G
The CheckResult is the data type that flows through the entire system. Every checker produces one, the manager enriches it, and every reporter consumes it.
type CheckResult struct {
Name string // set by manager from registered name
Status Status // Healthy, Degraded, or Unhealthy
AffectsLiveness bool // set by manager from WithLivenessImpact()
AffectsReadiness bool // set by manager from WithReadinessImpact()
AffectsStartup bool // set by manager from WithStartupImpact()
Group string // set by manager from WithGroup()
ComponentType string // set by manager from WithComponentType()
DependsOn []string // set by manager from WithDependsOn()
Error error // set by checker
ErrorSince time.Time // set by checker
Duration time.Duration // set by checker
Metadata map[string]string // set by checker
Timestamp time.Time // set by checker
}
After every check result, the manager runs evaluateFitness. This is the logic that determines whether the service is live, ready, and started:
WithLivenessImpact() is Unhealthy, liveness is false (and readiness is also false, because you can't be ready if you're not live)WithReadinessImpact() is Unhealthy, readiness is false (but liveness is unaffected)Each checker gets its own dispatch goroutine. Results flow through a buffered channel (checkFunnel, sized to the number of checkers) into a single processing goroutine that runs processHealthCheck and evaluateFitness sequentially. This means fitness evaluation is never concurrent with itself, eliminating race conditions in state transitions.
Checkers that panic are recovered by safeCheck() and reported as Unhealthy. Nil results are caught with a nil guard. The manager never crashes from a misbehaving checker.
This example registers an HTTP dependency check, a TCP check against Postgres, and an HTTP reporter with BasicAuth middleware.
package main
import (
"context"
"log"
"os/signal"
"syscall"
"time"
"github.com/schigh/health/v2"
"github.com/schigh/health/v2/manager/std"
"github.com/schigh/health/v2/checker/http"
"github.com/schigh/health/v2/checker/tcp"
"github.com/schigh/health/v2/reporter/httpserver"
)
func main() {
ctx, cancel := signal.NotifyContext(context.Background(),
syscall.SIGINT, syscall.SIGTERM)
defer cancel()
mgr := std.Manager{Logger: health.DefaultLogger()}
// HTTP dependency check
if err := mgr.AddCheck("payments-api",
http.NewChecker("payments", "https://payments.internal/health"),
health.WithCheckFrequency(health.CheckAtInterval, 10*time.Second, 0),
health.WithLivenessImpact(),
health.WithReadinessImpact(),
health.WithGroup("external"),
health.WithComponentType("http"),
); err != nil {
log.Fatal(err)
}
// Postgres via TCP
if err := mgr.AddCheck("postgres",
tcp.NewChecker("postgres", "localhost:5432"),
health.WithCheckFrequency(health.CheckAtInterval, 5*time.Second, 0),
health.WithLivenessImpact(),
health.WithReadinessImpact(),
health.WithStartupImpact(),
health.WithGroup("database"),
health.WithComponentType("datastore"),
); err != nil {
log.Fatal(err)
}
// HTTP reporter with BasicAuth
if err := mgr.AddReporter("http", httpserver.New(
httpserver.WithPort(8181),
httpserver.WithServiceName("orders-api"),
httpserver.WithMiddleware(httpserver.BasicAuth("admin", "secret")),
)); err != nil {
log.Fatal(err)
}
errChan := mgr.Run(ctx)
select {
case err := <-errChan:
log.Fatalf("manager error: %v", err)
case <-ctx.Done():
if err := mgr.Stop(ctx); err != nil {
log.Printf("stop error: %v", err)
}
}
}
All built-in checkers use only the Go standard library and have no external dependencies. Each returns a *health.CheckResult populated with Duration, Timestamp, and Metadata.
Checks an HTTP endpoint returns the expected status code.
c := http.NewChecker("api", "https://api.example.com/health",
http.WithTimeout(3*time.Second),
http.WithExpectedStatus(200),
http.WithMethod("HEAD"),
)
WithTimeout, WithExpectedStatus, WithMethod, WithClientChecks a TCP port is accepting connections.
c := tcp.NewChecker("postgres", "localhost:5432",
tcp.WithTimeout(2*time.Second),
)
WithTimeoutChecks a hostname resolves to at least one address.
c := dns.NewChecker("coredns", "kubernetes.default.svc",
dns.WithTimeout(2*time.Second),
)
WithTimeout, WithResolverPING via raw RESP protocol. Zero dependency on go-redis.
c := redis.NewChecker("cache", "localhost:6379",
redis.WithTimeout(time.Second),
redis.WithPassword("secret"),
)
WithTimeout, WithPasswordPings a SQL database via the CtxPinger interface (*sql.DB satisfies this).
c := db.NewChecker("postgres", sqlDB,
db.WithTimeout(3*time.Second),
)
WithTimeoutRun any func(ctx) error. Covers every dependency without a dedicated checker. Panics are recovered.
c := command.NewChecker("s3", func(ctx context.Context) error {
_, err := s3Client.HeadBucket(ctx,
&s3.HeadBucketInput{Bucket: &bucket})
return err
})
Implement the Checker interface or use CheckerFunc for one-offs:
type Checker interface {
Check(context.Context) *CheckResult
}
// Functional shortcut
health.CheckerFunc(func(ctx context.Context) *health.CheckResult {
return &health.CheckResult{
Name: "custom",
Status: health.StatusHealthy,
}
})
Every check is configured using the functional options pattern. Each option is self-documenting at the call site.
Check is passing. All good.
Check is passing with warnings. Does not fail liveness or readiness probes. Reported to observers.
Check is failing. Affects liveness/readiness based on impact options.
Reporters receive health state from the manager and expose it to external observers. The core module includes three built-in reporters with no external dependencies. Three additional reporters are available as separate Go modules.
Runs an HTTP server with liveness, readiness, startup, and discovery endpoints.
reporter := httpserver.New(
httpserver.WithPort(8181),
httpserver.WithServiceName("my-api"),
)
// Endpoints:
// /livez → 200 or 503
// /readyz → 200 or 503
// /healthz → 200 or 503
// /.well-known/health → manifest JSON
Standard grpc.health.v1.Health protocol. Check + Watch.
// go get github.com/schigh/health/v2/reporter/grpc
reporter := grpc.NewReporter(grpc.Config{
Addr: "0.0.0.0:8182",
})
Emits metrics via OTel API: check status, duration, executions, liveness/readiness/startup.
// go get github.com/schigh/health/v2/reporter/otel
reporter, _ := otel.NewReporter(otel.Config{
MeterProvider: provider,
})
Exposes metrics for Prometheus scraping. Configurable namespace.
// go get github.com/schigh/health/v2/reporter/prometheus
reporter := prometheus.NewReporter(prometheus.Config{
Namespace: "myapp",
})
http.Handle("/metrics", reporter.Handler())
ASCII table output. Great for local development.
mgr.AddReporter("stdout", &stdout.Reporter{})
Instrumented reporter for unit tests. Tracks state changes, toggles, update counts.
rpt := &test.Reporter{}
mgr.AddReporter("test", rpt)
// ... run checks ...
report := rpt.Report()
fmt.Println(report.NumLivenessStateChanges)
Any checker can be wrapped with TTL-based caching to reduce load on expensive dependencies. During a cache refresh, stale results are served to concurrent callers, preventing thundering herd behavior.
// Cache Redis check results for 30 seconds
cached := health.WithCache(
redis.NewChecker("redis", "localhost:6379"),
30*time.Second,
)
mgr.AddCheck("redis", cached,
health.WithCheckFrequency(health.CheckAtInterval, 5*time.Second, 0),
health.WithReadinessImpact(),
)
The HTTP reporter supports a middleware chain. The first middleware in the list is the first to see the request.
reporter := httpserver.New(
httpserver.WithPort(8181),
httpserver.WithMiddleware(
httpserver.BasicAuth("admin", "secret"),
myLoggingMiddleware,
),
)
// Custom middleware
func myLoggingMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
log.Printf("health check: %s", r.URL.Path)
next.ServeHTTP(w, r)
})
}
BasicAuth(user, pass) uses constant-time comparison. Recover middleware is always outermost (catches panics).
Every service using this library can expose a /.well-known/health manifest describing its health checks and their dependencies. Other services can fetch these manifests and build transitive dependency graphs without any additional infrastructure.
// GET /.well-known/health
{
"service": "orders-api",
"version": "1.2.3",
"status": "pass",
"checks": [
{
"name": "postgres",
"status": "healthy",
"group": "database",
"componentType": "datastore",
"duration": "1.2ms"
},
{
"name": "payments",
"status": "healthy",
"dependsOn": ["http://payments:8181"]
}
],
"timestamp": "2026-03-28T20:00:00Z"
}
import "github.com/schigh/health/v2/discovery"
// Fetch one service's manifest
manifest, _ := discovery.FetchManifest(ctx, "http://orders:8181")
// Walk the full dependency graph (BFS)
graph, _ := discovery.DiscoverGraph(ctx, "http://api-gateway:8181",
discovery.WithMaxDepth(5),
discovery.WithTimeout(3*time.Second),
)
// Render as Mermaid
fmt.Println(graph.Mermaid())
// Render as Graphviz DOT
fmt.Println(graph.DOT())
flowchart LR
A["orders-api
/.well-known/health"]
B["payments
/.well-known/health"]
C["stripe-gw
/.well-known/health"]
A -- DependsOn --> B -- DependsOn --> C
DG["DiscoverGraph()"] -.->|"1. fetch manifest"| A
DG -.->|"2. follow DependsOn"| B
DG -.->|"3. follow DependsOn"| C
style DG fill:#6c5ce7,color:#fff,stroke:#6c5ce7
style A fill:#4caf50,color:#fff,stroke:#4caf50
style B fill:#4caf50,color:#fff,stroke:#4caf50
style C fill:#4caf50,color:#fff,stroke:#4caf50
/.well-known/ follows RFC 8615 for machine-discoverable service metadata. Same convention as OpenID Connect, ACME, and security.txt. Unreachable nodes are recorded as "unknown" without failing the graph. Max depth (default 10) prevents cycles.
The library supports all three Kubernetes probe types. The following manifest snippet can be added directly to your deployment configuration.
# Liveness: is the process alive?
livenessProbe:
httpGet:
path: /livez
port: 8181
initialDelaySeconds: 5
periodSeconds: 10
# Readiness: can it serve traffic?
readinessProbe:
httpGet:
path: /readyz
port: 8181
initialDelaySeconds: 5
periodSeconds: 10
# Startup: has it finished initializing?
startupProbe:
httpGet:
path: /healthz
port: 8181
failureThreshold: 30
periodSeconds: 2
Checks with WithStartupImpact() must all pass before liveness and readiness are evaluated. Once startup completes, it's not re-evaluated. This prevents K8s from killing your pod while it's still loading data or warming caches.
mgr.AddCheck("cache-warm",
command.NewChecker("cache", warmCache),
health.WithStartupImpact(),
health.WithCheckFrequency(health.CheckAtInterval, 2*time.Second, 0),
)
Following the Kubernetes API health check convention, you can query individual checks by name and get verbose output.
# Individual check by name
curl http://localhost:8181/livez/postgres
# [+]postgres ok (200)
curl http://localhost:8181/readyz/redis
# [-]redis failed: connection refused (503)
# Verbose: list all checks with status
curl http://localhost:8181/livez?verbose
# [+]postgres ok
# [-]redis failed: connection refused
# Exclude a check from evaluation
curl "http://localhost:8181/livez?verbose&exclude=redis"
# [+]postgres ok (200, redis excluded)
/livez/{name}, /readyz/{name}, /healthz/{name}. Unknown check names return 404.
This section describes the internal data flow and key design decisions for those who want to understand how the library works under the hood.
flowchart TD
AC["AddCheck(name, checker, opts...)"] --> CM["checkers map"]
CM --> DI["dispatchIntervalCheck
goroutine per check"]
CM --> DO["dispatchOneTimeCheck
goroutine per check"]
DI --> SC["safeCheck()
panic recovery"]
DO --> SC
SC --> AO["applyCheckOptions()
set Name, Group, Impact flags"]
AO --> CF["checkFunnel
buffered channel"]
CF --> PH["processHealthCheck
single goroutine"]
PH --> |"1. nil guard
2. store result
3. first-run gate
4. update reporters"| EF["evaluateFitness"]
EF --> |"startup gate
liveness AND
readiness AND
can't be ready if not live"| SS["setLive / setReady / setStartup
atomic swap + fan out"]
SS --> HTTP["/livez /readyz /healthz
/.well-known/health"]
SS --> GRPC["gRPC
Check() / Watch()"]
SS --> OTEL["OTel
gauges, histograms, counters"]
SS --> PROM["Prometheus
/metrics"]
style AC fill:#6c5ce7,color:#fff,stroke:none
style CF fill:#ff9800,color:#fff,stroke:none
style PH fill:#6c5ce7,color:#fff,stroke:none
style EF fill:#6c5ce7,color:#fff,stroke:none
style HTTP fill:#4caf50,color:#fff,stroke:none
style GRPC fill:#4caf50,color:#fff,stroke:none
style OTEL fill:#4caf50,color:#fff,stroke:none
style PROM fill:#4caf50,color:#fff,stroke:none
go.mod files. You only pay for what you import.safeCheck() recovers panicking checkers. cacheHealthChecks() recovers serialization panics. Recover middleware catches HTTP handler panics. The library never takes down your service./.well-known/health manifest makes every service discoverable.github.com/schigh/health/v2) has everything you need for basic use. Reporters with heavy dependencies are separate Go modules with their own go.mod:
reporter/grpc depends on google.golang.org/grpcreporter/otel depends on go.opentelemetry.io/otelreporter/prometheus depends on github.com/prometheus/client_golangThe library includes a full end-to-end test suite that deploys three microservices to a Kind (Kubernetes-in-Docker) cluster with real Postgres and Redis infrastructure.
flowchart LR
GW["Gateway
:8181"]
ORD["Orders
:8182"]
PAY["Payments
:8183"]
PG[("Postgres")]
RD[("Redis")]
GW -- HTTP --> ORD -- HTTP --> PAY
ORD -- TCP --> PG
ORD -- RESP --> RD
PAY -- TCP --> PG
style GW fill:#6c5ce7,color:#fff,stroke:none
style ORD fill:#6c5ce7,color:#fff,stroke:none
style PAY fill:#6c5ce7,color:#fff,stroke:none
style PG fill:#ff9800,color:#fff,stroke:none
style RD fill:#ff9800,color:#fff,stroke:none
# Run the full E2E suite (requires Docker + Kind)
make e2e
# Or step by step for debugging
make e2e-cluster # create Kind cluster
make e2e-build # build Docker images
make e2e-deploy # deploy to K8s
make e2e-test # run tests
make e2e-teardown # delete cluster
# Core (zero dependencies)
go get github.com/schigh/health/v2
# gRPC reporter
go get github.com/schigh/health/v2/reporter/grpc
# OpenTelemetry reporter
go get github.com/schigh/health/v2/reporter/otel
# Prometheus reporter
go get github.com/schigh/health/v2/reporter/prometheus
Requires Go 1.22+.