Monitoring

This guide covers setting up monitoring and alerting for UAPK Gateway.

Metrics Overview

UAPK Gateway exposes Prometheus-compatible metrics:

Metric	Type	Description
`gateway_requests_total`	Counter	Total requests by endpoint
`gateway_request_duration_seconds`	Histogram	Request latency
`gateway_decisions_total`	Counter	Decisions by type (allow/deny/escalate)
`gateway_active_agents`	Gauge	Currently active agents
`gateway_pending_approvals`	Gauge	Pending approval count
`gateway_chain_verification`	Gauge	Last verification status

Prometheus Setup

docker-compose.yml

version: '3.8'

services:
  gateway:
    image: ghcr.io/uapk/gateway:latest
    environment:
      METRICS_ENABLED: "true"
    ports:
      - "8000:8000"

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}

volumes:
  prometheus_data:
  grafana_data:

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'gateway'
    static_configs:
      - targets: ['gateway:8000']
    metrics_path: /metrics

  - job_name: 'caddy'
    static_configs:
      - targets: ['caddy:9180']

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - /etc/prometheus/alerts/*.yml

Grafana Dashboards

Gateway Overview Dashboard

{
  "dashboard": {
    "title": "UAPK Gateway Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(gateway_requests_total[5m])",
            "legendFormat": "{{endpoint}}"
          }
        ]
      },
      {
        "title": "Decision Distribution",
        "type": "piechart",
        "targets": [
          {
            "expr": "sum by (decision) (gateway_decisions_total)",
            "legendFormat": "{{decision}}"
          }
        ]
      },
      {
        "title": "Request Latency (p99)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, rate(gateway_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p99"
          }
        ]
      },
      {
        "title": "Pending Approvals",
        "type": "stat",
        "targets": [
          {
            "expr": "gateway_pending_approvals"
          }
        ]
      }
    ]
  }
}

Alerting Rules

alerts.yml

groups:
  - name: gateway
    rules:
      - alert: GatewayDown
        expr: up{job="gateway"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Gateway is down"
          description: "Gateway has been down for more than 1 minute"

      - alert: HighErrorRate
        expr: rate(gateway_requests_total{status="5xx"}[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanize }}%"

      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(gateway_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "p99 latency is {{ $value }}s"

      - alert: PendingApprovalsBacklog
        expr: gateway_pending_approvals > 10
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Pending approvals backlog"
          description: "{{ $value }} approvals pending for over 1 hour"

      - alert: ChainVerificationFailed
        expr: gateway_chain_verification == 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Log chain verification failed"
          description: "Audit log chain integrity check failed"

      - alert: HighDenyRate
        expr: rate(gateway_decisions_total{decision="deny"}[1h]) / rate(gateway_decisions_total[1h]) > 0.2
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "High deny rate"
          description: "{{ $value | humanizePercentage }} of requests denied"

  - name: database
    rules:
      - alert: DatabaseDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL is down"

      - alert: HighConnectionCount
        expr: pg_stat_activity_count > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High database connection count"

Alertmanager Configuration

alertmanager.yml

global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: '[email protected]'

route:
  receiver: 'default'
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: 'critical'
      repeat_interval: 1h

    - match:
        severity: warning
      receiver: 'warning'

receivers:
  - name: 'default'
    email_configs:
      - to: '[email protected]'

  - name: 'critical'
    email_configs:
      - to: '[email protected]'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts-critical'

  - name: 'warning'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'

Health Checks

Gateway Health Endpoint

# Basic health check
curl http://localhost:8000/api/v1/gateway/health

# Response
{
  "status": "healthy",
  "version": "0.1.0",
  "database": "connected",
  "timestamp": "2024-12-14T10:00:00Z"
}

Docker Health Checks

services:
  gateway:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/gateway/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

External Health Monitoring

# Uptime monitoring script
#!/bin/bash

GATEWAY_URL="https://gateway.yourdomain.com"
SLACK_WEBHOOK="https://hooks.slack.com/services/..."

if ! curl -sf "$GATEWAY_URL/api/v1/gateway/health" > /dev/null; then
    curl -X POST "$SLACK_WEBHOOK" \
      -H 'Content-type: application/json' \
      -d '{"text": "⚠️ UAPK Gateway health check failed!"}'
fi

Log Aggregation

Loki Setup

services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yml:/etc/loki/local-config.yaml
      - loki_data:/loki

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log:ro
      - ./promtail-config.yml:/etc/promtail/config.yml

promtail-config.yml

server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: gateway
    static_configs:
      - targets:
          - localhost
        labels:
          job: gateway
          __path__: /var/log/gateway/*.log

  - job_name: caddy
    static_configs:
      - targets:
          - localhost
        labels:
          job: caddy
          __path__: /var/log/caddy/*.log

Key Metrics to Monitor

Metric	Alert Threshold	Description
Request latency p99	> 1s	User experience
Error rate	> 1%	Service health
Pending approvals	> 10 for 1h	Operator attention needed
Chain verification	Failed	Critical integrity
Database connections	> 80%	Capacity
CPU usage	> 80%	Performance
Memory usage	> 85%	Stability

Runbook

High Error Rate

Check gateway logs: docker compose logs gateway
Check database connectivity: docker compose exec db pg_isready
Review recent deployments
Check external service dependencies

High Latency

Check database query times
Review connection pool usage
Check for resource contention
Scale if necessary

Chain Verification Failed

Critical - Follow incident response
Export affected logs immediately
Investigate potential tampering
Contact security team

Single VM - Deployment setup
Backups - Backup monitoring
Incidents - Incident response

Metrics Overview​

Prometheus Setup​

docker-compose.yml​

prometheus.yml​

Grafana Dashboards​

Gateway Overview Dashboard​

Alerting Rules​

alerts.yml​

Alertmanager Configuration​

alertmanager.yml​

Health Checks​

Gateway Health Endpoint​

Docker Health Checks​

External Health Monitoring​

Log Aggregation​

Loki Setup​

promtail-config.yml​

Key Metrics to Monitor​

Runbook​

High Error Rate​

High Latency​

Chain Verification Failed​

Related​

Metrics Overview

Prometheus Setup

docker-compose.yml

prometheus.yml

Grafana Dashboards

Gateway Overview Dashboard

Alerting Rules

alerts.yml

Alertmanager Configuration

alertmanager.yml

Health Checks

Gateway Health Endpoint

Docker Health Checks

External Health Monitoring

Log Aggregation

Loki Setup

promtail-config.yml

Key Metrics to Monitor

Runbook

High Error Rate

High Latency

Chain Verification Failed

Related