Skip to content

fast-ish/aws-druid-infra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

aws-druid-infra

AWS CDK application written in Java that provisions an Apache Druid deployment on Amazon EKS (Elastic Kubernetes Service) with integrated AWS managed services for real-time OLAP analytics at scale.

License: MIT Java AWS CDK Amazon VPC Amazon EKS Apache Druid Amazon MSK OpenTelemetry Grafana


Overview

This CDK application provisions a production-ready Apache Druid deployment on Amazon EKS with fully integrated AWS managed services. Druid is a high-performance, real-time analytics database designed for workflows where fast queries and ingest are critical. The architecture follows EKS Best Practices and Analytics Lens recommendations.

Key Features

Feature Description Reference
EKS Cluster Managed Kubernetes control plane with RBAC configuration EKS User Guide
AWS Managed Addons VPC CNI, EBS CSI, CoreDNS, Kube Proxy, Pod Identity Agent, CloudWatch Container Insights EKS Add-ons
Helm Chart Addons cert-manager, AWS Load Balancer Controller, Karpenter, CSI Secrets Store Helm
Apache Druid Real-time OLAP database with sub-second query latency Druid Documentation
RDS PostgreSQL Managed database for Druid metadata storage Amazon RDS
S3 Deep Storage Scalable object storage for Druid segments S3 Deep Storage
MSK (Kafka) Managed streaming for real-time data ingestion Amazon MSK
Grafana Cloud Integration Full observability stack with metrics, logs, and traces Grafana Cloud
Managed Node Groups Bottlerocket AMIs for enhanced security Managed Node Groups

Architecture

System Overview

flowchart TB
    subgraph "Data Sources"
        KAFKA[MSK Kafka]
        S3IN[S3 Ingestion]
        BATCH[Batch Files]
    end

    subgraph "EKS Cluster"
        subgraph "Druid Cluster"
            COORD[Coordinator]
            OVER[Overlord]
            BROKER[Broker]
            ROUTER[Router]
            HIST[Historical]
            MM[MiddleManager]
        end
    end

    subgraph "AWS Managed Services"
        RDS[(RDS PostgreSQL)]
        S3DEEP[S3 Deep Storage]
        S3MSQ[S3 MSQ Storage]
    end

    subgraph "Query Clients"
        CONSOLE[Web Console]
        JDBC[JDBC Clients]
        API[REST API]
    end

    KAFKA --> MM
    S3IN --> MM
    BATCH --> MM

    MM --> OVER
    OVER --> COORD
    COORD --> RDS
    MM --> S3DEEP
    HIST --> S3DEEP

    CONSOLE --> ROUTER
    JDBC --> BROKER
    API --> BROKER
    ROUTER --> BROKER
    BROKER --> HIST
Loading

Data Ingestion Flow

sequenceDiagram
    participant Source as Data Source
    participant Kafka as MSK Kafka
    participant MM as MiddleManager
    participant Overlord
    participant S3 as S3 Deep Storage
    participant Historical
    participant Broker

    Source->>Kafka: Publish Events
    Kafka->>MM: Consume Batch

    MM->>MM: Parse & Index
    MM->>Overlord: Report Progress
    MM->>S3: Push Segment

    S3->>Historical: Load Segment
    Historical->>Historical: Cache in Memory

    Note over Broker,Historical: Query Path
    Broker->>Historical: Query Segment
    Historical-->>Broker: Results
Loading

Stack Structure

The Druid infrastructure uses a layered architecture with CloudFormation nested stacks:

flowchart TB
    subgraph "DeploymentStack (main)"
        MAIN[Main Stack]
    end

    subgraph "Nested Stacks"
        VPC[VpcNestedStack]
        EKS[EksNestedStack]
        SETUP[DruidSetupNestedStack]
        DRUID[DruidNestedStack]
    end

    MAIN --> VPC
    MAIN --> EKS
    MAIN --> SETUP
    MAIN --> DRUID

    VPC -.->|depends on| EKS
    VPC -.->|depends on| SETUP
    EKS -.->|depends on| DRUID
    SETUP -.->|depends on| DRUID
Loading

Dependency Chain:

  1. VPC is created first (network foundation)
  2. EKS cluster is provisioned (independent of Druid setup)
  3. Druid setup creates supporting resources (RDS, S3, MSK) that depend on VPC
  4. Druid Helm chart is deployed after both EKS and setup are ready

Apache Druid Components

flowchart LR
    subgraph "Master Nodes"
        COORD[Coordinator<br/>Segment Management]
        OVER[Overlord<br/>Task Management]
    end

    subgraph "Query Nodes"
        BROKER[Broker<br/>Query Routing]
        ROUTER[Router<br/>API Gateway]
    end

    subgraph "Data Nodes"
        HIST[Historical<br/>Segment Storage]
        MM[MiddleManager<br/>Ingestion Tasks]
    end

    ROUTER --> BROKER
    ROUTER --> COORD
    ROUTER --> OVER
    BROKER --> HIST
    OVER --> MM
    COORD --> HIST
Loading

Apache Druid consists of several specialized node types:

Node Type Purpose Reference
Coordinator Manages data availability and segment distribution Coordinator Process
Overlord Controls data ingestion workload assignment Overlord Process
Broker Handles queries from external clients Broker Process
Router Routes requests to Brokers, Coordinators, and Overlords Router Process
Historical Stores and queries historical data segments Historical Process
MiddleManager Executes submitted ingestion tasks MiddleManager Process

AWS Service Integration

Service Druid Component Purpose Reference
RDS PostgreSQL Metadata Storage Stores segment metadata, rules, and configuration Metadata Storage
S3 Deep Storage Long-term segment storage for Historical nodes Deep Storage
S3 Multi-Stage Query Intermediate storage for MSQ engine MSQ
MSK (Kafka) Real-time Ingestion Streaming data source for Druid supervisors Kafka Ingestion

Observability Stack

The cluster integrates with Grafana Cloud for comprehensive observability:

Component Purpose Reference
Prometheus Druid and Kubernetes metrics collection Grafana Mimir
Loki Log aggregation from all Druid processes Grafana Loki
Tempo Distributed tracing for query analysis Grafana Tempo
Pyroscope Continuous profiling for performance optimization Grafana Pyroscope
OpenTelemetry Collector Telemetry data collection and export OpenTelemetry

Platform Integration

When deployed through the Fastish platform, this infrastructure integrates with internal platform services:

Platform Component Integration Point Purpose
Orchestrator Release pipeline automation Automated CDK synthesis and deployment via CodePipeline
Portal Subscriber management Tenant provisioning, cluster access control
Network Shared VPC infrastructure Cross-stack connectivity for platform services
Reporting Usage metering Pipeline execution tracking and cost attribution

These integrations are managed automatically when deploying via the platform's release workflows.


Prerequisites

Requirement Version Installation
Java 21+ SDKMAN
Maven 3.8+ Maven Download
AWS CLI 2.x AWS CLI Install
AWS CDK CLI 2.221.0+ CDK Getting Started
kubectl 1.28+ kubectl Install
Helm 3.x Helm Install
Docker Latest Docker Install
GitHub CLI Latest GitHub CLI
Grafana Cloud Account - Grafana Cloud

AWS CDK Bootstrap:

cdk bootstrap aws://<account-id>/<region>

Replace <account-id> with your AWS account ID and <region> with your desired AWS region (e.g., us-west-2). This sets up necessary resources for CDK deployments including an S3 bucket for assets and CloudFormation execution roles. See: CDK Bootstrapping | Bootstrap CLI Reference


Deployment

Step 1: Clone Repositories

gh repo clone fast-ish/cdk-common
gh repo clone fast-ish/aws-druid-infra

Step 2: Build Projects

mvn -f cdk-common/pom.xml clean install
mvn -f aws-druid-infra/pom.xml clean install

Step 3: Prepare Apache Druid Artifacts (Optional)

If using custom Druid images or Helm charts, prepare the artifacts in Amazon ECR:

Docker Image

# Authenticate to ECR
aws ecr get-login-password --region <region> | \
  docker login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com

# Create repository
aws ecr create-repository \
  --repository-name fasti.sh/v1/docker/druid \
  --region <region> \
  --image-scanning-configuration scanOnPush=true

# Build and push image
docker buildx build --provenance=false --platform linux/amd64 -f Dockerfile.druid \
  -t <account-id>.dkr.ecr.<region>.amazonaws.com/fasti.sh/v1/docker/druid:$(date +'%Y%m%d') \
  -t <account-id>.dkr.ecr.<region>.amazonaws.com/fasti.sh/v1/docker/druid:v1 \
  -t <account-id>.dkr.ecr.<region>.amazonaws.com/fasti.sh/v1/docker/druid:latest \
  --push .

See: ECR User Guide

Helm Chart

# Authenticate Helm to ECR
aws ecr get-login-password --region <region> | \
  helm registry login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com

# Create repository for Helm charts
aws ecr create-repository \
  --repository-name fasti.sh/v1/helm/druid \
  --region <region> \
  --image-scanning-configuration scanOnPush=true \
  --encryption-configuration encryptionType=AES256

# Package and push chart
helm package ./helm/chart/druid
helm push druid-<version>.tgz oci://<account-id>.dkr.ecr.<region>.amazonaws.com/fasti.sh/v1/helm

See: Helm OCI Support

Update Artifact References

Update the Docker image reference in src/main/resources/prototype/v1/druid/values.mustache:

Parameter Description Example
image.repository ECR repository for Druid Docker image 000000000000.dkr.ecr.us-west-2.amazonaws.com/fasti.sh/v1/docker/druid
image.tag Tag of the Druid Docker image v1, latest, or date tag
image.pullPolicy Pull policy for the Docker image IfNotPresent

Update the Helm chart reference in src/main/resources/prototype/v1/conf.mustache:

Parameter Description Example
chart.repository ECR repository for Druid Helm chart oci://000000000000.dkr.ecr.us-west-2.amazonaws.com/fasti.sh/v1/helm
chart.name Name of the Druid Helm chart druid
chart.version Version of the Druid Helm chart 34.0.0

Step 4: Configure Deployment

Create aws-druid-infra/cdk.context.json from aws-druid-infra/cdk.context.template.json:

Required Configuration Parameters:

Parameter Description Example
:account AWS account ID (12-digit number) 123456789012
:region AWS region for deployment us-west-2
:domain Registered domain name (optional) example.com
:environment Environment name (do not change) prototype
:version Resource version identifier v1

Notes:

  • :environment and :version map to resource files at aws-druid-infra/src/main/resources/prototype/v1
  • These values determine which configuration templates are loaded during CDK synthesis

Step 5: Configure Grafana Cloud

Add Grafana Cloud configuration for observability:

{
  "hosted:eks:grafana:instanceId": "000000",
  "hosted:eks:grafana:key": "glc_xyz",
  "hosted:eks:grafana:lokiHost": "https://logs-prod-000.grafana.net",
  "hosted:eks:grafana:lokiUsername": "000000",
  "hosted:eks:grafana:prometheusHost": "https://prometheus-prod-000-prod-us-west-0.grafana.net",
  "hosted:eks:grafana:prometheusUsername": "0000000",
  "hosted:eks:grafana:tempoHost": "https://tempo-prod-000-prod-us-west-0.grafana.net/tempo",
  "hosted:eks:grafana:tempoUsername": "000000",
  "hosted:eks:grafana:pyroscopeHost": "https://profiles-prod-000.grafana.net:443"
}

Grafana Cloud Setup:

  1. Create Account: Sign up at grafana.com
  2. Create Stack: Navigate to your stack settings
  3. Generate API Key: Create key with required permissions
Parameter Location Description
instanceId Stack details page Unique identifier for your Grafana instance
key API keys section API key with all permissions (starts with glc_)
lokiHost Logs > Data Sources > Loki Endpoint URL for logs
lokiUsername Logs > Data Sources > Loki Account identifier for Loki
prometheusHost Metrics > Data Sources > Prometheus Endpoint URL for metrics
prometheusUsername Metrics > Data Sources > Prometheus Account identifier for Prometheus
tempoHost Traces > Data Sources > Tempo Endpoint URL for traces
tempoUsername Traces > Data Sources > Tempo Account identifier for Tempo
pyroscopeHost Profiles > Connect a Data Source Endpoint URL for profiling

Required API Key Permissions:

Permission Access Purpose
metrics Read/Write Prometheus metrics ingestion
logs Read/Write Loki log ingestion
traces Read/Write Tempo trace ingestion
profiles Read/Write Pyroscope profiling data
alerts Read/Write Alerting configuration
rules Read/Write Recording and alerting rules

See: Grafana Cloud Kubernetes Monitoring

Step 6: Configure Cluster Access

Add IAM role mappings in cdk.context.json for EKS access entries:

{
  "hosted:eks:administrators": [
    {
      "username": "administrator",
      "role": "arn:aws:iam::000000000000:role/AWSReservedSSO_AdministratorAccess_abc",
      "email": "admin@example.com"
    }
  ],
  "hosted:eks:users": [
    {
      "username": "user",
      "role": "arn:aws:iam::000000000000:role/AWSReservedSSO_DeveloperAccess_abc",
      "email": "user@example.com"
    }
  ]
}
Parameter Description Reference
administrators IAM roles with full cluster admin access Cluster Admin
users IAM roles with read-only cluster access RBAC Authorization
username Identifier for the user in Kubernetes RBAC User Mapping
role AWS IAM role ARN (typically from AWS IAM Identity Center) IAM Roles
email For identification and traceability -

Step 7: Deploy Infrastructure

cd aws-druid-infra

# Preview changes
cdk synth

# Deploy all stacks
cdk deploy

See: CDK Deploy Command | CDK Synth Command

What Gets Deployed:

Resource Type Count Description Reference
CloudFormation Stacks 5 1 main + 4 nested stacks Nested Stacks
VPC 1 Multi-AZ with public/private subnets VPC Documentation
EKS Cluster 1 Kubernetes 1.28+ control plane EKS Clusters
RDS PostgreSQL 1 Druid metadata database RDS PostgreSQL
S3 Buckets 2 Deep storage + MSQ intermediate S3 User Guide
MSK Cluster 1 Kafka for real-time ingestion Amazon MSK
Managed Node Groups 1+ Bottlerocket-based worker nodes Managed Node Groups
Druid Deployment 1 All Druid node types via Helm Druid Helm Chart

Step 8: Access the Cluster

# Update kubeconfig
aws eks update-kubeconfig --name <cluster-name> --region <region>

# Verify cluster connectivity
kubectl get nodes
kubectl get pods -A

# Check Druid pods
kubectl get pods -n druid

See: Connecting to EKS


Configuration Reference

CDK Context Variables

The build process uses Mustache templating to inject context variables into configuration files. See cdk-common for the complete build process documentation.

Variable Type Description
{{account}} String AWS account ID
{{region}} String AWS region
{{environment}} String Environment name
{{version}} String Resource version
{{hosted:id}} String Unique deployment identifier

Template Structure

src/main/resources/
└── prototype/
    └── v1/
        ├── conf.mustache           # Main configuration
        ├── eks/
        │   ├── cluster.mustache    # EKS cluster configuration
        │   ├── addons.mustache     # Managed addons
        │   └── nodegroups.mustache # Node group configuration
        ├── druid/
        │   ├── values.mustache     # Druid Helm chart values
        │   ├── rds.mustache        # RDS configuration
        │   ├── s3.mustache         # S3 bucket configuration
        │   └── msk.mustache        # MSK cluster configuration
        ├── helm/
        │   ├── karpenter.mustache  # Karpenter values
        │   └── monitoring.mustache # Grafana stack values
        └── iam/
            └── roles.mustache      # IAM role definitions

Druid Operations

Accessing the Druid Console

The Druid Router provides a web console for administration:

# Port-forward to Druid Router
kubectl port-forward svc/druid-router 8888:8888 -n druid

# Access console at http://localhost:8888

See: Druid Web Console

Ingesting Data from Kafka

Create a Kafka ingestion supervisor to stream data:

{
  "type": "kafka",
  "spec": {
    "dataSchema": {
      "dataSource": "my-datasource",
      "timestampSpec": {
        "column": "timestamp",
        "format": "auto"
      },
      "dimensionsSpec": {
        "dimensions": ["dimension1", "dimension2"]
      }
    },
    "ioConfig": {
      "type": "kafka",
      "consumerProperties": {
        "bootstrap.servers": "<msk-bootstrap-servers>"
      },
      "topic": "my-topic"
    }
  }
}

See: Kafka Ingestion

Querying Data

Druid supports multiple query languages:

Method Description Reference
Druid SQL SQL-compatible queries via Broker Druid SQL
Native Queries JSON-based query format Native Queries
JDBC Standard JDBC driver connectivity JDBC Driver

Security Considerations

Aspect Implementation Reference
Node AMI Bottlerocket for minimal attack surface Bottlerocket
Pod Identity IAM roles for service accounts Pod Identity
Network Policies VPC CNI for pod-level network isolation Network Policies
Secrets Management CSI Secrets Store with AWS Secrets Manager Secrets Store
RDS Encryption Encryption at rest with KMS RDS Encryption
S3 Encryption Server-side encryption (SSE-S3) S3 Encryption
MSK Encryption TLS in transit, KMS at rest MSK Encryption

See: EKS Best Practices Guide - Security


Troubleshooting

Quick Diagnostics

# Check CDK synthesis
cdk synth --quiet 2>&1 | head -20

# Verify CloudFormation stack status
aws cloudformation describe-stacks --stack-name <stack-name> \
  --query 'Stacks[0].StackStatus'

# Check EKS cluster status
aws eks describe-cluster --name <cluster-name> \
  --query 'cluster.status'

# Verify Druid pods
kubectl get pods -n druid
kubectl describe pod <druid-pod> -n druid

# Check Druid logs
kubectl logs -l app=druid-coordinator -n druid --tail=50
kubectl logs -l app=druid-broker -n druid --tail=50

# Verify RDS connectivity
kubectl run pg-test --rm -it --image=postgres:15 -- \
  psql -h <rds-endpoint> -U druid -d druid -c "SELECT 1"

# Check MSK cluster status
aws kafka describe-cluster --cluster-arn <msk-arn> \
  --query 'ClusterInfo.State'

# Test Kafka connectivity from pod
kubectl exec -it <druid-middlemanager-pod> -n druid -- \
  kafka-broker-api-versions --bootstrap-server <msk-bootstrap>:9092

Common Issues

Issue Symptom Resolution
RDS connection timeout Druid coordinator fails to start Verify security group allows port 5432 from EKS nodes
MSK authentication failure MiddleManager ingestion errors Check IAM role permissions for MSK access
S3 deep storage errors Segment handoff failures Verify S3 bucket policy and IAM permissions
Druid OOM Historical/MiddleManager pod restarts Increase memory limits in values.yaml
Grafana no metrics Empty dashboards Verify Grafana Cloud credentials in cdk.context.json

For detailed troubleshooting procedures, see the Troubleshooting Guide.


Related Documentation

Platform Documentation

Resource Description
Fastish Documentation Platform documentation home
cdk-common Shared CDK constructs library
Troubleshooting Guide Common issues and solutions
Validation Guide Deployment validation procedures
Upgrade Guide Upgrade and rollback procedures
Capacity Planning Sizing and cost guidance
IAM Permissions Minimum required permissions
Network Requirements CIDR, ports, and security groups
Glossary Platform terminology
Changelog Version history

AWS Documentation

Resource Description
EKS User Guide Official EKS documentation
EKS Best Practices AWS EKS best practices guide
Analytics Lens Analytics architecture guidance
Amazon MSK Developer Guide MSK documentation
MSK Best Practices MSK configuration guidance
Amazon RDS User Guide RDS documentation
S3 Best Practices S3 performance optimization

Apache Druid Documentation

Resource Description
Apache Druid Documentation Official Druid documentation
Druid Architecture Druid design and components
Druid Tuning Guide Performance optimization

Observability

Resource Description
Grafana Cloud Docs Grafana Cloud documentation
OpenTelemetry Documentation Telemetry collection framework

License

MIT License

For your convenience, you can find the full MIT license text at:

About

apache druid deployment on eks with msk integration for real-time analytics at scale

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •