aws-druid-infra

AWS CDK application written in Java that provisions an Apache Druid deployment on Amazon EKS (Elastic Kubernetes Service) with integrated AWS managed services for real-time OLAP analytics at scale.

Overview

This CDK application provisions a production-ready Apache Druid deployment on Amazon EKS with fully integrated AWS managed services. Druid is a high-performance, real-time analytics database designed for workflows where fast queries and ingest are critical. The architecture follows EKS Best Practices and Analytics Lens recommendations.

Key Features

Feature	Description	Reference
EKS Cluster	Managed Kubernetes control plane with RBAC configuration	EKS User Guide
AWS Managed Addons	VPC CNI, EBS CSI, CoreDNS, Kube Proxy, Pod Identity Agent, CloudWatch Container Insights	EKS Add-ons
Helm Chart Addons	cert-manager, AWS Load Balancer Controller, Karpenter, CSI Secrets Store	Helm
Apache Druid	Real-time OLAP database with sub-second query latency	Druid Documentation
RDS PostgreSQL	Managed database for Druid metadata storage	Amazon RDS
S3 Deep Storage	Scalable object storage for Druid segments	S3 Deep Storage
MSK (Kafka)	Managed streaming for real-time data ingestion	Amazon MSK
Grafana Cloud Integration	Full observability stack with metrics, logs, and traces	Grafana Cloud
Managed Node Groups	Bottlerocket AMIs for enhanced security	Managed Node Groups

Architecture

System Overview

flowchart TB
    subgraph "Data Sources"
        KAFKA[MSK Kafka]
        S3IN[S3 Ingestion]
        BATCH[Batch Files]
    end

    subgraph "EKS Cluster"
        subgraph "Druid Cluster"
            COORD[Coordinator]
            OVER[Overlord]
            BROKER[Broker]
            ROUTER[Router]
            HIST[Historical]
            MM[MiddleManager]
        end
    end

    subgraph "AWS Managed Services"
        RDS[(RDS PostgreSQL)]
        S3DEEP[S3 Deep Storage]
        S3MSQ[S3 MSQ Storage]
    end

    subgraph "Query Clients"
        CONSOLE[Web Console]
        JDBC[JDBC Clients]
        API[REST API]
    end

    KAFKA --> MM
    S3IN --> MM
    BATCH --> MM

    MM --> OVER
    OVER --> COORD
    COORD --> RDS
    MM --> S3DEEP
    HIST --> S3DEEP

    CONSOLE --> ROUTER
    JDBC --> BROKER
    API --> BROKER
    ROUTER --> BROKER
    BROKER --> HIST

Data Ingestion Flow

sequenceDiagram
    participant Source as Data Source
    participant Kafka as MSK Kafka
    participant MM as MiddleManager
    participant Overlord
    participant S3 as S3 Deep Storage
    participant Historical
    participant Broker

    Source->>Kafka: Publish Events
    Kafka->>MM: Consume Batch

    MM->>MM: Parse & Index
    MM->>Overlord: Report Progress
    MM->>S3: Push Segment

    S3->>Historical: Load Segment
    Historical->>Historical: Cache in Memory

    Note over Broker,Historical: Query Path
    Broker->>Historical: Query Segment
    Historical-->>Broker: Results

Stack Structure

The Druid infrastructure uses a layered architecture with CloudFormation nested stacks:

flowchart TB
    subgraph "DeploymentStack (main)"
        MAIN[Main Stack]
    end

    subgraph "Nested Stacks"
        VPC[VpcNestedStack]
        EKS[EksNestedStack]
        SETUP[DruidSetupNestedStack]
        DRUID[DruidNestedStack]
    end

    MAIN --> VPC
    MAIN --> EKS
    MAIN --> SETUP
    MAIN --> DRUID

    VPC -.->|depends on| EKS
    VPC -.->|depends on| SETUP
    EKS -.->|depends on| DRUID
    SETUP -.->|depends on| DRUID

Dependency Chain:

VPC is created first (network foundation)
EKS cluster is provisioned (independent of Druid setup)
Druid setup creates supporting resources (RDS, S3, MSK) that depend on VPC
Druid Helm chart is deployed after both EKS and setup are ready

Apache Druid Components

flowchart LR
    subgraph "Master Nodes"
        COORD[Coordinator<br/>Segment Management]
        OVER[Overlord<br/>Task Management]
    end

    subgraph "Query Nodes"
        BROKER[Broker<br/>Query Routing]
        ROUTER[Router<br/>API Gateway]
    end

    subgraph "Data Nodes"
        HIST[Historical<br/>Segment Storage]
        MM[MiddleManager<br/>Ingestion Tasks]
    end

    ROUTER --> BROKER
    ROUTER --> COORD
    ROUTER --> OVER
    BROKER --> HIST
    OVER --> MM
    COORD --> HIST

Apache Druid consists of several specialized node types:

Node Type	Purpose	Reference
Coordinator	Manages data availability and segment distribution	Coordinator Process
Overlord	Controls data ingestion workload assignment	Overlord Process
Broker	Handles queries from external clients	Broker Process
Router	Routes requests to Brokers, Coordinators, and Overlords	Router Process
Historical	Stores and queries historical data segments	Historical Process
MiddleManager	Executes submitted ingestion tasks	MiddleManager Process

AWS Service Integration

Service	Druid Component	Purpose	Reference
RDS PostgreSQL	Metadata Storage	Stores segment metadata, rules, and configuration	Metadata Storage
S3	Deep Storage	Long-term segment storage for Historical nodes	Deep Storage
S3	Multi-Stage Query	Intermediate storage for MSQ engine	MSQ
MSK (Kafka)	Real-time Ingestion	Streaming data source for Druid supervisors	Kafka Ingestion

Observability Stack

The cluster integrates with Grafana Cloud for comprehensive observability:

Component	Purpose	Reference
Prometheus	Druid and Kubernetes metrics collection	Grafana Mimir
Loki	Log aggregation from all Druid processes	Grafana Loki
Tempo	Distributed tracing for query analysis	Grafana Tempo
Pyroscope	Continuous profiling for performance optimization	Grafana Pyroscope
OpenTelemetry Collector	Telemetry data collection and export	OpenTelemetry

Platform Integration

When deployed through the Fastish platform, this infrastructure integrates with internal platform services:

Platform Component	Integration Point	Purpose
Orchestrator	Release pipeline automation	Automated CDK synthesis and deployment via CodePipeline
Portal	Subscriber management	Tenant provisioning, cluster access control
Network	Shared VPC infrastructure	Cross-stack connectivity for platform services
Reporting	Usage metering	Pipeline execution tracking and cost attribution

These integrations are managed automatically when deploying via the platform's release workflows.

Prerequisites

Requirement	Version	Installation
Java	21+	SDKMAN
Maven	3.8+	Maven Download
AWS CLI	2.x	AWS CLI Install
AWS CDK CLI	2.221.0+	CDK Getting Started
kubectl	1.28+	kubectl Install
Helm	3.x	Helm Install
Docker	Latest	Docker Install
GitHub CLI	Latest	GitHub CLI
Grafana Cloud Account	-	Grafana Cloud

AWS CDK Bootstrap:

cdk bootstrap aws://<account-id>/<region>

Replace <account-id> with your AWS account ID and <region> with your desired AWS region (e.g., us-west-2). This sets up necessary resources for CDK deployments including an S3 bucket for assets and CloudFormation execution roles. See: CDK Bootstrapping | Bootstrap CLI Reference

Deployment

Step 1: Clone Repositories

gh repo clone fast-ish/cdk-common
gh repo clone fast-ish/aws-druid-infra

Step 2: Build Projects

mvn -f cdk-common/pom.xml clean install
mvn -f aws-druid-infra/pom.xml clean install

Step 3: Prepare Apache Druid Artifacts (Optional)

If using custom Druid images or Helm charts, prepare the artifacts in Amazon ECR:

Docker Image

# Authenticate to ECR
aws ecr get-login-password --region <region> | \
  docker login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com

# Create repository
aws ecr create-repository \
  --repository-name fasti.sh/v1/docker/druid \
  --region <region> \
  --image-scanning-configuration scanOnPush=true

# Build and push image
docker buildx build --provenance=false --platform linux/amd64 -f Dockerfile.druid \
  -t <account-id>.dkr.ecr.<region>.amazonaws.com/fasti.sh/v1/docker/druid:$(date +'%Y%m%d') \
  -t <account-id>.dkr.ecr.<region>.amazonaws.com/fasti.sh/v1/docker/druid:v1 \
  -t <account-id>.dkr.ecr.<region>.amazonaws.com/fasti.sh/v1/docker/druid:latest \
  --push .

See: ECR User Guide

Helm Chart

# Authenticate Helm to ECR
aws ecr get-login-password --region <region> | \
  helm registry login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com

# Create repository for Helm charts
aws ecr create-repository \
  --repository-name fasti.sh/v1/helm/druid \
  --region <region> \
  --image-scanning-configuration scanOnPush=true \
  --encryption-configuration encryptionType=AES256

# Package and push chart
helm package ./helm/chart/druid
helm push druid-<version>.tgz oci://<account-id>.dkr.ecr.<region>.amazonaws.com/fasti.sh/v1/helm

See: Helm OCI Support

Update Artifact References

Update the Docker image reference in src/main/resources/prototype/v1/druid/values.mustache:

Parameter	Description	Example
`image.repository`	ECR repository for Druid Docker image	`000000000000.dkr.ecr.us-west-2.amazonaws.com/fasti.sh/v1/docker/druid`
`image.tag`	Tag of the Druid Docker image	`v1`, `latest`, or date tag
`image.pullPolicy`	Pull policy for the Docker image	`IfNotPresent`

Update the Helm chart reference in src/main/resources/prototype/v1/conf.mustache:

Parameter	Description	Example
`chart.repository`	ECR repository for Druid Helm chart	`oci://000000000000.dkr.ecr.us-west-2.amazonaws.com/fasti.sh/v1/helm`
`chart.name`	Name of the Druid Helm chart	`druid`
`chart.version`	Version of the Druid Helm chart	`34.0.0`

Step 4: Configure Deployment

Create aws-druid-infra/cdk.context.json from aws-druid-infra/cdk.context.template.json:

Required Configuration Parameters:

Parameter	Description	Example
`:account`	AWS account ID (12-digit number)	`123456789012`
`:region`	AWS region for deployment	`us-west-2`
`:domain`	Registered domain name (optional)	`example.com`
`:environment`	Environment name (do not change)	`prototype`
`:version`	Resource version identifier	`v1`

Notes:

:environment and :version map to resource files at aws-druid-infra/src/main/resources/prototype/v1
These values determine which configuration templates are loaded during CDK synthesis

Step 5: Configure Grafana Cloud

Add Grafana Cloud configuration for observability:

{
  "hosted:eks:grafana:instanceId": "000000",
  "hosted:eks:grafana:key": "glc_xyz",
  "hosted:eks:grafana:lokiHost": "https://logs-prod-000.grafana.net",
  "hosted:eks:grafana:lokiUsername": "000000",
  "hosted:eks:grafana:prometheusHost": "https://prometheus-prod-000-prod-us-west-0.grafana.net",
  "hosted:eks:grafana:prometheusUsername": "0000000",
  "hosted:eks:grafana:tempoHost": "https://tempo-prod-000-prod-us-west-0.grafana.net/tempo",
  "hosted:eks:grafana:tempoUsername": "000000",
  "hosted:eks:grafana:pyroscopeHost": "https://profiles-prod-000.grafana.net:443"
}

Grafana Cloud Setup:

Create Account: Sign up at grafana.com
Create Stack: Navigate to your stack settings
Generate API Key: Create key with required permissions

Parameter	Location	Description
`instanceId`	Stack details page	Unique identifier for your Grafana instance
`key`	API keys section	API key with all permissions (starts with `glc_`)
`lokiHost`	Logs > Data Sources > Loki	Endpoint URL for logs
`lokiUsername`	Logs > Data Sources > Loki	Account identifier for Loki
`prometheusHost`	Metrics > Data Sources > Prometheus	Endpoint URL for metrics
`prometheusUsername`	Metrics > Data Sources > Prometheus	Account identifier for Prometheus
`tempoHost`	Traces > Data Sources > Tempo	Endpoint URL for traces
`tempoUsername`	Traces > Data Sources > Tempo	Account identifier for Tempo
`pyroscopeHost`	Profiles > Connect a Data Source	Endpoint URL for profiling

Required API Key Permissions:

Permission	Access	Purpose
`metrics`	Read/Write	Prometheus metrics ingestion
`logs`	Read/Write	Loki log ingestion
`traces`	Read/Write	Tempo trace ingestion
`profiles`	Read/Write	Pyroscope profiling data
`alerts`	Read/Write	Alerting configuration
`rules`	Read/Write	Recording and alerting rules

See: Grafana Cloud Kubernetes Monitoring

Step 6: Configure Cluster Access

Add IAM role mappings in cdk.context.json for EKS access entries:

{
  "hosted:eks:administrators": [
    {
      "username": "administrator",
      "role": "arn:aws:iam::000000000000:role/AWSReservedSSO_AdministratorAccess_abc",
      "email": "admin@example.com"
    }
  ],
  "hosted:eks:users": [
    {
      "username": "user",
      "role": "arn:aws:iam::000000000000:role/AWSReservedSSO_DeveloperAccess_abc",
      "email": "user@example.com"
    }
  ]
}

Parameter	Description	Reference
`administrators`	IAM roles with full cluster admin access	Cluster Admin
`users`	IAM roles with read-only cluster access	RBAC Authorization
`username`	Identifier for the user in Kubernetes RBAC	User Mapping
`role`	AWS IAM role ARN (typically from AWS IAM Identity Center)	IAM Roles
`email`	For identification and traceability	-

Step 7: Deploy Infrastructure

cd aws-druid-infra

# Preview changes
cdk synth

# Deploy all stacks
cdk deploy

See: CDK Deploy Command | CDK Synth Command

What Gets Deployed:

Resource Type	Count	Description	Reference
CloudFormation Stacks	5	1 main + 4 nested stacks	Nested Stacks
VPC	1	Multi-AZ with public/private subnets	VPC Documentation
EKS Cluster	1	Kubernetes 1.28+ control plane	EKS Clusters
RDS PostgreSQL	1	Druid metadata database	RDS PostgreSQL
S3 Buckets	2	Deep storage + MSQ intermediate	S3 User Guide
MSK Cluster	1	Kafka for real-time ingestion	Amazon MSK
Managed Node Groups	1+	Bottlerocket-based worker nodes	Managed Node Groups
Druid Deployment	1	All Druid node types via Helm	Druid Helm Chart

Step 8: Access the Cluster

# Update kubeconfig
aws eks update-kubeconfig --name <cluster-name> --region <region>

# Verify cluster connectivity
kubectl get nodes
kubectl get pods -A

# Check Druid pods
kubectl get pods -n druid

See: Connecting to EKS

Configuration Reference

CDK Context Variables

The build process uses Mustache templating to inject context variables into configuration files. See cdk-common for the complete build process documentation.

Variable	Type	Description
`{{account}}`	String	AWS account ID
`{{region}}`	String	AWS region
`{{environment}}`	String	Environment name
`{{version}}`	String	Resource version
`{{hosted:id}}`	String	Unique deployment identifier

Template Structure

src/main/resources/
└── prototype/
    └── v1/
        ├── conf.mustache           # Main configuration
        ├── eks/
        │   ├── cluster.mustache    # EKS cluster configuration
        │   ├── addons.mustache     # Managed addons
        │   └── nodegroups.mustache # Node group configuration
        ├── druid/
        │   ├── values.mustache     # Druid Helm chart values
        │   ├── rds.mustache        # RDS configuration
        │   ├── s3.mustache         # S3 bucket configuration
        │   └── msk.mustache        # MSK cluster configuration
        ├── helm/
        │   ├── karpenter.mustache  # Karpenter values
        │   └── monitoring.mustache # Grafana stack values
        └── iam/
            └── roles.mustache      # IAM role definitions

Druid Operations

Accessing the Druid Console

The Druid Router provides a web console for administration:

# Port-forward to Druid Router
kubectl port-forward svc/druid-router 8888:8888 -n druid

# Access console at http://localhost:8888

See: Druid Web Console

Ingesting Data from Kafka

Create a Kafka ingestion supervisor to stream data:

{
  "type": "kafka",
  "spec": {
    "dataSchema": {
      "dataSource": "my-datasource",
      "timestampSpec": {
        "column": "timestamp",
        "format": "auto"
      },
      "dimensionsSpec": {
        "dimensions": ["dimension1", "dimension2"]
      }
    },
    "ioConfig": {
      "type": "kafka",
      "consumerProperties": {
        "bootstrap.servers": "<msk-bootstrap-servers>"
      },
      "topic": "my-topic"
    }
  }
}

See: Kafka Ingestion

Querying Data

Druid supports multiple query languages:

Method	Description	Reference
Druid SQL	SQL-compatible queries via Broker	Druid SQL
Native Queries	JSON-based query format	Native Queries
JDBC	Standard JDBC driver connectivity	JDBC Driver

Security Considerations

Aspect	Implementation	Reference
Node AMI	Bottlerocket for minimal attack surface	Bottlerocket
Pod Identity	IAM roles for service accounts	Pod Identity
Network Policies	VPC CNI for pod-level network isolation	Network Policies
Secrets Management	CSI Secrets Store with AWS Secrets Manager	Secrets Store
RDS Encryption	Encryption at rest with KMS	RDS Encryption
S3 Encryption	Server-side encryption (SSE-S3)	S3 Encryption
MSK Encryption	TLS in transit, KMS at rest	MSK Encryption

See: EKS Best Practices Guide - Security

Troubleshooting

Quick Diagnostics

# Check CDK synthesis
cdk synth --quiet 2>&1 | head -20

# Verify CloudFormation stack status
aws cloudformation describe-stacks --stack-name <stack-name> \
  --query 'Stacks[0].StackStatus'

# Check EKS cluster status
aws eks describe-cluster --name <cluster-name> \
  --query 'cluster.status'

# Verify Druid pods
kubectl get pods -n druid
kubectl describe pod <druid-pod> -n druid

# Check Druid logs
kubectl logs -l app=druid-coordinator -n druid --tail=50
kubectl logs -l app=druid-broker -n druid --tail=50

# Verify RDS connectivity
kubectl run pg-test --rm -it --image=postgres:15 -- \
  psql -h <rds-endpoint> -U druid -d druid -c "SELECT 1"

# Check MSK cluster status
aws kafka describe-cluster --cluster-arn <msk-arn> \
  --query 'ClusterInfo.State'

# Test Kafka connectivity from pod
kubectl exec -it <druid-middlemanager-pod> -n druid -- \
  kafka-broker-api-versions --bootstrap-server <msk-bootstrap>:9092

Common Issues

Issue	Symptom	Resolution
RDS connection timeout	Druid coordinator fails to start	Verify security group allows port 5432 from EKS nodes
MSK authentication failure	MiddleManager ingestion errors	Check IAM role permissions for MSK access
S3 deep storage errors	Segment handoff failures	Verify S3 bucket policy and IAM permissions
Druid OOM	Historical/MiddleManager pod restarts	Increase memory limits in values.yaml
Grafana no metrics	Empty dashboards	Verify Grafana Cloud credentials in cdk.context.json

For detailed troubleshooting procedures, see the Troubleshooting Guide.

License

MIT License

For your convenience, you can find the full MIT license text at:

https://opensource.org/license/mit/ (Official OSI website)
https://choosealicense.com/licenses/mit/ (Choose a License website)

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github		.github
helm/chart/druid		helm/chart/druid
releases		releases
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
ADDONS.md		ADDONS.md
CHANGELOG.md		CHANGELOG.md
CHECKSTYLE_SETUP.md		CHECKSTYLE_SETUP.md
COMMIT_CONVENTIONS.md		COMMIT_CONVENTIONS.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.druid		Dockerfile.druid
LICENSE.md		LICENSE.md
LOCAL_VALIDATION.md		LOCAL_VALIDATION.md
READINESS.md		READINESS.md
README.md		README.md
RELEASE_GUIDE.md		RELEASE_GUIDE.md
RELEASING.md		RELEASING.md
USAGE.md		USAGE.md
cdk.context.template.json		cdk.context.template.json
cdk.json		cdk.json
checkstyle.xml		checkstyle.xml
dependency-check-suppressions.xml		dependency-check-suppressions.xml
eclipse-formatter.xml		eclipse-formatter.xml
pom.xml		pom.xml
renovate.json		renovate.json
spotbugs-exclude.xml		spotbugs-exclude.xml

Resource	Description
Fastish Documentation	Platform documentation home
cdk-common	Shared CDK constructs library
Troubleshooting Guide	Common issues and solutions
Validation Guide	Deployment validation procedures
Upgrade Guide	Upgrade and rollback procedures
Capacity Planning	Sizing and cost guidance
IAM Permissions	Minimum required permissions
Network Requirements	CIDR, ports, and security groups
Glossary	Platform terminology
Changelog	Version history

Resource	Description
EKS User Guide	Official EKS documentation
EKS Best Practices	AWS EKS best practices guide
Analytics Lens	Analytics architecture guidance
Amazon MSK Developer Guide	MSK documentation
MSK Best Practices	MSK configuration guidance
Amazon RDS User Guide	RDS documentation
S3 Best Practices	S3 performance optimization

Resource	Description
Apache Druid Documentation	Official Druid documentation
Druid Architecture	Druid design and components
Druid Tuning Guide	Performance optimization

Resource	Description
Grafana Cloud Docs	Grafana Cloud documentation
OpenTelemetry Documentation	Telemetry collection framework

License

fast-ish/aws-druid-infra

Folders and files

Latest commit

History

Repository files navigation

aws-druid-infra

Overview

Key Features

Architecture

System Overview

Data Ingestion Flow

Stack Structure

Apache Druid Components

AWS Service Integration

Observability Stack

Platform Integration

Prerequisites

Deployment

Step 1: Clone Repositories

Step 2: Build Projects

Step 3: Prepare Apache Druid Artifacts (Optional)

Docker Image

Helm Chart

Update Artifact References

Step 4: Configure Deployment

Step 5: Configure Grafana Cloud

Step 6: Configure Cluster Access

Step 7: Deploy Infrastructure

Step 8: Access the Cluster

Configuration Reference

CDK Context Variables

Template Structure

Druid Operations

Accessing the Druid Console

Ingesting Data from Kafka

Querying Data

Security Considerations

Troubleshooting

Quick Diagnostics

Common Issues

Related Documentation

Platform Documentation

AWS Documentation

Apache Druid Documentation

Observability

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages