According to CNCF 2019 survey, more than 84% of respondents are using containers in production, and more than 78% respondents are using Kubernetes in Production. We also know that, 84% of all kubernetes workloads in public cloud run on AWS. With the commercial launch of Amazon Elastic Kubernetes Service (EKS) in June 2018, there has been a significant adoption of using EKS for running kubernetes workloads on AWS. And now, with AWS Fargate on Amazon EKS, which was launched at re:Invent 2019, there will be even greater adoption of EKS for all kubernetes workloads.

However, best practices, and solutions for implementing those best practices, for successfully running secure, reliable, efficient, and cost effective kubernetes workloads in production on EKS are not readily available. Different elements of the content are spread across a large number of blog posts, white papers, and vendor publications, and mostly are not specific to EKS.

This blog is an attempt to provide this information in a concise, easy to use format. To ensure completeness of coverage, the information is organized under five categories.

Operations


This category focuses on running and monitoring systems.

 

Checklist Items

Best Practices

Solution Options (AWS)

Solution Options (Partners/Open Source)

1

Kubernetes Version Updates and Patches

·       Implement a documented and operational update and upgrade program

·       Pre-upgrade checks:

o   Kubernetes release notes

o   platform version notes

o   Control plane and API server compatibility

o   Other plug-in versions

·       Test, update, and upgrade:

o   control plane (EKS)

o   worker nodes

o   VPC CNI Plugin

·       EKS Managed Nodegroups

·       EKS on Fargate

·       Eksktl

·       Terraform

2

DevOps

·       Build immutable images

·       Use ConfigMap, instead of storing configuration information in images

·       AWS CDK

·       AWS CodePipeline

·       Argo Flux

·       Jenkins X

·       Spinnaker

·       CircleCI

·       Gitlab CI/CD

3

Observability

·       Enable high-level view into your running clusters

·       Configure timely incident alerts when something goes wrong.

·       Define and deploy processes and tools in place to act on incident alerts

·       Auditing and Logging: Amazon CloudWatch logs for EKS control plane, AWS CloudTrail for EKS API

·       Tracing: AWS X-Ray

·       Monitoring & Alerting: 
CloudWatch metrics, events & alarms

·       Observability: CloudWatch ServiceLens

·       Analytics: CloudWatch Container Insights, CloudWatch ServiceLens 

·       Automation: AWS Lambda, Amazon EventBridge, Auto Scaling Groups (ASG)

·       Logging: Datadog

·       EFK Monitoring and Alerting: Prometheus, AlertManager, PagerDuty

5

Service Discovery

·       Run CoreDNS on each worker nodes

·       To discover services running outside the cluster:

o   Use Kubernetes “service” object without pod selector

o   Use Kubernetes “ExternalName” service type

·       AWS Cloud Map

·       AWS App Mesh

·       Amazon Route 53

 

6

Kubernetes Namespace

·       Use namespaces for easier resource management.

·       Define and enforce a namespace naming convention

 

 

7

Service Mesh

Service Mesh benefits:

·       standardizes how your services communicate

·       end-to-end visibility

·       end-to-end security

·       high-availability

·       monitoring and dynamically controlling communications between services

·       make it easier to deploy new versions of your services

AWS App Mesh

Istio

8

Single or Multiple Clusters

·       A separate cluster per environment (dev, staging, prod, etc.)

·       Start with a single production cluster

·       explore multiple clusters to support specific requirements

o   security or compliance requirements to isolate certain workloads

o   extremely highly variability in scaling and network load requirements between workloads

o   customer geographic distribution requiring clusters in different regions

 

admiralty.io

Security


The Security category focuses on protecting information and systems via risk assessments and mitigation strategies.

 

Checklist Items

Best Practices

Solution Options (AWS)

Solution Options (Partners/Open Source)

1

Secrets Management

Mount Secrets as volumes, not environment variables

·       AWS Secrets Manager

·       AWS SSM Parameter Store

·       Envelope encryption of secrets with KMS

·       Hashicorp Vault

·       Bitnami sealed secrets

2

Container Runtime Security

·       Prevent containers from running as root (All processes in a container run as the root user (uid 0), by default)

·       Disallow privileged containers

·       Disallow adding new capabilities. Ensure that application pods cannot add new capabilities at runtime.

·       Disallow changes to kernel parameters

·       Disallow use of bind mounts (hostPath volumes)

·       Disallow access to the docker socket bind mount 

·       Disallow use of host network and ports (allows potential snooping of network traffic across application pods.)

·       Use a read-only root filesystem in containers

EKS on Fargate (for VM isolation at pod level)

Pod Security Policy (PSP)

3

Pod communications control

Enable Kubernetes “network policies” to prevent unauthorized access, improve security, and segregate namespaces.

VPC CNI + Calico

Tigera Calico Enterprise

4

Kubernetes RBAC

·       Disable auto-mounting of the default ServiceAccount

·       RBAC policies are set to the least amount of privileges necessary

·       RBAC policies are granular and not shared

·       Avoid using wildcards in “roles” and “clusterroles

·       Configure IAM users/groups mapping to Kubernetes RBAC roles

·       Configure IAM roles for service accounts, if a pod needs access to AWS resources.

kube2iam

5

Cluster security benchmark

Cluster passes the CIS Kubernetes Benchmark tests

kube-bench

·       Twistlock

·       Aqua Security

6

DevSecOps

·       Secure credentials for CI to push and for cluster to pull images.

·       Automate the scanning of vulnerabilities in your container images, implemented at the CI stage of your pipeline.

·       Allow deploying containers only from known registries

Amazon ECR:

·       Use ECR Image Scanning

·       Use ECR PrivateLink Endpoint Policies for fine-grained IAM based access control.

·       Use Open Policy Agent (OPA)

·       Partner Solution for OPA: styra.com

Reliability


The Reliability category focuses on recovery from infrastructure or service disruptions, and dynamically adjusting resources to meet demand.

 

Checklist Items

Best Practices

Solution Options (AWS)

Solution Options (Partners/Open Source)

1

Disaster Recovery

·       Practice “infrastructure as code” with fully automated CI/CD pipelines for easier cluster installs and upgrades.

·       Gitops practices and implementation to recreate a cluster from git

·       Amazon EKS (multi-master and multi-AZ)

·       Amazon EBS and Amazon EFS as “persistent volumes” for stateful applications.

·       Amazon S3, Amazon DynamoDB, and Amazon RDS for external data storage

·       Amazon ElastiCache for Redis for session data storage, and for in-memory cache

Weave Cloud

2

High Availability

·       Create worker nodes in Multi-AZ

·       Deploy pods on multiple nodes: set anti-affinity rules

·       Deploy pods in Multi-AZ

·       Enable NLB (Network Load Balancer) in Multi-AZ, and/or enable cross-zone load balancing.

·       Configure Auto Scaling Groups (ASG) per AZ

 

3

Scalability

·       Use the Horizontal Pod Autoscaler (HPA) for apps with variable usage patterns

·       Use the Cluster Autoscaler (CA) for varying workloads

·       For stateful applications using EBS backed storage, configure multiple node groups, each scoped to a single AZ. In addition, you should enable the --balance-similar-node-groups feature

·       AWS Fargate for Amazon EKS (for fully managed autoscaling)

·       ASG to enable EKS autoscaling by CA

 

4

Pod IP address inventory and ENI Management

·       Size subnets appropriately to have sufficient addresses for pods

·       Worker nodes instance size should be selected to support expected number of pods, which could be limited by number of ENI's that could be attached to an instance.

·       Number of pods running on a cluster may also be limited by number of VPC secondary CIDR addresses available

Assign Secondary CIDR ranges (non-RFC 1918 addresses) to VPC, if needed

 

4

Graceful pod shutdown

Implement lifecycle policy in podspec so a pod doesn't shut down on SIGTERM, but gracefully terminates connections

 

 

5

Health checks

Set appropriate readiness probe and liveness probe values for containers

 

 

 

Performance


The Performance category focuses on using computing resources efficiently to meet system requirements.

 

Checklist Items

Best Practices

Solution Options (AWS)

Solution Options (Partners/Open Source)

1

fine tuning cluster performance

·       Use optimized base images

·       Tune scaling target for HPA

·       Cluster Autoscaler tuning: Adjust the min/max size of a node group directly in ASG

Scaling Kubernetes deployments with Amazon CloudWatch metrics

 

2

External Access and Traffic Routing

Use ingress controller

·       ALB ingress controller

·       Integrate ALB ingress controller with AWS App Mesh for standardized east-west and north-south service communication

·       NLB with NGINX ingress controller

·       NGINX

·       Traefik

·       HAProxy Ingress

·       Istio Ingress

·       Gloo

3

Resource requests and limits

·       Set memory limits and requests for all containers

·       Set CPU limits after determining correct settings for your container.

·       use a LimitRange object to define the standard size for a container deployed in the current namespace.

·       Use vertical pod autoscaler (VPA) in recommendation-mode to get the right resource (CPU/memory) requests and limits

 

 

4

Windows pods and containers on Windows worker nodes

Since Windows worker nodes support 1 ENI per node, which limits number of pods that can run on it, so select EC2 instance type based on your workload needs

·       Use Auto Scaling Group (ASG) for Windows worker nodes for scalability

·       Run EKS Windows containers with group Managed Service Accounts (gMSA) for authentication and authorization.

 

Cost Efficiency


The Cost Efficiency category focuses on running systems to deliver business value at the lowest price point. A blog was recently published which nicely covers this category for EKS. So, I am only summarizing the findings of that blog here.

 

Checklist Items

Best Practices

Solution Options (AWS)

Solution Options (Partners/Open Source)

1

Lowering costs

·       Identify actual CPU utilization by pods to set CPU request values

·       Use “Vertical Pod Autoscaler” in recommendation mode

·       Shutdown or scale down cluster at off-peak times

·       Spot Instances for EKS worker nodes

·       Compute Savings Plan 

·       kube-resource-report

·       kube-downscaler

·       Vertical pod autoscaler

Conclusion

In this blog post, I have compiled the best practices for EKS, and paired them with implementation solutions, provided by AWS and/or by AWS partners or open source.