EKS Troubleshooting
- Pods are terminated, leaving stuck runs
- Multi-Attach error for volume
- DataMasque fails to open with a 'Disallowed host' error
1. Pods are terminated during a masking run
Problem
Pods are terminated during a masking run, leaving some runs stuck.
Solution
Generally if pods are terminated, they will restart automatically (after being rescheduled by EKS).
In some circumstances, if a run is in progress and the masque-agent
pod restarts,
then the run may be stuck in a Running or Cancelling state.
To fix stuck runs:
If the run appears to still be Running, Cancel it using the DataMasque web UI. It should move to a Cancelling state, and then to Cancelled within a couple of minutes.
If the run stays in the Cancelling state for more than five minutes, restart the
masque-0
pod by deleting it usingkubectl
:kubectl delete pod admin-db-0
The pod will automatically be rescheduled by EKS and will clear any Cancelling runs when it starts.
2. Multi-Attach error for volume
Problem
EC2 nodes are terminated. When new nodes are created, pods do not start.
Solution
If a pod fails to start (is stuck in ContainerCreating
status),
use eksctl
to describe
the pod in question. For example, to describe the admin-db-0
pod:
$ kubectl describe pods admin-db-0
You should see a reason for the pod not starting.
If the error is similar to this:
Multi-Attach error for volume "pvc-<uuid>"
Volume is already exclusively attached to one node and can't be attached to another node.
Then the EBS volume is still attached to the terminated node. This error usually resolves itself within ten minutes, and the pod will start automatically.
3. DataMasque fails to open with a 'Disallowed host' error
Problem
EC2 nodes are terminated. When new nodes are created they have a new IP address, which causes a Disallowed host error when accessing the DataMasque web UI.
Solution
These commands are to be run on a machine with kubectl
installed,
which is configured to use the EKS cluster that needs updating.
Information on configuring kubectl
to use a specific EKS cluster can be found at the
AWS creating/updating kubeconfig
documentation.
Run the following command to reset the allowed hosts:
kubectl exec -it masque-0 -- bash -c 'python3 reset_allowed_hosts.py'
Visit the target EKS IP address to be able to log in to DataMasque, navigate to the settings page and change the allowed hosts to the current IP Address.