Troubleshooting GitLab Cloud Native chart deployments

Using the kubeSOS to troubleshoot GitLab Cloud Native chart deployments

What is KubeSOS

KubeSos is a tool that uses kubectl and helm to retrieve GitLab cluster configuration and logs from GitLab Cloud Native chart deployments. This information is then zipped into a tar file and shared with the support team to help with troubleshooting GitLab deployments.

Requirements

  • kubectl client v1.16+
  • helm 3.3.1+

Usage

You can either download the script:

wget https://gitlab.com/gitlab-com/support/toolbox/kubesos/-/raw/master/kubeSOS.sh

chmod +x kubeSOS.sh
./kubeSOS.sh [flags]

Or use curl:

curl https://gitlab.com/gitlab-com/support/toolbox/kubesos/raw/master/kubeSOS.sh | bash -s -- [flags]
Flags Description Required Default
-n namespace No “default”
-r helm chart release No “gitlab”
-l app application label to match for logs (can be used multiple times) No
-L select apps for logs interactively No n/a
-s time Only return logs newer than a relative duration like 5s, 2m, or 3h No 0=all logs
-t time_stamp Only return logs after a specific date (RFC3339) No all logs
-m maxlines Override the default maximum lines output per log (-1 = no limit) No 10000
-p Prepend log entries with pod and container names No n/a
-w log_timeout Log generation wait time (seconds). Increase this if log collection does not complete in time No 60

Data will be archived to kubesos-<timestamp>.tar.gz

Extracting the archive

Use the tar linux utility to extract the data into a folder

tar -zxvf kubesos-<timestamp>.tar.gz

Troubleshoot a GitLab installation

There are two main areas to check when troubleshooting a cloud native application like GitLab:

  • Cluster setup: We will assume that the cluster is correctly setup as per our recommendation and enough resources have been allocated to the nodes. We will look at a few commands that would be helpful in confirming this.

  • Application Failures: This will be our primary area of focus and we will be trying to identify why GitLab is not working or not behaving correctly.

Cluster setup

We recommend a cluster with 8vCPU and 30GB of RAM so a few things to checks is if the nodes have enough resources. Use unix commands like top, free to confirm this.

Check if the nodes are registered correctly and verify that all of the nodes you expect to see are present and that they are all in the Ready state.

kubectl get nodes

To get detailed information about the overall health of your cluster, use the following command:

kubectl cluster-info dump

To delve deeper into troubleshooting the cluster have a look at Troubleshoot Clusters which gives you insights as to the logs that you would look into.

GitLab Requirements

In order to deploy GitLab on Kubernetes, ensure the setup meets the documented requirements.

Checking kubeSOS output

kubectl-check

To check the version of kubectl installed

% more kubectl-check

Helm version

% more helm-version

Debugging Pods

Check the current state of the pod through checking the get_pods file. All pods should be running or completed.

 % more get_pods
NAME                                                   READY   STATUS      RESTARTS   AGE
gitlab-gitaly-0                                        1/1     Running     0          12m
gitlab-gitlab-exporter-586ccff5fb-6g67t                1/1     Running     0          12m
gitlab-gitlab-grafana-app-6bf7db585b-75fgt             2/2     Running     0          12m
gitlab-gitlab-shell-7547c6457-54xm8                    1/1     Running     0          12m
gitlab-gitlab-shell-7547c6457-lmfb2                    1/1     Running     0          12m
gitlab-migrations-76-gq7pq                             0/1     Completed   0          12m
gitlab-nginx-ingress-controller-786f5f5ddd-ktwh5       1/1     Running     0          12m
gitlab-nginx-ingress-controller-786f5f5ddd-nqk2w       1/1     Running     0          12m
gitlab-nginx-ingress-default-backend-7ff88b95f-rv6jt   1/1     Running     0          12m
gitlab-prometheus-server-5b47c879b4-g9xvb              2/2     Running     0          12m
gitlab-redis-master-0                                  2/2     Running     0          12m
gitlab-registry-7f874b748d-bxtnh                       1/1     Running     0          12m
gitlab-registry-7f874b748d-rs246                       1/1     Running     0          12m
gitlab-sidekiq-all-in-1-v1-95cbfd5f7-546sq             1/1     Running     0          12m
gitlab-task-runner-6f7dcc8897-psscr                    1/1     Running     0          12m
gitlab-webservice-default-659fdddb9b-cmrbd             2/2     Running     0          12m
gitlab-webservice-default-659fdddb9b-hhctp             2/2     Running     0          12m

Any pod in pending status indicates a possible problem which one can confirm by checking the recent events from the describe_pods file. If a pod is stuck in Pending it means that it can not be scheduled onto a node. This could be due to lack of resources such as CPU or Memory in your cluster. More on this in Debugging Pods

Services

For services the main thing to confirm is if the loadbalancer has been assigned an External IP and is not in pending state.

% more get_services | grep gitlab-nginx-ingress-controller | grep LoadBalancer

gitlab-nginx-ingress-controller LoadBalancer   172.20.34.155   36.0.0.25   80:32220/TCP,443:30038/TCP,22:30963/TCP   45d

or in AWS

% more get_services | grep gitlab-nginx-ingress-controller | grep LoadBalancer

gitlab-nginx-ingress-controller LoadBalancer   172.20.164.155   a48.eu-west-5.elb.amazonaws.com   80:32220/TCP,443:30038/TCP,22:30963/TCP   45d

Further checks would involve confirming if all the services have been assigned an endpoint

% more endpoints
NAME                                      ENDPOINTS                                           AGE
gitlab-cert-manager                       10.16.4.130:9402                                    23h
gitlab-gitaly                             10.16.4.164:8075                                    23h
gitlab-gitlab-exporter                    10.16.4.155:9168                                    23h
gitlab-gitlab-pages                       10.16.4.156:8090                                    23h
gitlab-gitlab-shell                       10.16.2.36:2222,10.16.4.162:2222                    23h
gitlab-grafana-app                        10.16.4.131:3000                                    23h

Ingress

Ingress exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. Traffic routing is controlled by rules defined on the Ingress resource. Confirm if the hosts and address are configured correctly and if the ingress has been assigned an IP.

% kubectl get ingress
% more describe_ingress

Deployments

To quickly confirm the applications that are setup, check the following file:

% more get_deployments
NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
gitlab-cainjector                      1/1     1            1           9d
gitlab-cert-manager                    1/1     1            1           9d
gitlab-gitlab-exporter                 1/1     1            1           9d
gitlab-gitlab-pages                    1/1     1            1           9d
gitlab-gitlab-runner                   1/1     1            1           9d
gitlab-gitlab-shell                    2/2     2            2           9d
gitlab-grafana-app                     1/1     1            1           9d
gitlab-kas                             2/2     2            2           9d
gitlab-minio                           1/1     1            1           9d
gitlab-nginx-ingress-controller        2/2     2            2           9d
gitlab-nginx-ingress-default-backend   1/1     1            1           9d
gitlab-prometheus-server               1/1     1            1           9d
gitlab-registry                        2/2     2            2           9d
gitlab-sidekiq-all-in-1-v1             1/1     1            1           9d
gitlab-task-runner                     1/1     1            1           9d
gitlab-webservice-default              2/2     2            2           9d

If any of the deployments are not ready use the describe_deployments file to check the reason for failure. It’s is also worth checking for errors in the describe_pods file.

% more describe_deployments

Persistent Volumes and Claims

GitLab uses persistent volumes to store data so if any of the pods are in pending status, check if the volumes exist and their status is Bound. Confirm amount of space allocated for each and if required allocate more resources.

% more get_pvc
NAME                            STATUS   VOLUME        CAPACITY ACCESS MODES   STORAGECLASS   AGE
data-gitlab-postgresql-0        Bound    pvc-44c3643e-  8Gi        RWO         standard       9d
gitlab-minio                    Bound    pvc-8a739402-  10Gi       RWO         standard       9d
gitlab-prometheus-server        Bound    pvc-29fc7b9d-  8Gi        RWO         standard       9d
redis-data-gitlab-redis-master  Bound    pvc-b9b67a9d-  8Gi        RWO         standard       9d
repo-data-gitlab-gitaly-0       Bound    pvc-af7ca188-  50Gi       RWO         standard       9d

User supplied values

Where there is more than one helm revision (helm history <release>), we capture the user_supplied_values.yaml and all_values.yaml for each revision. This is useful for comparing changes that were applied between revisions. For example:

% diff user_supplied_values_rev_7.yaml user_supplied_values_rev_8.yaml
3,4d2
< certmanager-issuer:
<   email: gladmin@example.com
33a32,33
>   ingress:
>     configureCertmanager: false

The above indicates a change was made to CertManager configuration between revisions 7 and 8.

If the YAML files are not present, it is likely that kubeSOS was not run against the correct namespace or release, helm list -A will show all helm deployed releases. Be sure to run kubeSOS.sh with the appropriate -n <namespace> and -r <release> options.

Application logs

Finally, kubeSOS.sh generates all the application logs which can be used to debug specific application issues.

Logging is more limited in a Kubernetes environment, you should note:

  • By default, a container’s current log is limited to a size of 10Mb, at which point it is rotated.
  • Whilst Kubernetes will rotate logs, it is not possible to retrieve rotated logs remotely via kubectl logs, direct access to the node is required (see Additional logs).
  • Kubernetes will retain the log of a failed container, this is limited to the previous instance of the container only.

It is worth noting also, that kubeSOS.sh will only obtain logs from pods/containers that are currently running (or completed for init containers). If you find that a log is not present and was not intentionally filtered out, then it’s likely the pod was not active when kubeSOS.sh was run. Check the file get_pods to see which pods were active. Note also that empty log files are not added to the archive.

Logs are captured for each container. Many pods run more than one container, for example, webservice could return five logs:

-rw-r--r--@  1 chriss  staff  6022116 27 Jul 15:06 webservice_webservice.log
-rw-r--r--@  1 chriss  staff  4226942 27 Jul 15:06 webservice_gitlab-workhorse.log
-rw-r--r--@  1 chriss  staff     1354 27 Jul 15:06 webservice_dependencies.log
-rw-r--r--@  1 chriss  staff     5710 27 Jul 15:06 webservice_configure.log
-rw-r--r--@  1 chriss  staff      321 27 Jul 15:06 webservice_certificates.log

Log file naming consists of application name_container name.log. The application name is determined from the app metadata label assigned to pods. If a container fails, its log is retained, kubeSOS.sh will retrieve this log via the kubectl logs --previous option and is identified by *_previous.log.

Additional logs

As mentioned, kubectl logs is limited in the logs it can retrieve. Additional logs exist on the worker nodes hosting the containers. These logs are usually found in:

  • /var/log/containers

on the host node.

Last modified June 27, 2024: Fix various vale errors (46417d02)