Added 301 case for health probes
The IT website suffered intermittent downtime for the past few days (02/09 to 09/09) due to what was believed to be a bug caused from upstream. bug link.
Although the instance was indeed affected by it, the real culprit for the downtime was the liveness Probe, returning 301 due to changes made by Catharine Noble (confirmed by her) on Friday (02/09/2022) that made the default endpoint permanently redirect to /welcome
.
This MR changes the probe to include and acknowledge 301
as an acceptable code returned by the base path (/
).
This MR will also solve the problem reported in the OTG permanently.
NB: The IT website has the livenessProbe
disabled until this MR is accepted and changes are propagated to the cluster. This means the annotation for no updates from the Operator must be removed after propagating the changes to production.
Merge request reports
Activity
30 30 31 31 # Expected responses: 32 32 # 200: normally working base URL 33 # 301: Moved Permanently, this happens when the website's homepage is not on the default path, `/` 33 34 # 302: redirection (NOTE: not sure if there's a legitimate case to expect this) 34 35 # 403: fully private websites give this response 35 36 # 503: high load 36 if [[ "${HTTP_CODE_BASE}" -ne "200" && "${HTTP_CODE_BASE}" -ne "302" && "${HTTP_CODE_BASE}" -ne "403" && "${HTTP_CODE_BASE}" -ne "503" ]]; then 37 if [[ "${HTTP_CODE_BASE}" -ne "200" && "${HTTP_CODE_BASE}" -ne "301" && "${HTTP_CODE_BASE}" -ne "302" && "${HTTP_CODE_BASE}" -ne "403" && "${HTTP_CODE_BASE}" -ne "503" ]]; then 503 is a valid code?
Side note: All this codes could probably be limited to 200 if all websites had a configured route
/ping
to perform healthchecks onEdited by Carina Antuneschanged this line in version 2 of the diff
503
shouldn't be a valid code.For the side note, I fully agree, more so I see a strong candidate for this, here.
I've updated the MR to reflect the suggestion.
That's correct, the goal here is to test that
php-fpm
andnginx
are running.On the second test, it requires that the website works properly, hence requiring that php is working in a non-error state (otherwise no redirection to the Auth endpoint).
It might actually make sense to just use the
/user/login
as enough logic to say the website is working.
added 1 commit
- 0b2024fd - Updated probe endpoint to <SITE>/_site/_php-fpm-status
added 1 commit
- 0799525e - Added check on number of active PHP processes
35 # 503: high load 36 if [[ "${HTTP_CODE_BASE}" -ne "200" && "${HTTP_CODE_BASE}" -ne "302" && "${HTTP_CODE_BASE}" -ne "403" && "${HTTP_CODE_BASE}" -ne "503" ]]; then 32 # 200: php_fpm is reporting it's status, therefore should be working as expected 33 if [[ "${HTTP_CODE_BASE}" -ne "200" ]]; then 37 34 echo "Probe failed" >> $FILE 38 echo "Probe failed. Endpoint / responds with code: $HTTP_CODE_BASE" 35 echo "Probe failed. Endpoint / responds with code: $HTTP_CODE_BASE" >> $FILE 36 echo "PHP-FPM Output" $(curl localhost:8080/_site/_php-fpm-status --silent --insecure) >> $FILE 37 exit 1 38 fi 39 40 # We can retrieve the number of active PHP processes from the endpoint, 41 # This is a variable described here: https://www.php.net/manual/en/fpm.status.php 42 # If the value is '0', that means there will be no processes processing requests 43 # In such cases the probe will fail and force a restart of the container 44 ACTIVE_PHP_PROCESSES=$(curl --max-time 200 --silent --fail --insecure localhost:8080/_site/_php-fpm-status?json | jq -r '."active processes"') Handled the same issue with a different approach, !162 (merged).
Although monitoring
/_site/_php-fpm-status
could be considered more accurate to restart the container (as it would mean php-fpm is not running), restarting on50x
codes is reasonable too.Closing this MR.