Symptoms
We had issues with one of our bigger sites where visitors were experiencing page request timeouts and getting 50x server errors on requests.
On investigation in cPanel resources, I could see that something was consuming all the server memory on our shared hosting.
The Resource usage snapshots in cPanel showed quite a few lsphp processes but no other details.
Investigation and action
We are on shared hosting and I’m glad that our plan gives us access to a webpage based console. So we can do some investigation here. As its a shared host, not all commands are available but enough to get us started.
The Environment
We are using the latest WP version, v6.2.2 at the time of writing.
We are on a high tier shared hosting plan.
The server is running RedHat and the LiteSpeed web server and cache all on php 8.1.
Using ps to check process memory
This cPanel article has more details on using ps to diagnose memory issues.
host [~]# ps faux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
mysite1+ 1738461 0.0 0.0 113648 3500 pts/3 S 13:29 0:00 /bin/bash -l
mysite1+ 1797259 0.0 0.0 153688 3804 pts/3 R+ 13:52 0:00 \_ ps faux
mysite1+ 1708794 0.0 0.0 642188 37128 ? S 13:17 0:00 lsphp
mysite1+ 1796691 34.2 0.0 731824 235064 ? Ss 13:52 0:04 \_ lsphp
mysite1+ 1797215 97.0 0.0 690600 193992 ? Ss 13:52 0:00 \_ lsphp
mysite1+ 1198478 0.0 0.0 809272 239596 ? Ss 12:32 0:01 lsphp:stmysite/public_html/wp-admin/admin-ajax.php
mysite1+ 1194242 0.0 0.0 670080 140948 ? Ss 12:30 0:00 lsphp:stmysite/public_html/wp-admin/admin-ajax.php
mysite1+ 1190066 0.0 0.0 760440 190124 ? Ss 12:29 0:01 lsphp:stmysite/public_html/wp-admin/admin-ajax.php
mysite1+ 1189976 0.0 0.0 809120 238672 ? Ss 12:29 0:01 lsphp:stmysite/public_html/wp-admin/admin-ajax.php
mysite1+ 1189972 0.0 0.0 809160 239504 ? Ss 12:29 0:01 lsphp:stmysite/public_html/wp-admin/admin-ajax.php
mysite1+ 1188974 0.0 0.0 680564 170444 ? Ss 12:28 0:00 lsphp:/home/hostmysite/public_html/wp-login.php
....
mysite1+ 1076379 0.0 0.0 726676 232240 ? Ss 11:44 0:03 lsphp:stmysite/public_html/wp-admin/admin-ajax.php
mysite1+ 1062448 0.0 0.0 721420 219728 ? Ss 11:39 0:01 lsphp:/home/hostmysite/public_html/wp-login.php
mysite1+ 1061020 0.0 0.0 798660 234316 ? Ss 11:38 0:04 lsphp:/home/hostmysite/public_html/wp-login.php
mysite1+ 1060387 0.0 0.0 796708 231800 ? Ss 11:38 0:03 lsphp:/home/hostmysite/public_html/wp-login.php
mysite1+ 1058101 0.0 0.0 771100 201200 ? Ss 11:37 0:01 lsphp:/home/hostmysite/public_html/index.php
mysite1+ 1054598 0.0 0.0 800616 236060 ? Ss 11:36 0:03 lsphp:/home/hostmysite/public_html/index.php
mysite1+ 1049924 0.0 0.0 792220 226172 ? Ss 11:34 0:04 lsphp:/home/hostmysite/public_html/index.php
We can see that few processes are actually using any %CPU or %MEM. Checking output of the ps command , we can see the %MEM (processes resident set size) which 0 bytes as it is sleeping (see the STAT column: S = sleeping). However, we see that the RSS (non-swappable resident set size) is about 200Mb+ for most of those processes. That’s where the memory has gone.
Now, from the START time we can also see that many of these processes were started when the server was experiencing problems. The issues eased off at about 12:30 when things started operating more normally. So we have a set of processes started recently (13.xx) and old processes started before 12:30. I’m guessing these are stuck.
So, from the console, I killed the processes with a start time of 12:30 or earlier.
Killing processes on your server can serious mess things up if you don't know what you are doing. So make sure you understand what you are killing and the consequences of that before you do it.
host [~]# kill 1049924
By default, the kill command sends a TERM signal to the process allowing it to shutdown cleanly. See the kill command for more aggressive signals to send.
You can see in the resource graph above that I killed the processes at about 14:00 freeing up memory and going from 2.7Gb to 600Mb memory usage which is more normal.
What caused this?
Spoiler alert – I could not pin point anything!
Potential culprits
WP scheduled jobs: I checked the timings of the scheduled jobs using the WP Crontrol plugin. No jobs really lined up with the timings of the issues. I did find a couple of jobs from an old plugin that was deleted a long time ago, so I did get to remove/disable them.
Errors in the WP logs: Nothing out of the ordinary here.
Hosting server syslog logs: sadly, don’t seem to have access to these but will be contacting the hosting service to see if I can get access.
Apache Access Logs: For historic log viewing you can use the AWS Stats app in cPanel. Unfortunately this only has logs from the previous day back. I downloaded today’s logs and used the fairly crippled, but usable, free version of Http Logs Viewer to check if there was any odd activity leading up to the incident. From here I could see clearly when the server started experiencing issues and returning 50x errors in response to requests. There was some unusual activity at about 06:00 but still nothing that lined up with the incident timings.
I think I will be installing the Cerber Security plugin on this site. I have been using it on some smaller sites recently and it appears to add a decent layer of protection for day to day probing from bots and reduce a lot of contact form spam.
A note on time zones: while investigating and cross checking logs I noticed that different environments are logging using different time zones. cPanel logs resource usage in UTC and the WP logs use the local time zone, UTC+1 for me at the moment. So be aware when trying to correlate events across logs and environments that the timestamps might be +/- an hour or more.
Conclusion
Well, nothing conclusive I’m afraid. I now know how to diagnose memory issues more effectively and solve them in this scenario.
As to the cause of the issue; I’m still in the dark. I will keep an eye on things to see if it happens again.