My Omega-Node (running on a Hetzner VPS, CPX11 (2vcore, 2gb ram) frequently is getting unresponsive for short intervals. It can be just 30 seconds, it can be 15-30 minutes. In the hetzner dashboard I can see that these downtimes coincide with 200% cpu-usage. (200% means probably 2 cores full load) (It’s best visible in the live mode).
These downtimes are too short too get noticed by the python script (iirc it needs 1 hour for the nodes status page to state that it’s “connecting” or “offline”), but a portscanner monitor on port 30303 (like uptimerobot) does reveal it faster.
One interesting detail is that the Apollo Node does not have these downtimes, although it’s running on the same VPS tier (Hetzner CPX11).
What causes these periods of 200% cpu usage? And why just the Omega node but not the Apollo node?
What I also wonder is if ambrosus did perform any serious stress tests on the network. Upgrading the VPS tier wouldn’t be a problem at all, but will the node operators be ready to do so if necessary? Will they react in a reasonable time or could the network get affected by it?
Or is it just normal behaviour of a Omega node and I shouldn’t think about it anymore?
Any ideas or clues?
#omega
Hetzner Dashboard during unresponsive period of the Omega node:
I initially thought there were no unexpected downtimes while hosting the node on my own hardware (VM with 6 cores and 6gb RAM), but I start to think remembering having had unexpected downtimes for which I blamed the Internet Provider. But I’m not sure, and human memory is a tricky thing sometimes
So, unfortunately I can’t make a clear statement about if it has anything to do with the available cpu cores or not.
Ok, thats interesting. After a chat in the telegram group I found some more clues. The user on Telegram showed a screenshot of his Omega Node Docker Container list. All of his containers run since creation date:
looking at my own Container list with docker ps I see that container “atlas_worker” and “parity” got restartet 7 hours ago. Last unresponsive period was 13 hours ago though.
Looking at syslog with grep parity /var/log/syslog (or syslog.1) parity seems to get killed repeatedly due to “Out of memory”. It also shows that last unresponsive period ended just seconds after parity got killed.
I also realized that there’s permanent brute force login attempts for root user. I put “turning password login off” on my todo list. But I think it’s unrelated. Anyways, if somebody’s interested, you can check it with:
cat /var/log/auth.log
Now I’m puzzled on why parity runs out of memory on my machine but on another machine it doesn’t.
But it doesn’t look that clear to me since the Apollo Node runs parity 2.7.2 too and did not have this issue, not once so far.
Then again the Apollo node just runs 2 containers while the Atlas nodes run 5 containers.
What also cached my attention, that it seems like cpu usage is quite low and all of a sudden it spikes to 200% until the parity container is crashing. That’s odd…
Think it has to do with the atlas having to resolve /calculate a challenge and the Apollo’s not.
Ok, that could be an explanation. Resolve/calculate a challenge might also fit to the 200% cpu usage until crashing parity.
But why would it behave like that on my omega node while on the omega node of @xfik31 parity never crashed since setting it up 2 weeks ago?
@xfik31
I will secure the SSH Login as soon as I find the time for it. Although I don’t think that this is the cause of this problem, who knows if it makes any difference. And I planned to do it anyway for obvious reasons.
Indeed, if you are seing thousand of entries in your auth.log from the SSHd daemon, scripts/bot trie to bruteforce your account.
i have also limit the number of failed login / disable the banner etc… “implement best practices for SSH exposed server”.
you can also setup UFW that will limit your “internet fingerprint”. on the amb.wifki, specific ports that are needed are explained but you have to take care when setup it, since it can block yourself
30GB of RAM
ok, that might be a possible reason why your Omega Node doesn’t kill the parity container due to “Out of memory”
maybe I could setup some monitoring on random Atlas nodes for a week or two to get a better picture about if short downtimes are a common issue or if it’s something special with my node.
Oh, and I will check syslog of the setup on my own Hardware and post the results here. Together with the exact specs of the VM (think it was 6GB RAM).
I start to question my conclusion. To verify my guess I setup 30 monitors on 30 random atlas nodes (10 for each tier). So far only one Zeta Node had a downtime of 3 minutes. All others are or 100% up or 100% down. (8 of 30 seem to be down, or at least the port monitor on port 30303 shows them down)
There might be something more to it. But maybe better wait for the monitoring data of 1 or 2 weeks before jumping to another conclusion
I’ll do the SSH login secure now. Let’s see if this make a difference.
A short update:
I did change the ssh port, disabled root login and installed fail2ban. But there was no change at all.
The monitors I setup show that other nodes seem to have the same problem, although none have so many incidents like my own Omega node.
I also setup a public Dashboard for the monitors.
To test if the system specs are too low I upgraded the VPS to a Hetzner CPX21 (3cores, 4GB Ram) today (01.03.2022).
Additionally I will setup a new Contabo VPS (Cloud VPS S, 4 vCPU Cores, 8 GB RAM, 50 GB NVMe) for another Omega Node to have another possibility to compare the influence of system specs and VPS provider.
I intend to write a feedback in about 2 weeks from now. Until then interested ones can take a look at the public Dashboard of the random nodes monitors to get a live update about random downtimes.
2 Weeks after the system specs upgrade to 3 cores, 4GB Ram I had not one downtime registered. Before the system specs upgrade I had an average of 3 downtimes a day.
My conclusion is therefore that the minimum requirements mentioned in amb.wiki are insufficient for running an Atlas Node.
As pointed out by @Jeromehere this might be a side effect of parity version 2.7.2 as he described in this GitHub Issue.
So, the solution was to increase available RAM of the VPS.
About 30% of the random Atlas Nodes in the previously mentioned monitoring page have frequent downtimes.
Since the minimum system requirements in amb.wiki are too low for Atlas nodes we should update amb.wiki to avoid new nodes having this issue. Although better would be if the developers would change to a better version of parity to make them run again with 2cores/2GB Ram, but since this is an open issue since late 2020 I guess it’s not considered being a priority.
Any thoughts on this? Increasing minimum system requirement in amb.wiki? Waiting for developers to switch to a better parity version?
I’m always right Well, almost. There is some traction on this topic at last in test net.
Hopefully this will mitigate the 2 year anniversary of this bug!