Omega Node frequent downtime/unresponsive

jorma · February 13, 2022, 10:04am

Hello Forum

My Omega-Node (running on a Hetzner VPS, CPX11 (2vcore, 2gb ram) frequently is getting unresponsive for short intervals. It can be just 30 seconds, it can be 15-30 minutes. In the hetzner dashboard I can see that these downtimes coincide with 200% cpu-usage. (200% means probably 2 cores full load) (It’s best visible in the live mode).
These downtimes are too short too get noticed by the python script (iirc it needs 1 hour for the nodes status page to state that it’s “connecting” or “offline”), but a portscanner monitor on port 30303 (like uptimerobot) does reveal it faster.

One interesting detail is that the Apollo Node does not have these downtimes, although it’s running on the same VPS tier (Hetzner CPX11).

What causes these periods of 200% cpu usage? And why just the Omega node but not the Apollo node?

What I also wonder is if ambrosus did perform any serious stress tests on the network. Upgrading the VPS tier wouldn’t be a problem at all, but will the node operators be ready to do so if necessary? Will they react in a reasonable time or could the network get affected by it?
Or is it just normal behaviour of a Omega node and I shouldn’t think about it anymore?

Any ideas or clues?

#omega

Hetzner Dashboard during unresponsive period of the Omega node:

.
.
End of unresponsive period of the Omega node:

.
.
Export of Uptimerobot’s Port 30303 protocol for the Omega node.

I initially thought there were no unexpected downtimes while hosting the node on my own hardware (VM with 6 cores and 6gb RAM), but I start to think remembering having had unexpected downtimes for which I blamed the Internet Provider. But I’m not sure, and human memory is a tricky thing sometimes

So, unfortunately I can’t make a clear statement about if it has anything to do with the available cpu cores or not.

jorma · February 13, 2022, 4:30pm

Ok, thats interesting. After a chat in the telegram group I found some more clues. The user on Telegram showed a screenshot of his Omega Node Docker Container list. All of his containers run since creation date:

looking at my own Container list with docker ps I see that container “atlas_worker” and “parity” got restartet 7 hours ago. Last unresponsive period was 13 hours ago though.

Looking at syslog with grep parity /var/log/syslog (or syslog.1) parity seems to get killed repeatedly due to “Out of memory”. It also shows that last unresponsive period ended just seconds after parity got killed.

I also realized that there’s permanent brute force login attempts for root user. I put “turning password login off” on my todo list. But I think it’s unrelated. Anyways, if somebody’s interested, you can check it with:

cat /var/log/auth.log

Now I’m puzzled on why parity runs out of memory on my machine but on another machine it doesn’t.

Jerome · February 13, 2022, 7:31pm

My best guess is that this is a side effect of the known issues with the party 2.7.2.

github.com/ambrosus/issues

NOP Parity: Increased memory used on nodes since Parity 2.7.1 / 2.7.2 causing node lockups or restart

opened 03:19PM - 01 Sep 20 UTC

Jerome2506

Since parity 2.7.1/2.7.2 the memory consumption of the nodes has been very high.… Sometimes it reached the 2GB limit, locks up, stops syncing and/or and causes the node to restart. Its just not as stable as parity 2.5.13 was before. After digging around it seems that the openethereum team recognises the issue. After months of trying to solve it on 2.7+ branches it they couldn't get it to work. Apparently they decided to go back to 2.5.13 (backport) and only include bug fixes and hardforks since that version to get it stable again. Release should come out in the second week of September if everything goes to plan. See https://github.com/openethereum/openethereum/issues/11858 Not sure if the team is are aware but could I would say this would be great if that could be taken up mid september in order to get the nodes stable again.

xfik31 · February 13, 2022, 7:57pm

hello, this is johnsmith31 from the TG group that show the “docker ps” command.

for the SSH login , i have secured it :

change SSH port
deny root login
secure sshd daemon
install fail2ban

jorma · February 13, 2022, 8:16pm

Thanks for the Link. That’s interesting.

But it doesn’t look that clear to me since the Apollo Node runs parity 2.7.2 too and did not have this issue, not once so far.
Then again the Apollo node just runs 2 containers while the Atlas nodes run 5 containers.

What also cached my attention, that it seems like cpu usage is quite low and all of a sudden it spikes to 200% until the parity container is crashing. That’s odd…

Jerome · February 13, 2022, 8:33pm

Think it has to do with the atlas having to resolve /calculate a challenge and the Apollo’s not.

jorma · February 13, 2022, 8:46pm

Think it has to do with the atlas having to resolve /calculate a challenge and the Apollo’s not.

Ok, that could be an explanation. Resolve/calculate a challenge might also fit to the 200% cpu usage until crashing parity.

But why would it behave like that on my omega node while on the omega node of @xfik31 parity never crashed since setting it up 2 weeks ago?

@xfik31
I will secure the SSH Login as soon as I find the time for it. Although I don’t think that this is the cause of this problem, who knows if it makes any difference. And I planned to do it anyway for obvious reasons.

jorma · February 13, 2022, 8:47pm

Wait, what are the specs of your VPS?

xfik31 · February 14, 2022, 6:48am

i got the vps from contabo

Indeed, if you are seing thousand of entries in your auth.log from the SSHd daemon, scripts/bot trie to bruteforce your account.

i have also limit the number of failed login / disable the banner etc… “implement best practices for SSH exposed server”.

you can also setup UFW that will limit your “internet fingerprint”. on the amb.wifki, specific ports that are needed are explained but you have to take care when setup it, since it can block yourself

regards

jorma · February 14, 2022, 11:18am

but which one?

VPS S (4 vCPU Cores, 8 GB RAM, 50 GB NVMe or 200 GB SSD, 32 TB Traffic Unlimited Incoming)
VPS 300 (2 vCPU Cores, 4 GB RAM, 300 GB HDD,100 Mbit/s Port)

I guess your VPS is much better than the ones I use. But how much better? What’s the difference?

My VPS specs:

Hetzner CPX11 (2 vCPU, 2 GB RAM, 40 GB SSD, 20 TB Traffic)

xfik31 · February 14, 2022, 11:47am

i have the L model

because i’m running another node in that need lots of SDD. ( pop.network masternode)

regards

jorma · February 14, 2022, 12:48pm

30GB of RAM
ok, that might be a possible reason why your Omega Node doesn’t kill the parity container due to “Out of memory”

maybe I could setup some monitoring on random Atlas nodes for a week or two to get a better picture about if short downtimes are a common issue or if it’s something special with my node.

Oh, and I will check syslog of the setup on my own Hardware and post the results here. Together with the exact specs of the VM (think it was 6GB RAM).

jorma · February 14, 2022, 9:03pm

I just checked the syslogs of my VM where I run the Omega node before migrating to a VPS.
Specs of the VM:
6 cores
5GB RAM

I used this commands without getting any matches:

sudo zgrep parity /var/log/syslog*
sudo zgrep "Out of memory" /var/log/syslog*

There is not one “Out of memory” message. (Logs are from the last 6 days of running the node)
Going back to the running node on the VPS used

sudo zgrep "Out of memory" /var/log/syslog* | sort -h -k 2 | nl

to find out that parity got restartet 21 times within the last week. Which makes an average of about 3 times per day or every 8 hours.

To me it looks like 2 cores with 2GB RAM is not enough to run an Omega Node (can’t tell about the other atlas tiers)

jorma · February 16, 2022, 1:41pm

I start to question my conclusion. To verify my guess I setup 30 monitors on 30 random atlas nodes (10 for each tier). So far only one Zeta Node had a downtime of 3 minutes. All others are or 100% up or 100% down. (8 of 30 seem to be down, or at least the port monitor on port 30303 shows them down)

There might be something more to it. But maybe better wait for the monitoring data of 1 or 2 weeks before jumping to another conclusion

I’ll do the SSH login secure now. Let’s see if this make a difference.

jorma · March 1, 2022, 7:25am

A short update:
I did change the ssh port, disabled root login and installed fail2ban. But there was no change at all.

The monitors I setup show that other nodes seem to have the same problem, although none have so many incidents like my own Omega node.
I also setup a public Dashboard for the monitors.

To test if the system specs are too low I upgraded the VPS to a Hetzner CPX21 (3cores, 4GB Ram) today (01.03.2022).
Additionally I will setup a new Contabo VPS (Cloud VPS S, 4 vCPU Cores, 8 GB RAM, 50 GB NVMe) for another Omega Node to have another possibility to compare the influence of system specs and VPS provider.

I intend to write a feedback in about 2 weeks from now. Until then interested ones can take a look at the public Dashboard of the random nodes monitors to get a live update about random downtimes.

jorma · March 16, 2022, 4:21pm

Guess you were right then

2 Weeks after the system specs upgrade to 3 cores, 4GB Ram I had not one downtime registered. Before the system specs upgrade I had an average of 3 downtimes a day.
My conclusion is therefore that the minimum requirements mentioned in amb.wiki are insufficient for running an Atlas Node.

As pointed out by @Jerome here this might be a side effect of parity version 2.7.2 as he described in this GitHub Issue.

So, the solution was to increase available RAM of the VPS.
About 30% of the random Atlas Nodes in the previously mentioned monitoring page have frequent downtimes.
Since the minimum system requirements in amb.wiki are too low for Atlas nodes we should update amb.wiki to avoid new nodes having this issue. Although better would be if the developers would change to a better version of parity to make them run again with 2cores/2GB Ram, but since this is an open issue since late 2020 I guess it’s not considered being a priority.

Any thoughts on this? Increasing minimum system requirement in amb.wiki? Waiting for developers to switch to a better parity version?

Jerome · March 18, 2022, 11:51am

I’m always right Well, almost. There is some traction on this topic at last in test net.
Hopefully this will mitigate the 2 year anniversary of this bug!

Topic		Replies	Views
Help me to understand nodes AirDAO Nodes	1	264	March 8, 2022
AMB/ETH Bridge proposals Community	54	1040	September 13, 2023
About the AirDAO Nodes category AirDAO Nodes	1	244	January 14, 2024
Explore our community-led wiki for all tech-related questions AirDAO Nodes	0	228	February 10, 2022
Amb nodes live map Product Working Group	2	288	February 24, 2022

Omega Node frequent downtime/unresponsive

Related topics