Kubernetes & IP changes

I’m currently putting together a server network using Waterfall & Paper running in a Kubernetes cluster. I have a set of servers running in a Statefulset where each server can be referenced via their DNS name, like this:

  • mc-server-0.mc-server
  • mc-server-1.mc-server

The one thing I’m noticing here is that because of the way Kubernetes works, the IP address to the individual pods in the Statefulset change when the pod is destroyed and restarted. Within the cluster, the DNS is instantly updated to point to the new IP, and everything should reconnect just fine.

What I’m finding is that Waterfall is unable to connect to the server as soon as the server is restarted. I suspect this is because the IP address of the server has changed and Waterfall is still trying to connect to the old IP. Restarting Waterfall fixes the issue, and it connects normally.

How does Waterfall cache & resolve DNS hostnames? I need it to check with every new connection opened, but it’s not working as I would expect.

I found the solution to this problem. By default Java caches the DNS lookup forever (which I think is a bit stupid).

Following these steps allowed me to set DNS TTL to something more sensible, like 10 seconds. https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-jvm-ttl.html

Essentially just open this file: $JAVA_HOME/jre/lib/security/java.security
And put in this line: networkaddress.cache.ttl=10

Now it will go and resolve the DNS properly every time.

you’ll likely wanna disable the async dns in waterfall.yml

Thanks, is that use_netty_dns_resolver? What exactly does it do?

Alternatively, don’t point waterfall directly at the statefulset pods. Make a kubernetes service that selects the pods you want, and give waterfall the service dns name. The service ip will never change in the lifetime of the service, even as pods underneath change.

Thanks - I’ve been trying to avoid this as it adds unnecessary extra services, but for the life of me I can’t seem to get DNS resolution working reliably. The fix I detailed seems to work 50% of the time, and then other times it just can’t get a lock.

I’ve tried with both the default Java resolver with TTL set to 0 and with the netty resolver - results seem to be the same. While for now I might have accept defeat and create individual services for each pod in the set, it would be great if anyone has more insight on DNS caching, TTLs and resolution within Waterfall.

I’ve run several tests and verified that the DNS lookup still works but the IP address is different, and every time Waterfall is still attempting to route traffic to the old IP address.