529 Members
english language channel ; off-topic → #yggdrasil-community:matrix.org ; info and downloads: https://yggdrasil-network.github.io ; code and issues: https://github.com/yggdrasil-network/yggdrasil-go ; atom feed of releases: https://github.com/yggdrasil-network/yggdrasil-go/releases.atom103 Servers

Load older messages

Timestamp Message
21 Jan 2020
19:00:21@Arceliar:matrix.orgArceliar neilalexander: my best guess is still that it's flooding the network with dht lookups faster than the network can handle
19:01:46@Arceliar:matrix.orgArceliara lot of requests appearing at once probably spawns a lot of goroutines in the api, so they may send a lot of requests to the switch before the go runtime changes to running the switch, and then the flood of messages overflows a switch queue. or something like that. so packets could drop on your local machine without ever reaching the network
20:07:57@neilalexander:matrix.orgneilalexander Arceliar: The switch queue buffers are quite large though?
20:08:53@Arceliar:matrix.orgArceliarit all depends on when the go runtime decides to switch between goroutines
20:11:03@Arceliar:matrix.orgArceliarthings could also get lost in the network, but it seems more likely that the bottleneck would happen on the local end (or maybe an immeddiate peer -- a dropped packet could head-of-line-block a lot of requests that then all appear in the node's switch much faster than the time it takes to send 1 packet)
20:11:47@neilalexander:matrix.orgneilalexanderYeah possibly
20:13:11@Arceliar:matrix.orgArceliar possible quick way to test: when you launch a goroutine that makes a request, have it start by time.Sleeping for a random amount of time, on the order of maybe 1 second or so
20:13:23@Arceliar:matrix.orgArceliarthe delay and randomness should prevent congestion
20:14:03@Arceliar:matrix.orgArceliar and hopefully not slow things down too much
20:15:04@Arceliar:matrix.orgArceliarif that works, then we know the problem is probably related to sending too many requests too quickly, and we can think about other ways to fix it (either in the crawler or in the API -- maybe ygg should limit the number of concurrent API requests or something)

I added a 50ms sleep whilst holding the mutex, so that it should make the requests more evenly spaced, and got worse results amazingly:

The crawl took 1m17.18349461s seconds
736 nodes were processed
440 nodes were found
296 nodes were not found
0 nodes responded with nodeinfo
440 nodes did not respond with nodeinfo
20:16:16@neilalexander:matrix.orgneilalexander(The nodeinfo bit there is disabled so you can ignore those numbers)
20:16:46@Arceliar:matrix.orgArceliari'm not sure where the mutex is / what it's doing exactly, but if it works the way i would guess, then that seems to suggest it's not the congestion
20:17:16@neilalexander:matrix.orgneilalexanderThe mutex covers a shared map so every goroutine contends at least on that mutex, so it's a good way to slow down the pings
20:17:39@Arceliar:matrix.orgArceliarso the other places things could go wrong are in the API code (if the timeout is too short) or the dht code (if the callback is somehow getting removed early)
20:19:01@neilalexander:matrix.orgneilalexanderThe timeout is 5 seconds I think, and I don't think the callbacks are removed early as I have checked that already when I refactored to actors
20:19:42@neilalexander:matrix.orgneilalexanderThe crawler also has a specific check not to start a DHT request for a node if one is already in progress for the same node, too
20:21:28@Arceliar:matrix.orgArceliarif some nodes know old/wrong coords, and that's how you first hear about a node, would that prevent you from checking at the right coords?
20:24:24@neilalexander:matrix.orgneilalexanderThe crawler does allow 5 attempts to ping a node
20:24:40@neilalexander:matrix.orgneilalexanderIf it gets a successful response, it won't retry, but if it doesn't, it'll allow 4 more attempts
20:25:00@Arceliar:matrix.orgArceliarif a ping is already sent (to the wrong coords), i'm wondering how it handles hearing about the node again with different coords
20:25:12@Arceliar:matrix.orgArceliarif it caches the info somewhere or if it just drops it
20:25:14@neilalexander:matrix.orgneilalexanderIt always uses the coords from the most recent rumour
20:28:22@neilalexander:matrix.orgneilalexanderI've added a random sleep of 1 to 500ms for each call now, trying that
20:42:52@Arceliar:matrix.orgArceliarthat's without the mutex? (random sleeps while holding the mutex probably doesn't accomplish much)
21:57:44@neilalexander:matrix.orgneilalexanderThat's with the mutex - my theory being that if the sleep is when the mutex is held, it stops other goroutines from flooding more DHT pings
21:57:50@neilalexander:matrix.orgneilalexanderSo it should be pretty much one-at-a-time

There are no newer messages yet.

Back to Room List