-
Notifications
You must be signed in to change notification settings - Fork 39
More reliable Tor experiments #78
Comments
A small comment for some precision: it is not the absolute divergence that is the problem, but rather the fact the offset is not the same for each relay class. |
Sorry for the delay! This is a timely issue you brought up, because I am currently rewriting the network generation procedure into a new easier to use tool. Thank you! I don't have an answer yet, but let me think about a new sampling algorithm and get back to you for feedback. |
Also, I should note that the new tool will generate networks based an a much larger set of consensus files (like, all consensuses from a given month) rather than a single consensus file. I think this will allow us to do a better job of creating a ShadowTor network that statistically looks like a Tor network even if it does not precisely match a single consensus. |
Cool! Also, since we're on the list of improvements for generate.py. There is a small issue I noticed recently: I guess we probably want to replace the wiki info which is: "wget https://metrics.torproject.org/userstats-relay-country.csv" to something along these line: "wget https://metrics.torproject.org/userstats-relay-country.csv?start=yyyy-mm-dd&end=yyyy-mm-d&events=off" Explaining the user needs to find an appropriate date range of at least 10 days close to the period he is simulating. |
I don't think the code agrees with your statement. The parsing function starts at index -1 (the last line in the file, i.e., the most recent date), and works backwards until it reaches 10 full days. shadow-plugin-tor/tools/generate.py Line 694 in 133724d
Is there a bug and you verified that running the code does in fact choose the first 10 days in the file instead of the last 10 days? In any case, allowing the user to specifies the date range seems useful :) |
Woops, totally misunderstood the code. You're right, that goes backward.
Yep x) |
I am unable to reproduce your results. I reimplemented the existing relay sampling strategy by splitting relays into classes and then doing the bucket median thing on each class to get a smaller set of relays for each class. Then I tried to reproduce your "divergence" metric by computing the following
This gives us the fraction of bandwidth weight that we expect in each relay class set, for both the full network and the sampled down network. So for example, in an ideal world we expect about 33% for each of middle, exit, and guard class sets. Then I took the absolute value of the difference between the sampled set and the full set. So if the sampled guards had only 20% of the bandwidth weight in the scaled network, but they had 33% in the full network, the bandwidth weight divergence is 13%. I found that the maximum bandwidth weight divergence across all relay classes for all network sizes [100, 6400] was about 4%. I'm not sure how you came up with 80%-100% bandwidth weight divergence in some cases. I thought I might be confused by observed bandwidth vs bandwidth weight, so I also computed the advertised bandwidth divergence in a similar fashion, where I am still sampling relays based on the bandwidth weight, but then I am computing the divergence based on divergence in advertised bandwidth (i.e., advertised bandwidth is our best guess of the bandwidth capacity of the relay). I found that the maximum advertised bandwidth divergence across all relay classes for all network sizes [100, 6400] was 4.3%. Which of the above two divergence metrics are you computing? Could you please describe your method in a way that would allow me to reproduce it? Feel free to post code ;) Or maybe there is a bug in the old generate script and my new implementation fixed it? |
To be clear on the nomenclature, I usually call "bandwidth-weights" the weights at the bottom of the consensus file, "bandwidth" or "consensus weight" for the same bandwidth line value in the consensus for each relay, and "advertised bandwidth" the bandwidth each relay reports.
I think we're not computing the same thing. But it looks like your method makes more sense. What I computed was the sum of consensus weights in the sampled class divided by the expected sum of consensus weights from the same class (that is, if we want one half of the original network, we should expect to obtain one half the total consensus weight for a given relay class after sampling, and I was trying to plot how far we are from this expectation). Maybe the code can help to clear this out: Here is the script I run with that commit to generate all results fast:
And the python script to plot the printed metrics:
It is also totally possible that there is bug (or logic flaw) somewhere in what I am doing. |
Right. I like my method better 😉 I think we should focus on the "consensus weight" (the "w Bandwidth=x" line in the consensus). Also, I believe that we need to normalize the consensus weights as I did in my approach, because the range of the absolute consensus weights in not meaningful. My method works on the normalized consensus weights, which represents available probability of selection for each relay class. I do have some ideas of reducing the 4% consensus weight divergence that I computed in my approach. I'm currently trying to think about if the added complexity of the algorithm for reducing the consensus weight divergence (and overhead associated with implementing it) is worth the improvement we would gain from the more accurate weights. |
Hello,
I've been thinking about improving Shadow's reliability for small sized experimentation, so here are some insights. Let me just describe the issue first.
When sampling the consensus, the generate.py script takes the median relay of each bin for which the sorted list of relays has been put into them. This script applies this logic by relay class (i.e., guards, exits, guardexits and middles) and decides the number of the sampled relays by taking the same proportion for each relay class. This choice implies that the total bandwidth proportion of each relay class is not preserved, while this proportion has a huge impact over how vanilla Tor selects path because it is part of how the bandwidth-weights are computed (one of the last line of the consensus). If the proportion of total bandwidth between classes stays the same, then the bandwidth-weight would be identical for the scaled down consensus. This is what I suggest to fix.
You can see in the following image the divergence of the total bandwidth of each relay class from its expected value, for different scaled down network from a same consensus.
An interesting result of this experiment is that it does not necessarily get better as we increase the size of the sampling.
I think there are several possibilities to improve on the current sampling:
Also, I am inclined to believe that the sampling strategy should be linked to the type of experiment we intend to run. E.g., for location-aware path selection, we might want to ensure to keep relays from many locations. And this does not look trivial to me :)
The text was updated successfully, but these errors were encountered: