More reliable Tor experiments #78

frochet · 2019-06-06T09:38:04Z

Hello,

I've been thinking about improving Shadow's reliability for small sized experimentation, so here are some insights. Let me just describe the issue first.

When sampling the consensus, the generate.py script takes the median relay of each bin for which the sorted list of relays has been put into them. This script applies this logic by relay class (i.e., guards, exits, guardexits and middles) and decides the number of the sampled relays by taking the same proportion for each relay class. This choice implies that the total bandwidth proportion of each relay class is not preserved, while this proportion has a huge impact over how vanilla Tor selects path because it is part of how the bandwidth-weights are computed (one of the last line of the consensus). If the proportion of total bandwidth between classes stays the same, then the bandwidth-weight would be identical for the scaled down consensus. This is what I suggest to fix.

You can see in the following image the divergence of the total bandwidth of each relay class from its expected value, for different scaled down network from a same consensus.

An interesting result of this experiment is that it does not necessarily get better as we increase the size of the sampling.

I think there are several possibilities to improve on the current sampling:

Minimize the divergence between classes when sampling relays
Apply a post-sampling correction on each relay's bandwidth such that the proportion gets right (that is, we tweak a bit the bandwidths).
Both of the above

Also, I am inclined to believe that the sampling strategy should be linked to the type of experiment we intend to run. E.g., for location-aware path selection, we might want to ensure to keep relays from many locations. And this does not look trivial to me :)

frochet · 2019-06-07T09:10:25Z

A small comment for some precision: it is not the absolute divergence that is the problem, but rather the fact the offset is not the same for each relay class.

robgjansen · 2019-07-02T13:39:31Z

Sorry for the delay!

This is a timely issue you brought up, because I am currently rewriting the network generation procedure into a new easier to use tool. Thank you!

I don't have an answer yet, but let me think about a new sampling algorithm and get back to you for feedback.

robgjansen · 2019-07-02T13:42:52Z

Also, I should note that the new tool will generate networks based an a much larger set of consensus files (like, all consensuses from a given month) rather than a single consensus file. I think this will allow us to do a better job of creating a ShadowTor network that statistically looks like a Tor network even if it does not precisely match a single consensus.

frochet · 2019-07-03T08:06:37Z

Sorry for the delay!

This is a timely issue you brought up, because I am currently rewriting the network generation procedure into a new easier to use tool. Thank you!

Cool!

Also, since we're on the list of improvements for generate.py. There is a small issue I noticed recently:
the userstat fetch given in the wiki downloads all data from the metrics portal (starting back in 2011), and the getClientCountryChoices function takes the values from the first 10 dates seen in that file (info from 2011 :/)

I guess we probably want to replace the wiki info which is:

"wget https://metrics.torproject.org/userstats-relay-country.csv"

to something along these line:

"wget https://metrics.torproject.org/userstats-relay-country.csv?start=yyyy-mm-dd&end=yyyy-mm-d&events=off"

Explaining the user needs to find an appropriate date range of at least 10 days close to the period he is simulating.

robgjansen · 2019-07-03T16:47:37Z

There is a small issue I noticed recently: the userstat fetch given in the wiki downloads all data from the metrics portal (starting back in 2011), and the getClientCountryChoices function takes the values from the first 10 dates seen in that file (info from 2011 :/)

I don't think the code agrees with your statement. The parsing function starts at index -1 (the last line in the file, i.e., the most recent date), and works backwards until it reaches 10 full days.

shadow-plugin-tor/tools/generate.py

Line 694 in 133724d

linei -= 1 #get next line above

Is there a bug and you verified that running the code does in fact choose the first 10 days in the file instead of the last 10 days?

In any case, allowing the user to specifies the date range seems useful :)

frochet · 2019-07-04T07:53:18Z

There is a small issue I noticed recently: the userstat fetch given in the wiki downloads all data from the metrics portal (starting back in 2011), and the getClientCountryChoices function takes the values from the first 10 dates seen in that file (info from 2011 :/)

I don't think the code agrees with your statement. The parsing function starts at index -1 (the last line in the file, i.e., the most recent date), and works backwards until it reaches 10 full days.

Woops, totally misunderstood the code. You're right, that goes backward.

In any case, allowing the user to specifies the date range seems useful :)

Yep x)

robgjansen · 2019-07-12T21:29:00Z

I am unable to reproduce your results.

I reimplemented the existing relay sampling strategy by splitting relays into classes and then doing the bucket median thing on each class to get a smaller set of relays for each class. Then I tried to reproduce your "divergence" metric by computing the following

the sum of bandwidth weight present in a sampled class set divided by the sum of bandwidth weight present in all sampled class sets

This gives us the fraction of bandwidth weight that we expect in each relay class set, for both the full network and the sampled down network. So for example, in an ideal world we expect about 33% for each of middle, exit, and guard class sets.

Then I took the absolute value of the difference between the sampled set and the full set. So if the sampled guards had only 20% of the bandwidth weight in the scaled network, but they had 33% in the full network, the bandwidth weight divergence is 13%.

I found that the maximum bandwidth weight divergence across all relay classes for all network sizes [100, 6400] was about 4%. I'm not sure how you came up with 80%-100% bandwidth weight divergence in some cases.

I thought I might be confused by observed bandwidth vs bandwidth weight, so I also computed the advertised bandwidth divergence in a similar fashion, where I am still sampling relays based on the bandwidth weight, but then I am computing the divergence based on divergence in advertised bandwidth (i.e., advertised bandwidth is our best guess of the bandwidth capacity of the relay). I found that the maximum advertised bandwidth divergence across all relay classes for all network sizes [100, 6400] was 4.3%.

Which of the above two divergence metrics are you computing? Could you please describe your method in a way that would allow me to reproduce it? Feel free to post code ;)

Or maybe there is a bug in the old generate script and my new implementation fixed it?

frochet · 2019-07-13T09:52:55Z

To be clear on the nomenclature, I usually call "bandwidth-weights" the weights at the bottom of the consensus file, "bandwidth" or "consensus weight" for the same bandwidth line value in the consensus for each relay, and "advertised bandwidth" the bandwidth each relay reports.

I am unable to reproduce your results.

I reimplemented the existing relay sampling strategy by splitting relays into classes and then doing the bucket median thing on each class to get a smaller set of relays for each class. Then I tried to reproduce your "divergence" metric by computing the following
* the sum of bandwidth weight present in a sampled class set divided by the sum of bandwidth weight present in all sampled class sets

I think we're not computing the same thing. But it looks like your method makes more sense. What I computed was the sum of consensus weights in the sampled class divided by the expected sum of consensus weights from the same class (that is, if we want one half of the original network, we should expect to obtain one half the total consensus weight for a given relay class after sampling, and I was trying to plot how far we are from this expectation).

Maybe the code can help to clear this out:
frochet@4742e57

Here is the script I run with that commit to generate all results fast:

#!/bin/zsh
PARALLEL=46
MAX=6541
i=50
while [[ $i -le $MAX ]];
do
  j=1
  while [[ $j -lt $PARALLEL && $i -lt $MAX ]];
  do
    python ../tools/generate.py --nauths 3 --nrelays $i --nclients 1 ../data/alexa-top-1000-ips.csv ../data/2019-03-05-00-00-00-consensus ../../descs/serverdesc/server-descriptors-2019-03/ ../../descs/extradesc/extra-infos-2019-03 ../data/userstats-relay-country.csv 1>> rdescrep 2> /dev/null &
    let j=$j+1
    if [ $i -gt 6450 ]
    then
      let i=$i+5
    else
      let i=$i+50
    fi
  done
  python ../tools/generate.py --nauths 3 --nrelays $i --nclients 1 ../data/alexa-top-1000-ips.csv ../data/2019-03-05-00-00-00-consensus ../../descs/serverdesc/server-descriptors-2019-03/ ../../descs/extradesc/extra-infos-2019-03 ../data/userstats-relay-country.csv 1>> rdescrep 2> /dev/null
  if [ $i -gt 6450 ]
  then
    let i=$i+5
  else
    let i=$i+50
  fi
done

And the python script to plot the printed metrics:

import matplotlib as mpl
mpl.use('pdf')
import matplotlib.pyplot as plt
import pdb
descrep = {}
with open("rdescrep") as f:

    line = f.readline()
    while line: 
        nrelays = int(line.split(':')[1].split("/")[0])
        guardexit = float(f.readline()[:-2].split(':')[1])
        guard = float(f.readline()[:-2].split(':')[1])
        exit = float(f.readline()[:-2].split(':')[1])
        middle = float(f.readline()[:-2].split(':')[1])
        descrep[nrelays] = (guardexit, guard, exit, middle)
        line = f.readline()


allElems = list(descrep.items())
allElems.sort(key=lambda x: x[0])
x = [elem[0] for elem in allElems] 
plt.figure()

plt.plot(x, [elem[1][0] for elem in allElems], label="Guardexit")
plt.plot(x, [elem[1][1] for elem in allElems], label="Guard")
plt.plot(x, [elem[1][2] for elem in allElems], label="Exit")
plt.plot(x, [elem[1][3] for elem in allElems], label="Middle")

plt.xlabel("Shadow's network size (relays)")
plt.ylabel("Divergence (%)")
plt.legend()

plt.savefig("test.png")
plt.close()

It is also totally possible that there is bug (or logic flaw) somewhere in what I am doing.

robgjansen · 2019-07-13T11:14:11Z

I think we're not computing the same thing. But it looks like your method makes more sense. What I computed was the sum of consensus weights in the sampled class divided by the expected sum of consensus weights from the same class (that is, if we want one half of the original network, we should expect to obtain one half the total consensus weight for a given relay class after sampling, and I was trying to plot how far we are from this expectation).

Right. I like my method better 😉

I think we should focus on the "consensus weight" (the "w Bandwidth=x" line in the consensus). Also, I believe that we need to normalize the consensus weights as I did in my approach, because the range of the absolute consensus weights in not meaningful. My method works on the normalized consensus weights, which represents available probability of selection for each relay class.

I do have some ideas of reducing the 4% consensus weight divergence that I computed in my approach. I'm currently trying to think about if the added complexity of the algorithm for reducing the consensus weight divergence (and overhead associated with implementing it) is worth the improvement we would gain from the more accurate weights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More reliable Tor experiments #78

More reliable Tor experiments #78

frochet commented Jun 6, 2019 •

edited

Loading

frochet commented Jun 7, 2019

robgjansen commented Jul 2, 2019

robgjansen commented Jul 2, 2019

frochet commented Jul 3, 2019 •

edited

Loading

robgjansen commented Jul 3, 2019

frochet commented Jul 4, 2019

robgjansen commented Jul 12, 2019 •

edited

Loading

frochet commented Jul 13, 2019

robgjansen commented Jul 13, 2019 •

edited

Loading

More reliable Tor experiments #78

More reliable Tor experiments #78

Comments

frochet commented Jun 6, 2019 • edited Loading

frochet commented Jun 7, 2019

robgjansen commented Jul 2, 2019

robgjansen commented Jul 2, 2019

frochet commented Jul 3, 2019 • edited Loading

robgjansen commented Jul 3, 2019

frochet commented Jul 4, 2019

robgjansen commented Jul 12, 2019 • edited Loading

frochet commented Jul 13, 2019

robgjansen commented Jul 13, 2019 • edited Loading

frochet commented Jun 6, 2019 •

edited

Loading

frochet commented Jul 3, 2019 •

edited

Loading

robgjansen commented Jul 12, 2019 •

edited

Loading

robgjansen commented Jul 13, 2019 •

edited

Loading