Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamps range boundaries are incorrectly checked #31

Open
mikolaj-kow opened this issue Mar 25, 2023 · 11 comments
Open

Timestamps range boundaries are incorrectly checked #31

mikolaj-kow opened this issue Mar 25, 2023 · 11 comments

Comments

@mikolaj-kow
Copy link

Greetings,

in the broker backend documentation, the timestamp is defined as the time of the beginning of the data dump, usually aligns on the full five-minute marks.

For example, I want to request 15 minutes of data, therefore i expect to receive three BrokerItems. However, broker produces four BrokerItem.

broker = bgpkit.Broker()
items = broker.query(ts_start="2014-04-01 00:00:00", ts_end="2014-04-01 00:14:59", project="riperis", data_type="update", print_url=True, collector_id="rrc00")
[BrokerItem(ts_start='2014-03-31T23:55:00', ts_end='2014-04-01T00:00:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.03/updates.20140331.2355.gz', rough_size=155648, exact_size=0),
 BrokerItem(ts_start='2014-04-01T00:00:00', ts_end='2014-04-01T00:05:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.04/updates.20140401.0000.gz', rough_size=129024, exact_size=0),
 BrokerItem(ts_start='2014-04-01T00:05:00', ts_end='2014-04-01T00:10:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.04/updates.20140401.0005.gz', rough_size=145408, exact_size=0),
 BrokerItem(ts_start='2014-04-01T00:10:00', ts_end='2014-04-01T00:15:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.04/updates.20140401.0010.gz', rough_size=174080, exact_size=0)]

This results in

elems = bgpkit.Parser(url=items[0].url).parse_all()
print(datetime.utcfromtimestamp(elems[0]["timestamp"]))
>>>2014-03-31 23:55:00

If I try this:

broker = bgpkit.Broker()
items = broker.query(ts_start="2014-04-01 00:00:01", ts_end="2014-04-01 00:15:00", project="riperis", data_type="update", print_url=True, collector_id="rrc00")
[BrokerItem(ts_start='2014-04-01T00:00:00', ts_end='2014-04-01T00:05:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.04/updates.20140401.0000.gz', rough_size=129024, exact_size=0),
 BrokerItem(ts_start='2014-04-01T00:05:00', ts_end='2014-04-01T00:10:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.04/updates.20140401.0005.gz', rough_size=145408, exact_size=0),
 BrokerItem(ts_start='2014-04-01T00:10:00', ts_end='2014-04-01T00:15:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.04/updates.20140401.0010.gz', rough_size=174080, exact_size=0),
 BrokerItem(ts_start='2014-04-01T00:15:00', ts_end='2014-04-01T00:20:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.04/updates.20140401.0015.gz', rough_size=121856, exact_size=0)]

I get this;

elems = bgpkit.Parser(url=items[-1].url).parse_all()
print(datetime.utcfromtimestamp(elems[-1]["timestamp"]))
>>>2014-04-01 00:20:00

So it seems that timespans I defined are already shorter than 15 minutes

ts_start='2014-04-01T00:00:01'
ts_end  ='2014-04-01T00:15:00'
datetime.fromisoformat(ts_end) - datetime.fromisoformat(ts_start)
>>> datetime.timedelta(seconds=899)

Easy workaround is using ts_start="2014-04-01 00:00:01", ts_end="2014-04-01 00:14:59"

Best regards

@digizeph
Copy link
Member

Hi @mikolaj-kow, thanks for bringing this up. I am aware of this issue but ended up keeping the current behavior because. The main reason is that MRT files sometimes do include messages at the exact end timestamp. For your example, it included the previous file because it might contain messages announced at the exact hour mark.

I am open to change the default behavior to not include the files if its end time overlaps with the start time of the filter, and add additional flag for the current behavior.

@mikolaj-kow
Copy link
Author

Hi,
It's a tough one :)
I get the idea behind returning more results than omitting some.
However, typically one side of a range should be "greater than or equal to" / "less than or equal to". In that way, one can define a continuity of ranges, without overlaps or gaps.

I tried to apply a Parser timestamp filter, to overcome this behaviour of Broker and have a precise timerange. This should be possibe, however in python it fails with
ValueError: Error: unknown filter type: ts_start and ValueError: Error: unknown filter type: ts_end errors. Can you have a look?

@digizeph
Copy link
Member

digizeph commented Apr 9, 2023

@mikolaj-kow it seems like there is some inconsistencies on the naming of filters. Could you try start_ts and end_ts instead? I will make sure both versions will be acceptable in the next release.

@mikolaj-kow
Copy link
Author

I was able to successfully use the following dictionary as Parser filter {'start_ts': '1396364401.0', 'end_ts': '1396450801.0'}
Notice that f64 values are casted to str. When I try to use values from datetime.timestamp() directly, I'm getting
TypeError: argument 'filters': 'float' object cannot be converted to 'PyString'

@digizeph
Copy link
Member

Good to hear! Yeah, the string conversion is expected. The timestamps parameters accept both unix timestamps and rfc3999 time strings (e.g. 2020-01-02T00:00:00Z), and thus need to be Strings.

@mikolaj-kow
Copy link
Author

Getting kind of offtopic ;) but wanted to let you know, I had no luck with either of these:
ValueError: Error: cannot parse f64 value from 2020-01-02T00:00:00
ValueError: Error: cannot parse f64 value from 2020-01-02T00:00:00Z
ValueError: Error: cannot parse f64 value from 2020-01-02T00:00:00+00:00

The last one is ISO format from dt.isoformat() but AFAIK should be compliant with RFC3999

@digizeph
Copy link
Member

This is great catch! Will attempt to patch these later today.

@mikolaj-kow
Copy link
Author

I've tested the Broker.query() as well, for a more complete picture.

For timestamps, int and str(int) works ok, but float and str(float) does not work; however, this worked with the Parser. So this depends if you wish to support microsecond resolution.

RFC/ISO formatted datetime without timezone and with Zulu works without issues, even with microseconds it's ok

RFC/ISO formatted datetime with timezone with negative offset like 2020-01-02T00:00:00-00:00 works ok, but positive offset like 2020-01-02T00:00:00+00:00 or 2020-01-02T00:00:00.123000+00:00 does not work.

ile ~/.pyenv/versions/bgpcompute-3.11.2/lib/python3.11/site-packages/bgpkit/bgpkit_broker.py:63, in Broker.query(self, ts_start, ts_end, collector_id, project, data_type, print_url)
     61 res = requests.get(api_url).json()
     62 while res:
---> 63     if res["count"] > 0:
     64         data_items.extend([BrokerItem(**i) for i in res["data"]])
     66         if res["count"] < res["page_size"]:

TypeError: '>' not supported between instances of 'NoneType' and 'int'

@digizeph
Copy link
Member

Hi @mikolaj-kow, thank you for your efforts! Are you using the default broker instance or self-hosted one? I've updated the default instance to allow float number search. I cannot however reproduce the other timezone-related issues you mentioned. (see the updated test here: https://github.com/bgpkit/pybgpkit/blob/7f4e6cd2d654c505821554daf720d10a9415fce8/bgpkit/test_integration.py#L28)

@mikolaj-kow
Copy link
Author

You're welcome. I'm happy I can help a bit.

I'm using a self-hosted docker instance due to a considerable amount of queries. It appears that the default broker works just fine for the same query.

≻ curl "0.0.0.0:18888/search?ts_start=2022-02-02T00:00:00-00:00&ts_end=2022-02-02T00:20:00.123000+00:00&collector_id=rrc00"
{"count":null,"page":null,"page_size":null,"error":"failed to parse ts_end time string: Expected an ISO 8601-like string, but was given '2022-02-02T00:20:00.123000 00:00'. Try passing in a format string to resolve this.","data":null}
≻ curl "0.0.0.0:18888/search?ts_start=2022-02-02T00:00:00-00:00&ts_end=2022-02-02T00:20:00+00:00&collector_id=rrc00"
{"count":null,"page":null,"page_size":null,"error":"failed to parse ts_end time string: Expected an ISO 8601-like string, but was given '2022-02-02T00:20:00 00:00'. Try passing in a format string to resolve this.","data":null}
≻ curl "https://api.bgpkit.com/broker/search?ts_start=2022-02-02T00:00:00-00:00&ts_end=2022-02-02T00:20:00+00:00&collector_id=rrc00"
{"count":7, ...

However if I escape the + sign with %2B I get expected output

≻ curl "0.0.0.0:18888/search?ts_start=2022-02-02T00:00:00-00:00&ts_end=2022-02-02T00:20:00.123000%2B00:00&collector_id=rrc00"
{"count":7 ...

The broker-api container logs

INFO:     172.18.0.1:44844 - "GET /search?ts_start=2022-02-02T00:00:00-00:00&ts_end=2022-02-02T00:20:00+00:00&collector_id=rrc00 HTTP/1.1" 200 OK
INFO:     172.18.0.1:47670 - "GET /search?ts_start=2022-02-02T00:00:00-00:00&ts_end=2022-02-02T00:20:00.123000+00:00&collector_id=rrc00 HTTP/1.1" 200 OK
INFO:     172.18.0.1:42460 - "GET /search?ts_start=2022-02-02T00:00:00-00:00&ts_end=2022-02-02T00:20:00.123000%2B00:00&collector_id=rrc00 HTTP/1.1" 200 OK

@digizeph
Copy link
Member

@mikolaj-kow Thank you for providing us with the detailed logs!

Currently, there is a tech-stack discrepancy between our hosted version and the docker version. Our hosted version has been updated to a serverless JavaScript-based API, which can handle concurrent requests more efficiently and integrate seamlessly with our backend Postgres instance. On the other hand, the docker version remains a Python-based API, and there are some differences in the parameter checking.

We are actively working on closing this gap by incorporating a Postgrest instance into the docker setup, which will enable querying via HTTP and allow us to reuse the same JavaScript API for the self-hosted version. We anticipate that this upgrade will be completed in the coming weeks.

In addition, we have made significant improvements to our hosting infrastructure, and the default broker is now fully equipped to handle large numbers of requests. Please feel free to utilize it for your purposes. If you have any further concerns or require customization, please do not hesitate to reach out to us at [email protected].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants