Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Failed to deserialize a token: Timeout was reached" - how to debug? #128

Open
steve8x8 opened this issue Sep 6, 2023 · 14 comments
Open

Comments

@steve8x8
Copy link

steve8x8 commented Sep 6, 2023

# scitokens-verify `cat /run/user/1902/bt_u1902`
Failed to deserialize a token: Timeout was reached

How to debug this? Debian Bullseye, libscitokens0 version 1.0.2

There's another machine with the same SciTokens setup that returns the correct response (same file content)

$ scitokens-verify `cat /run/user/1902/bt_u1902`
Token deserialization successful.

but I can't find the difference :/

@jbasney
Copy link
Member

jbasney commented Sep 6, 2023

In #114 the "Timeout was reached" error was caused when trying to retrieve the issuer public key, so the first thing I'd check is that curl can get your issuer's well-known endpoint from the machine in question. For example:

$ curl https://demo.scitokens.org/.well-known/openid-configuration
{"claims_supported":["aud","exp","iat","iss","sub"],"device_authorization_endpoint":"https://demo.scitokens.org/oauth2/device_authorization","grant_types_supported":["authorization_code","refresh_token","urn:ietf:params:oauth:grant-type:token-exchange","urn:ietf:params:oauth:grant-type:device_code"],"id_token_signing_alg_values_supported":["RS256","RS384","RS512"],"issuer":"https://demo.scitokens.org","jwks_uri":"https://demo.scitokens.org/oauth2/certs","registration_endpoint":"https://demo.scitokens.org/oauth2/oidc-cm","response_modes_supported":["query","fragment","form_post"],"response_types_supported":["code","id_token"],"scopes_supported":["read:/","write:/"],"subject_types_supported":["public"],"token_endpoint":"https://demo.scitokens.org/oauth2/token"}

@steve8x8
Copy link
Author

steve8x8 commented Sep 7, 2023

Using "iss": "https://cilogon.org/igwn" from htdecodetoken output, I can connect, both curl https://cilogon.org/.well-known/openid-configuration and curl https://cilogon.org/igwn/.well-known/openid-configuration return some JSON code.

What's the second thing to check? (The token file is 1048 bytes in size, if that matters.) I'm about to run Wireshark on that machine to get a grip on what's being sent where (and meant to be received).

@jbasney
Copy link
Member

jbasney commented Sep 7, 2023

I got a (large) token from https://cilogon.org/igwn on my Ubuntu 22.04.3 LTS VM to see if I could reproduce the problem:

$ apt-show-versions libscitokens0
libscitokens0:amd64/unknown 1.0.2-1+ubuntu22.04.0 uptodate
$ htdecodetoken | grep iss
  "iss": "https://cilogon.org/igwn",
$ wc ${XDG_RUNTIME_DIR:-/tmp}/bt_u`id -u`
   1    1 1220 /run/user/1000//bt_u1000
$ rm -rf .cache/scitokens
$ strace -o scitokens-trace.txt scitokens-verify `cat /run/user/1000/bt_u1000`
Token deserialization successful.

The strace output shows the keycache getting populated:

$ grep cache scitokens-trace.txt | head -5
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
mkdir("/home/jbasney/.cache", 0700)     = -1 EEXIST (File exists)
mkdir("/home/jbasney/.cache/scitokens", 0700) = 0
newfstatat(AT_FDCWD, "/home/jbasney/.cache/scitokens/scitokens_cpp.sqllite", 0x7ffeb24c35d0, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/jbasney/.cache/scitokens/scitokens_cpp.sqllite", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 3

I see the issuer keys in the keycache:

$ sqlite3 ~/.cache/scitokens/scitokens_cpp.sqllite
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.
sqlite> select * from keycache;
https://cilogon.org/igwn|{"expires":1694442612,"jwks":{"keys":[{"alg":"RS256","e":"AQAB","kid":"2FEC35C60764D27594EBEC4D98824CD8","kty":"RSA","n":"AKtyEMIR2_ymESOtrak2QJfdQvVTip-QoC_fIgorqO47p_iBkkrvwRdyQoi_MyNi-5nLQ1pPScaL3niLRvgzx11jY7GoIil2gMN8NhY6dX94dCQC3Qsvo5C3tvla1SkfEGO8y52z8biReVQZy8_Hdy6l__IGveR6aQmC1tKecf1e-kJROS46yA9qpsNEtkREdwZAD9tmrDgddbfXAcCUtfdc62dEZTryuDcJGoXEaWBAobVIGIE5e7IYjsiZz_KtIOgaDqoguXeyttZB1-g2ok8PT9S3Xdt3gP4Wy3PE_qO58UCAv5YejG6Uau8XyxKfxGQrPnXERVJl3PQF1NCJj-8","use":"sig"},{"alg":"RS384","e":"AQAB","kid":"680182D6A0BF3BFF98E7DC211991EA44","kty":"RSA","n":"ALVOmsYrvRb8jzjyS99p2J22bP7Tvv9FGNzW62xDLKy77rVud0zMAUh7RJwUROmAoRoJ1soDIDXJH3-F4V_pZpI80_nGalXjX2SSbeNFtQVSilXdlS-CPwBE59A4lueDTvIg4925pI5PR50ShWQHMXnXU4xM5iFs9V0wtZsbKvs8Flsu2va_pnbhOb_3otEaz68iI2SI8srP-cz5e18l5SI-xtGie_IOWYkVN9Bs2O0ELcpI2hMT86m_dXJGUy5rMRpk6gTLGh3kis4wZN4OLrLvO_qTo3BU9I54nvCwiaQVWJwpZAq4Vf5jwXdrSCNnHLYOiPB1jix-Yor4JUJP8lk","use":"sig"},{"alg":"RS512","e":"AQAB","kid":"90387F85055ECED268E374A28414A056","kty":"RSA","n":"AItqzHE1vr_intX7uYEz5wf7Ppn2FSJUdhDn-GNu4Pg6MWUOB9JV89M90a2v9xHD-mHLhVgiqiR5VBz7o6Oo75x5zREB_UauYO3P5TnJEREXAMGc_ZWQiXhHr9LrShUYBcEP2DB1xpibXnq2m6pCvK9bSCVxyMZFtTHKGes7cQV6KTUd5S4ujAa7EQgccf6WXTUu43h1_VRN9NxJcFgkD8C4JKJPz1jYmNP7neL3qQmVIBRurA_gMkXP2ipT--7pN3PsYMKdE6XhNhbbCUJMrNoX7cT-_urvLXYAjHfplch3VrRANgr2auCQ2DtOJb9bztaTdB0A14Imdvujbi1ISSc","use":"sig"}]},"next_update":1694097612}

I'm not sure what else to check. I think scitokens-verify needs a verbose mode.

@maarten-litmaath
Copy link

Hi all,
could there be a timeout in populating the key cache? Does it depend on acquiring a lock? If so, does the directory on the failing host happen to be mounted through NFS?

@bbockelm
Copy link
Contributor

bbockelm commented Sep 7, 2023

Hi,

Don't forget there are two different downloads: one for the metadata, the second for the public key. Right now, those seem to point to the same IP address (but maybe there's an implementation detail I can't see?). @steve8x8 - can you try both of these commands:

curl -v https://cilogon.org/igwn/.well-known/openid-configuration
curl -v https://cilogon.org:443/oauth2/certs/igwn

(if there's nothing private in the output, attaching them as a file would be useful)

Now, if the networking setup looks clear, I know that @djw8605 was merging some PRs around bugs in the latest version... maybe there's some undefined behavior triggering on one box but not the other?

Brian

@DrDaveD
Copy link
Contributor

DrDaveD commented Sep 7, 2023

I have looked a little into another timeout that @mambelli was experiencing. In his case he was using a strange test network setup that ended up not being able to get a response from the DNS the first couple of tries, each of which timed out after 2 seconds. We found out that the scitokens timeout was only 4 seconds, which @djw8605 confirmed. That issue was worked around by updating /etc/resolv.conf to prioritize a different DNS which could respond right away.

@steve8x8
Copy link
Author

steve8x8 commented Sep 8, 2023

Wow, what a lot of responses ;-) Let me work my way through them.

@jbasney I've kept straces from a successful and a failed attempt (different hosts), and the failed one diverts right after closing the keycache database file, where I run into a long SIGPIPE loop:

socketpair(AF_UNIX, SOCK_STREAM, 0, [3, 4]) = 0
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
fcntl(4, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x14cf9312cd60}, NULL, 8) = 0
socketpair(AF_UNIX, SOCK_STREAM, 0, [5, 6]) = 0
mmap(NULL, 2101248, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x14cf91d58000
mprotect(0x14cf91d59000, 2097152, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x14cf91f57d30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[1781962], tls=0x14cf91f58700, child_tidptr=0x14cf91f589d0) = 1781962
rt_sigaction(SIGPIPE, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x14cf9312cd60}, NULL, 8) = 0
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x14cf9312cd60}, 8) = 0

I was wondering which pipe would be involved here, but perhaps that's a red herring.

@maarten-litmaath This seems to happen after accessing the keycache - which (of course, like all cluster users' homes) is on NFS.

@bbockelm Both commands seem to succeed:

$ curl -v https://cilogon.org/igwn/.well-known/openid-configuration
*   Trying 3.14.54.175:443...
* Connected to cilogon.org (3.14.54.175) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=Illinois; O=University of Illinois; CN=cilogon.org
*  start date: Apr  3 00:00:00 2023 GMT
*  expire date: May  3 23:59:59 2024 GMT
*  subjectAltName: host "cilogon.org" matched cert's "cilogon.org"
*  issuer: C=US; ST=MI; L=Ann Arbor; O=Internet2; OU=InCommon; CN=InCommon RSA Server CA
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55f02f0f81e0)
> GET /igwn/.well-known/openid-configuration HTTP/2
> Host: cilogon.org
> user-agent: curl/7.74.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 200 
< content-type: application/json;charset=UTF-8
< date: Fri, 08 Sep 2023 10:48:06 GMT
< set-cookie: _cilogon_org_oauth2=4eb32aa3bdcdaa28; Path=/; HttpOnly; Secure
< strict-transport-security: max-age=63072000
< x-content-type-options: nosniff
< x-frame-options: DENY
< content-length: 1510
< 
{
 "authorization_endpoint": "https://cilogon.org/authorize",
 "registration_endpoint": "https://cilogon.org:443/oauth2/oidc-cm",
 "jwks_uri": "https://cilogon.org:443/oauth2/certs/igwn",
 "issuer": "https://cilogon.org/igwn",
 "token_endpoint": "https://cilogon.org:443/oauth2/token",
 "userinfo_endpoint": "https://cilogon.org:443/oauth2/userinfo",
 "introspection_endpoint": "https://cilogon.org:443/oauth2/introspect",
 "revocation_endpoint": "https://cilogon.org:443/oauth2/revoke",
 "revocation_endpoint_auth_methods_supported": ["client_secret_basic"],
 "device_authorization_endpoint": "https://cilogon.org/oauth2/device_authorization",
 "token_endpoint_auth_methods_supported":  [
  "client_secret_post",
  "client_secret_basic",
  "private_key_jwt"
 ],
 "subject_types_supported": ["public"],
 "code_challenge_method_supported":  [
  "plain",
  "S256"
 ],
 "scopes_supported":  [
  "org.cilogon.userinfo",
  "openid",
  "profile",
  "email",
  "offline_access"
 ],
 "response_types_supported":  [
  "code",
  "id_token"
 ],
 "grant_types_supported":  [
  "web",
  "urn:oa4mp:params:oauth:grant-type:token-info",
  "urn:ietf:params:oauth:grant-type:token-exchange",
  "refresh_token",
  "authorization_code",
  "urn:ietf:params:oauth:grant-type:device_code"
 ],
 "response_modes_supported":  [
  "query",
  "fragment",
  "form_post"
 ],
 "claims_supported":  [
  "sub",
  "aud",
  "iss",
  "exp",
  "iat",
  "email"
 ],
 "id_token_signing_alg_values_supported":  [
  "RS256",
  "RS384",
  "RS512"
 ]
* Connection #0 to host cilogon.org left intact
}

and

$ curl -v https://cilogon.org:443/oauth2/certs/igwn
*   Trying 3.16.221.198:443...
* Connected to cilogon.org (3.16.221.198) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=Illinois; O=University of Illinois; CN=cilogon.org
*  start date: Apr  3 00:00:00 2023 GMT
*  expire date: May  3 23:59:59 2024 GMT
*  subjectAltName: host "cilogon.org" matched cert's "cilogon.org"
*  issuer: C=US; ST=MI; L=Ann Arbor; O=Internet2; OU=InCommon; CN=InCommon RSA Server CA
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55e2786b01e0)
> GET /oauth2/certs/igwn HTTP/2
> Host: cilogon.org
> user-agent: curl/7.74.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 200 
< content-type: application/json;charset=UTF-8
< date: Fri, 08 Sep 2023 10:48:44 GMT
< set-cookie: _cilogon_org_oauth2=4eb32aa3bdcdaa28; Path=/; HttpOnly; Secure
< strict-transport-security: max-age=63072000
< x-content-type-options: nosniff
< x-frame-options: DENY
< 
{"keys": [
  {
  "n": "AKtyEMIR2_ymESOtrak2QJfdQvVTip-QoC_fIgorqO47p_iBkkrvwRdyQoi_MyNi-5nLQ1pPScaL3niLRvgzx11jY7GoIil2gMN8NhY6dX94dCQC3Qsvo5C3tvla1SkfEGO8y52z8biReVQZy8_Hdy6l__IGveR6aQmC1tKecf1e-kJROS46yA9qpsNEtkREdwZAD9tmrDgddbfXAcCUtfdc62dEZTryuDcJGoXEaWBAobVIGIE5e7IYjsiZz_KtIOgaDqoguXeyttZB1-g2ok8PT9S3Xdt3gP4Wy3PE_qO58UCAv5YejG6Uau8XyxKfxGQrPnXERVJl3PQF1NCJj-8",
  "e": "AQAB",
  "alg": "RS256",
  "kid": "2FEC35C60764D27594EBEC4D98824CD8",
  "use": "sig",
  "kty": "RSA"
 },
  {
  "n": "ALVOmsYrvRb8jzjyS99p2J22bP7Tvv9FGNzW62xDLKy77rVud0zMAUh7RJwUROmAoRoJ1soDIDXJH3-F4V_pZpI80_nGalXjX2SSbeNFtQVSilXdlS-CPwBE59A4lueDTvIg4925pI5PR50ShWQHMXnXU4xM5iFs9V0wtZsbKvs8Flsu2va_pnbhOb_3otEaz68iI2SI8srP-cz5e18l5SI-xtGie_IOWYkVN9Bs2O0ELcpI2hMT86m_dXJGUy5rMRpk6gTLGh3kis4wZN4OLrLvO_qTo3BU9I54nvCwiaQVWJwpZAq4Vf5jwXdrSCNnHLYOiPB1jix-Yor4JUJP8lk",
  "e": "AQAB",
  "alg": "RS384",
  "kid": "680182D6A0BF3BFF98E7DC211991EA44",
  "use": "sig",
  "kty": "RSA"
 },
  {
  "n": "AItqzHE1vr_intX7uYEz5wf7Ppn2FSJUdhDn-GNu4Pg6MWUOB9JV89M90a2v9xHD-mHLhVgiqiR5VBz7o6Oo75x5zREB_UauYO3P5TnJEREXAMGc_ZWQiXhHr9LrShUYBcEP2DB1xpibXnq2m6pCvK9bSCVxyMZFtTHKGes7cQV6KTUd5S4ujAa7EQgccf6WXTUu43h1_VRN9NxJcFgkD8C4JKJPz1jYmNP7neL3qQmVIBRurA_gMkXP2ipT--7pN3PsYMKdE6XhNhbbCUJMrNoX7cT-_urvLXYAjHfplch3VrRANgr2auCQ2DtOJb9bztaTdB0A14Imdvujbi1ISSc",
  "e": "AQAB",
  "alg": "RS512",
  "kid": "90387F85055ECED268E374A28414A056",
  "use": "sig",
  "kty": "RSA"
 }
* Connection #0 to host cilogon.org left intact
]}

@DrDaveD I'll look into that - indeed the node in question is behind yet another gateway, which may have an effect on response times. Would there be a way to increase the timeout?

Some of you have already guessed that I'm trying to use htgettoken with the cilogin.org/igwn issuer, and I'm apparently lacking some basic understanding of what's going on: htgettoken will tell me that there are files below /tmp and /run but the keycache isn't mentioned, and I probably did things the wrong way for some time. What I have found now, after leaving a token over night (they use to expire after 3 hours, right?):

  • Running htgettoken on the "failing" node would kind of refresh the token, but htdecodetoken (let's settle on that one since it wraps around scitokens-verify anyway) would present a timeout. Running the same htgettoken on the headnode would again kind of refresh the token, leaving the decoded content unchanged but make htdecodetoken succeed (not timeout) on the execute node, at least for a while.
  • Probably tokens are meant to be created only on submit nodes (as the dependencies on the igwn-iam package suggest), and I should completely forget about running htgettoken on an exec node?
  • The final frontier would be getting CVMFS work with scitokens but I'm getting some strange failures on the exec node (as root, everything looks reasonable - as mere user, I get "Permission denied" despite having a matching token in the keycache). Is this whole stuff documented somewhere? Right now I'm blaming myself for doing stupid things out of ignorance...

Somehow I'm not getting the knack of local files (the ones in /tmp and /run/user/$id) and the global keycache which seems to be able to hold only one token at a time (per scope / issuer / ... ?), and what gets written and read where...

Pardon me for being incredibly stupid!

@DrDaveD
Copy link
Contributor

DrDaveD commented Sep 8, 2023

@DrDaveD I'll look into that - indeed the node in question is behind yet another gateway, which may have an effect on response times. Would there be a way to increase the timeout?

There is not currently a way to increase the timeout, it is a compiled-in limit only. That would probably be a good feature request, however.

Some of you have already guessed that I'm trying to use htgettoken with the cilogin.org/igwn issuer, and I'm apparently lacking some basic understanding of what's going on: htgettoken will tell me that there are files below /tmp and /run but the keycache isn't mentioned, and I probably did things the wrong way for some time. What I have found now, after leaving a token over night (they use to expire after 3 hours, right?):

htgettoken does not mention the scitokens keycache because it does not use any scitokens code (it just passes through those tokens, it does nothing to validate them). The only thing in the htgettoken package that uses scitokens is htdecodetoken, and only via the scitokens-verify command if that happens to be installed. So if that's the only place you're using scitokens code you could ignore the error. On the other hand, the scitokens library is used by other things, for example cvmfs_scitoken_helper from the cvmfs-x509-helper package, so if you try to access protected ligo/igwn CVMFS repositories on the same machine, it may also experience the timeout.

Yes the cilogon.org/igwn token issuer is configured to generate tokens that expire in 3 hours.

* Running `htgettoken` on the "failing" node would kind of refresh the token, but `htdecodetoken` (let's settle on that one since it wraps around `scitokens-verify` anyway) would present a timeout. Running the same `htgettoken` on the headnode would again kind of refresh the token, **leaving the decoded content unchanged** but make `htdecodetoken` succeed (not timeout) on the execute node, at least for a while.

The vault server that htgettoken reads from caches those bearer tokens until --minsecs seconds (default 60) before they expire. That's why you see no change in the decoded content. The fact that there was no timeout the next time I'm sure was unrelated to the fact that you ran htgettoken again, but it could very well have been related to the fact that you ran htdecodetoken again because the scitokens keycache would have caused it do fewer network accesses the second time.

* Probably tokens are meant to be created only on submit nodes (as the dependencies on the `igwn-iam` package suggest), and I should completely forget about running `htgettoken` on an exec node?

I'm not familiar with the igwn-iam package you are referring to but yes, htgettoken is expected to be used only on submit nodes and not on execute nodes.

* The final frontier would be getting CVMFS work with scitokens but I'm getting some strange failures on the exec node (as root, everything looks reasonable - as mere user, I get "Permission denied" despite having a matching token in the keycache). Is this whole stuff documented somewhere? Right now I'm blaming myself for doing stupid things out of ignorance...

I believe you are confused about what the scitokens keycache is. It is only used for validating the signatures on JSON Web Tokens (JWTs -- scitokens are JWTs), so it caches the public keys of the token issuers. It does not cache individual JWTs.

There is not good documentation on this stuff that I can point you to, unfortunately. However I can tell you the best way to debug CVMFS access with scitokens is to set CVMFS_DEBUGLOG=/tmp/somefile.log in an /etc/cvmfs/config.d/<repo>.local file where <repo> is the name of the repository that you want more info on. Then after attempting to access it look toward the end of /tmp/somefile.log.authz to see what detailed messages are there.

Somehow I'm not getting the knack of local files (the ones in /tmp and /run/user/$id) and the global keycache which seems to be able to hold only one token at a time (per scope / issuer / ... ?), and what gets written and read where...

The file in /tmp is a vault token. That provides authenticated access to the vault server. The file in /run/user is the bearer token, stored under the WLCG Bearer Token Discovery standard. The scitokens keycache only caches the public key of the token issuer.

@steve8x8
Copy link
Author

steve8x8 commented Sep 8, 2023

@DrDaveD thanks for the explanation (and the confirmation of lack of docs)! I found that the failing node was using a rather long list of name servers in /etc/resolv.conf, most of them pointing to DNS proxies in the internal network of the pool. Since the setup was changed after installation, to add a gateway, this is no longer necessary and I can use "our" official DNS server - response times are much shorter now and I haven't seen another timeout yet. Will continue testing next week. Thanks again!

@steve8x8
Copy link
Author

Now that "everything seems to work" (iow, the small issue has been fixed to make room for the next, possibly bigger, one), I'm wondering -

  • while with X.509 authentication, the validity of a certificate could be checked locally, without network access (using the issuer data present in /etc/grid-security/certificates),
  • with SciTokens, it seems to always need an internet connection to the issuer to verify ("deserialize") a token that otherwise is protected against tampering with a signature. To me this looks like a regression, and prone to DoS attacks. (Actually, DNS request timing out already was kind of DoS...)
    Is this a misunderstanding on my side, or do I just miss the bigger idea behind it?

@DrDaveD
Copy link
Contributor

DrDaveD commented Sep 11, 2023

You are correct that there is a network connection needed to verify the first token, but after that the verifications should be able to be done without the network using the local key cache. I don't think that is a significant issue, however. Network is required for so many different things, this is a very small network access, and there are so many different ways to DoS grid workflows that I don't think that adding one more really changes anything.

@jbasney
Copy link
Member

jbasney commented Sep 11, 2023

related to #97

@bbockelm
Copy link
Contributor

In addition to what @DrDaveD mentions, I'd point out the public keys don't necessarily need to be hosted at the same location as the issuer itself. It's common to see issuers use a hosting or CDN service (e.g., CloudFlare) that are going to be fairly impervious to a DDoS.

The tradeoff here is the system is much closer to a "fail safe" than a "fail open". In case of a trust root compromise, trust in the keys can be pulled centrally; after a fixed amount of time, you've automatically revoked all signed credentials.

@steve8x8
Copy link
Author

steve8x8 commented Sep 25, 2023

To summarize, the timeout was caused by a DNS server not responding - it takes ten seconds to move on, but a second server, although listed in /etc/resolv.conf, wasn't considered anymore.

I think a 4-second timeout is too short; 15 seconds would at least allow a second DNS to jump in. But I've got to believe the current timeout had been chosen intentionally, so I've got to keep this in mind. (It would have been helpful if the error message would have told me what exactly was timing out...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants