Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: requirement for having reasonable timeouts for (external) services #1778

Open
elarlang opened this issue Nov 8, 2023 · 25 comments
Assignees
Labels
1) Discussion ongoing Issue is opened and assigned but no clear proposal yet next meeting Filter for leaders V12 _5.0 - prep This needs to be addressed to prepare 5.0

Comments

@elarlang
Copy link
Collaborator

elarlang commented Nov 8, 2023

update: to skip long discussion, jump here: #1778 (comment)


Via #1777 we have proposed new category with 2 requirements:

## V12.7 Application Resources

# Description L1 L2 L3 CWE
12.7.1 [ADDED] Verify that the application is designed in a way that a failure to access external resources does not result in the entire application failing, for example using the circuit breaker pattern.
12.7.2 [ADDED] Verify that the application proactively releases system resources, such as database connections, open files, threads, etc, when it finishes using them to prevent resource exhaustion. 404

For 12.7.1 there is issue #1748 and for 12.7.2 there is issue #1620.

At the moment it seems we need one more requirement which addresses timeout for connections:

  • 12.7.1 is for (fail safe) "when you can not connect at all"
  • 12.7.2 is for "when everything worked well, but you did not release the resources"

The point for proposed requirement - "when you are able to connect, but you can not get answer with expected time". Personally I think the likelihood for this scenario is much higher than for the proposed 12.7.1 and 12.7.2.

From denial of service point of view - it matters when it is not done for synchronous flow, such as calling some external connection / resource when building response for HTTP request.

It can be problematic for cron-jobs as well, but there it does not increase the likelihood for DoS that much.

@elarlang elarlang added the 1) Discussion ongoing Issue is opened and assigned but no clear proposal yet label Nov 8, 2023
@elarlang elarlang added the V12 label Nov 8, 2023
@tghosth
Copy link
Collaborator

tghosth commented Nov 8, 2023

I think this should be added to 12.7.1:

# Description L1 L2 L3 CWE
12.7.1 [ADDED] Verify that the application is designed in a way that a failure to access external resources or a delay in response does not result in the entire application failing, for example using the circuit breaker pattern.

@elarlang
Copy link
Collaborator Author

elarlang commented Nov 8, 2023

12.7.1 - the focus is - when connection fails, then application should still give at least some answer - so the situation must be handled via error handling

the focus for proposed one is connectionpool lockup - every connection must have (relatively short) timeout, maybe also connection pool+queue and when it's full - it already gives response that service is not available, not putting another request to the queue and cause denial of service

Idea is to address things like:

  • CWE-410: Insufficient Resource Pool
  • CWE-1088: Synchronous Access of Remote Resource without Timeout

Here also extra logging requirement that all connection fails and timeouts must be logged.

@tghosth
Copy link
Collaborator

tghosth commented Nov 9, 2023

12.7.1 - the focus is - when connection fails, then application should still give at least some answer - so the situation must be handled via error handling

I don't agree. The focus on making sure the application still operates correctly that could be providing a sensible error message but also carrying on other operations as well.

Please can you suggest a requirement text, I am still not seeing the difference you are proposing.

@elarlang elarlang added the 2) Awaiting response Awaiting a response from the original poster label Nov 9, 2023
@elarlang
Copy link
Collaborator Author

Maybe there are even 2 requirements.

Connection pool:

  • Problem to solve: if by default the connection pool to resource is really small, it will be bottle-neck for peak demand, all the connection pool is locked and the application can not provide availability (denial of service attack).
  • Requirement: Verify that the application has reasonable connection pool for every internal and external connection to handle peak demand.
  • CWE: CWE-410

Timeout:

  • Problem to solve: if there is problem on the external resource side or there is connection issue or there is really slow query (impact by attacker activity, mistake in program code, heavy demand, performance issues) then there is no connection pool lockout.
  • Requirement: Verify that the application has set reasonable timeout for every synchronous call to external service (to avoid connection pool lockout when there is connection or performane problem)
  • CWE: CWE-1088

@elarlang elarlang removed the 2) Awaiting response Awaiting a response from the original poster label Nov 11, 2023
@tghosth
Copy link
Collaborator

tghosth commented Nov 14, 2023

Time out / CWE-1088

We currently have:

# Description L1 L2 L3 CWE
12.7.1 [ADDED] Verify that the application is designed in a way that a failure to access external resources does not result in the entire application failing, for example using the circuit breaker pattern.

The whole point of circuit breaker pattern is to prevent this sort of lockout/lockup when relying on external resources. I think maybe we should just add CWE 1088 to this requirement. Your text seems like an example of the situation covered by this requirement.

Connection pool / CWE-410

We currently have:

# Description L1 L2 L3 CWE
12.7.2 [ADDED] Verify that the application proactively releases system resources, such as database connections, open files, threads, etc, when it finishes using them to prevent resource exhaustion. 404

I think your connection pool is similar case to this. but maybe we can add it more explicitly like this:

Verify that the application proactively releases system resources, such as database connections, open files, threads, etc, when it finishes using them or alternatively uses an appropriately sized resource pool to ensure resource reuse and to prevent resource exhaustion.

What do you think @elarlang ?

@elarlang
Copy link
Collaborator Author

elarlang commented Nov 14, 2023

If you have circuit braking handled (application does not die from program code logic perspective / the error is handled) - it does not eliminate the need to have timeouts set (and vice versa). Those are separate problems.

With second example for me it also combined two different problems to one requirement.

In a bit bigger picture - we created new separate section for those requirements, modern applications use external services a lot and those things getting more and more important. Shorter, one problem per requirement, requirements are better in my opinion.

@jmanico
Copy link
Member

jmanico commented Nov 14, 2023

I personally this this is not necessary and see it more as a functionality vs security problem. Suggest we drop it.

@elarlang
Copy link
Collaborator Author

I personally this this is not necessary and see it more as a functionality vs security problem. Suggest we drop it.

If we talk about timeouts here - I have proved in pen-test cases it is clearly security issue. If there is some time consuming call to external service, you find the really slow response response build, hammer it, take all resources in use because they don't have reasonable timeout and it's service down. Clear availability issue.

@elarlang
Copy link
Collaborator Author

App level DoS is serious thing. Especially, when there is high SLA and availability requirement and/or it provides service to another applications. For me this vector is underestimated in general and also maybe under-covered in ASVS.

Everyone talk about DDoS like just sending huge amount of packets towards application. This is just stupid way to do it. Mostly it's doable with pushing some resource demanding and/or time-consuming service and you can take the service busy. From big picture network level you can not even notice it.

@elarlang
Copy link
Collaborator Author

I try to conclude the situation here:

  • 12.7.1 - circuit braking - application must still give reasonable response, if there is issue with connection (it suits also to category 7.4)
  • 12.7.2 - release resources after finishing using them
  • +1 - have timeouts set for avoiding app level dos or collection dying connection to resource pool lockout
  • +1 - have reasonable connection pool set that you can not occupy all connection with dos or it just won't happen with peak visiting

Those are all separate problems. If you just fix 3 of them, the 4th one is still an issue (for me personally 12.7.2 is the less problematic here, but it's separate topic)

@tghosth
Copy link
Collaborator

tghosth commented Nov 16, 2023

I think I could get on board with the connection pool requirement.

Maybe:

Verify that the application uses appropriately sized resource pools or can dynamically resize them to ensure that a volumetric Denial of Service or peak-time load will not lead to the application becoming unavailable.

On the other hand, I think the circuit breaker pattern comes to specifically address the timeout problem to the extent that after a while it stops trying altogether

We could modify to:

Verify that the application is designed in a way that a failure to access external resources within pre-defined timeouts does not result in the entire application failing, for example using the circuit breaker pattern.

@elarlang
Copy link
Collaborator Author

Verify that the application uses appropriately sized resource pools or can dynamically resize them to ensure that a volumetric Denial of Service or peak-time load will not lead to the application becoming unavailable.

For dynamic size there should be also max-limit. If you have "auto-scale" cloud environment and attacker takes resource pool in use, then this constant auto-scale you have later written to the bill as well.

Like I wrote before, then circuit braking must only handle the fact that application can give response, if there is whatever connection issue.

@tghosth
Copy link
Collaborator

tghosth commented Nov 23, 2023

Like I wrote before, then circuit braking must only handle the fact that application can give response, if there is whatever connection issue.

Following discussion with Elar, the current "circuit breaker" requirement is:

Verify that the application is designed in a way that a failure to access external resources within pre-defined timeouts does not result in the entire application failing, for example using the circuit breaker pattern.

This is a slightly reactive mechanism that is basically telling the application to fail gracefully without completely becoming unavailable.

Elar's point is that there should be proactive use of timeouts to ensure that a large/complicated query will not run for longer than the timeout and afterwards the resource will be freed.

A caveat is that even if timeouts are implemented in the application, the database may still be processing the query and therefore tied up executing it.

@elarlang
Copy link
Collaborator Author

elarlang commented Dec 15, 2023

Just for linking update - 12.7.2 in the issue definition is now 12.7.1, and previous 12.7.1 is now 7.4.4 (see e5c4a4b).

At the moment it stands like this:

# Description L1 L2 L3 CWE
7.4.4 [ADDED] Verify that the application is designed in a way that a failure to access external resources does not result in the entire application failing, for example using the circuit breaker pattern.
12.7.1 [ADDED] Verify that the application proactively releases system resources, such as database connections, open files, threads, etc, when it finishes using them to prevent resource exhaustion. 404

@tghosth
Copy link
Collaborator

tghosth commented Dec 28, 2023

So did you still want to add another requirement here?

@elarlang elarlang added the 3) awaiting proposal There is some discussion in issue and reach to some results but it's not concluded with clear propos label Dec 28, 2023
@tghosth
Copy link
Collaborator

tghosth commented Jan 24, 2024

@elarlang is this on your todo list?

@tghosth tghosth added the _5.0 - prep This needs to be addressed to prepare 5.0 label Jan 24, 2024
@elarlang
Copy link
Collaborator Author

@elarlang is this on your todo list?

yes. I keep labels (at least those that are related to me) up-to-date, so if labels are saying so, it is so.

@jmanico
Copy link
Member

jmanico commented Jul 29, 2024

I like the idea of moving 7.4.4 to 12.7.2

So far we have

# Description L1 L2 L3 CWE
12.7.1 [ADDED] Verify that the application proactively releases system resources, such as database connections, open files, threads, etc, when it finishes using them to prevent resource exhaustion.   404

Then moving 7.4.4 we have

# Description L1 L2 L3 CWE
12.7.1 [ADDED] Verify that the application proactively releases system resources, such as database connections, open files, threads, etc, when it finishes using them to prevent resource exhaustion.   404
12.7.2 [MOVED from 7.4.4] Verify that the application is designed in a way that a failure to access external resources does not result in the entire application failing, for example using the circuit breaker pattern.  

@jmanico
Copy link
Member

jmanico commented Sep 4, 2024

I think 12.7.1 is more of an API issue and does not belong in 12.x. I suggest this is moved to 13.x. Timeouts are rarely a problem with file upload services or file download and I think it confuses the intent of this section.

@elarlang
Copy link
Collaborator Author

elarlang commented Sep 4, 2024

Comment for last Jim's comments:

  • note that 7.4.4 was moved away from this 12.* section and is not proposed as an idea to move it back (it is about error situation handling in the program code)
  • V12 title is "File and Resources" and V12.7 title is "Application Resources". If there is proposal to move 12.7 to somewhere else, it is a separate issue to discuss.

I need to work through my previous ideas to provide proposals.

@jmanico
Copy link
Member

jmanico commented Sep 5, 2024

I like the idea of dropping 12.7.1 and all of section 12.7 from v12. Let's do it! v12 is close to being "done" at some level.

@tghosth
Copy link
Collaborator

tghosth commented Sep 8, 2024

hi @elarlang, can you confirm what the currently proposed action here is?

@tghosth
Copy link
Collaborator

tghosth commented Nov 5, 2024

Waiting for feedback from @elarlang

@tghosth tghosth added 2) Awaiting response Awaiting a response from the original poster and removed 1) Discussion ongoing Issue is opened and assigned but no clear proposal yet 3) awaiting proposal There is some discussion in issue and reach to some results but it's not concluded with clear propos labels Nov 5, 2024
@cronchie
Copy link
Contributor

cronchie commented Nov 7, 2024

I'd be open to moving/modifying the current 12.7.1 into the proposed text of #2281. The circuit breaker pattern is a fantastic retry strategy, and coupled with the backoff + jitter proposed in #2281, it can help protect against DoSing external applications.

@tghosth tghosth assigned cronchie and unassigned elarlang Nov 7, 2024
@elarlang elarlang self-assigned this Nov 10, 2024
@elarlang elarlang added 1) Discussion ongoing Issue is opened and assigned but no clear proposal yet next meeting Filter for leaders and removed 2) Awaiting response Awaiting a response from the original poster labels Nov 10, 2024
@elarlang
Copy link
Collaborator Author

elarlang commented Nov 10, 2024

Awakening the monster here...

I reworked comments from the issue and made a summary with some new point of views, we can say that this comment is a new starting point without the need to go through the previous comments here.

First:

  • "circuit breaker" fail-safe is covered by 7.4.4
  • release resources after (successful) usage is covered by 12.7.1

We also have requirement for anti-automation that is closely related and maybe also covering some problems:

V11.2.2 Verify that anti-automation controls are in place to protect against excessive calls to application functions that could lead to data exfiltration, garbage data creation, quota exhaustion, rate limit breaches, denial of service, or overuse of costly resources.

The issue is, that connection-pool and resource-demanding or slow-responses you often run under the general "anti-automation" radar. Most likely we need to improve V11.2.2, avoid duplicates, check overlaps etc.

Problems to solve/cover:

  • Problem 1: The connection pool is too small
    • Example: gets filled when attacker hammering some resource-heavy functionality
    • Example: gets filled by peak-demand
    • area/solution: general availability and performance testing area; configuration, and auto-scale question - auto-scale must be with limit, otherwise the bill also auto-scales. May require documented decision per connection/service used
  • Problem 2: connection failure time-out is too long
    • Example: service (to connect) is not available and the application waits for too long
      • classical solution - waiting for timeout 1 minute, although it is clear, that if the connection is not successful in a few seconds, it is failed anyway
      • attacker can fill connection pool fast
      • for heavily used functionality connection pool gets filled without an attacker
    • area/solution: for just connectivity, it fails fast. May require documented decision per connection/service used
  • Problem 3: slow-response time-out is too long
    • complicated area, as sometimes services are resource and time-demanding to produce a response
    • Example: response from (an external) service takes too long but without reaching the timeout
      • an attacker can hit the endpoint and keep the server busy and the connection pool full
      • may not be only a connection pool problem, as resource-demanding responses may occupy other available resources, such as memory
    • Example: response from (an external) service takes too long and front-end connection gets timeout - but backend still processing the request, although there is no one waiting for the response anymore
      • classical example - slow database query, front-end gets timeout but database engine still processes the query, but no one waits for the answer
      • easy-dos scenario for an attacker
      • may happen under heavy usage
    • area/solution:
      • The fail-strategy must be analyzed per service used (can be different for different services using the same connection)
      • general definition for slow-queries AND if building some response is resource demanding, it must happen asynchronously with parallel-process count limit
      • Definition for slow-query per functionality/service call - to have alerting and monitoring capability. From "a security event" point of view is covered by V1.7+V7 requirements, but the problem or need can be highlighted in a documentation requirement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1) Discussion ongoing Issue is opened and assigned but no clear proposal yet next meeting Filter for leaders V12 _5.0 - prep This needs to be addressed to prepare 5.0
Projects
None yet
Development

No branches or pull requests

4 participants