Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High passive CPU utilization #274

Closed
ngocphamm opened this issue Apr 24, 2020 · 28 comments
Closed

High passive CPU utilization #274

ngocphamm opened this issue Apr 24, 2020 · 28 comments

Comments

@ngocphamm
Copy link

Just a question because I don't even know what defines "high" here. I just notice that AltTab is constantly among the top 3-5 processes that use CPU, and the percentage for it is also always around 8% like in the screenshot. Of course it will not when I play video, or run virtual machines, etc... but I'm wondering maybe something can be optimized here 🤔

2020-04-24 at 10 26 AM

@lwouis
Copy link
Owner

lwouis commented Apr 24, 2020

Hey @ngocphamm! Thank you for this feedback!

We had issues in the past with high passive CPU usage like this: #170. AltTab, under normal circumstances should use 0% CPU or close to that, maybe spiking at 0.1% if there is OS activity it is observing.

Here is the cause, and why it's such a problem to avoid: we use macOS Accessibility APIs to observe the system. Observations such as "a new window was created", "a window was resized", "a window had its title changed". We use these events to build a history of the windows. However, these APIs are ancient, poorly documented, and full of bugs.

There are 2 levels of observing: processes, and windows. The "window was created" event comes from observing a process for instance, but then the "window was destroyed" event comes form observing the window, not the process.

Here we try to subscribe, using the AXObserverAddNotification OS API:

https://github.com/lwouis/alt-tab-macos/blob/master/src/api-wrappers/AXUIElement.swift#L102

Depending on the code we get in return, we either retry to subscribe, or we stop:

https://github.com/lwouis/alt-tab-macos/blob/master/src/api-wrappers/AXUIElement.swift#L114

The issue is with some processes that take a long time to be subscribe-able to. We had the example of #182 which had a process that would literally never be ready, which is not normal, and for which I added a hardcoded exception. But there are legitimate cases like launching the GIMP app. You can imagine Photoshop or Unity or Maya too. These things can take up minutes to finish launching, until which point the subscription will finally succeed.

Because of that, if we see a new process, and try to subscribe to its events, if it fails we must retry on a loop, which will raise the CPU. There are ways to limit CPU usage, but they all have trade-offs:

  • Space the attempts with more delay (current delay is every 10ms). Here we may subscribe after the first windows get spawn by the app, thus not knowing they exist and not showing them in AltTab. For that one, a workaround I added was to re-ask for all open windows when we successfully subscribe:

self.observeNewWindows()

If we space the attempts more, let's say every 1min, then the user can have the app finally finish starting, and it's not showing in AltTab for 1 min.

  • Stop trying after some time as passed. This would prevent cases where we wait forever because some toxic process never finishes launching. Obvious downside is that if you actually have an app legitimately take let's say 5min to launch, it will not be tracked by AltTab then if we give up after 4min let's say.

I would guess that on your machine, you have some process never finishing to launch, and AltTab and looping on trying to subscribe to its accessibility events, never succeeding, and generating CPU usage infinitely. The more processes like this, the more retry loops happening.

You can verify that easily by going into Activity Monitor, selecting AltTab in the list, then clicking View > Sample Process. You should see a bunch of time spent in the functions above I shared.

Now to get to the bottom of this, you need to run the app locally, and check the values of these arrays: Applications.appsInSubscriptionRetryLoop and Windows.windowsInSubscriptionRetryLoop. Inside you will see the process IDs of the problem processes. From these IDs, you can go into Activity Monitor, and find out which processes are buggy.

I hope these explanations help. If you find out which processes are the issue, I can attempt to reproduce the issue on my machine by installing the apps, the way I did with this Octave.app app, and either hardcode an exception for the specific app, or find a potential universal solution, but once again I don't think it's possible without a trade-offs were we don't show some slow-to-start apps.

Please let me know! :)

@lwouis lwouis changed the title CPU usage a bit high? High passive CPU utilization Apr 24, 2020
@ngocphamm
Copy link
Author

Thanks for the very detailed explanation of this issue @lwouis!

I tried to do the Sample Process, but unfortunately I don't quite understand the output of it. If you want I can send you the dump file but let me do the stuff below (after I understand what exactly I have to do there) first.

What do you mean by this? How can I check the arrays?

Now to get to the bottom of this, you need to run the app locally, and check the values of these arrays: Applications.appsInSubscriptionRetryLoop and Windows.windowsInSubscriptionRetryLoop.

@lwouis
Copy link
Owner

lwouis commented Apr 25, 2020

Yeah you can just share the sample output here.

For the arrays, it's about getting the values of what's inside. You need to change the code and run the app locally to kind of experiment like this. Do you have XCode?

Assuming you can run the project locally (see https://github.com/lwouis/alt-tab-macos/blob/master/docs/CONTRIBUTING.md), you can add line like:

debugPrint("ngocphamm", Applications.appsInSubscriptionRetryLoop, Windows.windowsInSubscriptionRetryLoop)

On the line after:

https://github.com/lwouis/alt-tab-macos/blob/master/src/ui/App.swift#L157

Then after running the app, you will see logs starting with "ngocphamm", showing you the contents of the arrays.

@ngocphamm
Copy link
Author

This is the link for the sample file https://send.firefox.com/download/381922b0eca6f63c/#wFCgumzYcnqbL_wyxIA9bQ

And sorry I don't even have XCode installed. Would there be anyway you can make a test/dev copy of the app with the debugging line enabled?

By the way I just restarted the laptop and it doesn't look like AltTab is using a lot of CPU any more. Both % CPU and CPU Time reported is very low right now (% CPU was around 8-9% like I showed in the first post, and its CPU time was on the top 5 processes, just after kernel_task, WindowServer, which are normal, and MS Remote Desktop, which I use a lot for work). I will see if it comes back up after a while or not.

@lwouis
Copy link
Owner

lwouis commented Apr 25, 2020

Would there be anyway you can make a test/dev copy of the app with the debugging line enabled?

Of course! Here is a build where I added the code I mentioned: AltTab.app.zip

You can just launch the app, and open Console.app on your mac. Then search for "ngocphamm" and any time you invoke AltTab UI, it will print the 2 arrays containing the apps and windows that are being tried to subscribed to in a loop. You can then use the number at the beginning of these to identify in Activity Monitor which processes are being a problem.

Here is what Console.app shows for me (2 empty arrays, as all processes are successfully subscribed to on my machine):

image

This is the link for the sample file https://send.firefox.com/download/381922b0eca6f63c/#wFCgumzYcnqbL_wyxIA9bQ

Thank you! It seems you selected the "Sample Text" in the dropdown at the top, after sampling. I would need the "Percent of Parent" though to make a better assessment of what is taking the CPU time. Here is what it looks like if I run it right now, for instance:

image

You can see a sampling of the method calls, and where time is spent. When your CPU was running high when AltTab was idle, I expect to see all the time spent in the AXUIElementRef.subscribeWithRetry method, as it would be looping there every 10ms, 5 times per process, adding up quickly to noticeable CPU usage.

@ngocphamm
Copy link
Author

Thanks a lot @lwouis! I've started running the custom build of the app. Right now it's not showing anything in the arrays and is using very little CPU as expected. I will keep it running and in case I notice the consistently high CPU usage, I will do the sample process again and let you know about that, and the arrays.

Thanks for your help with this. It's truly appreciated!

@ngocphamm
Copy link
Author

I think I'm seeing this behavior again @lwouis. Unfortuately I have no longer ran your custom version to see the debugging arrays 😞 I didn't see the behavior for a few days after running that custom version you sent me so I updated it to be the latest version with a few nice fixes (thumbnail memory, and full arrow keys support).

Please find the sample file here https://send.firefox.com/download/6b2341bec71c27d0/#7BRsY-L5kmy4rVZm2_AAQg

If you need me to run the debugging version again, please let me know!

@lwouis
Copy link
Owner

lwouis commented May 4, 2020

Please find the sample file here

The logs you shared are still in the wrong format 😅See my instructions here. I still see a lot of retries, so even though I don't have the percentages, I'm assuming the issue is the AX subscription retry loop.

If you need me to run the debugging version again, please let me know!

If you run the debug version now, does it also get to higher CPU? It's worth a try if you haven't.

Basically I need to know which process are failing to get subscribed to. It's the only way I can imagine to fix it. As I said I blacklisted a faulty process in Octave.app last time. Either I blacklist more, since in your case, some other process is bugged, or we give up after N attempts, but then we risk missing out on legit slow processes like Gimp/Maya/etc.

@ngocphamm
Copy link
Author

Sorry but I'm a bit confused. What should I do after I see this screen? I selected "Percent of Parent" and hit "Save..." button to get you the log file. Was that not what you wanted?

2020-05-04 at 2 20 PM

I can run the debug version now, but it won't get this behavior probably until a few days later (what is happening is X days since the last update of AltTab), or maybe until the culprit app(s) got launched. I will run the debug version after I get to know how to properly get the sample file for you.

@lwouis
Copy link
Owner

lwouis commented May 5, 2020

Oh... then it means the export can't show the percentages I guess. Cause if you open the export file, you'll see there no percentage, so hard to know what's taking CPU time. Anyway on your screenshot I see that's it's the subscribeWithRetry as expected.

Next step has to be to identify the processes that can't be subscribed to. I thought about it, and decided that it may be generally useful to include info regarding these subscription retry loops in the debug profile attached when you send feedback through the in-app form.

I'll release an update that contains that change in a few minutes. This means that in the future, when you see AltTab using higher CPU than usual, you'll just have to click:

image

And that will open a ticket here with info on the problem apps or windows that fail to get subscribed to 👍

@ngocphamm
Copy link
Author

Nice. I just updated to latest version 3.17.2, and I will send the feedback the next time I notice the CPU spikes. Thank you!

@lwouis
Copy link
Owner

lwouis commented May 6, 2020

I'm wondering if a 5min timeout wouldn't be a pragmatic way to deal with this issue. After all, if some app needs more than 5min to launch, it could be considered a bug on their side. I can still imagine viable use-cases though like opening a huge Excel or Tableau file, or something that needs the network to start.

The issue is that there processes declare themselves as front, background, or daemon. We try to subscribe only to front processes as they could spawn windows. However, if an app is incorrectly written and has a background process declared as front, and that process can never be subscribed to, we will loop infinitely, causing high CPU.

It's really an issue on these third-party apps end (see #182), but the user is caught in the middle, and i'm not sure which side we should lean on: timeout and potentially have some apps not listed, or (current way) try indefinitely but wasting CPU if the third-party app is not letting us subscribe to it.

@ngocphamm
Copy link
Author

IMO, timeout would be good for 99% of the users. Would be nice to have a configurable timeout value so anyone who really wants to wait can extend the timeout.

@lwouis
Copy link
Owner

lwouis commented May 6, 2020

@ngocphamm I think I'll add a timeout soon then. Before that, however, I would like to see on your system what's triggering the issue. Because by adding a timeout, we may be side-stepping a bug that's on AltTab's side, not on the third-party side. I wish I had more data from users seeing above 0.1% CPU usage when idle.

@ngocphamm
Copy link
Author

ngocphamm commented May 6, 2020

Yup I will send a feedback with debug info as soon as I notice the high CPU usage! It might take days for it to get there, though, so please bear with me.

@ngocphamm
Copy link
Author

Just reported! #302

@lwouis
Copy link
Owner

lwouis commented May 7, 2020

@ngocphamm it's trying to subscribe to a window belonging to the "Location Menu" process. This seems like it's the OS process to track location. Maybe the report in #302 didn't reflect the actual issue. Could you try sending another one or two? If the CPU usage is still high, that is

@ngocphamm
Copy link
Author

Sorry I just updated the app again seeing the new version fixed the issue when it didn't show all the windows from other spaces, so it doesn't use up CPU for now. Is it still okay submitting the report at this normal state?

@lwouis
Copy link
Owner

lwouis commented May 7, 2020

so it doesn't use up CPU for now

That's interesting. I would guess that this window should never have been made. To get this subscription loop going, it's supposed to first pass the test of being an actual window. Given that its parent process is "Location Menu", I doubt it's it's a real window. I'm curious to how it was deemed valid earlier, thus the loop, but after a restart, it's not deemed valid.

Is it still okay submitting the report at this normal state?

I mean do it why not, it doesn't hurt, but if the CPU is low then it shouldn't prove useful. Let's see it

@ngocphamm
Copy link
Author

Also my guess on this is Location Menu is the menu bar icon that shows up whenever an app request access to my location (which I have a few doing so I think). This is the Open Files and Ports for that process (still running). Also, I have an app called Bartender to hide/show certain menu bar items, so maybe it's adding to the issue?

cwd
/
txt
/System/Library/CoreServices/LocationMenu.app/Contents/MacOS/LocationMenu
txt
/Library/Preferences/Logging/.plist-cache.PlaQPw7o
txt
/usr/share/icu/icudt64l.dat
txt
/private/var/db/timezone/tz/2019c.1.0/icutz/icutz44l.dat
txt
/System/Library/Fonts/SFNS.ttf
txt
/System/Library/CoreServices/SystemAppearance.bundle/Contents/Resources/SystemAppearance.car
txt
/System/Library/CoreServices/SystemAppearance.bundle/Contents/Resources/DarkAquaAppearance.car
txt
/System/Library/CoreServices/SystemAppearance.bundle/Contents/Resources/DarkAppearance.car
txt
/System/Library/Keyboard Layouts/AppleKeyboardLayouts.bundle/Contents/Resources/AppleKeyboardLayouts-L.dat
txt
/System/Library/Caches/com.apple.IntlDataCache.le.kbdx
txt
/usr/lib/libobjc-trampolines.dylib
txt
/System/Library/CoreServices/SystemAppearance.bundle/Contents/Resources/FunctionRowAppearance.car
txt
/private/var/folders/q4/2gt3yksn56dfc6rxcws07c0c0000gn/0/com.apple.LaunchServices.dv/com.apple.LaunchServices-1082-v2.csstore
txt
/usr/lib/dyld
0
/dev/null
1
/dev/null
2
/dev/null

so it doesn't use up CPU for now

That's interesting. I would guess that this window should never have been made. To get this subscription loop going, it's supposed to first pass the test of being an actual window. Given that its parent process is "Location Menu", I doubt it's it's a real window. I'm curious to how it was deemed valid earlier, thus the loop, but after a restart, it's not deemed valid.

Yeah the app is always back to normal after a restart (of the app), by updating it. The issue comes back at some random times. Last time was after a few days. This time was under a day (updated the app yesterday too).

Is it still okay submitting the report at this normal state?

I mean do it why not, it doesn't hurt, but if the CPU is low then it shouldn't prove useful. Let's see it

Just did #303

@lwouis
Copy link
Owner

lwouis commented May 7, 2020

Yeah #303 is not very useful as there is not much CPU usage and no subscription retries.

In #302, we see:

{wid: 0, title: Item-0, ownerPID: 493, ownerName: Location Menu, layer: 25}

wid (window ID) being 0 is very suspicious. Also I don't think Location Menu spawns real windows, so why are we trying to sub to this. I checked in the code, and there are only 3 places where we construct a window. All 3 are preceded with a check that the AXUIElement is an actual window.

I don't get how this seemingly broken window got passed that test, and we started subscribing to it.

@ngocphamm
Copy link
Author

ngocphamm commented May 7, 2020

Yeah this is not something that I know about...

If this helps, according to System Preferences > Security & Privacy > Privacy tab, I'm having 5 apps that have access to location: Little Snitch Helper, Find My, Maps, Home, and Control Plane. Out of those 5, the location icon (that "indicates an app that has used your location within the last 24 hours") only shows for Maps.

And also those system services (with the icon shown)
2020-05-07 at 9 01 AM

@ngocphamm
Copy link
Author

Could it be Control Plane, as I found a "fix" for a problem I've had with it since Catalina 18 days ago (by giving it access to location), and I reported this issue (with AltTab) 13 days ago 🤔

dustinrue/ControlPlane#506

@lwouis
Copy link
Owner

lwouis commented May 7, 2020

Here is the plan I suggest: let's wait, and please post another report if you see high CPU again. Let's see then if it's the same thing happening.

In the meanwhile I will also pay attention to other tickets that people periodically open to discuss things, and check if they have high CPU. That may give more data points.

After a while, maybe 2-4 weeks, I'll add the timeout. At the point, if there are rare bugs, they will be hidden which is not ideal, but at least it will stop wasting CPU after the timeout, which is a big upside.

@koekeishiya
Copy link

wid (window ID) being 0 is very suspicious. Also I don't think Location Menu spawns real windows, so why are we trying to sub to this. I checked in the code, and there are only 3 places where we construct a window. All 3 are preceded with a check that the AXUIElement is an actual window.

You should also ignore any window that reports back 0 as its ID: https://developer.apple.com/documentation/coregraphics/kcgnullwindowid?language=objc

The actual definition is: #define kCGNullWindowID ((CGWindowID)0)
Some windows still let you target them through the AX API even though this is the case.
The finder desktop windows are an example of this.

@lwouis
Copy link
Owner

lwouis commented May 10, 2020

Investigating how to deal with these apps that won't be subscribed to.

I can reproduce this scenario in a legitimate app, Gimp, which replies with an AXError of -25204 when trying to sub with AXObserverAddNotification while it's starting. Then when it finishes starting, it succeeds.

The other app I use to investigate is Octave.app. This one, after starting, gets stuck in a loop or something, and won't respond. this means that we infinitely retry and get the same -25204 code.

I thought: is there any way we can differentiate these 2 processes. Even a correlation may be a good workaround, better than a hard timeout.

Interestingly, Octave.app takes a long time to respond to the AXObserverAddNotification call. I can see in the console, even though the code is supposed to retry every 10ms, it prints only every 2s or so. The call is blocking for a while before returning, presumably because the OS tries to contact the process for a while before giving up, which is different from the Gimp case, where somehow the OS comes back straight away with the error. Also found people confirming that on the Phoenix project.

I tried to compare these values: debugPrint(app.bundleIdentifier, app.isFinishedLaunching, app.isActive, app.isTerminated, app.processIdentifier, app.localizedName). Unfortunately, they are not giving clues. I'm thinking that maybe with some old C APIs, there may be ways to get some more insights into the state of these processes, and discriminate apps launching that way.

Here is a list of big apps, and how they deal with the issue:

  • Chromium: doesn't check the return code
  • Phoenix: doesn't check the return code
  • Hammerspoon: doesn't check the return code
  • yabai: retries every 10ms, no timeout.
  • Karabiner. They throw an exception and don't deal with it.
  • slate. They remove the sub if they fail. I think it makes no sense.
  • Snapp: 20ms * 100 retries = 2s timeout
  • Here they mention polling but there is no code in that repo

@lwouis
Copy link
Owner

lwouis commented May 10, 2020

After thinking about this thing for a while, I decided to go with a timeout. Doing a number of attempts like Snapp is incorrect though, as the time between attempts varies deeply between apps (between 10ms or lower, and 6s I believe).

I just check the clock and after 2 min of trying, I stop trying. This means apps that takes more than 2 min to launch / be ready, will not have their windows listed in AltTab. It's really not ideal, but I believe these scenarios are less frequent than having zombies processes make AltTab uses constant CPU.

Other ideas I considered:

  • Exponential backoff, where we increase the delay between retries as we retry. I think it's pretty pointless. Let's say you open Excel and it takes 10min to open because you're opening some juicy corporate nonsense file. By this time, let's say the delay is already at 5min. This means you'll see the Excel windows in 5min. Also we would run into overflows and other joys of reaching huge numbers with the delay.

  • Spacing retries more. Then there is a real risk of missing out on the early events triggered by the app, and then we are desync'd with the app, creating all sort of problems.

  • Blacklist/Whitelist. I think these are reasonable. If people complain about an app not listing, I'm open to add the app to a whitelist with longer/no timeout. My worry though, is that it's unlikely that people complain that their CPU was at 6% for 2min. It will stop just within their tolerance, and we will miss out on useful reports that would help us understand the issue more. Oh well, what can we do.

@lwouis lwouis closed this as completed in 7ab7c82 May 10, 2020
lwouis pushed a commit that referenced this issue May 10, 2020
## [3.22.5](v3.22.4...v3.22.5) (2020-05-10)

### Bug Fixes

* implement a 2min timeout for unresponsive apps (closes [#274](#274)) ([7ab7c82](7ab7c82))
@lwouis
Copy link
Owner

lwouis commented May 13, 2020

You should also ignore any window that reports back 0 as its ID: https://developer.apple.com/documentation/coregraphics/kcgnullwindowid?language=objc

I had somehow never seen the case where such a filtering would be needed (I already filter based on many things such as role, subrole, title, etc). However, after working on #292 I discovered that specifically during login, some windows have ID 0, and should be ignored, thus I added your recommended filtering 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants