Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apcupsd-ups driver race condition #2007

Closed
tbsky opened this issue Aug 3, 2023 · 9 comments · Fixed by #2010
Closed

apcupsd-ups driver race condition #2007

tbsky opened this issue Aug 3, 2023 · 9 comments · Fixed by #2010
Labels
APC bug impacts-release-2.8.0 Issues reported against NUT release 2.8.0 (maybe vanilla or with minor packaging tweaks)
Milestone

Comments

@tbsky
Copy link

tbsky commented Aug 3, 2023

Hi:
we have more and more new APC ups which need to use nut apcupsd-ups driver (with serial modbus supported in apcupsd).
they work fine under nut 2.7.4, but it is not good under nut 2.8.0.

the difference is that the default nut apcupsd-ups driver polling interval is 60 seconds under 2.7.4 and change to 2 seconds under 2.8.0. the default normal polling interval settings for apcupsd is also 60 seconds. so nut 2.7.4 seems co-exist with apcupsd well.

we need to record the ups data like "ups.load" history so we use command "upsc myups" to read the data. and we found sometimes there are no "ups.load" data under 2.8.0. I write a little script and try to execute "upsc myups" every second and found it will fail to read the full data about 60 seconds interval. the normal full data for "upsc myups" is like below:

battery.charge: 100.0
battery.charge.low: 5
battery.date: 2021-11-10
battery.runtime: 1440.0
battery.runtime.low: 180
battery.voltage: 26.2
device.mfr: APC
device.type: ups
driver.name: apcupsd-ups
driver.parameter.pollinterval: 2
driver.parameter.port: localhost
driver.parameter.synchronous: auto
driver.version: 2.8.0
driver.version.internal: 0.6
input.frequency: 59.8
input.transfer.reason: Line voltage notch or spike
input.voltage: 105.7
output.current: 4.03
output.frequency: 59.8
output.voltage: 121.5
output.voltage.nominal: 120
power.percent: 33.9
ups.date: 2023-08-02
ups.delay.shutdown: 0
ups.delay.start: 0
ups.firmware: UPS 04.4
ups.firmware.aux: 00.5
ups.id: APC UPS
ups.load: 47.1
ups.mfr: APC
ups.mfr.date: 2021-11-10
ups.model: Smart-UPS 1500
ups.power.nominal: 1440
ups.realpower.nominal: 1000.0
ups.serial: AS2145160441
ups.status: BOOST OL
ups.temperature: 34.0
ups.test.result: OK
ups.time: 17:46:18

===================================================

when there is racing, "upsc myups" returns partial data like below:

battery.charge: 100.0
battery.runtime: 1440.0
device.mfr: APC
device.type: ups
driver.name: apcupsd-ups
driver.parameter.pollinterval: 2
driver.parameter.port: localhost
driver.parameter.synchronous: auto
driver.version: 2.8.0
driver.version.internal: 0.6
ups.mfr: APC
ups.model: Smart-UPS 1500
ups.power.nominal: 1440
ups.realpower.nominal: 1000.0
ups.status: BOOST OL

===================================================

but the "apcaccess" of apcupsd is always correct like below:

APC      : 001,042,1078
DATE     : 2023-08-02 17:56:48 +0800
HOSTNAME : love-1.nhs.hcc.edu.tw
VERSION  : 3.14.14 (31 May 2016) redhat
UPSNAME  : APC UPS
CABLE    : Custom Cable Smart
DRIVER   : MODBUS UPS Driver
UPSMODE  : Stand Alone
STARTTIME: 2023-02-18 23:22:30 +0800
MODEL    : Smart-UPS 1500
STATUS   : BOOST ONLINE
LINEV    : 105.5 Volts
LOADPCT  : 46.6 Percent
LOADAPNT : 33.6 Percent
BCHARGE  : 100.0 Percent
TIMELEFT : 24.0 Minutes
MBATTCHG : 5 Percent
MINTIMEL : 3 Minutes
MAXTIME  : 0 Seconds
OUTPUTV  : 121.4 Volts
DWAKE    : 0 Seconds
DSHUTD   : 0 Seconds
ITEMP    : 34.0 C
BATTV    : 26.2 Volts
LINEFREQ : 59.8 Hz
OUTCURNT : 4.03 Amps
LASTXFER : Line voltage notch or spike
NUMXFERS : 24
XONBATT  : 2023-07-29 15:35:27 +0800
TONBATT  : 0 Seconds
CUMONBATT: 1079 Seconds
XOFFBATT : 2023-07-29 15:35:36 +0800
LASTSTEST: 2023-07-25 15:53:35 +0800
SELFTEST : OK
STATFLAG : 0x0500000C
MANDATE  : 2021-11-10
SERIALNO : AS2145160441
BATTDATE : 2021-11-10
NOMOUTV  : 120 Volts
NOMPOWER : 1000 Watts
NOMAPNT  : 1440 VA
FIRMWARE : UPS 04.4 / 00.5
END APC  : 2023-08-02 17:57:04 +0800

===================================================

I found I can not set the apcupsd-ups polling interval back to "60" seconds under nut 2.8.0. the maximum value it will use is "10" seconds. current I can only set the polling interval to "7" seconds, since "7" is a prime number. it will has better common multiple with "60" seconds default with apcupsd.

@jimklimov jimklimov added bug APC impacts-release-2.8.0 Issues reported against NUT release 2.8.0 (maybe vanilla or with minor packaging tweaks) labels Aug 3, 2023
@jimklimov
Copy link
Member

jimklimov commented Aug 3, 2023

Hi, trying to understand this better so reformatted your message for "screenshots" a bit.

Did I understand correctly that with your testing script, you had lots of data for every once-a-second read from upsc a minute, except once every 60 seconds where it was actually refreshing? Was this "lots of data" cached (whether by NUT or by apcupsd itself), that is - numbers were same during the minute?

After the change of polling interval to "7", does the problem re-occur - e.g. every 7 minutes as the NUT query to apcupsd daemon lands into its refresh, or just somehow becoming (and staying) bad after a minute?

From your description, my first guess would be that either:

  1. NUT places some expiration to the values it had, or to the last successful read, so it removes the bad info and queries the device again; or
  2. NUT does query apcupsd every 2 seconds and this data does not expire per 1), but as apcupsd is refreshing its knowledge of the device every 60 seconds, it returns a partial response to NUT. Maybe its own client has some other type of query like "give me possibly-obsolete info too" which NUT does not know to use.
  3. UPDATE after investigation posted below: ...or NUT queries apcupsd daemon every 2 seconds, and wipes much of its locally cached info before each such query, so you often have a chance to see the reduced data set until it asks, gets and process()es the replies from apcupsd.

Are you in position to try building NUT from sources and investigating the problem this way (adding more debug messages where you see fit, etc.)?

@jimklimov
Copy link
Member

Looking at v2.7.4...v2.8.0 (search down for drivers/apcupsd-ups.c, possibly main.c or others) to see if any changes between the releases pop up for e.g. poll interval...

@jimklimov
Copy link
Member

jimklimov commented Aug 3, 2023

So one suspect code could be in getdata(), where it goes:

	for(x=0;nut_data[x].info_type;x++)
		if(!(nut_data[x].drv_flags & DU_FLAG_INIT) && !(nut_data[x].drv_flags & DU_FLAG_PRESERVE))
			dstate_delinfo(nut_data[x].info_type);

	if((p.fd=socket(AF_INET,SOCK_STREAM,0))==-1)
...

previously it did not care about DU_FLAG_PRESERVE.

So indeed it first deletes all data not marked as "init" (and newly, neither marked as "preserved") and then goes to query apcupsd daemon. That gives a window of time to see partial info and maybe an opportunity to re-architect something here (e.g. remember/timestamp what we had, get new info, only then delete the non-refreshed items). But seems this order of events was always here (at least, was in 2.7.4 and at that - was more severe before #796 which went into 2.8.0).

For poll interval seems there is a genuine bug in upsdrv_updateinfo() and upsdrv_initinfo():

-poll_interval=60;
+poll_interval = (poll_interval > POLL_INTERVAL_MIN) ? POLL_INTERVAL_MIN : poll_interval;

where newly added POLL_INTERVAL_MIN is 10. Apparently the comparison sign got inverted vs. the intent of #797 :)

@jimklimov jimklimov added this to the 2.8.1 milestone Aug 3, 2023
jimklimov added a commit to jimklimov/nut that referenced this issue Aug 3, 2023
jimklimov added a commit to jimklimov/nut that referenced this issue Aug 3, 2023
jimklimov added a commit to jimklimov/nut that referenced this issue Aug 3, 2023
jimklimov added a commit to jimklimov/nut that referenced this issue Aug 3, 2023
…oleted readings AFTER refreshing data from apcupsd daemon [networkupstools#2007]

Signed-off-by: Jim Klimov <[email protected]>
jimklimov added a commit to jimklimov/nut that referenced this issue Aug 3, 2023
…FTER refreshing data from apcupsd daemon [networkupstools#2007]

Signed-off-by: Jim Klimov <[email protected]>
@jimklimov
Copy link
Member

Theorized a fix, proposed in #2010

If CI basically agrees it is buildable across the board, and if you have a chance to build it from https://github.com/jimklimov/nut/tree/issue-2007 e.g. per https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests to test (even if just from the build workspace) and report if it behaves better or explodes badly, that would be great.

@tbsky
Copy link
Author

tbsky commented Aug 4, 2023

Hi:
the code can not run correctly. strace like below:

newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=73, ...}, AT_EMPTY_PATH) = 0
close(4)                                = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 4
connect(1, {sa_family=AF_INET, sin_port=htons(3551), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ENOTSOCK (Socket operation on non-socket)
close(1)                                = 0
write(2, "can't communicate with apcupsd!\n", 32can't communicate with apcupsd!
) = 32
openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = 1

I tried to modify apcupsd-ups.c and replace related line to old 2.8.0 version and let it run:

current code => if (INVALID_FD_SOCK( p.fd = socket(AF_INET, SOCK_STREAM, 0) ))
2.8.0 => if((p.fd=socket(AF_INET,SOCK_STREAM,0))==-1)

the new version seems fine. now the default and minimum polling interval is "10" seconds. and I can set the polling interval to "60" seconds. I try to execute "upsc myups" every second with default "10" seconds polling interval. there is no partial data after 10 minutes. I will let it run longer to make sure.

currently I don't have APC ups nearby which can unplug the utility power. so I can not test the situation when the ups is on battery.

thanks a lot for your quick fix!

@jimklimov
Copy link
Member

Sounds encouraging, thanks for the report and for the bug detection and fix proposal! :)

@jimklimov
Copy link
Member

I wonder if the INVALID_FD_SOCK change would benefit from one more set of parentheses (so the macro's argument is surely a single boolean/int token for the parser...

jimklimov added a commit to jimklimov/nut that referenced this issue Aug 4, 2023
jimklimov added a commit to jimklimov/nut that referenced this issue Aug 4, 2023
jimklimov added a commit to jimklimov/nut that referenced this issue Aug 4, 2023
… we are deleting an entry because it is too old [networkupstools#2007]

Signed-off-by: Jim Klimov <[email protected]>
@jimklimov
Copy link
Member

Bumped the PR: gave a shot to the idea with parentheses above, and added logging (at verbosity level 6) to see if there were in fact some values not refreshed from apcupsd (and/or not marked as such due to coding errors) and so deleted after the refresh.

If you'd check that this also works without surprises, would be great too :)

@tbsky
Copy link
Author

tbsky commented Aug 5, 2023

Hi:
I try the new version and it works fine.
thanks a lot for your help!

alexwbaule pushed a commit to alexwbaule/nut that referenced this issue Oct 4, 2023
alexwbaule pushed a commit to alexwbaule/nut that referenced this issue Oct 4, 2023
alexwbaule pushed a commit to alexwbaule/nut that referenced this issue Oct 4, 2023
alexwbaule pushed a commit to alexwbaule/nut that referenced this issue Oct 4, 2023
…oleted readings AFTER refreshing data from apcupsd daemon [networkupstools#2007]

Signed-off-by: Jim Klimov <[email protected]>
Signed-off-by: Alex W Baulé <[email protected]>
alexwbaule pushed a commit to alexwbaule/nut that referenced this issue Oct 4, 2023
…FTER refreshing data from apcupsd daemon [networkupstools#2007]

Signed-off-by: Jim Klimov <[email protected]>
Signed-off-by: Alex W Baulé <[email protected]>
alexwbaule pushed a commit to alexwbaule/nut that referenced this issue Oct 4, 2023
alexwbaule pushed a commit to alexwbaule/nut that referenced this issue Oct 4, 2023
alexwbaule pushed a commit to alexwbaule/nut that referenced this issue Oct 4, 2023
alexwbaule pushed a commit to alexwbaule/nut that referenced this issue Oct 4, 2023
alexwbaule pushed a commit to alexwbaule/nut that referenced this issue Oct 4, 2023
… we are deleting an entry because it is too old [networkupstools#2007]

Signed-off-by: Jim Klimov <[email protected]>
Signed-off-by: Alex W Baulé <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
APC bug impacts-release-2.8.0 Issues reported against NUT release 2.8.0 (maybe vanilla or with minor packaging tweaks)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants