Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternation leaves positional variable set sometimes #7518

Closed
p5pRT opened this issue Sep 30, 2004 · 7 comments
Closed

Alternation leaves positional variable set sometimes #7518

p5pRT opened this issue Sep 30, 2004 · 7 comments

Comments

@p5pRT
Copy link

p5pRT commented Sep 30, 2004

Migrated from rt.perl.org#31782 (status was 'resolved')

Searchable as RT31782$

@p5pRT
Copy link
Author

p5pRT commented Sep 30, 2004

From [email protected]

Created by [email protected]

This program, when run on most systems, only returns a defined result
for $1 once - when the $ anchor finally matches. Thus, $c is set to 1,
But on this particular system, the variable $1 is left set with a
"leftover" and this causes a larger program to malfunction. $c ends
with a value of 15, and $d ends with a value of 14.

This is a loop that breaks a line into groups of words, those groups
being the largest group of space delimited words that is less than or
equal to 77 characters. The real code does different things depending
on which of the patterns matches - when the first alternative matches,
it believes that it is at an end, because the $ anchor matches.

Some simpler cases do not malfunction - this is an artificial test
case, of course, and it is about the same complexity as the actual
program that is failing.

#! /usr/bin/perl -w
$c = $d = $e = 0;
$a='';
for ($i = 0; $i < 100; $i ++) {
  $ii = substr($i, -1);
  for ($j = 1; $j < 6; $j++) {
  $a .= "$ii$j";
  }
  $a .= ' ';
}
#$a = "123456789 " x 100;
while ( $a =~ /\G( .{1,77}? ) \s*$ |
  \G (.{0,76}\S ) \s+ |
  \G ( .{1,77} ) /sgx) {
  $c ++if $1;
  $d++ if $2;
  $e++ if $3;
}
print "c $c d $d e $e \n ";

I believe that this is a bug - and the sysadmin of this
system would like to know if it is a bug or a feecher that
is settable through some configuration that he needs to deal
with.

Also, was it fixed on purpose or by accident in the newer versions
of Perl? This particular Mandrake system is the only one I know of
that fails in this manner.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl v5.8.1:

Configured by gb at Mon Sep  1 17:25:48 CEST 2003.

Summary of my perl5 (revision 5.0 version 8 subversion 1) configuration:
  Platform:
    osname=linux, osvers=2.4.18-23mdksmp, archname=i386-linux-thread-multi
    uname='linux hp6.mandrakesoft.com 2.4.18-23mdksmp #1 smp fri aug 2 12:31:40 cest 2002 i686 unknown unknown gnulinux '
    config_args='-des -Dinc_version_list=5.8.0/i386-linux-thread-multi 5.8.0 5.6.1 5.6.0 -Darchname=i386-linux -Dcc=gcc -Doptimize=-O2 -fomit-frame-pointer -pipe -march=i586 -mcpu=pentiumpro  -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr -Dman3ext=3pm -Dcf_by=MandrakeSoft -Dmyhostname=localhost -Dperladmin=root@localhost -Dd_dosuid -Ud_csh -Duseshrplib -Dusethreads'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-O2 -fomit-frame-pointer -pipe -march=i586 -mcpu=pentiumpro ',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-strict-aliasing -I/usr/local/include -I/usr/include/gdbm'
    ccversion='', gccversion='3.3.1 (Mandrake Linux 9.2 3.3.1-1mdk)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lndbm -lgdbm -ldl -lm -lcrypt -lutil -lpthread -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.3.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5/5.8.1/i386-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    RC4


@INC for perl v5.8.1:
    /usr/lib/perl5/5.8.1/i386-linux-thread-multi
    /usr/lib/perl5/5.8.1
    /usr/lib/perl5/site_perl/5.8.1/i386-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.1
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/vendor_perl/5.8.1/i386-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.1
    /usr/lib/perl5/vendor_perl/5.8.0
    /usr/lib/perl5/vendor_perl
    .


Environment for perl v5.8.1:
    HOME=/home/nick
    LANG=en_US
    LANGUAGE=en_US:en
    LC_ADDRESS=en_US
    LC_COLLATE=en_US
    LC_CTYPE=en_US
    LC_IDENTIFICATION=en_US
    LC_MEASUREMENT=en_US
    LC_MESSAGES=en_US
    LC_MONETARY=en_US
    LC_NAME=en_US
    LC_NUMERIC=en_US
    LC_PAPER=en_US
    LC_TELEPHONE=en_US
    LC_TIME=en_US
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/games:/home/nick/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Oct 1, 2004

From @hvds

"njs.perlbug@​spameater.squawk.com (via RT)" <perlbug-followup@​perl.org> wrote​:
:This program, when run on most systems, only returns a defined result
:for $1 once - when the $ anchor finally matches. Thus, $c is set to 1,
:But on this particular system, the variable $1 is left set with a
:"leftover" and this causes a larger program to malfunction. $c ends
:with a value of 15, and $d ends with a value of 14.

I wasn't able to reproduce your problem on my RedHat system here with any
of the released 5.8.[012345], so it is difficult for me to answer your
precise question.

However, I believe the regexp engine has historically had some bugs
handling the use of '\G' when not at the actual start of the pattern;
it is possible you could avoid the problem by refactoring the pattern​:

  m{ \G (?​:
  ( .{1,77}? ) \s* $
  | ( .{0,76} \S ) \s+
  | ( .{1,77} )
  )}sgx

There have also been various bugs fixed in the past regarding correctly
unsetting the capture variables on backtracking, and it is also quite
possible that the problem you are seeing on this particular system is
one such.

Note that your version output shows​:

:Locally applied patches​:
: RC4

.. ie this was "perl 5.8.1 release candidate 4", rather than the actual
final release version. In the event, many further patches (mostly minor)
were applied before this candidate was deemed ready for release.

In any event, the intention is that $1 should be undef if that alternate
was not the one that matched.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Oct 1, 2004

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Oct 1, 2004

From [email protected]

On Fri, 2004-10-01 at 08​:48, Hugo van der Sanden via RT wrote​:

"njs.perlbug@​spameater.squawk.com (via RT)" <perlbug-followup@​perl.org> wrote​:
:This program, when run on most systems, only returns a defined result
:for $1 once - when the $ anchor finally matches. Thus, $c is set to 1,
:But on this particular system, the variable $1 is left set with a
:"leftover" and this causes a larger program to malfunction. $c ends
:with a value of 15, and $d ends with a value of 14.

I wasn't able to reproduce your problem on my RedHat system here with any
of the released 5.8.[012345], so it is difficult for me to answer your
precise question.

When this problem was reported to me, I had the same issue in that I
could not reproduce it on any system I had. Because of this issue, I
have been given a login on one of the systems that is at this particular
level of Mandrake, and that system is where I submitted the perlbug
from, specifically because of the point that I thought that there might
be something odd with the environment and in the hopes of getting
exactly this sort of comment (see end of note).

However, I believe the regexp engine has historically had some bugs
handling the use of '\G' when not at the actual start of the pattern;
it is possible you could avoid the problem by refactoring the pattern​:

m{ \G (?​:
( .{1,77}? ) \s* $
| ( .{0,76} \S ) \s+
| ( .{1,77} )
)}sgx

I tried the suggested alteration of the pattern, however, and I am sorry
to report that this did not fix the problem with the backreference to
the failed alternation not being undef. Below is my test of the
suggestion. I will say that this is a better reformulation of the
pattern and I am going to patch this section to reduce the impact of the
failure on my code, so I will include your suggestion - I'm not sure
whether the pattern optimizer is smart enough to see the fact that every
alternation starts with \G and therefore it can be factored - and I may
well have written this particular pattern some time ago, before (?​: )
entered my personal toolkit.

[nick@​grp nick]$ ./error_test
c 17 d 16 e 1 combos 16 combo3 1
[nick@​grp nick]$ cat error_test
#! /usr/bin/perl -w
$c = $d = $e = $combos = $combo3 = 0;
$a='';
for ($i = 0; $i < 100; $i ++) {
  $ii = substr($i, -1);
  for ($j = 1; $j < 6; $j++) {
  $a .= "$ii$j";
  }
  $a .= ' ';
}
$a .= substr("123456789" x 10,0,78);
#$a = "123456789 " x 100;
while ( $a =~ m{ \G (?​:
  ( .{1,77}? ) \s* $
  | ( .{0,76} \S ) \s+
  | ( .{1,77} ) )}sgx) {
  $c ++if $1;
  $d++ if $2;
  $e++ if $3;
  $combos++ if ($1 and $2) or ($2 and $3) or ($1 and $3);
  $combo3++ if $1 and $2 and $3;
}
print "c $c d $d e $e combos $combos combo3 $combo3\n ";
[nick@​grp nick]$

I've tried some variations on this - this was a test where I tried using
your exact test and was also insuring that the third part of the pattern
got exercised, but I have tried variations of the pattern in so far as I
can think of them. None have helped. I don't think I can write a program
that works properly in all circumstances without the proper setting of
the variables, without, say, applying the patterns individually and
testing the results one at a time. I guess that before I abandon that
approach, I should likely test it to see how much, if any, performance I
lose by running three simple patterns against the result instead of one
more complex one, or perhaps I should supply it as a private fix for
this environment.

I had thought that moving part of the complex program logic of walking
through the paragraph while advancing the pointer into the vector and
all the rest of the work that this multi-part pattern with alternate
bits was able to encapsulate simplified the main line perl code. I was
actually really happy with it.

Something like this will probably work reliably despite the failure​:

pos($a) = $lagpos = $[;
while (pos($a) < length($a)) {
  if($a =~ m{\G(.{1,77}?)\s*$}sg) {
  $c ++ if $1;
  last;
  } elsif(pos($a)=$lagpos, $a =~ m{\G(.{0,76}\S)\s+}sg) {
  $d ++ if $1;
  } elsif(pos($a)=$lagpos, $a =~ m{\G(.{1,77})}sg) {
  $e ++ if $1;
  } else {
  print "Nothing matched.\n";
  last;
  }
  if(pos($a) == $lagpos) {
  print "Position is not advancing, ", pos($a)," ", $lagpos,"\n";
  exit(1);
  } else {
  $lagpos = pos($a);
  # print "lagpos set to $lagpos\n";
  }
}
print "c $c d $d e $e ";

Yes, the stuff with $lagpos is required since a failed pattern match
unsets the pos() associated with the variable, so you have to be
prepared to restore it. In any case, I think that this code is really
ugly and complex as compared to the first loop. And it is about 5%
slower than the original code (tested over 10,000 iterations).

I am also not sure when you have to initially set pos and when you do
not - my original test case works without the setting of pos() but the
separate patterns don't - when pos() is unset, this version will drag
the \G to the end of the pattern. So \G with an unset pos() means
different things depending on what surrounds the pattern, or so it
seems.

There have also been various bugs fixed in the past regarding correctly
unsetting the capture variables on backtracking, and it is also quite
possible that the problem you are seeing on this particular system is
one such.

I had searched for such bugs without success before reporting this. As
far as I am concerned, this is all mooted by your below discovery that
the Mandrake people built Perl with a source that was not a final,
released-and-supported level.

Note that your version output shows​:

:Locally applied patches​:
: RC4

.. ie this was "perl 5.8.1 release candidate 4", rather than the actual
final release version. In the event, many further patches (mostly minor)
were applied before this candidate was deemed ready for release.

In any event, the intention is that $1 should be undef if that alternate
was not the one that matched.

Hugo

I'm leaving this in because I'm bccing my response to the systems
administrator in the hope that he can possibly use this information to
drag a fix out of the Mandrake 9 people. Thanks.

Thanks for your help.

@p5pRT
Copy link
Author

p5pRT commented Oct 3, 2004

From @ysth

On Fri, Oct 01, 2004 at 04​:42​:22PM -0400, Nick Simicich wrote​:

I'm not sure whether the pattern optimizer is smart enough to see
the fact that every alternation starts with \G and therefore it can
be factored

No, it's not smart enough; a /\Gfoo|\Gbar/ pattern will try matching
first \Gfoo then \Gbar at each position in the string from the beginning
until the pos(). To see this in action, try​:

  $ perl -we'use re "debug"; $_="x"x99; $_.="bar"; pos($x)=99; /\Gfoo|\Gbar/'

then try​:

  $ perl -we'use re "debug"; $_="x"x99; $_.="bar"; pos($x)=99; /\G(?​:foo|bar)/'

Yes, the stuff with $lagpos is required since a failed pattern match
unsets the pos() associated with the variable, so you have to be
prepared to restore it. In any case, I think that this code is really
ugly and complex as compared to the first loop. And it is about 5%
slower than the original code (tested over 10,000 iterations).

Use /gc instead of just /g to preserve pos() when the regex fails.

@p5pRT
Copy link
Author

p5pRT commented May 31, 2008

From [email protected]

I can reproduce this with perl-5.8.1 RC4 but not with perl-5.8.1 (or
any other perl).

So this seems to be fixed.

@p5pRT
Copy link
Author

p5pRT commented May 31, 2008

[email protected] - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant