Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pcre error -27 (JIT stack limit) on long regex string #8278

Closed
randyzwitch opened this issue Sep 9, 2014 · 18 comments
Closed

pcre error -27 (JIT stack limit) on long regex string #8278

randyzwitch opened this issue Sep 9, 2014 · 18 comments

Comments

@randyzwitch
Copy link
Contributor

(Edit: Working off nightly 0.4 build)
I'm making a package to parse Apache logs. See code here:
https://github.com/randyzwitch/LogParser.jl

I'm fairly comfortable with the regex I wrote, having a 99% match rate on my test files. However, on one particularly gnarly string, I cause the following error:

julia> errorstring = """71.163.72.113 - - [30/Jul/2014:16:40:55 -0700] "GET emptymind.org/thevacantwall/wp-content/uploads/2013/02/DSC_006421.jpg HTTP/1.1" 200 492513 "http://images.search.yahoo.com/images/view;_ylt=AwrB8py9gdlTGEwADcSjzbkF;_ylu=X3oDMTI2cGZrZTA5BHNlYwNmcC1leHAEc2xrA2V4cARvaWQDNTA3NTRiMzYzY2E5OTEwNjBiMjc2YWJhMjkxMTEzY2MEZ3BvcwM0BGl0A2Jpbmc-?back=http%3A%2F%2Fus.yhs4.search.yahoo.com%2Fyhs%2Fsearch%3Fei%3DUTF-8%26p%3Dapartheid%2Bwall%2Bin%2Bpalestine%26type%3Dgrvydef%26param1%3D1%26param2%3Dsid%253Db01676f9c26355f014f8a9db87545d61%2526b%253DChrome%2526ip%253D71.163.72.113%2526p%253Dgroovorio%2526x%253DAC811262A746D3CD%2526dt%253DS940%2526f%253D7%2526a%253Dgrv_tuto1_14_30%26hsimp%3Dyhs-fullyhosted_003%26hspart%3Dironsource&w=588&h=387&imgurl=occupiedpalestine.files.wordpress.com%2F2012%2F08%2F5-peeking-through-the-wall.jpg%3Fw%3D588%26h%3D387&rurl=http%3A%2F%2Fwww.stopdebezetting.com%2Fwereldpers%2Fcompare-the-berlin-wall-vs-israel-s-apartheid-wall-in-palestine.html&size=49.0KB&name=...+%3Cb%3EApartheid+wall+in+Palestine%3C%2Fb%3E...+%7C+Or+you+go+peeking+through+the+%3Cb%3Ewall%3C%2Fb%3E&p=apartheid+wall+in+palestine&oid=50754b363ca991060b276aba291113cc&fr2=&fr=&tt=...+%3Cb%3EApartheid+wall+in+Palestine%3C%2Fb%3E...+%7C+Or+you+go+peeking+through+the+%3Cb%3Ewall%3C%2Fb%3E&b=0&ni=21&no=4&ts=&tab=organic&sigr=13evdtqdq&sigb=19k7nsjvb&sigi=12o2la1db&sigt=12lia2m0j&sign=12lia2m0j&.crumb=.yUtKgFI6DE&hsimp=yhs-fullyhosted_003&hspart=ironsource" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"""

julia> r = r"""([\d\.]+) ([\w.-]+) ([\w.-]+) (\[.+\]) "([^"\r\n]*|[^"\r\n\[]*\[.+\][^"]+|[^"\r\n]+.[^"]+)" (\d{3}) (\d+|-) ("(?:[^"]|\")+)"? ("(?:[^"]|\")+)"?"""

julia> match(r, errorstring)

error -27
while loading In[28], in expression starting on line 1

 in error at error.jl:21
 in exec at ./pcre.jl:136
 in match at ./regex.jl:119
 in match at ./regex.jl:133

Here's the man explanation page for -27:
PCRE_ERROR_JIT_STACKLIMIT (-27)

   This error is returned when a pattern  that  was  successfully  studied
   using  a  JIT compile option is being matched, but the memory available
   for the just-in-time processing stack is  not  large  enough.  See  the
   pcrejit documentation for more details.

http://www.pcre.org/pcre.txt

This much of the regex works fine:

r"""([\d\.]+) ([\w.-]+) ([\w.-]+) (\[.+\]) "([^"\r\n]*|[^"\r\n\[]*\[.+\][^"]+|[^"\r\n]+.[^"]+)" (\d{3}) (\d+|-)"""

Any ideas what to do here or what the problem might be? Seems like a try/catch is the wrong way to handle this, it seems like a lower-level type of issue.

@stevengj
Copy link
Member

stevengj commented Sep 9, 2014

See pcrestack on how to increase the PCRE stack size (or how to rearrange your regex to require less stack). It seems like it has to be done at compile time, and you may also need to increase the OS stack size.

@dcjones
Copy link
Contributor

dcjones commented Sep 9, 2014

The default stack size is only 32KB. Maybe we should allocate one, say 1MB, stack and set all the regexes to use that when they're compiled.

This from the pcrejit manpage made me laugh:

(7) This is too much of a headache. Isn't there any better solution for JIT stack handling?

No, thanks to Windows. If POSIX threads were used everywhere, we could throw out this complicated API.

@ViralBShah
Copy link
Member

It does seem reasonable to have a higher stack size, at least on linux and mac, if windows is a problem.

@ViralBShah ViralBShah reopened this Sep 9, 2014
@randyzwitch
Copy link
Contributor Author

Thanks for confirming that the issue is a small stack default @dcjones.

@randyzwitch
Copy link
Contributor Author

Is there a simple setting I can modify while compiling from source to play around with different stack size values?

@dcjones
Copy link
Contributor

dcjones commented Sep 11, 2014

Not super simple, but if pat is your regex pattern, you can do this and it should work.

ccall((:pcre_assign_jit_stack, :libpcre),
      Void, (Ptr{Void}, Ptr{Void}, Ptr{Void}), pat.extra, C_NULL,
      ccall((:pcre_jit_stack_alloc, :libpcre),
            Ptr{Void}, (Cint, Cint), 32768, 1048576))

@dcjones
Copy link
Contributor

dcjones commented Sep 11, 2014

In that example 32768 is the initial stack size and 1048576 is the maximum.

@JeffBezanson JeffBezanson changed the title pcre error -27 on long regex string pcre error -27 (JIT stack limit) on long regex string Sep 16, 2014
@randyzwitch
Copy link
Contributor Author

Thanks @dcjones! I tried this out on the bug example above and it worked, and tested it on a 350,000 array of Apache Log strings and didn't get any errors (which previously failed based on the example string).

Is this something that could be incorporated into Base easily or should I just build this fix into my package (or both)?

@JeffBezanson
Copy link
Sponsor Member

Yes I think we should use a bigger stack by default; 32k is extremely small. It seems like the only way to do this is for us to explicitly call pcre_assign_jit_stack for every regex? Or at least intercept the error, print a nice message and provide an easier way to do this.

@dcjones
Copy link
Contributor

dcjones commented Sep 17, 2014

I was going to make a PR to set patterns to all use a 1mb stack, but am running into an issue. If I define globals in pcre.jl like so

const JIT_STACK_START_SIZE = 32768
const JIT_STACK_MAX_SIZE = 1048576
const JIT_STACK = ccall((:pcre_jit_stack_alloc, :libpcre), Ptr{Void},
                        (Cint, Cint), JIT_STACK_START_SIZE, JIT_STACK_MAX_SIZE)

JIT_STACK is always NULL. Yet it works from the repl. Why would that be?

@simonster
Copy link
Member

@dcjones Maybe the ccall has to happen in __init__ since the pointer can't be saved in sys.so? Does it work if you remove sys.so/dylib/dll?

@dcjones
Copy link
Contributor

dcjones commented Sep 17, 2014

Thanks @simonster, that was the issue.

@stevengj
Copy link
Member

Isn't there a way to set the stack size when PCRE is compiled?

@pao
Copy link
Member

pao commented Sep 17, 2014

That wouldn't help if your build used USE_SYSTEM_PCRE.

@randyzwitch
Copy link
Contributor Author

Feels like a person building themselves and changing to use their own system PCRE would presumably know to change the stack size or have done it themselves? So if doing this at compile time takes an extra call out of every regex match function, that seems like a decent trader off to me.

Maybe just out a note in the make file to make sure stack size is large enough if you choose to use system PCRE?

@randyzwitch
Copy link
Contributor Author

That's "trade off" and "put a note", iOS is not being good to me this morning

@nalimilan
Copy link
Member

@randyzwitch People the least involved in Julia development are going to use distribution packages on Linux, and they'll use the system PCRE without even knowing it.

@StefanKarpinski
Copy link
Sponsor Member

Since it's simple for us to set the stack size at run time, I can't see why we wouldn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants