searchMatch matching " Bis " to "\s*Titel\s*" #63

Open
opened 2021-04-11 19:36:20 +00:00 by benibela · 0 comments
benibela commented 2021-04-11 19:36:20 +00:00 (Migrated from github.com)

When I load the text #10#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#66#105#115#10#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32 ("Bis" with whitespace) from an HTML file and match it to \s*Titel\s*, I get a match:

$ wine ~/hg/programs/internet/xidel/xidel.exe \
  z:\\home\\benito\\hg\\programs\\internet\\VideLibri\\_meta\\tests\\\\aDISWeb\\list_orders_munich.html \
 -e 'let $x := (//text()[contains(.,"Bis")]) 
      return matches($x => string(), "\s*Titel\s*", "im")'
true

However, when I just match it without the HTML file, I do not get a match:

$ wine ~/hg/programs/internet/xidel/xidel.exe \
   z:\\home\\benito\\hg\\programs\\internet\\VideLibri\\_meta\\tests\\\\aDISWeb\\list_orders_munich.html \
  -e 'x:cps((10, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 66, 105, 115, 10, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32)) => string-join("") => matches(  "\s*Titel\s*", "im")'
false

(flags case-insensitive and single line, so the regex becomes really big according to DumpRegularExpression ((?:[\x09-\x0d ]|\xc2\xa0|\xe1(?:\x9a\x80|\xa0\x8e)|\xe2(?:\x80[\x80-\x8b\xa8-\xa9\xaf]|\x81\x9f)|\xe3\x80\x80|\xef(?:\xbb\xbf|\xbf\xbe))*[Bb][Ii][Bb][Ll][Ii][Oo][Tt][Hh][Ee][Kk]|[Aa][Uu][Ss][Gg][Aa][Bb][Ee][Oo][Rr][Tt]|[Zz][Ww][Ee][Ii][Gg][Ss][Tt][Ee][Ll][Ll][Ee](?:[\x09-\x0d ]|\xc2\xa0|\xe1(?:\x9a\x80|\xa0\x8e)|\xe2(?:\x80[\x80-\x8b\xa8-\xa9\xaf]|\x81\x9f)|\xe3\x80\x80|\xef(?:\xbb\xbf|\xbf\xbe))*))

Anyways, the strings are the same in either case, so it does not have anything to do with the HTML file...

If I look at TFLRE.SearchMatch, it takes the branch fifDFAReady in InternalFlags, case .. DFAMatch: , UnanchoredStart with StartPosition = 0, UntilExcludingPosition = 43, MatchEnd = 92. Then it takes the next exit, with MatchBegin = 64, MatchEnd = 92

Which should be an impossible match, because the string is only 43 bytes long, should it not?

It calls TFLREDFA.SearchMatchFast, and it happens with and without #58.

But when I remove all the assembly and so it uses the Pascal TFLREDFA.SearchMatchFast, it works without any problem (naturally, it only affects the 32-bit build).

When I load the text `#10#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#66#105#115#10#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32#32` ("Bis" with whitespace) from [an HTML file](https://github.com/benibela/videlibri/blob/master/_meta/tests/aDISWeb/list_orders_munich.html) and match it to `\s*Titel\s*`, I get a match: $ wine ~/hg/programs/internet/xidel/xidel.exe \ z:\\home\\benito\\hg\\programs\\internet\\VideLibri\\_meta\\tests\\\\aDISWeb\\list_orders_munich.html \ -e 'let $x := (//text()[contains(.,"Bis")]) return matches($x => string(), "\s*Titel\s*", "im")' true However, when I just match it without the HTML file, I do not get a match: $ wine ~/hg/programs/internet/xidel/xidel.exe \ z:\\home\\benito\\hg\\programs\\internet\\VideLibri\\_meta\\tests\\\\aDISWeb\\list_orders_munich.html \ -e 'x:cps((10, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 66, 105, 115, 10, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32)) => string-join("") => matches( "\s*Titel\s*", "im")' false (flags case-insensitive and single line, so the regex becomes really big according to DumpRegularExpression `((?:[\x09-\x0d ]|\xc2\xa0|\xe1(?:\x9a\x80|\xa0\x8e)|\xe2(?:\x80[\x80-\x8b\xa8-\xa9\xaf]|\x81\x9f)|\xe3\x80\x80|\xef(?:\xbb\xbf|\xbf\xbe))*[Bb][Ii][Bb][Ll][Ii][Oo][Tt][Hh][Ee][Kk]|[Aa][Uu][Ss][Gg][Aa][Bb][Ee][Oo][Rr][Tt]|[Zz][Ww][Ee][Ii][Gg][Ss][Tt][Ee][Ll][Ll][Ee](?:[\x09-\x0d ]|\xc2\xa0|\xe1(?:\x9a\x80|\xa0\x8e)|\xe2(?:\x80[\x80-\x8b\xa8-\xa9\xaf]|\x81\x9f)|\xe3\x80\x80|\xef(?:\xbb\xbf|\xbf\xbe))*)`) Anyways, the strings are the same in either case, so it does not have anything to do with the HTML file... If I look at `TFLRE.SearchMatch`, it takes the branch `fifDFAReady in InternalFlags`, `case .. DFAMatch: `, `UnanchoredStart` with `StartPosition = 0, UntilExcludingPosition = 43, MatchEnd = 92`. Then it takes the next exit, with `MatchBegin = 64, MatchEnd = 92 ` Which should be an impossible match, because the string is only 43 bytes long, should it not? It calls `TFLREDFA.SearchMatchFast`, and it happens with and without #58. But when I remove all the assembly and so it uses the Pascal `TFLREDFA.SearchMatchFast`, it works without any problem (naturally, it only affects the 32-bit build).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
BeRo1985/flre#63
No description provided.