XQuery test suite false negative #10
Labels
No labels
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
BeRo1985/flre#10
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The XQTS knows the following regexs that should match, but do not:
The Kelvin sign K should match k or K in case insensitive mode
And there should be the following Unicode classes:
\P{C}
\P{Cc}
\P{L}
\p{Lu}
\P{Lu}
\P{M}
\P{Me}
\P{N}
\P{No}
\P{P}
\P{Pf}
\P{S}
\P{Sk}
\P{Z}
\P{Zs}
And some more that should have matched
The ^$ stuff should work now. Please retest it.
And back-references are only partially supported, because full support for irregular expression back-references does need a backtracking algorithm, and FLRE avoids backtracking as much as possible. More details see this followed article series (whereupon FLRE was also modeled): https://swtch.com/~rsc/regexp/regexp1.html https://swtch.com/~rsc/regexp/regexp2.html and https://swtch.com/~rsc/regexp/regexp3.html
And please provide complete but minimal test case source codes (in single dpr or pas files), so that I can test&debug it quickly.
Now it complains that it matches too much
It is not really that irregular.
Seems mostly to be that if the capture group is optional and does not match anything in the input, the backreference should match the empty string. For that you could even remove both of them during the expansion
here
github: "Unfortunately, we don’t support that file type. " ...
here as png

And the list of missing Unicode classes:
\p{C}
\p{Cc}
\p{Cf}
\p{Cn}
\p{Co}
\p{IsAlphabeticPresentationForms}
\p{IsArabic}
\p{IsArmenian}
\p{IsArrows}
\p{IsBasicLatin}
\p{IsBengali}
\p{IsBlockElements}
\p{IsBopomofo}
\p{IsBopomofoExtended}
\p{IsBoxDrawing}
\p{IsBraillePatterns}
\p{IsByzantineMusicalSymbols}
\p{IsCherokee}
\p{IsCJKCompatibility}
\p{IsCJKCompatibilityForms}
\p{IsCJKCompatibilityIdeographs}
\p{IsCJKCompatibilityIdeographsSupplement}
\p{IsCJKRadicalsSupplement}
\p{IsCJKSymbolsandPunctuation}
\p{IsCJKUnifiedIdeographs}
\p{IsCJKUnifiedIdeographsExtensionA}
\p{IsCJKUnifiedIdeographsExtensionB}
\p{IsCombiningDiacriticalMarks}
\p{IsCombiningHalfMarks}
\p{IsCombiningMarksforSymbols}
\p{IsControlPictures}
\p{IsCurrencySymbols}
\p{IsCyrillic}
\p{IsDeseret}
\p{IsDevanagari}
\p{IsDingbats}
\p{IsEnclosedAlphanumerics}
\p{IsEnclosedCJKLettersandMonths}
\p{IsEthiopic}
\p{IsGeneralPunctuation}
\p{IsGeometricShapes}
\p{IsGeorgian}
\p{IsGothic}
\p{IsGreek}
\p{IsGreekExtended}
\p{IsGujarati}
\p{IsGurmukhi}
\p{IsHalfwidthandFullwidthForms}
\p{IsHangulCompatibilityJamo}
\p{IsHangulJamo}
\p{IsHangulSyllables}
\p{IsHebrew}
\p{IsHighSurrogates}
\p{IsHiragana}
\p{IsIdeographicDescriptionCharacters}
\p{IsIPAExtensions}
\p{IsKanbun}
\p{IsKangxiRadicals}
\p{IsKannada}
\p{IsKatakana}
\p{IsKhmer}
\p{IsLao}
\p{IsLatinExtendedAdditional}
\p{IsLetterlikeSymbols}
\p{IsLowSurrogates}
\p{IsMalayalam}
\p{IsMathematicalAlphanumericSymbols}
\p{IsMathematicalOperators}
\p{IsMiscellaneousSymbols}
\p{IsMiscellaneousTechnical}
\p{IsMongolian}
\p{IsMusicalSymbols}
\p{IsMyanmar}
\p{IsNumberForms}
\p{IsOgham}
\p{IsOldItalic}
\p{IsOpticalCharacterRecognition}
\p{IsOriya}
\p{IsPrivateUse}
\p{IsRunic}
\p{IsSinhala}
\p{IsSmallFormVariants}
\p{IsSpacingModifierLetters}
\p{IsSpecials}
\p{IsSuperscriptsandSubscripts}
\p{IsSyriac}
\p{IsTags}
\p{IsTamil}
\p{IsTelugu}
\p{IsThaana}
\p{IsThai}
\p{IsTibetan}
\p{IsUnifiedCanadianAboriginalSyllabics}
\p{IsYiRadicals}
\p{IsYiSyllables}
\p{L}
\p{Ll}
\p{Lm}
\p{Lo}
\p{Lt}
\p{Lu}
\p{M}
\p{Mc}
\p{Me}
\p{Mn}
\p{N}
\p{Nd}
\p{Nl}
\p{No}
\p{P}
\p{Pc}
\p{Pd}
\p{Pe}
\p{Pf}
\p{Pi}
\p{Po}
\p{Ps}
\p{S}
\p{Sc}
\p{Sk}
\p{Sm}
\p{So}
\p{Z}
\p{Zl}
\p{Zp}
\p{Zs}
The Unicode classes were not missing, instead the code for the \p{} parsing had just a bug, which I've fixed now.
Nice. It still needs \p{IsBasicLatin} and p{IsCombiningMarksforSymbols}
And the test cases want unassigned characters to match (e.g. \u0530 for \p{IsArmenian}), but perhaps it is better, if they do not match
Is there an option to make the escapes independent of the case sensitivity? So
\p{Lu}only matches upper case letters, even if matched case insensitive.The InBasicLatin parsing bug is fixed now. And I'll add IsCombiningMarksforSymbols later.
But the other stuff I couldn't find in the offical Unicode 8.0 resource data (eg. in DerivedGeneralCategory.txt and UnicodeData.txt of the Unicode ftp server). The whole FLRE unicode support stuff is automatically converted from these offical Unicode data files, see src\Unicode\FLREConvertUnicode.dpr for the OfficalUnicodeResourceData->FLREData-converter-code.
Hm just use
(?-i)\p{Lu}(?i)? :)What other stuff?
Armenian is there. Range, 1328;ToChar:1423. And \p{IsArmenian} works for all the assigned code points https://en.wikipedia.org/wiki/Armenian_%28Unicode_block%29 except 0x589 (Armenian fullstop).
That would make sense. Guess a preprocessor could output that, even if it would not allow it as input
But can this work with backreferences? I.e.
(s(?-i)\p{Lu}(?i)t)\1should still matchsUtsutarmenian.txt
Perhaps because that is in the Armerian block, but the Common script
Since FLRE avoids back-tracking as possible, FLRE is emulating back-references by cloning sub byte-code instruction sequences of capturing groups to hidden internal capture groups and adding a hidden-internal-capturing-group-comparison-verification-operation-instruction, where hidden internal capturing groups have their own internal count-numbering.
For example:
(s[uU]t)\1will transformed and compiled internally into(s[uU]t)(?hidden-internal-capture-group:clone-capturing-group-1:s[uU]t)(?hidden-internal-verify-operation:is-capture-group-1-equal-to-hidden-internal-capture-group-1).This is necessary, because, as already said, FLRE avoids backtracking as possible (except at the BitstateNFA algorithm for short and simple non-complex regexs), and the Thompson-NFA (Parallel-Threaded-NFA) algorithm can emulate backreferences only in that hacky way, since the Thompson-NFA (Parallel-Threaded-NFA) algorithm in FLRE does never backtrack instead it processes every input char exactly once. And the DFA code in FLRE sees
(s[uU]t)\1just only ass[uU]ts[uU]tand leaves it to the non-backtracking *NFA algorithms the final verification.Full real support for back-references would do need a full back-tracking-algorithm implementation, and back-tracking can be very slow and is even too vulnerable to ReDoS attacks (Regular expression Denial of Service, see https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS and https://en.wikipedia.org/wiki/ReDoS and even https://swtch.com/~rsc/regexp/regexp1.html ), therefore FLRE avoids back-tracking as possible, and this consequently FLRE has only emulated partial hacky support for back-references.
And as a small side-note: the Google RE2 regexp engine (which was the role model for FLRE) doesn't support back-references at all for these reasons.
Any questions left regarding the back-references topic? :)
Sounds complicated
Perhaps it could remove the ?i when cloning the group?
I do not really need the speed. It just needs to pass the test cases that have require this specific syntax.
My suggestion (and my offer for you) would be a syntax like (s(?-i)\p{Lu}(?i)t)(?P=1:s\p{Lu}t) or something like that, where (?P=x:y) (as a extension of the normal alternative (?P=x) backreference syntax) would be a compare-to-another-capturing-group capturing-group, where x = compare-with-capturing-group-index-or-named-group-name and y = manual-by-hand cloned but modified regex. Anything else would mean more effort and complicate the code, I think.
that would be great
Another one that should match (using the matches function from the above png):
(?i)(s(?-i)\p{Lu}(?i)t)(?P=1:s\p{Lu}t)and(^|) *abc *(,|$)should work now. I've improved the beginning-and-ending-text-anchor-detection for to disable some speed optimizations then, and that had fixed at least(^|) *abc *(,|$).Thanks. Now I need to figure out how to combine it with my preprocessor
If there are parenthesis inside the (?P=..) does that create new capturing groups?
Tried to test it, but found this instead:
Hm yes, you can use here non-capturing-groups (?:) for example
(?:\w+)Should fixed now in the lastest commit.
It was IsBasicLatin in this test
Although it seems it gives both. Some even have \p{script=...} or \p{block=...}
You are aware that it is still failing the
abracadabra-abracadabra-3case, etc. ?Hello BeRo1985,
I'm one of Xidel's users; benibela's project that seems to using your FLRE-library.
He's dead set on updating Xidel's regular expression engine before fixing other bugs and releasing another update at all, but relies on you to do so.
If you would fix benibela's reported issues, besides helping benibela you would also help me in the process.
Thanks!
Perhaps i should close this general issue and open new ones for every sub issue.
Might be easier to handle
Some more observations with
one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty|twenty-one|twenty-two|twenty-three|twenty-four|twenty-five|twenty-six|twenty-seven|twenty-eight|twenty-nine|thirty|thirty-one|thirty-two|thirty-three|thirty-four|thirty-five|thirty-six|thirty-seven|thirty-eight|thirty-nine|forty|forty-one|forty-two|forty-three|forty-four|forty-five|forty-six|forty-seven|forty-eight|forty-nine|fifty|fifty-one|fifty-two|fifty-three|fifty-four|fifty-five|fifty-six|fifty-seven|fifty-eight|fifty-nine|sixty|sixty-one|sixty-two|sixty-three|sixty-four|sixty-five|sixty-six|sixty-seven|sixty-eight|sixty-nine|seventyIt fails with most common prefixes and none of
six seven twenty thirty forty fifty sixtyare matched.And it depends on the order of the alternatives. When it is sorted to descending
two|twenty-two|twenty-three|twenty-six|twenty-seven|twenty-one|twenty-nine|twenty-four|twenty-five|twenty-eight|twenty|twelve|three|thirty-two|thirty-three|thirty-six|thirty-seven|thirty-one|thirty-nine|thirty-four|thirty-five|thirty-eight|thirty|thirteen|ten|sixty-two|sixty-three|sixty-six|sixty-seven|sixty-one|sixty-nine|sixty-four|sixty-five|sixty-eight|sixty|sixteen|six|seventy|seventeen|seven|one|nineteen|nine|fourteen|four|forty-two|forty-three|forty-six|forty-seven|forty-one|forty-nine|forty-four|forty-five|forty-eight|forty|five|fifty-two|fifty-three|fifty-six|fifty-seven|fifty-one|fifty-nine|fifty-four|fifty-five|fifty-eight|fifty|fifteen|eleven|eighteen|eightonlythirtyandfiftyare not matchedBut sorted ascending to
eight|eighteen|eleven|fifteen|fifty|fifty-eight|fifty-five|fifty-four|fifty-nine|fifty-one|fifty-seven|fifty-six|fifty-three|fifty-two|five|forty|forty-eight|forty-five|forty-four|forty-nine|forty-one|forty-seven|forty-six|forty-three|forty-two|four|fourteen|nine|nineteen|one|seven|seventeen|seventy|six|sixteen|sixty|sixty-eight|sixty-five|sixty-four|sixty-nine|sixty-one|sixty-seven|sixty-six|sixty-three|sixty-two|ten|thirteen|thirty|thirty-eight|thirty-five|thirty-four|thirty-nine|thirty-one|thirty-seven|thirty-six|thirty-three|thirty-two|three|twelve|twenty|twenty-eight|twenty-five|twenty-four|twenty-nine|twenty-one|twenty-seven|twenty-six|twenty-three|twenty-two|twogives no improvement over the unsorted case.If the regex is parenthesed to
(one)|(two)|(three)|(four)|(five)|(six)|(seven)|(eight)|(nine)|(ten)|(eleven)|(twelve)|(thirteen)|(fourteen)|(fifteen)|(sixteen)|(seventeen)|(eighteen)|(nineteen)|(twenty)|(twenty-one)|(twenty-two)|(twenty-three)|(twenty-four)|(twenty-five)|(twenty-six)|(twenty-seven)|(twenty-eight)|(twenty-nine)|(thirty)|(thirty-one)|(thirty-two)|(thirty-three)|(thirty-four)|(thirty-five)|(thirty-six)|(thirty-seven)|(thirty-eight)|(thirty-nine)|(forty)|(forty-one)|(forty-two)|(forty-three)|(forty-four)|(forty-five)|(forty-six)|(forty-seven)|(forty-eight)|(forty-nine)|(fifty)|(fifty-one)|(fifty-two)|(fifty-three)|(fifty-four)|(fifty-five)|(fifty-six)|(fifty-seven)|(fifty-eight)|(fifty-nine)|(sixty)|(sixty-one)|(sixty-two)|(sixty-three)|(sixty-four)|(sixty-five)|(sixty-six)|(sixty-seven)|(sixty-eight)|(sixty-nine)|(seventy)it works in all cases. But with non-capturing parentheses(?:one)|(?:two)|(?:three)|(?:four)|(?:five)|(?:six)|(?:seven)|(?:eight)|(?:nine)|(?:ten)|(?:eleven)|(?:twelve)|(?:thirteen)|(?:fourteen)|(?:fifteen)|(?:sixteen)|(?:seventeen)|(?:eighteen)|(?:nineteen)|(?:twenty)|(?:twenty-one)|(?:twenty-two)|(?:twenty-three)|(?:twenty-four)|(?:twenty-five)|(?:twenty-six)|(?:twenty-seven)|(?:twenty-eight)|(?:twenty-nine)|(?:thirty)|(?:thirty-one)|(?:thirty-two)|(?:thirty-three)|(?:thirty-four)|(?:thirty-five)|(?:thirty-six)|(?:thirty-seven)|(?:thirty-eight)|(?:thirty-nine)|(?:forty)|(?:forty-one)|(?:forty-two)|(?:forty-three)|(?:forty-four)|(?:forty-five)|(?:forty-six)|(?:forty-seven)|(?:forty-eight)|(?:forty-nine)|(?:fifty)|(?:fifty-one)|(?:fifty-two)|(?:fifty-three)|(?:fifty-four)|(?:fifty-five)|(?:fifty-six)|(?:fifty-seven)|(?:fifty-eight)|(?:fifty-nine)|(?:sixty)|(?:sixty-one)|(?:sixty-two)|(?:sixty-three)|(?:sixty-four)|(?:sixty-five)|(?:sixty-six)|(?:sixty-seven)|(?:sixty-eight)|(?:sixty-nine)|(?:seventy)there is no improvementOr with dates
jan|january|feb|february|mar|march|apr|april|may|may|jun|june|jul|july|aug|august|sep|september|oct|october|nov|november|dec|decemberDoes not match any of
mar jun julSee issue #44. Commenting out the Prefix factoring resolves the disjunction issues above, i,e
and the months work.
This issue was too big. It is hard to see which test cases have been fixed and which ones are still failing.
I have opened a separate issue for each group of test cases: #65 #67 #68 #69 #70 #72 #73 #74