XQuery test suite false negative

benibela commented

2015-12-30 14:05:59 +00:00

(Migrated from github.com)

The XQTS knows the following regexs that should match, but do not:

f := TFLRE.Create('(124|864|377|3)', []);
f.ExtractAll('abracadabra-abracadabra-3',matches) ;


f := TFLRE.Create('^$', [rfMULTILINE]);
f.ExtractAll(LineEnding+'abcd'+LineEnding+'defg',matches) ;

f := TFLRE.Create('^$', [rfMULTILINE]);
f.ExtractAll(LineEnding+'abcd'+LineEnding+LineEnding+'defg',matches) ;

The XQTS knows the following regexs that should match, but do not: ``` f := TFLRE.Create('(124|864|377|3)', []); f.ExtractAll('abracadabra-abracadabra-3',matches) ; f := TFLRE.Create('^$', [rfMULTILINE]); f.ExtractAll(LineEnding+'abcd'+LineEnding+'defg',matches) ; f := TFLRE.Create('^$', [rfMULTILINE]); f.ExtractAll(LineEnding+'abcd'+LineEnding+LineEnding+'defg',matches) ; ```

benibela commented

2015-12-30 14:16:25 +00:00

(Migrated from github.com)

The Kelvin sign K should match k or K in case insensitive mode

And there should be the following Unicode classes:
\P{C}
\P{Cc}
\P{L}
\p{Lu}
\P{Lu}
\P{M}
\P{Me}
\P{N}
\P{No}
\P{P}
\P{Pf}
\P{S}
\P{Sk}
\P{Z}
\P{Zs}

The Kelvin sign K should match k or K in case insensitive mode And there should be the following Unicode classes: \P{C} \P{Cc} \P{L} \p{Lu} \P{Lu} \P{M} \P{Me} \P{N} \P{No} \P{P} \P{Pf} \P{S} \P{Sk} \P{Z} \P{Zs}

benibela commented

2015-12-30 14:29:31 +00:00

(Migrated from github.com)

And some more that should have matched

<test id="p86" regex="^" input="abc" result="y"/>
<test id="p87" regex="$" input="abc" result="y"/>
<test id="p205" regex="a*" input="" result="y"/>
<test id="p220" regex="(abc|)ef" input="abcdef" result="y"/>
<test id="p282" regex="([a-c]*)\1" input="abcabc" result="y"/>
<test id="p294" regex="(a)|\1" input="x" result="y"/>
<test id="p295" regex="(?:(b)?a)\1" input="a" result="y"/>
<test id="p341" regex="^" flags="i" input="ABC" result="y"/>
<test id="p342" regex="$" flags="i" input="ABC" result="y"/>
<test id="p394" regex="(a+|b){0,1}?" flags="i" input="AB" result="y"/>
<test id="p398" regex="a*" flags="i" input="" result="y"/>
<test id="p409" regex="(abc|)ef" flags="i" input="ABCDEF" result="y"/>
<test id="p489" regex="^(){3,5}" input="abc" result="y"/>
<test id="p662" regex="$" input="a\nb\n" result="y"/>
<test id="p665" regex="$" input="b\na\n" result="y"/>
<test id="p668" regex="$" input="b\na" result="y"/>
<test id="p671" regex="$" flags="m" input="a&#xA;b&#xA;" result="y"/>
<test id="p674" regex="$" flags="m" input="b&#xA;a&#xA;" result="y"/>
<test id="p677" regex="$" flags="m" input="b&#xA;a" result="y"/>
<test id="p908" regex="(\w)?(abc)\1b" input="abcab" result="y"/>
<test id="p955" regex="(.*)\d+\1" input="abc12bc" result="y"/>
<test id="p995" regex="^.{3,4}(.+)\1$" input="foobarbar" result="y"/>
<test id="p996" regex="^(?:f|o|b){3,4}(.+)\1$" input="foobarbar" result="y"/>
<test id="p997" regex="^.{3,4}((?:b|a|r)+)\1$" input="foobarbar" result="y"/>
<test id="p998" regex="^(?:f|o|b){3,4}((?:b|a|r)+)\1$" input="foobarbar" result="y"/>
<test id="p999" regex="^.{3,4}(.+?)\1$" input="foobarbar" result="y"/>
<test id="p1000" regex="^(?:f|o|b){3,4}(.+?)\1$" input="foobarbar" result="y"/>
<test id="p1001" regex="^.{3,4}((?:b|a|r)+?)\1$" input="foobarbar" result="y"/>
<test id="p1002" regex="^(?:f|o|b){3,4}((?:b|a|r)+?)\1$" input="foobarbar" result="y"/>
<test id="p1003" regex="^.{2,3}?(.+)\1$" input="foobarbar" result="y"/>
<test id="p1004" regex="^(?:f|o|b){2,3}?(.+)\1$" input="foobarbar" result="y"/>
<test id="p1005" regex="^.{2,3}?((?:b|a|r)+)\1$" input="foobarbar" result="y"/>
<test id="p1006" regex="^(?:f|o|b){2,3}?((?:b|a|r)+)\1$" input="foobarbar" result="y"/>
<test id="p1007" regex="^.{2,3}?(.+?)\1$" input="foobarbar" result="y"/>
<test id="p1008" regex="^(?:f|o|b){2,3}?(.+?)\1$" input="foobarbar" result="y"/>
<test id="p1009" regex="^.{2,3}?((?:b|a|r)+?)\1$" input="foobarbar" result="y"/>
<test id="p1010" regex="^(?:f|o|b){2,3}?((?:b|a|r)+?)\1$" input="foobarbar" result="y"/>
<test id="p1016" regex="(WORDS|WORLD|WORD)S" input="WORDS" result="y"/>
<test id="p1035" regex="X(A|B||C|D)Y" input="XXXYYY" result="y"/>
<test id="p1042" regex="^(XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1043" regex="^(XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/>
<test id="p1044" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1045" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/>
<test id="p1046" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1047" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQX:" result="y"/>
<test id="p1048" regex="^(XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1049" regex="^(XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/>
<test id="p1050" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1051" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/>
<test id="p1052" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/>
<test id="p1053" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQX:" result="y"/>
<test id="p1228" regex="(WORDS|WORLD|WORD)S" input="WORDS" result="y"/>
<test id="p1236" regex="(WORDS|WORLD|WORD)+S" input="WORDS" result="y"/>
<test id="p1601" regex="^\p{L}" input="㐀" result="y"/>
<test id="p1684" regex="^(foo|)bar$" input="bar" result="y"/>
<test id="p1685" regex="^(foo||baz)bar$" input="bar" result="y"/>
<test id="p1689" regex="^(?:foo|)bar$" input="bar" result="y"/>
<test id="p1690" regex="^(?:foo||baz)bar$" input="bar" result="y"/>
<test id="p1705" regex="^(.)(?:(.)+)*[BX]" input="ABCDE" result="y"/>

And some more that should have matched ``` <test id="p86" regex="^" input="abc" result="y"/> <test id="p87" regex="$" input="abc" result="y"/> <test id="p205" regex="a*" input="" result="y"/> <test id="p220" regex="(abc|)ef" input="abcdef" result="y"/> <test id="p282" regex="([a-c]*)\1" input="abcabc" result="y"/> <test id="p294" regex="(a)|\1" input="x" result="y"/> <test id="p295" regex="(?:(b)?a)\1" input="a" result="y"/> <test id="p341" regex="^" flags="i" input="ABC" result="y"/> <test id="p342" regex="$" flags="i" input="ABC" result="y"/> <test id="p394" regex="(a+|b){0,1}?" flags="i" input="AB" result="y"/> <test id="p398" regex="a*" flags="i" input="" result="y"/> <test id="p409" regex="(abc|)ef" flags="i" input="ABCDEF" result="y"/> <test id="p489" regex="^(){3,5}" input="abc" result="y"/> <test id="p662" regex="$" input="a\nb\n" result="y"/> <test id="p665" regex="$" input="b\na\n" result="y"/> <test id="p668" regex="$" input="b\na" result="y"/> <test id="p671" regex="$" flags="m" input="a
b
" result="y"/> <test id="p674" regex="$" flags="m" input="b
a
" result="y"/> <test id="p677" regex="$" flags="m" input="b
a" result="y"/> <test id="p908" regex="(\w)?(abc)\1b" input="abcab" result="y"/> <test id="p955" regex="(.*)\d+\1" input="abc12bc" result="y"/> <test id="p995" regex="^.{3,4}(.+)\1$" input="foobarbar" result="y"/> <test id="p996" regex="^(?:f|o|b){3,4}(.+)\1$" input="foobarbar" result="y"/> <test id="p997" regex="^.{3,4}((?:b|a|r)+)\1$" input="foobarbar" result="y"/> <test id="p998" regex="^(?:f|o|b){3,4}((?:b|a|r)+)\1$" input="foobarbar" result="y"/> <test id="p999" regex="^.{3,4}(.+?)\1$" input="foobarbar" result="y"/> <test id="p1000" regex="^(?:f|o|b){3,4}(.+?)\1$" input="foobarbar" result="y"/> <test id="p1001" regex="^.{3,4}((?:b|a|r)+?)\1$" input="foobarbar" result="y"/> <test id="p1002" regex="^(?:f|o|b){3,4}((?:b|a|r)+?)\1$" input="foobarbar" result="y"/> <test id="p1003" regex="^.{2,3}?(.+)\1$" input="foobarbar" result="y"/> <test id="p1004" regex="^(?:f|o|b){2,3}?(.+)\1$" input="foobarbar" result="y"/> <test id="p1005" regex="^.{2,3}?((?:b|a|r)+)\1$" input="foobarbar" result="y"/> <test id="p1006" regex="^(?:f|o|b){2,3}?((?:b|a|r)+)\1$" input="foobarbar" result="y"/> <test id="p1007" regex="^.{2,3}?(.+?)\1$" input="foobarbar" result="y"/> <test id="p1008" regex="^(?:f|o|b){2,3}?(.+?)\1$" input="foobarbar" result="y"/> <test id="p1009" regex="^.{2,3}?((?:b|a|r)+?)\1$" input="foobarbar" result="y"/> <test id="p1010" regex="^(?:f|o|b){2,3}?((?:b|a|r)+?)\1$" input="foobarbar" result="y"/> <test id="p1016" regex="(WORDS|WORLD|WORD)S" input="WORDS" result="y"/> <test id="p1035" regex="X(A|B||C|D)Y" input="XXXYYY" result="y"/> <test id="p1042" regex="^(XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/> <test id="p1043" regex="^(XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/> <test id="p1044" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/> <test id="p1045" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/> <test id="p1046" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/> <test id="p1047" regex="^([TUV]+|XXXXXXXXXX|YYYYYYYYYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQX:" result="y"/> <test id="p1048" regex="^(XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/> <test id="p1049" regex="^(XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/> <test id="p1050" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/> <test id="p1051" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P):" input="ZEQQQX:" result="y"/> <test id="p1052" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQQQQQQQQQQQQQQQQP:" result="y"/> <test id="p1053" regex="^([TUV]+|XXX|YYY|Z.Q*X|Z[TE]Q*P|[MKJ]):" input="ZEQQQX:" result="y"/> <test id="p1228" regex="(WORDS|WORLD|WORD)S" input="WORDS" result="y"/> <test id="p1236" regex="(WORDS|WORLD|WORD)+S" input="WORDS" result="y"/> <test id="p1601" regex="^\p{L}" input="㐀" result="y"/> <test id="p1684" regex="^(foo|)bar$" input="bar" result="y"/> <test id="p1685" regex="^(foo||baz)bar$" input="bar" result="y"/> <test id="p1689" regex="^(?:foo|)bar$" input="bar" result="y"/> <test id="p1690" regex="^(?:foo||baz)bar$" input="bar" result="y"/> <test id="p1705" regex="^(.)(?:(.)+)*[BX]" input="ABCDE" result="y"/> ```

BeRo1985 commented

2015-12-30 23:52:04 +00:00

(Migrated from github.com)

The ^$ stuff should work now. Please retest it.

And back-references are only partially supported, because full support for irregular expression back-references does need a backtracking algorithm, and FLRE avoids backtracking as much as possible. More details see this followed article series (whereupon FLRE was also modeled): https://swtch.com/~rsc/regexp/regexp1.html https://swtch.com/~rsc/regexp/regexp2.html and https://swtch.com/~rsc/regexp/regexp3.html

And please provide complete but minimal test case source codes (in single dpr or pas files), so that I can test&debug it quickly.

The ^$ stuff should work now. Please retest it. And back-references are only partially supported, because full support for **irregular expression** back-references does need a backtracking algorithm, and FLRE avoids backtracking as much as possible. More details see this followed article series (whereupon FLRE was also modeled): https://swtch.com/~rsc/regexp/regexp1.html https://swtch.com/~rsc/regexp/regexp2.html and https://swtch.com/~rsc/regexp/regexp3.html And please provide complete but minimal test case source codes (in single dpr or pas files), so that I can test&debug it quickly.

benibela commented

2015-12-31 13:11:01 +00:00

(Migrated from github.com)

The ^$ stuff should work now. Please retest it.

Now it complains that it matches too much

And back-references are only partially supported, because full support for irregular expression

It is not really that irregular.

Seems mostly to be that if the capture group is optional and does not match anything in the input, the backreference should match the empty string. For that you could even remove both of them during the expansion

And please provide complete but minimal test case source codes (in single dpr or pas files), so that I can test&debug it quickly.

here

github: "Unfortunately, we don’t support that file type. " ...

here as png

> The ^$ stuff should work now. Please retest it. Now it complains that it matches too much > And back-references are only partially supported, because full support for irregular expression It is not really that irregular. Seems mostly to be that if the capture group is optional and does not match anything in the input, the backreference should match the empty string. For that you could even remove both of them during the expansion > And please provide complete but minimal test case source codes (in single dpr or pas files), so that I can test&debug it quickly. here github: "Unfortunately, we don’t support that file type. " ... here as png ![flrematches](https://cloud.githubusercontent.com/assets/1205129/12064629/4249cf72-afc8-11e5-8a92-a7d7b014e812.png)

benibela commented

2015-12-31 13:31:13 +00:00

(Migrated from github.com)

And the list of missing Unicode classes:
\p{C}
\p{Cc}
\p{Cf}
\p{Cn}
\p{Co}
\p{IsAlphabeticPresentationForms}
\p{IsArabic}
\p{IsArmenian}
\p{IsArrows}
\p{IsBasicLatin}
\p{IsBengali}
\p{IsBlockElements}
\p{IsBopomofo}
\p{IsBopomofoExtended}
\p{IsBoxDrawing}
\p{IsBraillePatterns}
\p{IsByzantineMusicalSymbols}
\p{IsCherokee}
\p{IsCJKCompatibility}
\p{IsCJKCompatibilityForms}
\p{IsCJKCompatibilityIdeographs}
\p{IsCJKCompatibilityIdeographsSupplement}
\p{IsCJKRadicalsSupplement}
\p{IsCJKSymbolsandPunctuation}
\p{IsCJKUnifiedIdeographs}
\p{IsCJKUnifiedIdeographsExtensionA}
\p{IsCJKUnifiedIdeographsExtensionB}
\p{IsCombiningDiacriticalMarks}
\p{IsCombiningHalfMarks}
\p{IsCombiningMarksforSymbols}
\p{IsControlPictures}
\p{IsCurrencySymbols}
\p{IsCyrillic}
\p{IsDeseret}
\p{IsDevanagari}
\p{IsDingbats}
\p{IsEnclosedAlphanumerics}
\p{IsEnclosedCJKLettersandMonths}
\p{IsEthiopic}
\p{IsGeneralPunctuation}
\p{IsGeometricShapes}
\p{IsGeorgian}
\p{IsGothic}
\p{IsGreek}
\p{IsGreekExtended}
\p{IsGujarati}
\p{IsGurmukhi}
\p{IsHalfwidthandFullwidthForms}
\p{IsHangulCompatibilityJamo}
\p{IsHangulJamo}
\p{IsHangulSyllables}
\p{IsHebrew}
\p{IsHighSurrogates}
\p{IsHiragana}
\p{IsIdeographicDescriptionCharacters}
\p{IsIPAExtensions}
\p{IsKanbun}
\p{IsKangxiRadicals}
\p{IsKannada}
\p{IsKatakana}
\p{IsKhmer}
\p{IsLao}
\p{IsLatinExtendedAdditional}
\p{IsLetterlikeSymbols}
\p{IsLowSurrogates}
\p{IsMalayalam}
\p{IsMathematicalAlphanumericSymbols}
\p{IsMathematicalOperators}
\p{IsMiscellaneousSymbols}
\p{IsMiscellaneousTechnical}
\p{IsMongolian}
\p{IsMusicalSymbols}
\p{IsMyanmar}
\p{IsNumberForms}
\p{IsOgham}
\p{IsOldItalic}
\p{IsOpticalCharacterRecognition}
\p{IsOriya}
\p{IsPrivateUse}
\p{IsRunic}
\p{IsSinhala}
\p{IsSmallFormVariants}
\p{IsSpacingModifierLetters}
\p{IsSpecials}
\p{IsSuperscriptsandSubscripts}
\p{IsSyriac}
\p{IsTags}
\p{IsTamil}
\p{IsTelugu}
\p{IsThaana}
\p{IsThai}
\p{IsTibetan}
\p{IsUnifiedCanadianAboriginalSyllabics}
\p{IsYiRadicals}
\p{IsYiSyllables}
\p{L}
\p{Ll}
\p{Lm}
\p{Lo}
\p{Lt}
\p{Lu}
\p{M}
\p{Mc}
\p{Me}
\p{Mn}
\p{N}
\p{Nd}
\p{Nl}
\p{No}
\p{P}
\p{Pc}
\p{Pd}
\p{Pe}
\p{Pf}
\p{Pi}
\p{Po}
\p{Ps}
\p{S}
\p{Sc}
\p{Sk}
\p{Sm}
\p{So}
\p{Z}
\p{Zl}
\p{Zp}
\p{Zs}

And the list of missing Unicode classes: \p{C} \p{Cc} \p{Cf} \p{Cn} \p{Co} \p{IsAlphabeticPresentationForms} \p{IsArabic} \p{IsArmenian} \p{IsArrows} \p{IsBasicLatin} \p{IsBengali} \p{IsBlockElements} \p{IsBopomofo} \p{IsBopomofoExtended} \p{IsBoxDrawing} \p{IsBraillePatterns} \p{IsByzantineMusicalSymbols} \p{IsCherokee} \p{IsCJKCompatibility} \p{IsCJKCompatibilityForms} \p{IsCJKCompatibilityIdeographs} \p{IsCJKCompatibilityIdeographsSupplement} \p{IsCJKRadicalsSupplement} \p{IsCJKSymbolsandPunctuation} \p{IsCJKUnifiedIdeographs} \p{IsCJKUnifiedIdeographsExtensionA} \p{IsCJKUnifiedIdeographsExtensionB} \p{IsCombiningDiacriticalMarks} \p{IsCombiningHalfMarks} \p{IsCombiningMarksforSymbols} \p{IsControlPictures} \p{IsCurrencySymbols} \p{IsCyrillic} \p{IsDeseret} \p{IsDevanagari} \p{IsDingbats} \p{IsEnclosedAlphanumerics} \p{IsEnclosedCJKLettersandMonths} \p{IsEthiopic} \p{IsGeneralPunctuation} \p{IsGeometricShapes} \p{IsGeorgian} \p{IsGothic} \p{IsGreek} \p{IsGreekExtended} \p{IsGujarati} \p{IsGurmukhi} \p{IsHalfwidthandFullwidthForms} \p{IsHangulCompatibilityJamo} \p{IsHangulJamo} \p{IsHangulSyllables} \p{IsHebrew} \p{IsHighSurrogates} \p{IsHiragana} \p{IsIdeographicDescriptionCharacters} \p{IsIPAExtensions} \p{IsKanbun} \p{IsKangxiRadicals} \p{IsKannada} \p{IsKatakana} \p{IsKhmer} \p{IsLao} \p{IsLatinExtendedAdditional} \p{IsLetterlikeSymbols} \p{IsLowSurrogates} \p{IsMalayalam} \p{IsMathematicalAlphanumericSymbols} \p{IsMathematicalOperators} \p{IsMiscellaneousSymbols} \p{IsMiscellaneousTechnical} \p{IsMongolian} \p{IsMusicalSymbols} \p{IsMyanmar} \p{IsNumberForms} \p{IsOgham} \p{IsOldItalic} \p{IsOpticalCharacterRecognition} \p{IsOriya} \p{IsPrivateUse} \p{IsRunic} \p{IsSinhala} \p{IsSmallFormVariants} \p{IsSpacingModifierLetters} \p{IsSpecials} \p{IsSuperscriptsandSubscripts} \p{IsSyriac} \p{IsTags} \p{IsTamil} \p{IsTelugu} \p{IsThaana} \p{IsThai} \p{IsTibetan} \p{IsUnifiedCanadianAboriginalSyllabics} \p{IsYiRadicals} \p{IsYiSyllables} \p{L} \p{Ll} \p{Lm} \p{Lo} \p{Lt} \p{Lu} \p{M} \p{Mc} \p{Me} \p{Mn} \p{N} \p{Nd} \p{Nl} \p{No} \p{P} \p{Pc} \p{Pd} \p{Pe} \p{Pf} \p{Pi} \p{Po} \p{Ps} \p{S} \p{Sc} \p{Sk} \p{Sm} \p{So} \p{Z} \p{Zl} \p{Zp} \p{Zs}

BeRo1985 commented

2016-01-01 03:22:20 +00:00

(Migrated from github.com)

The Unicode classes were not missing, instead the code for the \p{} parsing had just a bug, which I've fixed now.

benibela commented

2016-01-01 13:48:52 +00:00

(Migrated from github.com)

Nice. It still needs \p{IsBasicLatin} and p{IsCombiningMarksforSymbols}

And the test cases want unassigned characters to match (e.g. \u0530 for \p{IsArmenian}), but perhaps it is better, if they do not match

Nice. It still needs \p{IsBasicLatin} and p{IsCombiningMarksforSymbols} And the test cases want unassigned characters to match (e.g. \u0530 for \p{IsArmenian}), but perhaps it is better, if they do not match

benibela commented

2016-01-01 14:03:06 +00:00

(Migrated from github.com)

Is there an option to make the escapes independent of the case sensitivity? So \p{Lu} only matches upper case letters, even if matched case insensitive.

Is there an option to make the escapes independent of the case sensitivity? So `\p{Lu}` only matches upper case letters, even if matched case insensitive.

BeRo1985 commented

2016-01-01 16:55:01 +00:00

(Migrated from github.com)

Nice. It still needs \p{IsBasicLatin} and p{IsCombiningMarksforSymbols}

And the test cases want unassigned characters to match (e.g. \u0530 for \p{IsArmenian}), but perhaps it is better, if they do not match

The InBasicLatin parsing bug is fixed now. And I'll add IsCombiningMarksforSymbols later.

But the other stuff I couldn't find in the offical Unicode 8.0 resource data (eg. in DerivedGeneralCategory.txt and UnicodeData.txt of the Unicode ftp server). The whole FLRE unicode support stuff is automatically converted from these offical Unicode data files, see src\Unicode\FLREConvertUnicode.dpr for the OfficalUnicodeResourceData->FLREData-converter-code.

Is there an option to make the escapes independent of the case sensitivity? So \p{Lu} only matches upper case letters, even if matched case insensitive.

Hm just use (?-i)\p{Lu}(?i) ? :)

> Nice. It still needs \p{IsBasicLatin} and p{IsCombiningMarksforSymbols} > > And the test cases want unassigned characters to match (e.g. \u0530 for \p{IsArmenian}), but perhaps it is better, if they do not match The InBasicLatin parsing bug is fixed now. And I'll add IsCombiningMarksforSymbols later. But the other stuff I couldn't find in the offical Unicode 8.0 resource data (eg. in DerivedGeneralCategory.txt and UnicodeData.txt of the Unicode ftp server). The whole FLRE unicode support stuff is automatically converted from these offical Unicode data files, see src\Unicode\FLREConvertUnicode.dpr for the OfficalUnicodeResourceData->FLREData-converter-code. > Is there an option to make the escapes independent of the case sensitivity? So \p{Lu} only matches upper case letters, even if matched case insensitive. Hm just use `(?-i)\p{Lu}(?i)` ? :)

benibela commented

2016-01-01 18:17:38 +00:00

(Migrated from github.com)

But the other stuff I couldn't find in the offical Unicode 8.0 resource data (eg. in DerivedGeneralCategory.txt and UnicodeData.txt of the Unicode ftp server).

What other stuff?

Armenian is there. Range, 1328;ToChar:1423. And \p{IsArmenian} works for all the assigned code points https://en.wikipedia.org/wiki/Armenian_%28Unicode_block%29 except 0x589 (Armenian fullstop).

Hm just use (?-i)\p{Lu}(?i) ? :)

That would make sense. Guess a preprocessor could output that, even if it would not allow it as input

But can this work with backreferences? I.e.(s(?-i)\p{Lu}(?i)t)\1 should still match sUtsut

> But the other stuff I couldn't find in the offical Unicode 8.0 resource data (eg. in DerivedGeneralCategory.txt and UnicodeData.txt of the Unicode ftp server). What other stuff? Armenian is there. Range, 1328;ToChar:1423. And \p{IsArmenian} works for all the assigned code points https://en.wikipedia.org/wiki/Armenian_%28Unicode_block%29 _except_ 0x589 (Armenian fullstop). > Hm just use (?-i)\p{Lu}(?i) ? :) That would make sense. Guess a preprocessor could output that, even if it would not allow it as input But can this work with backreferences? I.e.`(s(?-i)\p{Lu}(?i)t)\1` should still match `sUtsut`

benibela commented

2016-01-01 18:18:06 +00:00

(Migrated from github.com)

armenian.txt

[armenian.txt](https://github.com/BeRo1985/flre/files/75992/armenian.txt)

benibela commented

2016-01-01 18:27:36 +00:00

(Migrated from github.com)

0x589 (Armenian fullstop).

Perhaps because that is in the Armerian block, but the Common script

> 0x589 (Armenian fullstop). Perhaps because that is in the Armerian block, but the Common script

BeRo1985 commented

2016-01-01 21:24:59 +00:00

(Migrated from github.com)

But can this work with backreferences? I.e.(s(?-i)\p{Lu}(?i)t)\1 should still match sUtsut

Since FLRE avoids back-tracking as possible, FLRE is emulating back-references by cloning sub byte-code instruction sequences of capturing groups to hidden internal capture groups and adding a hidden-internal-capturing-group-comparison-verification-operation-instruction, where hidden internal capturing groups have their own internal count-numbering.

For example: (s[uU]t)\1 will transformed and compiled internally into (s[uU]t)(?hidden-internal-capture-group:clone-capturing-group-1:s[uU]t)(?hidden-internal-verify-operation:is-capture-group-1-equal-to-hidden-internal-capture-group-1) .

This is necessary, because, as already said, FLRE avoids backtracking as possible (except at the BitstateNFA algorithm for short and simple non-complex regexs), and the Thompson-NFA (Parallel-Threaded-NFA) algorithm can emulate backreferences only in that hacky way, since the Thompson-NFA (Parallel-Threaded-NFA) algorithm in FLRE does never backtrack instead it processes every input char exactly once. And the DFA code in FLRE sees (s[uU]t)\1 just only as s[uU]ts[uU]t and leaves it to the non-backtracking *NFA algorithms the final verification.

Full real support for back-references would do need a full back-tracking-algorithm implementation, and back-tracking can be very slow and is even too vulnerable to ReDoS attacks (Regular expression Denial of Service, see https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS and https://en.wikipedia.org/wiki/ReDoS and even https://swtch.com/~rsc/regexp/regexp1.html ), therefore FLRE avoids back-tracking as possible, and this consequently FLRE has only emulated partial hacky support for back-references.

And as a small side-note: the Google RE2 regexp engine (which was the role model for FLRE) doesn't support back-references at all for these reasons.

Any questions left regarding the back-references topic? :)

> But can this work with backreferences? I.e.(s(?-i)\p{Lu}(?i)t)\1 should still match sUtsut Since FLRE avoids back-tracking as possible, FLRE is emulating back-references by cloning sub byte-code instruction sequences of capturing groups to hidden internal capture groups and adding a hidden-internal-capturing-group-comparison-verification-operation-instruction, where hidden internal capturing groups have their own internal count-numbering. For example: `(s[uU]t)\1` will transformed and compiled internally into `(s[uU]t)(?hidden-internal-capture-group:clone-capturing-group-1:s[uU]t)(?hidden-internal-verify-operation:is-capture-group-1-equal-to-hidden-internal-capture-group-1)` . This is necessary, because, as already said, FLRE avoids backtracking as possible (except at the BitstateNFA algorithm for short and simple non-complex regexs), and the Thompson-NFA (Parallel-Threaded-NFA) algorithm can emulate backreferences only in that hacky way, since the Thompson-NFA (Parallel-Threaded-NFA) algorithm in FLRE does never backtrack instead it processes every input char exactly once. And the DFA code in FLRE sees `(s[uU]t)\1` just only as `s[uU]ts[uU]t` and leaves it to the non-backtracking *NFA algorithms the final verification. Full real support for back-references would do need a full back-tracking-algorithm implementation, and back-tracking can be very slow and is even too vulnerable to ReDoS attacks (Regular expression Denial of Service, see https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS and https://en.wikipedia.org/wiki/ReDoS and even https://swtch.com/~rsc/regexp/regexp1.html ), therefore FLRE avoids back-tracking as possible, and this consequently FLRE has only emulated partial hacky support for back-references. And as a small side-note: the Google RE2 regexp engine (which was the role model for FLRE) doesn't support back-references at all for these reasons. Any questions left regarding the back-references topic? :)

benibela commented

2016-01-01 21:43:30 +00:00

(Migrated from github.com)

Since FLRE avoids back-tracking as possible, FLRE is emulating back-references by cloning sub byte-code instruction sequences of capturing groups to hidden internal capture groups and adding a hidden-internal-capturing-group-comparison-verification-operation-instruction, where hidden internal capturing groups have their own internal count-numbering.

Sounds complicated

Perhaps it could remove the ?i when cloning the group?

and back-tracking can be very slow

I do not really need the speed. It just needs to pass the test cases that have require this specific syntax.

> Since FLRE avoids back-tracking as possible, FLRE is emulating back-references by cloning sub byte-code instruction sequences of capturing groups to hidden internal capture groups and adding a hidden-internal-capturing-group-comparison-verification-operation-instruction, where hidden internal capturing groups have their own internal count-numbering. Sounds complicated Perhaps it could remove the ?i when cloning the group? > and back-tracking can be very slow I do not really need the speed. It just needs to pass the test cases that have require this specific syntax.

BeRo1985 commented

2016-01-01 22:06:00 +00:00

(Migrated from github.com)

My suggestion (and my offer for you) would be a syntax like (s(?-i)\p{Lu}(?i)t)(?P=1:s\p{Lu}t) or something like that, where (?P=x:y) (as a extension of the normal alternative (?P=x) backreference syntax) would be a compare-to-another-capturing-group capturing-group, where x = compare-with-capturing-group-index-or-named-group-name and y = manual-by-hand cloned but modified regex. Anything else would mean more effort and complicate the code, I think.

benibela commented

2016-01-01 22:34:47 +00:00

(Migrated from github.com)

that would be great

benibela commented

2016-01-01 23:41:56 +00:00

(Migrated from github.com)

Another one that should match (using the matches function from the above png):

  matches('abc, foobar','(^|) *abc *(,|$)')

Another one that should match (using the matches function from the above png): ``` matches('abc, foobar','(^|) *abc *(,|$)') ```

BeRo1985 commented

2016-01-02 04:39:40 +00:00

(Migrated from github.com)

(?i)(s(?-i)\p{Lu}(?i)t)(?P=1:s\p{Lu}t) and (^|) *abc *(,|$) should work now. I've improved the beginning-and-ending-text-anchor-detection for to disable some speed optimizations then, and that had fixed at least (^|) *abc *(,|$).

`(?i)(s(?-i)\p{Lu}(?i)t)(?P=1:s\p{Lu}t)` and `(^|) *abc *(,|$)` should work now. I've improved the beginning-and-ending-text-anchor-detection for to disable some speed optimizations then, and that had fixed at least `(^|) *abc *(,|$)`.

benibela commented

2016-01-02 11:01:31 +00:00

(Migrated from github.com)

Thanks. Now I need to figure out how to combine it with my preprocessor

benibela commented

2016-01-02 16:58:04 +00:00

(Migrated from github.com)

If there are parenthesis inside the (?P=..) does that create new capturing groups?

Tried to test it, but found this instead:

 matches('aaxyy', '([a-c])\1(x)(y)\3', '');
 matches('aAxyy', '([a-c])\1(x)(y)\3', 'i');

If there are parenthesis inside the (?P=..) does that create new capturing groups? Tried to test it, but found this instead: ``` matches('aaxyy', '([a-c])\1(x)(y)\3', ''); matches('aAxyy', '([a-c])\1(x)(y)\3', 'i'); ```

BeRo1985 commented

2016-01-02 18:03:21 +00:00

(Migrated from github.com)

If there are parenthesis inside the (?P=..) does that create new capturing groups?

Hm yes, you can use here non-capturing-groups (?:) for example (?:\w+)

Tried to test it, but found this instead:

matches('aaxyy', '([a-c])\1(x)(y)\3', '');
matches('aAxyy', '([a-c])\1(x)(y)\3', 'i');

Should fixed now in the lastest commit.

> If there are parenthesis inside the (?P=..) does that create new capturing groups? Hm yes, you can use here non-capturing-groups (?:) for example `(?:\w+)` > Tried to test it, but found this instead: > > matches('aaxyy', '([a-c])\1(x)(y)\3', ''); > matches('aAxyy', '([a-c])\1(x)(y)\3', 'i'); Should fixed now in the lastest commit.

benibela commented

2016-01-02 18:36:24 +00:00

(Migrated from github.com)

The InBasicLatin parsing bug is fixed now.

It was IsBasicLatin in this test

Although it seems it gives both. Some even have \p{script=...} or \p{block=...}

> The InBasicLatin parsing bug is fixed now. It was IsBasicLatin in this test Although it seems it gives both. Some even have \p{script=...} or \p{block=...}

benibela commented

2016-01-23 13:35:21 +00:00

(Migrated from github.com)

You are aware that it is still failing the abracadabra-abracadabra-3 case, etc. ?

You are aware that it is still failing the `abracadabra-abracadabra-3` case, etc. ?

Reino17 commented

2016-02-14 23:29:12 +00:00

(Migrated from github.com)

Hello BeRo1985,

I'm one of Xidel's users; benibela's project that seems to using your FLRE-library.
He's dead set on updating Xidel's regular expression engine before fixing other bugs and releasing another update at all, but relies on you to do so.
If you would fix benibela's reported issues, besides helping benibela you would also help me in the process.
Thanks!

Hello BeRo1985, I'm one of Xidel's users; benibela's project that seems to using your FLRE-library. He's dead set on updating Xidel's regular expression engine before fixing other bugs and releasing another update at all, but relies on you to do so. If you would fix benibela's reported issues, besides helping benibela you would also help me in the process. Thanks!

benibela commented

2016-02-15 20:44:11 +00:00

(Migrated from github.com)

Perhaps i should close this general issue and open new ones for every sub issue.

Might be easier to handle

Perhaps i should close this general issue and open new ones for every sub issue. Might be easier to handle

benibela commented

2016-04-10 14:22:47 +00:00

(Migrated from github.com)

f := TFLRE.Create('(124|864|377|3)', []);
f.ExtractAll('abracadabra-abracadabra-3',matches) ;

It fails with most common prefixes and none of six seven twenty thirty forty fifty sixty are matched.

benibela commented

2016-04-10 17:32:25 +00:00

(Migrated from github.com)

Or with dates jan|january|feb|february|mar|march|apr|april|may|may|jun|june|jul|july|aug|august|sep|september|oct|october|nov|november|dec|december

Does not match any of mar jun jul

Or with dates `jan|january|feb|february|mar|march|apr|april|may|may|jun|june|jul|july|aug|august|sep|september|oct|october|nov|november|dec|december` Does not match any of `mar jun jul`

pyscripter commented

2018-04-16 23:16:37 +00:00

(Migrated from github.com)

See issue #44. Commenting out the Prefix factoring resolves the disjunction issues above, i,e

f := TFLRE.Create('(124|864|377|3)', []);
f.ExtractAll('abracadabra-abracadabra-3',matches)

and the months work.

See issue #44. Commenting out the Prefix factoring resolves the disjunction issues above, i,e ``` f := TFLRE.Create('(124|864|377|3)', []); f.ExtractAll('abracadabra-abracadabra-3',matches) ``` and the months work.

benibela commented

2021-05-08 16:14:37 +00:00

(Migrated from github.com)

This issue was too big. It is hard to see which test cases have been fixed and which ones are still failing.

I have opened a separate issue for each group of test cases: #65 #67 #68 #69 #70 #72 #73 #74

This issue was too big. It is hard to see which test cases have been fixed and which ones are still failing. I have opened a separate issue for each group of test cases: #65 #67 #68 #69 #70 #72 #73 #74

Rows
Columns

XQuery test suite false negative #10