pattern facets in Relax NG
Ken Beesley <ken.beesley <at> xrce.xerox.com>
2005-02-16 10:35:06 GMT
RE: pattern facets in Relax NG
I was very pleased to find that one can specify multiple
regular-expression 'pattern' facets, e.g.
Xtext = element xtext { xsd:string { pattern="..." pattern="..." }}
and that all of them have to be satisfied, and that nxml-mode
properly red-lines strings that aren't valid, as you type. Very nice.
Has any thought already been given to the following possibilities,
or can they already be done somehow?
1. Implement a negpattern="..." facet,
the complement of pattern. E.g.
Xtext = element xtext { xsd:string { pattern="..."
negpattern=".*z.*z.*" negpattern="zork|cumquat"}}
would allow any string matching some arbitrarily complex regular
expression, shown above as "...", but EXCLUDING any strings
that contain two 'z' letters,
and EXCLUDING the words "zork" and "cumquat".
Multiple patterns and negpatterns would all have to be satisfied
for validity. (Yes, I understand that the two negpatterns in the
example above could be unioned into one:
negpattern=".*z.*z.*|zork|cumquat".)
For non-trivial matching, it's often much easier to write a
general pattern that overrecognizes
somewhat and then filter the language accepted with something like a
negpattern.
2. Add support for the definition of non-trivial regular expressions
via string interpolation, as in Perl. Python has a similar
mechanism. Possible utility: The phonology and
orthography of some languages can be very predictable, with
words being composed of one or more syllables, and syllables
having a metapattern like CVC* (a consonant, followed by
a vowel, followed by zero or more consonants). The definition
of a "possible word" pattern would be much facilitiated if one
could define and interpolate strings in a Perl-like manner, e.g.
$Con = "(p|t|k|q|h|v|m|n|s)" ;
$Vow = "[ieaou]" ;
$Syl = "($Con$Vow$Con*)" ;
$Word = $Syll+ ;
Xtext = element xtext { xsd:string { pattern="$Word" } }
This is an artificially simple example, of course, but some real
languages (like Hopi) are not tremendously more complex
in what constitutes a phonologically/orthographically possible
word. But the final pattern is definitely non-trivial, and trying
to write it as one monolithic regular expression would be both
tedious and error-prone. String interpolation would allow
subparts of the pattern (like the definition of Consonant and Vowel)
to be changed and re-used consistently.
With the ability to construct complex patterns, and the
ability to exclude words via negpatterns, one could
reasonably build a kind of real-time Hopi spell checker
using Relax NG and nxml-mode. It would red-line orthographically
impossible words as you type, which could be very useful in teaching
the orthography.
3. (lower priority) Definition of non-trivial regular expressions
might be facilitated for some people by implementing xpattern
and xnegpattern facets that work like pattern and negpattern,
respectively, but ignore any non-literalized whitespace in the
pattern. Similar to Perl's "x" option, e.g. matching with /pattern/x
or substitution with s/pattern/string/x.
*****************************************************
If such functionality is already available somehow, please
point me to it.
Thanks,
Ken Beesley
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/emacs-nxml-mode/
<*> To unsubscribe from this group, send an email to:
emacs-nxml-mode-unsubscribe <at> yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/