[oclug][OT] Quick Perl question
Brenda J. Butler
bjb at achilles.net
Mon Nov 11 13:29:33 EST 2002
On Mon, Nov 11, 2002 at 10:54:21AM -0500, Michael P. Soulier wrote:
> On 08/11/02 Roger Messier did speaketh:
>
> > I'm new to Perl, and I'm having trouble with a
> > regular expression. I basically want to match a
> > word, so I currently have /\b[a-zA-Z]\b/. The
> > problem is that I want to match any word boundary
> > *except* a hyphen, so that something like 'hello-'
> > or '-hello' would not match. The current version
> > matches because the \b matches on the hyphen.
>
> Really?
>
> main::(-e:1): 0
> DB<1> if ("-hello" =~ /\b[a-zA-Z]\b/) { print "match\n" } \
> cont: else { print "no match\n" }
> no match
>
> It doesn't for me, nor should it, considering that you've told it to match
> only alphabetic chars with word boundaries. A hyphen is not included in your
> [a-zA-Z] character class.
Well... it fails to match because there is no + after [a-zA-Z]. It
would match "-H" or "H-".
Also, do you want to match
shilly-shally
but not
shilly-
or
-shally
?
And do you want to distinguish between
shilly-shally
and
shillyshally
?
Generally, if you don't want to use the existing definition for
the short-cut characters (such as \b) then you have to write out
the expression without it. It is painful and ugly, but possible.
So, to express
/\b[a-zA-Z]+\b/
without using \b, you have to say something like:
/[^A-Za-z-][A-Za-z]+[^A-Za-z-]|^[A-Za-z]+[^A-Za-z-]|[^A-Za-z-][A-Za-z]+$|^[a-zA-Z]+$/
(Ie, a word surrounded by non-word characters, or at the beginning
of a line, or at the end of a line, or the only word on a line).
And this doesn't include words that do have hyphens embedded
in them. Build up the expression piece by piece, and put it
together with alternation as above.
The pieces you need are:
Words made up of only alpha characters.
- within a line
- at the beginning of a line
- at the end of a line
- at the beginning and end of a line (only word on the line)
Words with hyphens embedded
- within a line
- at the beginning of a line
- at the end of a line
- at the beginning and end of a line (only word on the line).
A word with hyphens embedded are words at least three characters,
with the outside two characters being alpha characters.
You may wish to write the definition so that hyphens are not
allowed to be consecutive. Something like:
The first and last alpha characters:
[A-Za-z]+[A-Za-z]+
with exactly one hyphen embedded:
[A-Za-z]+[-][A-Za-z]+
with repeated embedded hyphens allowed:
[A-Za-z]+[A-Za-z-]+[A-Za-z]+
with repeated (but no consecutive) embedded hyphens:
[A-Za-z]+([-][A-Za-z]+)*[-][A-Za-z]+
with non-word chars (neither alpha nor hyphen) at either end:
[^A-Za-z-][A-Za-z]+([-][A-Za-z]+)*[-][A-Za-z]+[^A-Za-z-]
You'll need versions of the above to account for those
types of word at the beginning or end of a line. That's
because \b matches the newline at the beginning or end
of a line (that's why \b is such a handy shortcut).
Tricky stuff. Keep trying. If you still have problems
send them to the list and we'll help out.
The general approach is: Make up mutually exclusive subsets
of the types of words you want to match, and make regular
expressions for them, and then string them together with
alternation.
--
bjb at achilles dot net
Welcome to the GNU age! http://www.gnu.org
5F82 9855 E247 1F8A 49CD 053E FB03 E77F 2A19 D707
More information about the OCLUG
mailing list