[oclug]Really basic Perl help
Ahmed Masud
masud at googgun.com
Thu Jan 9 16:53:12 EST 2003
Milan Budimirovic wrote:
>On Mon, 2003-01-06 at 16:24, Ahmed Masud wrote:
>
>
>>B McKee wrote:
>>
>>
>>
>>>In truth I want to be able to read in an HTML file and output a subsection
>>>that is framed with a specific <--! Comment -->
>>>I gather their may be a module that would help, but I thought my needs could
>>>be handled by a simple regular expression search. Since this will end up on
>>>someone else's server I didn't want to have to install any extra software to
>>>make it work.
>>>
>>>
>>>
>>>
>>>
>><sigh> why make things Sooooooooooo complicated!
>>
>>#!/usr/bin/perl
>>$flag = 0;
>>while (<>) {
>> /<!-- comment -->/ && do { $flag = !$flag; next; } ;
>> print if ( $flag );
>>}
>>
>>A
>>
>>
>>
>
>Again, it's simple if the comment is always going to appear on a single
>line. But what if it's
>
><!-- this could be a very long comment spread out
>over two or more lines -->
>
>
>
perl sucks :) use flex instead
--- cut here save as htmlsplitter.l ---
%{
#include <stdio.h>
#define STEP 256
int print_flag = 0,count =0, buflen=0, commentmark = 0;
char *comment_buffer = NULL;
/* const char NEEDLE[] = "Silly and long comment\nwith tons of new
lines\nin it, but we really\nhave to find thi
ngs like\nthis in the HTML file"; */
const char NEEDLE[] = "foo\nbar";
%}
htcb <!--[ \t]*
htce [ \t]*-->
ws [ \t]+
%x C
%option noyywrap
%%
{htcb} {
BEGIN(C);
if (comment_buffer)
free(comment_buffer);
count = buflen = 0;
comment_buffer = (char *) malloc(buflen = STEP);
strcpy(comment_buffer, yytext );
commentmark = count = strlen(yytext);
}
\n|. { if ( print_flag ) fputc(yytext[0],yyout); }
<C>{htce} {
comment_buffer[count] = 0;
if (strcmp(comment_buffer+commentmark, NEEDLE)==0) {
print_flag=!print_flag;
}
else {
if ( print_flag ) {
fprintf(yyout, "%s%s", comment_buffer,
yytext) ;
}
}
BEGIN(INITIAL);
}
<C>.|\n {
if ( count >= buflen ) {
char *p;
p = realloc(comment_buffer, buflen+=STEP);
if ( !p ) {
fprintf(stderr, "You really should rethink these
comments, out of memory\n" );
exit(-1);
}
comment_buffer = p;
}
comment_buffer[count++] = yytext[0];
}
%%
int main(int argc, char *argv[]) {
int k;
if ( argc >= 2 ) {
for ( k = 1; k < argc; k++ ) {
yyin = fopen ( argv[k], "r");
if ( !yyin ) {
perror(argv[k]);
}
yylex();
}
}
else {
yylex();
}
}
--- end of htmlsplitter.l ---
after saving as htmlsplitter.l simply
flex -o htmlsplitter.c htmlspliter.l
gcc -o htmlsplitter htmlsplitter.c
Change the NEEDLE variable to whatever comment you are looking for sans
leading and trailing white spaces.
so <!-- foo bar --> is the same as <!--foo bar-->.
However, the spaces within the comments count so
<!--foo bar--> is not the same as <!--foo bar-->. I leave
allowing for space collapse as an excersize.
The program above was tested with the following HTML file:
--- foo.html begins --
BEFORE THE WORLD
<!-- foo
bar -->HELLO WORLD<!-- foo
bar -->AFTER THE WORLD<!-- foo
bar --> this should show up on the same line as hello world <!-- and
this comment should as well! -->
<!-- foo bar should also be making an appearance on its own line -->
<!--foo
bar-->however this line
and any line following it should not be anywhere in the output
another line
--- foo.html ends --
While perl is excellent for record extraction, Flex is far more powerful
a tool for lexical analysis than perl because it is designed to make
lexical state machine definitions and corrosponding actions easier.
Consider learning it for that purpose. man flex is a good start. Lex
and Yacc by O'Rielly is an excellent source.
Cheers,
Ahmed.
More information about the OCLUG
mailing list