[oclug]Really basic Perl help

Ahmed Masud masud at googgun.com
Thu Jan 9 16:53:12 EST 2003


Milan Budimirovic wrote:

>On Mon, 2003-01-06 at 16:24, Ahmed Masud wrote:
>  
>
>>B McKee wrote:
>>
>>    
>>
>>>In truth I want to be able to read in an HTML file and output a subsection 
>>>that is framed with a specific <--! Comment -->  
>>>I gather their may be a module that would help, but I thought my needs could 
>>>be handled by a simple regular expression search.  Since this will end up on 
>>>someone else's server I didn't want to have to install any extra software to 
>>>make it work.
>>>
>>> 
>>>
>>>      
>>>
>><sigh> why make things Sooooooooooo complicated!
>>
>>#!/usr/bin/perl
>>$flag = 0;
>>while (<>) {
>>    /<!-- comment -->/ && do { $flag = !$flag; next; } ;
>>    print if ( $flag );
>>}
>>
>>A
>>
>>    
>>
>
>Again, it's simple if the comment is always going to appear on a single
>line. But what if it's
>
><!-- this could be a very long comment spread out
>over two or more lines -->
>
>  
>
perl sucks :) use flex instead

--- cut here save as htmlsplitter.l ---
%{
#include <stdio.h>

#define STEP 256

int print_flag = 0,count =0, buflen=0, commentmark = 0;
char *comment_buffer = NULL;
/* const char NEEDLE[] = "Silly and long comment\nwith tons of new 
lines\nin it, but we really\nhave to find thi
ngs like\nthis in the HTML file"; */
const char NEEDLE[] = "foo\nbar";
%}

htcb <!--[ \t]*
htce [ \t]*-->
ws [ \t]+

%x C
%option noyywrap
%%

{htcb} {
        BEGIN(C);
        if (comment_buffer)
                free(comment_buffer);
        count = buflen = 0;
        comment_buffer = (char *) malloc(buflen = STEP);
        strcpy(comment_buffer, yytext );
        commentmark = count = strlen(yytext);
    }
\n|.  { if ( print_flag ) fputc(yytext[0],yyout); }

<C>{htce} {
                comment_buffer[count] = 0;
                if (strcmp(comment_buffer+commentmark, NEEDLE)==0) {
                        print_flag=!print_flag;
                }
                else {
                        if ( print_flag ) {
                                fprintf(yyout, "%s%s", comment_buffer, 
yytext) ;
                        }
                }
                BEGIN(INITIAL);
        }
<C>.|\n {
        if ( count >= buflen ) {
                char *p;
                p = realloc(comment_buffer, buflen+=STEP);
                if ( !p ) {
                        fprintf(stderr, "You really should rethink these 
comments, out of memory\n" );
                        exit(-1);
                }
                comment_buffer = p;
        }
        comment_buffer[count++] = yytext[0];
}

%%
int main(int argc, char *argv[]) {
    int k;
    if ( argc >= 2 ) {
        for ( k = 1; k < argc; k++ ) {
            yyin = fopen ( argv[k], "r");
            if ( !yyin ) {
                perror(argv[k]);
            }
            yylex();
        }
    }
    else {
        yylex();
    }
}
--- end of htmlsplitter.l ---

after saving as htmlsplitter.l  simply

flex -o htmlsplitter.c htmlspliter.l
gcc -o htmlsplitter htmlsplitter.c

Change the NEEDLE variable to whatever comment you are looking for sans 
leading and trailing white spaces.
so <!--        foo bar            --> is the same as <!--foo bar-->. 
However, the spaces within the comments count so
<!--foo           bar--> is not the same as <!--foo bar-->.  I leave 
allowing for space collapse as an excersize.


The program above was tested with the following HTML file:

--- foo.html begins --
BEFORE THE WORLD
<!-- foo
bar -->HELLO WORLD<!-- foo
bar -->AFTER THE WORLD<!-- foo
bar --> this should show up on the same line as hello world <!-- and 
this comment should as well! -->
<!-- foo bar should also be making an appearance on its own line -->
<!--foo
bar-->however this line
and any line following it should not be anywhere in the output
another line
--- foo.html ends --

While perl is excellent for record extraction, Flex is far more powerful 
a tool for lexical analysis than perl because it is designed to make 
lexical state machine definitions and corrosponding actions easier. 
Consider learning it for that purpose. man flex is a good start.  Lex 
and Yacc by O'Rielly is an excellent source.


Cheers,

Ahmed.




More information about the OCLUG mailing list