Thursday, August 23, 2012


Some information about HTML Parser:

html:

HTML stands for Hyper Text Markup Language. It is mainly used for creating web pages, blogs. It is written in the form og HTML elements, i.e., HTML tags which are enclosed within angle bracelets.
Start tag = < >
end tag = </ >
You can include any of the following tags within those  angle bracelets and write a HTML program. Here are some HTML tags.

1. html                                                               11. b
2. head                                                              12. address
3. body                                                              13. strong
4. div                                                                 14. col
5. span                                                               15. colgroup
6. p                                                                    16. td
7. br                                                                   17. ins
8. title                                                                 18. del
9. doctype                                                           19. var
10. style                                                              20. input
Here is a simple HTML program:

<html>
<h1> This is my first HTML program </h1>
<body>
<p> Hello world! </p>
</body>
</html>

As you may have sensed, this is the famous hello world program which is given as an introductory program in any computing language. Copy the code as it is and paste it in text editor, (in case of windows use note pad and in case of linux operating systems use gedit, kate, emacs). save that text file as program.html in desktop. Close the text editor and open the program.html file from the desktop. Bingo!! your first HTML program is compiled and you will see the results of your program in a browser ( opens in default browser). <h1> is the first header and it will appear in big fonts and will be in bold, <body> is the body of the html language and contains the important code. <html> indicates that the program is a pure HTML language. <p> means it is a paragraph, i.e., whatever text you want in the HTML code should be written as <p> your text </p>. End tags are compulsory. The advantage of HTML language is that it is a error tolerant language. This means that even though you make some errors in writing the HTML code, the HTML compiler can tolerate such errors and can show accurate results. But make careful errors :D :D You can refer www.w3schools.com for more information.

                                         Now let us come back to HTML parser. I got this as a topic for my holiday project. The code was not so complex. I wrote in lex language. You can see that code in later part of my article. So what is an HTML parser? HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. Even though it is a java library, I used lex programming and got good results. There are many types of HTML parser, .i.e., HTML to PHP, HTML to JAVA, HTML to PLAIN TEXT. I've used the HTML parser which converts the HTML code into plain text, i.e., it will remove the tags and will display only the text part. The code is as follows.

%{
#include<stdio.h>
%}

%%
"<"([a-zA-Z0-9]|[\_\|\ \t\*\-\+\?\=\`\~\!\@\&\#\$\%\^\&\*\(\)\{\}\[\]\;\:\'\"\,\.\/\\])+">" {;}
%%


int main(int argc,char **argv)
{
yyin=fopen(argv[1],"r");
yyout=fopen(argv[2],"w");
yylex();
return 0;
}

The regular expression which is used as argument will tell to the lexical analyzer that the special characters used under that regular expression should be removed. fopen(arg[1],"r") means that it will accept and HTML file as input. As this is a lexical program it will analyze the whole program byte by byte and will remove the special characters that are present in the HTML program. Install the lex tools in windows or in linux operating system and copy this code into any text editor. Save the text file as program.l, Open command prompt or terminal and create the path to the directory where this text file was saved. Now type this command.
lex program.l
Now you'll get lex.yy.c
Now run this command.
gcc -o program lex.yy.c
Now you'll get an executable file by the name a.out.
Now type as ./a.out inputhtmlfile outputtextfile
Then the tags in the input html file will be removed and you will get a text file in the linked directory. That text file will have text part only.
Ex : Consider the HTML program which i had given earlier.


<html>
<h1> This is my first HTML program </h1>
<body>
<p> Hello world! </p>
</body>
</html>

Give that HTML program as an input to that lex program in the method which I stated above. You'll get a text file which will contain the following content.



 This is my first HTML program

 Hello world!

I found this lex HTML parser to be effective.




No comments:

Post a Comment