Man Linux: Main Page and Category List

NAME

       htmlstrip - Strip HTML markup code

SYNOPSIS

       htmlstrip [-o outputfile] [-O level] [-b blocksize] [-v] [inputfile]

DESCRIPTION

       HTMLstrip reads inputfile or from "stdin" and strips the contained HTML
       markup. Use this program to shrink and compactify your HTML files in a
       safe way.

   Recognized Content Types
       There are three disjunct types of content which are recognized by
       HTMLstrip while parsing:

       HTML Tag (tag)
           This is just a single HTML tag, i.e. a string beginning with a
           opening angle bracket directly followed by an identifier,
           optionally followed by attributes and ending with a closing angle
           bracket.

       Preformatted (pre)
           This is any contents enclosed in one of the following container
           tags:

             1. <nostrip>
             2. <pre>
             3. <xmp>

           The non-HTML-3.2-conforming "<nostrip>" tag is special here: It
           acts like "<pre>" as a protection container for HTMLstrip but is
           also stripped from the output.  Use this as a pseudo-block which
           just preserves its body for the HTMLstrip processing but itself is
           removed from the output.

       Plain Text (txt)
           This is anything not falling into one of the two other categories,
           i.e any content both outside of preformatted areas and outside of
           HTML tags.

   Supported Stripping Levels
       The amount of stripping can be controlled by a optimization level,
       specified via option -O (see below). Higher levels also include all of
       the lower levels. The following stripping is done on each level:

       Level 0:
           No real stripping, just removing the sharp/comment-lines ("#...")
           [txt,tag].  Such lines are a standard feature of WML, so this is
           always done.

       Level 1:
           Minimal stripping: Same as level 0 plus stripping of blank and
           empty lines [txt].

       Level 2:
           Good stripping: Same as level 1 plus compression of multiple
           whitespaces (more then one in sequence) to single whitespaces
           [txt,tag] and stripping of trailing whitespaces at the of of a line
           [txt,tag,pre].

           This level is the default because while providing good optimization
           the HTML markup is not destroyed and remains human readable.

       Level 3:
           Best stripping: Same as level 2 plus stripping of leading
           whitespaces on a line [txt]. This can also be recommended when you
           still want to make sure that the HTML markup is not destroyed in
           any case. But the resulting code is a little bit ugly because of
           the removed whitespaces.

       Level 4:
           Expert stripping:  Same as level 3 plus stripping of HTML comment
           lines (‘‘"<!-- ... -->"’’) and crunching of HTML tag endsi [tag].
           BE CAREFUL HERE: Comment lines are widely used for hiding some Java
           or JavaScript code for browsers which are not capable of ignoring
           those stuff.  When using this optimization level make sure all your
           JavaScript code is hided correctly by adding HTMLstrip’s
           "<nostrip>" tags around the comment delimiters.

       Level 5:
           Crazy stripping: Same as level 4 plus wrapping lines around to fit
           in an 80 column view window. This saves some newlines but both
           leads to really unreadable markup code and opens the window for a
           lot of problems when this code is used to layout the page in a
           browser. Use with care. This is only experimental!

       Additionally the following global strippings are done:

       "^\n":
           A leading newline is always stripped.

       "<suck>":
           The "<suck>" tag just absorbs itself and all whitespaces around it.
           This is like the backslash for line-continuation, but is done in
           Pass 8, i.e.  really at the end. Use this inside HTML tag
           definitions to absorb whitespaces, for instance around %body when
           used inside "<table>" structures which at some point are newline-
           sensitive in Netscape Navigator.

OPTIONS

       -o outputfile
           This redirects the output to outputfile. Usually the output will be
           send to "stdout" if no such option is specified or outputfile is
           ""-"".

       -O level
           This sets the optimization/stripping level, i.e. how much HTMLstrip
           should compress the contents.

       -b blocksize
           For efficiency reasons, input is divided into blocks of 16384
           chars.  If you have some performance problems, you may try to
           change this value.  Any value between 1024 and 32766 is allowed.
           With a value of 0, input is not divided into blocks.

       -v  This sets verbose mode where some processing information will be
           given on the console.

AUTHORS

        Ralf S. Engelschall
        rse@engelschall.com
        www.engelschall.com

        Denis Barbier
        barbier@engelschall.com