Yellow Rabbit

Parse the Complex Element

Web-engine in Lisp: Parse the Most Complex Element

The most complicated element. Much more complicated than comments. We have opening and closing tags, attributes and their meanings, and everything else, the elements can still be nested.

It even has a complex constructor :smile:


(defun make-element-node (tag attrs children)
  "Tag and list of name-value pairs"
  (let ((el (make-instance 'element-node :tag tag :children children)))
       (element-set-all-attrs el attrs)
       el))

For this to work, you need to add a function to the dom.lisp:


;; Set all attrs at once
(defun element-set-all-attrs (el vals)
  "Set all attrs at once. Vals is name-value pair list."
  (let ((attrs (element-node-attrs el)))
    (mapc #'(lambda (nv)
	      (setf (gethash (car nv) attrs) (cdr nv))) vals)))

We also need a lot of small parsers, starting with the parser of the tag name.


      (parse-tagname ()
          "Parse tag name"
          (let (ch)
            (with-output-to-string (s)
              (matchit $[@(tag-text ch) !(write-char ch s)]))))

Useful thing — skipping spaces, tabs, etc. It’s always successful.


       (skip-whitespace ()
          (let (ch)
            (matchit $[@(whitespace ch)])
            t))

Attribute values can be enclosed in either double or single quotes, so the parser must remember from what type of quotation the value began and work up to exactly the same closing quotation mark.


       (parse-value (quo)
          (let (ch)
            (with-output-to-string (s)
              (matchit $[@(any-text ch) !(char/= ch quo) !(write-char ch s)]))))


       (parse-quoted-value ()
          "Parse quoted value"
          (let (quo (oldindex index))
            (or (matchit
                  [ {[#\' !(setf quo #\')] [#\" !(setf quo #\")]}
                    !(parse-value quo)])
                (progn (setf index oldindex) nil))))

With these helpers, it’s easy to make the parser of the same attribute: first select the name (not empty), then the = and the value of the attribute in any quotes. The output will be a pair (name, value).


       (parse-attr ()
          "name=value"
          (let (name value (oldindex index))
            (if
              (matchit
                [ !(setf name (parse-tagname))
                  !(not (zerop (length name)))
                  #\= !(setf value (parse-quoted-value))])
              (cons name value)
              (progn (setf index oldindex) nil))))

Recognized attributes are pushed into the list. The order will be reversed.


       (parse-attrs ()
          "Parse attributes"
          (let (attrs)
            (matchit
              $[ !(skip-whitespace)
                 !(setf attr (parse-attr))
                 !(push attr attrs)])
            attrs))

The closing tag also requires attention.


       (parse-closing-tag (tagname)
          "Parse </tagname>"
          (let ((oldindex index))
            (or
              (matchit
                [#\< #\/ !(string= (parse-tagname) tagname) #\>])
              (progn (setf index oldindex) nil))))

Element parser sketch:


       (parse-element ()
          "<tagname $[attr=value]> ??? </tagname>"
          (let ((oldindex index)
                ch tagname attrs)
            (or (matchit
                  [!(setf tagname (parse-tagname))
                   {!(setf attrs (parse-attrs)) !T}
                   #\>
                   !(parse-closing-tag tagname)
                   !(make-element-node tagname attrs nil)])
                (progn (setf index oldindex) nil))))

The problem with this parser is that it does not provide anything between opening and closing tags. That is, it is time to think about recursion and how to organize descendants. In the meantime, you can check how the element parser works.


* (ql:quickload 'toy-engine)
To load "toy-engine":
  Load 1 ASDF system:
    toy-engine
; Loading "toy-engine"

(TOY-ENGINE)
* (in-package :toy-engine)

#<PACKAGE "TOY-ENGINE">
* (defparameter *str* "table border=\"1\"   width=\"50%\"   ></table>")

*STR*
* (parse-html *str*)

#<ELEMENT-NODE {100504E943}>
* (gethash "width" (element-node-attrs *))

"50%"
T
* (pp->dot "element-node.dot" (lambda () (pp-dom **)))

"}"
* 

The tree after parsing the element

Recursion

Since we create a node only after we create all of its descendants, then we will logically accumulate them in the list and then simply transfer it to the node’s constructor. The most interesting function will be one that parses several sites one by one into the list of future descendants.

But first the parser of a single node:


       (parse-node ()
          "Text or node"
          (let ((oldindex index))
            (or (matchit
                  { !(parse-comment-or-element)
                    !(parse-text)})
                (progn (setf index oldindex) nil))))

It uses a small helper parser, which is needed to roll back if there is not a comment or an element. The fact is that elementary matches like #\< or "example" used inside the sequence [] can not be rolled back if they worked.


       (parse-comment-or-element ()
	  "< {comment element}"
	  (let ((oldindex index))
	    (or (matchit
		  [#\< {!(parse-comment) !(parse-element)}])
		(progn (setf index oldindex) nil))))

Let’s check that this function recognizes and creates different types of nodes:


* (ql:quickload 'toy-engine)
To load "toy-engine":
  Load 1 ASDF system:
    toy-engine
; Loading "toy-engine"

(TOY-ENGINE)
* (in-package :toy-engine)

#<PACKAGE "TOY-ENGINE">
* (defparameter *str-text* "table border=\"1\"   width=\"50%\"   ></table>")

*STR-TEXT*
* (defparameter *text* (parse-html *str-text*))

*TEXT*
* (defparameter *str-comment* "<!-- coomn -->")

*STR-COMMENT*
* (defparameter *comment* (parse-html *str-comment*))

*COMMENT*
* (defparameter *str-element* "<table border=\"1\"   width=\"50%\"   ></table>>")

*STR-ELEMENT*
* (defparameter *element* (parse-html *str-element*))

*ELEMENT*
* *element*

#<ELEMENT-NODE {100516F043}>
* *text*

#<TEXT-NODE {100503DD53}>
* *comment*

#<COMMENT-NODE {10050C6A93}>
*

We collect several nodes in the list:


       (parse-nodes ()
          "Children"
          (let ((oldindex index) children node)
            (if (matchit
                   $[!(setf node (parse-node)) !(push node children)])
	        (reverse children)
                (progn (setf index oldindex) nil))))

We check how well this function recognizes a string with different parts of HTML:


* (ql:quickload 'toy-engine)
To load "toy-engine":
  Load 1 ASDF system:
    toy-engine
; Loading "toy-engine"

(TOY-ENGINE)
* (in-package :toy-engine)

#<PACKAGE "TOY-ENGINE">
* (defparameter *str* "Text  rr<!-- coomn --> <p></p>")

*STR*
* (defparameter *ch* (parse-html *str*))

*CH*
* *ch*

(#<TEXT-NODE {1005AD9AA3}> #<COMMENT-NODE {1005AD9BE3}>
 #<TEXT-NODE {1005AD9EE3}> #<ELEMENT-NODE {1005ADA5E3}>)
*

If you now change the body of the parse-html function to


      (make-instance 'node :children (parse-nodes))

thereby making all recognized nodes on the upper level descendants of some node of type node.


* (ql:quickload 'toy-engine)
To load "toy-engine":
  Load 1 ASDF system:
    toy-engine
; Loading "toy-engine"

(TOY-ENGINE)
* (in-package :toy-engine)

#<PACKAGE "TOY-ENGINE">
* (defparameter *str* "Text  rr<!-- coomn --> <p></p>")

*STR*
* (defparameter *ch* (parse-html *str*))

*CH*
* (pp->dot "children.dot" (lambda () (pp-dom *ch*)))

"}"
*

A tree after parsing several elements

Finally, the element parser contains the start of recursion:


       (parse-element ()
          "<tagname $[attr=value]> ??? </tagname>"
          (let ((oldindex index)
                ch tagname attrs children)
            (or (matchit
                  [!(setf tagname (parse-tagname))
                   {!(setf attrs (parse-attrs)) !t}
                   #\>
		   {!(setf children (parse-nodes)) !t}
                   !(parse-closing-tag tagname)
                   !(make-element-node tagname attrs children)])
                (progn (setf index oldindex) nil))))

A case study of a complex example:


* (ql:quickload 'toy-engine)
To load "toy-engine":
  Load 1 ASDF system:
    toy-engine
; Loading "toy-engine"
.
(TOY-ENGINE)
* (in-package :toy-engine)

#<PACKAGE "TOY-ENGINE">
* (defparameter *str* "Text  rr<!-- coomn --> <p>another text<table color=\"red\"><tr>Row</tr></table></p>")

*STR*
* (defparameter *ch* (parse-html *str*))

*CH*
* (pp->dot "parse-html.dot" (lambda () (pp-dom *ch*)))

"}"
*

Tree after parsing tiny HTML