About Anything

The personal blog of Al Stevens. Focus is overrated.

Emacs elisp regex preprocessor hack

without comments

I write a lot of elisp — that’s Emacs Lisp. I’m experimenting with Clojure, and may yet find a role for it, but for now, Emacs and elisp will remain our workhorses.

My brain has no trouble with lots of embedded parentheses, but escaped regular expressions drive me crazy. I just can’t read them. And Emacs Lisp requires they be doubly escaped. Trying to debug a simple doubly-escaped expression like this:

“\\([Aa][n]?[ ]*\\)?\\(\\(\\([Cc]ountry\\)\\|\\([Rr]epublic\\)\\|\\([Ii]ndependent [Ss]tate\\)\\|\\([Kk]ingdom\\)\\|\\([Mm]onarchy\\)\\|\\([Cc]onstitutional [Mm]onarchy\\)\\)\\|\\(\\(\\([Aa]utonomous\\)\\|\\([Ff]ederal\\)\\|\\([Ii]sland\\)\\|\\([Ii]slamic\\)\\|\\([Ii]ndependent\\)\\)[ ]*\\(\\([Cc]ountry\\)\\|\\([Rr]epublic\\)\\|\\([Ss]tate\\)\\)\\)\\)”

makes my head hurt — and it’s actually a single line. This one is used when we are trying to find all references to a country in one of our geographic dictionaries.

My hack is not perfect, but it allows me to write the above in a more readable form:

       "([Aa][n]?[ ]*)?
        (
          ( ([Cc]ountry) 
            | ([Rr]epublic) 
            | ([Ii]ndependent [Ss]tate) 
            | ([Kk]ingdom) 
            | ([Mm]onarchy) 
            | ([Cc]onstitutional [Mm]onarchy) )
          |
          ( 
            ( (  [Aa]utonomous) 
              | ([Ff]ederal) 
              | ([Ii]sland) 
              | ([Ii]slamic) 
              | ([Ii]ndependent) )
            [ ]*
            ( ([Cc]ountry) 
              | ([Rr]epublic) 
              | ([Ss]tate) )
          )
        )

The lisp code I use to do the pre-processing is simple — I include this in a “common” file which I load automatically whenever I start emacs.

(defun xr-escape-regex (regex-string &optional nparam)
  ;; Temporarily save character alternatives like: [a-zA-Z]
  (let ((ca-regex "\\[[^]]+\\]"))
  ;; increment n to mark the location of the replaced character alternatives
    ;; no character alt's left, escape the remaining string
    (cond ((not (string-match ca-regex regex-string))
           (xr-escape-regex-basic regex-string))
           ;; if we have a [ ] expression, replace it with an indexed placeholder
          (t
           (let* ((n (if nparam nparam 0))
                  (ca-place-marker (format "---------- %d ----------" n))
                  (saved-ca (match-string 0 regex-string))
                  ;; add the temporary replacement
                  (regex-string-ca-removed 
                      (replace-match ca-place-marker t t regex-string)))
             ;; recurse on the modified string,
             ;; putting the ca's back as we come back up the stack.
             (xr-replace-string 
                 (xr-escape-regex regex-string-ca-removed (1+ n))
                 ca-place-marker saved-ca) )))))
          
(defun xr-escape-regex-basic (regex-string)
  "Helper function that escapes '(', ')' '|'.
Will happily escape these chars even if inside of [], so
should not be called directly.
Allows white space around the (, ) and | characters.
Uses xr-replace-string, because 
emacs replace-regex-in-string chokes if rep contains \\."
  (xr-replace-string 
   (xr-replace-string
    (xr-replace-string
     (xr-replace-string
      (xr-replace-string
       regex-string
       "[ \t\n]*([ \t\n]*" "\\(")
      "[ \t\n]*)[ \t\n]*" "\\)")
     "[ \t\n]*{[ \t\n]*" "\\{")
    "[ \t\n]*}[ \t\n]*" "\\}")
   "[ \t\n]*|[ \t\n]*" "\\|"))

(defun xr-replace-string (string regexp replacement)
  "Replace all occurrences in string matched by
regexp with replacement."
  (save-match-data
    (mapconcat (function (lambda (x) x))
	       (split-string string regexp)
	       replacement)))

To use it, I usually create a variable to hold the compiled regex and then use it where needed. If it’s a global, it will look like this:

(defvar xr-ont-regex-capital-to-trim nil)
(setq xr-ont-regex-capital-to-trim
      (xr-escape-regex
       "and[ ]+((chief)|(only)|(largest))[ ]+(large[ ]+)?((city)|(town))[ ]+"))

After the setq, the global will hold the regex string:

“and[ ]+\\(\\(chief\\)\\|\\(only\\)\\|\\(largest\\)\\)[ ]+\\(large[ ]+\\)?\\(\\(city\\)\\|\\(town\\)\\)[ ]+

For complex regular expressions, especially ones where white space actually matters, this function can produce bad results. Most of my regular expressions are not that complex. I’ve so far been able to work around this by simply embedding the white space in character alternative tags.

It works fine in GNU Emacs. I’ve not tested it with any other flavor.

Written by Al Stevens

March 7th, 2010 at 2:09 pm

Posted in Technical Things

Tagged with , , , ,

Leave a Reply