<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>About Anything &#187; really geeky</title>
	<atom:link href="http://www.alstevens.org/tag/really-geeky/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.alstevens.org</link>
	<description>The personal blog of Al Stevens. Focus is overrated.</description>
	<lastBuildDate>Mon, 07 Mar 2011 21:18:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Emacs elisp regex preprocessor hack</title>
		<link>http://www.alstevens.org/2010/03/07/elisp-regex-preprocessor-hack/</link>
		<comments>http://www.alstevens.org/2010/03/07/elisp-regex-preprocessor-hack/#comments</comments>
		<pubDate>Sun, 07 Mar 2010 19:09:21 +0000</pubDate>
		<dc:creator>Al Stevens</dc:creator>
				<category><![CDATA[Technical Things]]></category>
		<category><![CDATA[emacs]]></category>
		<category><![CDATA[geeky]]></category>
		<category><![CDATA[lisp]]></category>
		<category><![CDATA[really geeky]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.alstevens.org/?p=429</guid>
		<description><![CDATA[I write a lot of elisp &#8212; that&#8217;s Emacs Lisp. I&#8217;m experimenting with Clojure, and may yet find a role for it, but for now, Emacs and elisp will remain our workhorses. My brain has no trouble with lots of embedded parentheses, but escaped regular expressions drive me crazy. I just can&#8217;t read them. And [...]]]></description>
			<content:encoded><![CDATA[<p>I write a lot of elisp &#8212; that&#8217;s <a href="http://directory.fsf.org/project/emacs/">Emacs</a> Lisp. I&#8217;m experimenting with <a href="http://clojure.org/">Clojure</a>, and may yet find a role for it, but for now, Emacs and elisp will remain our workhorses.</p>
<p>My brain has no trouble with lots of embedded parentheses, but escaped regular expressions drive me crazy. I just can&#8217;t read them. And Emacs Lisp requires they be doubly escaped. Trying to debug a simple doubly-escaped expression like this:</p>
<p>&#8220;\\([Aa][n]?[ ]*\\)?\\(\\(\\([Cc]ountry\\)\\|\\([Rr]epublic\\)\\|\\([Ii]ndependent [Ss]tate\\)\\|\\([Kk]ingdom\\)\\|\\([Mm]onarchy\\)\\|\\([Cc]onstitutional [Mm]onarchy\\)\\)\\|\\(\\(\\([Aa]utonomous\\)\\|\\([Ff]ederal\\)\\|\\([Ii]sland\\)\\|\\([Ii]slamic\\)\\|\\([Ii]ndependent\\)\\)[ ]*\\(\\([Cc]ountry\\)\\|\\([Rr]epublic\\)\\|\\([Ss]tate\\)\\)\\)\\)&#8221;</p>
<p>makes my head hurt &#8212; and it&#8217;s actually a single line. This one is used when we are trying to find all references to a country in one of our geographic dictionaries.</p>
<p>My hack is not perfect, but it allows me to write the above in a more readable form:</p>
<pre>
       "([Aa][n]?[ ]*)?
        (
          ( ([Cc]ountry)
            | ([Rr]epublic)
            | ([Ii]ndependent [Ss]tate)
            | ([Kk]ingdom)
            | ([Mm]onarchy)
            | ([Cc]onstitutional [Mm]onarchy) )
          |
          (
            ( (  [Aa]utonomous)
              | ([Ff]ederal)
              | ([Ii]sland)
              | ([Ii]slamic)
              | ([Ii]ndependent) )
            [ ]*
            ( ([Cc]ountry)
              | ([Rr]epublic)
              | ([Ss]tate) )
          )
        )
</pre>
<p>The lisp code I use to do the pre-processing is simple &#8212; I include this in a &#8220;common&#8221; file which I load automatically whenever I start emacs.</p>
<pre>
(defun xr-escape-regex (regex-string &#038;optional nparam)
  ;; Temporarily save character alternatives like: [a-zA-Z]
  (let ((ca-regex "\\[[^]]+\\]"))
  ;; increment n to mark the location of the replaced character alternatives
    ;; no character alt's left, escape the remaining string
    (cond ((not (string-match ca-regex regex-string))
           (xr-escape-regex-basic regex-string))
           ;; if we have a [ ] expression, replace it with an indexed placeholder
          (t
           (let* ((n (if nparam nparam 0))
                  (ca-place-marker (format "---------- %d ----------" n))
                  (saved-ca (match-string 0 regex-string))
                  ;; add the temporary replacement
                  (regex-string-ca-removed
                      (replace-match ca-place-marker t t regex-string)))
             ;; recurse on the modified string,
             ;; putting the ca's back as we come back up the stack.
             (xr-replace-string
                 (xr-escape-regex regex-string-ca-removed (1+ n))
                 ca-place-marker saved-ca) )))))

(defun xr-escape-regex-basic (regex-string)
  "Helper function that escapes '(', ')' '|'.
Will happily escape these chars even if inside of [], so
should not be called directly.
Allows white space around the (, ) and | characters.
Uses xr-replace-string, because
emacs replace-regex-in-string chokes if rep contains \\."
  (xr-replace-string
   (xr-replace-string
    (xr-replace-string
     (xr-replace-string
      (xr-replace-string
       regex-string
       "[ \t\n]*([ \t\n]*" "\\(")
      "[ \t\n]*)[ \t\n]*" "\\)")
     "[ \t\n]*{[ \t\n]*" "\\{")
    "[ \t\n]*}[ \t\n]*" "\\}")
   "[ \t\n]*|[ \t\n]*" "\\|"))

(defun xr-replace-string (string regexp replacement)
  "Replace all occurrences in string matched by
regexp with replacement."
  (save-match-data
    (mapconcat (function (lambda (x) x))
	       (split-string string regexp)
	       replacement)))
</pre>
<p>To use it, I usually create a variable to hold the compiled regex and then use it where needed. If it&#8217;s a global, it will look like this:</p>
<pre>
(defvar xr-ont-regex-capital-to-trim nil)
(setq xr-ont-regex-capital-to-trim
      (xr-escape-regex
       "and[ ]+((chief)|(only)|(largest))[ ]+(large[ ]+)?((city)|(town))[ ]+"))
</pre>
<p>After the setq, the global will hold the regex string:</p>
<p>&#8220;and[ ]+\\(\\(chief\\)\\|\\(only\\)\\|\\(largest\\)\\)[ ]+\\(large[ ]+\\)?\\(\\(city\\)\\|\\(town\\)\\)[ ]+</p>
<p>For complex regular expressions, especially ones where white space actually matters, this function can produce bad results. Most of my regular expressions are not that complex. I&#8217;ve so far been able to work around this by simply embedding the white space in character alternative tags.</p>
<p>It works fine in GNU Emacs. I&#8217;ve not tested it with any other flavor.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.alstevens.org/2010/03/07/elisp-regex-preprocessor-hack/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

