2024-10-28: String Splitting

I've been looking at one of the first examples used on the Raku language site: Raku by Example. This provides a little test case for exploring some string splitting and related Lisp libraries.

The program reads in a file of data and does some pre-processing. What struck me as interesting was the way the data was split up into fields.

The file format for a single line is:

Ana Dave | 3:0

and the Raku code to get this into four variables is:

    my ($pairing, $result) = $line.split(' | ');
    my ($p1, $p2)          = $pairing.words;
    my ($r1, $r2)          = $result.split(':');

What does this look like in Lisp?

No Libraries

The "no library" version could be written as:

       (let* ((space-posn (position #\space line))
              (bar-posn (position #\| line))
              (colon-posn (position #\: line))
              (pairing-1 (subseq line 0 space-posn))
              (pairing-2 (subseq line (+ space-posn 1) (- bar-posn 1)))
              (result-1 (subseq line (+ bar-posn 2) colon-posn))
              (result-2 (subseq line (+ colon-posn 1))))
       ; ...
       )

This uses position to extract the index of the three key split points in the string, and then careful subseq operations to retrieve the four required values.

This is not too bad to write for small examples, but is clearly not robust and it's not too clear how the splitting relates to the structure of the string.

Using UIOP Library

The UIOP library is an important one, because it comes along with ASDF, so you get it "for free" when using the library system. The library offers a useful function, split-string, which splits strings based on one or more separating characters.

In this way, our string is split by using all three separator characters in one go:

* (uiop:split-string "Ann Dave | 3:0" :separator " |:")
("Ann" "Dave" "" "" "3" "0")

The above code can then be rewritten, using destructuring-bind and removing all the empty strings:

        (destructuring-bind (pairing-1 pairing-2 result-1 result-2)
          (remove-if #'uiop:emptyp (uiop:split-string line :separator " |:"))
        ; ...
        )

This works, but it again ignores the structure of the data, which the Raku program reflected: first splitting the data into two halves, and then dealing with each half in a different way.

The Raku split command also has an additional ability over uiop's split-string: it can split a string using a substring, returning the parts before and after the substring.

[3] > 'a-b-c | d-e-f'.split(' | ')
(a-b-c d-e-f)
[4] > '(a-b-c | d-e-f'.split('|')
((a-b-c   d-e-f)

The first split has removed the vertical bar and surrounding spaces. Can we do this in Lisp?

Using str Library

The str library provides many important string-processing functions.

The function we are looking at is str's split function. This does several things, but, to answer the points above, it can split on a character or on a substring:

* (str:split #\| "Ann Dave | 3:0")
("Ann Dave " " 3:0")
* (str:split " | " "Ann Dave | 3:0")
("Ann Dave" "3:0")

and it has a flag to remove empty strings from the result:

* (str:split #\space " one two  three ")
("" "one" "two" "" "three" "")
* (str:split #\space " one two  three " :omit-nulls t)
("one" "two" "three")

We use split and words to replicate the three split statements from the Raku code:

        (destructuring-bind (pairing result) (str:split " | " line)
          (destructuring-bind (pairing-1 pairing-2) (str:words pairing)
            (destructuring-bind (result-1 result-2) (str:split ":" result)
        ; ...
        )))

Of course, the binding part looks a bit ugly with those nested, repeated destructuring-bind statements. We clean this up with another library: the metabang-bind library, which provides a combination of let, destructuring-bind, with-multiple-values.

With this, we write three parallel assignments, splitting the line in steps, echoing the original Raku program:

        (metabang-bind:bind 
          (((pairing result) (str:split " | " line))
           ((pairing-1 pairing-2) (str:words pairing))
           ((result-1 result-2) (str:split ":" result)))
        ; ...
        )

The only remaining point about Raku's treatment of the input data, is that Raku is weakly typed, and is happy to take the strings returned for the results and use them as numbers:

[0] > 1 + 2
3
[1] > 1 + '2'
3
[2] > "1" + 2
3
[3] > "1" + "2"
3

whereas strongly-typed Lisp is not so happy:

* (+ 1 2)
3
* (+ 1 "2")

debugger invoked on a TYPE-ERROR @B800002FFF in thread
#<THREAD tid=2954 "main thread" RUNNING {1000B90003}>:
  The value
    "2"
  is not of type
    NUMBER

For Lisp, we need to convert the results to integers, which is done after splitting the result variable.

Final Program

For completeness, my final Lisp program is:

(load "~/quicklisp/setup.lisp")
(ql:quickload :metabang-bind :silent t)
(ql:quickload :str :silent t)

(format t "Tournament results:~%~%")

(let* ((lines (uiop:read-file-lines "scores.txt"))
       (names (str:words (first lines)))
       (matches (make-hash-table :test #'equalp))
       (sets (make-hash-table :test #'equalp)))
  ;
  (dolist (line (rest lines))
    (unless (str:emptyp line)
      (metabang-bind:bind 
        (((pairing result) (str:split " | " line))
         ((player-1 player-2) (str:words pairing))
         ((result-1 result-2) (mapcar #'parse-integer
                                      (str:split ":" result))))

        (incf (gethash player-1 sets 0) result-1)
        (incf (gethash player-2 sets 0) result-2)
        (if (> result-1 result-2)
          (incf (gethash player-1 matches 0))
          (incf (gethash player-2 matches 0))))))
  ;
  (setf names (sort names #'< :key #'(lambda (name) (gethash name sets))))
  (setf names (sort names #'< :key #'(lambda (name) (gethash name matches))))
  ;
  (dolist (name (reverse names))
    (format t "~a has won ~d ~:* ~[matches~;match~:;matches~] and ~d set~:p~&"
            name 
            (gethash name matches)
            (gethash name sets))))

Note the use of format conditionals to handle plurals.

~:* steps back in the list of arguments, to reuse the count of matches
~[...~] picks the relevant match/matches based on if the count if 0, 1, or more
~:p adds an "s" if the previous argument is not 1