2024-10-25: Text Classification with ABCL and Weka

This is an exercise I've done in a few JVM-based languages. The idea is to do some simple text classification in Lisp using the Weka library. As we are working on the JVM, the code runs in ABCL.

For data, I use the BBC dataset from Kaggle, selecting two classes: entertainment and tech.

So I have a parent folder, "bbc/", with two child folders, "entertainment" and "tech". Each file in the child folders is a text document, and will form an instance for the classifier to work on. The classes each have around 400 documents, and the total word count is 330,000.

To install Weka, download the latest version (I used 3.8.6), and extract the file weka.jar.

Overall process

The overall process can be divided into several steps. (Here, I keep things relatively simple, to investigate the API.)

First, the text must be converted into words only, so all numbers and punctuation are removed, and all letters are converted to lowercase. So "Address: 3 High Street, London." becomes the four words "address high street london".

Second, it is useful to simplify the words themselves. Common words (like "the", "a") are removed, using a stop-list. Also, stemming is applied so that different words like "walks" and "walking" are reduced to their stem "walk"; stemming increases the number of words in different documents which match.

Third, the text files are converted into a representation where each text file is an instance made from the values for a number of attributes, and a class label. The class label has two values, one for "entertainment" and one for "tech", depending on which folder the text file is in. Each attribute represents one word, and has values 1 or 0 (whether the word for that attribute is present in this instance or not).

Fourth, the number of attributes is usually very large at this stage, because natural language uses a lot of words. An attribute selection technique reduces the number of attributes to a manageable size.

Fifth, build some classification models. Three algorithms are used: Naive Bayes, Decision Trees and Support Vector Machines. For each of these algorithms, 10-fold cross validation is used to derive an overall accuracy.

Script

The script follows. It is actually one file, but broken up with some explanations.

https://bitbucket.org/pclprojects/workspace/snippets/rqG4Md/abcl-text-classification-with-weka

At the start of the script, import the required libraries.

(require :asdf)
(require :confusion-matrix) ; <1>
(require :java)

Install Confusion Matrix - my own library providing a Confusion Matrix.

The following method was created as our preprocess steps apply two of Weka's filters. Weka requires us to input each instance in turn to the filter, tell the filter the current batch has finished, and then retrieve the instances. Notice that the output result may have a different structure (number and type of attributes) to the input instances.

This method illustrates one of ABCL's JFFI methods: jcall, which takes the name of the method, the object to use and the arguments. e.g. the expression

  (jcall "setInputFormat" filterfn instances)

is equivalent to the Java call:

  filterfn.setInputFormat (instances);

(defun apply-filter (instances filter-fn)
  (jcall "setInputFormat" filter-fn instances)
  (dotimes (i (jcall "size" instances))                   ; <1>
    (jcall "input" filter-fn (jcall "get" instances i)))  ; <2>
  (jcall "batchFinished" filter-fn)

  (let ((result (jcall "getOutputFormat" filter-fn)))
    (do ((instance (jcall "output" filter-fn)
                   (jcall "output" filter-fn)))
      ((not instance) result)
      (jcall "add" result instance))))

Use a loop based on index to iterate over the Java Instances.
The Instance#get method returns an instance..

The preprocess method covers the first four(!) steps described above.

Weka provides TextDirectoryLoader to load the text documents from the two folders. This process leaves each instance with two attributes: one is the text of the document, and the second is its class label (the name of the child folder).

Step 1 is done using Lisp's substitute-if-not, to replace all non-alphabetic characters with spaces.

Steps 2-3 are done using a StringToWordVector filter. In this filter, I set the stemmer and stopwords handlers, tell it to convert the text to lower case and tokenise the string as words (rather than character sequences). Setting output_word_counts to false means the values will be 1 or 0, not actual word counts.

Step 4 is achieved using a second filter, CorrelationAttributeEval, along with a ranking algorithm to pick the most predictive 300 attributes.

Here, we use jnew to instantiate Java classes.

(defun pre-process (text-directory)
  (let ((loader (jnew "weka.core.converters.TextDirectoryLoader")))
    (jcall "setSource" loader (jnew "java.io.File" text-directory))
    (let ((instances (jcall "getDataSet" loader)))

      ; remove numbers/punctuation - step 1
      (dotimes (i (jcall "size" instances))
        (let ((instance (jcall "get" instances i)))
          ; remove all non ASCII letters
          (jcall "setValue" instance 0 
                   (substitute-if-not #\space                     ; <1>
                                      #'alpha-char-p
                                      (jcall "stringValue" instance 0)))))

      ; turn into vector of words, applying filters - steps 2 & 3   <2>
      (let ((string->words (jnew "weka.filters.unsupervised.attribute.StringToWordVector"))
        (jcall "setLowerCaseTokens" string->words t)
        (jcall "setOutputWordCounts" string->words nil)
        (jcall "setStemmer" string->words (jnew "weka.core.stemmers.LovinsStemmer"))
        (jcall "setStopwordsHandler" string->words (jnew "weka.core.stopwords.Rainbow"))
        (jcall "setTokenizer" string->words (jnew "weka.core.tokenizers.WordTokenizer"))
        ; -- apply the filter
        (setf instances (apply-filter instances string->words)))

      ; identify the class label
      (jcall "setClassIndex" instances 0)

      ; reduce number of attributes to 300 - step 4
      (let ((selector (jnew "weka.filters.supervised.attribute.AttributeSelection"))
            (ranker (jnew "weka.attributeSelection.Ranker")))
        (jcall "setEvaluator" selector (jnew "weka.attributeSelection.CorrelationAttributeEval"))
        (jcall "setNumToSelect" ranker 300)
        (jcall "setSearch" selector ranker)
        ; apply the filter
        (setf instances (apply-filter instances selector)))

      ; randomise order of instances
      (jcall "randomize" instances (jnew "java.util.Random"))

  instances)))

Use substitute-if-not to replace non-letters with space.
The StringToWordVector filter offers many options, some require providing a new instance (like the stemmer) and others a value (like whether to use lower case tokens).

Step 5 is the task of the evaluate-classifier method, used to test a given classification algorithm. Weka provides methods on instances to access train/test sets for k-fold cross-validation, so we use those to build and evaluate a classifier for each fold. We use the fully qualified path to each class as a reference. This method uses a confusion-matrix instance to collate results, and Weka's cross-validation methods to handle the train/test splits.

(defun number->class-label (n)                                          ; <1>
  "Convert real number into a class keyword."
  (if (> n 0.5) :positive :negative))

(defun evaluate-classifier (name classifier instances k)
  "Evaluate and report results of classifier based on k-fold CV."
  (let ((cm (cm:make-confusion-matrix :labels '(:positive :negative)))) ; <2>
    (dotimes (i k)
      (let ((model (jnew classifier))                                   ; <3>
            (train (jcall "trainCV" instances k i))                     ; <4>
            (test (jcall "testCV" instances k i)))
        (jcall "buildClassifier" model train)
        (dotimes (i (jcall "size" test))                                ; <5>
          (let ((instance (jcall "get" test i)))
            (cm:confusion-matrix-add 
              cm
              (number->class-label 
                (jcall "classValue" instance)) ; predicted class
              (number->class-label 
                (jcall "classifyInstance" model instance))))))) ; observed class
    ; report results
    (format t "Classifier: ~a~&" name)
    (format t " -- Precision      ~,3f~&" (cm:precision cm))            ; <6>
    (format t " -- Recall         ~,3f~&" (cm:recall cm))
    (format t " -- Geometric mean ~,3f~&" (cm:geometric-mean cm))))

Simple function to convert a numeric prediction into a class label.
Create a confusion-matrix to collate results.
Construct an instance of the classifier type.
Use Weka's trainCV and testCV to create train/test splits for each fold of the cross-validation process.
Run through every test instance in turn, recording its predicted and observed class in the confusion matrix.
Pull out aggregate results from the confusion matrix.

The top-level simply loads in the dataset, through the pre-process method, and then evaluates each classifier in turn. Notice how the classifiers are identified by their full class name.

(let ((data (pre-process "bbc/")))
  (dolist (classifier 
            '(("Decision Tree" "weka.classifiers.trees.J48")
              ("Naive Bayes" "weka.classifiers.bayes.NaiveBayes")
              ("Support Vector Machine" "weka.classifiers.functions.SMO")))
    (evaluate-classifier (first classifier)
                         (second classifier)
                         data
                         10)))
(exit)

On my system, the script runs through in about 30 seconds. The output is:

$ java --add-opens java.base/java.lang=ALL-UNNAMED -cp abcl.jar:weka.jar org.armedbear.lisp.Main --load abcl-weka-text.lisp
Armed Bear Common Lisp 1.9.2
Java 23 Oracle Corporation
OpenJDK 64-Bit Server VM
Low-level initialization completed in 0.388 seconds.
Startup completed in 2.186 seconds.
Classifier: Decision Tree
 -- Precision      0.950
 -- Recall         0.977
 -- Geometric mean 0.963
Classifier: Naive Bayes
 -- Precision      0.982
 -- Recall         0.974
 -- Geometric mean 0.978
Classifier: Support Vector Machine
 -- Precision      0.979
 -- Recall         0.990
 -- Geometric mean 0.985