This is an exercise I've done in a few JVM-based languages. The idea is to do some simple text classification in Lisp using the Weka library. As we are working on the JVM, the code runs in ABCL.
For data, I use the BBC dataset from Kaggle, selecting two classes: entertainment and tech.
So I have a parent folder, "bbc/", with two child folders, "entertainment" and "tech". Each file in the child folders is a text document, and will form an instance for the classifier to work on. The classes each have around 400 documents, and the total word count is 330,000.
To install Weka, download the latest version (I used 3.8.6), and extract the file weka.jar.
Overall process
The overall process can be divided into several steps. (Here, I keep things relatively simple, to investigate the API.)
First, the text must be converted into words only, so all numbers and punctuation are removed, and all letters are converted to lowercase. So "Address: 3 High Street, London." becomes the four words "address high street london".
Second, it is useful to simplify the words themselves. Common words (like "the", "a") are removed, using a stop-list. Also, stemming is applied so that different words like "walks" and "walking" are reduced to their stem "walk"; stemming increases the number of words in different documents which match.
Third, the text files are converted into a representation where each text file is an instance made from the values for a number of attributes, and a class label. The class label has two values, one for "entertainment" and one for "tech", depending on which folder the text file is in. Each attribute represents one word, and has values 1 or 0 (whether the word for that attribute is present in this instance or not).
Fourth, the number of attributes is usually very large at this stage, because natural language uses a lot of words. An attribute selection technique reduces the number of attributes to a manageable size.
Fifth, build some classification models. Three algorithms are used: Naive Bayes, Decision Trees and Support Vector Machines. For each of these algorithms, 10-fold cross validation is used to derive an overall accuracy.
Script
The script follows. It is actually one file, but broken up with some explanations.
https://bitbucket.org/pclprojects/workspace/snippets/rqG4Md/abcl-text-classification-with-weka
At the start of the script, import the required libraries.
(require :asdf) (require :confusion-matrix) ; <1> (require :java)
- Install Confusion Matrix - my own library providing a Confusion Matrix.
The following method was created as our preprocess steps apply two of Weka's
filters. Weka requires us to input each instance in turn to the filter, tell
the filter the current batch has finished, and then retrieve the instances.
Notice that the output result
may have a different structure (number and type
of attributes) to the input instances
.
This method illustrates one of ABCL's JFFI methods: jcall
, which takes the
name of the method, the object to use and the arguments. e.g. the expression
(jcall "setInputFormat" filterfn instances)
is equivalent to the Java call:
filterfn.setInputFormat (instances);
(defun apply-filter (instances filter-fn) (jcall "setInputFormat" filter-fn instances) (dotimes (i (jcall "size" instances)) ; <1> (jcall "input" filter-fn (jcall "get" instances i))) ; <2> (jcall "batchFinished" filter-fn) (let ((result (jcall "getOutputFormat" filter-fn))) (do ((instance (jcall "output" filter-fn) (jcall "output" filter-fn))) ((not instance) result) (jcall "add" result instance))))
- Use a loop based on index to iterate over the Java Instances.
-
The
Instance#get
method returns an instance..
The preprocess method covers the first four(!) steps described above.
Weka provides TextDirectoryLoader
to load the text documents from the two
folders. This process leaves each instance with two attributes: one is the text
of the document, and the second is its class label (the name of the child
folder).
Step 1 is done using Lisp's substitute-if-not
, to replace all non-alphabetic
characters with spaces.
Steps 2-3 are done using a StringToWordVector
filter. In this filter, I set
the stemmer and stopwords handlers, tell it to convert the text to lower case
and tokenise the string as words (rather than character sequences). Setting
output_word_counts
to false means the values will be 1 or 0, not actual word
counts.
Step 4 is achieved using a second filter, CorrelationAttributeEval
, along
with a ranking algorithm to pick the most predictive 300 attributes.
Here, we use jnew
to instantiate Java classes.
(defun pre-process (text-directory) (let ((loader (jnew "weka.core.converters.TextDirectoryLoader"))) (jcall "setSource" loader (jnew "java.io.File" text-directory)) (let ((instances (jcall "getDataSet" loader))) ; remove numbers/punctuation - step 1 (dotimes (i (jcall "size" instances)) (let ((instance (jcall "get" instances i))) ; remove all non ASCII letters (jcall "setValue" instance 0 (substitute-if-not #\space ; <1> #'alpha-char-p (jcall "stringValue" instance 0))))) ; turn into vector of words, applying filters - steps 2 & 3 <2> (let ((string->words (jnew "weka.filters.unsupervised.attribute.StringToWordVector")) (jcall "setLowerCaseTokens" string->words t) (jcall "setOutputWordCounts" string->words nil) (jcall "setStemmer" string->words (jnew "weka.core.stemmers.LovinsStemmer")) (jcall "setStopwordsHandler" string->words (jnew "weka.core.stopwords.Rainbow")) (jcall "setTokenizer" string->words (jnew "weka.core.tokenizers.WordTokenizer")) ; -- apply the filter (setf instances (apply-filter instances string->words))) ; identify the class label (jcall "setClassIndex" instances 0) ; reduce number of attributes to 300 - step 4 (let ((selector (jnew "weka.filters.supervised.attribute.AttributeSelection")) (ranker (jnew "weka.attributeSelection.Ranker"))) (jcall "setEvaluator" selector (jnew "weka.attributeSelection.CorrelationAttributeEval")) (jcall "setNumToSelect" ranker 300) (jcall "setSearch" selector ranker) ; apply the filter (setf instances (apply-filter instances selector))) ; randomise order of instances (jcall "randomize" instances (jnew "java.util.Random")) instances)))
-
Use
substitute-if-not
to replace non-letters with space. -
The
StringToWordVector
filter offers many options, some require providing a new instance (like the stemmer) and others a value (like whether to use lower case tokens).
Step 5 is the task of the evaluate-classifier
method, used to test a given
classification algorithm. Weka provides methods on instances to access
train/test sets for k-fold cross-validation, so we use those to build and
evaluate a classifier for each fold. We use the fully qualified path to each
class as a reference. This method uses a confusion-matrix
instance to collate
results, and Weka's cross-validation methods to handle the train/test splits.
(defun number->class-label (n) ; <1> "Convert real number into a class keyword." (if (> n 0.5) :positive :negative)) (defun evaluate-classifier (name classifier instances k) "Evaluate and report results of classifier based on k-fold CV." (let ((cm (cm:make-confusion-matrix :labels '(:positive :negative)))) ; <2> (dotimes (i k) (let ((model (jnew classifier)) ; <3> (train (jcall "trainCV" instances k i)) ; <4> (test (jcall "testCV" instances k i))) (jcall "buildClassifier" model train) (dotimes (i (jcall "size" test)) ; <5> (let ((instance (jcall "get" test i))) (cm:confusion-matrix-add cm (number->class-label (jcall "classValue" instance)) ; predicted class (number->class-label (jcall "classifyInstance" model instance))))))) ; observed class ; report results (format t "Classifier: ~a~&" name) (format t " -- Precision ~,3f~&" (cm:precision cm)) ; <6> (format t " -- Recall ~,3f~&" (cm:recall cm)) (format t " -- Geometric mean ~,3f~&" (cm:geometric-mean cm))))
- Simple function to convert a numeric prediction into a class label.
-
Create a
confusion-matrix
to collate results. - Construct an instance of the classifier type.
-
Use Weka's
trainCV
andtestCV
to create train/test splits for each fold of the cross-validation process. - Run through every test instance in turn, recording its predicted and observed class in the confusion matrix.
- Pull out aggregate results from the confusion matrix.
The top-level simply loads in the dataset, through the pre-process
method,
and then evaluates each classifier in turn. Notice how the classifiers are
identified by their full class name.
(let ((data (pre-process "bbc/"))) (dolist (classifier '(("Decision Tree" "weka.classifiers.trees.J48") ("Naive Bayes" "weka.classifiers.bayes.NaiveBayes") ("Support Vector Machine" "weka.classifiers.functions.SMO"))) (evaluate-classifier (first classifier) (second classifier) data 10))) (exit)
On my system, the script runs through in about 30 seconds. The output is:
$ java --add-opens java.base/java.lang=ALL-UNNAMED -cp abcl.jar:weka.jar org.armedbear.lisp.Main --load abcl-weka-text.lisp Armed Bear Common Lisp 1.9.2 Java 23 Oracle Corporation OpenJDK 64-Bit Server VM Low-level initialization completed in 0.388 seconds. Startup completed in 2.186 seconds. Classifier: Decision Tree -- Precision 0.950 -- Recall 0.977 -- Geometric mean 0.963 Classifier: Naive Bayes -- Precision 0.982 -- Recall 0.974 -- Geometric mean 0.978 Classifier: Support Vector Machine -- Precision 0.979 -- Recall 0.990 -- Geometric mean 0.985