Clustering Program

Using Information Theory Methods

Garry Law

program download  |   test data  |    running the program with test data | running with your data


This program performs clustering of entities on the basis that has been set out in the paper "Clustering of Archaeological Entities - An Information Theory Based Approach".

The programme is in BASIC and will run on the QB64 compiler. It is aailable free on the internet and good advice sites exist for the language.

The program will run on categorised data or on presence / absence type data. It is not recommended that the two be mixed.

The data is set up in an array A(I,J) where conventionally I identifies the character and J the item (or entity).

If we were to cluster say archaeological sites on the numbers of different sorts of artefacts in them and say site number 3 has yielded 20 fishhooks of type number 12 then A(12,3) = 20. This is called categorised data. Zero scores are fine. If the number is not known then it is better  to delete either the site or the artefact type.

The calculations tend to get dominated by the predominant types of artefact so if you have a situation where all the sites have a large number of a few artefacts and only ones and twos of others you might be better to try a presence absence system as well.

If we wish to use presence  / absence data, then again for type 12 fishhooks in site 3 we would score A(12,3) = 1 ( scoring present as 1). On the other hand if site 4 lacked this artefact we would score it as A(12,4) = 0 (scoring absent as zero). This is less sensitive to missing data (scoring "missing" as zero).

The program reads data from a file called items.txt which must be in the same directory folder as the basic programme.

It runs iteratively requesting the user to press enter at each iteration.

The program clusters items reporting after each clustering iteration H(CHAR)  H(ITEMS) and J  (respectively for Hr Hc and J in the paper). At each iteration the pair of items which maximises the value of J is selected. After clustering the clustered item is represented by the sum of the data in the clustered entities and stored in the location of the lowest numbered of the entities.. Thus if items 3 and 7 are clustered then A(I,3) is replaced by A(I.3) + A(I,7) and the cluster hereafter is referred to as item 3.

It also reports which entities have been joined at each iteration the value of DEL-J (Delta J in the paper) and the value of the relative information R.

Note that if it is wished to cluster characters using their occurrence in sites as data then this is also possible but the data entry must be reversed so that J identifies the character and I the item.

Some test data and results on same is given below.

This is free for use - but is not warranted in any way.


 QBASIC Program:   download glclus.bas 


Test Data   test data  file  download items.txt   (NB save in same directory as the qbasic programme glclus.bas)

0 = absent  1 =  present

Railway Engines (locomotives) - a few at random ages AD 1851 to 1895

  Engine
Character usa1 usa2 usa3 usa4 uka ukb ukc ukd
Outside cylinders 1 1 1 1 1 0 1 0
Windows in cab side 1 1 1 1 0 0 1 0
Light over smoke box 1 1 1 1 0 1 0 1
Cow Catcher 0 1 1 1 0 0 0 0

Engines 1 to 4 are from the USA, 5 to 8 from the UK.


Running the Programme with the Test Data

You need to create a text file called results.dat in the same directory as the other files.

NB the program has locations specified for  items.txt and results.txt - you will need to change the locations in the program  to the directory location you have chosen.

Clicking on glclus.bas in the directory should run the programme using the test data.

It will prompt you to hit return at each stage of the clustering.

After it has run open results.dat in a text reader.

It will look like this:

"clustering programme using informaion methods"
"engines / locomotives data"
"Number 0f Items",8
"Number 0f characters",4
1,"usa1",1,1,1,0,0,0,0,0,0
2,"usa2",1,1,1,1,0,0,0,0,0
3,"usa3",1,1,1,1,0,0,0,0,0
4,"usa4",1,1,1,1,0,0,0,0,0
5,"uka",1,0,0,0,0,0,0,0,0
6,"ukb",0,0,1,0,0,0,0,0,0
7,"ukc",1,1,0,0,0,0,0,0,0
8,"ukd",0,0,1,0,0,0,0,0,0
" H(CHAR)   H(ITEMS)    J           Joined   DEL-J    R"
1.353525,1.929849,.2876422
"                                                       ",6,8,0,8.949499E-02
1.353525,1.860535,.2876422
"                                                       ",3,4,0,9.794407E-02
1.353525,1.583276,.2876422
"                                                       ",2,3,0,.1125848
1.353525,1.201367,.2876421
"                                                       ",5,7,-2.616241E-02,.1063178
1.353525,1.10589,.2614797
"                                                       ",1,2,-3.790072E-02,.1072778
1.353525,.7305882,.223579
"                                                       ",1,5,-8.945024E-02,7.990475E-02
1.353525,.3250831,.1341288
"Only two groups left clustering stops"
"n = ",20

The programme prints out the data matirix before the results. It prints out 9 characters irrespective of the actual number.

The clusterings can be plotted as a hierarchy using Delta J to scale them ( and using  -J to scale the last join) - as below.

The value of H(CHAR) never changes because the characters are never clustered together. H(ITEMS) steadily reduces as the items (Engines in this case) are clustered together.

DEL-J for the first three joinings is zero because they (Engines 2 to 4) have identical scores. Hence as they are joined J does not reduce.

wpe7.gif (6988 bytes)

The American engines are a clear group - the British ones less so, but clearly distiguished from the American ones.

The value of R peaks after clusters 2 and 3 join when there are 5 clusters left,suggesting this is the level of clustering which most efficiently characterises the data set, but three clusters is almost as good.

wpe5.gif (4383 bytes)


Running the Programme with Your Data

You need to replace the test data in items.txt  with your own.

Remember to save it with a distincive name after you have used it in case you overwrite it next time you enter a new data set.

items.txt looks like this:

engines / locomotives data
8,4
usa1
1
1
1
0
usa2
1
1
1
1
usa3
1
1
1
1
usa4
1
1
1
1
uka
1
0
0
0
ukb
0
0
1
0
ukc
1
1
0
0
ukd
0
0
1
0

The specification for your data is:

<data set name string>
<integer No of items>,<integer No of characters>
<name string for first item>
<integer, code for character No 1, item 1>
<integer, code for character No 2, item 1>
<integer, code for character No 3, item 1>
.

.

<integer, code for character No (last), item 1>
<name string for second item>
<integer, code for character No 1, item 2>
<integer, code for character No 2, item 2>
<integer, code for character No 3, item 2>
.

.

<integer, code for character No (last), item 2>

<name string for third item>

.

.

<name string for last item>
<integer, code for character No 1, item (last)>

.

.

<integer, code for character No (last), item (last)>

Have fun!                                       

Update: 01 September, 2014