Clustering Program

The programme is in BASIC and will run on the QB64 compiler. It is available free on the internet and good advice sites exist for the language.

The program will run on categorised data or on presence / absence type data. It is not recommended that the two be mixed.

The data is set up in an array A(I,J) where conventionally I identifies the character and J the item (or entity).

If we were to cluster say archaeological sites on the numbers of different sorts of artefacts in them and say site number 3 has yielded 20 fishhooks of type number 12 then A(12,3) = 20. This is called categorised data. Zero scores are fine. If the number is not known then it is better to delete either the site or the artefact type.

The calculations tend to get dominated by the predominant types of artefact so if you have a situation where all the sites have a large number of a few artefacts and only ones and twos of others you might be better to try a presence absence system as well.

If we wish to use presence / absence data, then again for type 12 fishhooks in site 3 we would score A(12,3) = 1 ( scoring present as 1). On the other hand if site 4 lacked this artefact we would score it as A(12,4) = 0 (scoring absent as zero). This is less sensitive to missing data (scoring "missing" as zero).

The program reads data from a file called items.txt which must be in the same directory folder as the basic programme.

The program clusters items reporting after each clustering iteration H(CHAR) H(ITEMS) and J (respectively for Hr Hc and J in the paper). At each iteration the pair of items which maximises the value of J is selected. After clustering the clustered item is represented by the sum of the data in the clustered entities and stored in the location of the lowest numbered of the entities.. Thus if items 3 and 7 are clustered then A(I,3) is replaced by A(I.3) + A(I,7) and the cluster hereafter is referred to as item 3.

It also reports which entities have been joined at each iteration the value of DEL-J (Delta J in the paper) and the value of the relative information R.

Note that if it is wished to cluster characters using their occurrence in sites as data then this is also possible but the data entry must be reversed so that J identifies the character and I the item.

Test Data test data file download items.txt (NB save in same directory as the qbasic programme glclus.bas)

You need to create a text file called results.dat in the same directory as the other files.

NB the program has locations specified for items.txt and results.txt - you will need to change the locations in the program to the directory location you have chosen.

Clicking on glclus.bas in the directory should run the programme using the test data.

"clustering programme using informaion methods"
"engines / locomotives data"
"Number 0f Items",8
"Number 0f characters",4
1,"usa1",1,1,1,0,0,0,0,0,0
2,"usa2",1,1,1,1,0,0,0,0,0
3,"usa3",1,1,1,1,0,0,0,0,0
4,"usa4",1,1,1,1,0,0,0,0,0
5,"uka",1,0,0,0,0,0,0,0,0
6,"ukb",0,0,1,0,0,0,0,0,0
7,"ukc",1,1,0,0,0,0,0,0,0
8,"ukd",0,0,1,0,0,0,0,0,0
" H(CHAR)   H(ITEMS)    J           Joined   DEL-J    R"
1.353525,1.929849,.2876422
"                                                       ",6,8,0,8.949499E-02
1.353525,1.860535,.2876422
"                                                       ",3,4,0,9.794407E-02
1.353525,1.583276,.2876422
"                                                       ",2,3,0,.1125848
1.353525,1.201367,.2876421
"                                                       ",5,7,-2.616241E-02,.1063178
1.353525,1.10589,.2614797
"                                                       ",1,2,-3.790072E-02,.1072778
1.353525,.7305882,.223579
"                                                       ",1,5,-8.945024E-02,7.990475E-02
1.353525,.3250831,.1341288
"Only two groups left clustering stops"
"n = ",20

The programme prints out the data matirix before the results. It prints out 9 characters irrespective of the actual number.

The clusterings can be plotted as a hierarchy using Delta J to scale them ( and using -J to scale the last join) - as below.

The value of H(CHAR) never changes because the characters are never clustered together. H(ITEMS) steadily reduces as the items (Engines in this case) are clustered together.

DEL-J for the first three joinings is zero because they (Engines 2 to 4) have identical scores. Hence as they are joined J does not reduce.

The American engines are a clear group - the British ones less so, but clearly distiguished from the American ones.

The value of R peaks after clusters 2 and 3 join when there are 5 clusters left,suggesting this is the level of clustering which most efficiently characterises the data set, but three clusters is almost as good.

Remember to save it with a distincive name after you have used it in case you overwrite it next time you enter a new data set.

<data set name string>
<integer No of items>,<integer No of characters>
<name string for first item>
<integer, code for character No 1, item 1>
<integer, code for character No 2, item 1>
<integer, code for character No 3, item 1>
.

<integer, code for character No (last), item 1>
<name string for second item>
<integer, code for character No 1, item 2>
<integer, code for character No 2, item 2>
<integer, code for character No 3, item 2>
.

<integer, code for character No (last), item 2>

<integer, code for character No (last), item (last)>

	Engine
Character	usa1	usa2	usa3	usa4	uka	ukb	ukc	ukd
Outside cylinders	1	1	1	1	1	0	1	0
Windows in cab side	1	1	1	1	0	0	1	0
Light over smoke box	1	1	1	1	0	1	0	1
Cow Catcher	0	1	1	1	0	0	0	0