# Introduction to microclass

## K-mers

All methods in this package are based on K-mer counting. K-mers are words of length $$K$$, found in the DNA sequence being analysed. There are $$4^K$$ possible words of length $$K$$, and all instances in a sequence are counted. Given the sequence “ATGCCTGAACTGACCTGC” we can, for instance, count 1-mers, 2-mers and 3-mers as follows:

library(microclass)
KmerCount("ATGCCTGAACTGACCTGC", K = 1, col.names = TRUE)
##      A C G T
## [1,] 4 6 4 4
KmerCount("ATGCCTGAACTGACCTGC", K = 2, col.names = TRUE)
##      AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
## [1,]  1  2  0  1  0  2  0  3  2  2  0  0  0  0  4  0
KmerCount("ATGCCTGAACTGACCTGC", K = 3, col.names = TRUE)
##      AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC
## [1,]   0   1   0   0   0   1   0   1   0   0   0   0   0   0   1   0   0   0
##      CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT
## [1,]   0   0   0   0   0   2   0   0   0   0   0   0   3   0   1   1   0   0
##      GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC
## [1,]   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##      TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT
## [1,]   0   0   2   2   0   0   0   0   0   0

## Classification in one line of code

The quickest way to perform prokaryote classification is through the taxMachine. It is a genus level classification methods based on the multinomial method, trained on the microcontax data set.

library(microcontax)
data(small.16S)
tax.tab <- taxMachine(small.16S$Sequence) ## Welcome to the taxMachine! ## classifying 71 sequences with median length 1338 ... ## creating model... ## caching model... ## classifying sequence 1 to 71 ## done! head(tax.tab) ## Genus D.score R.score P.recognized ## 1 Caldisphaera 7.950911 0.05188227 1.00000000 ## 2 Caldisphaera 8.692077 -0.07626455 0.18882257 ## 3 Caldisphaera 8.653983 0.22779976 1.00000000 ## 4 Caldisphaera 8.549670 0.23457126 1.00000000 ## 5 Caldisphaera 8.580826 0.28430467 1.00000000 ## 6 Caldisphaera 8.062746 -1.21785848 0.05899272 genus <- sapply(strsplit(small.16S$Header,split=" "),function(x){x[2]})
boxplot(tax.tab$R.score, factor(genus), main = "R-score") Along with each R.score value is also a P.recognized value. This is simply the probability that one of the training data sequences get an R.score of the same value, or lower. It is perhaps easier to interpret than the score itself. We typically use this to mark some sequences as unrecognized, and risky to include in further analyses: tax.tab$Is.Recognized <- tax.tab$P.recognized>0.01 tax.tab[35:45,] ## Genus D.score R.score P.recognized Is.Recognized ## 35 Acidianus 5.425759 0.59778317 1.00000000 TRUE ## 36 Acidianus 7.367201 0.14176582 1.00000000 TRUE ## 37 Acidianus 5.464705 0.56362329 1.00000000 TRUE ## 38 Acidianus 7.879450 -0.35603991 0.13953400 TRUE ## 39 Acidianus 7.195179 0.09568198 1.00000000 TRUE ## 40 Acidianus 7.832235 -0.44416463 0.12710836 TRUE ## 41 Acidianus 9.064079 -3.69599537 0.00610501 FALSE ## 42 Acidianus 8.097803 -0.93641982 0.07726423 TRUE ## 43 Acidianus 7.438815 0.19891170 1.00000000 TRUE ## 44 Acidianus 7.438897 0.17761285 1.00000000 TRUE ## 45 Acidianus 7.394431 0.26012677 1.00000000 TRUE In most cases we would accept as recognized any sequence with P.recognized>1e-4, since there is a substantial variation also in ‘proper’ sequences. Sequences with a lot of sequencing errors, chimera or completely new taxa usually get extremely negative R.scores. ## Custom classification If classification is to be carried out on a different taxonomic level than genus, using a different training data set, or using different parameters, the RDP and multinomial methods are available. The following example uses half the small.16 data set to train an RDP model and then tries to classify the other half of the data. rdp <- rdpTrain(small.16S$Sequence[seq(1,71,2)], genus[seq(1,71,2)])  # training step
predicted <- rdpClassify(small.16S\$Sequence[seq(2,71,2)], rdp)        # classification step
cat( "Number of errors:", sum(predicted != genus[seq(2,71,2)]) )
## Number of errors: 0