Rainette

CRAN status CRAN Downloads R build status

The package website can be found at : https://juba.github.io/rainette/.

Rainette is an R package which implements a variant of the Reinert textual clustering method. This method is available in other software such as Iramuteq (free software) or Alceste (commercial, closed source).

Features

Installation

The package is installable from CRAN :

install_packages("rainette")

The development version is installable from R-universe :

install.packages("rainette", repos = "https://juba.r-universe.dev")

Usage

Let’s start with an example corpus provided by the excellent quanteda package :

library(quanteda)
data_corpus_inaugural

First, we’ll use split_segments to split each document into segments of about 40 words (punctuation is taken into account) :

corpus <- split_segments(data_corpus_inaugural, segment_size = 40)

Next, we’ll apply some preprocessing and compute a document-term matrix with quanteda functions :

tok <- tokens(corpus, remove_punct = TRUE)
tok <- tokens_remove(tok, stopwords("en"))
dtm <- dfm(tok, tolower = TRUE)
dtm <- dfm_trim(dtm, min_docfreq = 10)

We can then apply a simple clustering on this matrix with the rainette function. We specify the number of clusters (k), and the minimum number of forms in each segment (min_segment_size). Segments which do not include enough forms will be merged with the following or previous one when possible.

res <- rainette(dtm, k = 6, min_segment_size = 15)

We can use the rainette_explor shiny interface to visualise and explore the different clusterings at each k :

rainette_explor(res, dtm, corpus)
rainette_explor_plot

The Cluster documents tab allows to browse and filter the documents in each cluster :

rainette_explor_docs

We can then use the generated R code to reproduce the displayed clustering visualisation plot :

rainette_plot(res, dtm, k = 5, type = "bar", n_terms = 20, free_scales = FALSE,
    measure = "chi2", show_negative = "TRUE", text_size = 10)

Or cut the tree at chosen k and add a group membership variable to our corpus metadata :

docvars(corpus)$cluster <- cutree(res, k = 5)

In addition to this, we can also perform a double clustering, ie two simple clusterings produced with different min_segment_size which are then “crossed” to generate more robust clusters. To do this, we use rainette2 on two rainette results :

res1 <- rainette(dtm, k = 5, min_segment_size = 10, min_split_members = 10)
res2 <- rainette(dtm, k = 5, min_segment_size = 15, min_split_members = 10)
res <- rainette2(res1, res2, max_k = 5, min_members = 10)

We can then use rainette2_explor to explore and visualise the results.

rainette2_explor(res, dtm, corpus)
rainette2_explor

Tell me more

Three vignettes are available :

Credits

This classification method has been created by Max Reinert, and is described in several articles. Here are two references :

Thanks to Pierre Ratineau, the author of Iramuteq, for providing it as free software and open source. Even if the R code has been almost entirely rewritten, it has been a precious resource to understand the algorithms.

Many thanks to Sébastien Rochette for the creation of the hex logo.

Many thanks to Florian Privé for his work on rewriting and optimizing Rcpp code.