Geometric Data Analysis with GDAtools

Nicolas Robette

2021-05-31


This tutorial presents the use of the GDAtools package for geometric data analysis. For more detailed information on the statistical procedures themselves, it is recommended to refer to the books by Henry Rouanet and Brigitte Le Roux:

Le Roux B. and Rouanet H., 2004, Geometric Data Analysis: From Correspondence Analysis to Stuctured Data Analysis, Kluwer Academic Publishers, Dordrecht.

Le Roux B. and Rouanet H., 2010, Multiple Correspondence Analysis, SAGE, Series: Quantitative Applications in the Social Sciences, Volume 163, CA:Thousand Oaks.


Introduction

For this example of Multiple Correspondence Analysis, we will use one of the data sets provided with the package. This is information on the tastes and cultural practices of 2000 individuals: listening to musical genres (French variety, rap, rock, jazz and classical) and taste for film genres (comedy, crime film, animation, science fiction, love film, musical). These 11 variables will be used as “active” variables in the MCA and are completed by 3 “supplementary” variables: gender, age and level of education.

library(GDAtools)
data(Taste)
str(Taste)
'data.frame':   2000 obs. of  14 variables:
 $ FrenchPop: Factor w/ 3 levels "No","Yes","NA": 2 1 2 1 2 1 1 1 1 2 ...
 $ Rap      : Factor w/ 3 levels "No","Yes","NA": 1 1 1 1 1 1 1 1 1 1 ...
 $ Rock     : Factor w/ 3 levels "No","Yes","NA": 1 1 2 1 1 2 1 1 2 1 ...
 $ Jazz     : Factor w/ 3 levels "No","Yes","NA": 1 2 1 1 1 1 1 1 1 1 ...
 $ Classical: Factor w/ 3 levels "No","Yes","NA": 1 2 1 2 1 1 1 1 1 1 ...
 $ Comedy   : Factor w/ 3 levels "No","Yes","NA": 1 2 1 1 1 1 2 2 2 2 ...
 $ Crime    : Factor w/ 3 levels "No","Yes","NA": 1 1 1 1 2 1 1 1 1 1 ...
 $ Animation: Factor w/ 3 levels "No","Yes","NA": 1 1 1 1 1 1 1 1 1 1 ...
 $ SciFi    : Factor w/ 3 levels "No","Yes","NA": 2 1 1 1 1 2 1 1 1 1 ...
 $ Love     : Factor w/ 3 levels "No","Yes","NA": 1 1 2 1 1 1 1 1 1 1 ...
 $ Musical  : Factor w/ 3 levels "No","Yes","NA": 1 1 1 1 1 1 1 1 1 1 ...
 $ Gender   : Factor w/ 2 levels "Men","Women": 1 1 2 1 2 2 2 2 1 1 ...
 $ Age      : Factor w/ 3 levels "15-24","25-49",..: 2 3 2 3 2 2 2 2 1 3 ...
 $ Educ     : Factor w/ 4 levels "None","Low","Medium",..: 3 4 3 4 2 1 3 2 2 2 ...


The active variables all have a “not available” (“NA”) category, which concerns some individuals. The so-called “specific” MCA makes it possible to neutralise these categories in the construction of the factorial space, while retaining all the individuals.

sapply(Taste[,1:11], function(x) sum(x=="NA"))
FrenchPop       Rap      Rock      Jazz Classical    Comedy     Crime Animation 
       10         9        10        15         5         3        15         4 
    SciFi      Love   Musical 
       12         7        11 


We start by identifying the rank of the categories we wish to neutralise.

getindexcat(Taste[,1:11])
 [1] "FrenchPop.No"  "FrenchPop.Yes" "FrenchPop.NA"  "Rap.No"       
 [5] "Rap.Yes"       "Rap.NA"        "Rock.No"       "Rock.Yes"     
 [9] "Rock.NA"       "Jazz.No"       "Jazz.Yes"      "Jazz.NA"      
[13] "Classical.No"  "Classical.Yes" "Classical.NA"  "Comedy.No"    
[17] "Comedy.Yes"    "Comedy.NA"     "Crime.No"      "Crime.Yes"    
[21] "Crime.NA"      "Animation.No"  "Animation.Yes" "Animation.NA" 
[25] "SciFi.No"      "SciFi.Yes"     "SciFi.NA"      "Love.No"      
[29] "Love.Yes"      "Love.NA"       "Musical.No"    "Musical.Yes"  
[33] "Musical.NA"   


The vector of these ranks is then given as an argument to the function speMCA.

mca <- speMCA(Taste[,1:11], excl=c(3,6,9,12,15,18,21,24,27,30,33))

The clouds

The Benzécri corrected inertia rates give an idea of how much information is represented by each axis.

modif.rate(mca)$modif
        mrate cum.mrate
1 67.30532896  67.30533
2 22.64536000  89.95069
3  7.17043134  97.12112
4  2.26387669  99.38500
5  0.59232858  99.97733
6  0.02267443 100.00000

It can be seen here that the first two axes capture most of the information (almost 90%). In the following we will therefore concentrate on the plane formed by axes 1 and 2.

The cloud of individuals

The cloud of individuals does not have a particular shape (triangle, horseshoe…), the points seem to be distributed in the whole plane.

ggcloud_indiv(mca)