DESCRIBER

Column

ADSL.XPT DATA with 254 OBSERVATIONS and 18 VARIABLES

DESCRIBEstatic

Column

Column

About Hmisc::describe

Every time a dataset is created, either for data management purposes or for statistical analyses, it is imperative that each variable be reviewed. Not only should the evaluation provide summary statistics and graphical displays to detect data errors, it should also present the results in a thorough, but succinct manner. To accomplish this goal, descriptive summaries for each variable should be created according to their characteristics.

The best available option for generating descriptive data set summaries is found in the Hmisc: Harrell Miscellaneous package for the R statistical programming environment. The describe function determines whether the variable is character, factor, category, binary, discrete numeric, or continuous numeric and prints a concise statistical summary according to each.

Of note:

  • For a binary variable, the sum (number of 1’s) and mean (proportion of 1’s) are printed

  • For any variable with at least 20 unique values, the 5 lowest and highest values are printed

  • A numeric variable is deemed discrete if it has <= 10 unique values. In this case, quantiles are not printed.

  • A frequency table is printed for any non-binary variable if it has no more than 20 unique values

DETAILS

Column

OVERVIEW About the describer package // Go back to main page

“If I take it up I must understand every detail,” said he. “Take time to consider. The smallest point may be the most essential.” — Sherlock Holmes The Adventure of the Red Circle

For a couple of decades, we have been loyal users of the Hmisc package in general, and the Hmisc::describe function in particular, as a way to explore data before any analyses. As is often the case in the R ecosystem, there are numerous ways to accomplish this task (see summarizing data blog posts here and here for a dated yet extensive review). Our appreciation for Hmisc::describe originated from its concise look (pre-rmarkdown days implementing Sweave/Latex/PDF) and its ability to link with SAS formatted datasets (containing labels, formats, special missing). Indeed, in the clinical research industry, SAS formatted datasets (SAS transport .xpt or native .sas7bdat files) remain widely used while the R language continues to grow in popularity. Dr. Frank Harrell, who developed the Hmisc package, has been, from our perspective, a luminary as he lays out the possibilities embedded in the R language, particularly in the clinical research environment.

For some time now, we have wanted to reengineer the aforementioned describe function to provide a modern and interactive interface to the static (HTML and/or PDF) report. The datadigest package was an effort to build an interactive data explorer inspired by Hmisc::describe; the package leveraged JavaScript for interactivity, with htmlwidget and Shiny interfaces for use in R. Since the release of datadigest, the R community has continued to deliver increasingly powerful frameworks for interactive displays. Therefore, we took the 2021 RStudio Table Contest as an opportunity to accomplish the goal of building an interactive interface for describe using tools available in R. We have utilized the power of reactable embedded with plotly interactive figures within a flexdashboard to generate concise summaries of every variable in a dataset with minimal user configuration. In order for other users to readily deploy such a powerful summary table, we wrapped our work into the {describer} package.

For this challenge, we selected a CDISC (Clinical Data Interchange Standards Consortium) ADaM (Analysis Data Model) ADSL (Analysis Data Subject Level) dataset as an illustration. The ADSL dataset structure is one record per subject and contains variables such as subject-level population flags, planned and actual treatment variables, demographic information, randomization factors, subgrouping variables, and important dates originated from the PHUSE CDISC Pilot replication study.

AUTHORS


Column

INSTRUCTIONS

The {describer} package provides an interface for the interactive table

{describer} consists of two main functions:

  • describe_data(): creates a comprehensive tibble of variable metadata using Hmisc::describe as the engine

  • describer(): creates an interactive table using Hmisc::describe + reactable.


Usage:

Install the package from GitHub:

devtools::install_github("agstn/describer")
library(describer)

Create a tibble summary of the dataset using Hmisc::describe as the engine. This will be passed into the describer function next:

dat_descr <- describe_data(data)

Display results using describer(), which creates a reactable display with columns for variable number (NO), type of variable (TYPE), variable name and label (NAME - LABEL), number observed (OBSERVED), number and percent missing (MISSING), number of unique values (DISTINCT), and an interactive display (INTERACTIVE FIGURE).

For each variable, there is additional dropdown details based on variable type (character, numeric, date), which are viewable by selecting .

describer(dat_descr)

Interactivity:
  • Built-in Interactivity:

    • Search: Search the dataset variables by label

    • Sort: Sort columns of the reactable variables (alphabetically or numerically)

    • Figures: Interactive figures are provided for each dataset variable dependent on variable type. Zoom and hover for more details.

  • Additional Interactivity:

    • Filters: We can add linked filters by leveraging the power of {crosstalk}. Filters can be created for any of the columns of the describe_data by adding crosstalk widgets and specifying a ‘SharedData’ object in the describer() function. In this example, we offer subsetting by variable type and filtering based on % missing in the sidebar.

Dependencies:

Install the latest development version of the reactable and reactablefmtr packages from GitHub:

devtools::install_github("glin/reactable")
devtools::install_github("kcuilla/reactablefmtr")

METRICS