Using PLPBenchmarks on Real-World Data
PLPBenchmarks.RmdIn this vignette, we will provide an introduction of how to make use
of the PLPBenchmarks packages to build and execute
benchmarks on real-world data (RWD). We will not show the whole pipeline
as we need to be connected to a database. Working examples with the
open-source Eunomia dataset exist in other vignettes (see
vignette(“BenchmarksWithEunomia”) and
vignette(“AlgorithmComparisons”).
Step 1: Preparing the environment and specifying a benchmark design
To get started we load the PLPBenchmarks package and the
other packages we will need.
This loads the PatientLevelPrediction and
DatabaseConnector R packages.
We can see the problems for RWD, their specification and other necessary information to build the cohorts.
We can also load the pre-defined model designs
As we see, the pre-specified model designs is a list containing 13
modelDesign objects, one for each problem in the
tasks data frame.
Before moving on, we need to define two more object.
The first object, is the connectionDetails from the
DatabaseConnector package that we will use to connect to
our database. For example, for a PostgreSQL database, we will define
something like
conDets <- createConnectionDetails(dbms = "postgresql",
user = "user",
password = "password",
server = "server")where "user", "password" are the user
credentials, and "server" the host name of the server and
the relevant schemas.
The second object we need to create is the
databaseDetails from
PatientLevelPrediction.
dbDets <- createDatabaseDetails(connectionDetails = conDets,
cdmDatabaseSchema = "cdmDatabaseSchema",
cdmDatabaseName = "cdmDatabaseName",
cdmDatabaseId = "cdmDatabaseId",
cohortDatabaseSchema = "cohortDatabaseSchema",
cohortTable = "cohortTable",
outcomeDatabaseSchema = "outcomeDatabaseSchema",
outcomeTable = "outcomeTable")The names in quotes should be the corresponding schemas and table names that our target and outcome cohorts are located.
Now we can create a benchmark design.
benchmarkDesign <- createBenchmarkDesign(modelDesign = modelDesigns, databaseDetails = dbDets, saveDirectory = "gettingStarted")Note the saveDirectory argument takes a character or a
file path for the location to save the results.
Some more details:
The functioncreateBecnhmarkDesign()takes three arguments: a list of named model designs, adatabaseDetailsobject created fromPatientLevelPrediction::createDatabaseDetails()function, and a name for the directory to save the results. The returned object is still a list, but this time a list of classbenchmarkDesignwith the additional information provided in each list element (i.e. each model design for each problem). The nice thing about this function, is identifying the unique sets of covariates and study population settings that are needed to be created in a benchmark design. These information is stored in the attributes of the returned object.
The advantage of this is that for the model designs that share common target and outcome cohorts, along with common covariates and population settings, theplpDataandstudyPopulationobjects will only be created once, saving time from recreating them each time we are running a model design.
We can now inspect our benchmark design settings in a side-to-side comparison to check if everything is specified correctly.
viewBenchmarkSettings(benchmarkDesign = benchmarkDesign) %>%
knitr::kable() %>%
kableExtra::kable_paper(lightable_options = "striped") %>%
kableExtra::scroll_box(width = "100%", height = "200px")If everything is specified correctly and we are ready to go we can then proceed.
Step 2: Create Cohorts and extract covariates (and even create the study population)
Next, we create the cohorts in the database
createBenchmarkCohorts(benchmarkDesign = benchmarkDesign)We are now ready to proceed.
Last think before running the benchmark design, is to extract the data (covariates) that are defined in each model design.
extractBenchmarkData(benchmarkDesign = benchmarkDesign, createStudyPopulation = TRUE)Notice that the function is responsible for two things: - extracts the covariates that are needed for each model designs - optionally creates the study population (default to TRUE) This is by convention to provide flexibility on altering the states of covariates and population if needed.
Step 3: Run the benchmark design
runBenchmarkDesign(benchmarkDesign = benchmarkDesign)runBencmarkDesign() which will run all specified model
designs.
Step 4: Inspect and share results
The following will collect all results in one object called
results.
results <- getBenchmarkModelPerformance(benchmarkDesign = benchmarkDesign)The results object is a list with two components: a
performanceMetrics data frame and an
executionTimes data frame. Optionally, we can also call
viewBenchmarkResults(benchmarkDesign = benchmarkDesign, viewShiny = TRUE)to create and view a shiny app with the results.
For more complete examples, have a look at the other vignettes using the Eunomia dataset to simulate experiments.