Skip to contents

DuckDB is the default backend in TestGenerator, and it is usually the fastest place to start. Remote database testing is useful when you also want to check that your study code works on another SQL engine. This is especially helpful for code that will run in PostgreSQL, SQL Server, or Databricks/Spark in production.

The idea is the same as with local testing: you create a small, controlled test population, load it into a CDM, run your study code, and assert the expected result. The difference is that patientsCDM() uploads the test CDM to a remote database and returns a CDM reference connected to that backend.

Use patientsCDM() with a remote database

First create a Unit Test Definition JSON file with readPatients() or readPatients.csv(). Then call patientsCDM() with the backend you want to test.

library(TestGenerator)

cdm <- patientsCDM(
  pathJson = "tests/testthat/testCases",
  testName = "my_test_population",
  cdmVersion = "5.4",
  dbms = "postgresql"
)

The supported remote values for dbms are:

Backend dbms value
PostgreSQL "postgresql"
SQL Server "sqlserver"
Databricks/Spark "databricks"

For remote databases, patientsCDM() creates the CDM locally first, trims the vocabulary to what the test population needs, uploads the result to the remote database, and returns the remote CDM reference. By default, TestGenerator creates a temporary test schema for the upload. You can also provide writeSchema if you want to choose the schema name yourself.

cdm <- patientsCDM(
  pathJson = "tests/testthat/testCases",
  testName = "my_test_population",
  cdmVersion = "5.4",
  dbms = "sqlserver",
  writeSchema = "testgenerator_my_case"
)

When the test has finished, clean up the remote schema and close the connection:

This is important for shared remote databases. The test schemas are small, but cleaning them up keeps the database tidy and avoids name conflicts in later runs.

Environment variables for direct use

When you call patientsCDM() directly against a remote database, TestGenerator reads the connection details from environment variables.

Backend Required environment variables
PostgreSQL POSTGRESQL_SERVER, POSTGRESQL_DBNAME, POSTGRESQL_PORT, POSTGRESQL_USER, POSTGRESQL_PASSWORD
SQL Server SQLSERVER_SERVER, SQLSERVER_DBNAME, SQLSERVER_PORT, SQLSERVER_USER, SQLSERVER_PASSWORD
Databricks/Spark DATABRICKS_HOST, DATABRICKS_TOKEN, DATABRICKS_HTTPPATH

For Databricks, TestGenerator also reads DATABRICKS_USER and DATABRICKS_WORKSPACE when they are set. If they are not set, it uses token as the user and hive_metastore as the workspace/catalog.

For SQL Server, TestGenerator reads SQL_SERVER_DRIVER when it is set. If it is not set, it uses ODBC Driver 18 for SQL Server.

A typical local .Renviron setup could look like this:

POSTGRESQL_SERVER=localhost
POSTGRESQL_DBNAME=cdm
POSTGRESQL_PORT=5432
POSTGRESQL_USER=postgres
POSTGRESQL_PASSWORD=your-password

SQLSERVER_SERVER=localhost
SQLSERVER_DBNAME=cdm
SQLSERVER_PORT=1433
SQLSERVER_USER=sa
SQLSERVER_PASSWORD=your-password

DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
DATABRICKS_TOKEN=your-token
DATABRICKS_HTTPPATH=/sql/1.0/warehouses/your-warehouse

Use values that match your own database or Databricks workspace. The database user needs permission to create schemas, create tables, insert data, read data, and drop the test schema during cleanup.

A minimal backend test pattern

A backend-specific test usually follows this shape:

testthat::test_that("study logic works on PostgreSQL", {
  cdm <- NULL
  on.exit({
    if (!is.null(cdm)) {
      TestGenerator::cleanupTestCdm(cdm)
    }
  }, add = TRUE)

  cdm <- TestGenerator::patientsCDM(
    pathJson = "tests/testthat/testCases",
    testName = "my_test_population",
    cdmVersion = "5.4",
    dbms = "postgresql"
  )

  result <- myPackage::runMyStudy(cdm)

  testthat::expect_equal(result$n_subjects, 3)
})

The exact expectations should be specific to your micro population. Good tests usually check counts, dates, cohort entry and exit, exclusions, or any other result that should be predictable from the small input dataset.