class, metadata=None, transform=True)[source]

Base dataset class for tabular patient records. Subclass it if additional properties and functions are required to add for specific tasks. We make use rdt: for transform and reverse transform of the tabular data.

  • df (pd.DataFrame) – The input patient tabular format records.

  • metadata

    Contains the meta setups of the input data. It should contain the following keys:

    1. sdtypes: dict, the data types of each column in the input data. The keys are the column names and the values are the data types. The data types can be one of the following: ‘numerical’, ‘categorical’, ‘datetime’, ‘boolean’.

    2. transformers: dict, the transformers to be used for each column. The keys are the column names and the values are the transformer names. The transformer names can be one in In addition, we also support inputting a transformer string name, e.g., {‘column1’: ‘OneHotEncoder’}.

transform: bool(default=True)

Whether or not transform raw self.df by hypertransformer. If set False, self.df will keep as the same as the passed one.


>>> from import TabularPatientBase
>>> df = pd.read_csv('tabular_patient.csv', index_col=0)
>>> # set `transform=True` will replace dataset.df with dataset.df_transformed
>>> dataset = TabularPatientBase(df, transform=True)
>>> # transform raw dataframe to numerical tables
>>> df_transformed = dataset.transform(df)
>>> # make back transform to the original df
>>> df_raw = dataset.reverse_transform(df_transformed)

Reverse the input dataframe back to the original format. Return the self.df in the original format if df=None.


df (pd.DataFrame) – The dataframe to be transformed back to the original format by


Transform the input df or the self.df by hypertransformer. If transform=True in __init__, then you do not need to call this function to transform self.df because it was tranformed already.


df (pd.DataFrame) – The dataframe to be transformed by

class, metadata=None)[source]

Load sequential patient inputs for longitudinal patient records generation.

  • data (dict) –

    A dict contains patient data in sequence and/or in tabular.

    data = {

    ‘x’: np.ndarray or pd.DataFrame

    Static patient features in tabular form, typically those baseline information.

    ’v’: list or np.ndarray

    Patient visit sequence in dense format or in tensor format (depends on the model input requirement.)

    • If in dense format, it is like [[c1,c2,c3],[c4,c5],…], with shape [n_patient, NA, NA];

    • If in tensor format, it is like [[0,1,1],[1,1,0],…] (multi-hot encoded), with shape [n_patient, max_num_visit, max_num_event].

    ’y’: np.ndarray or pd.Series

    • Target label for each patient if making risk detection, with shape [n_patient, n_class];

    • Target label for each visit if making some visit-level prediction with shape [n_patient, NA, n_class].


  • metadata (dict (optional)) –

    A dict contains configuration of input patient data.

    metadata = {

    ‘voc’: dict[Voc]

    Vocabulary contains the event index to the exact event name, has three keys in general: ‘diag’, ‘med’, ‘prod’, corresponding to diagnosis, medication, and procedure. Voc object should have two functions: idx2word and word2idx.

    ’visit’: dict[str]

    a dict contains the format of input data for processing input visit sequences.

    visit: {

    ‘mode’: ‘tensor’ or ‘dense’,

    ’order’: list[str] (required when mode=’tensor’)


    ’label’: dict[str]

    a dict contains the format of input data for processing input labels.

    label: {

    ‘mode’: ‘tensor’ or ‘dense’,


    ’max_visit’: int

    the maximum number of visits considered when building tensor inputs, ignored when visit mode is dense.



support make collation of unequal sized list of batch for densely stored sequential visits data.


config (dict) –


‘visit_mode’: in ‘dense’ or ‘tensor’,

’label_mode’: in ‘dense’ or ‘tensor’,