data.patient_data
- class pytrial.data.patient_data.TabularPatientBase(df, metadata=None, transform=True)[source]
Base dataset class for tabular patient records. Subclass it if additional properties and functions are required to add for specific tasks. We make use rdt: https://docs.sdv.dev/rdt for transform and reverse transform of the tabular data.
- Parameters
df (pd.DataFrame) – The input patient tabular format records.
metadata –
Contains the meta setups of the input data. It should contain the following keys:
sdtypes: dict, the data types of each column in the input data. The keys are the column names and the values are the data types. The data types can be one of the following: ‘numerical’, ‘categorical’, ‘datetime’, ‘boolean’.
transformers: dict, the transformers to be used for each column. The keys are the column names and the values are the transformer names. The transformer names can be one in https://docs.sdv.dev/rdt/transformers-glossary/browse-transformers. In addition, we also support inputting a transformer string name, e.g., {‘column1’: ‘OneHotEncoder’}.
- transform: bool(default=True)
Whether or not transform raw self.df by hypertransformer. If set False,
self.df
will keep as the same as the passed one.
Examples
>>> from pytrial.data.patient_data import TabularPatientBase >>> df = pd.read_csv('tabular_patient.csv', index_col=0) >>> # set `transform=True` will replace dataset.df with dataset.df_transformed >>> dataset = TabularPatientBase(df, transform=True) >>> # transform raw dataframe to numerical tables >>> df_transformed = dataset.transform(df) >>> # make back transform to the original df >>> df_raw = dataset.reverse_transform(df_transformed)
- class pytrial.data.patient_data.SequencePatientBase(data, metadata=None)[source]
Load sequential patient inputs for longitudinal patient records generation.
- Parameters
data (dict) –
A dict contains patient data in sequence and/or in tabular.
data = {
‘x’: np.ndarray or pd.DataFrame
Static patient features in tabular form, typically those baseline information.
’v’: list or np.ndarray
Patient visit sequence in dense format or in tensor format (depends on the model input requirement.)
If in dense format, it is like [[c1,c2,c3],[c4,c5],…], with shape [n_patient, NA, NA];
If in tensor format, it is like [[0,1,1],[1,1,0],…] (multi-hot encoded), with shape [n_patient, max_num_visit, max_num_event].
’y’: np.ndarray or pd.Series
Target label for each patient if making risk detection, with shape [n_patient, n_class];
Target label for each visit if making some visit-level prediction with shape [n_patient, NA, n_class].
}
metadata (dict (optional)) –
A dict contains configuration of input patient data.
metadata = {
‘voc’: dict[Voc]
Vocabulary contains the event index to the exact event name, has three keys in general: ‘diag’, ‘med’, ‘prod’, corresponding to diagnosis, medication, and procedure.
Voc
object should have two functions: idx2word and word2idx.’visit’: dict[str]
a dict contains the format of input data for processing input visit sequences.
visit: {
‘mode’: ‘tensor’ or ‘dense’,
’order’: list[str] (required when mode=’tensor’)
},
’label’: dict[str]
a dict contains the format of input data for processing input labels.
label: {
‘mode’: ‘tensor’ or ‘dense’,
}
’max_visit’: int
the maximum number of visits considered when building tensor inputs, ignored when visit mode is dense.
}