Basic Patient Data Class
Table of Contents
PyTrial offers several basic data classes for organizing patient and trial data:
They define the basic structure of the input data and then the convenience for the next model training and predicting. Thus, we use these standard data classes as the inputs for many tasks. Here, we will show two examples for patient data building.
We categorize the input data into two types: tabular and sequential patient data. The former is the data that can be represented as a table: each row is a patient and each column is a feature. The latter is the data that each patient has a sequence of visits, where each visit has multiple features, e.g., events, lab tests, etc.
Patient Data: Tabular
Colab example is available at Example: data.patient_data.TabularPatientBase.
We have pytrial.data.patient_data.TabularPatientBase
for the tabular patient data.
from pytrial.data.patient_data import TabularPatientBase
Consider we get patient data in pandas.DataFrame
format but the raw features are a mixture of texts, numbers, and missing values.
We usually need to preprocess the data before passing it to models. TabularPatientBase
provides a convenient way to do this.
Let’s first load the raw demo data for creating a TabularPatientBase
instance.
# load the raw demo data
from pytrial.data.demo_data import load_trial_patient_tabular
data = load_trial_patient_tabular()
# parse the raw data
df = data['data']
metadata = data['metadata']
Then, we can pass the raw dataframe to the target data class.
import rdt
# create a TabularPatientBase instance
patient_data = TabularPatientBase(
df=df, # this contains the raw dataframe
metadata= {
'sdtypes': {
'race': 'categorical',
'target_label': 'boolean',
}, # this contains the data types of each column
'tranformers': {
'race': rdt.transformers.FrequencyEncoder(),
}, # this contains the transformers for each column
},
)
A list of available data transformers can be found on https://docs.sdv.dev/rdt/transformers-glossary/browse-transformers.
By default, TabularPatientBase
will automatically detect the data types of each column and apply the corresponding transformers,
e.g., rdt.transformers.FrequencyEncoder
for categorical features, which means you can leave the metadata
empty all the time, like this:
# leave the metadata empty and let the class automatically detect the data types and apply transformers
patient_data = TabularPatientBase(df=df)
However, sometimes you may want to customize the data types and transformers
in case the automatically detected ones are wrong. That is why in the above example
we assign 'race':'categorical'
amd 'race': rdt.tranformers.FrequencyEncoder()
, which will
push the dataclass to follow our custom settings.
Please notice that we are allowed to just pass 'sdtypes'
for one column without specifying the corresponding transformer, where the dataclass
will pick the default transformer for the passed data type. as the 'target_label':'boolean'
in the above example.
We can check the transformed tabular data by
# the transformed values
patient_data.df
Besides, we can actually transform the data back to its original format by
# transform the data back to its original format
df_raw = patient_data.reverse_transform()
Or pass another dataframe to the dataclass to be transformed like
# pass another dataframe to the dataclass to be transformed
df_prime_transformed = patient_data.transform(df_prime)
Patient Data: Sequence
Colab example is available at Example: data.patient_data.SequencePatientBase.
We have pytrial.data.patient_data.SequencePatientBase
for the sequential patient data.
from pytrial.data.patient_data import SequencePatientBase
Load the raw demo data to see how to create a SequencePatientBase
instance.
from pytrial.data.demo_data import load_synthetic_ehr_sequence
data = load_synthetic_ehr_sequence()
data.keys()
'''
dict_keys(['visit', 'feature', 'order', 'n_num_feature', 'y', 'voc', 'cat_cardinalities'])
'''
# the raw visit data
data['visit'][0]
'''
[
[[0, 1, 2, 3, 5, 7, 41, 313, 1], [0, 1, 82], [2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 51, 19, 26]],
[[0, 1, 10, 69], [1, 4], [0, 2, 3, 6, 7, 41, 12, 13, 14, 16, 52, 54, 22, 28]]
]
'''
# the order
data['order']
'''
['diag', 'prod', 'med']
'''
# the vocabulary
data['voc']
'''
{'diag': <promptehr.data.Voc at 0x7f150c615cd0>,
'prod': <promptehr.data.Voc at 0x7f14af789750>,
'med': <promptehr.data.Voc at 0x7f14af7b7310>}
'''
In the above example, data['visit']
should be a list of patients, where each patient is a list of visits, where each visit is a list of events.
That is, data['visit'][0]
is the visits of the first patient, where data['visit'][0][0]
is the first visit.
It should be noted that inside data['visit'][0][0]
there are three lists, where each list contains
several events of the same type, for example, diagnosed diseases represented by ICD codes.
The data['order']
is the order of the events in each visit, which should be the same for all patients. In the above example,
data['order']
is ['diag', 'prod', 'med']
, which means the first list of data['visit'][0][0]
is the diagnosed diseases, and so on.
The data['voc']
contains the vocabularies for each event type. Each voc objective should have the same format as pytrial.data.vocab_data.Vocab
.
Once we have the raw data, we can create a SequencePatientBase
instance.
# create a ``SequencePatientBase`` instance
seqdata = SequencePatientBase(
data={'v':data['visit'], 'y':data['y'], 'x':data['feature']},
metadata={
'visit':{
'mode':'dense',
'order': data['order'] # need to parse the ``order`` here
},
'label':{'mode':'tensor'},
'voc':data['voc'], # need to parse the ``voc`` here
'max_visit':20,
}
)
The parameter data
contains the raw data including visits, label, and baseline features;
the parameter metadata
customize the output data format.
Then, we can check the transformed data by
from torch.utils.data import DataLoader
from pytrial.data.patient_data import SeqPatientCollator # we need a collation function to process the input SequencePatient dataset
# let's see the outputs
collate_fn = SeqPatientCollator()
loader = DataLoader(seqdata, batch_size=2, collate_fn=collate_fn, num_workers=0)
loader = iter(loader)
batch = next(loader)
print(batch.keys())
'''
dict_keys(['v', 'x', 'y'])
'''
batch['v'].keys()
'''
dict_keys(['diag', 'prod', 'med'])
'''
batch['v']['diag'][0]
'''
[[0, 1, 2, 3, 5, 7, 41, 313, 1], [0, 1, 10, 69]]
'''
The dataloader returns the visits with keys corresponding to data['order']
, i.e., ['diag', 'prod', 'med']
in the above example.
batch['v']['diag'][0]
is the diagnosis events for the first patient, where there are two visits.