trial_search.WhitenBERT
- class pytrial.tasks.trial_search.models.whiten_bert.WhitenBERT(layer_mode='last_first', bert_name='emilyalsentzer/Bio_ClinicalBERT', device='cuda:0', experiment_id='test')[source]
Bases:
pytrial.tasks.trial_search.models.base.TrialSearchBase
Implement a postprocessing method to improve BERT embeddings for similarity search 1.
- Parameters
layer_mode ({'last_first', 'last''}) – The mode of layer of embeddings to aggregate. ‘last_first’ means use the last layer and the first layer. ‘last’ means use the last layer only.
bert_name (str, optional (default = 'emilyalsentzer/Bio_ClinicalBERT')) – The name of base BERT model used for encoding input texts.
device (str, optional (default = 'cuda:0')) – The device of this model, typically be ‘cpu’ or ‘cuda:0’.
experiment_id (str, optional (default = 'test')) – The name of current experiment.
Notes
- 1
Huang, J., Tang, D., Zhong, W., Lu, S., Shou, L., Gong, M., … & Duan, N. (2021, November). WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 238-244).
- encode(inputs, batch_size=None, num_workers=None, return_dict=True, verbose=True)[source]
Encode input documents and output the document embeddings.
- Parameters
inputs (dict) –
The input documents to encode:
’fields’ is the list of fields to be encoded.
’tag’ is the unique index column name of each document, e.g., ‘nctid’.
inputs =
{
‘x’: pd.DataFrame,
’fields’: list[str],
’tag’: str,
}
batch_size (int, optional) – The batch size when encoding trials.
num_workers (int, optional) – The number of workers when building the val dataloader.
return_dict (bool) –
Whether to return a dict of results.
If set True, return dict[np.ndarray].
Else, return np.ndarray with the order same as the input documents.
verbose (bool) – Whether plot progress bar or not.
- Returns
embs – Encoded trial-level embeddings with key (tag) and value (embedding)..
- Return type
dict[np.ndarray]
- evaluate(test_data)[source]
Evaluate within the given trial and corresponding candidate trials.
- Parameters
test_data (dict) –
The provided labeled dataset for test trials. Follow the format listed below.
test_data =
{
‘x’: pd.DataFrame,
’y’: pd.DataFrame
}
- Returns
results – A dict of metrics and the values.
- Return type
dict[float]
Notes
x =
target_trial | trial1 | trial2 | trial3 |nct01 | nct02 | nct03 | nct04 |y =
label1 | label2 | label3 |0 | 0 | 1 |
- fit(train_data, valid_data=None)[source]
Go over all trials and encode them into embeddings. Note that this is a post-processing method based on a pretrained BERT model, so it does NOT need to be trained.
- Parameters
train_data (dict) –
The data for encoding.
’x’ is the dataframe that contains multiple sections of a trial.
’fields’ is the list of fields to be encoded.
’tag’ is the unique index column name of each document, e.g., ‘nctid’.
train_data =
{
‘x’: pd.DataFrame,
’fields’: list[str],
’tag’: str,
}
valid_data (Not used.) – This is a placeholder because this model does not need training.
- load_model(input_dir)[source]
Only load the embeddings. Do not load the model.
- Parameters
input_dir (str) – The input directory to load the model.
- predict(test_data, top_k=10, return_df=True)[source]
Predict the top-k relevant for input documents.
- Parameters
test_data (dict) –
Share the same input format as the train_data in fit function. If fields and tag are not given, will reuse the ones used during training.
test_data =
{
‘x’: pd.DataFrame,
’fields’: list[str],
’tag’: str,
}
top_k (int) – Number of retrieved candidates.
return_df (float) –
If set True, return dataframe for the computed similarity ranking.
else, return rank_list=[[(doc1,sim1),(doc2,sim2)], [(doc1,sim1),…]].
- Returns
rank (pd.DataFrame) – A dataframe contains the top ranked NCT ids for each.
sim (pd.DataFrame) – A dataframe contains the corresponding similarities.
rank_list (list[list[tuple]]) – A list of tuples of top ranked docs and similarities.