trial_search.WhitenBERT

class pytrial.tasks.trial_search.models.whiten_bert.WhitenBERT(layer_mode='last_first', bert_name='emilyalsentzer/Bio_ClinicalBERT', device='cuda:0', experiment_id='test')[source]

Bases: pytrial.tasks.trial_search.models.base.TrialSearchBase

Implement a postprocessing method to improve BERT embeddings for similarity search 1.

Parameters

layer_mode ({'last_first', 'last''}) – The mode of layer of embeddings to aggregate. ‘last_first’ means use the last layer and the first layer. ‘last’ means use the last layer only.
bert_name (str, optional (default = 'emilyalsentzer/Bio_ClinicalBERT')) – The name of base BERT model used for encoding input texts.
device (str, optional (default = 'cuda:0')) – The device of this model, typically be ‘cpu’ or ‘cuda:0’.
experiment_id (str, optional (default = 'test')) – The name of current experiment.

Notes

1: Huang, J., Tang, D., Zhong, W., Lu, S., Shou, L., Gong, M., … & Duan, N. (2021, November). WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 238-244).

encode(inputs, batch_size=None, num_workers=None, return_dict=True, verbose=True)[source]

Encode input documents and output the document embeddings.

Parameters

inputs (dict) –
The input documents to encode:
- ’fields’ is the list of fields to be encoded.
- ’tag’ is the unique index column name of each document, e.g., ‘nctid’.
inputs =

{

‘x’: pd.DataFrame,

’fields’: list[str],

’tag’: str,

}
batch_size (int, optional) – The batch size when encoding trials.
num_workers (int, optional) – The number of workers when building the val dataloader.
return_dict (bool) –
Whether to return a dict of results.
- If set True, return dict[np.ndarray].
- Else, return np.ndarray with the order same as the input documents.
verbose (bool) – Whether plot progress bar or not.

Returns

embs – Encoded trial-level embeddings with key (tag) and value (embedding)..

Return type

dict[np.ndarray]

evaluate(test_data)[source]

Evaluate within the given trial and corresponding candidate trials.

Parameters

test_data (dict) –

The provided labeled dataset for test trials. Follow the format listed below.

test_data =

{

‘x’: pd.DataFrame,

’y’: pd.DataFrame

}

Returns

results – A dict of metrics and the values.

Return type

dict[float]

Notes

x =

target_trial | trial1 | trial2 | trial3 |

nct01 | nct02 | nct03 | nct04 |

y =

label1 | label2 | label3 |

0 | 0 | 1 |

fit(train_data, valid_data=None)[source]

Go over all trials and encode them into embeddings. Note that this is a post-processing method based on a pretrained BERT model, so it does NOT need to be trained.

Parameters

train_data (dict) –
The data for encoding.
- ’x’ is the dataframe that contains multiple sections of a trial.
- ’fields’ is the list of fields to be encoded.
- ’tag’ is the unique index column name of each document, e.g., ‘nctid’.
train_data =

{

‘x’: pd.DataFrame,

’fields’: list[str],

’tag’: str,

}
valid_data (Not used.) – This is a placeholder because this model does not need training.

load_model(input_dir)[source]

Only load the embeddings. Do not load the model.

Parameters: input_dir (str) – The input directory to load the model.

predict(test_data, top_k=10, return_df=True)[source]

Predict the top-k relevant for input documents.

Parameters

test_data (dict) –
Share the same input format as the train_data in fit function. If fields and tag are not given, will reuse the ones used during training.

test_data =

{

‘x’: pd.DataFrame,

’fields’: list[str],

’tag’: str,

}
top_k (int) – Number of retrieved candidates.
return_df (float) –
- If set True, return dataframe for the computed similarity ranking.
- else, return rank_list=[[(doc1,sim1),(doc2,sim2)], [(doc1,sim1),…]].

Returns

rank (pd.DataFrame) – A dataframe contains the top ranked NCT ids for each.
sim (pd.DataFrame) – A dataframe contains the corresponding similarities.
rank_list (list[list[tuple]]) – A list of tuples of top ranked docs and similarities.

save_model(output_dir)[source]

Only save the embeddings. Do not save the model.

Parameters: output_dir (str) – The output directory to save the model.