trial_search.WhitenBERT

class pytrial.tasks.trial_search.models.whiten_bert.WhitenBERT(layer_mode='last_first', bert_name='emilyalsentzer/Bio_ClinicalBERT', device='cuda:0', experiment_id='test')[source]

Bases: pytrial.tasks.trial_search.models.base.TrialSearchBase

Implement a postprocessing method to improve BERT embeddings for similarity search 1.

Parameters
  • layer_mode ({'last_first', 'last''}) – The mode of layer of embeddings to aggregate. ‘last_first’ means use the last layer and the first layer. ‘last’ means use the last layer only.

  • bert_name (str, optional (default = 'emilyalsentzer/Bio_ClinicalBERT')) – The name of base BERT model used for encoding input texts.

  • device (str, optional (default = 'cuda:0')) – The device of this model, typically be ‘cpu’ or ‘cuda:0’.

  • experiment_id (str, optional (default = 'test')) – The name of current experiment.

Notes

1

Huang, J., Tang, D., Zhong, W., Lu, S., Shou, L., Gong, M., … & Duan, N. (2021, November). WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 238-244).

encode(inputs, batch_size=None, num_workers=None, return_dict=True, verbose=True)[source]

Encode input documents and output the document embeddings.

Parameters
  • inputs (dict) –

    The input documents to encode:

    • ’fields’ is the list of fields to be encoded.

    • ’tag’ is the unique index column name of each document, e.g., ‘nctid’.

    inputs =

    {

    ‘x’: pd.DataFrame,

    ’fields’: list[str],

    ’tag’: str,

    }

  • batch_size (int, optional) – The batch size when encoding trials.

  • num_workers (int, optional) – The number of workers when building the val dataloader.

  • return_dict (bool) –

    Whether to return a dict of results.

    • If set True, return dict[np.ndarray].

    • Else, return np.ndarray with the order same as the input documents.

  • verbose (bool) – Whether plot progress bar or not.

Returns

embs – Encoded trial-level embeddings with key (tag) and value (embedding)..

Return type

dict[np.ndarray]

evaluate(test_data)[source]

Evaluate within the given trial and corresponding candidate trials.

Parameters

test_data (dict) –

The provided labeled dataset for test trials. Follow the format listed below.

test_data =

{

‘x’: pd.DataFrame,

’y’: pd.DataFrame

}

Returns

results – A dict of metrics and the values.

Return type

dict[float]

Notes

x =

target_trial | trial1 | trial2 | trial3 |
nct01 | nct02 | nct03 | nct04 |

y =

label1 | label2 | label3 |
0 | 0 | 1 |
fit(train_data, valid_data=None)[source]

Go over all trials and encode them into embeddings. Note that this is a post-processing method based on a pretrained BERT model, so it does NOT need to be trained.

Parameters
  • train_data (dict) –

    The data for encoding.

    • ’x’ is the dataframe that contains multiple sections of a trial.

    • ’fields’ is the list of fields to be encoded.

    • ’tag’ is the unique index column name of each document, e.g., ‘nctid’.

    train_data =

    {

    ‘x’: pd.DataFrame,

    ’fields’: list[str],

    ’tag’: str,

    }

  • valid_data (Not used.) – This is a placeholder because this model does not need training.

load_model(input_dir)[source]

Only load the embeddings. Do not load the model.

Parameters

input_dir (str) – The input directory to load the model.

predict(test_data, top_k=10, return_df=True)[source]

Predict the top-k relevant for input documents.

Parameters
  • test_data (dict) –

    Share the same input format as the train_data in fit function. If fields and tag are not given, will reuse the ones used during training.

    test_data =

    {

    ‘x’: pd.DataFrame,

    ’fields’: list[str],

    ’tag’: str,

    }

  • top_k (int) – Number of retrieved candidates.

  • return_df (float) –

    • If set True, return dataframe for the computed similarity ranking.

    • else, return rank_list=[[(doc1,sim1),(doc2,sim2)], [(doc1,sim1),…]].

Returns

  • rank (pd.DataFrame) – A dataframe contains the top ranked NCT ids for each.

  • sim (pd.DataFrame) – A dataframe contains the corresponding similarities.

  • rank_list (list[list[tuple]]) – A list of tuples of top ranked docs and similarities.

save_model(output_dir)[source]

Only save the embeddings. Do not save the model.

Parameters

output_dir (str) – The output directory to save the model.