model_utils.bert
- class pytrial.model_utils.bert.BERT(bertname='emilyalsentzer/Bio_ClinicalBERT', proj_dim=None, max_length=512, device='cpu')[source]
Bases:
torch.nn.modules.module.Module
The pretrained BERT model for getting text embeddings.
- Parameters
bertname (str (default='emilyalsentzer/Bio_ClinicalBERT')) – The name of pretrained bert to get from huggingface models hub: https://huggingface.co/models. Or pass the dir where the local pretrained bert is available.
proj_dim (int or None) – A linear projection head added on top of the bert encoder. Note that if given, the projection head is RANDOMLY initialized and needs further training.
max_length (int) – Maximum acceptable number of tokens for each sentence.
device (str) – The device of this model, typically be ‘cpu’ or ‘cuda:0’.
Examples
>>> model = BERT() >>> emb = model.encode('The goal of life is comfort.') >>> print(emb.shape)
- encode(input_text, is_train=False, batch_size=None)[source]
Encode the input texts into embeddings.
- Parameters
input_text (str or list[str]) – A sentence or a list of sentences to be encoded.
is_train (bool) – Set True if this model’s parameters will update by learning.
batch_size (int) – How large batch size to use when encoding long documents with many sentences. When set None, will encode all sentences at once.
- Returns
outputs – The encoded sentence embeddings with size [num_sent, emb_dim]
- Return type
torch.Tensor
- forward(input_ids, attention_mask=None, token_type_ids=None, return_hidden_states=False)[source]
Forward pass of the model.
- Parameters
input_ids (torch.Tensor) – The input token ids with shape [batch_size, seq_len].
attention_mask (torch.Tensor) – The attention mask with shape [batch_size, seq_len].
token_type_ids (torch.Tensor) – The token type ids with shape [batch_size, seq_len].
return_hidden_states (bool) – Whether to return the hidden states of all layers.