model_utils.bert

class pytrial.model_utils.bert.BERT(bertname='emilyalsentzer/Bio_ClinicalBERT', proj_dim=None, max_length=512, device='cpu')[source]

Bases: torch.nn.modules.module.Module

The pretrained BERT model for getting text embeddings.

Parameters

bertname (str (default='emilyalsentzer/Bio_ClinicalBERT')) – The name of pretrained bert to get from huggingface models hub: https://huggingface.co/models. Or pass the dir where the local pretrained bert is available.
proj_dim (int or None) – A linear projection head added on top of the bert encoder. Note that if given, the projection head is RANDOMLY initialized and needs further training.
max_length (int) – Maximum acceptable number of tokens for each sentence.
device (str) – The device of this model, typically be ‘cpu’ or ‘cuda:0’.

Examples

>>> model = BERT()
>>> emb = model.encode('The goal of life is comfort.')
>>> print(emb.shape)

encode(input_text, is_train=False, batch_size=None)[source]

Encode the input texts into embeddings.

Parameters

input_text (str or list[str]) – A sentence or a list of sentences to be encoded.
is_train (bool) – Set True if this model’s parameters will update by learning.
batch_size (int) – How large batch size to use when encoding long documents with many sentences. When set None, will encode all sentences at once.

Returns

outputs – The encoded sentence embeddings with size [num_sent, emb_dim]

Return type

torch.Tensor

forward(input_ids, attention_mask=None, token_type_ids=None, return_hidden_states=False)[source]

Forward pass of the model.

Parameters

input_ids (torch.Tensor) – The input token ids with shape [batch_size, seq_len].
attention_mask (torch.Tensor) – The attention mask with shape [batch_size, seq_len].
token_type_ids (torch.Tensor) – The token type ids with shape [batch_size, seq_len].
return_hidden_states (bool) – Whether to return the hidden states of all layers.