trial_search.Doc2Vec
- class pytrial.tasks.trial_search.models.doc2vec.Doc2Vec(emb_dim=128, epochs=10, window=5, min_count=5, max_vocab_size=None, num_workers=4, experiment_id='test')[source]
Bases:
pytrial.tasks.trial_search.models.base.TrialSearchBase
Implement the Doc2Vec model for trial document similarity search.
- Parameters
emb_dim (int, optional (default=128)) – Dimensionality of the embedding vectors.
epochs (int, optional (default=10)) – Number of iterations (epochs) over the corpus. Defaults to 10 for Doc2Vec.
window (int, optional (default=5)) – The maximum distance between the current and predicted word within a sentence.
min_count (int, optional (default=5)) – Ignores all words with total frequency lower than this.
max_vocab_size (int, optional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
num_workers (int, optional (default=4)) – Use these many worker threads to train the model (=faster training with multicore machines).
experiment_id (str, optional (default='test')) – The name of current experiment.
- encode(inputs)[source]
Encode input documents and output the document embeddings.
- Parameters
inputs (dict) –
The documents which are to be encoded. x: a dataframe of trial documents. fields: the list of columns to be used in x.
inputs =
{
‘x’: pd.DataFrame,
’fields’: list[str],
}
- Returns
embs – The encoded trial document embeddings.
- Return type
np.ndarray
- fit(train_data, valid_data=None)[source]
Train the doc2vec model to get document embeddings for trial search.
- Parameters
train_data (dict) –
Training corpus for the model.
x: a dataframe of trial documents.
fields: optional, the fields of documents to use for training. If not given, the model uses all fields in x.
tag: optional, the field in x that serves as unique identifiers. Typically it is the nct_id of each trial. If not given, the model takes integer tags.
train_data =
{
‘x’: pd.DataFrame,
’fields’: list[str],
’tag’: str,
}
- valid_data: Ignored.
Not used, present here for API consistency by convention.
- load_model(checkpoint)[source]
Load the pretrained model from disk.
- Parameters
checkpoint (str) –
The path to the pretrained model.
If a directory, the only checkpoint file *.pth.tar will be loaded.
If a filepath, will load from this file.
- predict(test_data, top_k=10)[source]
Take the input document, find the most similar documents in the training corpus.
- Parameters
test_data (dict) –
Trial docs to be predicted. x: a dataframe of trial documents. fields: optional, the fields of documents to use for training. If not given, the model uses all fields in x.
test_data =
{
‘x’: pd.DataFrame,
’fields’: list[str],
}
top_k (int, optional (default=10)) – The number of top similar documents to be retrieved.
- Returns
pred – The sequence of (‘key’: similarities) for the input test documents for each input trial.
- Return type
list[list[tuple[str, float]]]