trial_search.Doc2Vec

class pytrial.tasks.trial_search.models.doc2vec.Doc2Vec(emb_dim=128, epochs=10, window=5, min_count=5, max_vocab_size=None, num_workers=4, experiment_id='test')[source]

Bases: pytrial.tasks.trial_search.models.base.TrialSearchBase

Implement the Doc2Vec model for trial document similarity search.

Parameters
  • emb_dim (int, optional (default=128)) – Dimensionality of the embedding vectors.

  • epochs (int, optional (default=10)) – Number of iterations (epochs) over the corpus. Defaults to 10 for Doc2Vec.

  • window (int, optional (default=5)) – The maximum distance between the current and predicted word within a sentence.

  • min_count (int, optional (default=5)) – Ignores all words with total frequency lower than this.

  • max_vocab_size (int, optional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.

  • num_workers (int, optional (default=4)) – Use these many worker threads to train the model (=faster training with multicore machines).

  • experiment_id (str, optional (default='test')) – The name of current experiment.

encode(inputs)[source]

Encode input documents and output the document embeddings.

Parameters

inputs (dict) –

The documents which are to be encoded. x: a dataframe of trial documents. fields: the list of columns to be used in x.

inputs =

{

‘x’: pd.DataFrame,

’fields’: list[str],

}

Returns

embs – The encoded trial document embeddings.

Return type

np.ndarray

fit(train_data, valid_data=None)[source]

Train the doc2vec model to get document embeddings for trial search.

Parameters

train_data (dict) –

Training corpus for the model.

  • x: a dataframe of trial documents.

  • fields: optional, the fields of documents to use for training. If not given, the model uses all fields in x.

  • tag: optional, the field in x that serves as unique identifiers. Typically it is the nct_id of each trial. If not given, the model takes integer tags.

train_data =

{

‘x’: pd.DataFrame,

’fields’: list[str],

’tag’: str,

}

valid_data: Ignored.

Not used, present here for API consistency by convention.

load_model(checkpoint)[source]

Load the pretrained model from disk.

Parameters

checkpoint (str) –

The path to the pretrained model.

  • If a directory, the only checkpoint file *.pth.tar will be loaded.

  • If a filepath, will load from this file.

predict(test_data, top_k=10)[source]

Take the input document, find the most similar documents in the training corpus.

Parameters
  • test_data (dict) –

    Trial docs to be predicted. x: a dataframe of trial documents. fields: optional, the fields of documents to use for training. If not given, the model uses all fields in x.

    test_data =

    {

    ‘x’: pd.DataFrame,

    ’fields’: list[str],

    }

  • top_k (int, optional (default=10)) – The number of top similar documents to be retrieved.

Returns

pred – The sequence of (‘key’: similarities) for the input test documents for each input trial.

Return type

list[list[tuple[str, float]]]

save_model(output_dir)[source]

Save the trained model.

Parameters

output_dir (str) – The output directory to save. Checkpoint is saved with name checkpoint.pth.tar by default.