Top2Vec API Guide — Top2Vec 1.0.36 documentation (original) (raw)

class top2vec.top2vec.Document(document_index, topics)

document_index

Alias for field number 0

topics

Alias for field number 1

class top2vec.top2vec.Top2Vec(documents: list[str], contextual_top2vec=False, c_top2vec_smoothing_window=5, min_count=50, topic_merge_delta=0.1, ngram_vocab=False, ngram_vocab_args=None, embedding_model='all-MiniLM-L6-v2', embedding_model_path=None, embedding_batch_size=32, split_documents=False, document_chunker='sequential', chunk_length=100, max_num_chunks=None, chunk_overlap_ratio=0.5, chunk_len_coverage_ratio=1.0, sentencizer=None, speed='learn', use_corpus_file=False, document_ids=None, keep_documents=True, workers=None, tokenizer=None, use_embedding_model_tokenizer=True, umap_args=None, gpu_umap=False, hdbscan_args=None, gpu_hdbscan=False, index_topics=False, verbose=True)

Creates jointly embedded topic, document and word vectors.

Parameters:

compute_topics(umap_args=None, hdbscan_args=None, topic_merge_delta=0.001, gpu_umap=False, gpu_hdbscan=False, index_topics=False, contextual_top2vec=None, c_top2vec_smoothing_window=5)

Computes topics from current document vectors.

New topic vectors will be computed along with new topic descriptions. Documents will be reassigned to new topics. If topics were previously reduced they will be removed. You will need to call hierarchical_topic_reduction to recompute them.

This is useful for experimenting with different umap and hdbscan parameters and also if many new documents were added since training the initial model.

Parameters:

get_num_topics(reduced=False)

Get number of topics.

This is the number of topics Top2Vec has found in the data by default. If reduced is True, the number of reduced topics is returned.

Parameters:

reduced (bool ( Optional , default False )) – The number of original topics will be returned by default. If True will return the number of reduced topics, if hierarchical topic reduction has been performed.

Returns:

num_topics

Return type:

int

get_topic_hierarchy()

Get the hierarchy of reduced topics. The mapping of each original topic to the reduced topics is returned.

Hierarchical topic reduction must be performed before calling this method.

Returns:

hierarchy – Each index of the hierarchy corresponds to the topic number of a reduced topic. For each reduced topic the topic numbers of the original topics that were merged to create it are listed.

Example: [[3] <Reduced Topic 0> contains original Topic 3 [2,4] <Reduced Topic 1> contains original Topics 2 and 4 [0,1] <Reduced Topic 3> contains original Topics 0 and 1 …]

Return type:

list of ints

get_topic_sizes(reduced=False)

Get topic sizes.

Top2vec: the number of documents most similar to each topic. Topics are in increasing order of size.

Contextual Top2vec: the number of tokens most similar to each topic.

The sizes of the original topics is returned unless reduced=True, in which case the sizes of the reduced topics will be returned.

Parameters:

reduced (bool ( Optional , default False )) – Original topic sizes are returned by default. If True the reduced topic sizes will be returned.

Returns:

get_topics(num_topics=None, reduced=False)

Get topics, ordered by decreasing size. All topics are returned if num_topics is not specified.

The original topics found are returned unless reduced=True, in which case reduced topics will be returned.

Each topic will consist of the top 50 semantically similar words to the topic. These are the 50 words closest to topic vector along with cosine similarity of each word from vector. The higher the score the more relevant the word is to the topic.

Parameters:

Returns:

hierarchical_topic_reduction(num_topics, interval=None)

Reduce the number of topics discovered by Top2Vec.

The most representative topics of the corpus will be found, by iteratively merging each smallest topic to the most similar topic until num_topics is reached.

Parameters:

Returns:

hierarchy – Each index of hierarchy corresponds to the reduced topics, for each reduced topic the indexes of the original topics that were merged to create it are listed.

Example: [[3] <Reduced Topic 0> contains original Topic 3 [2,4] <Reduced Topic 1> contains original Topics 2 and 4 [0,1] <Reduced Topic 3> contains original Topics 0 and 1 …]

Return type:

list of ints

classmethod load(file)

Load a pre-trained model from the specified file.

Parameters:

file (str) – File where model will be loaded from.

query_topics(query, num_topics, reduced=False, tokenizer=None)

Semantic search of topics using text query.

These are the topics closest to the vector. Topics are ordered by proximity to the vector. Successive topics in the list are less semantically similar to the vector.

Parameters:

Returns:

save(file)

Saves the current model to the specified file.

Parameters:

file (str) – File where model will be saved.

search_topics(keywords, num_topics, keywords_neg=None, reduced=False)

Semantic search of topics using keywords.

The most semantically similar topics to the combination of the keywords will be returned. If negative keywords are provided, the topics will be semantically dissimilar to those words. Topics will be ordered by decreasing similarity to the keywords. Too many keywords or certain combinations of words may give strange results. This method finds an average vector(negative keywords are subtracted) of all the keyword vectors and returns the topics closest to the resulting vector.

Parameters:

Returns:

search_topics_by_vector(vector, num_topics, reduced=False)

Semantic search of topics using keywords.

These are the topics closest to the vector. Topics are ordered by proximity to the vector. Successive topics in the list are less semantically similar to the vector.

Parameters:

Returns:

search_words_by_vector(vector, num_words, use_index=False, ef=None)

Semantic search of words using a vector.

These are the words closest to the vector. Words are ordered by proximity to the vector. Successive words in the list are less semantically similar to the vector.

Parameters:

Returns:

similar_words(keywords, num_words, keywords_neg=None, use_index=False, ef=None)

Semantic similarity search of words.

The most semantically similar word to the combination of the keywords will be returned. If negative keywords are provided, the words will be semantically dissimilar to those words. Too many keywords or certain combinations of words may give strange results. This method finds an average vector(negative keywords are subtracted) of all the keyword vectors and returns the words closest to the resulting vector.

Parameters:

Returns:

class top2vec.top2vec.Topic(topic_index, tokens, score)

score

Alias for field number 2

tokens

Alias for field number 1

topic_index

Alias for field number 0

top2vec.top2vec.default_tokenizer(document)

Tokenize a document for training and remove too long/short words

Parameters:

document (List of str) – Input document.

Returns:

tokenized_document – List of tokens.

Return type:

List of str

top2vec.top2vec.get_chunks(tokens, chunk_length, max_num_chunks, chunk_overlap_ratio)

Split a document into sequential chunks

Parameters:

Returns:

chunked_document – List of document chunks.

Return type:

List of str

top2vec.top2vec.get_random_chunks(tokens, chunk_length, chunk_len_coverage_ratio, max_num_chunks)

Split a document into chunks starting at random positions

Parameters:

Returns:

chunked_document – List of document chunks.

Return type:

List of str