Model Specific Functions

This module contains wrappers to load, tokenize and perform inference and format outputs from models included by default on the model_types list so that they can all be called with a uniform API.

This is where new models can be included in the code.

model_types

This script reads the model_types from model_types.csv .

load_models

This file contains functions to load models, tokenizers, etc. Because not all models are from huggingface and they might not all be installed, the right imports are directly inside the corresponding loading functions.

activation_extractor.model_functions.load_models.load_model(model_name, model_type, **kwargs)[source]

Loads a Pytorch model according to the passed model name. For sequence models, it loads the corresponding tokenizer. For image models, it loads the image processor.

Parameters:

model – the Pytorch model object
model_type (str) – A model type (see list of included models).

Returns:

tuple with (model, tokenizer) or (model, processor).

activation_extractor.model_functions.load_models.load_tokenizer(model_name, tokenizer_type, **kwargs)[source]

Load a tokenizer type for a model. This function is called inside load_model() for sequence type models.

Parameters:

model_name (str) – model name (for huggingface models it should be the same as the loaded model)
tokenizer_type (str) – the type of tokenizer (valid types - AutoTokenizer and T5Tokenizer)

Returns:

the tokenizer object

tokenize_funs

Defines a tokenizer wrapper function for the models included by default.

activation_extractor.model_functions.tokenize_funs.define_tokenize_function(model_type, tokenizer, device=None)[source]

Define the right function to tokenize the inputs based on the model type. This function is called inside inferencer.tokenizer().

Parameters:

model_type (str) – the model type (from the list in activation_extractor.model_functions.model_types)
tokenizer – the loaded tokenizer object

Returns:

the function used to tokenize the inputs

inference_funs

This file defines an inferencer wrapper for the included models.

activation_extractor.model_functions.inference_funs.define_inference_function(model_type, model, tokenizer, device)[source]

Define the right function to do inference based on the model type. The resulting function is called as inferencer.inference(). The functions move the tokenized input to device before performing inference.

Parameters:

model_type (str) – the model type (from the list in activation_extractor.model_functions.model_types)
model_type – the loaded pytorch model
tokenizer – the loaded tokenizer object
device (str) – the device (cpu, cuda…)

Returns:

the function used to do the inference

default_hooked_layers

This file contains functions to get relevant layer (module) names to hook from the models included by default.

activation_extractor.model_functions.default_hooked_layers.get_layers_to_hook(model, model_type, modality='sequence', return_structure=False)[source]

Get a list of default layers to hook (extract activations from) for each model type.

Parameters:

model – the Pytorch model object
model_type (str) – A model type (protein - esm, prot_t5, ankh; dna - nucleotide-transformer, hyenadna, evo, caduceus).

Returns:

the list of layers/modules names

Return type:

list

embedding_to_numpy

activation_extractor.model_functions.embedding_to_numpy.embedding_to_numpy(embeddings)[source]

Converts different types of module outputs to a numpy array. Handles different cases for the different models. Additionally, moves from GPU to CPU.

Parameters:: embedding – Intermediate output object from a pytorch model layer/module.
Returns:: intermediate output as a numpy array
Return type:: numpy array