Octopod Text¶

The text aspects of Octopod are housed here. This includes sample model architectures and a dataset class.

Model Architectures¶

class octopod.text.models.multi_task_bert.BertForMultiTaskClassification(config, pretrained_task_dict=None, new_task_dict=None, dropout=0.1)¶

PyTorch BERT class for multitask learning. This model allows you to load in some pretrained tasks in addition to creating new ones.

Examples

To instantiate a completely new instance of BertForMultiTaskClassification and load the weights into this architecture you can use the from_pretrained method of the base class by specifying the name of the weights to load, e.g.:

model = BertForMultiTaskClassification.from_pretrained(
    'bert-base-uncased',
    new_task_dict=new_task_dict
)

# DO SOME TRAINING

model.save(SOME_FOLDER, SOME_MODEL_ID)

To instantiate an instance of BertForMultiTaskClassification that has layers for pretrained tasks and new tasks, you would do the following:

model = BertForMultiTaskClassification.from_pretrained(
    'bert-base-uncased',
    pretrained_task_dict=pretrained_task_dict,
    new_task_dict=new_task_dict
)

model.load(SOME_FOLDER, SOME_MODEL_DICT)

# DO SOME TRAINING

Parameters

config (json file) – Defines the BERT model architecture. Note: you will most likely be instantiating the class with the from_pretrained method so you don’t need to come up with your own config.
pretrained_task_dict (dict) – dictionary mapping each pretrained task to the number of labels it has
new_task_dict (dict) – dictionary mapping each new task to the number of labels it has
dropout (float) – dropout percentage for Dropout layer

export(folder, model_id, model_name=None)¶

Exports the entire model state dict to a specific folder.

Note: if the model has pretrained_classifiers and new_classifers, they will be combined into the pretrained_classifiers attribute before being saved.

Parameters

folder (str or Path) – place to store state dictionaries
model_id (int) – unique id for this model
model_name (str (defaults to None)) – Name to store model under, if None, will default to multi_task_bert_{model_id}.pth

Side Effects

saves one file:

folder / model_name

forward(tokenized_input)¶

Defines forward pass for Bert model

Parameters: tokenized_input (torch tensor of integers) – integers represent tokens for each word
Returns
Return type: A dictionary mapping each task to its logits

freeze_bert()¶: Freeze all core Bert layers

freeze_pretrained_classifiers_and_bert()¶: Freeze pretrained classifier layers and core Bert layers

import_model(folder, file)¶

Imports the entire model state dict from a specific folder.

Note: to export a model based on the import_model from this method, use the export method

Parameters

folder (str or Path) – place to store state dictionaries
file (str) – filename for the exported model object

load(folder, model_id)¶

Loads the model state dicts from a specific folder.

Parameters

folder (str or Path) – place where state dictionaries are stored
model_id (int) – unique id for this model

Side Effects

loads from three files:

folder / f’bert_dict_{model_id}.pth’
folder / f’dropout_dict_{model_id}.pth’
folder / f’pretrained_classifiers_dict_{model_id}.pth’

save(folder, model_id)¶

Saves the model state dicts to a specific folder. Each part of the model is saved separately to allow for new classifiers to be added later.

Note: if the model has pretrained_classifiers and new_classifers, they will be combined into the pretrained_classifiers_dict.

Parameters

folder (str or Path) – place to store state dictionaries
model_id (int) – unique id for this model

Side Effects

saves three files:

folder / f’bert_dict_{model_id}.pth’
folder / f’dropout_dict_{model_id}.pth’
folder / f’pretrained_classifiers_dict_{model_id}.pth’

unfreeze_pretrained_classifiers()¶: Unfreeze pretrained classifier layers

unfreeze_pretrained_classifiers_and_bert()¶: Unfreeze pretrained classifiers and core Bert layers

Dataset¶

class octopod.text.dataset.OctopodTextDataset(x, y, tokenizer, max_seq_length=128)¶

Load data for use with a BERT model

Parameters

x (pandas Series) – the text to be used
y (list) – A list of dummy-encoded or string categories will be encoded using an sklearn label encoder
tokenizer (pretrained BERT Tokenizer) – BERT tokenizer likely from transformers
max_seq_length (int (defaults to 128)) – Maximum number of tokens to allow

class octopod.text.dataset.OctopodTextDatasetMultiLabel(x, y, tokenizer, max_seq_length=128)¶

Multi label subclass of OctopodTextDataset

Parameters

x (pandas Series) – the text to be used
y (list) – a list of lists of binary encoded categories or string categories with length equal to number of classes in the multi-label task. For a 4 class multi-label task a sample list would be [1,0,0,1], A string example would be [‘cat’,’dog’], (if the classes were [‘cat’,’frog’,’rabbit’,’dog]), which will be encoded using a sklearn label encoder to [1,0,0,1].
tokenizer (pretrained BERT Tokenizer) – BERT tokenizer likely from transformers
max_seq_length (int (defaults to 128)) – Maximum number of tokens to allow