Octopod Text

The text aspects of Octopod are housed here. This includes sample model architectures and a dataset class.

Model Architectures

class octopod.text.models.multi_task_bert.BertForMultiTaskClassification(config, pretrained_task_dict=None, new_task_dict=None, dropout=0.1)

PyTorch BERT class for multitask learning. This model allows you to load in some pretrained tasks in addition to creating new ones.

Examples

To instantiate a completely new instance of BertForMultiTaskClassification and load the weights into this architecture you can use the from_pretrained method of the base class by specifying the name of the weights to load, e.g.:

model = BertForMultiTaskClassification.from_pretrained(
    'bert-base-uncased',
    new_task_dict=new_task_dict
)

# DO SOME TRAINING

model.save(SOME_FOLDER, SOME_MODEL_ID)

To instantiate an instance of BertForMultiTaskClassification that has layers for pretrained tasks and new tasks, you would do the following:

model = BertForMultiTaskClassification.from_pretrained(
    'bert-base-uncased',
    pretrained_task_dict=pretrained_task_dict,
    new_task_dict=new_task_dict
)

model.load(SOME_FOLDER, SOME_MODEL_DICT)

# DO SOME TRAINING
Parameters
  • config (json file) – Defines the BERT model architecture. Note: you will most likely be instantiating the class with the from_pretrained method so you don’t need to come up with your own config.

  • pretrained_task_dict (dict) – dictionary mapping each pretrained task to the number of labels it has

  • new_task_dict (dict) – dictionary mapping each new task to the number of labels it has

  • dropout (float) – dropout percentage for Dropout layer

export(folder, model_id, model_name=None)

Exports the entire model state dict to a specific folder.

Note: if the model has pretrained_classifiers and new_classifers, they will be combined into the pretrained_classifiers attribute before being saved.

Parameters
  • folder (str or Path) – place to store state dictionaries

  • model_id (int) – unique id for this model

  • model_name (str (defaults to None)) – Name to store model under, if None, will default to multi_task_bert_{model_id}.pth

Side Effects

saves one file:
  • folder / model_name

forward(tokenized_input)

Defines forward pass for Bert model

Parameters

tokenized_input (torch tensor of integers) – integers represent tokens for each word

Returns

Return type

A dictionary mapping each task to its logits

freeze_bert()

Freeze all core Bert layers

freeze_pretrained_classifiers_and_bert()

Freeze pretrained classifier layers and core Bert layers

import_model(folder, file)

Imports the entire model state dict from a specific folder.

Note: to export a model based on the import_model from this method, use the export method

Parameters
  • folder (str or Path) – place to store state dictionaries

  • file (str) – filename for the exported model object

load(folder, model_id)

Loads the model state dicts from a specific folder.

Parameters
  • folder (str or Path) – place where state dictionaries are stored

  • model_id (int) – unique id for this model

Side Effects

loads from three files:
  • folder / f’bert_dict_{model_id}.pth’

  • folder / f’dropout_dict_{model_id}.pth’

  • folder / f’pretrained_classifiers_dict_{model_id}.pth’

save(folder, model_id)

Saves the model state dicts to a specific folder. Each part of the model is saved separately to allow for new classifiers to be added later.

Note: if the model has pretrained_classifiers and new_classifers, they will be combined into the pretrained_classifiers_dict.

Parameters
  • folder (str or Path) – place to store state dictionaries

  • model_id (int) – unique id for this model

Side Effects

saves three files:
  • folder / f’bert_dict_{model_id}.pth’

  • folder / f’dropout_dict_{model_id}.pth’

  • folder / f’pretrained_classifiers_dict_{model_id}.pth’

unfreeze_pretrained_classifiers()

Unfreeze pretrained classifier layers

unfreeze_pretrained_classifiers_and_bert()

Unfreeze pretrained classifiers and core Bert layers

Dataset

class octopod.text.dataset.OctopodTextDataset(x, y, tokenizer, max_seq_length=128)

Load data for use with a BERT model

Parameters
  • x (pandas Series) – the text to be used

  • y (list) – A list of dummy-encoded or string categories will be encoded using an sklearn label encoder

  • tokenizer (pretrained BERT Tokenizer) – BERT tokenizer likely from transformers

  • max_seq_length (int (defaults to 128)) – Maximum number of tokens to allow

class octopod.text.dataset.OctopodTextDatasetMultiLabel(x, y, tokenizer, max_seq_length=128)

Multi label subclass of OctopodTextDataset

Parameters
  • x (pandas Series) – the text to be used

  • y (list) – a list of lists of binary encoded categories or string categories with length equal to number of classes in the multi-label task. For a 4 class multi-label task a sample list would be [1,0,0,1], A string example would be [‘cat’,’dog’], (if the classes were [‘cat’,’frog’,’rabbit’,’dog]), which will be encoded using a sklearn label encoder to [1,0,0,1].

  • tokenizer (pretrained BERT Tokenizer) – BERT tokenizer likely from transformers

  • max_seq_length (int (defaults to 128)) – Maximum number of tokens to allow