Octopod Text¶
The text aspects of Octopod are housed here. This includes sample model architectures and a dataset class.
Model Architectures¶
-
class
octopod.text.models.multi_task_bert.
BertForMultiTaskClassification
(config, pretrained_task_dict=None, new_task_dict=None, dropout=0.1)¶ PyTorch BERT class for multitask learning. This model allows you to load in some pretrained tasks in addition to creating new ones.
Examples
To instantiate a completely new instance of BertForMultiTaskClassification and load the weights into this architecture you can use the from_pretrained method of the base class by specifying the name of the weights to load, e.g.:
model = BertForMultiTaskClassification.from_pretrained( 'bert-base-uncased', new_task_dict=new_task_dict ) # DO SOME TRAINING model.save(SOME_FOLDER, SOME_MODEL_ID)
To instantiate an instance of BertForMultiTaskClassification that has layers for pretrained tasks and new tasks, you would do the following:
model = BertForMultiTaskClassification.from_pretrained( 'bert-base-uncased', pretrained_task_dict=pretrained_task_dict, new_task_dict=new_task_dict ) model.load(SOME_FOLDER, SOME_MODEL_DICT) # DO SOME TRAINING
- Parameters
config (json file) – Defines the BERT model architecture. Note: you will most likely be instantiating the class with the from_pretrained method so you don’t need to come up with your own config.
pretrained_task_dict (dict) – dictionary mapping each pretrained task to the number of labels it has
new_task_dict (dict) – dictionary mapping each new task to the number of labels it has
dropout (float) – dropout percentage for Dropout layer
-
export
(folder, model_id, model_name=None)¶ Exports the entire model state dict to a specific folder.
Note: if the model has pretrained_classifiers and new_classifers, they will be combined into the pretrained_classifiers attribute before being saved.
- Parameters
folder (str or Path) – place to store state dictionaries
model_id (int) – unique id for this model
model_name (str (defaults to None)) – Name to store model under, if None, will default to multi_task_bert_{model_id}.pth
Side Effects
- saves one file:
folder / model_name
-
forward
(tokenized_input)¶ Defines forward pass for Bert model
- Parameters
tokenized_input (torch tensor of integers) – integers represent tokens for each word
- Returns
- Return type
A dictionary mapping each task to its logits
-
freeze_bert
()¶ Freeze all core Bert layers
-
freeze_pretrained_classifiers_and_bert
()¶ Freeze pretrained classifier layers and core Bert layers
-
import_model
(folder, file)¶ Imports the entire model state dict from a specific folder.
Note: to export a model based on the import_model from this method, use the export method
- Parameters
folder (str or Path) – place to store state dictionaries
file (str) – filename for the exported model object
-
load
(folder, model_id)¶ Loads the model state dicts from a specific folder.
- Parameters
folder (str or Path) – place where state dictionaries are stored
model_id (int) – unique id for this model
Side Effects
- loads from three files:
folder / f’bert_dict_{model_id}.pth’
folder / f’dropout_dict_{model_id}.pth’
folder / f’pretrained_classifiers_dict_{model_id}.pth’
-
save
(folder, model_id)¶ Saves the model state dicts to a specific folder. Each part of the model is saved separately to allow for new classifiers to be added later.
Note: if the model has pretrained_classifiers and new_classifers, they will be combined into the pretrained_classifiers_dict.
- Parameters
folder (str or Path) – place to store state dictionaries
model_id (int) – unique id for this model
Side Effects
- saves three files:
folder / f’bert_dict_{model_id}.pth’
folder / f’dropout_dict_{model_id}.pth’
folder / f’pretrained_classifiers_dict_{model_id}.pth’
-
unfreeze_pretrained_classifiers
()¶ Unfreeze pretrained classifier layers
-
unfreeze_pretrained_classifiers_and_bert
()¶ Unfreeze pretrained classifiers and core Bert layers
Dataset¶
-
class
octopod.text.dataset.
OctopodTextDataset
(x, y, tokenizer, max_seq_length=128)¶ Load data for use with a BERT model
- Parameters
x (pandas Series) – the text to be used
y (list) – A list of dummy-encoded or string categories will be encoded using an sklearn label encoder
tokenizer (pretrained BERT Tokenizer) – BERT tokenizer likely from transformers
max_seq_length (int (defaults to 128)) – Maximum number of tokens to allow
-
class
octopod.text.dataset.
OctopodTextDatasetMultiLabel
(x, y, tokenizer, max_seq_length=128)¶ Multi label subclass of OctopodTextDataset
- Parameters
x (pandas Series) – the text to be used
y (list) – a list of lists of binary encoded categories or string categories with length equal to number of classes in the multi-label task. For a 4 class multi-label task a sample list would be [1,0,0,1], A string example would be [‘cat’,’dog’], (if the classes were [‘cat’,’frog’,’rabbit’,’dog]), which will be encoded using a sklearn label encoder to [1,0,0,1].
tokenizer (pretrained BERT Tokenizer) – BERT tokenizer likely from transformers
max_seq_length (int (defaults to 128)) – Maximum number of tokens to allow