text_classification_query()
Automatically fits a text classification model to your dataset. All standard text modification procedures are applied automatically if applicable. Stored as 'text_classification' in models dictionary.
Dataset Guidelines
- One column in the file should contain the text to be classified
- One column should contain the label of each text and SHOULD BE NAMED LABEL. If is named something else, please specify the name with the label_column parameter.
- Your instruction should be about the text you want to classify, for example in a problem where you have a tweet column (with the text) and a sentiment column (with 0-2 representing mood) your instruction should be about the tweet: 'please perform analysis on tweets'.
Arguments | |
instruction=None | An English language statement that represents the task you would like to be completed. eg: 'predict the median house value' or 'please estimate the number of households'. Should correspond to the column of text that you want to classify. |
label_column='label' | Represents the column name in which your label exists. If the name is already similar to 'label' then this parameter is not required. |
preprocess=True | Whether you want your dataset to be intelligently preprocessed. |
test_size=0.2 | The proportion of your entire dataset that is used for testing. |
random_state=49 | The randomization channel that you want to be set at. |
learning_rate=1e-2 | The default rate at which your model learns based on gradient descent. |
epochs=20 | Number of epochs. This is for every model that's created in the process. |
epochs=50 | Number of epochs for every model attempted |
monitor='val_loss' | The parameter that you want the query to minimize/maximize. For example, the default setting will try to minimize your validation loss. |
generate_plots=True | Whether you want libra to create accuracy and loss plots for you. |
batch_size=32 | The number of dataset points that will be provided to your model in every pass. |
max_text_length=200 | The maximum amount of text that can be used to classify. If larger, it will cut off the rest. |
max_features=20000 | The size of the input embedding layer in the model. |
generate_plots=True | Whether you want libra to create accuracy and loss plots for you. |
save_model=False | Do you want the model weights and architecture to be saved as a .json and .h5 file. |
save_path=os.getcwd() | Where do you want the save_model information to be stored. Default is current working directory. |
new_client = client('path_to_csv') new_client.text_classification_query('Please estimate the sentiment') new_client.classify_text('new text to classify')
summarization_query()
Automatically fits a transfer-learning Document Summarization model to your dataset. This model will have frozen layers with pretrained weights to help with small dataset sizes. Stored as 'doc_summarization' in models dictionary.
Dataset Guidelines
- The data that you want to summarized should be the target of the instruction. So if you want to summarize tweets, the instruction could be 'summarize long textual tweets'.
- The result, or the summary should be in a column called 'summary'. THIS IS ESSENTIAL.
- Your instruction should be about the label column, not the text.
Arguments | |
drop=None | Columns to drop manually, drop columns with links, weirdly formatted numbers, and others. |
epochs=10 | Number of epochs. This is for every model that's created in the process. |
batch_size=32 | The number of dataset points that will be provided to your model in every pass. |
learning_rate=1e-2 | The default rate at which your model learns based on gradient descent. |
max_text_length=512 | The maximum amount of text that can be summarized |
max_summary_length=150 | Maximum outputted summary length. The longer this is, the less accurate it will be. |
gpu=False | Determines whether a built in cpu or gpu will be used. |
generate_plots=True | Whether you want libra to create accuracy and loss plots for you. |
save_model=False | Do you want the model weights and architecture to be saved as a .json and .h5 file. |
save_path=os.getcwd() | Where do you want the save_model information to be stored. Default is current working directory. |
newClient = client('path_to_csv') newClient.summarization_query("Please summarize original text") newClient.get_summary('new text to summarize')
image_caption_query()
Automatically fits an caption generation transfer learning model to your dataset. This model will have frozen layers with pretrained weights to help with small dataset sizes. Stored as 'image_caption' in models dictionary.
Dataset Guidelines
- One column in the file should be that path to the images. This will be found automatically.
- The target of your instruction should be the caption column. So, maybe if your caption column is called short tweets, have your instruction be 'please shorten this text into short tweets'.
Arguments | |
instruction | An English language statement that represents the task you would like to be completed. eg: 'predict the median house value' or 'please estimate the number of households'. Should correspond to the column of captions in the dataset. |
drop=None | Columns to drop manually, drop columns with links, weirdly formatted numbers, and others. |
epochs=10 | Number of epochs. This is for every model that's created in the process. |
preprocess=True | Whether you want your dataset to be intelligently preprocessed. |
random_state=49 | he randomization channel that you want to be set at. |
buffer_size=1000 | Maximum number of elements that will be buffered to be selected for the caption. |
embedding_dim=256 | Sets the size of the word embedding mapping. |
units=512 | The exact number of recurrent units present in the decoder. |
gpu=False | Determines whether a built in cpu or gpu will be used. |
generate_plots=True | Whether you want libra to create accuracy and loss plots for you. |
newClient = client('path_to_csv') newClient.image_caption_query('Generate image captions') newClient.generate_caption('path to image')
generate_text()
Automatically generates text of specified length based on initial prefix text. Stored as ‘generated_text’ in models dictionary.
Dataset Guidelines
- A text file with the initial part of the text that you want to use to generate the next set of text.
- OR just type in the prefix that you want yourself for the prefix argument
Arguments | |
file_data=True | Whether or not you want the client file you provided to be the prefix used in generating text |
prefix=None | If file_data is false then what is the prefix you would like to use - string |
tempreture=0.3 | The temperature to make the next word probability distribution sharper (float). |
maxLength=512 | What length do you want your generated text to be. |
top_k=50 | He randomization channel that you want to be set at. |
top_p=0.9 | Maximum number of elements that will be buffered to be selected for the caption. |
return_sequences=2 | How many different variations do you want to be returned |
newClient = client('path_to_txt’) newClient.generate_text(“generate text” file_data=False, prefix=“Hello there!”)
named_entity_query()
Automatically detects name entities like persons name, geographic locations, organization/companies and addresses from label column containing text. Stored as 'named_entity_recognition' in models dictionary.
Dataset Guidelines
- The data that you want to extract name entities from should be the target of the instruction. For example, if you want to extract name entities from text column, the instruction could be 'extract ner from text'. or simple 'text'
- Your instruction should be about the label column, not the text.
Arguments | |
instruction | An English language statement that represents the task you would like to be completed. eg: 'predict the median house value' or 'please estimate the number of households'. Should correspond to the column of captions in the dataset. |
newClient = client('path_to_txt’) newClient.named_entity_query('detect from text')