gsitk is a library on top of scikit-learn that eases the development process on NLP machine learning driven projects. It uses numpy, pandas and related libraries to easy the development.
gsitk manages datasets, features, classifiers and evaluation techniques, so that writing an evaluation pipeline results fast and simple.
Installation and use
gsitk can be installed via pip, which is the recommended way:
pip install gsitk
Alternatively, gsitk can be installed by cloning this repository.
gsitk saves into disk the datasets and some other necessary resources.
By default, all these data are stored in
The environment variable
$DATA_PATH can be set in order to specify an alternative directory.
SIMON feature extractor
gsitk includes the implementation of the SIMON feature extractor. To use it, two things are needed: - A sentiment lexicon - A word embeddings model that is gensim compatible.
from gsitk.features import simon from nltk.corpus import opinion_lexicon from gensim.models.keyedvectors import KeyedVectors lexicon = [list(opinion_lexicon.positive()), list(opinion_lexicon.negative())] embedding_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) simon_transformer = simon.Simon(lexicon=lexicon, n_lexicon_words=200, embedding=embedding_model) # simon_transformer has the fit() and transform() methods, so it can be used in a Pipeline
To enhance performance, it is recommendable to use a more complete scikit-learn pipe that implements normalization and feature selection in conjuction with the SIMON feature extraction.
from gsitk.features import simon simon_model = simon.Simon(lexicon=lexicon, n_lexicon_words=200, embedding=embedding_model) model = simon.simon_pipeline(simon_transformer=simon_model, percentile=25) # model also implemtens fit() and transform()