Классификация литературных жанров

Web scraping with Python

Literary Genre Classification

Training Set Fantasy Classic Murder Mystery Space Fiction A Clash
of Kings Ten Little Niggers Foundation The Black Company The Mystery of Edwin Drood Hyperion A Wizard of Earthsea The Adventures of Sherlock Holmes Starship Troopers Dragonflight Perry Mason stories Dune The King of Swords The Murders in the Rue Morgue Star Kings Miecz przeznaczenia The Suicide Club Туманность Андромеды The Lord of the Rings Лунная Радуга Люди как Боги

Vectorisation & Feature Selection

Vectorization • CountVectorizer - convert a collection of raw documents
to a matrix of token count from sklearn.feature_extraction.text import CountVectorizer from stop_words import get_stop_words cv = CountVectorizer(input="filename", stop_words=get_stop_words("russian")) corpus_tdm = cv.fit_transform(training_set)

Feature Selection • Chi2 test is applied to test the
independence of two events, occurrence of the term and occurrence of the class from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 ch2 = SelectKBest(chi2, k=10000) tdmnew = ch2.fit_transform(corpus_tdm, training_labels)

Text Similarity

Euclidean distance import seaborn as sns from sklearn.metrics import pairwise_distances
euclidean_distance_matrix = pairwise_distances(tdmnew) sns.heatmap(euclidean_distance_matrix, xticklabels=filenames, yticklabels=filenames) sns.plt.show()

Normalization • Normalized vector is a vector in the same
direction but with norm 1. from sklearn.preprocessing import normalize tdmnew = normalize(tdmnew.astype(np.float64), axis=1, norm="l2")

Cosine Similarity • Cosine similarity is a measure of similarity
between two vectors that measures the cosine between them.

Soft Cosine Similarity • Generalization of cosine similarity to arbitrary
inner product

ch2 = SelectKBest(chi2, k=10000) tdmnew = ch2.fit_transform(corpus_tdm, training_labels) names =
np.asarray(cv.get_feature_names())[ch2.get_support()] def new_metric_tensor(feature_names): metric_tensor = np.ndarray(shape=(len(feature_names), len(feature_names))) for i in range(0, len(feature_names)): for j in range(i, len(feature_names)): feature_similarity = 1.01**(-distance(feature_names[i], feature_names[j])) metric_tensor[i,j] = feature_similarity metric_tensor[j,i] = feature_similarity return metric_tensor def soft_cosine(a, b, metric_tensor): inner = lambda a, b: np.dot(np.transpose(a), metric_tensor).dot(b) return 1 - inner(a, b) / (sqrt(inner(a, a)) * sqrt(inner(b, b)))

Classification

Nearest centroid classifier • Assigns the label of the class
whose centroid (center of mass) is closest

from sklearn.cross_validation import train_test_split x_train, x_test, y_train, y_test = train_test_split(training_set,
training_labels, test_size=0.86) text_clf = Pipeline([('vect', TfidfVectorizer(input='filename', stop_words=get_stop_words('russian'))), ('select', SelectKBest(chi2, k=10000)), ('clf', NearestCentroid()) ]) text_clf.fit(x_train, y_train) print(text_clf.score(x_test, y_test))

Inner product comparison Dot product Similarity Matrix Handpicked Fantasy 61.8%
Sci-Fi 60.4% Detective 56.4% • Cross validation: 63% accuracy Handpicked Fantasy 79.6% Sci-Fi 66.6% Detective 57.8% • Cross validation: 71.5%

The end. By A. Kiselev [email protected]

Классификация литературных жанров

Классификация литературных жанров

Moscow Python Meetup
PRO

More Decks by Moscow Python Meetup

Other Decks in Technology

Featured

Transcript

Web scraping with Python

Literary Genre Classification

Training Set Fantasy Classic Murder Mystery Space Fiction A Clash

Vectorisation & Feature Selection

Vectorization • CountVectorizer - convert a collection of raw documents

Feature Selection • Chi2 test is applied to test the

Text Similarity

Euclidean distance import seaborn as sns from sklearn.metrics import pairwise_distances

Normalization • Normalized vector is a vector in the same

Cosine Similarity • Cosine similarity is a measure of similarity

Soft Cosine Similarity • Generalization of cosine similarity to arbitrary

ch2 = SelectKBest(chi2, k=10000) tdmnew = ch2.fit_transform(corpus_tdm, training_labels) names =

Classification

Nearest centroid classifier • Assigns the label of the class

from sklearn.cross_validation import train_test_split x_train, x_test, y_train, y_test = train_test_split(training_set,

Inner product comparison Dot product Similarity Matrix Handpicked Fantasy 61.8%

The end. By A. Kiselev [email protected]