Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Классификация литературных жанров

Классификация литературных жанров

Андрей Киселёв

Попробуем собрать и улучшить простой языконезависимый классификатор текстов, исходя из естественных математических соображений.

9 февраля 2016, Moscow Python Meetup №32

Avatar for Moscow Python Meetup

Moscow Python Meetup

February 09, 2016
Tweet

More Decks by Moscow Python Meetup

Other Decks in Technology

Transcript

  1. Training Set Fantasy Classic Murder Mystery Space Fiction A Clash

    of Kings Ten Little Niggers Foundation The Black Company The Mystery of Edwin Drood Hyperion A Wizard of Earthsea The Adventures of Sherlock Holmes Starship Troopers Dragonflight Perry Mason stories Dune The King of Swords The Murders in the Rue Morgue Star Kings Miecz przeznaczenia The Suicide Club Туманность Андромеды The Lord of the Rings Лунная Радуга Люди как Боги
  2. Vectorization • CountVectorizer - convert a collection of raw documents

    to a matrix of token count from sklearn.feature_extraction.text import CountVectorizer from stop_words import get_stop_words cv = CountVectorizer(input="filename", stop_words=get_stop_words("russian")) corpus_tdm = cv.fit_transform(training_set)
  3. Feature Selection • Chi2 test is applied to test the

    independence of two events, occurrence of the term and occurrence of the class from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 ch2 = SelectKBest(chi2, k=10000) tdmnew = ch2.fit_transform(corpus_tdm, training_labels)
  4. Euclidean distance import seaborn as sns from sklearn.metrics import pairwise_distances

    euclidean_distance_matrix = pairwise_distances(tdmnew) sns.heatmap(euclidean_distance_matrix, xticklabels=filenames, yticklabels=filenames) sns.plt.show()
  5. Normalization • Normalized vector is a vector in the same

    direction but with norm 1. from sklearn.preprocessing import normalize tdmnew = normalize(tdmnew.astype(np.float64), axis=1, norm="l2")
  6. Cosine Similarity • Cosine similarity is a measure of similarity

    between two vectors that measures the cosine between them.
  7. ch2 = SelectKBest(chi2, k=10000) tdmnew = ch2.fit_transform(corpus_tdm, training_labels) names =

    np.asarray(cv.get_feature_names())[ch2.get_support()] def new_metric_tensor(feature_names): metric_tensor = np.ndarray(shape=(len(feature_names), len(feature_names))) for i in range(0, len(feature_names)): for j in range(i, len(feature_names)): feature_similarity = 1.01**(-distance(feature_names[i], feature_names[j])) metric_tensor[i,j] = feature_similarity metric_tensor[j,i] = feature_similarity return metric_tensor def soft_cosine(a, b, metric_tensor): inner = lambda a, b: np.dot(np.transpose(a), metric_tensor).dot(b) return 1 - inner(a, b) / (sqrt(inner(a, a)) * sqrt(inner(b, b)))
  8. Nearest centroid classifier • Assigns the label of the class

    whose centroid (center of mass) is closest
  9. from sklearn.cross_validation import train_test_split x_train, x_test, y_train, y_test = train_test_split(training_set,

    training_labels, test_size=0.86) text_clf = Pipeline([('vect', TfidfVectorizer(input='filename', stop_words=get_stop_words('russian'))), ('select', SelectKBest(chi2, k=10000)), ('clf', NearestCentroid()) ]) text_clf.fit(x_train, y_train) print(text_clf.score(x_test, y_test))
  10. Inner product comparison Dot product Similarity Matrix Handpicked Fantasy 61.8%

    Sci-Fi 60.4% Detective 56.4% • Cross validation: 63% accuracy Handpicked Fantasy 79.6% Sci-Fi 66.6% Detective 57.8% • Cross validation: 71.5%