Item Based Search and More Like This

Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission
is strictly prohibited Alex Ksikes Item Based Search and More Like This

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission
is strictly prohibited What is content based search? • retrieve sets of results which are not necessary directly accessible with full text search. • should work also on multimedia documents, i.e. images or videos. • query is made of a set of documents, rather than of keywords, query-by-example. • results are a set of “similar” or “related” documents. • the search is performed over the whole content of the documents hence the name.

is strictly prohibited Example

is strictly prohibited Example …

is strictly prohibited Features • multimedia documents may have no apparent structure. • number of variables to consider may be very large. • for example: images may have millions of pixels which, taken sequentially, have no obvious underlying pattern. • the information must be condensed into meaningful pieces of information called features. • many features have been engineering for all types of applications.

is strictly prohibited Example (Image Intensity Histograms) • for images we could take the pixel intensity histogram of the image and represented as a vector. • images can then be matched using a similarity measure between their respective feature vectors. • here the intensity histogram of this image serves as a feature vector in order to diﬀerentiate between diﬀerent types of yeasts (courtesy Yeast Resource Center)

is strictly prohibited Example (Image Intensity Histograms) …

is strictly prohibited Principle • documents are represented as feature vectors and matched using an appropriate metric. • documents with “close enough” features are then thought to be similar. • ingredients of content based search system: ‣ the relevant features must be chosen and extracted. ‣ a metric must be properly chosen. ‣ an algorithm should be crafted to perform the matching eﬃciently.

is strictly prohibited Features Continued • the idea is to capture some aspect or characteristic of the data • for example for text, the words and their order within each document would be of interest. • for images, we might want to consider the color usage, texture composition or shape. • let’s cover diﬀerent types of features.

is strictly prohibited Images: Color Histograms • one simple way of modeling the color of an image can consist of computing an histogram of RGB triplets • same as the intensity histograms previously discussed only that now RGB triplets have replaced intensity values.

is strictly prohibited Images: Color Histograms …

is strictly prohibited Images: Texture Histograms • Tamura et al. (1978) have mathematically deﬁned and studied six basic features that correspond to the human visual perception of texture. • Out of these six features, coarseness was the most fundamental, followed by contrast and directionality. • to model texture, a window around each pixel is taken, and coarseness (C), contrast (N) and directionality (D) is computed within that window. • histogram of the three values C, N and D

is strictly prohibited Images: Texture Histograms … ! ! ! ! ! ! ! ! ! ! ! » image courtesy of Stefan Rüger in Multimedia Information Retrieval

is strictly prohibited Other Feature Types • Let the object be an image and denote by p(i,j) the intensity of this image at pixel (i,j). The average of pixel intensities can then be written as follows. ! • central moments of the quantity p for k > 1 ! ! • μ and of all central moments is suﬃcient to re-construct the distribution of p. Therefore, the vector (μ, p2, p3, ..., pk) could be used as a feature of the distribution p.

is strictly prohibited In Chemoinformatics • make use of spectral features i.e. counting re-occurring substructures • counting re-occurring substructure within small molecules represented in 1D, 2D or 3D (Azencott et al., 2007). ‣ in 1D represented as a SMILE string. Feature vector of the molecule is made of counts of all substrings of a certain maximum size, super set of EdgeNGrams. ‣ in 2D a spectral vector can be devised as a count of all sub-paths along a molecular carbon chain. ‣ in 3D a feature vector can be built by counting distances between speciﬁc atoms of importance.

is strictly prohibited Diﬀerent Measures between Features • Distance between vectors: ! • Induce similarity measure: • Or distance between probability distributions ‣ Kullback-Leibler divergence measures the degree of diﬀerence between two probability distributions v and w. ‣ Expected number of extra bits required to code samples from v when using a code based on w.

is strictly prohibited Curse of Dimensionality • indexing hi-dimensional vectors eﬃciently is very challenging (Bellman, 1966) • consider a n-dimensional unit hypercube [0,1]^n where the data points are uniformly distributed ‣ to capture a portion of the data p, the length l in each dimension of this volume can be written as l = p^1/n ‣ one 1% of the data in a 10 dimensional unit hypercube, we would have l = (1/100)^(1/10) ≈ 0.63! ‣ require 63% of the range in each dimension, after only 500 dimensions, it becomes 99%! • most of the volume enclosed in the hypercube is actually located on its surface!

is strictly prohibited Curse of Dimensionality … • similar argument, Beyer et al. (1999) showed that, as the dimensionality of the space increases, all the points tend to exhibit the same distance with respect to each other. • This has the ultimate consequence of making the simple nearest neighbor search approach ill deﬁned

is strictly prohibited Learning to Rank • Two phase scheme: 1) chunk of the relevant documents is identiﬁed using a simple retrieval model (top-k retrieval). 2) more accurate but computationally expensive model is used for ranking. • The training data consists of query-document pairs together with a ordinal or boolean score. • Scores are usually determined by human judges who assess on the relevance of each document with respect to a given query

is strictly prohibited Learning to Rank …

is strictly prohibited Bayesian Sets ... • Bayesian Sets (Ghahramani and Heller, 2005) takes probabilistic view of the data, instead of devising a metric on a feature space. • Query is a set of items • Our information retrieval method should rank items x by how well x ﬁts with the query set

is strictly prohibited Bayesian Sets with Sparse Binary Data ... ! ! ! ! ! ! ! ‣ where alpha and beta are hyper-parameters usually set to be proportional to the mean of the data. Tidle means that it depends on the values of the items as well.

is strictly prohibited Bayesian Sets … • multiple item based queries are possible • reduces the work involved in setting up a similarity search based solution to just feature engineering. • reduces the handling of complex content based searches to choosing the right plugin i.e. feature extractor. • generic and open ended approach because completely new data types could be handled in the future by writing the right feature extractor. • to stress these particularities, the search algorithm is referred as item based as opposed to content based.

is strictly prohibited Bayesian Sets • The matrix product c + Xq looks like the vector space model up to a constant and a diﬀerent weighting scheme of the features (terms). • Therefore if we can binarize our data and put them in text form, we could use a search engine such as Lucene to perform this computation! • Only we would need to be able to change the scoring function to sum of the q_j. The alpha / beta not dependent on the items could be ignored (penalize features in the doc, but not in the queried items). • In order not to make the query too long, we could greedily only select the best q_j -> use a heuristic similar to More Like This.

is strictly prohibited More Like This • Given a piece of text, it attempts to find to best terms (highest tf-idf) characterizing this text, i.e. feature selection. ‣ Terms that most contribute to the score of this text or document. • Forms a boolean query from these terms. ! ! ! ! • Different filtering ways of specifying how terms should be selected form the like_text. { "more_like_this" : { "fields" : ["name.first", "name.last"], "like_text" : "text like this one", "min_term_freq" : 1, "max_query_terms" : 12 } }

is strictly prohibited More Like This … • Searching for multiple items has been recently added to Elasticsearch. • Performs a MLT query per ﬁeld from the text fetched in that ﬁeld (treated as one multi-value item) { "more_like_this" : { "fields" : ["name.first", "name.last"], "docs" : [ { "_index" : "test", "_type" : "type", "_id" : "1" }, { "_index" : "test", "_type" : "type", "_id" : "2" } ], "ids" : ["3", "4"], "min_term_freq" : 1, "max_query_terms" : 12 } }

is strictly prohibited What’s Missing? • however this only works on text for now. ‣ We need different features for different applications and which could be binarized. • play with different similarity functions for selecting « interesting terms ». ‣ need custom similarity mimicking BSets. • Some challenges on the interface level, how to add multiple items?

is strictly prohibited What could be next? • go beyond searching for similar documents and look into searching for the most significant terms of a query (like_query). ‣ Using aggs with significant terms. • This new generated query could be used as a classifier query. ‣ Get the significant terms for a query in a specific category and use these terms in order to classify new documents based on how well they « fit » within this query (how well it belong to this category). ‣ cheap way of creating a classifier.

Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission
is strictly prohibited Questions

is strictly prohibited thank you! http://elasticsearch.com/support @elasticsearch

Item Based Search and More Like This

Item Based Search and More Like This

Elasticsearch Inc

More Decks by Elasticsearch Inc

Other Decks in Research

Featured

Transcript

Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission

Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission