Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Item Based Search and More Like This

Item Based Search and More Like This

In this talk I'll be introducing the concept of querying by examples. I'll cover a couple of features used for multimedia search, and some matching procedures such as kNN and Bayesian Sets. Finally, I'll conclude with Lucene internally performing a matrix multiply, and using More Like This as a online feature selection algorithm.

Elasticsearch Inc

July 30, 2014
Tweet

More Decks by Elasticsearch Inc

Other Decks in Research

Transcript

  1. Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission

    is strictly prohibited Alex Ksikes Item Based Search and More Like This
  2. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited What is content based search? • retrieve sets of results which are not necessary directly accessible with full text search. • should work also on multimedia documents, i.e. images or videos. • query is made of a set of documents, rather than of keywords, query-by-example. • results are a set of “similar” or “related” documents. • the search is performed over the whole content of the documents hence the name.
  3. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Features • multimedia documents may have no apparent structure. • number of variables to consider may be very large. • for example: images may have millions of pixels which, taken sequentially, have no obvious underlying pattern. • the information must be condensed into meaningful pieces of information called features. • many features have been engineering for all types of applications.
  4. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Example (Image Intensity Histograms) • for images we could take the pixel intensity histogram of the image and represented as a vector. • images can then be matched using a similarity measure between their respective feature vectors. • here the intensity histogram of this image serves as a feature vector in order to differentiate between different types of yeasts (courtesy Yeast Resource Center)
  5. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Example (Image Intensity Histograms) …
  6. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Principle • documents are represented as feature vectors and matched using an appropriate metric. • documents with “close enough” features are then thought to be similar. • ingredients of content based search system: ‣ the relevant features must be chosen and extracted. ‣ a metric must be properly chosen. ‣ an algorithm should be crafted to perform the matching efficiently.
  7. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Features Continued • the idea is to capture some aspect or characteristic of the data • for example for text, the words and their order within each document would be of interest. • for images, we might want to consider the color usage, texture composition or shape. • let’s cover different types of features.
  8. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Images: Color Histograms • one simple way of modeling the color of an image can consist of computing an histogram of RGB triplets • same as the intensity histograms previously discussed only that now RGB triplets have replaced intensity values.
  9. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Images: Texture Histograms • Tamura et al. (1978) have mathematically defined and studied six basic features that correspond to the human visual perception of texture. • Out of these six features, coarseness was the most fundamental, followed by contrast and directionality. • to model texture, a window around each pixel is taken, and coarseness (C), contrast (N) and directionality (D) is computed within that window. • histogram of the three values C, N and D
  10. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Images: Texture Histograms … ! ! ! ! ! ! ! ! ! ! ! » image courtesy of Stefan Rüger in Multimedia Information Retrieval
  11. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Other Feature Types • Let the object be an image and denote by p(i,j) the intensity of this image at pixel (i,j). The average of pixel intensities can then be written as follows. ! • central moments of the quantity p for k > 1 ! ! • μ and of all central moments is sufficient to re-construct the distribution of p. Therefore, the vector (μ, p2, p3, ..., pk) could be used as a feature of the distribution p.
  12. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited In Chemoinformatics • make use of spectral features i.e. counting re-occurring substructures • counting re-occurring substructure within small molecules represented in 1D, 2D or 3D (Azencott et al., 2007). ‣ in 1D represented as a SMILE string. Feature vector of the molecule is made of counts of all substrings of a certain maximum size, super set of EdgeNGrams. ‣ in 2D a spectral vector can be devised as a count of all sub-paths along a molecular carbon chain. ‣ in 3D a feature vector can be built by counting distances between specific atoms of importance.
  13. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Different Measures between Features • Distance between vectors: ! • Induce similarity measure: • Or distance between probability distributions ‣ Kullback-Leibler divergence measures the degree of difference between two probability distributions v and w. ‣ Expected number of extra bits required to code samples from v when using a code based on w.
  14. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Curse of Dimensionality • indexing hi-dimensional vectors efficiently is very challenging (Bellman, 1966) • consider a n-dimensional unit hypercube [0,1]^n where the data points are uniformly distributed ‣ to capture a portion of the data p, the length l in each dimension of this volume can be written as l = p^1/n ‣ one 1% of the data in a 10 dimensional unit hypercube, we would have l = (1/100)^(1/10) ≈ 0.63! ‣ require 63% of the range in each dimension, after only 500 dimensions, it becomes 99%! • most of the volume enclosed in the hypercube is actually located on its surface!
  15. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Curse of Dimensionality … • similar argument, Beyer et al. (1999) showed that, as the dimensionality of the space increases, all the points tend to exhibit the same distance with respect to each other. • This has the ultimate consequence of making the simple nearest neighbor search approach ill defined
  16. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Learning to Rank • Two phase scheme: 1) chunk of the relevant documents is identified using a simple retrieval model (top-k retrieval). 2) more accurate but computationally expensive model is used for ranking. • The training data consists of query-document pairs together with a ordinal or boolean score. • Scores are usually determined by human judges who assess on the relevance of each document with respect to a given query
  17. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Bayesian Sets ... • Bayesian Sets (Ghahramani and Heller, 2005) takes probabilistic view of the data, instead of devising a metric on a feature space. • Query is a set of items • Our information retrieval method should rank items x by how well x fits with the query set
  18. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Bayesian Sets with Sparse Binary Data ... ! ! ! ! ! ! ! ‣ where alpha and beta are hyper-parameters usually set to be proportional to the mean of the data. Tidle means that it depends on the values of the items as well.
  19. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Bayesian Sets … • multiple item based queries are possible • reduces the work involved in setting up a similarity search based solution to just feature engineering. • reduces the handling of complex content based searches to choosing the right plugin i.e. feature extractor. • generic and open ended approach because completely new data types could be handled in the future by writing the right feature extractor. • to stress these particularities, the search algorithm is referred as item based as opposed to content based.
  20. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited Bayesian Sets • The matrix product c + Xq looks like the vector space model up to a constant and a different weighting scheme of the features (terms). • Therefore if we can binarize our data and put them in text form, we could use a search engine such as Lucene to perform this computation! • Only we would need to be able to change the scoring function to sum of the q_j. The alpha / beta not dependent on the items could be ignored (penalize features in the doc, but not in the queried items). • In order not to make the query too long, we could greedily only select the best q_j -> use a heuristic similar to More Like This.
  21. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited More Like This • Given a piece of text, it attempts to find to best terms (highest tf-idf) characterizing this text, i.e. feature selection. ‣ Terms that most contribute to the score of this text or document. • Forms a boolean query from these terms. ! ! ! ! • Different filtering ways of specifying how terms should be selected form the like_text. { "more_like_this" : { "fields" : ["name.first", "name.last"], "like_text" : "text like this one", "min_term_freq" : 1, "max_query_terms" : 12 } }
  22. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited More Like This … • Searching for multiple items has been recently added to Elasticsearch. • Performs a MLT query per field from the text fetched in that field (treated as one multi-value item) { "more_like_this" : { "fields" : ["name.first", "name.last"], "docs" : [ { "_index" : "test", "_type" : "type", "_id" : "1" }, { "_index" : "test", "_type" : "type", "_id" : "2" } ], "ids" : ["3", "4"], "min_term_freq" : 1, "max_query_terms" : 12 } }
  23. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited What’s Missing? • however this only works on text for now. ‣ We need different features for different applications and which could be binarized. • play with different similarity functions for selecting « interesting terms ». ‣ need custom similarity mimicking BSets. • Some challenges on the interface level, how to add multiple items?
  24. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited What could be next? • go beyond searching for similar documents and look into searching for the most significant terms of a query (like_query). ‣ Using aggs with significant terms. • This new generated query could be used as a classifier query. ‣ Get the significant terms for a query in a specific category and use these terms in order to classify new documents based on how well they « fit » within this query (how well it belong to this category). ‣ cheap way of creating a classifier.
  25. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission

    is strictly prohibited thank you! http://elasticsearch.com/support @elasticsearch