benefit from larger training data (1B+ words) and dimensions (typically 50-300) • Some models (GloVe) closer to matrix factorization than neural networks • Can successfully uncover semantic and syntactic word relationships from unlabeled corpora (wikipedia, Google News, Common Crawl).