Como escolher um bom tema para projeto: uma análise utilizando NLP

Como escolher um bom tema para projeto uma análise utilizando
Natural Language Processing

Quem sou eu

Curiosidade Computação Empenho

Por que Data Science?

Conquistas Microsoft Azure Machine Learning Award (2016) Admission into the
MIT Global Entrepreneurship Bootcamp (2015) IBM Master the Mainframe World Championship (Nova York 2014, Brasil 2012)

Por que esse projeto? Ciência da Computação na UFPE Curiosa
para entender o que o status de cada Universidade signiﬁca, de fato O que eles fazem que nós não fazemos? O que a gente pode fazer para se destacar mais?

Que dados vamos investigar?

Carnegie Mellon 6° lugar no World University Rankings 2016-2017 by
subject: computer science 23° lugar no World University Rankings 2017 UFPE Nem aparece no World University Rankings 2016-2017 by subject: computer science 801+ no World University Rankings 2017 Fonte: https://www.timeshighereducation.com/

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin urna
odio, aliquam vulputate faucibus id, elementum lobortis felis. Mauris urna dolor, placerat ac sagittis quis. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin urna odio, aliquam vulputate faucibus id, elementum lobortis felis. Mauris urna dolor, placerat ac sagittis quis. [ { "year": 2003, "title": "Adaptive Motion for Quadruped Robots", "abstract": "Robotic soccer is a complex task that requires multiple autonomous agents to collaborate in an adversarial environment to achieve specific objectives. [...]", "university": "Carnegie Mellon", "degree": "Computer Science" }, { "year": 2007, "title": "Mapping the CMMI Model to the ISO/IEC 12207 norm", "abstract": "Software standards and models guide organizations in definition and deployment of software development processes and, therefore, they help them to improve [...]", "university": "UFPE", "degree": "Computer Science", }, ]

Que hipóteses vamos investigar?

Hipóteses Eles fazem mais trabalhos aplicando a Computação em contextos
diferentes Eles produzem menos review de literatura

diferentes Eles produzem mais pesquisas originais e menos review de literatura

Investigando os temas dos trabalhos

Extraindo keywords Extrair os tokens e lemas dos abstracts Converter
os textos em uma matriz de token counts Transformar a matriz em uma representação tf-idf

def get_lemma(doc): tokens = [] for token in doc: if
not token.is_punct and (not token.is_digit) and (not token.is_stop): if(len(token.lemma_) > 2): tokens.append(token.lemma_) return tokens lemmas = get_lemma(parsed_abstract) print(lemmas) >>> ['study', 'propose', 'develop', 'nationwide', 'survey', 'asses', 'level', 'maturity', 'project', 'management', 'junior', 'enterprise', 'brazil', 'rely', 'project', 'management', 'maturity', 'model', 'create', 'darci', 'prado', 'enterprise', 'survey', 'select', 'state', 'federation', 'junior', 'enterprises', 'affiliate', 'brazil', 'junior', 'fejece', 'fejepe', 'unijr', 'fejemg', 'fejesp', 'fejepar', 'fejesc', 'rio', 'concentro', 'addition', 'state', 'not', 'affiliate', 'paraíba', 'sergipe', 'total', 'state', 'search', 'goal', 'reach', 'junior', 'enterprise', 'brazil'] print("Number of words:",len(lemmas)) >>> Number of words: 54

print({k: v for v, k in enumerate(cvec.get_feature_names())}) {'addition': 0, 'enterprises':
9, 'maturity': 21, 'affiliate': 1, 'search': 31, 'management': 20, 'goal': 17, 'fejesp': 16, 'concentro': 4, 'model': 22, 'total': 37, 'survey': 36, 'fejemg': 12, 'select': 32, 'level': 19, 'develop': 7, 'federation': 10, 'asses': 2, 'junior': 18, 'project': 26, 'fejesc': 15, 'rely': 29, [...]}

from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer cvec =
CountVectorizer(stop_words='english', ngram_range=(1,1), lowercase=False, tokenizer=lambda key: get_lemma(nlp(df.iloc[key]['abstract']))) # fit(raw_documents, y=None) # Learn a vocabulary dictionary of all tokens in the raw documents cvec.fit(list(range(0,len(df)))) # transform(raw_documents) # Transform documents to document-term matrix cvec_counts = cvec.transform(list(range(0,len(df))))

id_doc id_palavra palavra ocorrências 0 0 addition 1 0 8
enterprise 3 0 18 junior 4

tf-idf “ O objetivo de usar tf-idf em vez da
simplesmente usar a frequência de ocorrência de um token é diminuir o impacto de tokens que ocorrem muito freqüentemente em um determinado corpus e que, portanto, são empiricamente menos informativos. - Fonte

term-frequency x inverse document-frequency

# TfidfTransformer(): Transform a count matrix to a normalized tf
or tf-idf representation transformer = TfidfTransformer() transformed_weights = transformer.fit_transform(cvec_counts) weights = transformed_weights.toarray()[i] >>> [ 0.10369517 0.20739034 0.10369517 0.31108551 0.10369517 0.10369517 0.10369517 0.10369517 0.31108551 0.10369517 0.10369517 0.10369517 0.10369517 0.10369517 0.10369517 0.10369517 0.10369517 0.10369517 0.41478068 0.10369517 0.20739034 0.20739034 0.10369517 0.10369517 0.10369517 0.10369517 0.20739034 0.10369517 0.10369517 0.10369517 0.10369517 0.10369517 0.10369517 0.10369517 0.31108551 0.10369517 0.20739034 0.10369517 0.10369517]

Carnegie Mellon

Criando clusters por tema O mesmo pré-processamento Remover a palavras
que aparecem só uma vez Criar um modelo com Latent Semantic Indexing

Latent Semantic Indexing “ [...] palavras usados no mesmo contexto
tendem a ter signiﬁcados semelhantes. - Fonte

Latent Semantic Indexing Álgebra Linear (decomposição de matrizes)

Assunto Título human computer interaction Human machine interface for ABC
computer applications human computer interaction Relation of user perceived response time to error measurement ... ... graph theory The generation of random, binary, ordered trees

id Título c1 human, interface, computer c5 user, response time
... ... m1 trees

c1 c2 c3 c4 c5 m1 m2 m3 c1 c2
0.91 c3 1.00 c4 1.00 c5 0.85 m1 -0.85 m2 -0.85 m3 -0.85

“ Isso aconteceu não porque os títulos de HCI são
semelhantes, já que eles não são, mas porque eles contrastam com os títulos de non-HCI de forma parecida Landauer, T. K., Foltz, P. W., & Laham, D.

software project model algorithm object robot logic proof language robot
object visual

network wireless model game english player

diferentes Eles produzem menos review de literatura

Investigando a natureza dos trabalhos

Palavras que mais aparecem nos títulos O mesmo pré-processamento Usar
o número de vezes que cada termo aparece

Carnegie Mellon

Descobertas

Descobertas No geral os temas são parecidos entre as duas
universidades Diferenças: na UFPE se fala mais sobre negócios e startups e na Carnegie Mellon sobre robôs e mobile Na UFPE se faz muito review de literatura; na Carnegie Mellon não

Call to action

Call to action Falar mais sobre o que a gente
tá trabalhando (blogs, medium, podcasts) Em vez de fazer um estudo sobre um tópico, fazer um estudo aplicando o tópico

“ It’s not enough to be good. In order to
be found, you have to be findable. Austin Kleon

Obrigada

Déborah Mesquita https://github.com/dmesquita https://medium.com/@dehhmesquita https://twitter.com/dehhmesquita

Como escolher um bom tema para projeto: uma aná...

Como escolher um bom tema para projeto: uma análise utilizando NLP

More Decks by Déborah Mesquita

Other Decks in Research

Featured

Transcript