The L2F Strategy for Sentiment Analysis and Topic Classification



Documentos relacionados
Efficient Locally Trackable Deduplication in Replicated Systems. technology from seed

Serviços: API REST. URL - Recurso

Welcome to Lesson A of Story Time for Portuguese

CIS 500 Software Foundations Fall September(continued) IS 500, 8 September(continued) 1

Guião A. Descrição das actividades

Análise Probabilística de Semântica Latente aplicada a sistemas de recomendação

NORMAS PARA AUTORES. As normas a seguir descritas não dispensam a leitura do Regulamento da Revista Portuguesa de Marketing, disponível em

BR-EMS MORTALITY AND SUVIVORSHIP LIFE TABLES BRAZILIAN LIFE INSURANCE AND PENSIONS MARKET

Métodos Formais em Engenharia de Software. VDMToolTutorial

User interface evaluation experiences: A brief comparison between usability and communicability testing

DIBELS TM. Portuguese Translations of Administration Directions

GUIÃO A. Ano: 9º Domínio de Referência: O Mundo do Trabalho. 1º Momento. Intervenientes e Tempos. Descrição das actividades

Inglês. Guião. Teste Intermédio de Inglês. Parte IV Interação oral em pares. Teste Intermédio

Multicriteria Impact Assessment of the certified reference material for ethanol in water

Mathematical Foundation I: Fourier Transform, Bandwidth, and Band-pass Signal Representation PROF. MICHAEL TSAI 2011/10/13

Guião M. Descrição das actividades

Interoperability through Web Services: Evaluating OGC Standards in Client Development for Spatial Data Infrastructures

A Aviação no Comércio Europeu de Licenças de Emissão Especificidades para pequenos emissores

Medicina e Meditação - Um Médico Ensina a Meditar (Portuguese Edition)

Silvano Nogueira Buback. Utilizando aprendizado de máquina para construção de uma ferramenta de apoio a moderação de comentários

O guia do enxoval do bebê nos Estados Unidos: Dicas e segredos da maior especialista de compras em Miami, Orlando e Nova York (Portuguese Edition)

ANALYSIS OF THE APPLICATION OF THE LADM IN THE BRAZILIAN URBAN CADASTRE: A CASE STUDY FOR THE CITY OF ARAPIRACA BRAZIL

Uma solução possível para garantir, em ambiente APEX, a consistência duma estrutura ISA total e disjuntiva.

Pesquisa Qualitativa do Início ao Fim (Métodos de Pesquisa) (Portuguese Edition)

01-A GRAMMAR / VERB CLASSIFICATION / VERB FORMS

Como Falar no Rádio - Prática de Locução Am e Fm (Portuguese Edition)

PL/SQL: Domine a linguagem do banco de dados Oracle (Portuguese Edition)

Um olhar que cura: Terapia das doenças espirituais (Portuguese Edition)

Easy Linux! FUNAMBOL FOR IPBRICK MANUAL. IPortalMais: a «brainware» company Manual

Perguntas & Respostas

Introdução A Delphi Com Banco De Dados Firebird (Portuguese Edition)

Livro do Desassossego

Accessing the contents of the Moodle Acessando o conteúdo do Moodle

NÚCLEO DE TECNOLOGIA EDUCACIONAL PARA A SAÚDE UNIVERSIDADE FEDERAL DO RIO DE JANEIRO

ACFES MAIORES DE 23 ANOS INGLÊS. Prova-modelo. Instruções. Verifique se o exemplar da prova está completo, isto é, se termina com a palavra FIM.

Vaporpunk - A fazenda-relógio (Portuguese Edition)

Intellectual Property. IFAC Formatting Guidelines. Translated Handbooks

In this lesson we will review essential material that was presented in Story Time Basic

Easy Linux! FUNAMBOL FOR IPBRICK MANUAL. IPortalMais: a «brainmoziware» company Manual Jose Lopes

Digital Cartographic Generalization for Database of Cadastral Maps

e-lab: a didactic interactive experiment An approach to the Boyle-Mariotte law

SOLOS SOB VITICULTURA NO VALE DOS VINHEDOS (RS) E SUA RELAÇÃO COM O TEOR DE RESVERATROL EM VINHOS PEDRINHO SPIGOLON QUÍMICO -PUCRS

A Vivência do Evangelho Segundo o Espiritismo (Portuguese Edition)

A CONVERSÃO ADJETIVO/SUBSTANTIVO EM FORMAÇÕES DEVERBAIS X-DO NO PORTUGUÊS DO BRASIL

PCOV-AD Survey Statements in English and Portuguese

VGM. VGM information. ALIANÇA VGM WEB PORTAL USER GUIDE June 2016

T Ã O B O M Q U A N T O N O V O

Building a golden collection of parallel Multi-Language Word Alignment

Cap. 6. Introdução a Lógicas Modais

Princípios de Direito Previdenciário (Portuguese Edition)

Labrador: Guia prático ilustrado (Coleção Pet Criador) (Portuguese Edition)

Farmacologia na Pratica de Enfermagem (Em Portuguese do Brasil)

Tese / Thesis Work Análise de desempenho de sistemas distribuídos de grande porte na plataforma Java

GUIÃO Domínio de Referência: CIDADANIA E MULTICULTURALISMO

ÍNDICE PORTUGUÊS INDEX ENGLISH

A Avaliação dos Projetos

Lucas de Assis Soares, Luisa Nunes Ramaldes, Taciana Toledo de Almeida Albuquerque, Neyval Costa Reis Junior. São Paulo, 2013

Introdução A Delphi Com Banco De Dados Firebird (Portuguese Edition)

How UMA Contributes to Solving the IDESG Healthcare Relationship Location Service Use tinyurl.com/umawg 19 Jan 2014

Utilizando subversion como controle de versão

UNIDADE DE PESQUISA CLÍNICA Centro de Medicina Reprodutiva Dr Carlos Isaia Filho Ltda. SAMPLE SIZE DETERMINATION FOR CLINICAL RESEARCH

Software reliability analysis by considering fault dependency and debugging time lag Autores

Biogas from coffee waste

Futebol em Transmissão. Football is on the Air.

English version at the end of this document

América Andina: integração regional, segurança e outros olhares (Portuguese Edition)

Prova de Seleção Mestrado LINGUA INGLESA 15/02/2016

Writing Good Software Engineering Research Papers

Esboços e Sermões Completos para Ocasiões e Datas Especiais: Mensagens Bíblicas para Todas as Datas da Vida Cristã (Portuguese Edition)

OVERVIEW DO EAMS. Enterprise Architecture Management System 2.0

Os 7 Hábitos das Pessoas Altamente Eficazes (Portuguese Edition)

Conceitos de Linguagens de Programação (Portuguese Edition)

MySQL: Comece com o principal banco de dados open source do mercado (Portuguese Edition)

Searching for Employees Precisa-se de Empregados

Receitas na Pressão - Vol. 01: 50 Receitas para Panela de Pressão Elétrica (Portuguese Edition)

Select a single or a group of files in Windows File Explorer, right-click and select Panther Print

Bíblia do Obreiro - Almeida Revista e Atualizada: Concordância Dicionário Auxílios Cerimônias (Portuguese Edition)

Consultoria em Direito do Trabalho

A dança do corpo vestido: Um estudo do desenvolvimento do figurino de balé clássico até o século XIX (Portuguese Edition)

LISTA DE TABELAS. Página

UNIVERSIDADE DE SÃO PAULO FACULDADE DE EDUCAÇÃO JOÃO FÁBIO PORTO. Diálogo e interatividade em videoaulas de matemática

Como escrever para o Enem: roteiro para uma redação nota (Portuguese Edition)

MARCELO DE LIMA BRAZ REDUÇÃO DA QUANTIDADE DE REPROCESSO NO SETOR DE PRODUÇÃO DE CALDOS ALIMENTÍCIOS NA EMPRESA DO RAMO ALIMENTÍCIO (ERA).

Click the + sign to add new server details. Clique no sinal de "+" para adicionar novos detalhes do servidor. Enter a friendly name for your BI Server

Relatório de Acção Action Report

eposters evita impressões avoid printing reduz a pegada ecológica reduce the ecological footprint

UNIVERSIDADE CATÓLICA PORTUGUESA. A Reputação e a Responsabilidade Social na BP Portugal: A importância da Comunicação. Por. Ana Margarida Nisa Vintém

LÍNGUA INGLESA CONTEÚDO E HABILIDADES DINÂMICA LOCAL INTERATIVA AULA. Conteúdo: Reading - Typographic Marks

PURCHASE-TO-PAY SOLUTIONS

English version at the end of this document

GUIÃO A. What about school? What s it like to be there/here? Have you got any foreign friends? How did you get to know them?

TEN CATE. HISTOLOGIA ORAL (EM PORTUGUESE DO BRASIL) BY ANTONIO NANCI

Trabalho de Compensação de Ausência - 1º Bimestre

JOSÉ RICARDO SANCHEZ FILHO ANALYSIS OF THE LONG-TERM EFFECTS OF THE VOLUNTARY OFFER OF THE BID RULE ON STOCKS LISTED IN THE BRAZILIAN STOCK EXCHANGE

A necessidade da oração (Escola da Oração) (Portuguese Edition)

Dermatologia Clínica. Guia Colorido Para Diagnostico e Tratamento (Em Portuguese do Brasil)

Lesson 6 Notes. Eu tenho um irmão e uma irmã Talking about your job. Language Notes

Infraestrutura, Gestão Escolar e Desempenho em Leitura e Matemática: um estudo a partir do Projeto Geres

Psicologia social contemporânea: Livro-texto (Portuguese Edition)

Transcrição:

The L2F Strategy for Sentiment Analysis and Topic Classification Fernando Batista and Ricardo Ribeiro

Outline Data Approach Experiments Features Submitted runs Conclusions and future work TASS discussion 2

Data Labeled data Training Development 7200 tweets 80% (5755 tweets) 20% (1444 tweets) We have used a XML file with information about the users that authored at least one of the tweets in the data. Includes information concerning the user type, which assumes three possible values: periodista (journalist), famoso (famous person), and politico (politician) Sentiment Lexicons in Spanish (Perez-Rosas et al., 2012). Only the most robust part was used: fullstrengthlexicon, containing 1346 words, automatically labelled with sentiment polarity 3

Approach Both tasks considered as classification tasks share the same method: Maximum Entropy Models Most successful recent experiments binary classification problems (discriminate between two classes) Maximum Entropy models clean way of expressing and combining different information properties probabilistic classifications, a generalisation of Boolean classification, which provides probability distributions over the classes The ME models used in this study were trained using the MegaM tool (Daume, 2004), which uses an efficient implementation of conjugate gradient (for binary problems). 4

Sentiment Analysis 6 possible classes: N, N+ negative polarity; P, P+ positive polarity; NEU contains both positive and negative sentiments; NONE without polarity information. The plus sign (+) signals the sentiment intensity. 5

Sentiment Analysis Strategy The first interesting results were achieved by combining 5 different binary classifiers, one for each class. <NONE, other> was used to discriminate between NONE and any other class. <other, N>, and <other, P> allow to detect negative and positive sentiments, respectively. <N, N+> and <P, P+>, allow perceiving the sentiment intensity Tweet data NONE - Other NONE? 6

Sentiment Analysis Strategy The first interesting results were achieved by combining 5 different binary classifiers, one for each class. <NONE, other> was used to discriminate between NONE and any other class. <other, N>, and <other, P> allow to detect negative and positive sentiments, respectively. <N, N+> and <P, P+>, allow perceiving the sentiment intensity Tweet data NONE - Other N, N+ NONE? Other - N Other - P NEU P, P+ 6

Sentiment Analysis Strategy The first interesting results were achieved by combining 5 different binary classifiers, one for each class. <NONE, other> was used to discriminate between NONE and any other class. <other, N>, and <other, P> allow to detect negative and positive sentiments, respectively. <N, N+> and <P, P+>, allow perceiving the sentiment intensity Tweet data NONE - Other N, N+ N - N+ N N+ NONE? Other - N Other - P NEU P P, P+ P - P+ P+ 6

Sentiment Analysis Adopted Strategy N N, N+ N - N+ N+ Tweet data Other - N Other - P NEU P NONE P, P+ P - P+ P+ 7

Topic Detection 10 distinct binary classifiers, each one for a different topic. Each classifier selects its corresponding topic, and in the case no topic was selected, the most probable topic is then selected based on the available classification probabilities Tweet data C1 C2... C10 Other Eco... Cin 8

Experiments Tweet content pre-processing The content of each tweet was firstly tokenized using twokenize (https://bitbucket.org/jasonbaldridge/twokenize/src), a tokenization tool for English tweets, with some minor modifications for dealing with Spanish instead of English Features Most of the features were used both for sentiment analysis and for topic detection, with small differences, specially concerning the use of punctuation marks 9

Features The following features, concerning the tweet text, were used for each tweet: Punctuation marks: used as feature for the sentiment task, but not for topic detection. All words after the words "nunca" (never) or "no" (no) prefixed by "NO_" until reaching some punctuation mark or until reaching the end of the tweet. Each token starting with "http:" was converted into the token "HTTP". All tokens starting with "#" were expanded into two tokens, one with and the other without the "#" A lesser weight was given to the stripped version of the token. All tokens starting with "@" were used, but the token "@USER" was introduced as well, with a smaller weight. All words with more than 3 repeating letters were also used. However, whenever they occur, two more features are produced: "LONG_WORD" with a lower weight, and the corresponding word without repetitions with a high weight (3.0). All cased words were used, but the corresponding lowercase words were used as well. Uppercase words were assigned also to a higher weight, since they are often used for emphasis. Apart from the features extracted from the text, two more features were used: Username of the author of the tweet. Usertype, corresponding to the user classification, according to the file users-info.xml. Some experiments use feature bigrams involving the following tokens HTTP, words starting with # without the diacritic #, @USER, LONG_WORD, all other words converted to lowercase. 10

Experiments Sentiment analysis Baseline: 52.5 Accuracy (Acc) in the development set plus tweet's author name: 53.6 Acc (+1.1) plus user type: 54.2 (+0.6) Best results achieved by using punctuation marks: 55.1 Acc (+0.9) Topic detection Differences across experiments were subtle, because improvements in one classifier may worsen results in another classifier Adding the author's name produced slightly better results Contrarily to what was expected, providing the user type as a feature did not improve results Adding punctuation marks decreased the overall performance Sentiment analysis Topic detection Unigrams only 63.4 (55.2) 64.9 (43.2) Unigrams, Bigrams 62.2 (53.8) 65.4 (42.5) Sentiment lexicon 63.2 (54.8) 11

Future Improve tokenization and normalisation Named entity detection Specific writing styles Use the remainder information available For example, the use of the sentiment polarity type (AGREEMENT, DISAGREEMENT), together with other information about the user (e.g. number of tweets, number of followers, number of following), would probably have an impact on the results. 12

Thank you Obrigado 32 17