Benchmarking sentiment analysis methods for large-scale texts: A case for using continuum-scored words and word shift graphs. [arxiv]

Andrew J. Reagan, Brian Tivnan, Jake Ryland Williams, Christopher M. Danforth, Peter Sheridan Dodds


Abstract

The emergence and global adoption of social media has rendered possible the real-time estimation of population-scale sentiment, issuing profound implications for our understanding of human behavior. Given the growing assortment of sentiment measuring instruments, comparisons between them are evidently required. Here, we perform detailed, quantitative tests and qualitative assessments of 6 dictionary-based methods applied to 4 different corpora, and briefly examine a further 7 methods. We show that a dictionary-based method will only perform both reliably and meaningfully if (1) the dictionary covers a sufficiently large enough portion of a given text's lexicon when weighted by word usage frequency; and (2) words are scored on a continuous scale.

Code

All code for this project is publicly available at https://github.com/andyreagan/sentiment-analysis-comparison. The link to the sentiment dictionaries points to data that can be found in the labMTsimple package, over at https://github.com/andyreagan/labMT-simple.

Data

Twitter (we provide the message IDs for all tweets used in our study, respecting the Streaming API Terms of Service).

Terms of service: https://dev.twitter.com/overview/terms/agreement-and-policy.

Message IDs (~200GB total): list of files by day.

New York Times.

License: https://catalog.ldc.upenn.edu/license/the-new-york-times-annotated-corpus-ldc2008t19.pdf.

Data: https://catalog.ldc.upenn.edu/LDC2008T19.

Movie Reviews (processed IMDb archive of the rec.arts.movies.reviews newsgroup, http://reviews.imdb.com/Reviews).

Available at http://www.cs.cornell.edu/people/pabo/movie-review-data/ as polarity dataset v2.0.

Direct link to details of this corpus: http://www.cs.cornell.edu/people/pabo/movie-review-data/poldata.README.2.0.txt.

Google Books.

License: Creative Commons Attribution-Non Commercial ShareAlike 3.0 Unported License.

Available at http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html.

A tarball of the full data directory used by the notebook scripts can be downloaded here (warning: roughly 25G download).