Social media analytics: a survey of techniques, tools and platforms

1 Introduction

Social media is defined as web-based and mobile-based Internet applications that allow the creation, access and exchange of user-generated content that is ubiquitously accessible (Kaplan and Haenlein 2010). Besides social networking media (e.g., Twitter and Facebook), for convenience, we will also use the term ‘social media’ to encompass really simple syndication (RSS) feeds, blogs, wikis and news, all typically yielding unstructured text and accessible through the web. Social media is especially important for research into computational social science that investigates questions (Lazer et al. 2009) using quantitative techniques (e.g., computational statistics, machine learning and complexity) and so-called big data for data mining and simulation modeling (Cioffi-Revilla 2010).

This has led to numerous data services, tools and analytics platforms. However, this easy availability of social media data for academic research may change significantly due to commercial pressures. In addition, as discussed in Sect. 2, the tools available to researchers are far from ideal. They either give superficial access to the raw data or (for non-superficial access) require researchers to program analytics in a language such as Java.

1.1 Terminology

We start with definitions of some of the key techniques related to analyzing unstructured textual data:

Natural language processing—(NLP) is a field of computer science, artificial intelligence and linguistics concerned with the interactions between computers and human (natural) languages. Specifically, it is the process of a computer extracting meaningful information from natural language input and/or producing natural language output.
News analytics—the measurement of the various qualitative and quantitative attributes of textual (unstructured data) news stories. Some of these attributes are: sentiment, relevance and novelty.
Opinion mining—opinion mining (sentiment mining, opinion/sentiment extraction) is the area of research that attempts to make automatic systems to determine human opinion from text written in natural language.
Scraping—collecting online data from social media and other Web sites in the form of unstructured text and also known as site scraping, web harvesting and web data extraction.
Sentiment analysis—sentiment analysis refers to the application of natural language processing, computational linguistics and text analytics to identify and extract subjective information in source materials.
Text analytics—involves information retrieval (IR), lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization and predictive analytics.

1.2 Research challenges

Social media scraping and analytics provides a rich source of academic research challenges for social scientists, computer scientists and funding bodies. Challenges include:

Scraping—although social media data is accessible through APIs, due to the commercial value of the data, most of the major sources such as Facebook and Google are making it increasingly difficult for academics to obtain comprehensive access to their ‘raw’ data; very few social data sources provide affordable data offerings to academia and researchers. News services such as Thomson Reuters and Bloomberg typically charge a premium for access to their data. In contrast, Twitter has recently announced the Twitter Data Grants program, where researchers can apply to get access to Twitter’s public tweets and historical data in order to get insights from its massive set of data (Twitter has more than 500 million tweets a day).
Data cleansing—cleaning unstructured textual data (e.g., normalizing text), especially high-frequency streamed real-time data, still presents numerous problems and research challenges.
Holistic data sources—researchers are increasingly bringing together and combining novel data sources: social media data, real-time market & customer data and geospatial data for analysis.
Data protection—once you have created a ‘big data’ resource, the data needs to be secured, ownership and IP issues resolved (i.e., storing scraped data is against most of the publishers’ terms of service), and users provided with different levels of access; otherwise, users may attempt to ‘suck’ all the valuable data from the database.
Data analytics—sophisticated analysis of social media data for opinion mining (e.g., sentiment analysis) still raises a myriad of challenges due to foreign languages, foreign words, slang, spelling errors and the natural evolving of language.
Analytics dashboards—many social media platforms require users to write APIs to access feeds or program analytics models in a programming language, such as Java. While reasonable for computer scientists, these skills are typically beyond most (social science) researchers. Non-programming interfaces are required for giving what might be referred to as ‘deep’ access to ‘raw’ data, for example, configuring APIs, merging social media feeds, combining holistic sources and developing analytical models.
Data visualization—visual representation of data whereby information that has been abstracted in some schematic form with the goal of communicating information clearly and effectively through graphical means. Given the magnitude of the data involved, visualization is becoming increasingly important.

Social media data is clearly the largest, richest and most dynamic evidence base of human behavior, bringing new opportunities to understand individuals, groups and society. Innovative scientists and industry professionals are increasingly finding novel ways of automatically collecting, combining and analyzing this wealth of data. Naturally, doing justice to these pioneering social media applications in a few paragraphs is challenging. Three illustrative areas are: business, bioscience and social science.

The early business adopters of social media analysis were typically companies in retail and finance. Retail companies use social media to harness their brand awareness, product/customer service improvement, advertising/marketing strategies, network structure analysis, news propagation and even fraud detection. In finance, social media is used for measuring market sentiment and news data is used for trading. As an illustration, Bollen et al. (2011) measured sentiment of random sample of Twitter data, finding that Dow Jones Industrial Average (DJIA) prices are correlated with the Twitter sentiment 2–3 days earlier with 87.6 percent accuracy. Wolfram ([2010](/article/10.1007/s00146-014-0549-4#ref-CR35 "Wolfram SMA (2010) Modelling the stock market using Twitter. Dissertation Master of Science thesis, School of Informatics, University of Edinburgh, pp 1–74. http://homepages.inf.ed.ac.uk/miles/msc-projects/wolfram.pdf

              . Accessed 23 Jul 2013")) used Twitter data to train a Support Vector Regression (SVR) model to predict prices of individual NASDAQ stocks, finding ‘significant advantage’ for forecasting prices 15 min in the future.

In the biosciences, social media is being used to collect data on large cohorts for behavioral change initiatives and impact monitoring, such as tackling smoking and obesity or monitoring diseases. An example is Penn State University biologists (Salathé et al. 2012) who have developed innovative systems and techniques to track the spread of infectious diseases, with the help of news Web sites, blogs and social media.

Computational social science applications include: monitoring public responses to announcements, speeches and events especially political comments and initiatives; insights into community behavior; social media polling of (hard to contact) groups; early detection of emerging events, as with Twitter. For example, Lerman et al. (2008) use computational linguistics to automatically predict the impact of news on the public perception of political candidates. Yessenov and Misailovic ([2009](/article/10.1007/s00146-014-0549-4#ref-CR36 "Yessenov K, Misailovic S (2009) Sentiment analysis of movie review comments, pp 1–17.

                http://people.csail.mit.edu/kuat/courses/6.863/report.pdf
                
              . Accessed 16 Aug 2013")) use movie review comments to study the effect of various approaches in extracting text features on the accuracy of four machine learning methods—Naive Bayes, Decision Trees, Maximum Entropy and K-Means clustering. Lastly, Karabulut ([2013](/article/10.1007/s00146-014-0549-4#ref-CR12 "Karabulut Y (2013) Can Facebook predict stock market activity? SSRN eLibrary, pp 1–58. 
                http://ssrn.com/abstract=2017099
                
              . Accessed 2 Feb 2014")) found that Facebook’s Gross National Happiness (GNH) exhibits peaks and troughs in-line with major public events in the USA.

For this paper, we group social media tools into:

Social media data—social media data types (e.g., social network media, wikis, blogs, RSS feeds and news, etc.) and formats (e.g., XML and JSON). This includes data sets and increasingly important real-time data feeds, such as financial data, customer transaction data, telecoms and spatial data.
Social media programmatic access—data services and tools for sourcing and scraping (textual) data from social networking media, wikis, RSS feeds, news, etc. These can be usefully subdivided into:
- _Data sources, services and tools_—where data is accessed by tools which protect the raw data or provide simple analytics. Examples include: Google Trends, SocialMention, SocialPointer and SocialSeek, which provide a stream of information that aggregates various social media feeds.
- Data feeds via _APIs_—where data sets and feeds are accessible via programmable HTTP-based APIs and return tagged data using XML or JSON, etc. Examples include Wikipedia, Twitter and Facebook.
Text cleaning and storage tools—tools for cleaning and storing textual data. Google Refine and DataWrangler are examples for data cleaning.
Text analysis tools—individual or libraries of tools for analyzing social media data once it has been scraped and cleaned. These are mainly natural language processing, analysis and classification tools, which are explained below.
- _Transformation tools_—simple tools that can transform textual input data into tables, maps, charts (line, pie, scatter, bar, etc.), timeline or even motion (animation over timeline), such as Google Fusion Tables, Zoho Reports, Tableau Public or IBM’s Many Eyes.
- _Analysis tools_—more advanced analytics tools for analyzing social data, identifying connections and building networks, such as Gephi (open source) or the Excel plug-in NodeXL.
Social media platforms—environments that provide comprehensive social media data and libraries of tools for analytics. Examples include: Thomson Reuters Machine Readable News, Radian 6 and Lexalytics.
- _Social network media platforms_—platforms that provide data mining and analytics on Twitter, Facebook and a wide range of other social network media sources.
- _News platforms_—platforms such as Thomson Reuters providing commercial news archives/feeds and associated analytics.

The two major impediments to using social media for academic research are firstly access to comprehensive data sets and secondly tools that allow ‘deep’ data analysis without the need to be able to program in a language such as Java. The majority of social media resources are commercial and companies are naturally trying to monetize their data. As discussed, it is important that researchers have access to open-source ‘big’ (social media) data sets and facilities for experimentation. Otherwise, social media research could become the exclusive domain of major companies, government agencies and a privileged set of academic researchers presiding over private data from which they produce papers that cannot be critiqued or replicated. Recently, there has been a modest response, as Twitter and Gnip are piloting a new program for data access, starting with 5 all-access data grants to select applicants.

2.1 Methodology

Research requirements can be grouped into: data, analytics and facilities.

2.1.1 Data

Researchers need online access to historic and real-time social media data, especially the principal sources, to conduct world-leading research:

Social network media—access to comprehensive historic data sets and also real-time access to sources, possibly with a (15 min) time delay, as with Thomson Reuters and Bloomberg financial data.
News data—access to historic data and real-time news data sets, possibly through the concept of ‘educational data licenses’ (cf. software license).
Public data—access to scraped and archived important public data; available through RSS feeds, blogs or open government databases.
Programmable interfaces—researchers also need access to simple application programming interfaces (APIs) to scrape and store other available data sources that may not be automatically collected.

2.1.2 Analytics

Currently, social media data is typically either available via simple general routines or require the researcher to program their analytics in a language such as MATLAB, Java or Python. As discussed above, researchers require:

Analytics dashboards—non-programming interfaces are required for giving what might be termed as ‘deep’ access to ‘raw’ data.
Holistic data analysis—tools are required for combining (and conducting analytics across) multiple social media and other data sets.
Data visualization—researchers also require visualization tools whereby information that has been abstracted can be visualized in some schematic form with the goal of communicating information clearly and effectively through graphical means.

2.1.3 Facilities

Lastly, the sheer volume of social media data being generated argues for national and international facilities to be established to support social media research (cf. Wharton Research Data Services https://wrds-web.wharton.upenn.edu):

Data storage—the volume of social media data, current and projected, is beyond most individual universities and hence needs to be addressed at a national science foundation level. Storage is required both for principal data sources (e.g., Twitter), but also for sources collected by individual projects and archived for future use by other researchers.
Computational facility—remotely accessible computational facilities are also required for: a) protecting access to the stored data; b) hosting the analytics and visualization tools; and c) providing computational resources such as grids and GPUs required for processing the data at the facility rather than transmitting it across a network.

2.2 Critique

Much needs to be done to support social media research. As discussed, the majority of current social media resources are commercial, expensive and difficult for academics to obtain full access.

2.2.1 Data

In general, access to important sources of social media data is frequently restricted and full commercial access is expensive.

Siloed data—most data sources (e.g., Twitter) have inherently isolated information making it difficult to combine with other data sources.
Holistic data—in contrast, researchers are increasingly interested in accessing, storing and combining novel data sources: social media data, real-time financial market & customer data and geospatial data for analysis. This is currently extremely difficult to do even for Computer Science departments.

2.2.2 Analytics

Analytical tools provided by vendors are often tied to a single data set, maybe limited in analytical capability, and data charges make them expensive to use.

2.2.3 Facilities

There are an increasing number of powerful commercial platforms, such as the ones supplied by SAS and Thomson Reuters, but the charges are largely prohibitive for academic research. Either comparable facilities need to be provided by national science foundations or vendors need to be persuaded to introduce the concept of an ‘educational license.’

Clearly, there is a large and increasing number of (commercial) services providing access to social networking media (e.g., Twitter, Facebook and Wikipedia) and news services (e.g., Thomson Reuters Machine Readable News). Equivalent major academic services are scarce.We start by discussing types of data and formats produced by these services.

3.1 Types of data

Although we focus on social media, as discussed, researchers are continually finding new and innovative sources of data to bring together and analyze. So when considering textual data analysis, we should consider multiple sources (e.g., social networking media, RSS feeds, blogs and news) supplemented by numeric (financial) data, telecoms data, geospatial data and potentially speech and video data. Using multiple data sources is certainly the future of analytics.

Broadly, data subdivides into:

Historic data sets—previously accumulated and stored social/news, financial and economic data.
Real-time feeds—live data feeds from streamed social media, news services, financial exchanges, telecoms services, GPS devices and speech.

And into:

Raw data—unprocessed computer data straight from source that may contain errors or may be unanalyzed.
Cleaned data—correction or removal of erroneous (dirty) data caused by disparities, keying mistakes, missing bits, outliers, etc.
Value-added data—data that has been cleaned, analyzed, tagged and augmented with knowledge.

3.2 Text data formats

The four most common formats used to markup text are: HTML, XML, JSON and CSV.

HTML—HyperText Markup Language (HTML) as well-known is the markup language for web pages and other information that can be viewed in a web browser. HTML consists of HTML elements, which include tags enclosed in angle brackets (e.g.,
), within the content of the web page.
XML—Extensible Markup Language (XML)—the markup language for structuring textual data using …<\tag> to define elements.
JSON—JavaScript Object Notation (JSON) is a text-based open standard designed for human-readable data interchange and is derived from JavaScript.
CSV—a comma-separated values (CSV) file contains the values in a table as a series of ASCII text lines organized such that each column value is separated by a comma from the next column’s value and each row starts a new line.

For completeness, HTML and XML are so-called markup languages (markup and content) that define a set of simple syntactic rules for encoding documents in a format both human readable and machine readable. A markup comprises start-tags (e.g., ), content text and end-tags (e.g., ).

Many feeds use JavaScript Object Notation (JSON), the lightweight data-interchange format, based on a subset of the JavaScript Programming Language. JSON is a language-independent text format that uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. JSON’s basic types are: Number, String, Boolean, Array (an ordered sequence of values, comma-separated and enclosed in square brackets) and Object (an unordered collection of key:value pairs). The JSON format is illustrated in Fig. 1 for a query on the Twitter API on the string ‘UCL,’ which returns two ‘text’ results from the Twitter user ‘uclnews.’

Fig. 1

JSON Example

Social media analytics: a survey of techniques, tools and platforms (original) (raw)

1 Introduction

1.1 Terminology

1.2 Research challenges

1.3 Social media research and applications

1.4 Social media overview

2 Social media methodology and critique

2.1 Methodology

2.1.1 Data

2.1.2 Analytics

2.1.3 Facilities

2.2 Critique

2.2.1 Data

2.2.2 Analytics

2.2.3 Facilities

3 Social media data

3.1 Types of data

3.2 Text data formats

4 Social media providers

4.1 Open-source databases

4.2 Data access via tools

4.2.1 Freely accessible sources

4.2.2 Commercial sources

4.3 Data feed access via APIs

4.3.1 Wiki media

4.3.2 Social networking media

4.3.2.1 Twitter

4.3.2.2 Facebook

4.3.3 RSS feeds

4.3.4 Blogs, news groups and chat services

4.3.5 News feeds

4.3.6 Geospatial feeds

5 Text cleaning, tagging and storing

5.1 Cleansing data

5.2 Tagging unstructured data

5.3 Storing data

5.3.1 Apache (noSQL) databases and tools

5.3.1.1 Apache open-source software

6 Social media analytics techniques

6.1 Computational science techniques

6.1.1 Stream processing

6.2 Sentiment analysis

6.2.1 Sentiment classification

6.2.2 Supervised learning methods

6.2.2.1 Naïve Bayes classifier (NBC)

7 Social media analytics tools

7.1 Scientific programming tools

7.2 Business toolkits

7.3 Social media monitoring tools

7.4 Text analysis tools

7.5 Data visualization tools

7.6 Case study: SAS Sentiment Analysis and Social Media Analytics

8 Social media analytics platforms

8.1 News platforms

8.2 Social network media platforms

8.3 Case study: Thomson Reuters News Analytics

9 Experimental computational environment for social media

9.1 Data

9.2 Analytics

9.3 System architecture

9.4 Platform components

10 Conclusions