CRAN Task View: Web Technologies and Services (original) (raw)

Maintainer: Mauricio Vargas Sepulveda, Will Beasley
Contact: m.sepulveda at mail.utoronto.ca
Version: 2024-10-27
URL: https://CRAN.R-project.org/view=WebTechnologies
Source: https://github.com/cran-task-views/WebTechnologies/
Contributions: Suggestions and improvements for this task view are very welcome and can be made through issues or pull requests on GitHub or via e-mail to the maintainer address. For further details see the Contributing guide.
Citation: Mauricio Vargas Sepulveda, Will Beasley (2024). CRAN Task View: Web Technologies and Services. Version 2024-10-27. URL https://CRAN.R-project.org/view=WebTechnologies.
Installation: The packages from this task view can be installed automatically using the ctv package. For example, ctv::install.views("WebTechnologies", coreOnly = TRUE) installs all the core packages or ctv::update.views("WebTechnologies") installs all packages that are not yet installed and up-to-date. See the CRAN Task View Initiative for more details.

0. Introduction

Tools for Working with the Web

This task view recommends packages and strategies for efficiently interacting with resources over the internet with R. This task view focuses on:

  1. Direct data download and ingestion,
  2. Online services,
  3. Frameworks for building web-based R applications,
  4. Low-level operations, and
  5. Resources

If you have suggestions for improving or growing this task view, please submit an issue or a pull request in the GitHub repository linked above. If you can’t contribute on GitHub, please e-mail the task view maintainer. If you have an issue with a package discussed below, please contact the package’s maintainer.

Thanks to all contributors to this task view, especially to Scott Chamberlain, Thomas Leeper, Patrick Mair, Karthik Ram, and Christopher Gandrud who maintained this task view up to 2021.

Core Tools For HTTP Requests

The bulk of R’s capabilities are supplied by CRAN packages that are layered on top of libcurl. A handful of packages provide the foundation for most modern approaches.

  1. httr2 and its predecessor httr are user-facing clients for HTTP requests. They leverage the curl package for most operations. If you are developing a package that calls a web service, we recommend reading their vignettes.
  2. crul is another package that leverages curl. It is an R6-based client that supports asynchronous HTTP requests, a pagination helper, HTTP mocking via webmockr, and request caching for unit tests via vcr. crul is intended to be called by other packages, instead of R users. Unlike httr2, crul’s current version does not support OAuth. Additional options may be passed to curl when instantiating crul’s R6 classes.
  3. curl is the lower-level package that provides a close interface between R and the libcurl C library. It is not intended to be called directly by typical R users. curl may be useful for operations on web-based XML or with FTP (as crul and httr2 are focused primarily on HTTP).
  4. utils and base are the base R packages that provide download.file(), url(), and related functions. These functions also use libcurl.

Before you Start Using Web Scraping Tools

You may have a code to perform web scraping, and it can be very efficient by time metrics or resources usage, but first we need to talk about whether it’s legal and ethical for you to do so.

You can use the ‘polite’ package, which builds upoen the principles of seeking permission, taking slowly and never asking twice. The package builds on awesome toolkits for defining and managing http sessions (‘httr’ and ‘rvest’, declaring the user agent string and investigating site policies (‘robots.txt’), and utilizing rate-limiting and response caching (‘ratelimitr’ and ‘memoise’).

The problem is not technical, but ethical and also legal. You can technically log into an art auction site and scrape the prices of all the paintings, but if you need an account and to use ‘rSelenium’ to extract the information by automating clicks in the browser, you are subject to the Terms of Service (ToS).

Another problem is that some websites require specific connections. You can connect to a site from a university or government building and access content for free, but if you connect from home, you may find that you require a paid subscription to access the same content. If you scrape a site from a university, you might be breaking some laws if you are not carefull about the goal and scope of the scraping.

1. Direct data download and ingestion

In recent years, many functions have been updated to accommodate web pages that are protected with TLS/SSL. Consequently you can usually download a file’s if its url starts with “http” or “https”.

If the data file is not accessible via a simple url, you probably want to skip to the Online services section. It describes how to work with specific web services such as AWS, Google Documents, Twitter, REDCap, PubMed, and Wikipedia.

If the information is served by a database engine, please review the cloud services in the Online services section below, as well as the Databases with R CRAN Task View.

Ingest a remote file directly

Many base and CRAN packages provide functions that accept a url and return a data.frame or list.

Download a remote file, then ingest it

If you need to process a different type of file, you can accomplish this in two steps. First download the file from a server to your local computer; second pass the path of the new local file to a function in a package like haven or foreign.

Many base and CRAN packages provide functions that download files:

Parsing Structured Web Data

The vast majority of web-based data is structured as plain text, HTML, XML, or JSON. Web service APIs increasingly rely on JSON, but XML is still prevalent in many applications. There are several packages for specifically working with these format. These functions can be used to interact directly with insecure web pages or can be used to parse locally stored or in-memory web files. Colloquially, these activities are called web scraping.

2. Online services

Cloud Computing and Storage

Software Development

Document and Images

Data Processing and Visualization

Machine Learning and Translation

This list describes online services. For a more complete treatment of the topic, please see the MachineLearning CRAN Task View.

Spatial Analysis

This list describes online services. For a more complete treatment of the topic, please see the Analysis Spatial Data CRAN Task View.

The following packages provide an interface to its associated service, unless noted otherwise.

Survey, Questionnaire, and Data Capture Tools

Web Analytics

The following packages interface with online services that facilitate web analytics.

The following packages interface with tools that facilitate web analytics.

Publications

Generating Synthetic Data

Sports Analytics

Many CRAN packages interact with services facilitating sports analysis. For a more complete treatment of the topic, please see the SportsAnalytics CRAN Task View.

Reproducible Research

Using packages in this Web Technologies task view can help you acquire data programmatically, which can facilitate Reproducible Research. Please see the ReproducibleResearch CRAN Task View for more tools and information:

“The goal of reproducible research is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, understood, and verified.”

Other Web Services

3. Frameworks for building web-based R applications

Other Useful Packages and Functions

4. Low-level operations

Tools for Working with URLs

Additional tools for internet communication

For specialized situations, the following resources may be useful:

Handling HTTP Errors/Codes

Security

5. Resources

CRAN packages

Core: crul, curl, httr, httr2.
Regular: ajv, analogsea, arrow, aRxiv, aws.signature, AzureAppInsights, AzureAuth, AzureContainers, AzureCosmosR, AzureGraph, AzureKusto, AzureQstor, AzureRMR, AzureStor, AzureTableStor, AzureVision, AzureVM, beakr, bigrquery, boilerpipeR, boxr, brandwatchR, clarifai, crunch, crunchy, data.table, dataone, datarobot, dataverse, duckduckr, europepmc, FastRWeb, fauxpas, fbRads, fiery, gargle, geonapi, geosapi, gh, gistr, git2r, gitlabr, gmailr, googleAnalyticsR, googleAuthR, googleCloudStorageR, googleComputeEngineR, googledrive, googleLanguageR, googlesheets4, googleVis, graphTweets, gsheet, gtrendsR, hackeRnews, htm2txt, htmltools, httpcache, httpcode, httping, httpRequest, httptest, httpuv, imguR, instaR, ipaddress, jqr, js, jsonlite, jsonvalidate, jstor, languagelayeR, longurl, mailR, mapsapi, mathpix, Microsoft365R, mime, mscstexta4r, mscsweblm4r, nanonext, ndjson, nominatimlite, notifyme, oai, OAIHarvester, opencage, opencpu, OpenML, osrm, ows4R, paws, pdftables, pins, plotly, plumber, PostcodesioR, postlightmercury, pubmed.mineR, pushoverr, qualtRics, radiant, RAdwords, rapiclient, Rcrawler, rcrossref, RCurl, rdatacite, readr, redcapAPI, REDCapCAST, REDCapDM, REDCapR, REDCapTidieR, repmis, reqres, request, rerddap, restfulr, ReviewR, Rexperigen, Rfacebook, rfigshare, RgoogleMaps, rhub, rio, rjson, RJSONIO, Rlinkedin, rLTP, roadoi, ROAuth, robotoolbox, robotstxt, Rook, rorcid, rosetteApi, routr, rpinterest, RPushbullet, rrefine, RSclient, rscopus, rsdmx, RSelenium, Rserve, RSmartlyIO, rtoot, rtweet, rvest, RYandexTranslate, scholar, selectr, seleniumPipes, sendmailR, servr, shiny, slackr, spiderbar, streamR, swagger, tidyREDCap, tidyRSS, uaparserjs, urlshorteneR, urltools, V8, vcr, vkR, W3CMarkupValidator, WebAnalytics, webmockr, webreadr, webshot, webutils, whisker, WikidataQueryServiceR, WikidataR, WikipediR, WufooR, XML, xml2, XML2R, xslt, yaml, yhatr, zen4R.
Archived: rgeolocate.

Other resources