Introduction: Webscraping
Institutional research (IR) professionals often times rely on external sources of data in addition to internal data. For example, IR professionals in the California Community College (CCC) system may be interested in data provided by the state’s Chancellor’s Office (CO) such as those on the Student Success Metrics (SSM), Data Mart, Scorecard, or Launchboard. In addition, they may be interested in data provided by the California State University (CSU; e.g., the CSU admissions dashboard) or the University of California (UC; e.g., the UC admissions dashboard).
These external data sources provide rich and useful information via their online interfaces, but sometimes there may be a need to download the raw data files for the information to be presented in a different manner, combined with other sources, or analyzed further. For example, achievement data could be downloaded for all colleges in order for an institution to benchmark itself against other institutions as this view is not offered online.
In this vignette, we illustrate how to leverage R to webscrape (automatically download) online data files. Think of webscraping as launching a web browser that one could navigate and command using a set of instructions (code) instead of manually moving the mouse/trackpad and clicking on links/buttons. Although initially effortful, scripting the download process allows one to re-run the download process at future dates with minimal effort. The payoff is especially worthwhile if the download process involves a lot of files at each run (e.g., 116 files, one for each CCC, or lots of iterations based on various dropdown menus) or if the process is recurring (e.g., weekly).
In R, the easiest way to webscrape data is via the rvest package, which allows one to webscrape html pages. However, this vignette focuses on the more complicated RSelenium package because it supports pages that utilize javascript (rvest
does not support this), which is quite common on many websites. Learning how to webscrape using RSelenium
allows one to download data from nearly all sites.
Installation and Setup
Virtualization (VT-x) in System BIOS
Before we start to webscrape in R using the RSelenium package, we have to enable and install a few hardware and software dependencies.
First, the user must enable virtualization on their computer by turning on VT-x in their BIOS. For those not familiar with how to do this, do a web search for “Computer Make/Model VT-x”. This usually requires rebooting the computer, entering the BIOS by hitting one of the function (F#) keys, and navigating to a page that allows the user to enable VT-x.
What this option does is it allows Windows to run a virtual server on the system, which is needed for us to launch a virtual server with a web browser (web driver) that R could control using code.
Docker and Selenium Server
Download and install Docker Desktop for Windows, which requires admin rights. For those not familiar, Docker allows users to install different applications as containers in a virtual environment. For our use case, this application will be the Selenium Web Driver in a container, giving us a web browser that we could control. The installation process should ask the user to restart their computer to finish the installation process.
On Windows 10, click the start menu and type “Edit local users and groups”. Go to the “Groups” folder, and navigate to “docker-users”. Right-click this icon and click “Add to group…”. Then add your username to this group. This step also requires admin rights.
On Windows 10, launch “Windows Powershell” from the Start Menu. Type the following in the terminal to install Firefox in a Selenium server container inside Docker.
The previous command downloads the specified container image. In this vignette, we pick Firefox over Chrome as it allows for more control in the browser (navigating and clicking). We stick to a single version, Firefox 2.53.0, in order to ensure consistency (codes and commands may be different depending on the version of the browser).
SelectorGadget Extension for Chrome
In order to identify different elements on a webpage for clicking or selecting dropdown menus, install the SelectorGadget extension in Chrome (the browser on your computer that you use regularly, not the one used for webscraping).
RSelenium in R
In R, download and install R with the following command:
Rtools to access the zip
command in Windows
We will be making use of the makeFirefoxProfile
function from RSelenium
, which will require the zip
command to package the results. Most Windows systems do not come with this command. The easiest way to get access to this is to install Rtools on the Windows system. For Rtools 4.0 or earlier, remember to add the the make utilities into the PATH
environment variable.
Summary of Webscraping Workflow
The workflow for scraping data are as follows:
- Launch the Selenium server in Docker via Windows Powershell.
- Start R and load the
RSelenium
package. - In R, connect to the Selenium server.
- In R, visit the site of interest, navigate the page, and download the necessary files.
- Exit R.
- Stop Selenium server in Docker via Windows Powershell.
This workflow will be leveraged in a few examples in this vignette.
Example: Webscrape Scorecard Data for All Colleges
Exploration and Strategy
In this example, we illustrate how to download each college’s data file on the CCC Scorecard. That is, the goal is to download the data provided for each college present in the college dropdown, as shown here:
Before going into technical details, let’s first explore the site and formulate a strategy. On the Scorecard site, select one college. We’ll select Irvine Valley College for illustration. The browser should take the visitor to the following page:
There are two things worth noting:
- The URL listed in the browser is now
https://scorecard.cccco.edu/scorecardrates.aspx?CollegeID=892
. Irvine Valley College (IVC) hasCollegeID=892
in this URL. For those not familiar, this is the district/college code assigned to IVC by the state, as used in all MIS data submissions. A list of the district/college codes can be found here. - To download the data file, scroll to the bottom, and notice the Five Year download button, as seen below. The button has the URL
https://scorecard.cccco.edu/fiveyear.aspx?CollegeID=892
, and clicking it downloads a file named892_FiveYear.xlsx
.
Now, recall the goal is to download the aforementioned excel file for all colleges. Based on the previous exploration, one could download the files by repeatedly visiting the URL https://scorecard.cccco.edu/fiveyear.aspx?CollegeID=YYY
, where we change YYY
to be various college codes. A general webscraping strategy is as follows:
- Obtain the list of college ID’s from here, and make it accessible in R.
- In R, construct the various URL’s of interest based on the previous list.
- Visit the URL’s generated in the previous step to download the many excel files.
In the next section, we’ll put this strategy into action with R and RSelenium
.
Execution
Launch Windows Powershell from the Start Menu. Then run the following command in the terminal.
docker run -d -v ~/Downloads/Selenium-Downloads/://home/seluser/Downloads -p 4445:4444 -p 5901:5900 selenium/standalone-firefox:2.53.0
The previous command has a few parameters worth noting:
docker run
is the command to run an application.selenium/standalone-firefox:2.53.0
is the application to run.- Forward port 4444 and 5900 on the virtual server to ports 4445 and 5901, respectively, on the host computer (the user’s computer). In R, we will connect to
localhost
(host computer) on port 4445, which will connect to the virtual server on port 4444. - Map the virtual server’s download folder,
/home/seluser/Downloads
, to the host computer’s download folder,~/Downloads/Selenium-Downloads
, where~
refers toC:/Users/USERNAME
andUSERNAME
is the username of the current user on the Windows system. This mapping allows files to be downloaded to Selenium’s default folder,/home/seluser/Downloads
, which will be accessible for the user atC:/Users/USERNAME/Downloads/Selenium-Downloads
.
The Windows Powershell terminal should look similar to the following after running the command:
Next, let’s launch R. In R, we create a Firefox profile, and connect to the Selenium Web Driver / Server using the following code:
library(RSelenium)
# Firefox profile
fprof <- makeFirefoxProfile(list(browser.download.dir='/home/seluser/Downloads'
, browser.download.folderList=2L
, browser.download.manager.showWhenStarting=FALSE
, browser.helperApps.neverAsk.saveToDisk='text/csv;application/vnd.ms-excel;application/zip' # This says if a CSV/XLS/XLSX/ZIP is encountered, then download file automatically into default download folder
# See MIME Types here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Common_types
))
# Remote driver
remDr <- remoteDriver(browserName='firefox'
, port=4445L
, extraCapabilities=fprof
, remoteServerAddr='localhost'
)
## ## If the user wants to use chrome, they will also need to download the Chrome version of Selenium
## remDr <- remoteDriver(browserName='chrome'
## , port=4445L
## , remoteServerAddr='localhost'
## , extraCapabilities=list(chromeOptions=list(prefs=list('profile.default_content_settings.popups'=0L, 'download.prompt_for_download'=FALSE, 'download.default_directory'='/home/seluser/Downloads'))) # https://stackoverflow.com/questions/35504731/specify-download-folder-in-rselenium
## )
# Connect to selenium server
remDr$open()
## [1] "Connecting to remote server"
## $applicationCacheEnabled
## [1] TRUE
##
## $rotatable
## [1] FALSE
##
## $handlesAlerts
## [1] TRUE
##
## $databaseEnabled
## [1] TRUE
##
## $version
## [1] "45.0.2"
##
## $platform
## [1] "LINUX"
##
## $nativeEvents
## [1] FALSE
##
## $acceptSslCerts
## [1] TRUE
##
## $webdriver.remote.sessionid
## [1] "6a53fec1-485f-4834-98b1-b5e1ac651615"
##
## $webStorageEnabled
## [1] TRUE
##
## $locationContextEnabled
## [1] TRUE
##
## $browserName
## [1] "firefox"
##
## $takesScreenshot
## [1] TRUE
##
## $javascriptEnabled
## [1] TRUE
##
## $cssSelectorsEnabled
## [1] TRUE
##
## $id
## [1] "6a53fec1-485f-4834-98b1-b5e1ac651615"
Before constructing the URL’s of the excel files and visiting those links, let’s first explore some basics of RSelenium
.
# URL
scorecard_url <- 'https://scorecard.cccco.edu/scorecard.aspx'
# Visit URL
remDr$navigate(scorecard_url)
# Get title of page
remDr$getTitle()
## [[1]]
## [1] "2019 Student Success Scorecard"
# Get a screenshot of the current page, to confirm we are on the right page
# remDr$screenshot(display=TRUE) # Running this should open up an image of the page in a browser
remDr$screenshot(file='RSelenium_Screenshot_Scorecard.png')
Running the remDr$screenshot(display=TRUE)
line should open up the following in an image viewer: