Download Voteview data in parallel

library(filibustr)

The Voteview functions have the power to download lots of data on many years of Congress. One downside of this power is that downloading many large datasets from the internet can be slow.

One way to speed up your data downloads is to download data in parallel. When you call a Voteview function to download data from multiple Congresses (i.e., when length(congress) > 1), filibustr will download data in parallel if you have set up that capability.

Everything described below is a purely optional way to accelerate your data imports. If you don’t set up parallel computing processes, the Voteview functions will simply download data sequentially.

Setting up for parallel downloads

Downloading data in parallel requires a short bit of setup in the beginning.

Make sure the `{mirai}` and `{carrier}` packags are installed

Under the hood, the Voteview functions use purrr::in_parallel() for parallel downloads. purrr::in_parallel() depends on two packages (mirai and carrier) that are not otherwise used in filibustr, so you may not have them installed.

To check if you have installed the required versions of these packages, run this code. It will prompt you to install any packages you’re missing.

rlang::check_installed(c("carrier", "mirai"), version = c("0.3.0",  "2.5.1"))

Set parallel processes

To download Voteview data in parallel, use mirai::daemons() to create parallel processes (mirai calls these “daemons”).

# detect the number of cores available on your machine
parallel::detectCores()

# launch a specific number of processes, or
mirai::daemons(4)
# launch a process on all but one available cores
mirai::daemons(parallel::detectCores() - 1)

How many processes should I create?

In general, if you split the work up across more processes, the download will finish faster. Theoretically, N processes can finish the download up to N times faster.

At the same time, there can be diminishing returns to creating a large number of processes.

First, there is some overhead involved with creating and communicating with parallel processes.
Second, consider the number of pieces of work. Multi-Congress data downloads get one file per Congress. That is the unit of work that the parallel processes can work on.
- If you are downloading data on 5 Congresses, but create 8 parallel processes, then the last 3 processes aren’t doing anything.
- Similarly, if you’re downloading data on 12 Congresses, there’s not much difference between 7, 8, and 9 processes.

Also, there is less benefit when you set more processes than the number of cores available on your machine (which you can see using parallel::detectCores()). A good rule of thumb (per the purrr documentation) is to use (at most) one less than the number of cores on your machine, leaving one core open for the main R process.

Downloading data in parallel

Once you’ve set up your parallel processes, just call the Voteview functions like normal, and they will automatically download data in parallel!

Reminder: parallel processing only impacts downloads where length(congress) > 1.

# download Voteview data from multiple Congresses
get_voteview_members(congress = 95:118)

get_voteview_rollcall_votes(congress = 95:118)

When you’re done with all your parallel processing, you can close the daemon connections with mirai::daemons(0) if you’d like. The connections will close automatically when your session ends otherwise.

mirai::daemons(0)

More details