Using the Rdatasets API¶
This tutorial explains the usage of the Rdatasets API in Data Retriever. It includes both the CLI (Command Line Interface) commands as well as the Python interface for the same.
Command Line Interface¶
Listing the Rdatasets¶
The retriever ls rdataset command displays the Rdatasets.
$ retriever ls rdataset -h (gives listing options)
usage: retriever ls rdataset [-h] [-p P [P ...]] all
positional arguments:
all display all the packages present in rdatasets
optional arguments:
-h, --help show this help message and exit
-p P [P ...] display a list of all rdatasets present in the package(s)
Examples
This example will display all the Rdatasets present with their package name, dataset name and script name
$ retriever ls rdataset
List of all available Rdatasets
Package: aer Dataset: affairs Script Name: rdataset-aer-affairs
Package: aer Dataset: argentinacpi Script Name: rdataset-aer-argentinacpi
Package: aer Dataset: bankwages Script Name: rdataset-aer-bankwages
...
Package: vcd Dataset: vonbort Script Name: rdataset-vcd-vonbort
Package: vcd Dataset: weldondice Script Name: rdataset-vcd-weldondice
Package: vcd Dataset: womenqueue Script Name: rdataset-vcd-womenqueue
This example will display all the Rdatasets present in the packages vcd and aer
$ retriever ls rdataset -p vcd aer
List of all available Rdatasets in packages: ['vcd', 'aer']
Package: vcd Dataset: arthritis Script Name: rdataset-vcd-arthritis
Package: vcd Dataset: baseball Script Name: rdataset-vcd-baseball
Package: vcd Dataset: brokenmarriage Script Name: rdataset-vcd-brokenmarriage
...
Package: aer Dataset: affairs Script Name: rdataset-aer-affairs
Package: aer Dataset: argentinacpi Script Name: rdataset-aer-argentinacpi
Package: aer Dataset: bankwages Script Name: rdataset-aer-bankwages
...
This example will display all the Rdatasets present in the package vcd
$ retriever ls rdataset -p vcd
List of all available Rdatasets in packages: ['vcd', 'aer']
Package: vcd Dataset: arthritis Script Name: rdataset-vcd-arthritis
Package: vcd Dataset: baseball Script Name: rdataset-vcd-baseball
Package: vcd Dataset: brokenmarriage Script Name: rdataset-vcd-brokenmarriage
...
This example will display all the packages present in rdatasets
$ retriever ls rdataset all
List of all the packages present in Rdatasets
aer cluster dragracer fpp2 gt islr mass multgee plyr robustbase stevedata
asaur count drc gap histdata kmsurv mediation nycflights13 pscl rpart survival
boot daag ecdat geepack hlmdiag lattice mi openintro psych sandwich texmex
cardata datasets evir ggplot2 hsaur lme4 mosaicdata palmerpenguins quantreg sem tidyr
causaldata dplyr forecast ggplot2movies hwde lmec mstate plm reshape2 stat2data vcd
Downloading the Rdatasets¶
The retriever download rdataset-<package>-<dataset> command downloads the Rdataset dataset which exists in the package package.
You can also copy the script name from the output of retriever ls rdataset.
Example
This example downloads the rdataset-vcd-bundesliga dataset.
$ retriever download rdataset-vcd-bundesliga
=> Installing rdataset-vcd-bundesliga
Downloading Bundesliga.csv: 60.0B [00:00, 117B/s]
Done!
The downloaded raw data files are stored in the raw_data directory in the ~/.retriever directory.
Installing the Rdatasets¶
The retriever install <engine> rdataset-<package>-<dataset> command downloads the raw data, creates the script for it and then installs
the Rdataset dataset present in the package package into the provided engine.
Example
This example install the rdataset-aer-usmoney dataset into the postgres engine.
$ retriever install postgres rdataset-aer-usmoney
=> Installing rdataset-aer-usmoney
Downloading USMoney.csv: 1.00B [00:00, 2.52B/s]
Processing... USMoney.csv
Successfully wrote scripts to /home/user/.retriever/rdataset-scripts/usmoney.csv.json
Updating script name to rdataset-aer-usmoney.json
Updating the contents of script rdataset-aer-usmoney
Successfully updated rdataset_aer_usmoney.json
Updated the script rdataset-aer-usmoney
Creating database rdataset_aer_usmoney...
Installing rdataset_aer_usmoney.usmoney
Progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 136/136 [00:00<00:00, 2225.09rows/s]
Done!
The script created for the Rdataset is stored in the rdataset-scripts directory in the ~/.retriever directory.
Python Interface in Data Retriever¶
Updating Rdatasets Catalog¶
The function update_rdataset_catalog creates/updates the datasets_url.json in the ~/.retriever/rdataset-scripts directory,
which contains the information about all the Rdatasets.
>>> import retriever as rt
>>> rt.update_rdataset_catalog()
Note
The update_rdataset_catalog function has a default argument test which is set to False.
If test is set to True, then the contents of the datasets_url.json file would be returned as
a dict.
Listing Rdatasets¶
The function display_all_rdataset_names prints the package, dataset name and the script name for the Rdatasets present in the package(s) requested.
If no package is specified, it prints all the rdatasets, and if all is passed as the function argument then all the package names are displayed.
Note
The function argument package_name takes a list as an input when you want to display rdatasets based on the packages.
If you want to display all packages names, set package_name argument to all (refer to the example below).
>>> import retriever as rt
>>>
>>> # Display all Rdatasets
>>> rt.display_all_rdataset_names()
List of all available Rdatasets
Package: aer Dataset: affairs Script Name: rdataset-aer-affairs
Package: aer Dataset: argentinacpi Script Name: rdataset-aer-argentinacpi
Package: aer Dataset: bankwages Script Name: rdataset-aer-bankwages
...
Package: vcd Dataset: vonbort Script Name: rdataset-vcd-vonbort
Package: vcd Dataset: weldondice Script Name: rdataset-vcd-weldondice
Package: vcd Dataset: womenqueue Script Name: rdataset-vcd-womenqueue
>>>
>>> # Display all the Rdatasets present in packages 'aer' and 'drc'
>>> rt.display_all_rdataset_names(['aer', 'drc'])
List of all available Rdatasets in packages: ['aer', 'drc']
Package: aer Dataset: affairs Script Name: rdataset-aer-affairs
Package: aer Dataset: argentinacpi Script Name: rdataset-aer-argentinacpi
Package: aer Dataset: bankwages Script Name: rdataset-aer-bankwages
...
Package: drc Dataset: spinach Script Name: rdataset-drc-spinach
Package: drc Dataset: terbuthylazin Script Name: rdataset-drc-terbuthylazin
Package: drc Dataset: vinclozolin Script Name: rdataset-drc-vinclozolin
>>>
>>> # Display all the packages in Rdatasets
>>> rt.display_all_rdataset_names('all')
List of all the packages present in Rdatasets
aer cluster dragracer fpp2 gt islr mass multgee plyr robustbase stevedata
asaur count drc gap histdata kmsurv mediation nycflights13 pscl rpart survival
boot daag ecdat geepack hlmdiag lattice mi openintro psych sandwich texmex
cardata datasets evir ggplot2 hsaur lme4 mosaicdata palmerpenguins quantreg sem tidyr
causaldata dplyr forecast ggplot2movies hwde lmec mstate plm reshape2 stat2data vcd
Downloading a Rdataset¶
>>> import retriever as rt
>>> rt.download('rdataset-drc-earthworms')
Installing a Rdataset¶
>>> import retriever as rt
>>> rt.install_postgres('rdataset-mass-galaxies')
Note
For downloading or installing the Rdatasets, the script name should follow the syntax given below.
The script name should be rdataset-<package name>-<dataset name>. The package name and dataset name
should be valid.
- Example:
Correct:
rdataset-drc-earthwormsIncorrect:
rdataset-drcearthworms,rdatasetdrcearthworms