Using the Rdatasets API¶
This tutorial explains the usage of the Rdatasets API in Data Retriever. It includes both the CLI (Command Line Interface) commands as well as the Python interface for the same.
Command Line Interface¶
Listing the Rdatasets¶
The retriever ls rdataset
command displays the Rdatasets.
$ retriever ls rdataset -h
(gives listing options)
usage: retriever ls rdataset [-h] [-p P [P ...]] all
positional arguments:
all display all the packages present in rdatasets
optional arguments:
-h, --help show this help message and exit
-p P [P ...] display a list of all rdatasets present in the package(s)
Examples
This example will display all the Rdatasets present with their package name, dataset name and script name
$ retriever ls rdataset
List of all available Rdatasets
Package: aer Dataset: affairs Script Name: rdataset-aer-affairs
Package: aer Dataset: argentinacpi Script Name: rdataset-aer-argentinacpi
Package: aer Dataset: bankwages Script Name: rdataset-aer-bankwages
...
Package: vcd Dataset: vonbort Script Name: rdataset-vcd-vonbort
Package: vcd Dataset: weldondice Script Name: rdataset-vcd-weldondice
Package: vcd Dataset: womenqueue Script Name: rdataset-vcd-womenqueue
This example will display all the Rdatasets present in the packages vcd
and aer
$ retriever ls rdataset -p vcd aer
List of all available Rdatasets in packages: ['vcd', 'aer']
Package: vcd Dataset: arthritis Script Name: rdataset-vcd-arthritis
Package: vcd Dataset: baseball Script Name: rdataset-vcd-baseball
Package: vcd Dataset: brokenmarriage Script Name: rdataset-vcd-brokenmarriage
...
Package: aer Dataset: affairs Script Name: rdataset-aer-affairs
Package: aer Dataset: argentinacpi Script Name: rdataset-aer-argentinacpi
Package: aer Dataset: bankwages Script Name: rdataset-aer-bankwages
...
This example will display all the Rdatasets present in the package vcd
$ retriever ls rdataset -p vcd
List of all available Rdatasets in packages: ['vcd', 'aer']
Package: vcd Dataset: arthritis Script Name: rdataset-vcd-arthritis
Package: vcd Dataset: baseball Script Name: rdataset-vcd-baseball
Package: vcd Dataset: brokenmarriage Script Name: rdataset-vcd-brokenmarriage
...
This example will display all the packages present in rdatasets
$ retriever ls rdataset all
List of all the packages present in Rdatasets
aer cluster dragracer fpp2 gt islr mass multgee plyr robustbase stevedata
asaur count drc gap histdata kmsurv mediation nycflights13 pscl rpart survival
boot daag ecdat geepack hlmdiag lattice mi openintro psych sandwich texmex
cardata datasets evir ggplot2 hsaur lme4 mosaicdata palmerpenguins quantreg sem tidyr
causaldata dplyr forecast ggplot2movies hwde lmec mstate plm reshape2 stat2data vcd
Downloading the Rdatasets¶
The retriever download rdataset-<package>-<dataset>
command downloads the Rdataset dataset
which exists in the package package
.
You can also copy the script name from the output of retriever ls rdataset
.
Example
This example downloads the rdataset-vcd-bundesliga
dataset.
$ retriever download rdataset-vcd-bundesliga
=> Installing rdataset-vcd-bundesliga
Downloading Bundesliga.csv: 60.0B [00:00, 117B/s]
Done!
The downloaded raw data files are stored in the raw_data
directory in the ~/.retriever
directory.
Installing the Rdatasets¶
The retriever install <engine> rdataset-<package>-<dataset>
command downloads the raw data, creates the script for it and then installs
the Rdataset dataset
present in the package package
into the provided engine
.
Example
This example install the rdataset-aer-usmoney
dataset into the postgres
engine.
$ retriever install postgres rdataset-aer-usmoney
=> Installing rdataset-aer-usmoney
Downloading USMoney.csv: 1.00B [00:00, 2.52B/s]
Processing... USMoney.csv
Successfully wrote scripts to /home/user/.retriever/rdataset-scripts/usmoney.csv.json
Updating script name to rdataset-aer-usmoney.json
Updating the contents of script rdataset-aer-usmoney
Successfully updated rdataset_aer_usmoney.json
Updated the script rdataset-aer-usmoney
Creating database rdataset_aer_usmoney...
Installing rdataset_aer_usmoney.usmoney
Progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 136/136 [00:00<00:00, 2225.09rows/s]
Done!
The script created for the Rdataset is stored in the rdataset-scripts
directory in the ~/.retriever
directory.
Python Interface in Data Retriever¶
Updating Rdatasets Catalog¶
The function update_rdataset_catalog
creates/updates the datasets_url.json
in the ~/.retriever/rdataset-scripts
directory,
which contains the information about all the Rdatasets.
>>> import retriever as rt
>>> rt.update_rdataset_catalog()
Note
The update_rdataset_catalog
function has a default argument test
which is set to False
.
If test
is set to True
, then the contents of the datasets_url.json
file would be returned as
a dict.
Listing Rdatasets¶
The function display_all_rdataset_names
prints the package, dataset name and the script name for the Rdatasets present in the package(s) requested.
If no package is specified, it prints all the rdatasets, and if all
is passed as the function argument then all the package names are displayed.
Note
The function argument package_name
takes a list as an input when you want to display rdatasets based on the packages.
If you want to display all packages names, set package_name
argument to all
(refer to the example below).
>>> import retriever as rt
>>>
>>> # Display all Rdatasets
>>> rt.display_all_rdataset_names()
List of all available Rdatasets
Package: aer Dataset: affairs Script Name: rdataset-aer-affairs
Package: aer Dataset: argentinacpi Script Name: rdataset-aer-argentinacpi
Package: aer Dataset: bankwages Script Name: rdataset-aer-bankwages
...
Package: vcd Dataset: vonbort Script Name: rdataset-vcd-vonbort
Package: vcd Dataset: weldondice Script Name: rdataset-vcd-weldondice
Package: vcd Dataset: womenqueue Script Name: rdataset-vcd-womenqueue
>>>
>>> # Display all the Rdatasets present in packages 'aer' and 'drc'
>>> rt.display_all_rdataset_names(['aer', 'drc'])
List of all available Rdatasets in packages: ['aer', 'drc']
Package: aer Dataset: affairs Script Name: rdataset-aer-affairs
Package: aer Dataset: argentinacpi Script Name: rdataset-aer-argentinacpi
Package: aer Dataset: bankwages Script Name: rdataset-aer-bankwages
...
Package: drc Dataset: spinach Script Name: rdataset-drc-spinach
Package: drc Dataset: terbuthylazin Script Name: rdataset-drc-terbuthylazin
Package: drc Dataset: vinclozolin Script Name: rdataset-drc-vinclozolin
>>>
>>> # Display all the packages in Rdatasets
>>> rt.display_all_rdataset_names('all')
List of all the packages present in Rdatasets
aer cluster dragracer fpp2 gt islr mass multgee plyr robustbase stevedata
asaur count drc gap histdata kmsurv mediation nycflights13 pscl rpart survival
boot daag ecdat geepack hlmdiag lattice mi openintro psych sandwich texmex
cardata datasets evir ggplot2 hsaur lme4 mosaicdata palmerpenguins quantreg sem tidyr
causaldata dplyr forecast ggplot2movies hwde lmec mstate plm reshape2 stat2data vcd
Downloading a Rdataset¶
>>> import retriever as rt
>>> rt.download('rdataset-drc-earthworms')
Installing a Rdataset¶
>>> import retriever as rt
>>> rt.install_postgres('rdataset-mass-galaxies')
Note
For downloading or installing the Rdatasets, the script name should follow the syntax given below.
The script name should be rdataset-<package name>-<dataset name>
. The package name
and dataset name
should be valid.
- Example:
- Correct:
rdataset-drc-earthworms
- Incorrect:
rdataset-drcearthworms
,rdatasetdrcearthworms
- Correct: