Data Retriever using Python¶
Data Retriever is written purely in python. The Python interface provides the core functionality supported by the CLI (Command Line Interface).
Installation¶
The installation instructions for the CLI and module are the same. Links have been provided below for convenience.
- Instructions for installing from binaries project website.
- Instructions for installing from source install from Source.
Note: The python interface requires version 2.1 and above.
Tutorial¶
Importing retriever
>>> import retriever
In this tutorial, the module will be referred to as rt
.
>>> import retriever as rt
List Datasets¶
Listing available datasets using dataset_names
function.
The function returns a list of all the currently available scripts.
>>> rt.dataset_names()
['abalone-age',
'antarctic-breed-bird',
.
.
'wine-composition',
'wine-quality']
For a more detailed description of the scripts installed in retriever, the datasets
function can be used. This function returns a list of Scripts
objects.
From these objects, we can access the available Script’s attributes as follows.
>>> for dataset in rt.datasets():
print(dataset.name)
abalone-age
airports
amniote-life-hist
antarctic-breed-bird
aquatic-animal-excretion
.
.
There are a lot of different attributes provided in the Scripts class. Some notably useful ones are:
- name
- citation
- description
- keywords
- title
- urls
- version
You can add more datasets locally by yourself. Adding dataset documentation.
Update Datasets¶
If there are no scripts available, or you want to update scripts to the latest version,
check_for_updates
will download the most recent version of all scripts.
>>> rt.check_for_updates()
Downloading scripts...
Download Progress: [####################] 100.00%
The retriever is up-to-date
Downloading recipes for all datasets can take a while depending on the internet connection.
Download Datasets¶
To directly download datasets without cleaning them use the download
function
def download(dataset, path='./', quiet=False, subdir=False, debug=False):
A simple download for the iris
dataset can be done using the following.
>>> rt.download("iris")
Output:
=> Downloading iris
Downloading bezdekIris.data...
100% 0 seconds Copying bezdekIris.data
The files will be downloaded into your current working directory by default.
You can change the default download location by using the path
parameter.
Here, we are downloading the NPN
dataset to our Desktop
directory
>>> rt.download("NPN","/Users/username/Desktop")
Output:
=> Downloading NPN
Downloading 2009-01-01.xml...
11 MBB
Downloading 2009-04-02.xml...
42 MBB
.
.
path (String): Specify dataset download path.
quiet (Bool): Setting True minimizes the console output.
subdir (Bool): Setting True keeps the subdirectories for archived files.
debug (Bool): Setting True helps in debugging in case of errors.
Install Datasets¶
Retriever supports installation of datasets into 7 major databases and file formats.
- csv
- json
- msaccess
- mysql
- postgres
- sqlite
- xml
There are separate functions for installing into each of the 7 backends:
def install_csv(dataset, table_name=None, compile=False, debug=False,
quiet=False, use_cache=True):
def install_json(dataset, table_name=None, compile=False,
debug=False, quiet=False, use_cache=True, pretty=False):
def install_msaccess(dataset, file=None, table_name=None,
compile=False, debug=False, quiet=False, use_cache=True):
def install_mysql(dataset, user='root', password='', host='localhost',
port=3306, database_name=None, table_name=None,
compile=False, debug=False, quiet=False, use_cache=True):
def install_postgres(dataset, user='postgres', password='',
host='localhost', port=5432, database='postgres',
database_name=None, table_name=None,
compile=False, debug=False, quiet=False, use_cache=True):
def install_sqlite(dataset, file=None, table_name=None,
compile=False, debug=False, quiet=False, use_cache=True):
def install_xml(dataset, table_name=None, compile=False, debug=False,
quiet=False, use_cache=True):
A description of default parameters mentioned above:
compile (Bool): Setting True recompiles scripts upon installation.
database_name (String): Specify database name. For postgres, mysql users.
debug (Bool): Setting True helps in debugging in case of errors.
file (String): Enter file_name for database. For msaccess, sqlite users.
host (String): Specify host name for database. For postgres, mysql users.
password (String): Specify password for database. For postgres, mysql users.
port (Int): Specify the port number for installation. For postgres, mysql users.
pretty (Bool): Setting True adds indentation in JSON files.
quiet (Bool): Setting True minimizes the console output.
table_name (String): Specify the table name to install.
use_cache (Bool): Setting False reinstalls scripts even if they are already installed.
user (String): Specify the username. For postgres, mysql users.
Examples to Installing Datasets:
Here, we are installing the dataset wine-composition as a CSV file in our current working directory.
rt.install_csv("wine-composition")
=> Installing wine-composition
Downloading wine.data...
100% 0 seconds Progress: 178/178 rows inserted into ./wine_composition_WineComposition.csv totaling 178
The installed file is called wine_composition_WineComposition.csv
Similarly, we can download any available dataset as a JSON file:
rt.install_json("wine-composition")
=> Installing wine-composition
Progress: 178/178 rows inserted into ./wine_composition_WineComposition.json totaling 17
The wine-composition dataset is now installed as a JSON file called wine_composition_WineComposition.json in our current working directory.