Data Retriever using Python¶

Data Retriever is written purely in python. The Python interface provides the core functionality supported by the CLI (Command Line Interface).

Installation¶

The installation instructions for the CLI and module are the same. Links have been provided below for convenience.

Instructions for installing from binaries project website.
Instructions for installing from source install from Source.

Note: The python interface requires version 2.1 and above.

Tutorial¶

Importing retriever

>>> import retriever

In this tutorial, the module will be referred to as rt.

>>> import retriever as rt

List Datasets¶

Listing available datasets using dataset_names function. The function returns a list of all the currently available scripts.

>>> rt.dataset_names()

['abalone-age',
 'antarctic-breed-bird',
 .
 .
 'wine-composition',
 'wine-quality']

For a more detailed description of the scripts installed in retriever, the datasets function can be used. This function returns a list of Scripts objects. From these objects, we can access the available Script’s attributes as follows.

>>> for dataset in rt.datasets():
      print(dataset.name)

abalone-age
airports
amniote-life-hist
antarctic-breed-bird
aquatic-animal-excretion
.
.

There are a lot of different attributes provided in the Scripts class. Some notably useful ones are:

- name
- citation
- description
- keywords
- title
- urls
- version

You can add more datasets locally by yourself. Adding dataset documentation.

Update Datasets¶

If there are no scripts available, or you want to update scripts to the latest version, check_for_updates will download the most recent version of all scripts.

>>> rt.check_for_updates()

Downloading scripts...
Download Progress: [####################] 100.00%
The retriever is up-to-date

Downloading recipes for all datasets can take a while depending on the internet connection.

Download Datasets¶

To directly download datasets without cleaning them use the download function

def download(dataset, path='./', quiet=False, subdir=False, debug=False):

A simple download for the iris dataset can be done using the following.

>>> rt.download("iris")

Output:

=> Downloading iris

Downloading bezdekIris.data...
100%  0 seconds Copying bezdekIris.data

The files will be downloaded into your current working directory by default. You can change the default download location by using the path parameter. Here, we are downloading the NPN dataset to our Desktop directory

>>> rt.download("NPN","/Users/username/Desktop")

Output:

=> Downloading NPN

Downloading 2009-01-01.xml...
11  MBB
Downloading 2009-04-02.xml...
42  MBB
.
.

path (String): Specify dataset download path.

quiet  (Bool): Setting True minimizes the console output.

subdir (Bool): Setting True keeps the subdirectories for archived files.

debug  (Bool): Setting True helps in debugging in case of errors.

Install Datasets¶

Retriever supports installation of datasets into 7 major databases and file formats.

- csv
- json
- msaccess
- mysql
- postgres
- sqlite
- xml

There are separate functions for installing into each of the 7 backends:

def install_csv(dataset, table_name=None, compile=False, debug=False,
            quiet=False, use_cache=True):

def install_json(dataset, table_name=None, compile=False,
             debug=False, quiet=False, use_cache=True, pretty=False):

def install_msaccess(dataset, file=None, table_name=None,
                 compile=False, debug=False, quiet=False, use_cache=True):

def install_mysql(dataset, user='root', password='', host='localhost',
              port=3306, database_name=None, table_name=None,
              compile=False, debug=False, quiet=False, use_cache=True):

def install_postgres(dataset, user='postgres', password='',
                 host='localhost', port=5432, database='postgres',
                 database_name=None, table_name=None,
                 compile=False, debug=False, quiet=False, use_cache=True):

def install_sqlite(dataset, file=None, table_name=None,
               compile=False, debug=False, quiet=False, use_cache=True):

def install_xml(dataset, table_name=None, compile=False, debug=False,
            quiet=False, use_cache=True):

A description of default parameters mentioned above:

compile         (Bool): Setting True recompiles scripts upon installation.

database_name (String): Specify database name. For postgres, mysql users.

debug           (Bool): Setting True helps in debugging in case of errors.

file          (String): Enter file_name for database. For msaccess, sqlite users.

host          (String): Specify host name for database. For postgres, mysql users.

password      (String): Specify password for database. For postgres, mysql users.

port             (Int): Specify the port number for installation. For postgres, mysql users.

pretty          (Bool): Setting True adds indentation in JSON files.

quiet           (Bool): Setting True minimizes the console output.

table_name    (String): Specify the table name to install.

use_cache       (Bool): Setting False reinstalls scripts even if they are already installed.

user          (String): Specify the username. For postgres, mysql users.

Examples to Installing Datasets:

Here, we are installing the dataset wine-composition as a CSV file in our current working directory.

rt.install_csv("wine-composition")

=> Installing wine-composition

Downloading wine.data...
100%  0 seconds Progress: 178/178 rows inserted into ./wine_composition_WineComposition.csv totaling 178

The installed file is called wine_composition_WineComposition.csv

Similarly, we can download any available dataset as a JSON file:

rt.install_json("wine-composition")

=> Installing wine-composition

Progress: 178/178 rows inserted into ./wine_composition_WineComposition.json totaling 17

The wine-composition dataset is now installed as a JSON file called wine_composition_WineComposition.json in our current working directory.