Data Retriever using Python

Data Retriever is written purely in python. The Python interface provides the core functionality supported by the CLI (Command Line Interface).

Installation

The installation instructions for the CLI and module are the same. Links have been provided below for convenience.

Note: The python interface requires version 2.1 and above.

Tutorial

Importing retriever

>>> import retriever

In this tutorial, the module will be referred as rt.

>>> import retriever as rt

List Datasets

Listing available datasets using dataset_names function. The function returns a list of all the currently available scripts.

>>> rt.dataset_names()

['abalone-age',
 'antarctic-breed-bird',
 .
 .
 'wine-composition',
 'wine-quality']

For more detailed description of the scripts installed in retriever datasets function can be used. This function returns a list objects of Scripts class. From these objects, we can access the available Script’s attributes.

>>> for dataset in rt.datasets():
      print(dataset.name)

There are a lot of different attributes provided in the Scripts class. Some notably useful ones are:

name
citation
description
keywords
title
urls
version

You can add more datasets locally by yourself. Adding dataset documentation.

Update Datasets

If there are no scripts available, or you want to update scripts to the latest version, check_for_updates will download the most recent version of all scripts.

>>> rt.check_for_updates()

Downloading recipes for all datasets can take a while depending on the internet connection.

Download Datasets

To directly download datasets without cleaning them use the download function

def download(dataset, path='./', quiet=False, subdir=False, debug=False):

A simple download for the iris dataset can be done using.

>>> rt.download("iris")

Output:

=> Downloading iris

Downloading bezdekIris.data...
100%  0 seconds Copying bezdekIris.data

This downloads the dataset in your current working directory. You can control where the dataset is downloaded using the path parameter.

path (String): Specify dataset download path.

quiet  (Bool): Setting True minimizes the console output.

subdir (Bool): Setting True keeps the subdirectories for archived files.

debug  (Bool): Setting True helps in debugging in case of errors.

Install Datasets

Retriever supports installation of datasets into 7 major databases and file formats.

There are separate functions for installing into each of the 7 backends:

def install_csv(dataset, table_name=None, compile=False, debug=False,
            quiet=False, use_cache=True):

def install_json(dataset, table_name=None, compile=False,
             debug=False, quiet=False, use_cache=True):

def install_msaccess(dataset, file=None, table_name=None,
                 compile=False, debug=False, quiet=False, use_cache=True):

def install_mysql(dataset, user='root', password='', host='localhost',
              port=3306, database_name=None, table_name=None,
              compile=False, debug=False, quiet=False, use_cache=True):

def install_postgres(dataset, user='postgres', password='',
                 host='localhost', port=5432, database='postgres',
                 database_name=None, table_name=None,
                 compile=False, debug=False, quiet=False, use_cache=True):

def install_sqlite(dataset, file=None, table_name=None,
               compile=False, debug=False, quiet=False, use_cache=True):

def install_xml(dataset, table_name=None, compile=False, debug=False,
            quiet=False, use_cache=True):

A description of default parameters mentioned above: