Data Storage (storage)

Orange.data.storage.Storage is an abstract class representing a data object in which rows represent data instances (examples, in machine learning terminology) and columns represent variables (features, attributes, classes, targets, meta attributes).

Data is divided into three parts that represent independent variables (X), dependent variables (Y) and meta data (metas). If practical, the class should expose those parts as properties. In the associated domain (Orange.data.Domain), the three parts correspond to lists of variable descriptors attributes, class_vars and metas.

Any of those parts may be missing, dense, sparse or sparse boolean. The difference between the later two is that the sparse data can be seen as a list of pairs (variable, value), while in the latter the variable (item) is present or absent, like in market basket analysis. The actual storage of sparse data depends upon the storage type.

There is no uniform constructor signature: every derived class provides one or more specific constructors.

There are currently two derived classes Orange.data.Table and Orange.data.sql.Table, the former storing the data in-memory, in numpy objects, and the latter in SQL (currently, only PostreSQL is supported).

Derived classes must implement at least the methods for getting rows and the number of instances (__getitem__ and __len__). To make storage fast enough to be practically useful, it must also reimplement a number of filters, preprocessors and aggregators. For instance, method _filter_values(self, filter) returns a new storage which only contains the rows that match the criteria given in the filter. Orange.data.Table implements an efficient method based on numpy indexing, and Orange.data.sql.Table, which "stores" a table as an SQL query, converts the filter into a WHERE clause.

Orange.data.storage.domain(:obj:`Orange.data.Domain`)

The domain describing the columns of the data

Data access

Orange.data.storage.__getitem__(self, index)

Return one or more rows of the data.

  • If the index is an int, e.g. data[7]; the corresponding row is returned as an instance of Instance. Concrete implementations of Storage use specific derived classes for instances.

  • If the index is a slice or a sequence of ints (e.g. data[7:10] or data[[7, 42, 15]], indexing returns a new storage with the selected rows.

  • If there are two indices, where the first is an int (a row number) and the second can be interpreted as columns, e.g. data[3, 5] or data[3, 'gender'] or data[3, y] (where y is an instance of Variable), a single value is returned as an instance of Value.

  • In all other cases, the first index should be a row index, a slice or a sequence, and the second index, which represent a set of columns, should be an int, a slice, a sequence or a numpy array. The result is a new storage with a new domain.

.__len__(self)

Return the number of data instances (rows)

Inspection

Storage.X_density, Storage.Y_density, Storage.metas_density

Indicates whether the attributes, classes and meta attributes are dense (Storage.DENSE) or sparse (Storage.SPARSE). If they are sparse and all values are 0 or 1, it is marked as (Storage.SPARSE_BOOL). The Storage class provides a default DENSE. If the data has no attibutes, classes or meta attributes, the corresponding method should re

Filters

Storage should define the following methods to optimize the filtering operations as allowed by the underlying data structure. Orange.data.Table executes them directly through numpy (or bottleneck or related) methods, while Orange.data.sql.Table appends them to the WHERE clause of the query that defines the data.

These methods should not be called directly but through the classes defined in Orange.data.filter. Methods in Orange.data.filter also provide the slower fallback functions for the functions not defined in the storage.

Orange.data.storage._filter_is_defined(self, columns=None, negate=False)

Extract rows without undefined values.

Parameters:
  • columns (sequence of ints, variable names or descriptors) -- optional list of columns that are checked for unknowns

  • negate (bool) -- invert the selection

Returns:

a new storage of the same type or Table

Return type:

Orange.data.storage.Storage

Orange.data.storage._filter_has_class(self, negate=False)

Return rows with known value of the target attribute. If there are multiple classes, all must be defined.

Parameters:

negate (bool) -- invert the selection

Returns:

a new storage of the same type or Table

Return type:

Orange.data.storage.Storage

Orange.data.storage._filter_same_value(self, column, value, negate=False)

Select rows based on a value of the given variable.

Parameters:
Returns:

a new storage of the same type or Table

Return type:

Orange.data.storage.Storage

Orange.data.storage._filter_values(self, filter)

Apply a the given filter to the data.

Parameters:

filter (Orange.data.Filter) -- A filter for selecting the rows

Returns:

a new storage of the same type or Table

Return type:

Orange.data.storage.Storage

Aggregators

Similarly to filters, storage classes should provide several methods for fast computation of statistics. These methods are not called directly but by modules within Orange.statistics.

_compute_basic_stats(
self, columns=None, include_metas=False, compute_variance=False)

Compute basic statistics for the specified variables: minimal and maximal value, the mean and a varianca (or a zero placeholder), the number of missing and defined values.

Parameters:
  • columns (list of ints, variable names or descriptors of type Orange.data.Variable) -- a list of columns for which the statistics is computed; if None, the function computes the data for all variables

  • include_metas (bool) -- a flag which tells whether to include meta attributes (applicable only if columns is None)

  • compute_variance (bool) -- a flag which tells whether to compute the variance

Returns:

a list with tuple (min, max, mean, variance, #nans, #non-nans) for each variable

Return type:

list

Orange.data.storage._compute_distributions(self, columns=None)

Compute the distribution for the specified variables. The result is a list of pairs containing the distribution and the number of rows for which the variable value was missing.

For discrete variables, the distribution is represented as a vector with absolute frequency of each value. For continuous variables, the result is a 2-d array of shape (2, number-of-distinct-values); the first row contains (distinct) values of the variables and the second has their absolute frequencies.

Parameters:

columns (list of ints, variable names or descriptors of type Orange.data.Variable) -- a list of columns for which the distributions are computed; if None, the function runs over all variables

Returns:

a list of distributions

Return type:

list of numpy arrays

Storage._compute_contingency(col_vars=None, row_var=None)[source]

Compute contingency matrices for one or more discrete or continuous variables against the specified discrete variable.

The resulting list contains a pair for each column variable. The first element contains the contingencies and the second elements gives the distribution of the row variables for instances in which the value of the column variable is missing.

The format of contingencies returned depends on the variable type:

  • for discrete variables, it is a numpy array, where element (i, j) contains count of rows with i-th value of the row variable and j-th value of the column variable.

  • for continuous variables, contingency is a list of two arrays, where the first array contains ordered distinct values of the column_variable and the element (i,j) of the second array contains count of rows with i-th value of the row variable and j-th value of the ordered column variable.

Parameters:
  • col_vars (list of ints, variable names or descriptors of type Orange.data.Variable) -- variables whose values will correspond to columns of contingency matrices

  • row_var (int, variable name or Orange.data.DiscreteVariable) -- a discrete variable whose values will correspond to the rows of contingency matrices