Previous Up Next

5  Data input format

5.1  A Foreword on Additional Attributes

Besides projection data (e.g. features or distance matrices), input files might contain additional attributes, handling for instance the name of an associated picture file, the class of the object (if this stands), and so on. An input file might contain 0, 1 or several such attributes.

5.2  Multisource mode

In case the user possesses several data source file that concern a single set of objects (each of these files containing different features), several of these files can be loaded in a single project and projected simultaneously.

This mode is allowed depending on Explorer3D options (see section 7.7). If this option is not activated, then each loading of a data source file will result in the reintialization of Explorer3D  and thus to discarding the current projection. Otherwise, the user will be asked if a new file is to be considered to contain a new set of features for the currently observed objects, or consists of a new set of objects, if which case Explorer3D will be reinitialized.

In multisource mode, each data source file can contain whatever data input format is allowed (features, distance matrices, etc.). Objects must appear in the same order in each of the files, unless the file is a subset one.

5.3  Subset files

To be translated. Several kinds of input file can contain data for only a subset of the objects. This does mainly make sens in multisource mode, were some input can be available for a subset only of the objects, but were the user wishes to compare the resulating projection with a former one, based on another input data file.

A subset file is denoted by the presence of the SUBSET keyword (see the various file formats). The global rank of each object must then be given. In the current release, the user i supposed to load at least one complete set before loading a subset.

Subset files can contain additional data columns. The value of these columns will be set to “UNDEFINED” for the missing objects.

5.4  Features input file

Input file do commonly consist of a set of features, i.e. a table where each object is described according to a set of features. A 3D view is then computed by projecting the objects in a low dimension space that reflects the original large dimension space formed by the original features.

5.4.1  General format

Explorer3D accepts text files, structured as follows:

[SUBSET [START WITH x]]
number of objects
number of original features (excluding additional features)
names of original features followed by names of additional features
Description of objects (list of original and additional features, one object per line)

Feature names and values are separated by a delimiter, which is by default the space character. This delimiter might be explicitly defined and changed in case the space character can not be used (see section5.4.2).

Here is a sample file (first lines of the very standard “iris” data set) :

150
4
A B C D class
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
...

In this sample, 4 original attributes are given, named “A”,“B”, “C” and “D”. they are followed by an additional attribute, “class”. In this sample, the original attributes consist of numerical values (reals).

Remark: the user should be careful, not to leave empty spaces at the end of lines, especially in the two first lines. Such empty spaces might be misinterpreted by Explorer3D .

5.4.2  User-defined delimiter

If the space character might not be used as a delimiter, an additional line can be added at the beginning of the file (on its 3rd line, thus after the number of attributes and before the names of attributes), to declare the delimiter used. for instance, if “|” is the delimiter, the file starts like:

150
4
|
A|B|C|D|class
5.1|3.5|1.4|0.2|Iris-setosa
4.9|3.0|1.4|0.2|Iris-setosa
4.7|3.2|1.3|0.2|Iris-setosa
...

5.4.3  Subset data file

If “SUBSET” is set on the first line, then the file is supposed to contain only a subset of the global set of objects. An additional numeric value must then be added at the begining of each object-description line, that corresponds to the current object rank in the whole set, starting at value 0. If the ranks given do not start at 0, then the user must use the optional parameter “START WITH”, followed by the offset. For instance, if his personnal object numbering starts at 1, then he will write “START WITH 1”.

For instance:

SUBSET START WITH 1
3
4
A B C D classe
5 5.1 3.5 1.4 0.2 Iris-setosa
20 4.9 3.0 1.4 0.2 Iris-setosa
110 4.7 3.2 1.3 0.2 Iris-setosa

The user only gives 3 objects, with she has numbered 5, 20 and 110 in her own numbering system that starts at 1. These objects are thus numbered 4, 19 and 109 in Explorer3D .

5.4.4  How symbolic attributes are managed

Original attributes are by default supposed to be numerical (real). Symbolic values can also be handeld. Symbolic means that the values do not belong a continuous domain. Thus, both integers and strings might be considered as symbolic attributes (integer will be by default considered as numerical values unless they are be explicitly considered as symbols).

Symbolic attributes will be denoted by a “.S” suffix in their attribute name. Let us observe the following example:

151
5
R1.S R2.S R3.S R4.S R5 Clas
A Green YES + 19 TRUE
B Red NO - 17 TRUE
C Blue NO - 49 FALSE
...

The first four attributes are symbolic.

Concerning numerical attributes, their name can be suffixed by “.N”, but this does remain optional.

In the current implementation of Explorer3D  a pre-processing is done on symbolic attributes in order to binarize them: A first scan of the column is done, in orderto list the existing values of the attribute. The unique attribute is then replaced by a list of attributes (taking value 0 or 1), each one corresponding to a value of the symbolic one. For a given object, all of these attributes are set to 0, except the one corresponding to the original symbolic value. Such forged attributes are named as the original attribute, except they are suffixed by ”$rank”, where rank is the rank of the symbolic value in the list of existing ones.
for instance, in our example data set, R1 get replaced with two attributes, “R1$1” and “R1$2”, the values of which are respectively, for the first object, 1 et 0, and for the second object 0 and 1.
This decomposition is fully transparent to the user.

5.5  Distance files

Distance files contain distance matrices, i.e. matrices the elements of which correspond to the distance between pairs of objects. The 3D view is then computed so that the distance in the 3D space correspond as much as possible to the distances in the matrix. The file inner format is as follows:

number of objects [COMPLETE]
distances between objects (one object per line)
[Number ou list of names of additional attributes
Values of additional attributes]

Where the squares braces denote optional parts.

By default, only the upper triangular part of the distance matrix is given: if we have 3 objects a, b and c, the first line will contain distanceab and distanceac, and the second line distancebc

On the first line might occur the optional keyword “COMPLETE”, which means that the full square matrix is given. According to our example, the first line will contain distancea,a, distancea,b and distancea,c, the second one distanceb,a, distanceb,b et distanceb,c, etc.).

If additional attributes are given, and only their cardinality is given, their names are automatically coined as follows: “Att1”, “Att2”, etc.

Here is a sample file (very first lines of a inter-illumination distance matrix) :

166 
... 
1.0694574 1.1302139 1.0019832 1.0004523 ...
1.0656028 0.96607274 1.1858556 ...
...
image class 
ms0001_1.jpg ms0001-Mazarine-Fr-SW-Début-12eme 
ms0001_2.jpg ms0001-Mazarine-Fr-SW-Début-12eme 
...

This data set consists of 166 objects. For each of them we have two additional attributes: the name of an associated image file, and the name of the document the illumination comes from.
the first object belongs to document “ms0001 - Mazarine - Fr - SW - Début 12eme”, and the user might find a picture of the illumination in the file “ms0001_1.jpg”. The second illumination comes from the same document and can be viewed in file “ms0001_2.jpg”. The distance between the first and the second object is 1.0694574; the distance between the first and the third object is “1.1302139”; etc.

5.6  Raw 3D coordinates file

This kind of file does start by the number of objects it contains (it is computed automatically). This might be reconsider in some future released of Explorer3D .

Optionnaly, the file might start by a line that contains the names of the 3 attributes.

Each remaining line contains the 3 coordinates (real values) of an object. These values might be followed by additional attributes.

Here is a example file with three objects:

0.5 0.5 0.5
-0.5 -0.5 -0.5
0 0 -2

5.6.1  Subset data file

Optionnaly, the keyword “SUBSET” should be added, alone on the first line of the file. Each remaining line will thus consist of four values, the first one beeing the global rank of each object. As with feature input files, SUBSET might be followed by “START WITH” to specify an offset. for instance :

SUBSET START WITH 1
4 0.5 0.5 0.5
2 -0.5 -0.5 -0.5
1 0 0 -2

This file contains the coordinates of objects number 4, 2 and 1, with an offset of 1, that is to say that the global rank of these objects are 3, 1 and 0.

5.7  Pure additional attributes files

the user can load files that consist of additional attributes with no projection data. Such files are structured as feature input file, where the number of projection features is set to 0. Such files do only make sense in multisource mode.

5.8  Reserved attribute names (additional attributes)

Some reserved attribute names might be used, that are linked to a pre-defined visual behaviour of Explorer3D (i.e. automatic object coloring, direct link to picture files), that should normally be set by hand (see section 7.2.1). two names are currently reserved:

5.9  Data import

If the data file inner format does not match Explorer3D format, a data import tool can be executed. This latter will be automatically loaded in case the user tries to load a misformatted file. It can also be manually launched choosing the “Files / Data import tool” menu. This wizard tool does allow to re-organize rows and columns and guides the user step by step. One must notice that only text (no binary files) feature input files (neither distance matrices nor rax 3D coorindates) are currently supported. This does nevertheless cover most of needs. Do not hesitate to contact the authors in case additional import features would be necessary.

To import spreadsheet data, please do first save your file using the CVS (i.e. text) format, and then open this file in the Explorer3D import tool.


Previous Up Next