README

Author: Mostafa BAMHA <Mostafa.Bamha (at) univ-orleans.fr>

1 Copyright
2 Overview
3 Quick Start
- 3.1 Build
- 3.2 Frequent Itemset Mining using MapFIM+
4 Directory Structure

1 Copyright

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS"; BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

2 Overview

This guide describes how to use the source code developed under the GIRAFON project funded by Centre Val de Loire region, for the study in:

MapFIM+: Memory Aware Parallelized Frequent Itemset Mining In Very Large Datasets.

Transactions on Large-Scale Data- and Knowledge-Centered Systems: Special Issue on Database and Expert-Systems Applications,
Vol. 39, Springer Berlin / Heidelberg. 2018.

MapFIM+ source code archive is available at the adress: http://www.univ-orleans.fr/lifo/software/MapFIM/MapFIM.tar

3 Quick Start

The only requirement for running the code is a Hadoop cluster. It does not have to be a full-fledged cluster, a single-node pseudo-distributed installation of Hadoop is enough. For more details about starting a Hadoop cluster please see http://hadoop.apache.org/docs/current/index.html
The code works with Hadoop version 2.7.2 or higher.

3.1 Build

For compilation, you need the installation of maven version 3.0 or higher.

$ cd mapfim-hadoop
mapfim-hadoop$ mvn clean package

3.2 Frequent Itemset Mining using MapFIM+

Here are the steps to perform Frequent Itemset Mining using MapFIM+ algorithm using Hadoop versions 2.XX.XX (or higher) on a small sample of retail dataset.
Input data, output directory, Support parameter, MaxMemory available to store/mine each prefix-projected database, hadoop number of reducers, ... .

The complete MapFim+'s parameters are given in the following command:

 hadoop jar target/mapfim-0.0.1-hadoop.jar girafon.MapFimV2.App <input> <output> <support> <maxmemory> <#maxDatabases> <#numberOfreducers> <ProgramFileName> <#TreeEntries>

All parameters should be chosen correctly to fit the Mining problem and processing nodes capacities. So choose maxMemory depending on the input size : 20 Mbytes/ 50 Mbytes/ 4000 Mbytes are in general good values.
For small input data size, high values for maxMemory parameter, MapFim+ will use local mining program since input data will fit in processing node's memory and no prefix-projected database will be generated.
For performance, MaxMemory can be chosen as high as possible depending on processing nodes capacities.

For example, by using the following command, MapFIM+ will generate Frequent Itemsets for a support of 100 while using prefix-projected databases of a most 2 Mbytes for local mining using the Eclat program stored in "./bin/eclat".

 hadoop jar target/mapfim-0.0.1-hadoop.jar girafon.MapFimV2.App mapfim_input/retail.dat mapfim_output 100 2 10000 1 ./bin/eclat 1000000

This execution will generate 6451 Frequent Itemsets present in the input file . Frequent Itemsets are stored in the HDFS folder . In this execution, only prefix-projected databases of at most 2Mbytes will be used.

----------------------------SUMMARY-------------------------
Input            : mapfim_input/retail.dat
Output           : mapfim_output
Support          : 100
Max Memory Allow : 2 Mbytes for each prefix-projected database
Max Data Allow   : 1164288 -- This parameter is generated depending on the values of MaxMemoryAllow and gamma parameters
#Databases       : 10000
#Reducer         : 1
Eclat Folder     : ./bin/eclat -- The program used to mine prefix-projected Databases 
max Tree size    : 1000000
Number of FIMs   : 6451
------------------------------------------------------------

Important: Note that, the output folder <mapfim_output> will be created during the execution of MapFIM program. Therefore be sure that the output folder do not exist before the starting program execution or use an other output folder.

3.2.1 Upload raw data to HDFS

 mapfim-hadoop$ hadoop fs -mkdir  -p mapfim_input 
    
 mapfim-hadoop$ hadoop fs -put ./data/retail.dat  mapfim_input/

The file "./data/retail.dat" contains one record per line. On each line corresponds to the list of item's ID of a given transaction (for simplicity the id of items of each transaction are given in sorted order. Items ID are separated by " ".
In the following dataset example, the first transaction "12 155 168" contains tree items: item "12", item "155" and item "168".

12 155 168  
38 39 41 110 569 16430 
32 41 48 78 544 785 2631 12929 16430 16431 
38 39 41 48 387 716 840 1031 2696 5314 12921 
39 156 1009 1479 1481 5962 6300 7158 15570 16430 16431 
39 48 297 425 1444 1531 1677 2090 12929 16430 
39 3642 4811 16430 16431

3.2.2 Benchmark and datasets repositories

A set of benchmark datasets for Frequent Itemset Mining is available at repository "http://fimi.ua.ac.be/data/".
A local copy of a small retail dataset is stored to "data" directory. To test other datasets (Webdoc, ...), you need to download them at the address "http://fimi.ua.ac.be/data/"

 mapfim-hadoop$ hadoop jar target/mapfim-0.0.1-hadoop.jar girafon.MapFimV2.App mapfim_input/retail.dat mapfim_output2 100 20 10000 2 ./bin/eclat 1000000

This job will run MapFIM+ program on "retail.dat" dataset using a support of 100 and MaxMemory fixed to 20 Mbytes. The output result is stored on HDFS directory "mapfim_output2".
You can notice that the number of Frequent Itemsets is 6451 is similar to the previous execution using MaxMemory fixed to 2Mbytes. The main difference between the two executions, by using 20Mbytes for MaxMemory, the input data will fit in the available memory, MapFIM+ switches automatically to mine "retail.dat" dataset using the program "./bin/eclat" without using prefix-projected databases.

----------------------------SUMMARY-------------------------
Input            : mapfim_input/retail.dat
Output           : mapfim_output
Support          : 100
Max Memory Allow : 20 Mbytes for each prefix-projected database
Max Data Allow   : 1164288 -- This parameter is generated depending on the values of MaxMemoryAllow and gamma parameters
#Databases       : 10000
#Reducer         : 2
Eclat Folder     : ./bin/eclat -- The program used to mine prefix-projected Databases 
max Tree size    : 1000000
Number of FIMs   : 6451
------------------------------------------------------------

You can run automatically, MapFIM+ on retail.dat dataset using the script called "runRetail.sh", this script build jar file, upload dataset to hdfs and finally runs MapFIM+ using fixed values for input parameters.

 mapfim-hadoop$ ./runRetail.sh

To test MapFIM+ on other datasets, you need to download dataset from repository "http://fimi.ua.ac.be/data/" and then modify on of the scripts "_runWebdocs.sh.example" and "_runSynthetic.sh.example".
For Efficiency, the number of reducers must be fixed depending on the cluster and input data sizes.

3.2.3 Frequent Itemsets result using MapFIM+

To show MapFIM+ output result strored in the HDFS folder "mapfim_output", use the following command:

 mapfim-hadoop$ hdfs dfs -cat mapfim_output/*/part*

For a given output folder "mapfim_output", the HDFS sub-folder "mapfim_output/1" contains Frequent Itemsets of size 1, "mapfim_output/2" sub-folder contains Frequent Itemsets of size 2, .... , whereas "mapfim_output/XXXXX" folder will contain Frequent Itemsets corresponding to size "XXXXX".

A local copy of the output result, is also stored on the local file system into the directory called "result__input" (see mapfim-hadoop directory structure bellow.)

4 Directory Structure

drwxr-xr-x 1   4096 janv. 30 10:48 betaFIMs__input
drwxr-xr-x 1   4096 janv. 29 14:31 bin
-rw-r--r-- 1    998 janv. 29 12:01 .classpath
drwxr-xr-x 1   4096 janv. 30 13:58 data
-rw-r--r-- 1      8 janv. 29 19:07 .gitignore
-rw-r--r-- 1   1174 janv. 29 14:28 pom.xml
-rw-r--r-- 1    542 janv. 29 12:02 .project
-rw------- 1  14521 févr.  1 17:01 Quick_start.html
-rw------- 1  14521 févr.  1 17:01 README.html
drwxr-xr-x 1   4096 janv. 30 12:21 result__input
-rw-rw-rw- 1   1045 janv. 30 12:09 _runDelicious.sh.example
-rwxr-xr-x 1   1019 févr.  1 15:23 runRetail.sh
-rw-rw-rw- 1   1188 janv. 30 12:11 _runSynthetic.sh.example
-rw-rw-rw- 1   1104 janv. 30 12:18 _runWebdocs.sh.example
drwxr-xr-x 1   4096 janv. 29 12:01 .settings
drwxr-xr-x 1   4096 janv. 29 12:01 src
drwxr-xr-x 1   4096 févr.  1 16:54 target

Date: 2019-01-31 09:58:14 PDT