Image

FP-Growth挖掘频繁项,java实现

首页 / 新闻资讯 / 正文

FP-Growth主要是用来进行挖掘频繁项,使用场景是发现事物之间的相关性,其中用支持度表示相关性的大小,可以通过设置支持度来筛选相关性小的事物的联系。相比较于Apriori算法需要扫描多次数据,严重受到IO的影响。FP-Growth只需要扫描两次数据集,可以提高算法运行效率。下图是论文中的图:

论文中table1

左边表示初始的数据集,表示原始的相关关系。然后遍历左边数据集,统计每个元素的出现次数,然后按照出现次数降序排列。得到中间的表格,设置minSupport = 3,然后删除出现次数小于minSupport的所有元素,然后遍历重构左侧原始元素之间的关系(删除出现次数小于minSupport的所有元素,同时以行为单位按照总出现次数降序重新排列)得到右边数据集(比如左第一行中f出现次数是2 < 3,因此删除f得到新关系[d,a])。(遍历两次,因此扫描了两次数据)。

设置minSupport:

    private int minStep = 1;     public int getMinStep() {         return minStep;     }      public void setMinStep(int minStep) {         this.minStep = minStep;     }

然后就是构建一棵FP树,因此需要构建一个节点的结构体,由于使用java实现因此创建了一个FPTreeNode类:

    private class FPTreeNode {         int count = 0; //访问次数         int blockAddress = 0;//指的是元素,由于我主要是用来统计blockAddress之间的关系,因此这样取名字         FPTreeNode parent = null;//记录父节点         FPTreeNode nextSimilarNode = null; //记录指向下一个该元素的节点         List<FPTreeNode> childSet = new ArrayList<>();         public FPTreeNode(int blockAddress) {             this.blockAddress = blockAddress;             count = 1;         }     }

构成的图如下图所示(蓝色线指的是:nextSimilarNode红色线指的是:parent

构建上图FP树主要分三步:

1.首先需要遍历数据集,统计元素出现总次数

    //计算各基本blockAddress出现的频次。     public List<List<Integer>> init(File file) {         List<List<Integer>> lists = new ArrayList<>();         try {             BufferedReader bufferedReader = new BufferedReader(new FileReader(file));             String string = "";             double start = -1.0;              List<Integer> list = new ArrayList<>();             List<Integer> listCount = new ArrayList<>();             double timePeroid = 60.0;             while((string = bufferedReader.readLine()) != null) {                 //time peroid, start block, access count                 String[] s = string.split(",");                 if (start == -1.0) {                     start = Double.valueOf(s[0]) + timePeroid;                 }                 if (start > Double.valueOf(s[0])) {                     if (!list.contains(Integer.valueOf(s[1]))) {                         list.add(Integer.valueOf(s[1]));                         listCount.add(Integer.valueOf(s[2]));                     }                 } else {                     lists.add(new ArrayList<>(list));                     listsCount.add(new ArrayList<>(listCount));                     start = Double.valueOf(s[0]) + timePeroid;                     list.clear();                     listCount.clear();                     list.add(Integer.valueOf(s[1]));                     listCount.add(Integer.valueOf(s[2]));                 }             }             if (list.size() > 0) {                 lists.add(new ArrayList<>(list));                 listsCount.add(new ArrayList<>(listCount));             }             bufferedReader.close();         } catch (FileNotFoundException e) {             e.printStackTrace();         } catch (IOException e) {             e.printStackTrace();         }         return lists;     }

2. 根据minSupport进行筛选生成项头表,同时按出现频次降序排列。

public List<FPTreeNode> buildTable(List<List<Integer>> lists) {         List<FPTreeNode> trees = new ArrayList<FPTreeNode>(); //获得频繁项表头,删除了小于minSupport的。         if (lists.size() == 0) {             return null;         }         HashMap<Integer, FPTreeNode> hashMap = new HashMap<>();         for (int i = 0; i < lists.size(); i++) {             List<Integer> tmp = lists.get(i);             for (int j = 0; j < tmp.size(); j++) {                 int val = tmp.get(j);                 if (hashMap.containsKey(val)) {                     hashMap.get(val).count++;                 } else {                     hashMap.put(val, new FPTreeNode(val));                 }              }         }         Iterator<Map.Entry<Integer, FPTreeNode>> iterator = hashMap.entrySet().iterator();         while(iterator.hasNext()) {             Map.Entry<Integer, FPTreeNode> entry = iterator.next();             if (entry.getValue().count >= minStep) {                 trees.add(entry.getValue());             }         }         //将频繁项进行降序排列         Collections.sort(trees, new Comparator<FPTreeNode>() {             @Override             public int compare(FPTreeNode o1, FPTreeNode o2) {                 if (o1.count < o2.count) {                     return 1;                 } else {                     if (o1.count > o2.count) {                         return -1;                     }                 }                 return 0;             }         });         return trees;     }

3.通过递归,构建FP树。自底向上。具体流程如下图:(红色箭头表示处理流程蓝色“阴影” FP树为创建的投影。)

/**      * 返回一个降序且满足频繁项的list      * @param list      * @return      */     public List<Integer> sortbyTrees(List<Integer> list, List<FPTreeNode> trees) {         List<Integer> tmp = new ArrayList<>();//返回一个降序且满足频繁项的list         for (int i = 0; i < trees.size(); i++) {             int block = trees.get(i).blockAddress;             if (list.contains(block)) {                 tmp.add(block);             }         }         return tmp;     }     public FPTreeNode findChild(FPTreeNode root, int node) {         for (FPTreeNode treeNode : root.childSet) {             if (treeNode.blockAddress == node) {                 return treeNode;             }         }         return null;     }     public void addNode(FPTreeNode node, List<Integer> list, List<FPTreeNode> trees) {         if (list.size() > 0) {             int val = list.remove(0);             FPTreeNode node1 = new FPTreeNode(val);             node1.parent = node;             node.childSet.add(node1);             for (FPTreeNode treeNode : trees) {                 if (treeNode.blockAddress == val) {                     while (treeNode.nextSimilarNode != null) {                         treeNode = treeNode.nextSimilarNode;                     }                     treeNode.nextSimilarNode = node1;                     break;                 }             }             addNode(node1, list, trees);         }     }     public FPTreeNode buildFPTree(List<List<Integer>> lists, List<FPTreeNode> trees) {         FPTreeNode root = new FPTreeNode(0);         for (int i = 0; i < lists.size(); i++) {             List<Integer> tmp = sortbyTrees(lists.get(i), trees);//得到一个降序且满足频繁项的list             FPTreeNode subRoot = root;             FPTreeNode tmpRoot = root;             if (root.childSet.size() > 0) {                 while (tmp.size() > 0 && (tmpRoot = findChild(subRoot, tmp.get(0))) != null) {                     tmpRoot.count++;                     subRoot = tmpRoot;                     tmp.remove(0);                 }             }             addNode(subRoot, tmp, trees);         }         return root;     }

4.按照FP-Growth算法挖掘频繁项,从底向上。

 public void FPGrowth(List<List<Integer>> transRecords, List<Integer> postPattern) {         List<FPTreeNode> trees = buildTable(transRecords);// 构建项头表,同时也是频繁1项集         // 构建FP-Tree         FPTreeNode root1 = buildFPTree(transRecords, trees);         if (root1.childSet.size() == 0) {             return;         }         if (postPattern.size() > 0) {             for (FPTreeNode node : trees) {                 System.out.print(node.count + ":" + node.blockAddress);                 for (int val: postPattern) {                     System.out.print(" " + val);                 }                 System.out.println();             }         }         for (int i = trees.size() - 1; i >= 0; i--) {             FPTreeNode node = trees.get(i);             List<Integer> tmp = new ArrayList<>();             tmp.add(node.blockAddress);             if (postPattern.size() > 0) {                 tmp.addAll(postPattern);             }             // 寻找header的条件模式基,放入records中             List<List<Integer>> records = new ArrayList<>();             FPTreeNode nextNode = node.nextSimilarNode;             while (nextNode != null) {                 int cnt = nextNode.count;                 List<Integer> prenodes = new ArrayList<Integer>();                 FPTreeNode parent = nextNode;                 while ((parent = parent.parent) != null && parent.blockAddress != 0) {                     prenodes.add(parent.blockAddress);                 }                 while (cnt > 0) {                     cnt--;                     records.add(prenodes);                 }                 nextNode = nextNode.nextSimilarNode;             }             FPGrowth(records, tmp);         }     }

亲测是正确的,数据集格式是:(主要是为了判断第二列之间的相关性,按照第一列表示秒,每180s所包含第二列元素表示为一个关系序列)

60,2104409088,1486 120,2104409088,667 120,2104410112,783 180,2104410112,1467 240,2104410112,1152 240,2104411136,301 300,2104411136,1447 360,2104411136,1429 420,2104411136,225 420,2104412160,1209 480,2104412160,1470 540,2104412160,722 540,2104413184,715 600,2104413184,1455

生成关系序列为:

[[2104409088, 2104410112], [2104410112, 2104411136], [2104411136, 2104412160, 2104413184], [2104413184]]

最后得到的频繁项为:(由于数据量小,频繁项表现不明显)

1:2104413184 2104412160 1:2104411136 2104412160 1:2104413184 2104411136 2104412160 1:2104410112 2104409088 1:2104411136 2104413184 1:2104410112 2104411136

上一个:SpringMVC接受参数

Loading...