机器学习算法库.html

LR(逻辑回归)

主要用于分类.

回归的英文是Regression，意思是“回退，退化，倒退”。回归分析的意思借用了“倒退，倒推”的含义。简单说就是“由果索因”的过程，是一种归纳的思想--当看到大量的事实所呈现的状态，推断出原因是如何的；当看到大量的数字对是某种状态，推断出它们之间蕴含的关系是如何的。

回归指研究一组随机变量(Y1 ，Y2 ，…，Yi)和另一组(X1，X2，…，Xk)变量之间关系的统计分析方法，又称多重回归分析。通常前者是因变量，后者是自变量。当因变量和自变量为线性关系时，称为线性回归(Linear Regression)。

逻辑回归(Logistic Regression)是一个被logistic方程归一化后的线性回归，将线性回归输出的很大范围的数，压缩到0和1之间，这样的输出值表达为某一类别的概率。

逻辑函数（sigmoid函数）:

lr1

Sigmoid 函数有个很漂亮的“S”形，如下图所示:

lr2

训练数据格式：

label1,value1,value2··· ··· ··· ···

Label为0, 1, ..., k - 1

Value为数值

预测数据格式：

value1,value2··· ···

即在训练数据格式上将label去掉

结果数据格式：

value1,value2--label ··· ···

构建分类模型

方法签名：

public void lrModelBuild(String masterUrl, String inputPath, String modelPath, int numClass, DKMLConf dkmlconf) throws TException;

签名参数说明：

masterUrl：local[2], 或spark://IP:PORT，spark-client...等
inputPath：训练数据所在路径
modelPath：模型保存路径
numClass：分类数目
dkmlconf：为参数优化,可以使用默认值0表示使用系统默认参数。

模型预测

方法签名：

public void lrModelPredict(String masterUrl, String inputPath, String modelPath, String outputPath, DKMLConf dkmlconf) throws TException;

签名参数说明：

masterUrl： local[2], 或spark://IP:PORT，spark-client...等
inputPath： 训练数据所在路径
modelPath： 模型保存路径
outputPath： 结果保存路径
dkmlconf： 为参数优化,可以使用默认值0表示使用系统默认参数。

范例：有一信用卡还款信息数据，包括用户的一些属性信息：性别、年龄、金额、年限、之前的还款记录等等，以及类别信息：正常与违约。应用LRModel可以预测其他用户的还款信息是正常还款还是可能会发生违约。

程序清单:

    String serverIP = "192.168.1.163";
    String masterUri = "spark://192.168.1.163:7077";
    String hdfsUri = "hdfs://192.168.1.163:9000";

    int serverPort = 9873;
    TTransport transport;
    DKML.Client client;
    SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMddHHmmss");
    DKMLConf dkmlConf;

    @Before
    public void open() throws TTransportException {
        transport = new TSocket(serverIP, serverPort);
        transport.open();
     
TProtocol protocol = new TCompactProtocol(transport);
        client = new DKML.Client(protocol);
        dkmlConf = new DKMLConf();
        //开启kerberos后需配置。
//        dkmlConf.PRINCIPAL = "spark";
//        dkmlConf.KEYTAB = "/root/spark.keytab";
}

    //逻辑回归 模型训练
    @Test
    public void testLRModelBuild() throws TException {
        String inputPath = hdfsUri+ "/fitting/dkmldatas/lr/data";
        String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/lr";
        int numClass = 10;
        client.lrModelBuild(masterUri, inputPath, modelPath, numClass,dkmlConf);
    }

    //逻辑回归 模型预测
    @Test
    public void testLRModelPredict() throws TException {
        String inputPath = hdfsUri+ "/fitting/dkmldatas/lr/test";
        String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/lr";
        String outputPath = hdfsUri+ "/user/root/fittingtest/dkml/lrp";
        client.lrModelPredict(masterUri, inputPath, modelPath, outputPath,dkmlConf);
}
    @After
    public void close() {
        transport.close();
    }

RandomForest(随机森林)

主要用于分类和回归

随机森林(Random Forest)是用随机的方式建立一个森林，森林里面有很多的决策树组成，随机森林的每一棵决策树之间是没有关联的。在得到森林之后，当有一个新的输入样本进入的时候，就让森林中的每一棵决策树分别进行一下判断，看看这个样本应该属于哪一类（对于分类算法），然后看看哪一类被选择最多，就预测这个样本为那一类。决策树实际上是将空间用超平面进行划分的一种方法，每次分割的时候，都将当前的空间一分为二，比如说下面的决策树：

rf1

就是将空间划分成下面的样子：

rf2

这样使得每一个叶子节点都是在空间中的一个不相交的区域，在进行决策的时候，会根据输入样本每一维feature的值，一步一步往下，最后使得样本落入N个区域中的一个。

森林里面有很多的决策树组成，随机森林的每一棵决策树之间是没有关联的。在得到森林之后，当有一个新的输入样本进入的时候，就让森林中的每一棵决策树分别进行一下判断，看看这个样本应该属于哪一类（对于分类算法），然后看看哪一类被选择最多，就预测这个样本为那一类。

训练数据格式：

label1,value1,value2··· ··· ··· ···

Label为0, 1, ..., k - 1

Value为数值

预测数据格式：

value1,value2··· ···

即在训练数据格式上将label去掉

结果数据格式：

value1,value2--label ··· ···

构建分类模型

方法签名：

public void rfClassModelBuild(String masterUrl, String inputPath, String modelPath, int numClass, DKMLConf dkmlConf) throws TException

签名参数说明：

   masterUrl 为 local[2], 或spark://IP:PORT ，spark-client...等
   inputPath 训练数据所在路径
   modelPath 模型保存路径
   numClass 分类数目
   dkmlConf 为参数优化,可以使用默认值0表示使用系统默认参数。

构建回归模型

方法签名：

public void rfRegresModelBuild(String masterUrl, String inputPath, String modelPath, DKMLConf dkmlConf) throws TException

签名参数说明：

masterUrl 为local[2], 或spark://IP:PORT ，spark-client...等
inputPath 训练数据所在路径
modelPath 模型保存路径
dkmlConf 为参数优化,可以使用默认值0表示使用系统默认参数。

模型预测

方法签名：

public void rfRegresModelBuild(String masterUrl, String inputPath, String modelPath, DKMLConf dkmlConf) throws TException

签名参数说明：

masterUrl 为local[2], 或spark://IP:PORT，spark-client...等
inputPath 训练数据所在路径
modelPath 模型保存路径
dkmlConf 为参数优化,可以使用默认值0表示使用系统默认参数。

范例：基于信用卡还款信息数据也可以用RFClassModel来预测用户的还款行为信息。如果是基于一些房屋的数据来预测房屋的售价可以用RFRegresModel。

程序清单

 String serverIP = "192.168.1.163";
 String masterUri = "spark://192.168.1.163:7077";
 String hdfsUri = "hdfs://192.168.1.163:9000";

 int serverPort = 9873;
 TTransport transport;
 DKML.Client client;
 SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMddHHmmss");
 DKMLConf dkmlConf;

 @Before
 public void open() throws TTransportException {
     transport = new TSocket(serverIP, serverPort);
     transport.open();
     TProtocol protocol = new TCompactProtocol(transport);
     client = new DKML.Client(protocol);
     dkmlConf = new DKMLConf();
     //开启kerberos后需配置。
//        dkmlConf.PRINCIPAL = "spark";
//        dkmlConf.KEYTAB = "/root/spark.keytab";
}

 //随机森林 构建分类模型
 @Test
 public void testRFClassModelBuild() throws TException {
     String inputPath = hdfsUri+ "/fitting/dkmldatas/rf/data";
     String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/rf";
     int numClass = 2;
     client.rfClassModelBuild(masterUri, inputPath, modelPath,numClass,dkmlConf);
 }

 //随机森林 模型预测
 @Test
 public void testRFModelPredict() throws TException {
     String inputPath = hdfsUri+ "/fitting/dkmldatas/rf/test";
     String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/rf";
     String outputPath = hdfsUri+ "/user/root/fittingtest/dkml/rfp";
     client.rfModelPredict(masterUri, inputPath, modelPath, outputPath,dkmlConf);
 }

 //随机森林 构建回归模型
 @Test
 public void testRFRegresModelBuild() throws TException {
     String inputPath = hdfsUri+ "/fitting/dkmldatas/rf/data";
     String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/rfl";
     client.rfRegresModelBuild(masterUri, inputPath, modelPath,dkmlConf);
 }
 @After
 public void close() {
     transport.close();
 }

SVM(支持向量机)

主要用于分类

支持向量机(support vector machine)是一种二类分类模型，支持向量意思就是数据集种的某些点，位置比较特殊，找分类线的时候，一般就看聚集在一起的两类数据，他们各自的最边缘位置的点，也就是最靠近划分直线的那几个点，而其他点对这条直线的最终位置的确定起不了作用，这些对分类线起决定作用的点就是支持向量，“机”即为算法。

支持向量机是一种二类分类模型，其基本模型定义为特征空间上的间隔最大的线性分类器，即支持向量机的学习策略便是间隔最大化，最终可转化为一个凸二次规划问题的求解。

svm1

SVM是一个由分类超平面定义的判别分类器。

训练数据格式：

label1,value1,value2··· ··· ··· ···

分类：Label为0, 1，仅支持二分类

回归：Label为数值

Value为数值

预测数据格式：

value1,value2··· ···

即在训练数据格式上将label去掉

结果数据格式：

value1,value2--label ··· ···

构建分类模型

方法签名：

public void svmModelBuild(String masterUrl, String inputPath, String modelPath, DKMLConf dkmlConf) throws TException

签名参数说明：

masterUrl local[2], 或spark://IP:PORT ，spark-client...等
inputPath 训练数据所在路径
modelPath 模型保存路径
dkmlConf 为参数优化,可以使用默认值0表示使用系统默认参数。

模型预测

方法签名：

public void svmModelPredict(String masterUrl, String inputPath, String modelPath, String outputPath, DKMLConf dkmlConf) throws TException

签名参数说明：

masterUrl local[2], 或spark://IP:PORT，spark-client...等
inputPath 训练数据所在路径
modelPath 模型保存路径
outputPath 结果保存路径
dkmlConf 为参数优化,可以使用默认值0表示使用系统默认参数。

范例：SVMModel也可用于对信用卡数据用户还款行为的预测。

程序清单

String serverIP = "192.168.1.163";
 String masterUri = "spark://192.168.1.163:7077";
 String hdfsUri = "hdfs://192.168.1.163:9000";

 int serverPort = 9873;
 TTransport transport;
 DKML.Client client;
 SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMddHHmmss");
 DKMLConf dkmlConf;

 @Before
 public void open() throws TTransportException {
     transport = new TSocket(serverIP, serverPort);
     transport.open();
     TProtocol protocol = new TCompactProtocol(transport);
     client = new DKML.Client(protocol);
     dkmlConf = new DKMLConf();
     //开启kerberos后需配置。
//        dkmlConf.PRINCIPAL = "spark";
//        dkmlConf.KEYTAB = "/root/spark.keytab";
}

 @Test
 public void testSVMModelBuild() throws TException {
     String inputPath = hdfsUri+ "/fitting/dkmldatas/svm/data";
     String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/svm";
     client.svmModelBuild(masterUri, inputPath, modelPath,dkmlConf);
 }

 @Test
 public void testSVMModelPredict() throws TException {
     String inputPath = hdfsUri+ "/fitting/dkmldatas/svm/test";
     String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/svm";
     String outputPath = hdfsUri+ "/user/root/fittingtest/dkml/svmp";
     client.svmModelPredict(masterUri, inputPath, modelPath, outputPath,dkmlConf);
 }
 @After
 public void close() {
     transport.close();
 }

NaiveBayes(朴素贝叶斯)

主要用于分类

贝叶斯分类是一系列分类算法的总称，这类算法均以贝叶斯定理为基础，故统称为贝叶斯分类。朴素贝叶斯算法（Naive Bayesian) 是其中应用最为广泛的分类算法之一。

分类是将一个未知样本分到几个预先已知类的过程。数据分类问题的解决是一个两步过程：第一步,建立一个模型，描述预先的数据集或概念集。通过分析由属性描述的样本（或实例，对象等）来构造模型。假定每一个样本都有一个预先定义的类，由一个被称为类标签的属性确定。为建立模型而被分析的数据元组形成训练数据集，该步也称作有指导的学习。

贝叶斯公式：

nb1

贝叶斯图形：

nb2

训练数据格式：

label1,value1,value2··· ··· ··· ···

值要求为非负

预测数据格式：

value1,value2

即在训练数据格式上将label去掉

结果数据格式：

value1,value2--label

构建模型

方法签名：

public void nbModelBuild(String masterUrl, String inputPath, String modelPath, DKMLConf dkmlConf) throws TException

签名参数说明：

masterUrl local[2], 或spark://IP:PORT ，spark-client...等
inputPath 训练数据所在路径
modelPath 模型保存路径
dkmlConf 为参数优化,可以使用默认值0表示使用系统默认参数。

预测

方法签名：

public void nbModelPredict(String masterUrl, String inputPath, String modelPath, String outputPath, DKMLConf dkmlConf) throws TException

签名参数说明：

masterUrl local[2], 或spark://IP:PORT，spark-client...等
inputPath 训练数据所在路径
modelPath 模型保存路径
outputPath 预测结果保存路径
dkmlConf 为参数优化,可以使用默认值0表示使用系统默认参数。

范例：对信用卡用户还款行为的预测也可以应用NBModel来进行预测。

程序清单

    String serverIP = "192.168.1.163";
    String masterUri = "spark://192.168.1.163:7077";
    String hdfsUri = "hdfs://192.168.1.163:9000";

    int serverPort = 9873;
    TTransport transport;
    DKML.Client client;
    SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMddHHmmss");
    DKMLConf dkmlConf;

    @Before
    public void open() throws TTransportException {
        transport = new TSocket(serverIP, serverPort);
        transport.open();
        TProtocol protocol = new TCompactProtocol(transport);
        client = new DKML.Client(protocol);
        dkmlConf = new DKMLConf();
        //开启kerberos后需配置。
//        dkmlConf.PRINCIPAL = "spark";
//        dkmlConf.KEYTAB = "/root/spark.keytab";
}
    // 朴素贝叶斯 模型构建
    @Test
    public void testNBModelBuild() throws TException {
        String inputPath = hdfsUri+ "/fitting/dkmldatas/nb/data/";
        String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/nb";
        client.nbModelBuild(masterUri, inputPath, modelPath,dkmlConf);
    }

    // 朴素贝叶斯 模型训练
    @Test
    public void testNBModelPredict() throws TException {
        String inputPath = hdfsUri+ "/fitting/dkmldatas/nb/test/";
        String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/nb";
        String outputPath = hdfsUri+ "/user/root/fittingtest/dkml/nbp";
        client.nbModelPredict(masterUri, inputPath, modelPath, outputPath,dkmlConf);
}

    @After
    public void close() {
        transport.close();
    }

K-means(K均值)

主要用来聚类

聚类指的是一种学习方式，即把物理或抽象对象的集合分组为由彼此类似的对象组成的多个类的分析过程。

K-均值是把数据集按照k个簇分类，其中k是用户给定的，其中每个簇是通过质心来计算簇的中心点。

首先创建一个初始划分，随机地选择 k 个对象，每个对象初始地代表了一个簇中心。对于其他的对象，根据其与各个簇中心的距离，将它们赋给最近的簇。当有新的对象加入簇或者已有对象离开簇的时候，重新计算簇的平均值，然后对对象进行重新分配。这个过程不断重复，直到没有簇中对象的变化。

K均值公式：

km1

K均值图形：

km2

训练用数据格式：

value1,value2

预测用数据格式：

value1,value2

结果数据格式：

value1,value2--label

构建聚类模型

方法签名：

public void kmModelBuild(String masterUrl, String inputPath, String modelPath, int numClusters, DKMLConf dkmlConf) throws TException

签名参数说明：

masterUrl local[2], 或spark://IP:PORT，spark-client...等
inputPath 训练数据所在路径
modelPath 模型保存路径
numClusters 聚类数目
dkmlConf 为参数优化,可以使用默认值0表示使用系统默认参数。

聚类模型预测

方法签名：

public void kmModelPredict(String masterUrl, String inputPath, String modelPath, String outputPath, DKMLConf dkmlConf) throws TException

签名参数说明：

masterUrl local[2], 或spark://IP:PORT，spark-client...等
inputPath 训练数据所在路径
modelPath 模型保存路径
outputPath 预测结果保存路径
dkmlConf 为参数优化,可以使用默认值0表示使用系统默认参数。

范例：有一航空会员数据，包括会员性别、航空里程、消费金额等，可以应用KMModel根据需求对会员的级别进行聚类，比如聚类成高中低三个类别，或者S、A、B、C四个类别。

程序清单

    String serverIP = "192.168.1.163";
    String masterUri = "spark://192.168.1.163:7077";
    String hdfsUri = "hdfs://192.168.1.163:9000";

    int serverPort = 9873;
    TTransport transport;
    DKML.Client client;
    SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMddHHmmss");
    DKMLConf dkmlConf;

    @Before
    public void open() throws TTransportException {
        transport = new TSocket(serverIP, serverPort);
        transport.open();
        TProtocol protocol = new TCompactProtocol(transport);
        client = new DKML.Client(protocol);
        dkmlConf = new DKMLConf();
        //开启kerberos后需配置。
//        dkmlConf.PRINCIPAL = "spark";
//        dkmlConf.KEYTAB = "/root/spark.keytab";
}

    // kmeans聚类 模型训练
    @Test
    public void testKMModelBuild() throws TException {
        String inputPath = hdfsUri+ "/fitting/dkmldatas/km/data/";
        String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/km";
        int numClusters = 4;
        client.kmModelBuild(masterUri, inputPath, modelPath, numClusters,dkmlConf);
    }

    // kmeans聚类 模型预测
    @Test
    public void testKMModelPredict() throws TException {
        String inputPath = hdfsUri+ "/fitting/dkmldatas/km/test/";
        String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/km";
        String outputPath = hdfsUri+ "/user/root/fittingtest/dkml/kmp";
        client.kmModelPredict(masterUri, inputPath, modelPath, outputPath,dkmlConf);
   
 }
@After
    public void close() {
        transport.close();
    }

GaussianMixtures(高斯混合模型)

主要用来聚类

混合高斯模型基于多变量正态分布，常用于聚类，通过选择成分最大化后验概率来完成聚类。与k-means聚类相似，高斯混合模型也使用迭代算法计算，最终收敛到局部最优。高斯混合模型在各类尺寸不同、聚类间有相关关系的的时候可能比k-means聚类更合适。使用高斯混合模型的聚类属于软聚类方法（一个观测量按概率属于各个类，而不是完全属于某个类），各点的后验概率提示了各数据点属于各个类的可能性。

高斯模型公式：

gm1

高斯模型图形：

gm2

训练用数据格式：

value1,value2

预用数据格式：

value测1,value2

结果数据格式：

value1,value2--label

模型构建

方法签名：

public void gmModelBuild(String masterUrl, String inputPath, String modelPath, int numClusters, DKMLConf dkmlConf) throws TException

签名参数说明：

masterUrl local[2], 或spark://IP:PORT，spark-client...等
inputPath 训练数据所在路径
modelPath 模型保存路径
numClusters 聚类数目
dkmlConf 为参数优化,可以使用默认值0表示使用系统默认参数。

模型预测

方法签名：

public void gmModelPredict(String masterUrl, String inputPath, String modelPath, String outputPath, DKMLConf dkmlConf) throws TException

签名参数说明：

masterUrl local[2], 或spark://IP:PORT，spark-client...等
inputPath 训练数据所在路径
modelPath 模型保存路径
outputPath 预测结果保存路径
dkmlConf 为参数优化,可以使用默认值0表示使用系统默认参数。

范例：对航空会员级别聚类也可以应用GMModel来进行聚类分析。

程序清单

    String serverIP = "192.168.1.163";
    String masterUri = "spark://192.168.1.163:7077";
    String hdfsUri = "hdfs://192.168.1.163:9000";

    int serverPort = 9873;
    TTransport transport;
    DKML.Client client;
    SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMddHHmmss");
    DKMLConf dkmlConf;

    @Before
    public void open() throws TTransportException {
        transport = new TSocket(serverIP, serverPort);
        transport.open();
        TProtocol protocol = new TCompactProtocol(transport);
        client = new DKML.Client(protocol);
        dkmlConf = new DKMLConf();
        //开启kerberos后需配置。
//        dkmlConf.PRINCIPAL = "spark";
//        dkmlConf.KEYTAB = "/root/spark.keytab";
}


    // 高斯混合模型 模型构建
    @Test
    public void testGMModelBuild() throws TException {
        String inputPath = hdfsUri+ "/fitting/dkmldatas/gmm/data/";
        String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/gm";
        int numClusters = 2;
        client.gmModelBuild(masterUri, inputPath, modelPath,numClusters,dkmlConf);
    }

    // 高斯混合模型 模型预测
    @Test
    public void testGMModelPredict() throws TException {
        String inputPath = hdfsUri+ "/fitting/dkmldatas/gmm/test/";
        String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/gm";
        String outputPath = hdfsUri+ "/user/root/fittingtest/dkml/gmp";
        client.gmModelPredict(masterUri, inputPath, modelPath, outputPath,dkmlConf);
    }
@After
    public void close() {
        transport.close();
    }

PCA(主成分分析)

主要用来对数据降维,去噪

主成分分析是设法将原来众多具有一定相关性（比如P个指标），重新组合成一组新的互相无关的综合指标来代替原来的指标。

主成分分析，是考察多个变量间相关性一种多元统计方法，研究如何通过少数几个主成分来揭示多个变量间的内部结构，即从原始变量中导出少数几个主成分，使它们尽可能多地保留原始变量的信息，且彼此间互不相关.通常数学上的处理就是将原来P个指标作线性组合，作为新的综合指标。

通过正交变换将一组可能存在相关性的变量转换为一组线性不相关的变量，转换后的这组变量叫主成分。

pca1

训练用数据格式：

value1,value2,value3,value4 ······

结果数据格式：

value1,value2

方法签名：

public void pcaModel(String masterUrl, String inputPath, String outputPath, int k, DKMLConf dkmlConf) throws TException

签名参数说明：

masterUrl local[2], 或spark://IP:PORT，spark-client...等
inputPath 训练数据所在路径
outputPath 结果保存路径
k 主成分数目
dkmlConf 为参数优化,可以使用默认值0表示使用系统默认参数。

范例：信用卡数据中用户属性信息可能有一部分是多余的或者起作用较少的，可以用PCAModel对用户属性信息进行降维。

程序清单

    String serverIP = "192.168.1.163";
    String masterUri = "spark://192.168.1.163:7077";
    String hdfsUri = "hdfs://192.168.1.163:9000";

    int serverPort = 9873;
    TTransport transport;
    DKML.Client client;
    SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMddHHmmss");
    DKMLConf dkmlConf;

    @Before
    public void open() throws TTransportException {
        transport = new TSocket(serverIP, serverPort);
        transport.open();
        TProtocol protocol = new TCompactProtocol(transport);
        client = new DKML.Client(protocol);
        dkmlConf = new DKMLConf();
        //开启kerberos后需配置。
//        dkmlConf.PRINCIPAL = "spark";
//        dkmlConf.KEYTAB = "/root/spark.keytab";
}



    // 主成分分析
    @Test
    public void testPCAModel() throws TException {
        String inputPath = hdfsUri+ "/fitting/dkmldatas/pca/";
        // 输出目录为linux本地已存在目录。
        String outputPath = "/opt/pca";
        int k = 3;
        client.pcaModel(masterUri, inputPath, outputPath,k,dkmlConf);
    }
    @After
    public void close() {
        transport.close();
    }

FP-growth(FP-growth)

主要用于挖掘关联规则的频繁项集

FP-Growth算法是韩家炜等人在2000年提出的关联分析算法，它采取如下分治策略：将提供频繁项集的数据库压缩到一棵频繁模式树（FP-tree），但仍保留项集关联信息。

在算法中使用了一种称为频繁模式树（Frequent Pattern Tree）的数据结构。FP-tree是一种特殊的前缀树，由频繁项头表和项前缀树构成。FP-Growth算法基于以上的结构加快整个挖掘过程。

FP-Growth算法图形：

fp1

训练用数据格式：

value1,value2··· ··· ··· ···

每行数据以逗号分隔

结果数据格式：

数据项：频繁次数

方法签名：

public void fpGrowthModelBuild(String masterUrl, String inputPath, String outputPath, double minSupport, DKMLConf dkmlConf) throws TException

签名参数说明：

  masterUrl local[2], 或spark://IP:PORT，spark-client...等
  inputPath 训练数据所在路径
  outputPath 训练结果保存路径
  minSupport 最小支持度，默认0.3，相当于占30%，超过此支持度的将被选出
  dkmlConf 为参数优化,可以使用默认值0表示使用系统默认参数。

范例：有一超市购物数据，可以应用FPGrowthModel来分析出顾客经常一起购买的商品，可以对这些商品进行搭配促销。如下图：

fp2

程序清单

    String serverIP = "192.168.1.163";
    String masterUri = "spark://192.168.1.163:7077";
    String hdfsUri = "hdfs://192.168.1.163:9000";

    int serverPort = 9873;
    TTransport transport;
    DKML.Client client;
    SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMddHHmmss");
    DKMLConf dkmlConf;

    @Before
    public void open() throws TTransportException {
        transport = new TSocket(serverIP, serverPort);
        transport.open();
        TProtocol protocol = new TCompactProtocol(transport);
        client = new DKML.Client(protocol);
        dkmlConf = new DKMLConf();
        //开启kerberos后需配置。
//        dkmlConf.PRINCIPAL = "spark";
//        dkmlConf.KEYTAB = "/root/spark.keytab";
}

    // 频繁项集
    @Test
    public void testFPGrowthModel() throws TException {
        String inputPath = hdfsUri+ "/fitting/dkmldatas/fpgrowth/";
        String outputPath = hdfsUri+ "/user/root/fittingtest/dkml/fpgrowth";
        double minSupport = 0.3;
        client.fpGrowthModelBuild(masterUri,inputPath,outputPath,minSupport,dkmlConf);
    }

  @After
    public void close() {
        transport.close();
    }

ALS(交替最小二乘法的协同过滤算法)

主要用于推荐

意为交替最小二乘法常用于基于矩阵分解的推荐系统中。例如：将用户(user)对商品(item)的评分矩阵分解为两个矩阵：一个是用户对商品隐含特征的偏好矩阵，另一个是商品所包含的隐含特征的矩阵。在这个矩阵分解的过程中，评分缺失项得到了填充，也就是说我们可以基于这个填充的评分来给用户最佳商品推荐了。

交替最小二乘法的公式：

als1

交替最小二乘法的图形：

als2

训练用数据格式：

userID, productID, rating ······

userID：用户id，数值型

productID：商品id，数值型

Rating：用户对商品的打分，数值型

数据用英文逗号分隔

预测用数据格式：

给用户推荐产品

userID每行一个

给产品推荐用户

productID每行一个

结果数据格式：

userID--productID：rating, productID：rating, ······

推荐模型构建

方法签名：

public void alsModelBuild(String masterUrl, String inputPath, String modelPath, int rank, int numIterations, DKMLConf dkmlConf) throws TException

签名参数说明：

masterUrl local[2], 或spark://IP:PORT，spark-client...等
inputPath 训练数据所在路径
modelPath 模型保存路径
rank 特征数目，默认10，用户打分时所考虑的特征角度
numIterations 迭代次数，推荐10-20，默认10

给产品推荐用户

方法签名：

public void rmUsers(String masterUrl, String inputPath, String modelPath, String outputPath, DKMLConf dkmlConf) throws TException

签名参数说明：

masterUrl local[2], 或spark://IP:PORT，spark-client...等
inputPath 预测用数据所在路径
modelPath 模型保存路径
outputPath 预测结果保存路径

给用户推荐产品

方法签名：

public void rmpProducts(String masterUrl, String inputPath, String modelPath, String outputPath, DKMLConf dkmlConf) throws TException

签名参数说明：

masterUrl local[2], 或spark://IP:PORT，spark-client...等
inputPath 预测用数据所在路径
modelPath 模型保存路径
outputPath 预测结果保存路径
dkmlConf 为参数优化,可以使用默认值0表示使用系统默认参数。

范例：有一豆瓣电影影评数据，包括用户ID、电影ID及打分，就可以应用ALSModel进行对用户推荐电影，或者对新上电影推荐潜在的用户。

程序清单

    String serverIP = "192.168.1.163";
    String masterUri = "spark://192.168.1.163:7077";
    String hdfsUri = "hdfs://192.168.1.163:9000";

    int serverPort = 9873;
    TTransport transport;
    DKML.Client client;
    SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMddHHmmss");
    DKMLConf dkmlConf;

    @Before
    public void open() throws TTransportException {
        transport = new TSocket(serverIP, serverPort);
        transport.open();
        TProtocol protocol = new TCompactProtocol(transport);
        client = new DKML.Client(protocol);
        dkmlConf = new DKMLConf();
        //开启kerberos后需配置。
//        dkmlConf.PRINCIPAL = "spark";
//        dkmlConf.KEYTAB = "/root/spark.keytab";
}


    // 构建als模型。
    @Test
    public void testALSModelBuild() throws TException {
        String inputPath = hdfsUri + "/fitting/dkmldatas/als/data/";
        String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/als";
        int rank = 10;
        int numIterations = 10;

        client.alsModelBuild(masterUri,inputPath, modelPath, rank, numIterations,dkmlConf);
    }

    // 使用als模型进行预测
    @Test
    public void testRMProducts() throws TException {
        String inputPath = hdfsUri+ "/fitting/dkmldatas/als/test/";
        String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/als";
        String outputPath = hdfsUri+ "/user/root/fittingtest/dkml/rmp";
        client.rmpProducts(masterUri,inputPath,modelPath,outputPath,dkmlConf);
    }

    // 使用als模型进行预测
    @Test
    public void testRMUsers() throws TException {
        String inputPath = hdfsUri+ "/fitting/dkmldatas/als/test/";
        String modelPath = hdfsUri+ "/user/root/fittingtest/dkml/als";
        String outputPath = hdfsUri+ "/user/root/fittingtest/dkml/rmu";
        client.rmUsers(masterUri,inputPath,modelPath,outputPath,dkmlConf);
    }
    @After
    public void close() {
        transport.close();
    }

官网地址：http://www.dksou.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

机器学习算法库.html

LR(逻辑回归)

RandomForest(随机森林)

SVM(支持向量机)

NaiveBayes(朴素贝叶斯)

K-means(K均值)

GaussianMixtures(高斯混合模型)

PCA(主成分分析)

FP-growth(FP-growth)

ALS(交替最小二乘法的协同过滤算法)

Fitting V1.0 文档

目录

用户安装使用手册

一、概述

二、Fitting打包说明

三、安装说明

软硬件环境

软件的安装卸载和升级

四、操作说明

用户管理

Fitting集群组件管理

安装主机状态监控功能

用户开发使用手册JAVA版

一、开发依赖说明JAVA版

二、数据源(datasource)模块JAVA版

三、SQL工具类(sqlutils)模块JAVA版

四、实时流(stream)模块JAVA版

五、搜索(search)模块JAVA版

六、自然语言处理(nlp)模块JAVA版

七、机器学习(ml)模块JAVA版

八、图计算(graphx)模块JAVA版

九、ElasticSQL模块JAVA版

十、数据处理(dataprocess)模块JAVA版

Clone this wiki locally