Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用 SPU 优化 PCA 算法 #259

Closed
Candicepan opened this issue Jul 21, 2023 · 71 comments · Fixed by #300
Closed

使用 SPU 优化 PCA 算法 #259

Candicepan opened this issue Jul 21, 2023 · 71 comments · Fixed by #300
Assignees
Labels
challenge OSCP SecretFlow Open Source Contribution Plan

Comments

@Candicepan
Copy link
Contributor

Candicepan commented Jul 21, 2023

此 ISSUE 为 隐语开源共建计划(SecretFlow Open Source Contribution Plan,简称 SF OSCP)第二期任务 ISSUE,欢迎社区开发者参与共建~
若有感兴趣想要认领的任务,但还未报名,辛苦先完成报名进行哈~

任务介绍

  • 任务名称:使用 SPU 优化 PCA 算法
  • 技术方向:SPU/SML
  • 任务难度:挑战🌟🌟🌟

详细要求

  • 安全性(尽量少 reveal)
  • 功能性:实现算法的基本功能,包括:
    • 性能比原有的纯 power iteration 要高;或收敛速度更快(体现在能支持更大的特征维度)
    • (推荐)不需要显示的计算协方差矩阵
  • 收敛性:包含 simulator 跑出的实验数据并且证明收敛性/准确性(最好有与明文 sklearn 结果的对比,可详见参考 PR
  • 正确性:请确保提交的代码内容为可以直接运行的
  • 代码规范:Python 代码需要使用 black+isort 进行格式化(流水线包含代码规范检查卡点)
  • 提交说明:关联该 isuue 并提交代码至 https://github.com/secretflow/spu/tree/main/sml
  • 特殊说明:若某个特性有特殊的限制,如需要 FM128,需要更多 fxp 等需要在注释文档中明确说明

能力要求

  • 熟悉经典的机器学习算法
  • 熟悉 JAX 或 NumPy,可以使用 NumPy 实现算法

操作说明

开发须知

以下部分代码请必须增加代码注释,对对应代码模块进行说明,包括:

  • __init__函数的超参数含义
  • fit 的具体算法实现说明
@Candicepan Candicepan converted this from a draft issue Jul 21, 2023
@Candicepan Candicepan added OSCP SecretFlow Open Source Contribution Plan challenge labels Jul 21, 2023
@tarantula-leo
Copy link
Contributor

tarantula-leo give it to me.

@Candicepan Candicepan moved this from Needs Triage to In Progress in OSCP Phase 2 Aug 1, 2023
@Candicepan
Copy link
Contributor Author

tarantula-leo give it to me.

感谢你的认领~为避免走弯路,请先提交你的设计思路,经确认后再进入开发阶段

  • 设计思路说明:简单说明计划使用什么算法 or 什么优化器实现任务需求即可

@tarantula-leo
Copy link
Contributor

使用QR分解实现

@deadlywing
Copy link
Contributor

Hi,可以稍微详细一点么?

明确一下,我们希望这次的要求是PCA算法能不需要先计算cov matrix,您可以对照一下看自己的idea是否满足要求哈~

Thanks

@tarantula-leo
Copy link
Contributor

考虑不计算协方差的出发点是什么?之前是准备用QR分解+QR迭代

@deadlywing
Copy link
Contributor

因为计算协方差在n比较大的时候cost也比较大,如果再做特征值分解可能性能不一定能比当前的power iteration效率高~本期的任务会相对更注重性能,所以做了这样的要求。

当前,重点还是性能,如果QR迭代性能确实更高也是ok的~欢迎尝试

Thanks

@tarantula-leo
Copy link
Contributor

  1. “计算协方差在n比较大的时候cost也比较大”这是指矩阵乘法的cost么?
  2. 原PCA代码测试了下在特征10000时会自动退出,不报错,这边也会有这样的情况么?还是我自己电脑配置问题

@deadlywing
Copy link
Contributor

  1. 是的,这是一个复杂度和n相关的矩阵乘
  2. 方便给一下复现的代码么,我试一下

@tarantula-leo
Copy link
Contributor

  1. 使用的代码就是simple_pca.py和simple_pca——test.py,将data设置成:
    A = random.normal(random.PRNGKey(0), (15, 10000))

@deadlywing
Copy link
Contributor

hello,我运行了一下,不会自动退出(虽然结果是错误的,估计是精度不够了)

@deadlywing
Copy link
Contributor

那估计是oom了?

@tarantula-leo
Copy link
Contributor

能完整执行完transform么?我这边这步是没执行到的:
print("X_transformed_jax", result[0])

@deadlywing
Copy link
Contributor

image

我这边有执行到,虽然结果是错误的

@tarantula-leo
Copy link
Contributor

这看起来不是误差导致的,应该是算法逻辑问题,这里是会出现除0的:
vec /= jnp.linalg.norm(vec)

@tarantula-leo
Copy link
Contributor

A = jnp.array([-0.3721109, 0.26423115, -0.18252768, -0.7368197, -0.44030377, -0.1521442, -0.67135346, -0.5908641, 0.73168886, 0.5673026])
A = A.reshape(2,5)

类似这样的数据应该也会出错

@deadlywing
Copy link
Contributor

A = jnp.array([-0.3721109, 0.26423115, -0.18252768, -0.7368197, -0.44030377, -0.1521442, -0.67135346, -0.5908641, 0.73168886, 0.5673026])
A = A.reshape(2,5)

类似这样的数据应该也会出错

sorry,这个我没理解,是指输入数据用这个,最后也会出错么?

@deadlywing
Copy link
Contributor

这看起来不是误差导致的,应该是算法逻辑问题,这里是会出现除0的:
vec /= jnp.linalg.norm(vec)

理论上,如果这里就除0的话,那说明cov matrix是半正定的,而且提取的主成分个数也太多了,我觉得这个就已经是属于input的问题~

@deadlywing
Copy link
Contributor

我简单尝试了一下,应该是power iteration本身的算法精度问题,在特征维度p比较大的时候,可能需要非常大的迭代次数。

我测试到p=800左右,在默认配置下就有点飞了~

So,这样看来,如果您实现的算法能在收敛速度上优于power iteration,我感觉也是很有价值的~

@tarantula-leo
Copy link
Contributor

用迭代的应该都会有精度问题吧 比如皮尔逊里newton_inverse的实现方案 会有类似问题么?是不是迭代的实现方案都无法支持大一些的数据量

@deadlywing
Copy link
Contributor

也是类似的,随着矩阵维度和条件数的增大,newton_inverse的精确性也会变差,此时就需要增大迭代次数,所以这也算是一个magic number;

so,您也可以快速尝试一下QR,不过我也不太确定qr迭代的收敛速度比power iteration的收敛速度是否有优势;

@tarantula-leo
Copy link
Contributor

单纯的qr迭代效率应该是不够的

import sys
import os
sys.path.append(os.path.join(os.path.dirname(__file__), '../../'))
import jax.numpy as jnp
from jax import random
from sklearn.datasets import load_iris
import spu.utils.simulation as spsim
import spu.spu_pb2 as spu_pb2
class PCA:
    def __init__(self, n_components):
        self.n_components = n_components
        self.eig_val = None
        self.eig_vec = None
        self.k_eigenvectors = None

    def fit(self, X):
        X = X - jnp.mean(X, axis=0)
        C = jnp.cov(X.T, ddof = 0)
        qr = []
        n = C.shape[0]
        Q = jnp.eye(n)

        for i in range(0, 100):
            qr = jnp.linalg.qr(C)
            Q = jnp.dot(Q,qr[0])
            C = jnp.dot(qr[1], qr[0])

        AK = jnp.dot(qr[0], qr[1])
        self.eig_val = jnp.diag(AK)
        klarge_index = jnp.argsort(self.eig_val)[::-1][:self.n_components]
        self.eig_vec = Q
        self.k_eigenvectors = (self.eig_vec.T)[klarge_index]
    
    def transform(self, X):
        return jnp.dot(X - jnp.mean(X, axis=0), self.k_eigenvectors.T)


iris = load_iris()
A = iris.data

# A = jnp.array([-1,-1,0,2,0,-2,0,0,1,1])
# A=A.reshape(2,5)

def fit_and_transform(X, n_components):
    pca = PCA(n_components)
    pca.fit(X)
    return pca.transform(X)

sim_aby = spsim.Simulator.simple(
    3, spu_pb2.ProtocolKind.ABY3, spu_pb2.FieldType.FM64
)
result = spsim.sim_jax(sim_aby, fit_and_transform,  static_argnums=(1,))(A, 2)
print(result)

@deadlywing
Copy link
Contributor

确实,光看迭代部分,首先qr分解的cost就不小,然后还有两次矩阵乘;如果收敛速度没有决定性优势的话,感觉应该不如现在的power iteration

@tarantula-leo
Copy link
Contributor

对于迭代计算方面有什么设计思路么?像PCA、Newton_inverse还有NMF都会涉及到迭代,目前基本是用固定迭代次数,spu没法用control_flow来实现,而且在数据高维的时候准确性会下降很多。

@deadlywing
Copy link
Contributor

对大矩阵而言,很多迭代算法都有这种问题,当前只能通过手动调整迭代次数来达到更高的精度。但是这里我理解有两个精度问题:

  1. 迭代算法本身,这部分我理解可以通过查阅文献,理论分析算法收敛速度去判断;像power iteration,qr迭代都是典型的textbook algorhithm,本身比较简单,性能上应该不会最优,可能需要去参考当前成熟线性代数库的算法;(如eign, numpy,这也是为什么这个任务是挑战级别任务的原因哈)
  2. MPC误差,这部分可以考虑环设置为FM128,同时提高fxp来解决

话说回来,对PCA,原本是希望能实现svd算法,一方面能避免计算conv matric,另一方面也能给其他算法提供重要的基础算子。

@tarantula-leo
Copy link
Contributor

用gram_schmidt来做qr应该会缓解溢出情况,不过还是需要去处理下scale。

@tarantula-leo
Copy link
Contributor

  1. 精度损失应该主要在求特征向量上,不是qr分解这里
  2. 随着待求解svd矩阵大小增大(这里是和n_component相关),迭代轮数需要增大
  3. 另外,如果数据集大小增大需根据明文情况调整scale
import numpy as np
import jax.numpy as jnp
from jax import random

def eigh_power(A):
    n = A.shape[0]
    eig_val = []
    eig_vec = []
    for _ in range(n):
        vec = jnp.ones((n,))
        for _ in range(100):
            # Power iteration
            vec = jnp.dot(A, vec)
            vec /= jnp.linalg.norm(vec)

        eigval = jnp.dot(vec.T, jnp.dot(A, vec))
        eig_vec.append(vec)
        eig_val.append(eigval)
        A -= eigval * jnp.outer(vec, vec)
    eig_vecs = jnp.column_stack(eig_vec)
    eig_vals = jnp.array(eig_val)
    return eig_vals, eig_vecs
    
def svd(A):
    m, n = A.shape
    # sigma, U = jnp.linalg.eigh(jnp.dot(A, A.T))
    # print(jnp.dot(A, A.T))
    sigma, U = eigh_power(jnp.dot(A, A.T))
    # if use eigh_qr 
    # arg_sort = jnp.argsort(sigma)[::-1]
    # sigma = sigma[arg_sort]
    # U = U[:, arg_sort]
    sigma_sqrt = jnp.sqrt(sigma)
    # sigma_matrix = jnp.diag(sigma_sqrt)
    sigma_inv = jnp.diag((1/sigma_sqrt))
    V = jnp.dot(jnp.dot(sigma_inv, U.T), A)
    return (U, sigma_sqrt, V)

def gram_schmidt(A):
    Q=jnp.zeros_like(A)
    cnt = 0
    for a in A.T:
        u = jnp.copy(a)
        for i in range(0, cnt):
            u -= jnp.dot(jnp.dot(Q[:, i].T, a), Q[:, i])
        e = u / jnp.linalg.norm(u)
        Q = Q.at[:,cnt].set(e)
        cnt += 1
    return Q

def iteration(A, Omega, power_iter = 4):
    # iter = 7 if rank < 0.1 * min(A.shape) else 4
    Y = jnp.dot(A, Omega)

    for _ in range(power_iter):
        Y = jnp.dot(A, jnp.dot(A.T, Y))
        # print(Y)
    # Q, _ = jnp.linalg.qr(Y)
    Q = gram_schmidt(Y / 10000)
    return Q

def rsvd(A, Omega):
    Q = iteration(A, Omega)
    B = jnp.dot(Q.T, A)
    # u_tilde, s, v = jnp.linalg.svd(B, full_matrices=False)
    u_tilde, s, v = svd(B)
    u = jnp.dot(Q, u_tilde)
    return u, s, v

A = random.normal(random.PRNGKey(0), (1000,10))
A = A - jnp.mean(A, axis=0)
rank = 5
random_state = np.random.RandomState(0)
Omega = random_state.normal(size=(A.shape[1], rank)) / 10000000

def main(mat,Omega):
    u, s, v = rsvd(mat, Omega)
    v = v[:rank,:]
    return u,s,v

# test
print(main(A,Omega)[1])

# sklearn
from sklearn.utils.extmath import randomized_svd
u1,s1,v1=randomized_svd(A,rank,power_iteration_normalizer = 'QR',svd_lapack_driver = 'gesvd',n_oversamples=0, random_state=0)
print(s1)

# spu
import spu.utils.simulation as spsim
import spu.spu_pb2 as spu_pb2
config = spu_pb2.RuntimeConfig(
        protocol=spu_pb2.ProtocolKind.ABY3,
        field=spu_pb2.FieldType.FM128,
        fxp_fraction_bits=30,
    )
sim_aby = spsim.Simulator(3, config)
result = spsim.sim_jax(sim_aby, main)(A, Omega)
print(result[1])

@deadlywing
Copy link
Contributor

  1. 精度损失应该主要在求特征向量上,不是qr分解这里
  2. 随着待求解svd矩阵大小增大(这里是和n_component相关),迭代轮数需要增大
  3. 另外,如果数据集大小增大需根据明文情况调整scale
import numpy as np
import jax.numpy as jnp
from jax import random

def eigh_power(A):
    n = A.shape[0]
    eig_val = []
    eig_vec = []
    for _ in range(n):
        vec = jnp.ones((n,))
        for _ in range(100):
            # Power iteration
            vec = jnp.dot(A, vec)
            vec /= jnp.linalg.norm(vec)

        eigval = jnp.dot(vec.T, jnp.dot(A, vec))
        eig_vec.append(vec)
        eig_val.append(eigval)
        A -= eigval * jnp.outer(vec, vec)
    eig_vecs = jnp.column_stack(eig_vec)
    eig_vals = jnp.array(eig_val)
    return eig_vals, eig_vecs
    
def svd(A):
    m, n = A.shape
    # sigma, U = jnp.linalg.eigh(jnp.dot(A, A.T))
    # print(jnp.dot(A, A.T))
    sigma, U = eigh_power(jnp.dot(A, A.T))
    # if use eigh_qr 
    # arg_sort = jnp.argsort(sigma)[::-1]
    # sigma = sigma[arg_sort]
    # U = U[:, arg_sort]
    sigma_sqrt = jnp.sqrt(sigma)
    # sigma_matrix = jnp.diag(sigma_sqrt)
    sigma_inv = jnp.diag((1/sigma_sqrt))
    V = jnp.dot(jnp.dot(sigma_inv, U.T), A)
    return (U, sigma_sqrt, V)

def gram_schmidt(A):
    Q=jnp.zeros_like(A)
    cnt = 0
    for a in A.T:
        u = jnp.copy(a)
        for i in range(0, cnt):
            u -= jnp.dot(jnp.dot(Q[:, i].T, a), Q[:, i])
        e = u / jnp.linalg.norm(u)
        Q = Q.at[:,cnt].set(e)
        cnt += 1
    return Q

def iteration(A, Omega, power_iter = 4):
    # iter = 7 if rank < 0.1 * min(A.shape) else 4
    Y = jnp.dot(A, Omega)

    for _ in range(power_iter):
        Y = jnp.dot(A, jnp.dot(A.T, Y))
        # print(Y)
    # Q, _ = jnp.linalg.qr(Y)
    Q = gram_schmidt(Y / 10000)
    return Q

def rsvd(A, Omega):
    Q = iteration(A, Omega)
    B = jnp.dot(Q.T, A)
    # u_tilde, s, v = jnp.linalg.svd(B, full_matrices=False)
    u_tilde, s, v = svd(B)
    u = jnp.dot(Q, u_tilde)
    return u, s, v

A = random.normal(random.PRNGKey(0), (1000,10))
A = A - jnp.mean(A, axis=0)
rank = 5
random_state = np.random.RandomState(0)
Omega = random_state.normal(size=(A.shape[1], rank)) / 10000000

def main(mat,Omega):
    u, s, v = rsvd(mat, Omega)
    v = v[:rank,:]
    return u,s,v

# test
print(main(A,Omega)[1])

# sklearn
from sklearn.utils.extmath import randomized_svd
u1,s1,v1=randomized_svd(A,rank,power_iteration_normalizer = 'QR',svd_lapack_driver = 'gesvd',n_oversamples=0, random_state=0)
print(s1)

# spu
import spu.utils.simulation as spsim
import spu.spu_pb2 as spu_pb2
config = spu_pb2.RuntimeConfig(
        protocol=spu_pb2.ProtocolKind.ABY3,
        field=spu_pb2.FieldType.FM128,
        fxp_fraction_bits=30,
    )
sim_aby = spsim.Simulator(3, config)
result = spsim.sim_jax(sim_aby, main)(A, Omega)
print(result[1])
  1. 我尝试了一下,特征向量求解应该也是一个bottleneck,但是qr也还是有影响的,用gram_schmidt替代jax的qr以后,能支持的维度大了一些
    2,3 :我感觉都可以留一个超参数的口子;

btw,你说的迭代次数具体是指哪里?iteration函数里的还是eigh_power里的?或者两个地方的迭代次数都能放出来给用户调

@tarantula-leo
Copy link
Contributor

迭代次数指的是eigh_power里的,simple_pca里的power_iteration迭代轮数和n_feature相关,这里的和n_component相关;iteration里的迭代次数一般就4或者7即可,根据n_component/n_feature比例来。
这个代码加上scale的超参是提PR么?考虑是不用pca路径,新建rsvd的路径。

@deadlywing
Copy link
Contributor

迭代次数指的是eigh_power里的,simple_pca里的power_iteration迭代轮数和n_feature相关,这里的和n_component相关;iteration里的迭代次数一般就4或者7即可,根据n_component/n_feature比例来。
这个代码加上scale的超参是提PR么?考虑是不用pca路径,新建rsvd的路径。

个人觉得,您可以在utils文件夹下建一个extmath.py,把我们这次讨论的一些内容放到util里,后续可以作为统一的工具函数,其中可以包括:

  1. rsvd
  2. eigh(qr 和 power iter)
  3. qr分解 (Jax api(即householder) 和 gram_schmidt)

然后,我觉得您也可以在原来的pca里增加一个method=rsvd

不过也需要麻烦您描述一下我们当前这个方案的一些问题,如需要FM128,可能需要调整scale,以及迭代轮数要随着n_component增大而增大等信息(这些可以直接加在PCA那个类的注释doc里~)

@tarantula-leo
Copy link
Contributor

大概这么整合:

  1. extmath.py文件里直接使用def,不使用class
  2. pca文件修改后,增加判断条件选择power_iteration或rsvd

@deadlywing
Copy link
Contributor

大概这么整合:

  1. extmath.py文件里直接使用def,不使用class
  2. pca文件修改后,增加判断条件选择power_iteration或rsvd

嗯,,我觉得ok

  1. extmath.py主要就是用来提供一些高阶的数学函数,以函数的形式就可以,没必要设计类
  2. 用method去区分就可以,fit里面增加一个rsvd的分支即可

@tarantula-leo
Copy link
Contributor

  1. 调整simulation参数会导致之前的simple_pca出错么?跑了下好像有问题,n_component为5,用的数据集是:
    X = random.normal(random.PRNGKey(0), (1000, 10))
        config = spu_pb2.RuntimeConfig(
                protocol=spu_pb2.ProtocolKind.ABY3,
                field=spu_pb2.FieldType.FM128,
                fxp_fraction_bits=30,
            )
        sim = spsim.Simulator(3, config)
  1. transform里也需要修改,去除均值这一步,考虑也通过method区分
        X = X - self._mean
        return jnp.dot(X, self._components)

@deadlywing
Copy link
Contributor

  1. 我本地跑了观察了一下结果,应该是数据大一点以后,精度没有这么高了所以导致assertion失败了
  2. 去均值应该都是需要的吧,即使是用svd

@tarantula-leo
Copy link
Contributor

去均值用SVD的话fit需要,transform不用

@tarantula-leo
Copy link
Contributor

先提交了extmath的PR,PCA的后面要抽空补充下原理

@deadlywing
Copy link
Contributor

@deadlywing
Copy link
Contributor

另外,从定义出发,pca本质可以认为找方差最大的投影方向,如果没有去均值化,那么感觉应该容易被scale影响~

@tarantula-leo
Copy link
Contributor

@deadlywing
Copy link
Contributor

我用fit_transform对比结果的,看里面是没有用到:
https://github.com/scikit-learn/scikit-learn/blob/7f9bad99d6e0a3e8ddf92a7e5561245224dab102/sklearn/decomposition/_pca.py

fit_transform是因为fit和transform一起做,在fit的时候本身就是去均值的,所以后面就不需要这步了;
但是transform的话一般是计算另外的数据集,所以是需要根据fit保留的均值去做标准化的,

anyway,一个最简单的方式就是检查一下结果,不管用svd还是power iter,transform的结果都是一样的~

@tarantula-leo
Copy link
Contributor

之前是用下面的fit_transform来验证结果的,看结果不太一样,如果分开用fit和transform结果是一致的

            # X_new = X * V = U * S * Vt * V = U * S
            U *= S[: self.n_components_]

@tarantula-leo
Copy link
Contributor

想问下用emul报错的原因是什么?另外,simulation时需要修改config,在emul里需要怎么进行设置?

[2023-08-10 16:49:29,458] Start multiprocess cluster...
[2023-08-10 16:49:29,819] [MainProcess] Starting grpc server at 127.0.0.1:9920
[2023-08-10 16:49:30,194] [MainProcess] Starting grpc server at 127.0.0.1:9921
[2023-08-10 16:49:30,501] [MainProcess] Run : builtin_spu_init at node:0
[2023-08-10 16:49:30,507] [MainProcess] Run : builtin_spu_init at node:1
I0810 16:49:30.510710 30898 external/com_github_brpc_brpc/src/brpc/server.cpp:1113] Server[yacl::link::internal::ReceiverServiceImpl] is serving on port=9930.
I0810 16:49:30.511318 30898 external/com_github_brpc_brpc/src/brpc/server.cpp:1116] Check out http://node2:9930 in web browser.
I0810 16:49:30.515334 30912 external/com_github_brpc_brpc/src/brpc/server.cpp:1113] Server[yacl::link::internal::ReceiverServiceImpl] is serving on port=9931.
I0810 16:49:30.515905 30912 external/com_github_brpc_brpc/src/brpc/server.cpp:1116] Check out http://node2:9931 in web browser.
[2023-08-10 16:49:30,597] [MainProcess] Starting grpc server at 127.0.0.1:9922
I0810 16:49:30.611969 30924 external/com_github_brpc_brpc/src/brpc/socket.cpp:2345] Checking Socket{id=0 addr=127.0.0.1:9931} (0x7f14600e37c0)
I0810 16:49:30.612149 30924 external/com_github_brpc_brpc/src/brpc/socket.cpp:2405] Revived Socket{id=0 addr=127.0.0.1:9931} (0x7f14600e37c0) (Connectable)
I0810 16:49:30.617444 30934 external/com_github_brpc_brpc/src/brpc/socket.cpp:2345] Checking Socket{id=1 addr=127.0.0.1:9932} (0x7f14600e35c0)
[2023-08-10 16:49:30,999] [MainProcess] Starting grpc server at 127.0.0.1:9923
[2023-08-10 16:49:31,371] [MainProcess] Starting grpc server at 127.0.0.1:9924
I0810 16:49:31.629143 30932 external/com_github_brpc_brpc/src/brpc/socket.cpp:2345] Checking Socket{id=1 addr=127.0.0.1:9932} (0x7f14600e3a00)
I0810 16:49:39.534662 30912 external/com_github_brpc_brpc/src/brpc/server.cpp:1173] Server[yacl::link::internal::ReceiverServiceImpl] is going to quit
[2023-08-10 16:49:39,536] [MainProcess] Traceback (most recent call last):
  File "/root/training/Python3.8/lib/python3.8/site-packages/spu/utils/distributed.py", line 324, in Run
    ret_objs = fn(self, *args, **kwargs)
  File "/root/training/Python3.8/lib/python3.8/site-packages/spu/utils/distributed.py", line 541, in builtin_spu_init
    link = libspu.link.create_brpc(desc, my_rank)
RuntimeError: what: 
        [external/yacl/yacl/link/context.cc:167] connect to mesh failed, failed to setup connection to rank=2
stacktrace: 
#0 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()+0x7f14756d9fed
#1 pybind11::cpp_function::dispatcher()+0x7f14756c1956
#2 PyCFunction_Call+0x43bdca



I0810 16:49:39.735126 30932 external/com_github_brpc_brpc/src/brpc/socket.cpp:2345] Checking Socket{id=0 addr=127.0.0.1:9931} (0x7f14600e37c0)
I0810 16:49:40.530438 30898 external/com_github_brpc_brpc/src/brpc/server.cpp:1173] Server[yacl::link::internal::ReceiverServiceImpl] is going to quit
[2023-08-10 16:49:40,532] [MainProcess] Traceback (most recent call last):
  File "/root/training/Python3.8/lib/python3.8/site-packages/spu/utils/distributed.py", line 324, in Run
    ret_objs = fn(self, *args, **kwargs)
  File "/root/training/Python3.8/lib/python3.8/site-packages/spu/utils/distributed.py", line 541, in builtin_spu_init
    link = libspu.link.create_brpc(desc, my_rank)
RuntimeError: what: 
        [external/yacl/yacl/link/context.cc:167] connect to mesh failed, failed to setup connection to rank=2
stacktrace: 
#0 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()+0x7f14756d9fed
#1 pybind11::cpp_function::dispatcher()+0x7f14756c1956
#2 PyCFunction_Call+0x43bdca



[2023-08-10 16:49:40,534] Shutdown multiprocess cluster...

@tpppppub
Copy link
Member

想问下用emul报错的原因是什么?另外,simulation时需要修改config,在emul里需要怎么进行设置?

[2023-08-10 16:49:29,458] Start multiprocess cluster...
[2023-08-10 16:49:29,819] [MainProcess] Starting grpc server at 127.0.0.1:9920
[2023-08-10 16:49:30,194] [MainProcess] Starting grpc server at 127.0.0.1:9921
[2023-08-10 16:49:30,501] [MainProcess] Run : builtin_spu_init at node:0
[2023-08-10 16:49:30,507] [MainProcess] Run : builtin_spu_init at node:1
I0810 16:49:30.510710 30898 external/com_github_brpc_brpc/src/brpc/server.cpp:1113] Server[yacl::link::internal::ReceiverServiceImpl] is serving on port=9930.
I0810 16:49:30.511318 30898 external/com_github_brpc_brpc/src/brpc/server.cpp:1116] Check out http://node2:9930 in web browser.
I0810 16:49:30.515334 30912 external/com_github_brpc_brpc/src/brpc/server.cpp:1113] Server[yacl::link::internal::ReceiverServiceImpl] is serving on port=9931.
I0810 16:49:30.515905 30912 external/com_github_brpc_brpc/src/brpc/server.cpp:1116] Check out http://node2:9931 in web browser.
[2023-08-10 16:49:30,597] [MainProcess] Starting grpc server at 127.0.0.1:9922
I0810 16:49:30.611969 30924 external/com_github_brpc_brpc/src/brpc/socket.cpp:2345] Checking Socket{id=0 addr=127.0.0.1:9931} (0x7f14600e37c0)
I0810 16:49:30.612149 30924 external/com_github_brpc_brpc/src/brpc/socket.cpp:2405] Revived Socket{id=0 addr=127.0.0.1:9931} (0x7f14600e37c0) (Connectable)
I0810 16:49:30.617444 30934 external/com_github_brpc_brpc/src/brpc/socket.cpp:2345] Checking Socket{id=1 addr=127.0.0.1:9932} (0x7f14600e35c0)
[2023-08-10 16:49:30,999] [MainProcess] Starting grpc server at 127.0.0.1:9923
[2023-08-10 16:49:31,371] [MainProcess] Starting grpc server at 127.0.0.1:9924
I0810 16:49:31.629143 30932 external/com_github_brpc_brpc/src/brpc/socket.cpp:2345] Checking Socket{id=1 addr=127.0.0.1:9932} (0x7f14600e3a00)
I0810 16:49:39.534662 30912 external/com_github_brpc_brpc/src/brpc/server.cpp:1173] Server[yacl::link::internal::ReceiverServiceImpl] is going to quit
[2023-08-10 16:49:39,536] [MainProcess] Traceback (most recent call last):
  File "/root/training/Python3.8/lib/python3.8/site-packages/spu/utils/distributed.py", line 324, in Run
    ret_objs = fn(self, *args, **kwargs)
  File "/root/training/Python3.8/lib/python3.8/site-packages/spu/utils/distributed.py", line 541, in builtin_spu_init
    link = libspu.link.create_brpc(desc, my_rank)
RuntimeError: what: 
        [external/yacl/yacl/link/context.cc:167] connect to mesh failed, failed to setup connection to rank=2
stacktrace: 
#0 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()+0x7f14756d9fed
#1 pybind11::cpp_function::dispatcher()+0x7f14756c1956
#2 PyCFunction_Call+0x43bdca



I0810 16:49:39.735126 30932 external/com_github_brpc_brpc/src/brpc/socket.cpp:2345] Checking Socket{id=0 addr=127.0.0.1:9931} (0x7f14600e37c0)
I0810 16:49:40.530438 30898 external/com_github_brpc_brpc/src/brpc/server.cpp:1173] Server[yacl::link::internal::ReceiverServiceImpl] is going to quit
[2023-08-10 16:49:40,532] [MainProcess] Traceback (most recent call last):
  File "/root/training/Python3.8/lib/python3.8/site-packages/spu/utils/distributed.py", line 324, in Run
    ret_objs = fn(self, *args, **kwargs)
  File "/root/training/Python3.8/lib/python3.8/site-packages/spu/utils/distributed.py", line 541, in builtin_spu_init
    link = libspu.link.create_brpc(desc, my_rank)
RuntimeError: what: 
        [external/yacl/yacl/link/context.cc:167] connect to mesh failed, failed to setup connection to rank=2
stacktrace: 
#0 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()+0x7f14756d9fed
#1 pybind11::cpp_function::dispatcher()+0x7f14756c1956
#2 PyCFunction_Call+0x43bdca



[2023-08-10 16:49:40,534] Shutdown multiprocess cluster...
  1. 请提供下代码和复现步骤
  2. 创建 emulator 的时候直接使用你修改后的 config.json 路径即可

@tarantula-leo
Copy link
Contributor

emulation的config.json文件里没看到这个参数,只有调整field的:
fxp_fraction_bits=30
在simulation里是这样的:

        config = spu_pb2.RuntimeConfig(
                protocol=spu_pb2.ProtocolKind.ABY3,
                field=spu_pb2.FieldType.FM128,
                fxp_fraction_bits=30,
            )
        sim = spsim.Simulator(3, config)

可以直接在config.json里添加这个参数么?

@tpppppub
Copy link
Member

emulation的config.json文件里没看到这个参数,只有调整field的: fxp_fraction_bits=30 在simulation里是这样的:

        config = spu_pb2.RuntimeConfig(
                protocol=spu_pb2.ProtocolKind.ABY3,
                field=spu_pb2.FieldType.FM128,
                fxp_fraction_bits=30,
            )
        sim = spsim.Simulator(3, config)

可以直接在config.json里添加这个参数么?

runtime_config 里加上 "fxp_fraction_bits": 30

@deadlywing
Copy link
Contributor

util那个pr已经merge,pca麻烦重新发一个pr

@tarantula-leo
Copy link
Contributor

配置问题是要新建一个config.json的吗?

@tpppppub
Copy link
Member

你在 PCA 子目录下先建一个吧

@tarantula-leo
Copy link
Contributor

util那个pr已经merge,pca麻烦重新发一个pr

已经提交,需要看下emul部分是否会报错。

@Candicepan Candicepan moved this from In Progress to In Review in OSCP Phase 2 Aug 11, 2023
deadlywing pushed a commit that referenced this issue Aug 15, 2023
# Pull Request

## What problem does this PR solve?
使用 SPU 优化 PCA 算法
Issue Number: Fixed #259

## Possible side effects?

- Performance:
1. 收敛速度更快(体现在能支持更大的特征维度)
2. 不需要显示的计算原数据集的协方差矩阵
- Backward compatibility:
@github-project-automation github-project-automation bot moved this from In Review to Done in OSCP Phase 2 Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
challenge OSCP SecretFlow Open Source Contribution Plan
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants