This repository is based on a project in a production plant to apply Bayesian methods to find change points in a sequence of production machine data, in order to alert engineers for maintenance.
The production line uses mounter machines to place components onto a PCB. The mounter is adjusted according to a planned schedule. However, sometimes the mounter's mounting accuracy would drop below a threshold, causing defective products, recalls, and unplanned stoppage for emergency maintenance.
Experienced engineers know that these "sudden" drops in accuracy are not that sudden after all. There are usually signs before performance crosses the final threshold, in the form of a small, step-wise or continuous deviation to one side of the designated position. These deviations are so small and irregular, that they do not trigger automatic optical inspection (AOI) alerts. But these changes accumulate, until they finally cause trouble.
At the time of the project, engineers rely on frequent manual inspection of the performance data, as well as personal experience to spot these signs, and plan preventive maintenance. They would like to have an algorithm that automatically warns them where there are change points detected in the performance data over time. The challenge is how to detect a pattern change when the base pattern is a random distribution with noise.
The problem can be formulated into a timeseries change point detection. Bayesian methods are used for its ability to detect change in a statistical distribution. The project is largely an implementation of the original paper from Schuetz and Holschneider, University of Potsdam. Kudos to the authors.
We demonstrate Bayesian method with a simple case: a school as 60% boys and 40% girls. Boys always wear pants, girls wear pants 50% of the time and dresses 50% of the time. If you see a pupil wearing a pair of pants, what is the probability that it is a boy / girl?
Assuming the school as U pupils in total. There are U * P(Boy) * P(Pants|Boy) boys who wear pants, and U * P(Girl) * P(Pants|Girl) girls who wear pants. The latter divided by the total is the probability that the person in pants you see is a girl:
P(Girl|Pants) = P(Girl) * P(Pants|Girl) / [P(Boy) * P(Pants|Boy) + P(Girl) * P(Pants|Girl)]
To generalize the case, we have the Bayes' theorem:
P(B|A) = P(A|B) * P(B) / [P(A|B) * P(B) + P(A|~B) * P(~B) ]
or P(B|A) = P(AB) / P(A), or P(B|A) * P(A) = P(AB)
Based on the engineers' requests, three types of changes are identified that need to be detected:
- A step-wise change in the average placement position of the components
- A continuous change in the average
- A step-wise change in the standard deviation
Mounter placment position is average position + a random dispersion from the average:
where
The objective is to calculate the probability of a change in any of the parameters
$$F_\theta = \begin{pmatrix} (\phi_-^\theta)1 & (\zeta-^\theta)1 & (\zeta+^\theta)1 & (\phi+^\theta)1 \ \vdots & \vdots & \ddots & \vdots \ (\phi-^\theta)n & (\zeta-^\theta)n & (\zeta+^\theta)n & (\phi+^\theta)_n \ \end{pmatrix}$$
Noise is assumed to be normally distributed and described as:
$$(\Omega_{\theta, s_1, s_2}){ij} = \big( \big[ 1 + s_1 (\zeta-^\theta)j + s_2 (\zeta+^\theta)j \big]^2 \big) \cdot \delta{ij}$$
Based on the above equations, we get the probability density function
There exists
where
We can also estimate the system's error through residual:
Assuming all parameters are independent. Their joint prior is
Since the prior is unknown, I use noninformative prior (jeffery's prior),
When x is normally distributed, or
Or the normal distributions Jeffery's prior is not dependent on the average. Based on which, we define the prior of location parameters
The multiplier
or simply the reciprocal:
Now the problem is simplified into: given the likelihood function and the prior, calculate the posterior (Bayesian inference):
Or simply
Taking the priors and
Partially integrating all parameters will give us the posterior.
Partial integration of
of
Then I used a numeric method to integrate
Final
In the same way we can get the posteriors of all the parameters. A high probability in any of the posteriors indicate a change point of the corresponding type.
Calculating
import numpy as np
from matplotlib import pyplot as plt
def xiMinus(theta, t, mode = 'constant'):
scale = 1
if mode == 'constant':
if t <= theta:
return -1.0 / scale
else:
return 0.0
if mode == 'linear':
if t <= theta:
return 1.0 / scale * (theta - t)
else:
return 0.0
def xiPlus(theta, t, mode = 'constant'):
scale = 1
if mode == 'constant':
if t <= theta:
return 0.0
else:
return 1.0 / scale
if mode == 'linear':
if t <= theta:
return 0.0
else:
return 1.0 / scale * (t - theta)
theta = 5
n = np.arange(11)
result = np.zeros(len(n))
for t in n:
result[t] = xiMinus(theta, t, 'linear') + xiPlus(theta, t, 'linear')
fig, ax = plt.subplots()
plt.plot(result)
Calculating
def fTheta(theta, n):
f = np.zeros([n,4])
for j in np.arange(0, n):
f[j][0] = xiMinus(theta, j, 'constant')
f[j][1] = xiMinus(theta, j, 'linear')
f[j][2] = xiPlus(theta, j, 'linear')
f[j][3] = xiPlus(theta, j, 'constant')
return f
f = fTheta(2, 5)
f
array([[-1., 2., 0., 0.],
[-1., 1., 0., 0.],
[-1., 0., 0., 0.],
[ 0., 0., 1., 1.],
[ 0., 0., 2., 1.]])
Calculating covariance matrix
def omegaTheta(theta, s, n):
om = np.zeros([n, n])
for j in np.arange(0, n):
om[j][j] = int(np.power (1 + s[1] * xiMinus(theta, j, 'linear') + s[2] * xiPlus(theta, j, 'linear'), 2))
return om
om = omegaTheta(2, [0, -1, 1, 0], 5)
om
array([[ 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 0., 0., 4., 0.],
[ 0., 0., 0., 0., 9.]])
Maximum likelihood:
Calculating
def betaStar(y, fTomInvf, fTomInv):
tmp = fTomInvf
target = np.matmul(fTomInv, y)
return np.linalg.solve(tmp, target)
def residuumSq(y, betastar, omInv):
fBetastar = np.matmul(f, betastar)
ytoFBetastar = np.subtract(y, fBetastar)
ytoFBetastarT = np.transpose(ytoFBetastar)
tmp = np.matmul(ytoFBetastarT, omInv)
result = np.matmul(tmp, ytoFBetastar)
rsq = result.tolist()[0][0]
return rsq
def sigmaHat(rsq, n):
sigmaHat = np.sqrt(rsq / (n + 1))
return sigmaHat
def inverseFunc(mat):
if np.linalg.cond(mat) < 1 / np.finfo(mat.dtype).eps:
return np.linalg.solve(mat, np.eye(mat.shape[1], dtype = float))
else:
U, S, V = scipy.linalg.svd(mat)
D = np.diag(S)
tmp = np.matmul(V, np.linalg.inv(D))
return np.matmul(tmp, np.transpose(U))
y = np.array([1,2,3,4,5])
y = y.reshape(-1,1)
om = np.where(om == 0, 1e-7, om)
_, omLogdet = np.linalg.slogdet(om)
omDet = np.exp(omLogdet)
omInv = inverseFunc(om)
fTomInv = np.matmul(np.transpose(f), omInv)
fTomInvf = np.matmul(fTomInv, f)
_, logdet = np.linalg.slogdet(fTomInvf)
fTomInvfDet = np.exp(logdet)
omDetfTomInvfDet = omDet * fTomInvfDet
betastar = betaStar(y, fTomInvf, fTomInv)
rsq = residuumSq(y, betastar, omInv)
sigmaHat = sigmaHat(rsq, len(y))
print(betastar)
print(rsq)
print(sigmaHat)
[[-3.]
[-1.]
[ 1.]
[ 3.]]
8.486838935292248e-18
1.18931625562e-09
Test data from Nile river water level. Blue line is original data, yellow line is the posterior of the changing point
The most computational intensive part of the algorithm is the partial integration. The original version took 20 minutes to get the posterior of
If we know approximately how the posterior looks like, and it doesn't change drastically, we can use sampling methods to reduce computation requirement. Fortunately this befits our requirement on the production line.
I used Markov Chain Monte Carlo (MCMC) + Gibbs sampling to sample parameters and generate posterior. Pseudo code for integration over
initialize s1 and s2 to 0
FOR i < iteration DO:
- use s1 and s2 to get
$p(\theta, s|y)$ - take
$p(\theta, s|y)$ as the estimated posterior and sample$\theta$ from it- use the sampled
$\theta$ to calculate$p(s1|\theta, s2, y)$ - take
$p(s1|\theta, s2, y)$ as the estimated posterior and sample s1- use the sampled s1 to calculate
$p(s2|theta, s1, y)$ - take
$p(s2|\theta, s1, y)$ as the estimated posterior and sample s2
Fortunately,
Using Python's multiprocessing. I used a Bayes class to pass global variables between processes. The tested result on
- 2.7GHz/16GB dual core
- data batch size of 100
- MCMC algorithm with 10 iterations
The total runtime is lower than < 2.5 seconds. In reality, an Intel Xeon E7-4820 with 8 cores is used, and the runtime is less than 0.5 seconds.