This file was empty at the start of the Lernphase and grew with my knowledge. It is supposed to be
- A useful knowledge base for myself. Because everything is worded in a way I get, and all the related information is in one place.
- Potentially useful for future years' students of the Computer Vision lecture at ETH, to get an overview over the topics.
There could be mistakes, but I tried to mark the places where I am uncertain well.
- Italic Titles are unfinished learning
- ==Q== and ==TODO== denote less-important uncertainties/questions and more important todos, respectively.
- Self-Q and Self-A mean I asked and answered that question myself.
Best viewed in typora. State of this file: exactly as little or much as I was prepared for the exam.
[TOC]
Line intersects point: dot product Point intersects point: cross product Line intersects line: cross product
Are all "tests" if something intersects dot products and all "constructions" cross products? I think so, just because of dimensionality..
So the line is represented by
so
In the special case of parallel lines, the intersection will have a
Similarly to before, if we have
The same special case can not happen with lines... unless both points are at infinity. Then the first two entries of the line representing vector are
Given a Point and a slope. We want a line that the point lies on - this is given by computing the line as
To fulfill the slope part, only
So if it's
(How) can I test if a line has a specific slope?
I guess I just read it off the line's entries. But can I also do it with a multiplication? I think so. the check is if
Self-Q: What happens when a false point (intersection of two parallel lines) is propagated? E.g. a line through it and another valid point.. would give some line. How does that line look like?
Self-A: That just works out. The point of intersection is the one at infinity in the direction of the parallel lines and thus works as a slope and can be used to build a new line.
It's alright that the slope ends in zero, because that's what slopes do. It's also okay if even the resulting line ends in zero, as that is just it's shift - it does not have to be normalizable to one.
Points and Planes in 3D use 4 coordinates.
Points and Planes are dual in 3d, much like Points and Lines are in 2d. A point is the intersection of three planes, and a plane is the construction from three points.
A Plane can be represented by linear combinations of three distinct points on it and their linear combinations, describing any point on it.
A point can also be described as the intersection of three planes. $\begin{bmatrix}\pi_1^T\\pi_2^T\\pi_3^T\end{bmatrix} p = 0$.
By inserting
So a plane can be either represented by three points
Homogeneous representation for 3D lines is not as straightforward as it is for 3D points and planes in 3D. A 3D line has four DoF so ideally we would use a 5d representation. That would be incompatible with the rest though. [Rephrased from (https://engineering.purdue.edu/kak/computervision/ECE661Folder/Lecture6.pdf)]
We can represent a line as a combination of two points, or as a linear combination of planes.
The two points you choose for the explicit representation should lie on the two planes you choose for the implicit representation (obviously, because they lie on the line). So they should fulfill the plane equation.
Because of this, we can go back and forth between the representations.
Notice that the order in the equation does not matter, as long as one of the factors is transposed such that the result is a 2x2 matrix and not a scalar.
Both representations result in the same line, a parametric
We can backproject lines from image space to planes in 3d space by doing
We can also treat the image point as the intersection of a line in the x-dimension and a line in the y-dimension in the image. Backproject those lines, then intersect the found planes in 3d:
$$
x = p_x \Rightarrow 0 = p_x - x \Rightarrow \ell_x = \begin{bmatrix}-1\0\p_x\end{bmatrix}\label{backproplinex}
$$
This is a vertical line, since
See also https://engineering.purdue.edu/kak/computervision/ECE661Folder/Lecture17.pdf for arguing about backprojecting points and that the camera center is weird.
Q: Is a projection matrix really always invertible by transposing it like I do here?
A: That works because we have for points that
$x_{2d} = Px_{3d}$ and a point in 3d is on a plane if$x^T_{3d}\pi =0$ . Combining those, we get for the plane-to-line projection that the line$x^T_{2d}\ell =0$ fulfills$\ell = P^{-T}\pi$ and thus the backprojection is using$P^T$ .
Note: Even though we can project a point with
Basically we first get the dual line representation (
Source from https://engineering.purdue.edu/kak/computervision/ECE661Folder/Lecture6.pdf
Self-Q: What does it mean in 3d when we intersect two lines that are not concurrent but also not parallel? I assume we get a point in infinity... but which one? Can I run a computation afterwards on a real point and the infinite point and get a line again? If yes, what is its direction?
Self-A: I... guess not. I would probably take three of the four plane equations in the two dual lines (plane representation) and solve the same system as above, then do it again and check if the two resulting points are the same.
I guess this is synonymous to a Conic Section (circle, ellipse, hyperbola, parabola).
A quadric would be the same, but not just a section in 2D, but in 3D (or more) I think.
A conic going through a point is defined by
Source . Because the equation is independent of scale (equal to zero), it has only 5 DoF, so we only need 5 points to define the conic. Placing the conic coefficients in
and can solve it for
Given a conic and a point. Denote
In the
We can define a conic with points on it
This works because with the tangent line definition
A degenerate conic is just two lines in 2d. That happens if
The degenerate conic could also be a single line (rank 1). Then
In the dual representation
We can not switch representations in the degenerate cases, because
And what the conics are good for...? Calibration, it was claimed.
Another example is simply computing with them. You have a circle (conic) in an image and you want the 3d cone that contains this circle, originating from the camera point? Well, simply do
For projecting the quadric to a conic, they use the dual quadric though, because that one is defined by planes and backprojecting planes is "very easy" compared to backprojecting points.
For these, I like to think of a conic as a function that takes a point and checks whether it lies on the conic. So this quadric is just a wrapper that transforms the points before and after ... or the planes, in the second case.
==Q==: Can there be multiple degenerate rank-2 conics that describe two points at infinity or is that unique?
I believe they are 3d "stacks" of conics. The structure is similar:
-
$X^TQX = 0$ , but this time$Q$ is$(4 \times 4)$ . It has$\mathbf 9$ degrees of freedom - It's 9 DoF because
$\text{size}({x^2, y^2,z^2,\Xi^2, xy, xz, x\Xi, yz, y\Xi,z\Xi})=10$ but the equation is only up to scale as it is equal to zero, so we subtract one DoF. - Like
$Cx$ yields a tangent line,$Qx$ yields a tangent plane (if$x$ lies indeed on the quadric). - As long as the quadric is not degenerate (i.e. the matrix
$Q$ has full rank, i.e. iff not det Q = 0) a dual representation exists. It describes the quadric using tangent planes instead of points:$\pi^TQ^\star \pi = 0$
Also called isotropic points or cyclic points. Are two special points at infinity.
Any line going through them is called an isotropic line.
$$
(1, \mathrm i, 0) \text{ and } (1, -\mathrm i, 0)
$$
These lie on the "complexification" of every real circle. The complexification is obtained by taking the homogeneous circle equation and taking all complex solutions. So in other words, the circular points are a solution to every circle of the form
$$
x^2 + y^2 + 2axz + 2byz + cz^2 = 0
$$
Which makes sense, because
Other solutions at infinity are the same points, up to scale.
Under Similarity Transforms (only scale, rotate, translate), they are invariant.
We know that the rank-2 degenerate conics can be either two lines (
Applying any homography to
Angles in Euclidean Space: Normalize the vectors and then the dot product is the cosine: $$ \cos \theta = \frac{\vec u}{|\vec u |} \cdot \frac{\vec v}{|\vec v|} $$
In projective space: If we know the absolute dual conic
Lecture 2 Wednesday: 1: 19 :22 touched this only briefly, but cis.upenn.edu pdf covers it on p44.
The problem with lines is that they are not oriented - so multiple angles are correct.
We take the lines
cross-ratio definition | visualization |
---|---|
Since
This quote has been posted in the discord, but I don't think it is incredibly useful. Also, the absolute conic is imo not an identity matrix... the last row is zeros only.
Points on the Plane at infinity. This is for 3D. It has no real points and can not be observed in an image. It is described by points, so is not a dual.
The absolute conic
==Q==: I believe it describes two lines in infinity? Or is
Don't try to geometrically, intuitively understand or visualize this. It's just what rolls out of equations.
Preserving the absolute dual quadric is equivalent to having a similarity.
The absolute dual quadric has 8 degrees of freedom in general, and "the plane at infinity is a nullvector of it." (Not sure what this sentence meant.)
Similar to Angles in 2D, but planes and the quadric, instead of lines and the conic.
Always move the center of mass to the origin and scale it down to the [0,1] range. This is doable by constructing a matrix that does the opposite and inverting it. This takes the points in [0,1] range, scales them by sigma, and shifts the mean to our data mean: $$ T = \begin{bmatrix} \sigma_x & 0 & \mu_x \ 0 & \sigma_y & \mu_y \ 0 & 0 & 1 \end{bmatrix} ^{-1} $$
Reason for normalization: avoiding numerical issues due to big scale differences. And things like PCA would work better, but not sure that matters here.
-
calibration: how many measurements do we need, and why? Each measurement yields two constraints, so we need 3 measurements to fill the 5 DoF
But why do we have 5 DoFs? Well, that is just the number of variables unknown in the pinhole projection matrix
$K$ . -
Fundamental Matrix
- equation for mapping points: Depends on convention.
$x'^TFx = 0$ is fulfilled for all corresponding points but also for other points that happen to be on the same epipolar line. - DoF and shape of Fundamental Matrix
$F$ has 7 DoF and$E$ has 5. I think I have put the arguments about this further below. - ==Q== 2016: Why do we need eight correspondences to extract the fundamental matrix when the matrix only has seven degrees of freedom??? I guess because linear means we can not enforce the determinant constraint.
- equation for mapping points: Depends on convention.
-
Affine, Similarity, Projective Transformations
- Under which of these groups are SIFT descriptors invariant? A: under similarities only. not affinities, not projectivities. ==Q==: Why?
The focal length
The projection works with similarities:
If we want some translation that is added onto the 2d point, we can do that by adding it times Z (because of impeding implicit division by z). Since the fourth column is zeros anyway, we can apply this also to non-homogeneous 3d points.
We can also add rotation, and translation before the projection
\begin{bmatrix} R & | & \overbrace{-R\tilde C}^t \ \matrix{0&0&0} & | & 1 \end{bmatrix} $$
Notice that
What is the point of having a principal point to move around after the projection (p_x, p_y) ?
It's wherever the axis from the center of the lens happens to be in the image. That is usually close to the image center, but not exactly. A slight shift of a millimeter can be calibrated using this.
How can focal length be different for
Well, the focal length
Focal lengths have to be positive.
A slightly different camera model: A spherical lens is relatively easy to produce but the incoming rays are slightly distorted - in a pinhole they would not. The line exiting the lens might have a slightly different direction based on the incoming angle. So the further away from the image center, the more we have distortion. That distortion is typically radially symmetric. We fix it by adding a nonlinear function between
- Iso-metric = Same-Distance. Preserves the euclidean distance.
- preserves orientation or reverses it (mirrors) (if
$r_{11}, r_{21}$ are multiplied by$-1$ ) - preserves area and lengths
- Is basically a rotation and a translation. So in 2d 3 DoF:
$(\theta, t_x, t_y)$ $R + \vec t$
- There is scaling: lengths are not preserved, but their ratios.
- The circular points are invariant
$s\cdot R + \vec t$
- must not strictly be a rotation+scale anymore
$A + \vec t$ - Parallel lines remain parallel
- (not sure if lines that intersect remain necessarily intersecting? I believe they do, because the projectivity also features that.)
- Ratios of parallel lines' lengths are preserved (e.g. midpoint remains in the middle)
- different directional lines can be squashed differently
- Linear combinations of vectors are preserved - I.e. the centroid is still the centroid after the affine transform.
- The line at infinity remains the line at infinity. Not the points on it though, those might shift density.
The most powerful / least constrained of them all.
-
preserves concurrency: Lines that intersect remain intersecting
-
preserves collinearity: points on the same line remain collinear after the projectivity transform
-
preserves order of contact (how strongly things touch):
- 0th: a crossing, not tangent
- 1st: the two curves are tangent
- 2nd: the curvatures are equal in that point (e.g. circle in curve)
- 3rd: 2nd derivative is equal too
- 4th: 3rd derivative (2nd derivative of the curvature) is equal too
1st 2nd 3rd -
Also called Homography
If we map two points, we get the new line. Thus
See
Note that this homography matrix is applied on the same side as it is for the line, but transposed.
The points fulfill
The lines fulfill
When we have a plane that touches the quadric, the image of that will also be tangent in the projections (images). This actually gives you a constraint because that's not obvious.
In 3D, the homographies are now
Same hierarchy: What are the 3d promises?
Instead of a line at infinity, we now have a plane. It is invariant under affine transforms. It has a
The circular points are now instead a full absolute conic, and invariant under similarities. It is in the plane at infinity. All spheres intersected with the plane at infinity have this complex conic.
The plane at infinity is invariant under Affinities, just like the line at infinity is in 2d. The absolute conic is invariant under Similarities, just like the circular points in 2d.
Two planes are parallel in 3d if the line of intersection lies in the plane at infinity.
Self-Q: how does that look like in the numbers? Is the first coord of the line zero, or what?
Self-A: No! Lines in 3D are stacked planes or stacked points. So we would simply compute the line by representing as stacked planes, then solve for the null-space
In 3d, the matrices are
Like projectivity, but lowest row is fixed. So
Affine projections preserve the center of mass. This can be useful to find a robust "common point" among multiple views.
Like Affine, but
Just a rotation matrix
A rotation happens always on a plane. In 3d it also happens around an axis, but that is misleading. In Nd, we want to count the number of distinct orthogonal planes to rotate in - so
So in
We want to find
Since
Q: What about the possibility that we find a orthogonal point instead of a same point? isn't that possible in 2d, in homogeneous coords?
Self-A: For the cross product to be zero,
But how to "simply factor them in"? How to we algebraically reach the big matrix with the constraints?
From Hartley, the book, page 89:
In my own words:
We take a look at each row of the projection matrix seperately and write the full matrix of the cross product. We notice that each entry of
It looks like they have 3d points in non-homogeneous coordinates but 2d points in homogeneous ones.
Self-Q: Why don't we use the dot product instead of the cross product? That should also represent the angle, no? Self-A: I guess because we need to normalize it then, which is more effort?
Q what was that about optimization requiring that the
- All Points on line through camera center: ambiguous
- All Points on one plane: no info on how to deal with points not on that plane
- All Points on a twisted cubic curve including the camera projection center
The not-so-special case of those is if you take a picture of a wall (plane) you get only 8 of the 11 DoFs with this linear method. "But you could place the camera anywhere and adjust the intrinsics and you will always get a solution."
Start with Linear Solution, then optimize the geometric error using maximum likelihood.
- Normalize all points (mean, variance) in 2d and 3d (call them
$U^{-1}$ and$T^{-1}$ ) - Direct Linear Transform
- Minimize the sum of squared distances in the projected plane.
$\underset P{\arg\min} \large\sum{\left(d(x_i, PX_i)^2\right)}$ - Once a good
$P$ was found, undo the normalization by applying$U$ and$T$ on the correct sides of the found$P$ :$\text{final_P} \mathop{:=} U \cdot \text{found_P} \cdot T $
Sometimes it can make sense to choose the distance in the world instead of in the image. Or use mahalanobis distance to mix both.
The calibration matrix
two planes with squares on them. Perform canny edge detection and straight line fitting to get detected edges. Then intersect them to find the corners of the squares. Finally use that to apply the Gold Standard Algorithm and find the camera parameters.
We can often assume that the skew is zero, pixels are square, the principal point is known. Clamping the values like that can of course only increase the error compared to unconstrained solving.
Projecting the dual absolute quadric into the image starts out as
So if we could find the absolute conic in the images, we could find
In real life, we would have calibration grids again, to make the detection more robust. But once detected, we think of it as just three squares at different perspectives in 3D.
For each of our squares in 3d, we compute the homography that we can now apply to any points, including the two circular points. That is, we map the square to
To my understanding in other words: We could also take the three planes of the squares, intersect each in 3d with the plane in infinity, and do the same thing.
This is Roughly Zhang's calibration method. The advantage is that you don't need to know your 3d object locations precisely - easier to carry around and reassemble.
Nice writeup, thanks for sharing! To my understanding though, the method doesn't use the inverse homography to get two points on the absolute conic in 3D space, but instead fits the imaged absolute conic to the images (now in 2D space) of the circular points in 3D. This is because if we had the absolute conic in space, that would still not give us P (just \Omega^*), so we need the image of the absolute conic, from which we can then compute K as described in the slide I posted above. Makes sense that we need to know the coordinates of the corners in 3D! Maybe we can assume that the corners map to (0,1), (1,0) etc. everytime because it's a simple translation transformation between the three different projections that would not change the fact that the imaged circular points lie on the imaged absolute conic everytime (but this has the nice property that the points would be different due the transformation, so we get the 6/5 distinct points we need instead of only 2 which we'd get if we would use a single coordinate system for all three planes)? https://discord.com/channels/910462893387030569/910462893387030572/931144853620920390
My current understanding, which may well be wrong, is: We have 3 planes in 3D, each of which has 2 corresponding circular points and each of which has a square with four corners in it (see slide). However, we don't know the location of the corners in 3D. Because we know the corners of each square lie in a plane, for each square we can come up with its own coordinate system and assign some arbitrary, yet consistent coordinates to the square's corners such that Z=0 (on a plane). We can see the three squares as essentially one square being transformed in 3D space. While the coordinates change, the plane at infinity stays fixed (because the plane at infinity will be fixed under anything that's an affinity/similarity/Euclidean transform). Also, and this is important: the absolute conic stays fixed (not pointwise, but as a set) under those transformations, by the "proof" I wrote above. Then, for each square (plane), we can compute a homography between the corners of the square and the image points (0,0), (1,0), (0,1), (1,1). Essentially, we have to compute a projection matrix P again. (So if we were to use e.g. DLT, we would need 6 point correspondences, right? Why do 4 suffice then?). The circular points of each plane are points that are on the absolute conic, and that will remain on the absolute conic again under H (by above "proof"). So after imaging all 32 = 6 circular points to the camera plane using the 3 homographies, we would obtain 6 points on the absolute conic in the image plane. Then we can fit a conic through these points and proceed with a Cholesky decomposition of the conic matrix of the imaged absolute conic (see slide, but there \Omega^ is the dual absolute conic, so we need to invert). This will give us K^{-1} K^{-T}, and from there it's just a matrix inversion to K. https://discord.com/channels/910462893387030569/910462893387030572/932969058045919302
perhaps i also want to watch the 40 minute video by stachniss
Goal: Find camera intrinsics
Approach: perform DLT on at least six known points. Each one gives two constraints, so we have the five DoF constrained that the
We do so by decomposing the found
Nonlinearities such as radial distortion are not contained in
The algebraic error $x \times PX_i $ is zero when perfect, but otherwise hard to interpret. The geometric error in the projected plane has a clear interpretation.
The issue here is that
$\ell,X,P$ are all in projective space (i.e., in homogeneous coordinates) and therefore defined up to scale.You could arbitrarily scale any of the components and the math still checks out, but if the constraint is not perfectly fulfilled the residual is scaled as well. Therefore the residual that you get from this has no direct geometric meaning, you just know that it will vanish if everything fits perfectly.
- Marcel Geppert Jan 2022
We know M=P[:3,:3]
for this. We had
The QR-Decomposition creates an orthogonal
The camera center
See also: Slides about camera matrices
Knowing the point in one image, it must lie on a line in 3D. Thus, we already know that the point must lie on the projection of that line in 2d on the second image, even if we know nothing about the point in 3D.
If we project the camera1 onto the second image, it lies in some point. Every 3d line we get from the image1 will go through its camera, so every such projected line goes through that point.
We call this point on the second image the epipole, and all these lines are epipolar lines (for specific points).
Given some point
The epipole itself does not yield an epipolar line - its 3d line collapses into the epipole on the other image.
If you only move sideways from image1 to image2, the line connecting the two cameras will not intersect with the image plane. Then it is at infinity.
The plane connecting the two cameras and the point in image1 will intersect image plane 2 in a horizontal line - that is the epipolar line then.
If you move straight forward, the line through both cameras does intersect the image plane... in the middle (kinda. Maybe not, but wherever the camera is.). Each point gets an epipolar line through the epipole, so it's a star-like pattern if you visualize them. But actually we know that the point will not just switch sides, so it's only a half-plane.
In the case where the camera is stable, the object only translates, then the object points in 3d moved all in the same way. connecting where they were and where they are now, we have parallel lines. Those lines hence meet in infinity in 3d. If the camera moves and the object is fixed, the motion is the same in the other direction, so the baseline connecting the two camera centers is also parallel.
But on the projection, these lines meet in some vanishing point. The baseline is parallel, so it must also intersect that point. We earlier defined the epipole as the point where the other camera is projected to ... so the epipole certainly lies on the baseline. The vanishing point certainly lies on the image plane. Since all parallel lines meet at a single point (infinity), the epipole must be equal to the vanishing point. So if we know 2 point correspondences, we can intersect their "parallel" lines and find the epipole.
We know that a homography from one image to the other must exist, because we can always map from image1 to a plane and from there to image2, and homographies are transitive.
We want a specific kind of homography that "maps to a plane" in the sense that the epipolar line is mapped to the epipolar line. So we require
So we can for every point say
So the fundamental matrix maps a point from image1 to a line in image2.
This also means that
$F$ is the unique $(3\times3) $ rank 2 matrix that satisfies$x'^T F x = 0$ for all corresponding$x, x'$ .
(But other points on the line
We can find the epipole when we know the fundamental matrix, because $e'[e']\times =\vec 0{1\times3}$. So
But what is the fundamental matrix good for?
Once I know
So given
This section is not actually useful. It is just a justification for how epipolar lines work and that the homography can be represented by a single matrix $H$. At least that is my impression after being confused for a while.
I am still confused what the exact point to be made is though. I think the argument stated the intersection between the plane and the epipolar line was key.. but the plane can never cover all points?? Perhaps the argument was just to say something obvious??
We choose a point
This only works if the epipolar line is not equal to
The homography maps from 2d to 1d (from plane to line) so it is rank deficient.
This pollefeys definition is, surprisingly, consistent with wikipedia.
- If
$F$ is the fundamental matrix for two images, then$F^T$ is the fundamental matrix for the opposite direction. (So i guess$F$ is orthogonal?) - The epipolar line in the other image is found as
$\ell' = Fx$ - Epipoles are where all epipolar lines are - so
$e'^TFx = 0 \forall x$ . That means that$e'^T F=0$ . Similarly, by varying$x'$ instead we get$Fe=0$ . So we can find the epipoles as the left and right nullspaces of$F$ -
$F$ is rank 2, so not invertible. It maps from a point to a line. -
$F$ has$7$ DoF. Because it is$(3\times3)$ but loses one to scaling and one to being rank 2.
https://www.robots.ox.ac.uk/~vgg/hzbook/hzbook2/HZepipolar.pdf https://studfile.net/preview/16498100/page:12/
We defined
So
This contains only
If we have
If we apply a homography to the camera and the inverse homography to the points, this leaves the image points unaffected. The fundamental matrix only relates image points and is hence also unaffected.
This means though, that given just a fundamental matrix, we have (at least, although also at most is true) a projective ambiguity: There is no unique pair of
Why Is the epipole now present in the
The basic idea here was simply that
I didn't actually understand every detail here. But the rough idea should suffice.
Make sure the data is normalized to avoid as many numerical issues as we can avoid.
We could use the eight-point-algorithm: Solve for the 9-vector using least squares on 8 points, then enforce rank = 2. But it is better to use less points and get an exact solution - because interpolation of any kind can be misleading here.
We have 7 DoF and hence need 7 point correspondences. Then we use
We could simply crop SVD for ensuring that there is no overdetermination (in the 8-point case), but that would optimize the frobenius norm / algebraic error, not geometric.
Instead we solve for the 9-vector 07_MVG_SfM.md
. This satisfies the seven constraints exactly, and is rank 2.
Alternatively, the two solutions can also be multiplied like
One advantage of using only seven points is that RANSAC has better chances of selecting seven inliers than of selecting eight inliers.
The exam from 2020 claims that
Working from some equations I found in my notes: E = (K')^T F K
, E = R[t]_\times
We can say F = K^{-T} E K^{-1}
(since the K is the same in our exam question for both images). And then insert the E
to get F = K^{-T} R[t]_\times K^{-1}
. See
In the end, the results should be equal because the cross product applied on the left vs on the right is just a difference in sign, and since everything is always only up to scale... that should not matter.
Fundamental Matrix has more DoF (7 instead of 5). Essential Matrix suffices if we already know the camera intrinsics.
The fundamental matrix takes image coordinates as inputs, the essential matrix normalized points. Because the cameras are normalized, their coordinate systems are related by just rotation and translation.
Wikipedia has the second image first. This is consistent with
$$ E = (K')^TFK, \qquad y'^T E y = 0, \qquad E = R[t]_\times\label{eqwiki1} $$ https://cs.stackexchange.com/a/135185/59764 https://en.wikipedia.org/wiki/Essential_matrix https://www.youtube.com/watch?v=auhpPoAqprk
(It seems like pollefeys defined the essential matrix the other way around though!)
Just like the fundamental matrix...
-
$\ell' = Ep$ is the epipolar line on the second image -
$E:; x\mapsto x'$ ,$E^T:; x' \mapsto x$ -
$\ell = E^Tp'$ is thus the epipolar line on the first image - A point
$x'$ lies on the line$\ell'$ if$\ell' x' = 0$ , equivalently$x'^T\ell' = 0$ . So we have$x'^T E x = 0$ for all correspondences$(x, x')$ . -
$F$ has$7$ DoF,$E$ has$5$ $E$ has two less because of the constraint that two (real, because SV are always real and positive) singular values must be equal. The second DoF is maybe to make the equal ones nonzero?
Inconsistent with the fundamental matrix definition.
$$
E = K^T F K' \qquad y^TEy' = 0,\qquad E = [t]_\times R
$$
For the essential matrix, we have
The properties are similar to the fundamental matrix except for the last one, which is new:
- Essential Matrix has two equal nonzero singular values. This is the additional constraint compared to the fundamental Matrix.
It has 5 DoF, so 5 points are enough. But 9 unknown values. so a 4 dimensional nullspace. We choose scale
So in the end if you have a calibrated matrix, you need only 5 point correspondences but a 10th degree polynomial to solve. But it's quite efficeiently computable.
Both wikipedia and pollefeys agree that
We have a general
If we now use the usual forms for the projections:
The epipole
The pseudoinverse of
The reason the pseudoinverse is that way is that it is a right pseudoinverse. Because we are okay if the backprojection is potentially ambiguous but P backproj(x)
should equal x
. So backproj
is a right pseudoinverse:
Known:
-
Normalize the images by left-multiplying with
$K^{-1}$ . This leaves us with only rotation and translation parts. -
Set up a constraint matrix equation where
$E$ is displayed as a vector with 9 elements. We only take five corresponedences, since the number of DoFs is$5$ . Five equations from$x'^T E x = 0$ . -
Solve
$A \vec E = 0$ by performing SVD and taking the smallest eigenvector (last row of$V^H$ ). -
Enforce the internal constraints of
$E$ that the first two singular values need to be equal and the third one zero: Since$E$ is up to scale, we can choose the two equal singular values arbitrarily. So we reconstruct the matrix form by forcing the eigenvalues to be$1, 1, 0$ : $$ \begin{align*} &U, S, V^H &\mathop{:=}& ;\mathrm{SVD}(\vec E)\ &E &\mathop{:=}& ;U ,\mathrm{diag}(1, 1, 0), V^H \end{align*} $$
Demazure 1988, proof here on pdf page 50.
We have three images. We know the point on two of them. We can construct a plane going through the third camera as well as the known image point and its camera. Then intersect that with the third image plane to get a line. Do it again with the other known point. Intersect the lines. And we find the third image point.
But if the 3d point lies on the same plane as the three cameras, the projection gives the same line twice, which is bad.
However, we can construct the 3d point from the two known image planes by constructing the ray going through the image point and the camera. Then project that point onto the third image. So we can solve this problem with the same data.
If we know one image position, we can backproject the point into 3d space by representing it as two lines, backproject those, and stack the resulting planes.
$$
\begin{bmatrix}\ell_x^T P \ l_y^T P\end{bmatrix} X = 0
$$
If we know more images (e.g. with projection
We use linear algebra and determinant definitions to reduce this (I hope this is not relevant for this exam). After several steps of linear algebra we get determinants where some elements are repeated, so that determinant is zero.
We end up with a linear equation (linear in each view's 2d point coordinates. Never
We actually just rederived the epipolar geometry.
(My interpretation: This equation basically is the same as saying we want to construct a matrix
To do this with more than two views, we only take the
We get something still linear but now there are three views combined, so we get 3x3x3=27 coefficients. This can solve the trifocal problem directly. For four views, it would be
The four-view approach is conceptually nice but not very usable because we need to see around 15 points per view and solving it is slow.
With more than three views, we perform iterative algorithms.
When
The Moore-Penrose Inverse for the left side can be computed as
With affine cameras, this is supposedly easier and amounts to normalizing the data, performing the SVD, and taking the closest rank 3 approximation.
Without this simplification, we did the following:
-
We know camera intrinsics and have a lot of pictures
-
Take two images and normalize their points by applying
$K^{-1}$ . Then estimate the Essential Matrix using the constraint matrix from the lecture. -
Decompose
$E$ into four possible poses and keep the one that has the most points in front of the camera when used to triangulate -
Triangulation works by taking the matching 2d point from both images and applying the projection backwards, intersecting the two lines. Knowing
$K, R, t$ we can compute$P$ for each image. We know that$Pp^{3D} = p^{2D}$ , but only up to scale so if we add a$\lambda$ we have $$ P_1X = \lambda x ,\qquad P_2 X = \lambda y, \qquad P_3X=\lambda $$ and creating the combination of the first two with the third equation can be done by inserting the third of these equations as the$\lambda$ . so $$ P_1 X = P_3 X x, \qquad P_2 X = P_3 X y $$ For triangulation, we combine the equations from two images as constraints on$X$ :If a vector satisfies this, every scaling of it will also satisfy it, so we look for the unit vector here. When you found an
$X$ , you can reweight it to minimize the geometric error. -
We filter out points that are obviously wrong (behind the camera in 3d space)
-
For each reconstructed 3d point, we make sure both image points are associated with it. These are the "correspondences".
-
Add more images by estimating its pose using the 3d ponts. Then triangulating all its points back using each of the already registered images, producing more correspondences. For this estimation, we again work in the normalized image plane, so we left-multiply by
$K^{-1}$ and solve the constraint matrix for the projection vector. But this time against the 3d points.
We simply computed the vector as per lecture slides. But why, again? Well, we just took the equation
We know
So we decompose
Goal: Find the line in the image.
Approach: For every point, consider every possible line, and increase that lines' counter if it goes through that point. To make this reasonably possible, we use bins instead of continuous parameter space.
If we take the line equation
We go through all possible values
Problem: The parameter space is unbounded.
If the points were only almost on the same line, there will be similar parameters in the parameter space, but not the same one. So maybe we want to blur the results a bit by also counting the adjacent grid entries.
- Lines that have the same angle are parallel in image space.
In parameter space they are thus bright spots at the same
$\theta$ coordinate. - A square has four parallel lines, so at two
$\theta$ values there are two bright spots each. The distance between these line pairs is the same. - A circle has a lot of lines at the tangents, but the lines crossing the circle have two intersections. The lines that are represented by brighter spots are the ones that were encountered more often while looping over all the circle points. The tangent lines are only generated by one point, the circle section lines by two points - that's why they are brighter.
Self-Q: What?! Shouldn't the tangent be dimmer? A line
$(a,b)$ is bright if many points vote for it. the tangent line has only one point... Answer by Crimson: Perhaps they mean a circle with a thickness of more than one pixel. Then there are "tangential" lines that go through more than just two pixels.
- Self-Q: how to go from the circle visualization to actually finding the circle center and radius?
- Self-A: It's probably easier to do a 3d hough transform instead, mapping circles to planes. See https://en.wikipedia.org/wiki/Circle_Hough_Transform
This section is still confused and would need some revisiting
The goal of this paper was that parallel lines generate vanishing points in the third view. See here for more, because I did not really get this. This quote from there explains it better than I do below.
After the first Hough transform (upper row in the figure below), the features (or "peaks") that can be detected are straight lines (as in the case of a classical Hough transform). After some filtering (leaving only the detected peaks intact), we can apply a second Hough transform. This brings us back to normal XY-coordinates. In this space, one can see which lines have been detected in the previous step (with the original image re-emerging in the first subspace). Now peaks can be found at the intersections of several of the lines detected in the previous step. For instance, a clear peak is found in the third subspace. This corresponds to the vanishing point of the vertical direction. Beside this special point, one can also find the vanishing points of the different horizontal directions, where fewer but longer (i.e. stronger) lines intersect. After a third Hough transform, we again get the peaks found after the first Hough transform. This is easy to understand, since the intersection points of different lines intersecting one single line are of course all collinear on this line. However, we can also distinguish a new peak, in the third subspace, where four different lines intersect. This is the line on which all the horizontal vanishing points lie, or in other words: the horizon of the image. Of course, one can go on and take a fourth and even a fifth Hough transform. However, the physical interpretation of the features found in these spaces seems less evident. Same Source but archived
If we have a point where a lot of lines intersect, it is likely a vanishing point because normally you have around three lines only intersect in a point in an image. We do the hough transform a few times forth and back, to find the vanishing points in infinity.
First we find the lines in the image by using a hough transform. Then we map those lines back to the image space (split into three subspaces too. Explained below) to see them.
For instance, a clear peak is found in the third subspace. This corresponds to the vanishing point of the vertical direction. Beside this special point, one can also find the vanishing points of the different horizontal directions, where fewer but longer (i.e. stronger) lines intersect.
Similarly, we find the horizontal vanishing points in the second subspace (middle of the image).
and we can go one step further with another step from the lines to pointspace and back to linespace. This gives us the third row of images, where we find the horizon - i.e. the line where all parallel lines' intersections lie.
Note in this image that we think of the second subspace as the cube walls left and right from us. And the third is above and below us. We can use each subspace as two walls because we don't care about the direction of lines.
One point
So we can apply one hough transform after another: The first one maps the points in xy-space into lines in ab-space. But then we can take each point of the line again and continue to find patterns in there.
If we use the polar parameters
We hence need an alternative approach that maintains the symmetry but is also bounded in parameter space. We keep the equation
- if
$|a|\leq 1$ and$|b| \leq 1$ everything is fine the straightforward way. We have one parameter-space where we map this case to. So for a point$(x,y)$ we draw the line in ab-space of which each point$(a,b)$ represents a possible line in xy-space that goes through$(x, y)$ . - Case where
$|a| \gt 1$ :- If
$a$ gets too large, we use the axis$\frac{1}{a}$ instead. That constrains the graph in this second parameter-space to lie in$\left]{-1}, 1\right[$ . If$|a| \ge |b|$ then their fraction$\frac{b}{a}$ is always in that range too. - To cover the remaining cases:
$|a|\gt 1 \text{ and } |a| \lt |b| $ :$b$ is getting too large, even more so than$a$ . So we choose like before: One axis as$\frac{1}{b} \in \mathopen]{-1}, 1\mathclose[$ and the other as$\frac{a}{b} \in \mathopen]{-1},1\mathclose[$ .
- If
So we end up with three subspaces that together are the ab-space. Since each of those spaces is bounded, we can for a given point
==Q==: Why are they still straight lines though? They were deformed now - and not both axes in the same way..? Or if they are no longer straight, why is this not a problem but the cosine was?
My questions in discord:
Did someone understand this fully? I also read the explanation at https://homes.esat.kuleuven.be/~tuytelaa/CHT.html but I don't understand it fully. a) Why is it important that the symmetry is given? What is the problem with the polar representation where all the possible lines for a point (x,y) do not lie themselves on a line in the parameter space (They lie on a cosine instead)? b) I understand that parallel lines meet in their vanishing point, and by introducing a representation that can deal with infinities, we can find the vanishing point in the second or third subspace. We need a hough transform for this because the vanishing point did not actually exist in the original image, so we transform once to the point representation and back to end up with infinitely long lines that do actually meet there. But I do not understand how transforming a second time to the point space and back would give us the horizon. The horizon is the line of points where parallels intersect, I believe. Is the point simply that doing another hough transform forward and backward will generate a line that connects the two horizontal vanishing points we found, and this is the horizon? And because the lines that go through the vanishing points are very long, they are weighted very strongly? c) Why do we need any hough transforms for this? It sounds like it would be easier to find the vanishing points in some more direct way, at least after the first step of finding all the lines?
Say we found a line, using hough transform. Then that line is based on where the gird tiles' middles were, so there is some inaccuracy. We could solve that with a least squares fit of the involved points, but then we only minimize the error in the y direction. If the error in both directions is relevant, we want to minimize the orthogonal distance.
This is "Total Least Squares", instead of "standard least squares".
Start at some point and take more points to fit a line to until the new point is too far away from our line. Then restart with a new hypothesis from that point on.
Hough Transform is a nicer solution.
Required a number of Lines
We assign randomly which point belongs to which "cluster". In this case, the clusters are lines that we want to assign the points to. We fit the lines to the clusters and then measure the residual for each point and re-assign it based on that.
Repeat until convergence.
If we use a squared error, that means one outlier can do a lot of harm. That is because least squares "assumes a gaussian" (todo: find out where that was said in the lecture and understand it again. It was a cool fact presented by Siyu Tang.), so a double exponential probability decline of far away points.
Short Answer Sextus Empiricus on stackexchange
Multiplying gaussian probabilities results in a sum due to the exponential. We see the least squares in the log likelihood.
To make K-Means more robust we optimize some error measure that does not explode for points that are far away:
$$
\frac{u^2}{\sigma^2+u^2}
$$
The larger
This works with a few points that are off.
If you minimize the median, you are assuming at least half the points are correct!
Similar to k-Means, except it's not just about binary labels: We talk about probabilities. "How likely does this point belong to our hypothetical line?"
The point can either come from noise or be generated from the line. We assume that if the point is indeed generated from the line, it follows a gaussian distribution around the line - further away means less likely. The noise is uniform over the whole image. $$ P(\text{point} :|: \text{line and noise}) = \ P(\text{point} :|: \text{line})P(\text{point comes from line}) :+\ \qquad\qquad P(\text{point}:|:\text{noise})P(\text{point comes from noise}) \ = P(\text{point}:|:\text{line})\lambda + P(\text{point}:|:\text{noise})(1-\lambda) $$
We don't have to look in detail at the math here
The algorithm steps are basically these two, repeated:
- Given some assignments, choose the most likely lines. Consider the points' assignment probabilities as weights.
- Given some lines, choose which points are assigned to it (This is basically just thresholding the distance to the line)
But in contrast to k-Means, the assignments are probabilities per line, not just one line selected.
I would solve the first point by placing the weights in a diagonal matrix
This was skipped in the lecture although the topic was on the slides.
From each point we start and consider the nearby points, compute their mean, and move the point there. Repeat until this point's location is stable. Do for all points independently and see which ones end up at more or less the same place.
- EM can deal with probabilities, K-Means only with binary labels
- EM is less sensitive to outliers and bad initialization. But EM and K-Means both are affected by the initialization.
- EM/K-Means require a model prior and the
$k$ . - EM and K-Means can get stuck in local optima
- Mean-Shift has no prior about the data while EM assumes Gaussians
- Mean-shift is robust to outliers because these are too far away to affect other points. That effect is huge in k-means since it affects the cluster mean, and less huge in EM where the probability of the outlier gives it a low weight.
As another alternative to assign points to lines.
We try all lines that go through two points. That means we might not find the best line, but we find a pretty okay line. We look at the line and count how many inliers it has. Based on an arbitrary threshold that defines how far away from the line it is still an inlier point.
Once we reach a sufficient number of inliers (or a predefined number of iterations) we stop and keep the best line. Optionally, we can now improve it further - see the lecture on Mathematical Foundations in CG and CV.
The number of iterations required, based on the outlier ratio: $$ N = \mathrm{ceil}\left(\frac{\log(1-p)}{log(1-((1-r)^s))}\right)\ $$
$$ \begin{align} \text{where}\qquad&\ p&\qquad\text{is how certain we want to be}\ s&\qquad\text{is how many data points define the model}\ r&\qquad\text{is the outlier percentage } \end{align} $$
This guarantees us that the probability of the selected
$s$ points all being inliers is$p$ when we have$N$ iterations. And since we defined what is still an inlier, it is a useful statement to make.Derivation of this formula: Probability that at least one point is an outlier is equal to probability that not all points are inliers =
$(1 -(1-r)^s )$ . We repeat that$N$ iterations, so the chance that we fail every time is$(1-(1-r)^s)^N$ and the chance we succeed at least once at selecting$s$ inliers is the opposite probability of what we want. So $$ (1-(1-r)^s)^N =1-p $$Alternative Phrasing that is maybe easier: $$ (1-(\text{inlier probability})^s)^N = \text{desired failure probability} $$
We can take 7 matches, compute
RANSAC assumes that there is some good set that leads to a good solution. If there is no good solution and the problem is illposed with the set, then RANSAC gets confused.
If we have almost degenerate data (E.g. all the points except for a few are on the same plane) this will not work out well (low probability of success). If RANSAC picks six points in the plane out of the eight, you will find a solution that works perfectly for all points in the plane but has a random epipole so you will very likely not get a solution that works for the actually interesting points.
This was somewhat covered in lecture 08 Wednesday. We check whether we can "squeeze" the points into a more constrained model (less dimensions. More-dimensional null-space). And if we still have almost the same number of outliers then it did not really matter ... we decrease dimensionality until it does matter, and then we know what dimensionality our data samples actually had. If that is too small, we try to find something in the outliers of the six-dimensional space to expand the degenerate case to a non-degenerate case.
Alternative to RANSAC: Minimize Median Residual instead of maximizing inlier count.
Imagine: For a particular hypothesis you sort the points by error amount. LMedS minimizes the error at the 50th percentile. RANSAC seeks for the most densely populated stripe of a user-defined size1.
We have several simple classifiers that only look at one single feature. So they are one-dimensional and are thresholding. There are a limited number of threshold values to attempt, because there is finite data. So it is linear to try each threshold (times the complexity of the classifier evaluation).
We try every feature with every threshold possible and see which one gives the best accuracy. Each classifier is assigned a weight
So if the previous classifier's error was above 50%, the
The lecture slides omitted the minus - I think this was by mistake. The lecture slides also do not modify the weights of the correctly classified points while the original paper reduces them. In the original paper, the formula is also different in that they also update the correctly classified points' weights:
- Real-Time
- Very Efficient
because it uses simple rectangular filters. Those are very efficient by computing the integral image. Then any sum can be computed in constant time (given the integral image. Which itself takes only a single pass.
Have a lot of features that are really simple to compute, then use AdaBoost to select the most informative features (sparsely).
See Advanced Machine Learning notes for details.
tl;dr: Hard-Margin SVM searches for a line with maximal margin. That is apparently equal to minimizing the norm of
A multicut is the set of edges that need to be cut to separate the multiple distinct components.
To avoid cutting an edge that has no impact, we require that for every cycle in the original graph, at least two edges need to be cut. Mathematically we say this by counting an edge as cut if its value is
The
I'm not going into the details of how we do the optimization. But the heuristic solver gave me much faster solutions.
This is called the Minimum Cost Multicut Problem and it can be used for image segmentation by choosing appropriate values for the edges. We get the coefficients e.g. from tracking features.
We can also add
Constraints:
- consistency: Any edge should only be not-a-cut if the node is valid. If a node is invalid, all its edges should be a cut.
- cycle consistency: No cycle should be cut only one time.
We look at detected humans over time. They should not split up or join together.
Problem: If the tracking is bad, this is hard. If we instead cut the graph, it is more robust.
You can either think about how to choose coefficients (example in MathFound lecture) or just run a deep neural network to do that for you. However, you need then to still verify whether the parameters it gives you are reasonable and the graph cut valid.
We add the uniqueness constraint: only one label per cycle.
If we assume an affine camera, i.e. absence of perspective - parallel lines are actually parallel - then we can solve SfM a lot simpler than with full projectivities.
An Affine matrix is no longer up to scale because the last row is fixed to
Call this leftmost vector
$$
P: (2m\times3),\qquad M:(3\times n)\
m \mathop{:=} \text{num cameras}, \qquad n \mathop{:=} \text{num points}
$$
The columns in
If we know the camera poses
As an additional constraint for choosing
This
$C$ here is actually very much related to the absolute conics/quadrics we saw earlier (==Q== How?)
After applying this all, we can interpolate to viewpoints that we have not seen. We got all this just with an SVD - all thanks to not having perspective. We can compute both the points and the camera positions just from 2d observations and knowing which 2d points belong together.
That would also be possible, but is way harder than Affine Factorization. It ends up being not a good approach in practise, but conceptually "it's nice to make the link".
We could add a scale factor to every single observed point by adding a diagonal matrix
Recover two-view geometry, triangulate points, estimate camera pose, repeat.
==TODO== is this the same as incremental SfM?
Start with one pair of images. Determine their pose from the 3d structure, extend the 3d structure, refine the resulting structure using bundle adjustment
Handle all images at once: Estimate all orientations, estimate all positions, triangulate the structure, refine it.
This is more efficient and accurate, but less robust.
Non-linear refinement given all points and camera poses.
The equation is hard, but we can make use of sparsity.
https://www.youtube.com/watch?v=mXINx-UG6iY
How is this different from SfM though?
###Disparity Varying
The approach where they slide over multiples depth hypotheses using a known and fixed offset
Also consider the
For each pixel, the disparity is how much the two images disagree given a depth hypothesis
A similar idea works to determine on which kind of plane a point lies: Hypothesize planes and see what works out best.
Yeah, this is related to the "plane sweep" idea for finding the optimal depth of a pixel, right? We also have multiple views and we compare them at different depths to the reference image. I think what I described is a bit different from that idea though in that we don't project the images to hypothesized planes, but get the disparity from solving the B/Z = d/f formula for d and then just take the pixel in the other image at that disparity and compare with that pixel. - Alexey
Possibly related: https://www.sciencedirect.com/topics/engineering/disparity-estimation
Corners are a good feature to detect because edges are only localizable in the direction orthogonal to the edge but corners are distinct. The basic idea is that if are on an edge, there is a large change in intensity orthogonal to the edge. If you have two edges that meet in a corner, and a window that is big enough to contain them both, moving the window in any direction will cause a big difference.
We look at a small window around each pixel to detect corners. The sum of element-wise squared differences between a window and a slightly shifted window is what we look at. The more different this is, the better a feature it is.
$$
\operatorname{cornerness}(x,y) = [I(x+u, y+v) - I(x,y)]^2 \
\stackrel{\huge\approx}{\small\text{;by taylor expansion}} \
[I(x,y) + I_x(x, y)u + I_y(x, y)v - I(x,y)]^2\
=[I_xu + I_yv]^2 \
= \left[\matrix{u&v}\right]M\left[\matrix{u\v}\right]
$$
We sum those squared differences up for all pixels in the window
If we use a gaussian with decreasing weights instead of a binary window, it becomes rotation invariant. So every pixel is technically considered a little - it just has almost no impact once it's a bit away. When we program this, we just threshold that and do not consider points that are too far away to be relevant.
$$
M = \sum_{x,y ,\in W} {g(\sigma, x, y) \ast \begin{bmatrix}I_x^2 & I_xI_y \ I_xI_y & I_y^2 \end{bmatrix}}
$$
To compute this, we first compute the image derivatives (and their squares) simply by subtracting
The determinant of
In the end, we perform non-maximum suppression because we want just one corner pixel, not many, for the same corner.
All in all, this is still not scale invariant - what used to be a sharp corner becomes a line if you zoom out.
Observation: the matrix basically determines an ellipse when we set the function to some constant, and its eigenvalues are large in the direction where the ellipse is squeezed together the most. Adding a rotation in this matrix would be possible.
We can distinguish the kind of edge by looking at the eigenvalues:
-
corner:
$\lambda_1, \lambda_2$ are big -
flat:
$\lambda_1 \approx 0 \approx \lambda_2$ - edge: one lambda is a lot bigger than the other.
We can approximate it with
We can actually compute this
because the determinant of the matrix is the product of its eigenvalues. And the trace is its sum. So this is
$\lambda_1 \lambda2 - \alpha (\lambda_1 + \lambda_2)$ .
this was not mentioned in the lecture but is on the slides $$ R = \frac{\det (M)}{\trace(M) + \epsilon} $$
We have a Corner detector that provides us with good keypoints. But they are not scale invariant. So we operate on multiple scales now, so that we can detect the feature again in a differently-scaled view.
For this we define a "signature function" that is scale invariant.
A laplacian of Gaussians (LoG) is basically a "blob detector". It has very high or very low values at blobs when convolved with the image. We take the second derivative of the gaussian in both
Bigger variances in the gaussian detect bigger (circular) regions.
Implementation Detail: Computing the laplacian of gaussians can also be approximated with a difference of two differently-sized gaussians instead.
The
The
(Source): at a reasonably sharp edge between two regions of uniform but different intensities, the LoG response will be:
- zero at a long distance from the edge,
- positive just to one side of the edge,
- negative just to the other side of the edge,
- zero at some point in between, on the edge itself.
The
The
Given keypoints, we want to build descriptors that we can recognize again. E.g. for bundle adjustment. Finding the keypoint candidates can be done e.g. by blurring the image in various scales, then computing the differences between such consecutive blurred images (to approximate the LoG) and take the points that stand out in some of those differences (extrema, detected in small window cubes).
We take the 4x4 histogram of the gradients around the keypoint. The gradients are usually robust with respect to illumination changes and viewpoint changes.
Each histogram has the angle split up into 8 bins, so 45-degree pieces. We could also consider the gradient magnitude but that is possibly more affected by illumination changes - so we don't keep that. Or we could weigh the gradients in the histogram by their magnitude.
youtube: First Principles : SIFT Detector Summary
- exam 2019
After Scale-Space Extrema detection using LoG approximations, we find interest point candidates. We remove all the weak extrema and do nonmaximum suppression, so now we have the SIFT interest points.
Then we look at each of this blob center and the pixels around it in a radius corresponding to its scale. For all of them we can compute gradients inside the circular window. The largest determines the zero-degree-orientation. After rotating accordingly, we compute the histogram of gradients (HoG) with some binnning of angles. This gives us rotation invariance.
To achieve scale invariance, the histogram of a larger-scale blob needs to be rescaled. We can simply divide each histogram by its sigma, since we know that.
Finally these descriptors are reshaped into feature vectors and correspondences are found using some distance function to compare them - e.g. by computing the squared difference.
Keywords "Recursive Bayesian Filter", "Condensation Algorithm". Amounts to something similar to what a Kalman Filter is, apparently. See further below, where I summarize all the exercises.
We assume a pixel moves from
We assume that brightness (or color) of a pixel does not change if the object remains the same (
Call the local flow vector
The minus is because the change over time is opposite to the change over space: If we keep the space fixed and a bright pixel moves from left to right into our observed pixel, the gradient over time is positive. But the gradient over space (with a fixed frame) looking at the moving bright spot is negative.
This equation can be solved for
Pros: Can converge quickly, can handle different parameter spaces. Cons: Not robust to noise or large displacement
If the brightness does not change over space, we can not say whether the pixel moved or not. That is the aperture problem. You can only perceive motion in the directions in which the gradient changes.
A trick to deal with this is to look at a larger window .. but then you assume that the window moves the same, locally.
To fulfill the assumption that the motion is very small (around 1 pixel) which is rarely the case, openCV uses pyramids of varying resolution. Bigger pixels means smaller motions. And then the tracking is refined on more detailed scales.
This approach has a constraint on smoothness in flow. It is a global approach. Again we assume brightness constancy and small motion, but this time we aim for a smooth global flow, not a constant local flow.
The energy function to be minimized is the integral over space of the gradients in space and time plus the norm of the flow. So having flow that changes a lot is punished. And constant brightness is rewarded. Note that
Accoding to Wikipedia, this algorithm produces a high density of flow vectors even in homogeneous regions (good) but it is more sensitive to noise than local methods (bad).
This section could be more detailed, but we never looked at it in more detail, except in the exercise. Which has its own summary section.
Very simple idea: Predict some next state, then weigh them based on how correct the predictions were. Sometimes add noise in between to make sure we have multiple hypotheses.
Assumptions:
- Everything is gaussian
- all models are linear
If these assumptions are true, the kalman filter is optimal. They are normally not true.
Difference between Kalman and Particle Filter
Performs some "linearization" using taylor expansion. So models don't have to be linear - just differentiable. It may need some "stabilizing noise" to avoid a drift of the estimated covariance matrix. I believe the extended Kalman Filter is out of the scope of this lecture though.
Instead of considering every possible sliding window, we want to only check a few. Those are called proposals, obtained by some heuristic. For example segmenting the image into rough patches and then only considering their bounding boxes.
Each object proposal candidate is warped to same size, then run through a ConvNet. They learn high dimensional features and then those are put into SVMs to classify.
In the end refine the bounding box. Using the features found.
Filter
(keras) aka Channel
(pytorch) is basically the depth of the data that is passed through the network. We usually start with three of them when working with images - R,G,B.
Dense Layers
(keras) aka Linear Layers
(pytorch) simply interconnect all the inputs with all the outputs. In Convolutions
we care about the shapes though, since the geometric nearness matters. That is part of why they are useful for images or video data.
We usually need to choose input_channels
,output_channels
, kernel_size
, padding
, stride
. The formula is intimidating but it's actually not that hard to just imagine. Talking about 2d Convolutions:
- The kernel size, e.g.
$(3\times3)$ defines a rectangle that moves its center over all pixels in the inputs. The pixel in the middle must be in the middle, so the numbers must be odd. - So if the border pixels should be considered, the input needs some
padding
. Otherwise the kernel would consider inexistent pixels. - The
stride
says how far the kernel moves. Default is1
, so every pixel is once the middle. If it is2
, then of course the image will be roughly half as wide. "Roughly", because you need to consider the padding too. - The
num_input_channels
is the size of the third input dimension. These channels are independent of each other. Thenum_output_channels
is the size of the third dimension, in the output. For each output channel, there is one convolution applied to each input channel, and those are summed up. source
$$ \newcommand\shape{{\leftarrow!!!!\in}}\
\text{Assuming input }\shape (W_{in}\times H_{in}), \text{ padding
The
There is also the concept of Deconvolutions
Applies a 2D transposed convolution operator over an input image composed of several input planes.
This module can be seen as the gradient of Conv2d with respect to its input. It is also known as a fractionally-strided convolution or a deconvolution (although it is not an actual deconvolution operation as it does not compute a true inverse of convolution). For more information, see the visualizations here and the Deconvolutional Networks paper.
I did not fully grasp how they would differ from just convolutions.
In conclusion, the deconvolution layer is the same as the convolution in LR with r channel output where d d is the spatial dimension of the data arxiv
They are used for upsampling, where a normal convolution with the same parameters would downsample. But if I ever want to use it, it is better to do some reading up first - seems like a normal convolution in low-res space would do the same trick.
I hope we don't have to manually compute backprop at the exam
The basic idea: compute how the loss function changes when we modify each weight. Use those to compute how the loss function changes when the input changes, by making use of the chain rule. Then gradient descent to find the optimum of
By having a non-polynomial activation function, a network can approximate any function pretty well
bad model -> bad results.
Some models are too crude. Others are very detailed but those details are not observed in the picture. So we just model the visible detail - shape, texture.
Just using pose parameters don't look good when applied rigidly, so we use linear blend skinning (recall from Mathfound.). So every point is a linear combination of close body parts.
if you don't apply the linear blend skinning and just rotate it based on the angle you got, you break some body part connections. With Linear Blend Skinning you consider "all" the body parts.
The Problem of Linear Blend Skinning though, is that it produces some artifacts that don't exist in the human body.
Then people use a "Blend Shape", to get rid of those artifacts. Meaning the original shape in the rest pose is corrected first. Those blend shapes can be learned (not covered now).
The pipeline is called SIMPL Body Model.
so the "Shape Blend Shape" is the identity of this instance and the "Pose Blend Shape" is changing based on pose.
They use this model to reconstruct a 3d human pose from a 2d image (by posing the model in the best way. ransac-like, I guess).
We had a non-linearly-seperable dataset. The linear classifier was not very good, unsurprisingly.
Takeaways:
-
nonlinear activation function -> model can learn higher-dimensional representation
-
floats are bad
-
num params is the sum of all weights and biases. It can be computed as $$ C_{in} \cdot K_x \cdot K_y \cdot C_{out} \quad+C_{out} $$
-
Confusion Matrix is a table of "prediction"
$\times$ "truth", counting for each cell how often that happened.
We implemented the Mean-Shift algorithm to segment an image into clusters. For this each point is considered and gets a new mean computed based on its weighted neighbours within some radius. We used the color distance.
Takeaways
- loops are slower than numpy einsum. But torch.sum is faster than just sum. And torch.einsum has some speed issues.
- Implementing a gaussian myself works better than the one from sklearn
We also implemented a Seg-Net. The architecture was conv2d, batchnorm, ReLU, MaxPool. Then in the end unpooling with the same indices. We didn't learn why the architecture was chosen this way.
Roughly covered in SfM. We got many images and point correspondences but no camera positions. We reconstructed the camera positions and the 3d Point locations, while trying to minimize the geometric reprojection error. We did not do bundle adjustment.
Summary: The first two images give rise to an essential matrix. We decompose that to get the rotation and translation of one image, fixing the other to no rotation nor translation. We backproject all points from the images to the 3d space using the correspondences and our now-known
Takeaways:
- There is a big difference between
$[R^T|t]$ and$[R^T | -R^T t]$ . The latter is correct, because the translation is applied after rotating. The trickiness comes in because if$R$ is the rotation from cam1 to cam2, then the correct projection matrix for cam2 (if cam1 is the origin) contains$R^T$ : if the camera turns right, the point is now to the left of it. Others have fixed that simply by swapping the camera poses. The translation should be negative too, but then also rotated, so it needs the same$R^T$ . Because the translation is applied in the already-rotated space.
- Nothing special about Ransac
- MVS was done with deep learning, so nothing interesting there either
We split the image into cells, and then consider patches that consist of
We handle the zero-gradient by arbitrarily assigning it the angle
Each cell has hence a descriptor that is the HOG. And each patch has 16 cells. All descriptors concatenated yields the feature vector of the patch. We lose geometric adjacency, but this could be learned if it is useful.
Using the described process, we extract a feature vector from each training image. We then choose some
For each test image, we also construct the bag of words and then treat it as a vector, searching for the nearest neighbour in euclidean distance in the training set.
If we choose
Classify the image by scoring it in multiple categories: How certainly does it depict an airplane, automobile, bird, cat, deer, dog, frog, horse, ship, or truck? We simply do some convolutions and then predict one score for each class.
Recursive Bayesian Filter / Condensation algorithm
We track a ball using some features like color or gradients. We have a model how the ball moves, e.g. with constant velocity, and update in each step a few predictions for the next step. Doing so, we also add some noise either there or in the observation step of actual values to randomize a bit. Once the next frame's color values are known, we propagate every hypothesis, but weigh them based on how good they are (compared to the color that we are looking for). Over time, wrong ideas are forgotten since they lose weight over time.
The weighted mean is what is used as an actual prediction.
The objectives of this course are:
- To introduce the fundamental problems of computer vision.
- To introduce the main concepts and techniques used to solve those.
- To enable participants to implement solutions for reasonably complex problems.
- To enable participants to make sense of the computer vision literature.
- Camera Models
- Calibration
- Invariant Features
- Multiple-view Geometry
- Model Fitting
- Stereo Matching
- Segmentation
- 2D Shape matching
- Shape from Silhouettes
- Optical Flow
- Structure From Motion
- Tracking
- Object recognition
- Object Category Recognition
- lecture 1, 2
- lecture 3, deep learning, lecture 04 convolutions
- lecture 04 corner detection (region detection, local descriptors)
- lecture 05, 06, image segmentation
- lecture 07: Projections, MVG
- watch lecture 08: thursday
- summarize important stuff of lecture 08: thursday
- watch lecture 09: wed and thu
- 10, 11,
- lecture 12, not much to be learned here...
- notes of lecture 12_tracking
- notes of 13_1
- notes of 13_2
- old exams
- 2020
- 2019
- 2016
- 2015-2016
- read feedback for tasks 1-4
- read feedback for tasks 4+
- review practical exercises
todo next: read my own reports
- ex1
- ex2
- ex3
- ex4
- ex5
- ex6
- my exam notes below
- SfM
- Machine Learning Stuff
- Recap everything, especially the homogeneous stuff
and how to project things that are not points from 2d to 3d and back. E.g.
$\text{Eq. }\ref{eqbackpropline}$ . - read up in discord starting at https://discord.com/channels/910462893387030569/910462893387030572/934777917626548275
-
SVM
-
graph cuts Although they were in lecture 13, which is not part of the exam according to a moodle post.
-
What is the Hungarian Matching algorithm mentioned in the lecture slides shortly?
-
disjoint paths solutions
-
"the effect of baseline on depth estimation" in the MVS lecture
-
MRF, graph cuts (image segmentation)
-
why are
$E$ and$F$ of not full rank? see https://discord.com/channels/910462893387030569/910462893387030572/932674716399984670 -
SLAM (if covered?)
-
all the missing deep learning stuff
-
Mihai: For the ML part, there will be questions regarding the different algorithms (e.g., pseudo-code, importance of different stept, when one clustering is more adequate than another). For the Deep Learning part, we mostly expect you to understand how the layers work, what are the parameters of the different layers, etc.
-
Kalman Filter:
We have a prior distribution reflecting how much a variable we're interested in (e.g. position of whatever we're tracking) has changed (e.g. a prior distribution over the speed of the object to be tracked). Based on that, we make a prediction for the next state of the variable, and use that prediction in some calculations we need to perform now. Then when we actually get to the next state later, we obtain a measurement of that variable. But the measurement is noisy, e.g. due to imperfections in our algorithm that detects where what we are tracking is now. So we update the parameters of our prior distribution, but we use a weighted combination of the current parameters and the newly measured parameters. The weight reflects how much we trust our measurements.
- Kinds of Transformations
- How to apply homographies to various objects Actually not too hard to rederive during the exam.
- Which element in the calibration matrix means what
- Properties of the fundamental matrix determinant, rank, # DoFs
- RANSAC formula derivation
- Properties of the essential matrix
same questions as for
$F$ Also look at$\text{Eq. }\ref{eqwiki1}$ - Harris and SIFT invariances Sift: not invariant to affine transformations, but invariant to rotation, translation, scale. Harris: rotation- and translation-invariant, but not scale invariant unless you add laplacians / scale pyramids.
- equation for AdaBoost
- Formulas for EM-Algorithm
- Equations for Essential and Fundamental matrix.
-
$\text{Eq. }\ref{brightnessconstancyconstraint}$ , the Optical Flow brightness constancy constraint. And also how the next two equations follow. Also$\text{Eq. }\ref{hornschunck}$ about horn-schunck OF and its advantages (enforces smooth flow, not constant flow. Advantage: dense flow even if not dense samples.) -
$\text{Eq. }\ref{mlfiltereq}$ : filter-size equation for deep learning - SVM equation
$\text{Eq. }\ref{svmeq}$ :$\min_{w, w_0} \frac 1 2 |w|^2 \quad\text{s.t. }y(w^Tx_i+w_0) \geq 1$
Topics I'd like to look at but are probably too far away from the lecture topics
- [Deep Learning] how to guarantee that embedding space is continuous and can be interpolated?
- Mahalanobis Distance to mix world-error and projected-error
- Total Least squares for line fitting. Apparently that can be done using svd? See also here for the clearest description:
- Lagrange Multipliers and Euler-Lagrange Equations
- Lecture videos / slides of course M. Pollefeys, 2021
- Hartley & Zisserman - Multiple View Geometry in Computer Vision
- Epipolar Geometry and the Fundamental Matrix
- Lucas Kanade Optical Flow
- Kalman Filter - 5 Minutes with Cyrill
- Homogeneous math stuff https://engineering.purdue.edu/kak/computervision/ECE661Folder/
Notes from solving the 2020 exam for studying
- What is "SSD mutual nearest neighbour matching"?
- did we have the "MSER Maximally Stable Extremal Regions" feature descriptor?
- is sift invariant to affine? No. But MSER is
- look at exact definition of SIFT again - what is part of it and what not? E.g. apparently scale-space extrema detection and keypoint localization are part of it.
- how to directly compute epipolar line again?
- with pure sideways translation, are the epipolar lines for all points verticals?? seems wrong... but the math looks like it to me.
- is a homogeneous line with zero in the first component even valid? I guess... yes. It just says it is independent of
$x$ . - apparently if we know the camera calibration and the relative motion of the camera but not the absolute 3d world coordinate frame, "up to which transformation can the scene be reconstructed in 3d" is only rotation and translation, not also scale. So euclidean, not similarity. Why??
- what is the "infinity homography"
- "estimate the optimal disparity per pixel using loops and numerical operations and the winner-takes-all technique"
- how does increasing the window size effect this?
- what is the graph-cut technique for computing disparitymap?
- Pros and Cons of Bag of Words? See solution 6.b)
- mean-shift algo: meaning of window
$h$ and convergence threshold$\epsilon$ . - gradient: do they define it positive as pointing right and down when the right and or down side pixel is larger? I think so.
- Scalar trick for finding constraint matrix
$A$ allows for taking the$\vec X$ in front of the$p$ : $$ PX = \left[\matrix{p^1_{row}X\p^2_{row}X\p^3_{row}X}\right] = \left[\matrix{{X^T}{p_{row}^{1}}^T\ X^T{p_{row}^2}^T\X^T{p_{row}^3}^T}\right] $$
In that exam's case:
Note: The