VINS code¶
Analysis the structure and the details of VINS code [1] . Take the android implement as example. As the original code is based on ROS(robot opeartion system), changes are made to create virtual ROS message, and manual call the ros callback functions. It has two main threads: vins_estimator and loop_fusion.
Top Node¶
In naive-lib.cpp all the intereface plugin functions are defined. In the android version code, there are lots of management of memory(mutex), query and buffer, and lots of global variables. It is the result of the absent of ROS, we need to take great consideration of these parts.
The main node in VINS android is : Estimator_node. Where many fronter functions are defined : functions to run data set test, functions to run in real time. In the init() function , it starts three threads : process, loop_detection, pose_graph. The main thread in ROS node is process it calls two essential functions receive_IMU and receive_img, they will call img_callback, imu_callback, and feature_callback respectively. And it is also responsible for loop deteciton and loop closure.
call from other¶
receive_img (called in java plugins, used to show image in android application)
-> img_callback : call this fcn when receive image
-> proprecessing : gestion of frequence / error rejcetion judgements
-> process the image (seperate the process of stereo camera and mono camera)
-> calculate feature points ( FeatureTracker )-> call feature_callback (push the feature to buffer)
-> draw track messages to image (tracked points in green, not tracked points in red, etc)
-> draw_mainui : call DrawResult class, to draw AR rendering(drawAR) or draw trajectory with map points(Reprojection)
-> add more debug infomation to shown image
receive_IMU (called in Phone Sensor)
-> imu_callback :
-> add to process query
-> predict the current state (position, quaternion and velocity) by intergration (mean value).
Details of mean value integration (in estimator_node -> predict()), where all the values are vectors:
Queation : these results are never used, it is real necessary??
Answer : This is used as output between two camera frames, to achieve IMU-rate performance. This high-frequence state estimates can be utilized as state feedback for closed loop closure.
Process¶
- call sendIMU -> estimator.processIMU
- call estimator. processImage
loop closure there are also loop closure process in this thread.
Queation we also have loop closure in process_loop_detection thread. Is is redundant??
process_loop_detection¶
- Take the first image from the keyframe buffer to process.
- Add the keyframe to KeyFrameDatabase.
- Extract Brief descriptors of the keyframe features.
- Start Loop Closure (LoopClosure class). if success, receive the looped kerframe’s index.
- if far enough, add this loop to process query<int> (optimize_posegraph_buf) : calculate the matches between current processing keyframe and the looped keyframe, then Pnp to get pose.
- if too many keyframe in database, downsample erase some keyframes.
process_pose_graph¶
- if we have loop in the query (optimize_posegraph_buf, as we may add loop in process_loop_detection thread).
- do optimize4DoFLoopPoseGraph (cerse solver, 4DOF as VINS has set the gravity direction to be vertical)
- update infomation for visualization
VINS estimator¶
Method called above in “top node” : estimator.processIMU, estimator.processImage, estimator.retrive_data_vector . Its basic idea is to manage a slide window , make imu preintegration and imu observation, also marginalization, etc.
preintegration¶
The system preintegation and the system state function can be write as :
processIMU¶
A IntegrationBase class is made for pre-intergration management and calculation.
IntegrationBase¶
- push back a new measurment : timestamp, gyrocope measure, and accelerometer measure. Add them to the buffer and propagate the system.
- midPointIntegration : basic it is the same expression as above, about we are doing integration for the error term of preintegration here (as a result, n gravity term here). (in the VINS source code, they note p, v, and q, however I found it being misleading, so I note them as alpha , beta and gamma as in [2] ).
- Jacobian update : (it is optinal, it is set true, However when considering First Estimation Jacobian , we should not update Jacobian) three matrix are calculated before to fasten. Noise is seen as gaussian. And the F matrix(15*15) and the error term propagation matrix V (15*18) are calculated. (remember to normalize quaternion). In the end, two 15*15 matrix : Jacobian and Covariance are calculated.
Jacobian is (noted as F), F here is actual (I+F) in the original article:
The noise term matrix is (noted as V):
This is a iteration process, as we can see below, as a result, the Jacobian is the acculumation of F.
- evaluate : calcuates the residual (15*1 vector)
- also have checkJacobian : to check the calculation of jacobian of the system; offer an option of eulerIntegration (however it is less precise than mid point integration); and compare the results of mid point integration and euler integration.
Integration¶
In the final part of processIMU, the integration terms of the real world physics variables are calculated as below, where j indicates ith window, k indicates kth imu data (between two received image).
processImage¶
Pipeline:
- addFeatureCheckParallax check the image simliarity, to choose whether marginalize the oldest image in the window(to make space for the new coming , and the current image is treated as new keyframe) or the last image in the window (if the recent images are similar).
- create new image frame, and create the image pre-integration base.
- option : ( ESTIMATE_EXTRINSIC == 2 ) calibrate the extrinsic parameters.
- (solver_flag == INITIAL) -> fill the slide window and try to initialize initialStructure.
- (solver_flag == NON_LINEAR) -> initialize success, manage the slide window.
initialStructure¶
- This is the main part of the initialization process of VINS, and it is realized by a global SFM. So I will cover its details in GLOBAL SFM part.
solveOdometry¶
- f_manager.triangulate
- optimization()
slideWindow¶
- slideWindowOld : (solver_flag == NON_LINEAR ? true : false) f_manager.removeBackShiftDepth, f_manager.removeBack
- slideWindowNew : f_manager.removeFront
optimization¶
use ceres to optimize : CauchyLoss
- add pose local parameter block (of the slide window)
- add current frame pose block
- add residual of imu preintegrations (of the slide window)
- add feature residual (ESTIMATE_TD option)
- marginalization_info->addResidualBlockInfo of the upper resiudal
linear_solver_type set to ceres::DENSE_SCHUR, trust_region_strategy_type set to ceres::DOGLEG.
Slide window marginalization.
- marginalization_info->preMarginalize();
- marginalization_info->marginalize();
Marginalization¶
ResidualBlockInfo¶
Evaluate : evaluate the ceres loss evaluate. In ceres the evaluate result is a three-deminsion vector , where r is the squared norm. And the loss function is a costume defined function.
And VINS uses an factor alpha to control its jacobian.
MarginalizationInfo¶
- std::unordered_map<long, int> parameter_block_size; //global size
- std::unordered_map<long, int> parameter_block_idx; //local size
preMarginalize : retrive ResidualBlockInfo->cost_function->parameter_blocks
marginalize: from viki page and CSDN we can learn about marginalize. When a key frame is delete from the slide window, we should not directly delete all its parameters, as it will lead to infomation lose. The solution is to use marginalzation algorithm, in which way to keep part of the old infomation to the current state, while delete these old variables.
we can rewrite the system state as :
As the result the system function can be rewrite as :
Then we can rewrite the function to the form:
In VINS source code, A and b are defined as follow:
In our non linear optimization we have :
SVD:We can express an matrix by its singular value decomposition (SVD) :
where U,V are orthogonal matrices and Lambda is a diagonal matrix that is compose of multiple singular values arranged in decreasing order. We can further use these eigen values (elements of Lambda) to decomprose the image into multiple rank 1 matrices :
We can calculate the Jacobian by SVD, and also make it positive defined at the same time. And Eigen::SelfAdjointEigenSolver is used to calculate the eigen values of A. And set the negative values of these eigenvalues (by selecting the elements smaller than eps=1e-8) set them to be zero. This is to choose the positive eigen values to make jacobian positive defined .
Then linearized jacobian and linearized residual are defined :
This is similar to filter based image processing methods, such as in fourier transform filter or as we have seen before in Image blury . We can say,that it will keep most of the original infomation.
MarginalizationFactor¶
It is derivated from ceres::CostFunction. The Evaluate function is redefined here.
point position
camera pose
sum
update jacobian
set the elements related to the old frame to be zeros. And set the rest elements by the linearized jacobian (as calculated above in MarginalizationInfo).
Feature Manager¶
- list<FeaturePerId> feature
- vector<FeaturePerFrame> feature_per_frame
Global SfM¶
This is a simplified Global SfM method, reduced lots of algorithm details (outlier rejection, retriangulation, multiply global BA, etc). However, it works well for a real time SLAM application, and we can try multiply times to initialize.
Main process loop¶
- check IMU state. where Delta V is the result of preintegration between two frames in integration base, Delta t is the time interval between frames. To make sure the IMU data in the window have enough variance, so that it may give enough triangulation possibility. The main process is realized by :
- if Var < 0.25 : “IMU excitation not enouth!” (that is to say , IMU hasn’t enough variance). But in real AR application test, we found this part may not be necessary.
- Initialize a sfm features vector by FeatureManager .
- Check the relative pose, if not enough features or parallax, ask to move the device.
- GlobalSFM process, this is the main part of the whole process. And global SFM process will return a value “l”, which indicates the reference frame index. As a result, the lth frame will be our world frame.
- If global sfm succeed, solve PnP for all frames. To calcuate all the pose with respect to the lth frame.
- fcn visualInitialAlign which is to optimize and refine the IMU parameters.
Notice that, in Estimator::initialStructure(), the system will solve PnP again, after this Global SfM process. Why?
Their are several port to call this process:
- Initialization state (when the system begin).
- When failure detected. They are some criterions to tell that the system may fail (for example, un normal movement). This part should be paied enough attention. (In real application, it is mostly the estimated bias of accelerator too large, which is un norml)
triangulate point¶
Given two corresponding points image pixel positions and corresponding camera poses, result in its 3d position. (notice: the image pixel poistion is the homo-pose in camera reference, which is [x/z, y/z] ). VINS uses DLT method to solve the problem as explained in section triangulation in chapter SFM.
solveFrameByPnP¶
In the SFM Feature class, saved all the observation of this point. Retrieve these observation, find the processing frame id to collect 3d and 2d infomation. Then use opencv sove pnp method to solve pose (none RANSAC).
triangulateTwoFrames¶
Use SFM Feature to collect matching infomation. Triangulate all the correspoding points between two frames, and set these 3d values to a vector of SFM Feature (by overwriting).
construct¶
Main process of this class. Use global struction from motion method to initialize the map. (“l” is actually “l+1” in the source code, to simplify keep “l”)
init:
- Create array to save all camera poses (boost::shared_array for armeabi-v7a ndk to not getting any error).
- Initialize two camera view: the lth and the last. Set the lth camera pose as identity, and set the last pose by giving relative transform(current frame to lth frame) as input.
triangulate points and solve PnP for frames :
- Triangulate lth frame with the current frame.
- Solve PnP lth to (current-1) frames, and triangulate each of them with the current frame.
- Triangulate all other frames with the 1st frame.
- Solve PnP for the 1st frame to (l-1)th frame, and triangulate each of them with the lth frame.
- Triangulate all other points (by the first and the last observation)
- if any of the upper PnP failed, return false (initialization failed)
Global BA:
- Fully BA : fail -> initialization failed; success -> assign quaternion and translation.
VisualIMUAlignment¶
IMU calibration process is done here by visual and IMU alignment.
The first order approximation of \(\alpha, \beta, \gamma\) with respect to biases can be write as:
Gyroscope Bias Calibration¶
In function solveGyroscopeBias. For all consecutive frames in the Global SfM optimized slide window (\(b_{i},b_{i+1}\)). The preintegration between these two frames can be got from upper steps (\(\alpha, \beta, \gamma\)). We have their poses (\(q_{b_{i}}, q_{b_{i+1}}\)) from Global SfM process, their relative pose is:
To calibration the gyroscope, we try to minimize the following cost function:
Take the inverse of the upper function:
Add the first order approximation w.r.t bias of gyroscope:
As the \(\delta b_{gyro}\) part is very small, we can rewrite the function as :
For a quaternion, its “w” term is always one, so we delete this part from the function, only consider the vector part in the following part, as a result the upper problem becomes:
The problem can be rewrite as :
Reorder the problem, and times \(A_{k}^{T}\) in both sides (as \(A^{T}A\) is a positive definite matrix, make it for LDLT solver).
Then LDLT (Robust Cholesky decomposition of a matrix with pivoting) will be used to solve. Then \(\delta b_{gyro}\) will be added to the original bias to update. After the gyroscope bias updated, repropagation step will be done to update all IMU preintegration terms.
TangentBasis¶
VINS will set the gravity direction to be the z axis, this function will be used to calculate the other two axises. The result b and c vector is shown below:
LinearAlignment¶
Velocity, Gravity Vector and Metric Scale Initialization (The gravity scale will not be traited as a pre-defined value in this process)
Accelerometer bias, as the author claimed, the accelerometer bias is coupled with gravity, and due to the large magnitude of the gravity vector comparing to platform dynamics, and the relatively short during of the initailization phase, these bias terms are hard to observe.
As a result, therefor the state variable should be :
where \(\mathbf{v}_{b_{i}}^{b_{i}}\) is the velocity in body frame while taking the ith image, \({g}^{c_{0}}\) is the gravity vector, and s the scale factor to metric units.
We can rewrite the system function of two images in slide window (in preintegration section), adding the scale factor:
Rerange the upper functions into the form below:
As a result the system can be rewrite as:
Turn the upper z into our relative preintegration in the body frame:
The problem becomes :
where \(\hat{z}_{b_{k+1}}^{b_{k}}\) can be obtained from preintegration.
We can use IDIT to solve it (Same as before, times the transpose of H in both sides, to make a positive definite matrix). After this linear alignment , gravity will be refined.
RefineGravity¶
The gravity vector obatined from the previous linear initialization step can be refined by constraining the magnitude. As a result the gravity will remain 2 degree of freedom. Therefore the gravity will be re-parameterized with two variables on its tangent space (by TangentBasis function).
This new gravity form will be substituted into the upper linear alignment process. The new state will be:
The system function should be rewrite as :
Use the IDLT as the same before to solve it. Then update the new gravity re-parameterizition.
This process will be done four iterations (not as the article said : until gravity converges).
Factors¶
In factors, we have the residual calculations and corresponding jacobian calculations.
Projection Factor¶
Projection factor corresponding to visual reprojection error which can also be called as visual measurement residual. Considering a feature first seen in the ith frame, and analysis the residual for the feature observation in the jth frame. The camera model can be seen as a ray casting from the camera optical center, so the 3d matched point vector (to the jth camera center) and the 2d feature vector should be in the same direction. As a result, the residual is evaluated within the tangant plane.
where [ \(\bar{X}_{c_{j}} , \bar{Y}_{c_{j}} , \bar{Z}_{c_{j}}\)] is a unit vector in the jth camera frame (\(\pi^{-1}\) realize this operation using camera intrinsic parameters).
Which is a chaine of transform : ith camera \(\rightarrow\) ith body \(\rightarrow\) world \(\rightarrow\) jth body \(\rightarrow\) jth camera (VINS system use the inverse depth rather than 3d pose as variable, so lambda appears here as the inverse of depth). Vector in red in the upper image can be expressed as :
Finally, the residual can be write as (the square sum of the residual is actually the length of d):
DT optimization¶
Here we talk about the optimization of the error of timestamp td (the error of timestamp of the camera sensor and the IMU sensor). It is mostly introduced by the explosion time of camera, and also influenced by other hardware processing. It is included as a part of VINS-mono, but is removed in VINS-Fusion.
This is realized by rewrite point poisition in the reprojection error factor with :
Where, TR is the rolling camera parameter (the time of each rolling), for global shutter camera TR equals zero.
Map Reuse¶
The objective of this part is to solve such problem:
Given a prebuilt map, use VINS to achieve better real time properity (faster processing and higher accuracy).
The key to solve this problem is the reuse of map. To fulfill this objective, we need to study mainly the loop closure thread of VINS. This chapter will start from the VINS moblie ‘s implementation of loop closure, followed by my changement of the system to allow better reuse of high accuracy prebuilt map.
Keyframe¶
Firstly, VINS introduced “keyframe” and “keyframe database”. As the BOW matching and loop found all based on keyframes. However, VINS does not have a well-defined keyframe selection algorithm, which may cause it less good.
Loop Detection¶
VINS uses DBOW2 for the detection of loops. The basic pipeline and some important structures can be seen in the figure below.
The processing in the detectloop function is :
- Transform the descriptor into BowVector and FeatureVector (which are types of DBOW).
- If the query id is large enough. Query the transformed two vectors into the bag-of-words database to find matches.
- Add the two vectors to the bag-of-words database anyway.
- Delete the candidates with low scores, after which, initialize the match index as the candidate with the highest score if exists.
- And will transform the query result into a vector of “islands” (that is, divide all the candidates into blocks in ascending order of ids)
- Find the island with the highest score.
- Check geometrically consistent by isGeometricallyConsistent_DI.
- Use the island with the highest score (we have found above) to check
- Find the feature matches by the descriptor distance (use bag-of-words’s tree structure to accelerate)
- Use fundamental matrix geometry properity to check.
- Save the keypoints, descriptors (for geometry check), and this BowVector(for NSS normalization).
- And VINS processor thread will go find the keyframe with the candidate index. To use PnP solver to find relative pose, and pass the loop to optimization thread.
Map save¶
As a result, if I want to save the map. I need :
- DBOW2 Bag-of-words database.
- All the history keypoints and descriptors. (If you want to enable the geometry check)
- VINS processor thread need to load all the history frame too.
As DBOW2’s database has its own load/save support, this will be a easy work. As for the keypoints and descripotrs we need to save them in correct format in binary file (txt/json/xml/etc will work too, but binary is the fastest choice).
Summary¶
In summary, we need to save the system state infomation and change the optimization logicial.
Save Data
- Save the BOW database, all the keyframe keypoints and their corresponding descriptors.
- Save all the keyframes for KerFrameDatabase.
- Load the data:
- Loop detector needs all the saved file.
- VINS’s main thread needs all the keypoints and descriptors.
New Relocalization Process
- PnP RANSAC (2d pts from current frame, 3d pts from old frame) to solve the current pose.
- Update the estimation of inlier points depth.
- Set the point as a child of this current frame. (As we are not sure about its original parent’s pose)
- Assign inverse depth.
- Implementation can be seen in my blog .
- Add this loop pair constrain to the optimization thread, however the old frame’s pose should not change.
- The keyframe dataset of VINS estimator will not update (will not take in new keyframe).
- Or we can save the new keyframes into another queue. Process the loop closure differently. But I prefer to drop them, as VINS’s tracking is very robust, we don’t depends much on the relocalzaiton process, but we need to correct accumulated error by accurate loop closing.

