This is the second part of my two-part explanation on what will be done in this project. This part tries to give more information on how multiple Kinect 2 can be used to film a person in such a way, that we get a kind of “holographic” recording of that person, or in other words an animated 3D-model.
A good way to see what I mean is by looking at this video that someone else has made: Q3D: True 3D Video with multiple Kinects in VR
If you just stumbled upon this blog and would like to start from the beginning, here is the overview and here the first part (this might be especially helpful, if you are not very familiar with the Kinect 2 or Oculus Rift).
Creating content this way is a big step forward from what has been done before. In a world where we have usable VR glasses, where we can (and want to) view content from any side and angle, we need to capture the content in a new way. This new method can have a huge range of applications, from filming your children and parents in a way that you can really feel like being there with them years later, sitting right next to them, walking around them, to creating filmed content for video games, instead of going the traditional (and time consuming) route of creating 3D-models by hand and animating them either by hand, or by using very-expensive motion capturing technology.
This could also be used for a more immersive experience when chatting with each other over a long distance. Current technology with video-chats have come a long way, but we still don’t feel like being there in the same room with the other person. With holographic real-time recordings filmed from every angle, we could be one step closer to solving long-distance problems in a world that is only going to get more global.
When we want to use more than one Kinect for creating a 3D-model, there are a lot of things that can go wrong, however. First we have to decide what pixel belongs to the person that is being filmed. And if we use two cameras, for instance one from the front and one from the back, we have to decide which pixel in our 3D-space belongs to the person and if there are overlapping areas, which camera recorded the better (more accurate) image.
One of the big problems is the overlapping area (the fringe area of a person from the perspective of the camera). In that area the accuracy is quite bad and using only two cameras is not enough to create a good recording. This is why this project will use four cameras in total, all placed in the corners of a square, pointing towards its center.
This means we will get good data for almost all areas. One area that will be missing (but could be completed by adding a fifth camera), is the top of the head (for several reasons it is not advisable to raise the cameras so high that they could – in theory – also see the top of each persons head). Even with four cameras, there will be “shadow” areas, that no camera covers.
Think about a person that holds their hands like cradling a child (making a O-shape in front of you with your hands): if there is a camera pointed at that person from the front, you cannot see part of the body from that perspective, because it is covered by an arm. We could add more cameras that cover higher and lower positions, but this would still leave us with uncovered areas and would drive the budget up quite a bit.
Therefore we could either leave that “shadow” space as it is, or we can try to use a previously created 3D-model of the person and map it onto the current skeletal position of the persons body and get our missing information from that. Of course that means we need to create that 3D-model first.
Here are some caveats about what could go wrong with the project:
- The 30 fps of the recording is too slow for 3D-models (that is actually not a concern for these recordings, as even 24 fps is fine for film. The content IN the virtual world can be more slowly animated, as long as the head motion is fluid)
- Automatically fusing the 3D-space of four Kinect cameras will leave many artefacts in the recording (this is a major concern, as it can be very hard to determine automatically where the body ends and how to smooth the edges)
- The resolution of the depth-image is too low for a good 3D-model (this could be problem, and it should be clear that not every eyelash will be reproduced with such a low-resolution camera over a longer distance, but the model will have enough depth-resolution, as long as we don’t expect the viewer to get very close. Also it should be possible to enhance the result with smoothing and tessellation).
- The resolution of the color-image is too low for a good texture for the 3D-model (this is also a concern and cannot be easily helped by using algorithms, as missing details are really simply missing in the data. One solution could be to use 4K-resolution cameras alongside the Kinect cameras and use their image instead of that from the Kinect for the texture. However, since this is supposed to produce a demo first, it will be more than sufficient and it is still possible to add other cameras later).
All in all it should be said that the underlying algorithms can be used with other resolutions as well. So if the next generation of depth-cameras comes out, with higher resolution and accuracy, the technology will already be there – waiting.