In applications where the locations of human subjects are needed, for example, human-computer interface, video conferencing, and security surveillance applications, localizations are often performed using single sensing modalities. These mono localization modalities, such as beamforming microphone array and video-graphical localization techniques, are often prone to errors. In this chapter, a modular multimodal localization framework was constructed by combining multiple mono localization modalities using a Bayesian network. As a case study, a joint audio-video talker localization system for the video conferencing application was presented. Based on the results, the proposed multimodal localization method outperforms localization methods, in terms of accuracy and robustness, when compare with mono modal modalities that rely only on audio or video.