On the internet, you can find more experiments performed by Avi Singh (https://avisingh599.github.io/deeplearning/visual-qa/) where different models are compared, including a simple 'bag-of-words' for language together with CNN for images, an LSTM-only model, and an LSTM+CNN model - similar to the one discussed in this recipe. The blog posting also discusses different training strategies for each model.
In addition to that, interested readers can find on the internet (https://github.com/anujshah1003/VQA-Demo-GUI) a nice GUI built on the top of Avi Singh's demo which allows you to interactively load images and ask related questions. A YouTube video is also available (https://www.youtube.com/watch?v=7FB9PvzOuQY).