The global pooling step you describe is where a lot of the expressiveness gets compressed — one thing worth experimenting with is differentiable pooling (DiffPool) instead of mean/max global pooling when the graph structure varies significantly across images. The message passing improvement over CNN feature maps tends to show the most gain on images with irregular spatial structure, like satellite imagery or medical scans, where the grid assumption of convolutions is the weakest.