that's the classic hashmap which is O(1) which is cool but still pushes the space complexity maybe you could take a look at https://en.wikipedia.org/wiki/Bloom_filter
this can give you false positives the trade is efficiency / simplicity against precision / space
if you don't have to care about memory the hashmap is a good solution if you don't need to be precise and the data gets very large the bloomfilter is an elegant solution.
the next thing you could do is use md5 sums on pages or links and check if they have been changed so you only parse the different versions :) and if you take the domain + tld as key in consideration the hash-collisions should be zero :)
still a lot of space :)