If you build a web crawler how to check if the url crawled before or not ?
I'm building a web crawler for some of our customers .. they have a news website and they publish about 1000 new article per day .. so I think i could use redis to store the urls as set redis with schema like "domian:1"
redis.sadd("domain:1", url_string)
It works good enough for me but after one month from now it will be hard as i guess
so any better sloution for this .. any hints ?