With Python, Beautiful Soup works very well to correct broken HTML, used it many years ago and I'm sure it's improved a lot since.
With Java I'm using Jsoup, same great experience and ease of use. Getting Java to scrape many web pages using httpClient is much easier to build due to the threading model.
Last time I used Python for heavy duty scraping, I had to dig deep to find thread-safe classes in order to get it to scrape many pages at the same time, got it working and scraping 7-10GB of BBC.co.uk content per night (this was 7 years ago, so it was considered a great scraper considering the limitations of the server it was running on), but it was somewhat messy due to Python not having that many thread-safe classes at the time.