Email Harvester

Also known as: Email scraper, Address harvester

A bot or script that crawls the web to scrape email addresses from public pages, directories, and leaked databases, building target lists for spam and phishing.

Last updated:

What is an email harvester?

An email harvester is an automated program that crawls the public internet for anything that looks like an email address — pattern-matching strings that contain an @ and a valid domain — and compiles them into target lists. The lists are then sold, leaked, or used directly to send spam and phishing. Harvesters are one of the oldest forms of abusive web crawling and the reason most professional websites no longer put raw mailto: addresses on contact pages.

Where harvesters pull addresses from

  • Published contact pages and staff directories — especially university, government, and corporate sites
  • Mailing list archives and forum post footers
  • WHOIS records for domain registrants (which is why modern registrars default to privacy protection)
  • Leaked breach databases redistributed on underground forums
  • Social media profiles that expose contact info
  • GitHub commit metadata — every commit records the author's email

How it looks in server logs

Harvester traffic is a specific flavor of web crawler abuse: a single IP or small set of IPs pulling thousands of pages per minute, ignoring robots.txt, focusing on pages likely to contain contact info, and sometimes identifying itself with a forged User-Agent that mimics a legitimate search engine. Heavy harvester activity can overwhelm small sites just through bandwidth, even without any downstream abuse.

Defense

Common countermeasures: obfuscate email addresses on public pages (use JavaScript to render, or display as name [at] example [dot] com), require a contact form instead, enforce rate limits per IP, and block the IPs of known harvesters at the CDN or WAF layer. Running unfamiliar crawler IPs through an IP abuse report checker flags known harvester infrastructure.

Frequently Asked Questions

Check your address against breach-aggregation services like Have I Been Pwned (`haveibeenpwned.com`) and IntelX. They maintain searchable indexes of email addresses that have appeared in known breaches, paste sites, and underground market dumps. If your address shows up in even one breach, assume it has been resold across spammer lists by now. The practical defense is per-service email aliases (Apple Hide My Email, SimpleLogin, Firefox Relay) so a leaked alias can be burned without affecting the rest.
Modestly. Simple obfuscation like rendering the address with JavaScript or writing "name [at] example [dot] com" defeats naive regex scrapers that just look for `@` patterns. Sophisticated harvesters now run JavaScript, parse `at`/`dot` decoy text, and even decode common Unicode tricks. The honest assessment is that obfuscation reduces harvest volume but doesn't eliminate it. A contact form behind a CAPTCHA is much stronger protection than any mailto-encoding scheme.
Yes — significantly. A contact form with CAPTCHA, rate limiting, and anti-spam token verification (Turnstile, hCaptcha, reCAPTCHA) blocks both harvesting (your real address never appears in the page) and the followup spam (bots can't easily submit through the form). It also lets you add filtering and triage on the receiving side. The downside is friction for legitimate contacts; pairing the form with a single visible address for transparency-required cases (legal, press) is a common compromise.
Breach data, by a wide margin. Modern harvesters spend more time aggregating leaked databases than crawling the live web, because breach corpora contain hundreds of millions of validated addresses with associated context (names, passwords, forum activity) that make targeting much more effective. Live-web scraping is now mostly used for niche targeting (pulling staff directories from a specific company before a BEC attack) rather than bulk list-building.
Yes for any address that no longer needs to be public — especially mailto links from individual staff members who have since left, generic addresses on archive pages, and addresses in WHOIS records (most registrars now default to redacted WHOIS, but legacy registrations may still be exposed). Once an address is on a spammer list, removing it from the website doesn't undo that — but it does prevent future scrapers from adding it to fresh lists.