banner
DIYgod

Hi, DIYgod

写代码是热爱,写到世界充满爱!
github
twitter
follow
bilibili
telegram
email
steam
playstation
nintendo switch

Elegantly using Cloudflare WARP to deal with the anti-crawling challenge of RSSHub

🕊️ This article is dedicated to a more open internet.


The reason for this is that I saw @geekbb's tweet introducing Warp. Although Warp has been released for a long time, in terms of protecting IP privacy, it is not as useful as iCloud Private Relay, and I don't have the need for magic internet access. But then I realized that I still have a need to hide my IP.

During the development of RSSHub over the years, I found that there are very few sites that provide public APIs, and many sites have strict anti-crawling controls to restrict access to their platform content. Some sites block excessive requests from the same IP, while others completely block IP addresses from common cloud server providers. Therefore, it has become very difficult to simply get the latest few content updates.

lord-of-the-rings-my-precious

This situation requires the use of proxies, but dedicated crawler proxies are usually expensive and have a very low cost-effectiveness. It would be great if RSSHub could utilize the unlimited traffic and abundant IP resources of Cloudflare WARP. RSSHub already supports a common proxy protocol, so as long as WARP can be wrapped as a common proxy, it can be used.

image

Although it is not convenient to use the official client directly in the command line environment, this idea that is so easy to think of must have been implemented by someone else. I found a packaged Docker on GitHub.

Then, just add such a service to enable the proxy service in RSSHub's docker-compose.yml

warp-socks:
    image: monius/docker-warp-socks:latest
    privileged: true
    volumes:
        - /lib/modules:/lib/modules
    cap_add:
        - NET_ADMIN
        - SYS_ADMIN
    sysctls:
        net.ipv6.conf.all.disable_ipv6: 0
        net.ipv4.conf.all.src_valid_mark: 1
    healthcheck:
        test: ["CMD", "curl", "-f", "https://www.cloudflare.com/cdn-cgi/trace"]
        interval: 30s
        timeout: 10s
        retries: 5

Finally, add a PROXY_URI environment variable to RSSHub

PROXY_URI: 'socks5h://warp-socks:9091'

I chose a hotukdeals route (the UK version of Dealabs) that I often use for testing. This site blocks all DigitalOcean IPs, so it has been in a 403 state.

image

With WARP, I can access it smoothly

image

In addition, I found that every time WARP is restarted, a new IP is output. Although I don't have time to verify it, I feel that the IP should change automatically frequently, which is good news for solving anti-crawling.

image

You can also further customize the WireGuard configuration, including using the paid version of WARP+ and custom endpoints, to get potentially better results.

To generate the WireGuard configuration file, you can use

Github Repo not found

The embedded github repo could not be found…

To brush WARP+ traffic and filter endpoints, you can use

There is a saying that there is no significant difference in the speed of WARP+ (WARP, WARP+ Speed Comparison, and WARP Speed Limit), but further verification is needed to determine if it affects anti-crawling effectiveness.

If everything goes well, many strict anti-crawling routes in the official instance of RSSHub should be able to be used again. I will verify and update here in a few days.

image

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.