Discussion

Search code, repositories, users, issues, pull requests...

plastic041: > Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.And it doesn't care about robots.txt.

Flux159: This looks pretty interesting! I haven't used it yet, but looked through the code a bit, it looks like it uses turndown to convert the html to markdown first, then it passes that to the LLM so assuming that's a huge reduction in tokens by preprocessing. Do you have any data on how often this can cause issues? ie tables or other information being lost?Then langchain and structured schemas for the output along w/ a specific system prompt for the LLM. Do you know which open source models work best or do you just use gemini in production?Also, looking at the docs, Gemini 2.5 flash is getting deprecated by June 17th https://ai.google.dev/gemini-api/docs/deprecations#gemini-2.... (I keep getting emails from Google about it), so might want to update that to Gemini 3 Flash in the examples.

zx8080: Robots.txt anyone?

sheept: > LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes.This might be one reason why Claude Code uses XML for tool calling: repeating the tag name in the closing bracket helps it keep track of where it is during inference, so it is less error prone.

andrew_zhong: Yeah that's a good observation. XML's closing tags give the model structural anchors during generation — it knows where it is in the nesting. JSON doesn't have that, so the deeper the nesting the more likely the model loses track of brackets.We see this especially with arrays of objects where each object has optional nested fields. The model will get 18 items right and then drop a closing bracket on item 19. That's why we put effort into the repair/recovery/sanitization layer — validate field-by-field and keep what's valid rather than throwing everything out.

andrew_zhong: Good point. The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — things like CDP leak fixes so Cloudflare doesn't block you mid-session. It's not about bypassing access restrictions.Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.

dmos62: What's your experience with not getting blocked by anti-bot systems? I see you've custom patches for that.

andrew_zhong: The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — fixing CDP leaks, removing automation flags, etc. For sites behind Cloudflare or Datadome, that alone usually isn't enough — you'll need residential proxies and proper browser fingerprints on top. The library supports connecting to remote scraping browsers via WebSocket and proxy configuration for those cases.

Reader /

Discussion

GitHub - lightfeed/extractor: Using LLMs and AI browser automation to robustly extract web data · GitHub

github.com

/ pin · @ user · Ctrl+Enter

No discussions yet

Discover

anthropic.com

Claude Code auto mode: a safer way to skip permissions \ Anthropic

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems…

english.elpais.com

The uncomfortable truth that will always haunt the Ramones: ‘They sold more T-sh…

It’s been 50 years since the debut album of the group that invented punk, although their legacy in style and merchandisi…

1 21

graphhopper.com

GitHub - lightfeed/extractor: Using LLMs and AI browser automation to robustly extract web data · GitHub

More from github.com

GitHub - Indosaram/logxide: Fastest Python logging framework · GitHub

[Security]: CRITICAL: Malicious litellm_init.pth in litellm 1.82.8 — credential…

GitHub - t8/hypura: Run models too big for your Mac's memory · GitHub

GitHub - open-gitagent/gitagent: A framework-agnostic, git-native standard for d…

Discover

Claude Code auto mode: a safer way to skip permissions \ Anthropic

The uncomfortable truth that will always haunt the Ramones: ‘They sold more T-sh…

GraphHopper Gets More Precise Elevation Data - GraphHopper Directions API

Arm expands compute platform to silicon products in historic company first - Arm…

aicasebook.dev — AI 개발 환경, 남들은 어떻게 쓸까?

So where are all the AI apps? – Answer.AI

GitHub - lightfeed/extractor: Using LLMs and AI browser automation to robustly extract web data · GitHub

More from github.com

GitHub - Indosaram/logxide: Fastest Python logging framework · GitHub

[Security]: CRITICAL: Malicious litellm_init.pth in litellm 1.82.8 — credential…

GitHub - t8/hypura: Run models too big for your Mac&#39;s memory · GitHub

GitHub - open-gitagent/gitagent: A framework-agnostic, git-native standard for d…

Discover

Claude Code auto mode: a safer way to skip permissions \ Anthropic

The uncomfortable truth that will always haunt the Ramones: ‘They sold more T-sh…

GraphHopper Gets More Precise Elevation Data - GraphHopper Directions API

Arm expands compute platform to silicon products in historic company first - Arm…

aicasebook.dev — AI 개발 환경, 남들은 어떻게 쓸까?

So where are all the AI apps? – Answer.AI

GitHub - t8/hypura: Run models too big for your Mac's memory · GitHub