
Impact of log4j CVE-2021-44228 on heritrix3? · Issue #451 ... - GitHub
Dec 10, 2021 · This is an issue to track the impact of a recent log4j remote exploit (CVE-2021-44228) in the context of heritrix3. My brief read of the situation is that log4j versions 2.0.x through 2.14.x (see …
GitHub - internetarchive/heritrix3: Heritrix is the Internet Archive's ...
Heritrix is designed to respect the robots.txt exclusion directives and META nofollow tags. Please consider the load your crawl will place on seed sites and set politeness policies accordingly. Also, …
Heritrix frontier files manipulation tool. - GitHub
Heritrix frontier files manipulation tool. Contribute to internetarchive/strainer development by creating an account on GitHub.
warctools/README.md at master - GitHub
Command line tools and libraries for handling and manipulating WARC files (and HTTP contents) - warctools/README.md at master · internetarchive/warctools
heritrix3/README.md at master - GitHub
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. - heritrix3/README.md at master · internetarchive/heritrix3
GitHub - internetarchive/openlibrary-bots: A repository of cleanup …
Open Library is an open, editable library catalog, building towards a web page for every book ever published. This repository contains cleanup bots implementing the openlibrary-client which allow …
Re-instate biblio.com affiliate link (s) · Issue #960 - GitHub
May 11, 2018 · Some of them seemed to simply exploit OL links to harvest user data/drop cookies, etc. Not sure if that was the case with biblio. There ought to be a less invasive sort of affiliation possible …
Summarize web archive capture index (CDX) files. - GitHub
Summarize local CDX files or remote ones over HTTP Handle gz and bz2 compression seamlessly Handle CDX data input to STDIN from pipe Support Internet Archive Petabox web item …
GitHub - internetarchive/emularity-bios: archive.org software emulation
archive.org software emulation. Contribute to internetarchive/emularity-bios development by creating an account on GitHub.
GitHub - internetarchive/tarb_soft404: Soft-404 detction system for …
This repository is a comprehensive toolset for soft 404 detection, encompassing data scraping, model training, web user interfaces, and inference capabilities. It utilizes tree-based models and …