About 6,140 results
Open links in new tab
  1. Impact of log4j CVE-2021-44228 on heritrix3? · Issue #451 ... - GitHub

    Dec 10, 2021 · This is an issue to track the impact of a recent log4j remote exploit (CVE-2021-44228) in the context of heritrix3. My brief read of the situation is that log4j versions 2.0.x through 2.14.x (see …

  2. GitHub - internetarchive/heritrix3: Heritrix is the Internet Archive's ...

    Heritrix is designed to respect the robots.txt exclusion directives and META nofollow tags. Please consider the load your crawl will place on seed sites and set politeness policies accordingly. Also, …

  3. Heritrix frontier files manipulation tool. - GitHub

    Heritrix frontier files manipulation tool. Contribute to internetarchive/strainer development by creating an account on GitHub.

  4. warctools/README.md at master - GitHub

    Command line tools and libraries for handling and manipulating WARC files (and HTTP contents) - warctools/README.md at master · internetarchive/warctools

  5. heritrix3/README.md at master - GitHub

    Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. - heritrix3/README.md at master · internetarchive/heritrix3

  6. GitHub - internetarchive/openlibrary-bots: A repository of cleanup …

    Open Library is an open, editable library catalog, building towards a web page for every book ever published. This repository contains cleanup bots implementing the openlibrary-client which allow …

  7. Re-instate biblio.com affiliate link (s) · Issue #960 - GitHub

    May 11, 2018 · Some of them seemed to simply exploit OL links to harvest user data/drop cookies, etc. Not sure if that was the case with biblio. There ought to be a less invasive sort of affiliation possible …

  8. Summarize web archive capture index (CDX) files. - GitHub

    Summarize local CDX files or remote ones over HTTP Handle gz and bz2 compression seamlessly Handle CDX data input to STDIN from pipe Support Internet Archive Petabox web item …

  9. GitHub - internetarchive/emularity-bios: archive.org software emulation

    archive.org software emulation. Contribute to internetarchive/emularity-bios development by creating an account on GitHub.

  10. GitHub - internetarchive/tarb_soft404: Soft-404 detction system for …

    This repository is a comprehensive toolset for soft 404 detection, encompassing data scraping, model training, web user interfaces, and inference capabilities. It utilizes tree-based models and …