Package: boilerpipeR 1.3.2
boilerpipeR: Interface to the Boilerpipe Java Library
Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe <https://github.com/kohlschutter/boilerpipe> Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.
Authors:
boilerpipeR_1.3.2.tar.gz
boilerpipeR_1.3.2.zip(r-4.7)boilerpipeR_1.3.2.zip(r-4.6)boilerpipeR_1.3.2.zip(r-4.5)
boilerpipeR_1.3.2.tgz(r-4.6-any)boilerpipeR_1.3.2.tgz(r-4.5-any)
boilerpipeR_1.3.2.tar.gz(r-4.7-any)boilerpipeR_1.3.2.tar.gz(r-4.6-any)
boilerpipeR_1.3.2.tgz(r-4.6-emscripten)
manual.pdf |manual.html✨
card.svg |card.png
boilerpipeR/json (API)
NEWS
| # Install 'boilerpipeR' in R: |
| install.packages('boilerpipeR', repos = c('https://mannau.r-universe.dev', 'https://cloud.r-project.org')) |
Bug tracker:https://github.com/mannau/boilerpiper/issues
- content - Wordpress generated Webpage (retrieved from Quantivity Blog <https://quantivity.wordpress.com>). Content is saved as character and ready to be extracted.
Last updated from:5cbc092cac. Checks:7 NOTE, 2 OK. Indexed: yes.
| Target | Result | Time | Files | Syslog |
|---|---|---|---|---|
| linux-devel-x86_64 | NOTE | 123 | ||
| source / vignettes | OK | 174 | ||
| linux-release-x86_64 | NOTE | 112 | ||
| macos-release-arm64 | NOTE | 148 | ||
| macos-oldrel-arm64 | NOTE | 150 | ||
| windows-devel | NOTE | 80 | ||
| windows-release | NOTE | 91 | ||
| windows-oldrel | NOTE | 60 | ||
| wasm-release | OK | 100 |
Exports:ArticleExtractorArticleSentencesExtractorCanolaExtractorDefaultExtractorExtractorKeepEverythingExtractorLargestContentExtractorNumWordsRulesExtractor
Dependencies:rJava
Readme and manuals
Help Manual
| Help page | Topics |
|---|---|
| Extract the main content from HTML files | boilerpipeR-package boilerpipe |
| A full-text extractor which is tuned towards news articles. | ArticleExtractor |
| A full-text extractor which is tuned towards extracting sentences from news articles. | ArticleSentencesExtractor |
| A full-text extractor trained on a 'krdwrd' Canola (see 'https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf'. | CanolaExtractor |
| Wordpress generated Webpage (retrieved from Quantivity Blog <https://quantivity.wordpress.com>). Content is saved as character and ready to be extracted. | content |
| A quite generic full-text extractor. | DefaultExtractor |
| Generic extraction function which calls boilerpipe extractors | Extractor |
| Marks everything as content. | KeepEverythingExtractor |
| A full-text extractor which extracts the largest text component of a page. | LargestContentExtractor |
| A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block). | NumWordsRulesExtractor |
