Package 'boilerpipeR'

Title:	Interface to the Boilerpipe Java Library
Description:	Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe <https://github.com/kohlschutter/boilerpipe> Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.
Authors:	See AUTHORS file.
Maintainer:	Mario Annau <[email protected]>
License:	Apache License (== 2.0)
Version:	1.3.2
Built:	2025-01-21 03:24:25 UTC
Source:	https://github.com/mannau/boilerpiper

Help Index

Extract the main content from HTML files
A full-text extractor which is tuned towards news articles.
A full-text extractor which is tuned towards extracting sentences from news articles.
A full-text extractor trained on a 'krdwrd' Canola (see https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf.
Wordpress generated Webpage (retrieved from Quantivity Blog https://quantivity.wordpress.com). Content is saved as character and ready to be extracted.
A quite generic full-text extractor.
Generic extraction function which calls boilerpipe extractors
Marks everything as content.
A full-text extractor which extracts the largest text component of a page.
A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).

Extract the main content from HTML files

Description

boilerpipeR interfaces the boilerpipe Java library, created by Christian Kohlschutter https://github.com/kohlschutter/boilerpipe. It implements robust heuristics to extract the main content from HTML files, removing unessecary elements like ads, banners and headers/footers.

Author(s)

Mario Annau mario.annau@gmail

Examples

## Not run: 
data(content)
extract <- DefaultExtractor(content)
cat(extract)

## End(Not run)
## Not run: 
data(content)
extract <- DefaultExtractor(content)
cat(extract)

## End(Not run)

A full-text extractor which is tuned towards news articles.

Description

In this scenario it achieves higher accuracy than DefaultExtractor.

Usage

ArticleExtractor(content, ...)
ArticleExtractor(content, ...)

Arguments

`content`	Text content as character
`...`	additional parameters

Value

extracted text as character

Author(s)

Mario Annau

Examples

data(content)
extract <- ArticleExtractor(content)
data(content)
extract <- ArticleExtractor(content)

A full-text extractor which is tuned towards extracting sentences from news articles.

Description

A full-text extractor which is tuned towards extracting sentences from news articles.

Usage

ArticleSentencesExtractor(content, ...)
ArticleSentencesExtractor(content, ...)

Arguments

`content`	Text content as character
`...`	additional parameters

Value

extracted text as character

Author(s)

Mario Annau

Examples

data(content)
extract <- ArticleSentencesExtractor(content)
data(content)
extract <- ArticleSentencesExtractor(content)

A full-text extractor trained on a 'krdwrd' Canola (see `https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf`.

Description

A full-text extractor trained on a 'krdwrd' Canola (see https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf.

Usage

CanolaExtractor(content, ...)
CanolaExtractor(content, ...)

Arguments

`content`	Text content as character
`...`	additional parameters

Value

extracted text as character

Author(s)

Mario Annau

Examples

data(content)
extract <- CanolaExtractor(content)
data(content)
extract <- CanolaExtractor(content)

Wordpress generated Webpage (retrieved from Quantivity Blog https://quantivity.wordpress.com). Content is saved as character and ready to be extracted.

Description

Wordpress generated Webpage (retrieved from Quantivity Blog https://quantivity.wordpress.com). Content is saved as character and ready to be extracted.

Author(s)

Mario Annau

References

https://quantivity.wordpress.com

Examples

#Data set has been generated as follows:
## Not run: 
library(RCurl)
url <- "https://quantivity.wordpress.com/2012/11/09/multi-asset-market-regimes/"
content <- getURL(url)
content <- iconv(content, "UTF-8", "ASCII//TRANSLIT")
save(content, file = "content.rda")

## End(Not run)
#Data set has been generated as follows:
## Not run: 
library(RCurl)
url <- "https://quantivity.wordpress.com/2012/11/09/multi-asset-market-regimes/"
content <- getURL(url)
content <- iconv(content, "UTF-8", "ASCII//TRANSLIT")
save(content, file = "content.rda")

## End(Not run)

A quite generic full-text extractor.

Description

A quite generic full-text extractor.

Usage

DefaultExtractor(content, ...)
DefaultExtractor(content, ...)

Arguments

`content`	Text content as character
`...`	additional parameters

Value

extracted text as character

Author(s)

Mario Annau

Examples

data(content)
extract <- DefaultExtractor(content)
data(content)
extract <- DefaultExtractor(content)

Generic extraction function which calls boilerpipe extractors

Description

It is the actual workhorse which directly calls the boilerpipe Java library. Typically called through functions as listed for parameter exname.

Usage

Extractor(exname, content, asText = TRUE, ...)
Extractor(exname, content, asText = TRUE, ...)

Arguments

`exname`	character specifying the extractor to be used. It can take one of the following values: `ArticleExtractor`A full-text extractor which is tuned towards news articles. `ArticleSentencesExtractor`A full-text extractor which is tuned towards extracting sentences from news articles. `CanolaExtractor`A full-text extractor trained on a 'krdwrd'. `DefaultExtractor`A quite generic full-text extractor. `KeepEverythingExtractor`Marks everything as content. `LargestContentExtractor`A full-text extractor which extracts the largest text component of a page. `NumWordsRulesExtractor`A quite generic full-text extractor solely based upon the number of words per block.
`content`	Text content or URL as character
`asText`	should content specifed be treated as actual text to be extracted or url (from which HTML document is first downloaded and extracted afterwards), defaults to TRUE
`...`	additional parameters

Value

extracted text as character

Author(s)

Mario Annau

References

https://github.com/kohlschutter/boilerpipe

Marks everything as content.

Description

Marks everything as content.

Usage

KeepEverythingExtractor(content, ...)
KeepEverythingExtractor(content, ...)

Arguments

`content`	Text content as character
`...`	additional parameters

Value

extracted text as character

Author(s)

Mario Annau

Examples

data(content)
extract <- KeepEverythingExtractor(content)
data(content)
extract <- KeepEverythingExtractor(content)

A full-text extractor which extracts the largest text component of a page.

Description

For news articles, it may perform better than the DefaultExtractor, but usually worse than ArticleExtractor.

Usage

LargestContentExtractor(content, ...)
LargestContentExtractor(content, ...)

Arguments

`content`	Text content as character
`...`	additional parameters

Value

extracted text as character

Author(s)

Mario Annau

Examples

data(content)
extract <- LargestContentExtractor(content)
data(content)
extract <- LargestContentExtractor(content)

A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).

Description

A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).

Usage

NumWordsRulesExtractor(content, ...)
NumWordsRulesExtractor(content, ...)

Arguments

`content`	Text content as character
`...`	additional parameters

Value

extracted text as character

Author(s)

Mario Annau

Examples

data(content)
extract <- NumWordsRulesExtractor(content)
data(content)
extract <- NumWordsRulesExtractor(content)

Package 'boilerpipeR'

Help Index

Extract the main content from HTML files

Description

Author(s)

See Also

Examples

A full-text extractor which is tuned towards news articles.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

A full-text extractor which is tuned towards extracting sentences from news articles.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

A full-text extractor trained on a 'krdwrd' Canola (see https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Wordpress generated Webpage (retrieved from Quantivity Blog https://quantivity.wordpress.com). Content is saved as character and ready to be extracted.

Description

Author(s)

References

Examples

A quite generic full-text extractor.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Generic extraction function which calls boilerpipe extractors

Description

Usage

Arguments

Value

Author(s)

References

Marks everything as content.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

A full-text extractor which extracts the largest text component of a page.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

A full-text extractor trained on a 'krdwrd' Canola (see `https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf`.