Extract the main content from HTML files
Description
boilerpipeR interfaces the boilerpipe Java library, created by Christian
Kohlschutter https://github.com/kohlschutter/boilerpipe. It implements robust heuristics
to extract the main content from HTML files, removing unessecary
elements like ads, banners and headers/footers.
Author(s)
Mario Annau mario.annau@gmail
See Also
Extractor
DefaultExtractor
ArticleExtractor
Examples
## Not run:
data(content)
extract <- DefaultExtractor(content)
cat(extract)
## End(Not run)
data(content)
extract <- DefaultExtractor(content)
cat(extract)
A full-text extractor which is tuned towards news articles.
Description
In this scenario it achieves higher accuracy than DefaultExtractor
.
Usage
ArticleExtractor(content, ...)
ArticleExtractor(content, ...)
Arguments
content |
Text content as character
|
... |
additional parameters
|
Value
extracted text as character
Author(s)
Mario Annau
See Also
Extractor
Examples
data(content)
extract <- ArticleExtractor(content)
data(content)
extract <- ArticleExtractor(content)
A full-text extractor which is tuned towards extracting sentences from news articles.
Description
A full-text extractor which is tuned towards extracting sentences from news articles.
Usage
ArticleSentencesExtractor(content, ...)
ArticleSentencesExtractor(content, ...)
Arguments
content |
Text content as character
|
... |
additional parameters
|
Value
extracted text as character
Author(s)
Mario Annau
See Also
Extractor
Examples
data(content)
extract <- ArticleSentencesExtractor(content)
data(content)
extract <- ArticleSentencesExtractor(content)
Wordpress generated Webpage (retrieved from Quantivity Blog https://quantivity.wordpress.com).
Content is saved as character and ready to be extracted.
Description
Wordpress generated Webpage (retrieved from Quantivity Blog https://quantivity.wordpress.com).
Content is saved as character and ready to be extracted.
Author(s)
Mario Annau
References
https://quantivity.wordpress.com
Examples
#Data set has been generated as follows:
## Not run:
library(RCurl)
url <- "https://quantivity.wordpress.com/2012/11/09/multi-asset-market-regimes/"
content <- getURL(url)
content <- iconv(content, "UTF-8", "ASCII//TRANSLIT")
save(content, file = "content.rda")
## End(Not run)
library(RCurl)
url <- "https://quantivity.wordpress.com/2012/11/09/multi-asset-market-regimes/"
content <- getURL(url)
content <- iconv(content, "UTF-8", "ASCII//TRANSLIT")
save(content, file = "content.rda")