導入方法
常法通りにインストール。
> install.packages("rvest", dep=TRUE)
インストール出来たら、library()で読み込む。パイプ演算子%>%
を使用するため、一緒にmagrittrも読み込むと便利。
> library(rvest)
> library(magrittr)
HTMLページをパースする
read_html(x, encoding)
引数 | 説明 |
---|---|
x | URLまたはローカルパス |
encoding | エンコード |
> url <- "https://www.ncbi.nlm.nih.gov/pubmed/?term=p53"
> html <- read_html(url)
ノードを選択する
ノードをCSSセレクタもしくはXPath1.0で指定する。XPath2.0は使えないので注意。
html_node(x, css, xpath)
html_nodes(x, css, xpath)
引数 | 説明 |
---|---|
x | ノードセットもしくはノード |
css | ノードを指定するCSSセレクタ |
xpath | ノードを指定するXPath |
タグで指定する
> html_nodes(html, "head")
{xml_nodeset (1)}
[1] <head xmlns:xi="http://www.w3.org/2001/XInclude">\n<meta http-equiv="Cont ...
タグとclassで指定する
タグとclassをドット.
でつなぐ。
> html_nodes(html, "div.rprt")
{xml_nodeset (20)}
[1] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[2] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[3] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[4] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[5] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[6] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[7] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[8] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[9] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[10] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[11] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[12] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[13] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[14] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[15] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[16] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[17] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[18] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[19] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
[20] <div class="rprt">\n<div class="rprtnum nohighlight">\n<label for="UidCh ...
タグとidで指定する
タグとidを#
でつなぐ。
> html_node(html, "h1#main")
タグの属性値でノードを指定する
> html_nodes(html, 'script[type="text/javascript"]')
htmlから属性値やタグなどを取り出す
テキストを取り出す
html_text(x, trim = FALSE)
> html_nodes(html, "p.title") %>% html_text
[1] "A tetrameric protein scaffold as a nano-carrier of antitumor peptides for cancer therapy."
[2] "Cytomorphologic and molecular analyses of fallopian tube fimbrial brushings for diagnosis of serous tubal intraepithelial carcinoma."
[3] "Hedgehog pathway proteins SMO and GLI expression as prognostic marker in head and neck squamous cell carcinoma - a retrospective immunohistochemical study."
[4] "Cellular Senescence as a Mechanism and Target in Chronic Lung Diseases."
[5] "In vitro study of the Polo-like kinase 1 inhibitor volasertib in non-small cell lung cancer reveals a role for the tumor suppressor p53."
[6] "Evaluation of clinical utility of P53 gene variations in repeated implantation failure."
[7] "From pathogenesis of acne vulgaris to anti-acne agents."
[8] "Prognostic Impact of Immunohistochemical p53 Expression in Bone Marrow Biopsy in Higher Risk MDS: a Pilot Study."
[9] "Genomic analyses of microdissected Hodgkin and Reed-Sternberg cells: mutations in epigenetic regulators and p53 are frequent in refractory classic Hodgkin lymphoma."
[10] "p53 β-hydroxybutyrylation attenuates p53 activity."
[11] "Loss of SET reveals both the p53-dependent and the p53-independent functions in vivo."
[12] "Hepatotoxicity induced by Isoniazid/Lipopolysaccharide through ERS-, autophagy- and apoptosis pathway in zebrafish."
[13] "Molecular characterization of the cyclin-dependent protein kinase 6 in whitefish (Coregonus lavaretus) and its potential interplay with miR-34a."
[14] "Desmocollin 3 has a tumor suppressive activity through inhibition of AKT pathway in colorectal cancer."
[15] "Circular RNA hsa_circ_0055538 regulates the malignant biological behavior of oral squamous cell carcinoma through the p53/Bcl-2/caspase signaling pathway."
[16] "p53-targeted lincRNA-p21 acts as a tumor suppressor by inhibiting JAK2/STAT3 signaling pathways in head and neck squamous cell carcinoma."
[17] "Proanthocyanidins Alleviates AflatoxinB₁-Induced Oxidative Stress and Apoptosis through Mitochondrial Pathway in the Bursa of Fabricius of Broilers."
[18] "p53 Signaling in Cancers."
[19] "Structural Basis for S100B Interaction with its Target Proteins."
[20] "Anticancer activity, dual prooxidant/antioxidant effect and apoptosis induction profile of new bichalcophene-5-carboxamidines."
属性値を取り出す
html_attr(x, name)
> html_attr(node, "href")
テーブルを取り出す
表をパースしてデータフレームにする
html_table(x header, trim)
引数 | 説明 | 初期値 |
---|---|---|
x | tableを指定するノード | |
header | 1行目をヘッダーとして使用するか | NA |
trim | TRUE |
トラブルシューティング
Error: ‘Rcpp’ という名前のパッケージはありません / / Error: パッケージ ‘xml2’ をロードできませんでした
Rcppパッケージがなんらかの理由で存在しないため、install.packages()
関数を使ってインストールし、再度rvestを読み込む。
> install.packages("Rcpp")
> library(rvest)
フォーム
ノードからフォームを取り出す
html_form(x)
> read_html("https://google.com/") %>% html_form
[[1]]
<form> 'f' (GET /search)
<input hidden> 'ie': ISO-8859-1
<input hidden> 'hl': ja
<input hidden> 'source': hp
<input hidden> 'biw':
<input hidden> 'bih':
<input text> 'q':
<input submit> 'btnG': Google 検索
<input submit> 'btnI': I'm Feeling Lucky
<input hidden> 'gbv': 1
サーバにフォームを送信する
submit_form(session, form, submit)
文字コード
文字コードを推定する
guess_encoding(x)
> read_html("https://hogehoge/12345/") %>%
+ html_nodes("h1") %>%
+ html_text %>%
+ guess_encoding
encoding language confidence
1 UTF-8 1.00
2 Big5 zh 0.32
3 windows-1252 no 0.24
4 UTF-16BE 0.10
5 UTF-16LE 0.10
6 Shift_JIS ja 0.10
7 GB18030 zh 0.10
8 windows-1253 el 0.08
9 KOI8-R ru 0.03
10 windows-1250 ro 0.02
11 windows-1251 ru 0.02
12 IBM420_ltr ar 0.02