Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zeor list #53

Open
ehsannu opened this issue Sep 28, 2018 · 5 comments
Open

zeor list #53

ehsannu opened this issue Sep 28, 2018 · 5 comments

Comments

@ehsannu
Copy link

ehsannu commented Sep 28, 2018

I want to download all links/ titles of papers from the web using rvest. I used the following script but it is not the list is zero. Any suggestions?

library(rvest)

Download the HTML and turn it into an XML file with read_html()
Papers <- read_html("https://papers.ssrn.com/sol3/JELJOUR_Results.cfm?npage=1&form_name=journalBrowse&journal_id=1475407&Network=no&lim=false")

Extract specific nodes with html_nodes()
Titles <- html_nodes(Papers, "span.optClickTitle")

@gueyenono
Copy link

gueyenono commented Sep 28, 2018

library(rvest)

webpage <- read_html("https://papers.ssrn.com/sol3/JELJOUR_Results.cfm?npage=1&form_name=journalBrowse&journal_id=1475407&Network=no&lim=false")


title <- webpage %>%
	html_nodes(".optClickTitle") %>%
	html_text()

links <- webpage %>%
	html_nodes(".optClickTitle") %>%
	html_attr("href")

info <- data.frame(title, links)
info

@ehsannu
Copy link
Author

ehsannu commented Sep 28, 2018

Thanks a lot! it works but it just scraps the records from the first page. Any suggestions?

@gueyenono
Copy link

You want everything for all 219 pages?

@ehsannu
Copy link
Author

ehsannu commented Sep 28, 2018

Yes!

@gueyenono
Copy link

gueyenono commented Sep 28, 2018

Be aware, the code is going to run for quite a bit of time. I recommend you export the resulting data frame right away as a csv file or whatever format you prefer.

library(rvest)
library(purrr)

scrape_paper_info <- function(link){
	
	node_of_interest <- read_html(link) %>%
		html_nodes(".optClickTitle")
	
	data.frame(
		title = html_text(node_of_interest, "title"),
		link = html_attr(node_of_interest, "href")
	)
}

links <- paste0("https://papers.ssrn.com/sol3/JELJOUR_Results.cfm?npage=",
					 1:219,
					 "&form_name=journalBrowse&journal_id=1475407&Network=no&lim=false")

paper_info <- map_dfr(links, scrape_paper_info)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants