Thursday, March 23, 2017

SmeegeScrape: Text Scraper and Custom Word List Generator

Click Here to Download Source Code

Customize your security testing with! It's a simple python script to scrape text from various sources including local files and web pages, and turn the text into a custom word list. A customized word list has many uses, from web application testing to password cracking, having a specific set of words to use against a target can increase efficiency and effectiveness during a penetration test. I realize there are other text scrapers publicly available however I feel this script is simple, efficient, and specific enough to warrant its own release. This script is able to read almost any file which has cleartext in it that python can open. I have also included support for file formats such as pdf, html, docx, and pptx.


Usage: {-f file | -d directory | -u web_url | -l url_list_file} [-o output_filename] [-s] [-i] [-min #] [-max #]

One of the following input types is required:(-f filename), (-d directory), (-u web_url), (-l url_list_file)

-h, --help show this help message and exit
-f LOCALFILE, --localFile LOCALFILE Specify a local file to scrape
-d DIRECTORY, --fileDirectory DIRECTORY Specify a directory to scrape the inside files
-u URL, --webUrl URL Specify a url to scrape page content (correct format: http(s)://
-l URL_LIST_FILE, --webList URL_LIST_FILE Specify a text file with a list of URLs to scrape (separated by newline)
-o FILENAME, --outputFile FILENAME Specify output filename (default: smeegescrape_out.txt)
-i, --integers Remove integers [0-9] from all output
-s, --specials Remove special characters from all output
-min # Specify the minimum length for all words in output
-max # Specify the maximum length for all words in output

Scraping a local file: -f Test-File.txt
This is a sample text file with different text.
This file could be different filetypes including html, pdf, powerpoint, docx, etc.  
Anything which can be read in as cleartext can be scraped.
I hope you enjoy SmeegeScrape, feel free to comment if you like it!


Each word is separated by a newline. The options -i and -s can be used to remove any integers or special characters found. Also, the -min and -max arguments can be used to specify desired word length.

Scraping a web page: -u -si

To scrape web pages we use the python urllib2 module. The format of the url is checked via regex and it must be in the correct format (e.g. http(s)://

web scrape output

Scraping multiple files from a directory: -d test\ -si -min 5 -max 12

The screen output shows each file which was scraped, the total number of unique words found based on the user's desired options, and the output filename.

directory scrape output

Scraping multiple URLs: -l weblist.txt -si -min 6 -max 10

The -l option takes in a list of web urls from a text file and scrapes each url. Each scraped URL is displayed on the screen as well as a total number of words scraped.

url list scraping url list scraping 2

This weblist option is excellent to use with Burp Suite to scrape an entire site. To do this, proxy your web traffic through Burp and discover as much content on the target site as you can (spidering, manual discovery, dictionary attack on directories/files, etc.). After the discovery phase, right click on the target in the site map and select the option "Copy URLs in this host" from the drop down list. In this instance for even a small blog like mine over 300 URLs were copied. Depending on the size of the site the scraping could take a little while, be patient!

burp copy URLs in host

Now just paste the URLs into a text file and run that as input with the -l option. -l SmeegeScrape-Burp-URLs.txt -si -min 6 -max 12
final output after parsing

So very easily we just scraped an entire site for words with specific attributes (length and character set) that we want.

As you can see there are many different possibilities with this script. I tried to make it as accurate as possible however sometimes the script depends on modules such as nltk, docx, etc. which may not always work correctly. In situations like this where the script is unable to read a certain file format, I would suggest trying to convert it to a more readable file type or copy/paste the text to a text file which can always be scraped.

The custom word list dictionaries you create are up to your imagination so have fun with it! This script could also be easily modified to extract phrases or sentences which could be used with password cracking passphrases. Here are a couple examples I made:

Holy Bible King James Version of 1611: -f HolyBibleDocx.docx -si -min 6 -max 12 -o HolyBible_scraped.txt

Shakespeare's Romeo and Juliet: -u -si -min 6 -max 12 -o romeo_juliet_scraped.txt

Feel free to share your scraped lists or ideas on useful content to scrape. Comments and suggestions welcome, enjoy!

Thursday, December 15, 2016

Pentesting Rsync

Pentesting rsync.. is what I googled when I first saw it reported as an open service from Nessus. I hadn't seen it much and most available documentation about it was just a short usage manual. Rsync (Remote Sync) is an open source utility that provides fast incremental file transfer. Rsync copies files either to or from a remote host, or locally on the current host. It is commonly found on *nix systems and functions as both a file synchronization and file transfer program.

According to There are two different ways for rsync to contact a remote system: using a remote-shell program as the transport (such as ssh or rsh) or contacting an rsync daemon directly via TCP. The remote-shell transport is used whenever the source or destination path contains a single colon (:) separator after a host specification. Contacting an rsync daemon directly happens when the source or destination path contains a double colon (::) separator after a host specification, OR when an rsync:// URL is specified.

So how do we detect rsync and take advantage of it during a pentest? During a recent test one of the Nessus results was plugin 11389 which is rsync service detection. Furthermore each of the hosts in the “Hosts” section had a list of rsync modules with their name, description, and access rights.

The default port you will typically find an rsync daemon running on is 873 and also potentially 8873. If you aren’t using nessus a simple nmap scan of those ports will let you know if either port is open. Once you have determined an rsync service is running you can use the metasploit module auxiliary/scanner/rsync/modules_list which lists the names of the modules the same way the Nessus plugin did.

Alternatively you can also use the nmap script rsync-list-modules to get a list of rsync modules.

nmap --script=rsync-list-modules <ip> -p 873

Once you have the list of modules you have a few different options depending on the actions you want to take and whether or not authentication is required. If authentication is not required you can copy all files to your local machine via the following command:

rsync -av /data/tmp

This recursively transfers all files from the directory “module_name_1” on the machine into the /data/tmp directory on the local machine. The files are transferred in "archive" mode, which ensures that symbolic links, devices, attributes, permissions, ownerships, etc. are preserved in the transfer. Happy dumpster diving!

But… what if authentication is required? Some modules on the remote daemon may require authentication. If so, you will receive a password prompt when you connect. As a pentester you still have options! There is a NSE script called rsync-brute which performs brute force password auditing against the rsync remote file syncing protocol.


Tuesday, February 2, 2016

Burp Suite Extension: Burp Importer

Burp Importer is a Burp Suite extension written in python which allows users to connect to a list of web servers and populate the sitemap with successful connections. Burp Importer also has the ability to parse Nessus (.nessus), Nmap (.gnmap), or a text file for potential web connections. Have you ever wished you could use Burp’s Intruder to hit multiple targets at once for discovery purposes? Now you can with the Burp Import extension. Use cases for this extension consist of web server discovery, authorization testing, and more!

Click here to download source code


  1. Download Jython standalone Jar:
  2. In the Extender>Options tab point your Python Environment to the Jython file.
  3. Add Burp Importer in the Extender>Extensions tab.

General Use

Burp Importer is easy to use as it’s fairly similar to Burp’s Intruder tool. The first section of the extension is the file load option which is optional and used to parse Nessus (.nessus), Nmap (.gnmap), or a list of newline separated URLs (.txt). Nessus files are parsed for the HTTP Information Plugin (ID 24260). Nmap files are parsed for open common web ports from a predefined dictionary. Text files should be a list of URLs which conform to the format and separated by a newline. After the files are parsed a list of generated URLs will be added to the URL List box.

The URL List section is almost identical to Burp Intruder’s Payload Options. Users have the ability to paste a list of URLs, copy the current list, remove a URL from the list, clear the entire list, or add an individual URL. A connection to each item in the list will be attempted using the class and Burp makeHttpRequest method.

Gnamp file parsed example:

Nessus file parsed example:

The last section of the extension provides the user a checkbox option to follow redirects, run the list of URLs, and a run log. Redirects are determined by 301 and 302 status codes and based on the ‘Location’ header in the response. The run log displays the same output which shows in the Extender>Extensions>Output tab. It shows basic data any time you run the URL list such as successful connections, number of redirects (if enabled), and a list of URLs which are malformed or have connectivity issues.

Running a list of hosts:

Items imported into the sitemap:

Use Case – Discovery

One of the main motivations for creating this extension was to help the discovery phase of an application or network penetration test. Parsing through network or vulnerability scan results can be tedious and inefficient which is why automation is a vital part of penetration testing. This extension can be utilized as just a file parser which generates valid URLs to use with other tools and can also be used to gain quick insight into the web application scope of a network. There are many ways to utilize this tool from a discovery perspective, which include:

  • Determine the web scope of an environment via successful connections added to the sitemap.
  • Search or scrape for certain information from multiple sites. An example of this would be searching multiple sites for e-mail addresses or other specific information.
  • Determine the low-level vulnerability posture of multiple sites or pages via spidering then passive or active scanning.

Use Case – Authorization Testing

Another way to use this extension is to check an application for insecure direct object references. This refers to restricting objects or pages only to users who are authorized. To do this requires at least one set of valid credentials or potentially more depending on how many user roles are being tested. Also, session handling rules must be set to use cookies from Burp’s cookie jar with Extender.

The following steps can then be performed:

  1. Authenticate with the highest privileged account and spider/discover as many objects and pages as possible. Don’t forget to use intruder to search for hidden directories as well as convert POST requests to GET which can also be used to access additional resources (if allowed by the server of course).
  2. In the sitemap right click on the host at the correct path and select ‘Copy URLs in this branch.’ This will give you a list of resources which were accessed by the high privileged account.
  3. Logout and clear any saved session information.
  4. Login with a lower privileged user which could also be a user with no credentials or user role at all. Be sure you have an active session with this user.
  5. Open the Burp Importer tab and paste the list of resources retrieved from the higher privileged account into the URL List box. Select the ‘Enable: Follow Redirects’ box as it helps you know if you are being redirected to a login or error page.
  6. Analyze the results! A list of ‘failed’ connections and the number of redirects will automatically be added to the Log box. These are a good indicator if the lower privileged session was able to access the resources or if they were just redirected to a login/error page. The sitemap should also be searched to manually verify if any unauthorized resources were indeed successfully accessed. Entire branches of responses can be searched using regex and a negative match for the ‘Location’ header to find valid connections.
Here we can see the requests made to the DVWA application while logged in as 'admin' were not able to connect and redirected to the login page after the original administrative session was logged out of and killed. During this use case the DVWA application was not vulnerable to insecure direct object references.

There are many other uses for this extension just use your imagination! If you come up with any cool ideas or have any comments please reach out to me.

Tuesday, May 19, 2015

Cross-Site Request Forgery Detection with Burp and Regex

Cross-Site Request Forgery (CSRF) is an attack where a malicious person tries to force an authenticated user to execute some action. This attack can be caused by a GET or POST request where the server doesn’t validate the request is created by the correct authenticated user. To prevent CSRF, OWASP suggests using CSRF tokens by stating “The ideal solution is to only include the CSRF token in POST requests and have actions with state changing affect to only respond to POST requests. If sensitive server-side actions are guaranteed to only ever respond to POST requests, then there is no need to include the token in GET requests.” During web application penetration tests it is very common for only POST requests to modify state, so my goal is usually to find POST requests which execute a state-changing action and doesn’t have any CSRF protection such as a unique token. It’s very possible that even if CSRF protection is implemented, it’s done so incorrectly or in an incomplete manner. I thought of a nice little trick using Burp search and regular expressions (regex) which I think could be very useful in quickly determining if an application is potentially vulnerable.

Inefficient Detection

Burp Suite Proxy does have CSRF detection as an option in the active scanner engine, however I have found it to be inaccurate at times. Burp also has a ‘Generate CSRF PoC’ function which I do use after my regex search, however in a large application it isn’t realistic to manually look at all POST requests and generate a PoC for each one. A third option is Burp’s pro extension ‘CSRF Scanner’ which almost does what I need, as it lets me passively scan a branch in the site map for requests with a negative match on specific strings (such as unique token names). The biggest downside to this is it does not allow me to specify only a POST request, so I end up with a list of hundreds of requests in the Scanner Results tab.

Efficient Detection with Burp Search and Regex

Burp’s built-in Search is decent but lacking in a lot of areas. It doesn’t let users specify things like type of requests (GET, POST, etc.), ignore duplicate resources, multiple strings in one request, etc. However, Burp’s search does allow us to use regex which means a lot of this functionality we can do and it will help us detect potential CSRF.
Starting with a simple regex search to get an idea of how it works, we see an expression which matches an entire request/response pair:

So let’s create a regex which will help us find potential CSRF. We want a list of all POST requests which don’t have a unique token called ‘CSRF_Token’. To do this we use the regex ^POST((?!CSRF_Token).)*$

Excellent! We now have a comprehensive list of POST requests without a ‘CSRF_Token’ parameter. This will be sufficient for most applications. But what if there are multiple parameters or cookies used to protect against CSRF? Just make a simple modification to the regex and use ^POST((?!CSRF_Token|Cookie: CSRF_Cookie=).)*$. We can add as many keywords to do a negative match against as we want. Just add a pipe ‘|’ character followed by the keyword. To wrap this up here are some quick steps you can do to use this method:

  1. Map out an application by spidering as much of it as you can.
  2. Use Burp’s automated form submission spider option or manually submit forms throughout the application to get as many POST requests in your site map as possible.
  3. Figure out which CSRF defenses are being implemented. Usually I just see one POST parameter as a unique token but sometimes cookies, headers, viewstate, etc. are also used.
  4. Search your target from the sitemap:
  5. Select the regex checkbox and input your desired expression based on step 3. Here is an example format:
  6. Look at the list of requests to confirm they are candidates for CSRF. Find a request with a high impact (ex: Add an administrative user) and use Burp’s ‘Generate CSRF PoC’ function in Engagement tools.
  7. Open and submit the CSRF PoC as the authenticated victim and confirm if the action completed successfully.

Done! Hopefully this post helps you identify potential CSRF candidates. If you have any questions or comments please feel free to share.