Sub Crunch

General Sep 22, 2023

During a normal workday, I discovered 2 URLs that exposed PII. There was a clear pattern in the format of those URLs so I decided to see if other links existed and what other nuggets of information I could find. This can be accomplished in several different ways. You can manually check/brute force (change the URL string), you could try Google Dorking, or you could do a little bash scripting and automate the discovery process. Below I will show the steps I took to generate a list of possible URLs matching the discovered pattern, and then how I checked that massive list of possible sites. These steps were completed using a Kali VM.

Step 1: The List

Kali comes with a tool used to generate wordlists called crunch. I recognized in the previously discovered URLs, only the last 6 digits differed. With that information, I went to crunch to generate a list of every possible combination of 6 digits and saved it to a file for future reference.

Generate every combination from 000000 - 999999 and save it to a file (6digit.txt).

crunch 6 6 1234567890 >> 6digi.txt

The list is comprised of 1,000,000 lines as confirmed with the following command

cat 6digi.txt | wc -l

The next step is generating a list of possible URLs. This can easily be done with another bash script. We'll call this gen.sh

The script above takes the Crunch list (6digi.txt) as a variable and for each line in the crunch list, appends the contents of the line to the end of the echoed URL ($name.pdf turns into 000000.pdf) and that completed URL is appended to sites.txt

Don’t forget to make the script executable and then run.

chmod + x gen.sh ./gen.sh 6digi.txt

sites.txt should be the same number of lines as digi.txt (1,000,000) since it is generating a file based on the variations listed in 6digi.txt

Now that we have our list of possible sites, it is time to check to see which ones are real and return info.

Step 2: The Check

I opted to create a bash one-liner to do a quick CURL against the URLs. A return of 200 OK indicates the link is real and has data.

The script above will iterate through the sites.txt file and curl each URL. We are doing

-s : Silent, no progress bar

-o : Output the data to a file instead of stdout, in this case, /dev/null

-I : Grab the header info

-w : Write out

%{http_code} : Numerical response code that was found in the last retrieved HTTP(S) or FTP(s) transfer

And finally we echo the URL next to the code it produced.

Please note, executing curl against 1,000,000 sites and retrieving http response codes will take some time. When I conducted this activity, I decided to leave my computer on overnight to run the script and added the following command to feed the output to a new text file, giving me current and future flexibility:

for word in $(cat sitelist.txt); do curl -s -o /dev/null -I -w "%{http_code} " $word; echo $word;done >> response.txt

In a new terminal window, you can run the following command to follow output progress and check for 200s while the above code is actively running:

tail -f Desktop/response.txt | grep '200 http'

We search for '200 http' so we don’t get returns where 200 is in the URL numerically, ie 1249480201111200.pdf

After observing live results from tail -f, you will have a brand new response.txt file waiting for you in the morning with possible nuggets to investigate.