We Go Deep: Data-Mining in Pornography

We Go Deep seems to be the title of a porn movie, I catched it while my crawler chased through the whole set at cavr.com—a pornography movie database where ratings, actors and actresses and often a description are present. The description on cavr is especially interesting for data-mining purposes, because it does not consist of complete natural language sentences, but instead just features the core keywords. Thus, it is easier to analyze the contents of pornography.

The intention of this article is to give a short overview over some statistical fact in a business which is little analyzed and considered, even though more than half of our population (almost all males + a part of females, according to a little survey I have done) does know it. I’m talking about porn industry.

Sometimes there is criticism, but most of the time pornography industry leads a pretty easy-going life whereas other industry (e.g. currently web industry with regards to privacy) has to justify.

Moreover, I also want to show how you can create statistics from some data you have.

Most popular first names

Let’s start with some very basic analysis, which can be easily created. How about the first names of porn actresses? Which ones are the mostly used ones?

Technical approach

Doing this is pretty easy if you have a list of actresses’ names. Just get all the names, split of the first word until the first space and sum up all occurrences. Of course you can say that some actresses could have double names, but as they are stage names, I guess still the first name would be the more important and common one. I even doubt that many people will choose double firstnames as stage names.

We use the Natural Language Toolkit (NLTK) which includes features for a lot of tasks in natural language processing. This is the first project I am doing with it. NLTK comes with a utility for frequency distribution, we can throw a list inside and get back an object we can directly plot.

In Python the code looks like this:

# Database access uses MongoDB, find() returns all actresses
firstnames = [star['name'].split()[0] for star in stars.find()]	
fdist = nltk.FreqDist(firstnames)
fdist.plot(50)

Results

The mostly used firstname is Nikki (40 actresses call themselves like this) followed by Vanessa and Victoria (both more than 35 occurrences). Then come Kelly, Ashley, Jessica, Angel, Samantha, Tiffany, Michelle and so on.

Frequency distribution of actresses’ firstnames in pornography

Most commonly used words in porn titles

Let’s do something more complicated. Now we want to see which words are most commonly used in the titles of American pornography productions. There are some interesting results, but first let’s again have a lot on how we can get such an analysis.

Technical approach

We already have a database of American porn movies. We read all the titles and again split them by spaces. But this time we have to consider stopwords. These are these little words like me, at, with etc. you have in each text, so they are not statistically relevant. As a lot of titles include numbers (because they are series), we also exclude those. Also special characters like & are not relevant.

To be able to group similar words together we also use stemming. This reduces a word to its stem (which is. a bit similar to the grammatical stem of a word, but not necessarily the same).

In Python the code for keyword generation can look like this:

for movie in movies.find():
	keywords = [porter.stem(keyword.lower()) # use stemming and lowercase
	            for keyword in movie['name'].split()
	            if not keyword.lower() in stopwords_english # exclude stopwords
	            and keyword.isalpha()] # check for real words
	all_keywords = all_keywords + keywords

Then we throw all these keywords into a frequency distribution. We also reduce the number of displayed items in our plot to the 100 most frequent, because there are just too many different words to display them all in one plot.

fdist = nltk.FreqDist(all_keywords)
fdist.plot(100)

Results

And already we can see the result: As pornography industry does not use many clothes, skin color plays a huge role. Black (about 800) and white (about 300) belong to the mostly used words in porn titles. Next frequent race keyword is asian (about 180).

Also information from the documentation “9 to 5 – Days in Porn” are confirmed. There is a hell of a lot of anal sex in American pornography industry. Ass, anal and butt all belong to the most important words. Ff you summed them up, they would exceed the term girl which is mentioned in almost 1000 titles. Later we will see how much percent of pornography really includes anal sex in the scenes (as title only shows specialization on that topic).

Then pornography industry likes to use derogative words like slut, whore or bitch paired with dirti (stem of dirty). On the other hand, purity and youth play a large role: teen, young, first, angel, virgin.

Interesting is that there are so many episodes of Barely Legal, it even got into the top used keywords (check the stem bare and legal in the graph).

Most commonly used words in porn titles

Finding collocations in description texts

Let’s get on to the description text of our porn movies. You might want to know if there are any terms that are usually used together with regards to pornography. In non-pornographic text one example could be the term “United States”, which is two distinct words, but one term.

Technical approach

So how do we do this? With the NLTK package it’s actually quite easy, but all details come from an answer on stackoverflow. We already used the frequency distribution utility, now we will use it to count both single words and bigrams. A bigram is just a tuple of two words that follow each other. In the text “I am a man” we have three bigrams: (I, am), (am, a), (a, man).

What we will do not is check which words follow each other most frequently. We have to count the occurrences of single words to normalize the whole calculation. If there are 20 occurrences of “I”, 20 occurrences of “me” and only 5 occurrences of “hello”, then of course I an me will follow each other more frequently, but that does not mean they follow each other unusually frequently. Thus we need to know how often each words occurs.

These two distributions are then thrown into a class BigramCollocationFinder which does all the calculating for us.

To avoid finding collocations through dots, we have to ensure that words after and before a dot cannot be seen as a bigram. I most normal circumstances this is not important, because sentence endings and beginnings differ too much, they will not be seen as collocations, but as we only use mini sentences of mostly 2-3 words and many of them begin with she we have to watch out. So we split dots as distinct words with the WordPunctTokenizer and then we filter special characters out. We also want to filter stopwords as these collocations are not really useful.

In source code the whole action looks like this:

stopwords_english = stopwords.words('english')
 
tokens = nltk.WordPunctTokenizer().tokenize(fulltext)
bigram_measures = nltk.collocations.BigramAssocMeasures()
 
word_fd = nltk.FreqDist(tokens)
bigram_fd = nltk.FreqDist(nltk.bigrams(tokens))
 
finder = nltk.BigramCollocationFinder(word_fd, bigram_fd)
finder.apply_word_filter(lambda w: w in stopwords_english or not w.isalpha())
 
# Print the 50 most frequent bigrams, this might be collocations
print sorted(finder.nbest(bigram_measures.raw_freq, 50), reverse=True)

You might have recognized that there still is an unknown variable fulltext. It holds all values from our scene analysis. How you build this depends on how the data is saved, but it could look like this:

all_scenes = [scene['description'] for scene in scenes.find()]
fulltext = " ".join(all_scenes)

Results

Let’s see what collocations are found in all scene descriptions:

[(u'various', u'positions'), (u'toe', u'sucking'), (u'titty', u'sucking'), (u'titty', u'screwing'), 
(u'titty', u'play'), (u'solo', u'fingering'), (u'side', u'ways'), (u'side', u'saddle'), (u'sexy', u'outfits'),
(u'sexy', u'outfit'), (u'sexy', u'black'), (u'self', u'titty'), (u'screwing', u'side'), (u'screwing', u'reverse'),
(u'screwing', u'rev'), (u'screwing', u'doggy'), (u'screwing', u'doggie'), (u'screwing', u'cowgirl'),
(u'safe', u'screwing'), (u'reverse', u'cowgirl'), (u'rev', u'cowgirl'), (u'open', u'mouth'),
(u'mouth', u'facials'), (u'mouth', u'facial'), (u'many', u'positions'), (u'les', u'eating'), (u'hands', u'bj'),
(u'guys', u'stroke'), (u'face', u'sitting'), (u'dual', u'bj'), (u'dp', u'rev'), (u'dp', u'doggie'),
(u'doggy', u'position'), (u'doggie', u'style'), (u'doggie', u'position'), (u'dildo', u'solo'), (u'dildo', u'play'),
(u'dildo', u'bj'), (u'deep', u'throat'), (u'cream', u'pie'), (u'cowgirl', u'riding'), (u'clit', u'solo'),
(u'black', u'guy'), (u'bj', u'clean'), (u'ball', u'sucking'), (u'anal', u'solo'), (u'anal', u'side'), (u'anal', u'rev'),
(u'anal', u'doggie'), (u'anal', u'dildo')]

Of course collocation finding is always a bit difficult and if you can, you should control it manually. As already mentioned in the technical approach section this is not sorted by number of occurrences, but by collocation strength (how often do these words occur together and how little not together).

Skimming through the words, you might see that there are a lot of real collocations, but sometimes there seems to be missing something. These are then probably trigrams (three words following each other). Look for example at “self titty”, what should that mean? Very probably it belongs together with “titty play” to the trigram “self titty play”. With a similar method to the above code (only changing bigram to trigram), we can find out the strongest trigrams:

[(u'sexy', u'white', u'outfit'), (u'sexy', u'teaser', u'opening'), (u'sexy', u'red', u'outfit'),
(u'sexy', u'pink', u'outfit'), (u'sexy', u'lowcut', u'dress'), (u'sexy', u'blue', u'outfit'), (u'sexy', u'black', u'outfit'), 
(u'sexy', u'black', u'dress'), (u'self', u'titty', u'play'), (u'screwing', u'side', u'ways'), (u'screwing', u'side', u'saddle'), 
(u'screwing', u'reverse', u'cowgirl'), (u'screwing', u'rev', u'cowgirl'), (u'screwing', u'doggy', u'position'), (u'screwing', u'doggie', u'style'), (u'screwing', u'doggie', u'position'), 
(u'screwing', u'cowgirl', u'riding'), (u'safe', u'screwing', u'reverse'), (u'safe', u'screwing', u'rev'), (u'safe', u'screwing', u'doggy'), (u'safe', u'screwing', u'doggie'),
(u'safe', u'screwing', u'cowgirl'), (u'rubbing', u'boxes', u'together'), (u'reverse', u'cowgirl', u'anal'),
(u'open', u'mouth', u'facials'), (u'open', u'mouth', u'facial'), (u'les', u'titty', u'sucking'), (u'lee', u'stone', u'ii'), 
(u'large', u'back', u'tattoo'), (u'guy', u'gets', u'bj'), (u'glass', u'dildo', u'solo'), (u'fingers', u'anal', u'solo'),
(u'dual', u'titty', u'sucking'), (u'dual', u'open', u'mouth'), (u'dp', u'reverse', u'cowgirl'), (u'dp', u'rev', u'cowgirl'), 
(u'dp', u'doggy', u'position'), (u'dp', u'doggie', u'position'), (u'circle', u'jerk', u'bjs'), (u'bj', u'rev', u'cowgirl'), (u'anal', u'solo', u'fingering'), (u'anal', u'side', u'ways'),
(u'anal', u'side', u'saddle'), (u'anal', u'reverse', u'cowgirl'), 
(u'anal', u'rev', u'cowgirl'), (u'anal', u'doggy', u'position'), (u'anal', u'doggie', u'position'), (u'anal', u'dildo', u'solo'), 
(u'anal', u'dildo', u'play'), (u'anal', u'cream', u'pie')]

There are some interesting findings in these trigram collocations compared with the bigram collocations. At first the trigam lee stone ii sucks pretty much, but we can exclude it manually. Another method would be to check for possible names automatically. But more interestingly, we can see the most which is a bigram sexual position can also be done extended to anal and to safe. Also we could find out sexual positions from these trigrams, because in the trigram collocations they begin with screwing (if the sexual position itself has two words).

What are the most common actions in pornography?

Let’s go on and find out which actions are most frequently part of porn movies. This is not hard either, but we can improve it with the collocations.

At first we just want to see which words are mostly used. This will give us a quick overview over the data we have.

Technical approach

For this we just use a frequency distribution again and fill it with all the words we have:

porter = nltk.PorterStemmer()
stopwords_english = stopwords.words('english')
fdist = nltk.FreqDist()
 
for scene in scenes.find():
	keywords = [porter.stem(keyword.lower())
			for keyword in scene.split()
			if not keyword.lower() in stopwords_english
			and keyword.isalpha()]
	fdist.update(keywords)
 
fdist.plot(100)

Results

Pornography action frequency distribution

And if we only count each occurrence in a movie once:
Pornography keyword description frequency distribution

Next thing we want to do is find out more specifically which positions are often shown.

Technical approach

This is some manual work, but in natural language processing you will often have to work out things manually to improve them. As we already saw before, we can use the collocations to get an overview over positions, but we have to remove clothings like sexy red dress before. We also remove some collocations that do not provide more information than the single word (e.g. doggie position does mean the same as doggie, but dp doggie is something different). Then we combine the collocation positions with some keywords from the most frequent words and then check which is how common (if we find a collocation we remove it, so that it will not be counted as single word again).

Results

Sexual position frequency distribution in pornography

Maybe you also think – like me – that this is not good enough yet. Many terms are there twice, because we did not use stemming on the composite terms or because the website uses abbreviations (rev and reverse). So let’s create display groups. All we have to do is create a Python dictonary with the old positions as keys and the names they shall be mapped to as names.

Thre improved result might look like this:
Position frequency distribution (improved)

Further Questions

Of course you can always extend such questions to get even more interesting results.

When I began with this article, I intended to analyze questions like “Comparing popular and not-so-popular porn actresses, is there any difference in what actions they perform?” later.

However, it has been a long time since I worked on this article and I do not have my raw data anymore. Maybe you want to continue my work?

Howto crack a small C program with Assembler

What do we want to do today? We want to learn how to crack a (very easy) program with Assembler. What we will use for this:

  • C programming language for our program
  • gcc for compiling the program
  • objdump for disassembling
  • a hexeditor for editing the binary file

The C program

What we will program is essentially a program that asks for a password and then displays either a secret information or denies access. Actually it sucks a bit, because you could just read out our secret string from the compiled binary, but imagine the program does a rights check and will do complex multiplications (you cannot do without having rights and without this program).

So this is our C code:

#include <stdio.h>
 
int main() {
        int code;
        scanf("%d", &code);
 
        if (code == 5) {
                printf("Very secret information\n");
        } else {
                printf("You have no access\n");
        }
        return 0;
}

Compile it with C and test it:

You will get something similar to these two test runs:

brati ~ $ ./a.out 
6
You have no access
brati ~ $ ./a.out 
5
Very secret information

But what if we had not seen before that the password is 5?

Disassemble with objdump

This is where objdump kicks in. This is a program that can dump the assembler instructions for you. Usually it’s used for debugging and you can only disassemble small stuff with it, but it’s enough for our purpose here.

With the option -d you can tell objdump to disassemble the code for you. So let’s do this and drop it to a file:

objdump -d a.out > access_disass

It will give you different sections containing source code (or expected to contain source code). Among them should be the .text segment. With -j .text you could have assigned objdump to only export the .text section, but it doesn’t really matter in this case. Somewhere in the disassembly should be the main label which denotes the main function of our program.

You should now study the assembler code on the right clearly to search for a place where there could a password check. Have you found it? If not, it should be the following commands in order:

  • cmp (compare)
  • jne (jump not equal)

These compare our input to the correct password (which you could read from here, but that would be boring) and if it is not right, jumps to the code to display the error message. If it is right though, it will go on with the secret part.

We want to turn the program to accept a wrong password now. For this, we will just replace the jne command with a je (jump equal) command which jumps to the error section if the password is right (meaning we will get to the restricted section if the password is wrong). We could also replace it with jmp (unconditioned jump), but then we would have to adjust the jump-length, too. Otherwise the program would jump to the error message even if the password was right ;) So jne is equal effective, but less editing. Even though you can also try out jmp later for training.

From the left hand side you can get the opcode belonging to the command. In my case it’s 75 for jne and 74 for je (both taking a hex code for the number of lines to jump as argument).

Replacing the opcodes in the binary file

If we replace jne with je, the program will jump to the secret area, if the password is wrong. For this you can open the binary file (a.out as gcc default) with a hexeditor (e.g. hexedit on Linux). Then search for the desired line, in my case “75 0e”.

Replace the 75 (jne) with 74 (je). Of course if your architecture has other opcodes, you have to adjust them. After having saved the file, you can just rerun it. Isn’t it a strange feeling to run a binary file, you have just editted with your bare hands? :)

Now you should get an output similar to this, which shows we can get the secret information with wrong passwords. Only the right password will not be accepted anymore.

brati ~ $ ./a.out 
3
Very secret information
brati ~ $ ./a.out 
5
You have no access

Training with jmp

Now try editing the binary file so that it always accepts the input, no matter if it’s correct or wrong. Don’t forget you have to adjust the jump length.

The test run should then look like this:

brati ~ $ ./a.out 
4
Very secret information
brati ~ $ ./a.out 
5
Very secret information

Using Hiragana, Katakana and Kanji for tokenizing Japanese

People working in the international web-development (and maybe also in other sectors) often suffer from the problem of tokenization. This usually happens, when you want to create a search engine. As already mentioned, I am currently working on a library to analyze texts with PHP. This library shall help all people having to implement a search field.

The complex method of tokenizing in Japanese

So let’s look at how we can tokenize Japanese. If you searched about this topic, you probably have found out that it’s not an easy topic, nothing you can write down yourself quickly. If you want to implement it really well, you need probability heuristics and use advanced topics like Hidden-Markov-Models or similar.

A simpler approach: Japanese knows several writing systems

So let’s think about an easier approach, somebody without a M.Sc. can implement or create. Somebody who cannot rely on C++ code (e.g. KyTea) or similar.

If you think about the structure of Japanese, you wil recognize that it has three writing systems, sometimes even four (which occurs on the web pretty often):

  • Hiragana
  • Katakana
  • Kanji
  • Romaji (latin characters)

We can use these different writing systems for tokenizing Japanese texts. If you have read Japanese text, you might know that the switch between Hiragana and Kanji fairly often happens at word borders. Of course verbs mostly consist of both kanji and hiragana, but as said, we want to keep it all very simple. We can still make it more difficult and accurate later. The point is: Even if we have a verb consisting of kanji and hiragana, the kanji will carry the meaning. On the other hand, nouns are very often delimited by hiragana because particles are written in hiragana.

Example

Let’s look at this easy sentence (I just remembered from some song or anime):

約束を忘れないでください。

There we have 約束 (promise, meeting) and then a hiragana, because this is being talked about. Then we have a form of 忘れる where the kanji carries the meaning of “to forget”. And then we have the whole hiragana rest carrying the meaning of “please not”.

Now, why should we not just tokenize the words when the writing system changes? That’s exactly what I thought and what I will also do in my current project. It would give us these tokens:

  • 約束
  • れないでください。

Of course, it is not perfect, but it is better than not splitting the words at all or splitting them into whole sentences (marked by 。 in Japanese). As a small improvement, we can still implement 。 as sort of its own writing system (like non-saved-writing-system, so that it will be considered a border too.

My current class

That’s the code I have read up to know. It still lacks recognition of Romaji (which I did not think of before, which can be seen often on the web) and I do not handle 。 、 or other special characters yet (but you do not do that in a pure whitespace tokenizer either).

Feel free to work with it. I will also improve it while working on my tokenizing-stemming-searching-PHP-library.

<?php
 
require_once 'Skoch/Tokenizer.php';
 
/**
 * The hiragana (katakana) tokenizer will tokenize words according to their
 * Japanese writings. It will split when the writing system changes (e.g. from
 * hiragana to katakana or from katakana to kanji).
 */
class Skoch_Tokenizer_Hiragana implements Skoch_Tokenizer {
	private $hiragana = array(
		'あ', 'い', 'う', 'え', 'お',
		'か', 'き', 'く', 'け', 'こ',
		'さ', 'し', 'す', 'せ', 'そ',
		'た', 'ち', 'つ', 'て', 'と',
		'な', 'に', 'ぬ', 'ね', 'の',
		'は', 'ひ', 'ふ', 'へ', 'ほ',
		'ま', 'み', 'む', 'め', 'も',
		'や',      'ゆ',      'よ',
		'ら', 'り', 'る', 'れ', 'ろ',
		'わ', 'ゐ',      'ゑ', 'を',
		                    'ん',
		'が', 'ぎ', 'ぐ', 'げ', 'ご',
		'ざ', 'じ', 'ず', 'ぜ', 'ぞ',
		'だ', 'ぢ', 'づ', 'で', 'ど',
		'ば', 'び', 'ぶ', 'べ', 'ぼ',
		'ぱ', 'ぴ', 'ぷ', 'ぺ', 'ぽ',
 
		'ぁ', 'ぃ', 'ぅ', 'ぇ', 'ぉ',
	);
	private $katakana = array(
		'ア', 'イ', 'ウ', 'エ', 'オ', 
		'カ', 'キ', 'ク', 'ケ', 'コ', 
		'サ', 'シ', 'ス', 'セ', 'ソ', 
		'タ', 'チ', 'ツ', 'テ', 'ト', 
		'ナ', 'ニ', 'ヌ', 'ネ', 'ノ', 
		'ハ', 'ヒ', 'フ', 'ヘ', 'ホ', 
		'マ', 'ミ', 'ム', 'メ', 'モ', 
		'ヤ',      'ユ',      'ヨ', 
		'ラ', 'リ', 'ル', 'レ', 'ロ', 
		'ワ', 'ヰ',      'ヱ', 'ヲ', 
		                    'ン',
		'ガ', 'ギ', 'グ', 'ゲ', 'ゴ', 
		'ザ', 'ジ', 'ズ', 'ゼ', 'ゾ', 
		'ダ', 'ヂ', 'ヅ', 'デ', 'ド', 
		'バ', 'ビ', 'ブ', 'ベ', 'ボ', 
		'パ', 'ピ', 'プ', 'ペ', 'ポ', 
 
		'ァ', 'ィ', 'ゥ', 'ェ', 'ォ', 
		'ー',
	);
 
	const HIRAGANA = 0x1;
	const KATAKANA = 0x2;
	const KANJI = 0x4;
 
	/**
	 * Initialize a new hiragana katakana tokenizer.
	 */
	public function __construct() {
 
	}
 
	/**
	 * Stem the input tokens given.
	 * 
	 * @param array $tokens The tokens.
	 * @return array The tokens after stemming.
	 */
	public function tokenize($string) {
		// ensure that we have utf-8 used
		$beforeEncoding = mb_internal_encoding();
		mb_internal_encoding("utf-8");
 
		$tokens = array();
 
		$currentSystem = null;
		$currentToken = '';
 
		for ($i = 0; $i <= mb_strlen($string); $i++) {
			$character = mb_substr($string, $i, 1);
 
			if (in_array($character, $this->hiragana)) {
				$system = self::HIRAGANA;
			} elseif (in_array($character, $this->katakana)) {
				$system = self::KATAKANA;
			} else {
				$system = self::KANJI;
			}
 
			// First string did not have a starting system
			if ($currentSystem == null) {
				$currentSystem = $system;
			}
 
			// if the system still is the same, no boundary has been reached
			if ($currentSystem == $system) {
				$currentToken .= $character;
			} else {
				// Write ended token to tokens and start a new one
				$tokens[] = $currentToken;
				$currentToken = $character;
				$currentSystem = $system;
			}
		}
 
 
		// reset encoding
		mb_internal_encoding($beforeEncoding);
 
		return $tokens;
	}
}