DHEYY05 / SEMEN_AND_SEMANTICS: Using language embeddings to analyze the evolution of pornography.

I downloaded the snapshots of Pornhub homepage from Internet archive, converted to titles of embedded in consequences, and evaluated results.
Here is the average title for the first whole year of Pornhub’s existence, 2008:
- “Hot blonde girl becomes fucked”
And here is the average title for 2023:
- “Familyxxx -” I can’t stop my steps big ass juicy “(Mila Monet)”
Changing is usually in Pornhub’s progress. Violent and violent content becomes more common:
![]() |
We look at three separate aggregates at titling conventions: 2008-2009, 2010-2016, 2017-Now:
![]() |
If we are describing violent terms such as “woman raped”, “incest”, and “pornography”
![]() |
What happened?
Some of the effects are pure SEO – videos marked by violent sexual language, even if they are not the violent violent. But the overall figure reflects the actual video content, which is more intense to prepare more delicious tastes of supreme spending, most of the consumer’s highlights.
It is widely due to professionalization: a transition from amateur, YouTube style style of professional studios with an interest in the bottom line. Interestingly, it imitates the evolution of YouTube himself. A wide, the full internet transfer to monetization may not be laughed at other places, but in the porn domain, to be a race underneath sexual violence.
Political efforts may have contributed to locking some aspects of this money trend. Fosta-Sesta efforts and imitation through payment processors to limit exposure to pornography may have helped improve the conditions of the minor and rape conditions from uploading. That’s good! But it makes an unexpected consequence of expenditure on the main request: Professional studios began to emphasize youth and violence.
Check more detailed data and method below.
Download Repo and Run “Pip Install” to install Dependencies.
The downloaded data located in the “snapshot” folder. Pornhub data back in 2007, but analysis began in 2008, if the format becomes more consistent. We have a folder for each month of the year, and an almost every week’s enthusiasm for snapshots. For each date, there are two files, for example: “20080606.html”, the raw file in HTML, and “20080606.JSON”, with video titles. JSON files are a row of dictionaries such as:
{“Title”: “SUICKIE OF CAR?”, “URL”: “/ View_PHPYSEY =,” Duration “:” 79}
Where “embedding” field is the “title” quantity converted to “Embed the Openii text”. URL format changes a bit over time.
From 4416 available snapshot, we ended 772 weekly snapshots. Usually, it segregates it in the year to form read-bounds.
To download additional data, run “fetch_snapshots.py” in “Data_Retrieval” directory. You can change the website by editing the python file.
To work with embedings, you need an OpenI API key. Put it in export opeaia_api_key = {…}.
We calculate representative porn in a year like this:
- Get the average embedding each day
- Given each day embreate, remove the average averages
It gives us “centroid” that is our representative embedding in the year. We calculated daily average in moderation to the impact of changes within the year.
We will start by seeing how different every cent of each pennyent cent, as seen below:
![]() |
We can see 3 stages emerging: 2008-2009, 2010-2016, and 2017-2023.
Run “run_centroids” to copy.
We can do the same thing with T-SNE:
![]() |
Trends are similar to the Heatmap: 2008 and 2009 are almost, but not as part of, Cluster in 2010-2016 starting along the cluster mate. There are at least two distinct stages of titling video in Pornhub history title.
As a result, to find the representative video title for the year, can we get the centroid center, and find the nearest neighbor for the given year – 2010, which is closest to “average”? They are as follows:
- 2008: The hot blonde girl gets fucke …
- 2009: big tit blonde fucklut …
- 2010: Latina starlet weight
- 2011: Hot Brunette experienced anal
- 2012: Many breasted anal fuck in a garage
- 2013: More boobed brunette fucked
- 2014: Jessica Jaymes POV
- 2015: Hot anal Madison
- 2016: MyBabysIttersClub – blonde blonde babysitter helped me with cum
- 2017: Great Tits Blasian Teen Anal Creampie Casting
- 2018: Leaded MILF creams across 4K 4K Congregation (Full of)
- 2019: Beautiful Busty Teen Loves a Hard Dick – Hard Fucking Vol 2
- 2020: Slutty daughter sends you a video from his dorm
- 2021: Hot College Babe Pangoga and Fucked Good In Multiple Orgasms – Bleast Raw – EP IX
- 2022: Trouble Fuck & Creampie
- 2023: Familyxxx – “I can’t stop my steps big juicy ass” (Mila Monet)
It gives a light to our past findings:
- 2008 and 2009 may vary only because of their truncation: Ellipse indicates that Pornhub Snapshot at that time stores a number of characters.
- Ang mga naunang mga titulo ingon nga labi ka gamay ug dili kaayo makahuluganon, nagpunting sa pipila ka mga hiyas: Nakita namon ang kolor sa kolor sa buhok (“blonde”, “anal”, “anal”), “Anal”).
- The latter titles were higher, and we started observing a single incident (“daughter”, “steps”) and “hard fucking”, “light fuck”).
- Note that capitalization practices have changed, which seem to start a little before, in 2013.
Run “Run_Nearest_neighBovers” to be copied; Increase value for k (the number of neighbors) to see more titles.
These results know but not conclusive. Let’s see the trends.
We will continue to see keyword trends such as:
- We make an embedding reference, such as “Latina”
- We will get the kitchen’s companion for each dataset title
- We convert the quiet resemblance to a normal Z score
- We took a maximum of 10% of the same same marks from the entire set
- We count how many of the top 10% marks each year
- We adjust for the number of titles every year – if 2010 has 100 titles and 2020 has 200 related examples and 2020 with 20
If we do this for eg “Latina” we get:
YEAR | matches | Complete | bought | normally |
---|---|---|---|---|
2008 | 18 | 114 | 0.158 | 1.58x |
2009 | 18 | 126 | 0.143 | 1.43x |
2010 | 12 | 126 | 0.095 | 0.95x |
2011 | 33 | 258 | 0.128 | 1.28x |
2012 | 36 | 312 | 0.115 | 1.16x |
2013 | 40 | 306 | 0.131 | 1.31x |
2014 | 29 | 306 | 0.095 | 0.95x |
2015 | 43 | 306 | 0.141 | 1.41X |
2016 | 15 | 282 | 0.053 | 0.53x |
2017 | 41 | 294 | 0.139 | 1.40x |
2018 | 27 | 264 | 0.102 | 1.02x |
2019 | 14 | 264 | 0.053 | 0.53x |
2020 | 18 | 288 | 0.062 | 0.63x |
2021 | 27 | 306 | 0.088 | 0.88x |
2022 | 18 | 312 | 0.058 | 0.58x |
2023 | 26 | 294 | 0.088 | 0.89x |
that looks like this:
![]() |
“Latina” as a descriptor here is missing Marketshare for hours.
As a gentle control, let’s see the word “orthogonal”, which should be unrelated.
![]() |
2016 jump may indicate the overall increase in complexity of titles for hours. These mirrors are what we see with clusters, which 2016 is a year of transit.
Finally, let’s see sexual stylus, with incident and rape:
![]() |
For both, an obvious jump and maintained increases. The incident heals rape, as we see from “step” titles and their variants.
![]() |
For better visibility and a more likely one-sized shadow.
Run “run_trend” with many words you want to run into your own analysis.
We will be back on T-SNE to check some new clusters. Similar to our keywords, we create references with embedding. At this time, I created groups of three, intended to be together, to determine how the categories are associated with the periods of the first and in late time. We can get distance cluster resembling.
“Brunette”, “blonde”, “redhead”
![]() |
Observation that hair color always comes to the first time titles, we have included some here, but we see they are not closer to the clusters of centroids.
“Maximus Thrust”, “Ivana is happy”, “Johnny Deep” (fictional names of good chatgpt recovery)
![]() |
Porn names names are more similar in the first years, but we also see the last time.
“Murder”, “Suicide”, “Death”
![]() |
Violence makes its own cluster. Perhaps, titles suffer violence for hours.
“Woman dancing”, “Woman cooking”, “woman eating breakfast”
![]() |
“Women who do activity” is a common format for titles and we see some closeness.
“Men who dug channels”, “Men who shine later”, “men who hiking hills”
![]() |
Men are farther away; We may be able to assert that the subject conducting the action is less relevant than the subject receiving it.
“African American”, “Latino”, “Asian”
![]() |
Race categories are closer than men, because they are usually attached.
“aircraft factory”, “blue collar”, “Manufacturing”
![]() |
“Making” is meant as a pure control, unrelated to sex in general. But it is actually much better than men or racial groups.
“People love”, “healthy relationship”, “moral habits”
![]() |
Benign terms are meant to give a contrary to sexual violence. They are actually relatively close, and in accordance with the same subsequent trend as violence.
“Woman raped”, “incest”, “porn hurt”
![]() |
We noticed a direct hit. Our sexual violent terms are almost fully overlapping our late time titles: both become synonymous.
Here they are all at once:
![]() |
Run “run_tsne” to imagine your own reference groups. By default, the script will first generate maps, and then display:
- The years map
- The Timous Years of Every Cluster Concept
- Each cluster and the years map
For a more simple animatatus analysis, show or hide different clusters to observe how “average” moves over time:
![]() |
All:
- Analyzing trends through “minutes looking” by evaluating for views, see length x; It is likely to be a heuristic for making content than the actual time of view
https://opengraph.githubassets.com/638b12883d76af665a275c796c9d39232c7ea296f5241d3d086102cd4cd3ae72/dhealy05/semen_and_semantics
2025-02-27 20:42:00