Software & Apps

DHEYY05 / SEMEN_AND_SEMANTICS: Using language embeddings to analyze the evolution of pornography.

I downloaded the snapshots of Pornhub homepage from Internet archive, converted to titles of embedded in consequences, and evaluated results.

Here is the average title for the first whole year of Pornhub’s existence, 2008:

  • “Hot blonde girl becomes fucked”

And here is the average title for 2023:

  • “Familyxxx -” I can’t stop my steps big ass juicy “(Mila Monet)”

Changing is usually in Pornhub’s progress. Violent and violent content becomes more common:

art

We look at three separate aggregates at titling conventions: 2008-2009, 2010-2016, 2017-Now:

art

If we are describing violent terms such as “woman raped”, “incest”, and “pornography”

art

What happened?

Some of the effects are pure SEO – videos marked by violent sexual language, even if they are not the violent violent. But the overall figure reflects the actual video content, which is more intense to prepare more delicious tastes of supreme spending, most of the consumer’s highlights.

It is widely due to professionalization: a transition from amateur, YouTube style style of professional studios with an interest in the bottom line. Interestingly, it imitates the evolution of YouTube himself. A wide, the full internet transfer to monetization may not be laughed at other places, but in the porn domain, to be a race underneath sexual violence.

Political efforts may have contributed to locking some aspects of this money trend. Fosta-Sesta efforts and imitation through payment processors to limit exposure to pornography may have helped improve the conditions of the minor and rape conditions from uploading. That’s good! But it makes an unexpected consequence of expenditure on the main request: Professional studios began to emphasize youth and violence.

Check more detailed data and method below.

Download Repo and Run “Pip Install” to install Dependencies.

The downloaded data located in the “snapshot” folder. Pornhub data back in 2007, but analysis began in 2008, if the format becomes more consistent. We have a folder for each month of the year, and an almost every week’s enthusiasm for snapshots. For each date, there are two files, for example: “20080606.html”, the raw file in HTML, and “20080606.JSON”, with video titles. JSON files are a row of dictionaries such as:

{“Title”: “SUICKIE OF CAR?”, “URL”: “/ View_PHPYSEY =,” Duration “:” 79}

Where “embedding” field is the “title” quantity converted to “Embed the Openii text”. URL format changes a bit over time.

From 4416 available snapshot, we ended 772 weekly snapshots. Usually, it segregates it in the year to form read-bounds.

To download additional data, run “fetch_snapshots.py” in “Data_Retrieval” directory. You can change the website by editing the python file.

To work with embedings, you need an OpenI API key. Put it in export opeaia_api_key = {…}.

Calculation each year centroids

We calculate representative porn in a year like this:

  • Get the average embedding each day
  • Given each day embreate, remove the average averages

It gives us “centroid” that is our representative embedding in the year. We calculated daily average in moderation to the impact of changes within the year.

We will start by seeing how different every cent of each pennyent cent, as seen below:

art

We can see 3 stages emerging: 2008-2009, 2010-2016, and 2017-2023.

Run “run_centroids” to copy.

We can do the same thing with T-SNE:

art

Trends are similar to the Heatmap: 2008 and 2009 are almost, but not as part of, Cluster in 2010-2016 starting along the cluster mate. There are at least two distinct stages of titling video in Pornhub history title.

As a result, to find the representative video title for the year, can we get the centroid center, and find the nearest neighbor for the given year – 2010, which is closest to “average”? They are as follows:

  • 2008: The hot blonde girl gets fucke …
  • 2009: big tit blonde fucklut …
  • 2010: Latina starlet weight
  • 2011: Hot Brunette experienced anal
  • 2012: Many breasted anal fuck in a garage
  • 2013: More boobed brunette fucked
  • 2014: Jessica Jaymes POV
  • 2015: Hot anal Madison
  • 2016: MyBabysIttersClub – blonde blonde babysitter helped me with cum
  • 2017: Great Tits Blasian Teen Anal Creampie Casting
  • 2018: Leaded MILF creams across 4K 4K Congregation (Full of)
  • 2019: Beautiful Busty Teen Loves a Hard Dick – Hard Fucking Vol 2
  • 2020: Slutty daughter sends you a video from his dorm
  • 2021: Hot College Babe Pangoga and Fucked Good In Multiple Orgasms – Bleast Raw – EP IX
  • 2022: Trouble Fuck & Creampie
  • 2023: Familyxxx – “I can’t stop my steps big juicy ass” (Mila Monet)

It gives a light to our past findings:

  • 2008 and 2009 may vary only because of their truncation: Ellipse indicates that Pornhub Snapshot at that time stores a number of characters.
  • Ang mga naunang mga titulo ingon nga labi ka gamay ug dili kaayo makahuluganon, nagpunting sa pipila ka mga hiyas: Nakita namon ang kolor sa kolor sa buhok (“blonde”, “anal”, “anal”), “Anal”).
  • The latter titles were higher, and we started observing a single incident (“daughter”, “steps”) and “hard fucking”, “light fuck”).
  • Note that capitalization practices have changed, which seem to start a little before, in 2013.

Run “Run_Nearest_neighBovers” to be copied; Increase value for k (the number of neighbors) to see more titles.

These results know but not conclusive. Let’s see the trends.

We will continue to see keyword trends such as:

  1. We make an embedding reference, such as “Latina”
  2. We will get the kitchen’s companion for each dataset title
  3. We convert the quiet resemblance to a normal Z score
  4. We took a maximum of 10% of the same same marks from the entire set
  5. We count how many of the top 10% marks each year
  6. We adjust for the number of titles every year – if 2010 has 100 titles and 2020 has 200 related examples and 2020 with 20

If we do this for eg “Latina” we get:

YEAR matches Complete bought normally
2008 18 114 0.158 1.58x
2009 18 126 0.143 1.43x
2010 12 126 0.095 0.95x
2011 33 258 0.128 1.28x
2012 36 312 0.115 1.16x
2013 40 306 0.131 1.31x
2014 29 306 0.095 0.95x
2015 43 306 0.141 1.41X
2016 15 282 0.053 0.53x
2017 41 294 0.139 1.40x
2018 27 264 0.102 1.02x
2019 14 264 0.053 0.53x
2020 18 288 0.062 0.63x
2021 27 306 0.088 0.88x
2022 18 312 0.058 0.58x
2023 26 294 0.088 0.89x

that looks like this:

art

“Latina” as a descriptor here is missing Marketshare for hours.

As a gentle control, let’s see the word “orthogonal”, which should be unrelated.

art

2016 jump may indicate the overall increase in complexity of titles for hours. These mirrors are what we see with clusters, which 2016 is a year of transit.

Finally, let’s see sexual stylus, with incident and rape:

art

For both, an obvious jump and maintained increases. The incident heals rape, as we see from “step” titles and their variants.

art

For better visibility and a more likely one-sized shadow.

Run “run_trend” with many words you want to run into your own analysis.

We will be back on T-SNE to check some new clusters. Similar to our keywords, we create references with embedding. At this time, I created groups of three, intended to be together, to determine how the categories are associated with the periods of the first and in late time. We can get distance cluster resembling.

“Brunette”, “blonde”, “redhead”

art

Observation that hair color always comes to the first time titles, we have included some here, but we see they are not closer to the clusters of centroids.

“Maximus Thrust”, “Ivana is happy”, “Johnny Deep” (fictional names of good chatgpt recovery)

art

Porn names names are more similar in the first years, but we also see the last time.

“Murder”, “Suicide”, “Death”

art

Violence makes its own cluster. Perhaps, titles suffer violence for hours.

“Woman dancing”, “Woman cooking”, “woman eating breakfast”

art

“Women who do activity” is a common format for titles and we see some closeness.

“Men who dug channels”, “Men who shine later”, “men who hiking hills”

art

Men are farther away; We may be able to assert that the subject conducting the action is less relevant than the subject receiving it.

“African American”, “Latino”, “Asian”

art

Race categories are closer than men, because they are usually attached.

“aircraft factory”, “blue collar”, “Manufacturing”

art

“Making” is meant as a pure control, unrelated to sex in general. But it is actually much better than men or racial groups.

“People love”, “healthy relationship”, “moral habits”

art

Benign terms are meant to give a contrary to sexual violence. They are actually relatively close, and in accordance with the same subsequent trend as violence.

“Woman raped”, “incest”, “porn hurt”

art

We noticed a direct hit. Our sexual violent terms are almost fully overlapping our late time titles: both become synonymous.

Here they are all at once:

art

Run “run_tsne” to imagine your own reference groups. By default, the script will first generate maps, and then display:

  1. The years map
  2. The Timous Years of Every Cluster Concept
  3. Each cluster and the years map

For a more simple animatatus analysis, show or hide different clusters to observe how “average” moves over time:

art

All:

  • Analyzing trends through “minutes looking” by evaluating for views, see length x; It is likely to be a heuristic for making content than the actual time of view

https://opengraph.githubassets.com/638b12883d76af665a275c796c9d39232c7ea296f5241d3d086102cd4cd3ae72/dhealy05/semen_and_semantics

2025-02-27 20:42:00

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button