Data Science Tutorial

Sahil Goel, Salma Khairat, and Saima Ahmed

Drake is one of the most influential artists of our generation. With billions of people streaming his music, there’s little data about his audience and who finds his music relatable. Although this data is scarce, we are going to attempt to figure out if his music could be relatable to everyone by analyzing the themes and words in his lyrics. The main motivation for this is curiosity, but it could also be helpful to artists who want to understand how their music is being perceived based on their lyrics. The albums we chose are some of Drake’s most popular albums and they span from 2011 to 2021 so that we can include data from different eras of his musical career. Our null hypothesis is assuming that Drake’s lyrics cater more towards the female audience, 75% of the 130 songs that we analyze will be classified as “for women”. We will be testing this hypothesis in hopes of proving/disproving this sentiment.

Note To Readers: As is common in hip-hop and rap culture, there is explicit language in the lyrics we have obtained. This explicit language shows up in our output and may also appear in our analysis. We felt that we should include these words uncensored as they reflect the culture of these genres of music in today's society and removing them would pose a significant barrier to an accurate analysis. We do not condone the use of this explicit language.

Preliminary Step

Obtaining Drake's Song Lyrics

In order to access the lyrics for each album we want to analyze we can use Genius, a website that holds official lyrics for popular songs/artists. Genius offers an API we can access through an individually generated token (https://docs.genius.com/#/getting-started-h1). Lyrics Genius is a python module that uses the Genius API which has pre-made functions for finding artists, songs, and lyrics. We decided to use Lyrics Genius and the Genius API because it is a verified website with accurate references and information. Since these tools already exist and have tons of documentation it was easier to implement and troubleshoot, as opposed to using a less well known source where we would have to scrape data ourselves page by page. Also, since we are working with well-known lyric data and Genius is a certified website, we did not encounter any issues with missing data. We first detected the albums we wanted to collect data from and analyzed them using the built in Genius search_album function. In order to get the song names of each album we created a function that takes an album object and looped through the individual songs. Once we had the song names, we could then go through and find the song objects using the built in search_song function. Once we had the song objects, we could then obtain the lyrics in the form of strings using the lyrics field of the song objects.

Cleaning and Organizing the Data

Now that we have the lyrics for each song for the 6 albums, we want to represent the data in a way that makes our data analysis effective. With python, Pandas is a common library used to store tabular data in the form of a data frame. Because many of the functions we will be using take data frames or lists as arguments, we decided to store our song lyrics in multiple data frames. We will be conducting analysis on all of the songs as a whole as well as the songs grouped by album, so we create data frames to store these subsets of data.

Tokenizing the Song Lyrics

We used the sent_tokenize function under the NLTK we imported to tokenize all the words detected in each string of lyrics. The Natural Language Toolkit is a collection of python libraries geared towards working with human language data that we discussed in class. The nature of this problem involves working with lyrics and analyzing language and context, so we explored NLTK and found that the tokenizing services would help us to separate individual words. In our implementation we represented each array of tokens/words in our data frame as a sentence. We then preproccessed the text by removing nonalphabetic characters, lowercasing, and removing stopwords. The toolkit also offers a set of stop words we can access that exist in the human language, words such as “the”, “it”, “and”, words that do not contribute to overarching themes/meaning. In our circumstances we are dealing with song lyrics and the genius website includes words such as “lyrics” and “chorus” that we manually added. The stop words set also includes male and female pronouns that we manually removed from the list, we wanted to analyze the frequency of gendered pronouns because it will be a statistic that helps us to determine who the artist is referring to or singing about (male/female audience).

Examining Word Frequencies By Album (In Chronological Order)

Within rap and poetry, artists make decisions on what themes they want to emphasize in a piece they are working on. One of the ways to create this emphasis is by repeating certain words to get the point across. Analyzing word frequencies of Drake’s lyrics showed certain themes that he discussed in his albums. Because our hypothesis is focused on determining who can relate to Drake’s music, we decided to put the word frequencies into three categories: words that describe men and their interests, words that describe women and their interests, and words that are neutral (words that apply to both or neither). In rap culture, stereotypes about men and women’s interests explain how these lyrics should be perceived. Because of this, we will be using these stereotypes (even though we don’t condone them) to explain how these lyrics are meant to be portrayed by these rappers. These stereotypes are mere generalizations and are not applicable to a lot of people. That being said, because we are analyzing information within rap culture, we have to see this data from their lens to obtain an accurate result. For this analysis, we looked at the most frequent words that were mentioned 35 times or more. We chose this number arbitrarily because we believed that it was enough to show us the top words that he wanted to emphasize while getting rid of filler words that do not have meaning on their own.

For this album, we found that there are 20 words that were used that have frequencies of 35 times or more. Of those words, we realized that he mentions words like “love”, “her”, and “girl” so we will group similar words like those to the woman section. There are also words like the n word which will be placed in the category about men. Putting that in mind, we noticed that the women category had 3 words, the men category had 2 words and the neutral category had 15 words.

For this album there are 11 words that were used 35 times or more. Of those words, there are 3 that directly talk about or describe men (the n word and the word “man”), whereas there is only 1 word that describes a woman, which is the word “girl”. The rest of these words are neutral and they comprise 7 words, which is also the majority of the words.

There are 9 words in this album that have word frequencies of 35 times or more. Of those words, there is one word that skews more towards the woman category and that is the word “feel” because socially it’s more acceptable for women to express how they feel. That being said, 1 is towards women and 0 is towards men. However, most are neutral.

There are only 5 words in this album that have frequencies of 35 times or more. Of these words, the word “man” was used and the rest are all neutral. So, 1 is towards the men category and 4 was neutral.

There are 9 words that were used in this album that have words that were mentioned 35 times or more. Of these words 1 could be used to categorize men and 1 could be used to categorize women. The rest are neutral words that don’t apply to any category.

With this album, there are 14 words that have frequencies of 35 times or more. Of these words, 5 could be categorized towards women and 2 could be categorized towards men. The rest would be neutral.

When analyzing the total word frequencies that occur in all the albums, we notice that there are 38 words that occur 100 times or more. Of those words, 3 fall into the men category since they describe men and 4 fall into the category that could describe women. The neutral words were the most frequent (31 words) which shows that in these albums the themes that Drake has mentioned would be general themes that could relate to both men and women. Even within each individual category, the neutral words overrode the words that would stereotypically relate to men or women.

Word Embeddings on All 6 Albums (using Word2Vec)

Word2Vec is a python library that generates associations between words. In our case we wanted to find the relationship between the words we found and how they relate to the term “man”. Word2Vec uses neural network algorithms to create a vector association between the word we wish to model off of. The number assigned to each individual word is an indicator of that relationship. In this case, we created these vectors off of a 300 dimensional network. We modeled this based on a dimensional value between 100-400 so that the network would be large enough to show varied results. The higher the vector value, the closer it is in association to the term we modeled.
In order to visualize the results, we had to simplify the dimensionality to something comprehensible. So, we projected the results onto a 2 dimensional plane using TSNE, which stands for T-distributed Stochastic Neighbor Embedding. This is a library under the sklearn module that is used to visualize high-dimensional data. This library helps to generate X and Y values related to each vector association which we plotted in a scatter plot.

Linear Support Vector Machine on All 6 Albums

A linear support vector machine is a machine learning model that uses supervised learning. The Sci-Kit learn module contains a built-in function that allows us to train a model with labeled data and then make predictions on new data. With this model, we can attempt to classify Drake’s songs as being more geared towards men or women. One question that comes up with this model is what data to use as a training set. We could label all of Drake’s songs manually as being classified towards men or women and then randomly select a subset of songs to train on, testing the model on the corresponding untrained subset. However, we felt that this may lead to confirmation bias since we are trying to figure out who Drake’s music appeals to. As a result, we decided to find some websites that listed songs that were either empowering women or doing the opposite. In doing so, we realized that we found more songs that were empowering to women than songs that were misogynistic. To make sure that we trained the model well, we wanted to give the model an equal amount of songs from these categories, which were extreme opposites, so that Drake’s music could be labeled accurately. Something we noticed with the misogynistic songs too was that most of them were rap songs. We also know that most of Drake’s music is rap, so we didn’t want the machine to match Drake's songs with the misogynistic ones just because the language used in the rap genre is similar. So, of the songs that empowered women, we decided to randomly select songs to remove that were not rap so that there would be equal songs that empower women and that don't empower women. A lot of the songs that did empower women were also rap. So, since the genre is the same, the machine wouldn’t separate Drake’s songs solely based on the music genre since both categories have some rap songs and some songs from other genres.

Hypothesis Testing

We can construct a hypothesis test using the results from the linear support vector machine above. The null hypothesis is that Drake makes most of his music to appease women in an effort to paint himself in a positive light. Therefore, the proportion of Drake’s songs that are classified as “for women” is 0.75. The alternative hypothesis is that Drake does not make music that primarily serves to appease women, and therefore the proportion of Drake’s songs that are classified as “for women” is less than 0.75.

Null Hypothesis: proportion of Drake’s songs for women = 0.75

Alternative Hypothesis: proportion of Drake’s songs for women < 0.75

Type of test: Our binary data lends itself to a one sided z test using a binomial distribution. With the assumption that the distribution of our proportion statistic is approximately normal (using central limit theorem), we can use a normal distribution to approximate the binomial distribution to test our hypothesis. We will use a significance level (alpha) of 0.05 as this is standard practice. The mean and standard deviation are the two metrics that define a normal distribution, so we will calculate them based on known formulas that we learned in class.

Using a cumulative density function as we did in class, we can calculate the probability that we observe the proportion of Drake’s songs for women is less than or equal to our observed proportion of 0.423, assuming that Drake makes a proportion of 0.75 of his music to appease women. This observed proportion corresponds to a count of 55 since 0.423 * 130 = 55.

Using the norm_below function obtained from class, the probability of such an event is extremely small. Since this p value is approximately 0, which is less than our alpha of 0.05, we can reject the null hypothesis. We conclude that the proportion of Drake’s songs that are “for women” is significantly less than 0.75. In non-statistical terms, this means that Drake’s songs are not significantly skewed towards women.

Conclusion

With Drake being a popular artist, we wanted to know if his music is truly for everyone. There is controversy over his target audience and we wanted to settle that dispute. Some people think that his music is created to solely appease women, so we decided to look into that and see what the data could tell us. First, we used a website called Genius to obtain all of Drake’s lyrics from his most popular albums. We then organized the data and parsed it so that it could be analyzed. After that, we were able to use that data to generate certain statistics such as word frequencies of all the albums. This did not support the null hypothesis because most of the frequent words that were used did not appeal to women over men. Then, we took the analysis a step further by creating a model that classifies Drake’s songs into one of two categories: a category that’s empowering to women and a category that is not. This model was trained on songs that were found online that we classified into the categories. To make sure that our hypothesis was tested, we used a binomial distribution and a one-sided z test. We discovered that the likelihood of Drake’s music mostly appeasing women was extremely unlikely. So, we concluded that Drake’s music was not only for women.

To improve our model, more songs can be added to the training set. We noticed that with the sample that we had, the model was fairly sensitive to any additions/removals from the training set. A larger sample size would give the model more information to work with. In turn, this would create a more accurate model that could classify certain lyrics without that sensitivity. The model can also be implemented using different categories besides gender so that artists can see how different demographics or topics interact with the lyrics they are producing. Also, word frequencies do not necessarily give you a well rounded context of the information that Drake could be emphasizing. A better approach may be to look at themes of certain sentences in his lyrics, which could have given a more well-rounded view of what his lyrics are actually trying to show.