Methodology

Our research consists of looking at the usage of words in Japanese books from the time period 1920-1960. We do this by use of sentiment analysis, the programming language Python, and a lot of trial and error. We do not use any form of AI at any point of our research. The website we used to get the books is Aozorabunko (hereafter simply "Aozora"), a website which contains the full text of many Japanese books.


Step 1: The dataset

We want to get all the books from the Aozora for our dataset. Naturally, this is much more than the limited timeframe we are looking into, namely from 1920 to 1960. Still, once we have everything from the website, filtering by year should be easy enough. At first, we tried to do this by scraping the entire website, but as we kept running into walls, we looked for another approach. After some searching, we found a website that explains how exactly someone scraped Aozora before us. After sending an email to the author of this website, Molly Des Jardin, she referred us to a github page explaining her steps. Following every step closely, we finally had our dataset downloaded locally. This was especially helpful as it ensured that we could continue our work without an internet connection.


Step 2: Cleanup

The dataset we got from the first step was quite extensive, and many of the columns in it were unimportant for our research purposes. The main one we were interested in is the column detailing the first release date of the book in question. The dataset took everything from Aozora itself, and dates were written in the Japanese style (e.g. 1923(大正12)年5月15日), which was hard to work with when we wanted to use these dates later on. Using the program Openrefine, cleanup was relatively easy, using simple regular expressions (regex). Considering 1920 to 1960 spans over 2 Japanese eras, Taisho and Showa, we needed to execute the 2 following regex:

  1. value.replace(/(昭和\d+)/,"").replace(/(大正\d+)/,"")
  2. value.replace("年","-").replace("月","-").replace("日","")

The first regex gets rid of the kanji signifying the Japanese era, and the second one gets rid of the kanji that indicate the year and the month, replacing those with a dash (-), and gets rid of the kanji indicating the day. This leaves us with dates in the form of xxxx-xx-xx, which is much easier to work with (with the date from before, this becomes 1923-5-15).

Openrefine can also be used to filter on specific dates, and then export a new dataset with that filter still intact. This was extremely helpful as it allowed us to narrow down our dataset to exactly the years we needed. That would also allow us to finally work with what we needed. In the end, between 1920 and 1960, the amount of books we have amounts to 1492.


Step 3: Words words words

Disclaimer: We did not use any form of AI. Everything we wrote was with the power of the human mind, and normal internet searches.

Another thing that we got from Step 1 is the full text of all the books on Aozora. Considering that there is a column in the dataset with the name of the file of each book, we used that to iterate over the dataset and read that file to get the text in our program. This was essential if we wanted to extract the words of each work.

Considering that we wanted to get the "score" of each word, we first needed a separate dataset which has all these scores. The one we used for this project is from this website. The way they got this is quite complicated, and we don't understand it either. The most important part is that we have an extensive dataset which is easy to use.

The first thing we did was to read the dataset with the pandas and the os python libraries:

import pandas as pd  
import os

path_dic = os.path.sep.join(['dic', 'pn_ja.dic'])
df = pd.read_csv(path_dic, encoding="utf8, sep=":", names=["lemma", "reading", "pos", "score"])
df.to_csv("pn_ja_withColumnNames.dic", index=False)

Like this, we saved the dataset with column names. This isn't necessary, but it makes it easier to work with later on. It let us import only the needed columns (namely the "lemma" and the "score" columns), which overall made it clearer on more comfortable to work with.

For libraries, the main ones we used were MeCab, a library designed to work with Japanese texts, pandas, mentioned above, and Counter from collections, to count the frequency of elements in arrays. You can find the code we wrote below here. If you're not interested in the nitty-gritty, feel free to skip ahead.

View full code ``` import MeCab as mc import pandas as pd import numpy as np import os from collections import Counter # Import the dataset with scores path_dic = os.path.sep.join(['dic', 'pn_ja_columnNames.dic']) df_pn = pd.read_csv(path_dic, encoding="utf8", sep=",", usecols=['lemma','score']) #Setting up MeCab tokenizer = mc.Tagger("unidic-kindai-bungo-v202512") mecab = mc.Tagger() # The dataset to which we want to add the scores aat the end nd = pd.read_csv("wpc-only-needed-timeframe.csv") # Dataframe with only the column of filenames. Useful to iterate over to get the link to the files with full texts. filenames = pd.read_csv("wpc-only-needed-timeframe.csv", encoding="utf8", usecols=["Tokenized Filename"]) # Empty arrays which will later be added to the nd dataframe scores_nouns = [] scores_verbs = [] scores_adjectives = [] # The link to the folder with the files of the books link = "C:/Users/Ariana/wpc/database adding/tokenized/" # Variable to iterate over the rows i = 0 # Start of the for loop that iterates over all the rows of the filenames dataframe for row in filenames.iterrows(): # Extract the filename on location i. Need to make this into a string, and replace a part of it to have only the filename left over. filename = filenames.loc[i].to_string().replace("Tokenized Filename ", "") textFile = link + filename # Open the file with link textFile, and replace spaces with nothing. sent = open(textFile, encoding="utf8").read().replace(" ", "") # Parse the text with MeCab, and write it to a variable node. node = tokenizer.parseToNode(sent) # Empty arrays that will eventually be added to the arrays defined before this loop. nouns_toadd = [] verbs_toadd = [] adjectives_toadd = [] nscores = [] vscores = [] ascores = [] # For every element in node while node: # Look for nouns, verbs, and adjectives respectively, and add them to their respective arrays if node.feature.split(",")[0] == u"名詞": nouns_toadd.append(node.surface) elif node.feature.split(",")[0] == u"動詞": verbs_toadd.append(node.feature.split(",")[7]) elif node.feature.split(",")[0] == u"形容詞": adjectives_toadd.append(node.feature.split(",")[7]) # Go to the next node. node = node.next # Get the scores for each word found in each of the arrays, and append them to the scores array. for noun in nouns_toadd: if noun in df_pn['lemma'].values: score = df_pn.loc[df_pn['lemma'] == noun, 'score'].iloc[0] nscores.append(score) for verb in verbs_toadd: if verb in df_pn['lemma'].values: score = df_pn.loc[df_pn['lemma'] == verb, 'score'].iloc[0] vscores.append(score) for adjective in adjectives_toadd: if adjective in df_pn['lemma'].values: score = df_pn.loc[df_pn['lemma'] == adjective, 'score'].iloc[0] ascores.append(score) # Append the 30 most common scores to the arrays defined outside of this loop scores_nouns.append(Counter(nscores).most_common(30)) scores_verbs.append(Counter(vscores).most_common(30)) scores_adjectives.append(Counter(ascores).most_common(30)) # Print to keep track of where we are in the loop, and increment i by 1 print(i) i += 1 # Add the arrays with most common scores to the original dataset, and save that to a new one. nd["Noun Scores"] = scores_nouns nd["Verb Scores"] = scores_verbs nd["Adjective Scores"] = scores_adjectives nd.to_csv("ds_with_scores.csv", index=False) ```

In the end, this added new columns to the already existing dataset we had, in which there were 30 of the most common scores and how often they were in the books, for each book in the dataset. Naturally, running this took quite some time, around a full day of running the program.

One thing to note is that in the dataset we used with the sentiment scores of the words, there are quite a few words that show up multiple times and also have different scores. This is due to different readings of words that slightly alter the definition of words, despite being written with the same characters. Our code completely disregards this aspect of the dataset, and therefore could lead to slightly different results if you would take this into account. The reason we didn't is because we aren't sure of how to handle this effectively. The creator of this dataset also has a couple of webpages describing how you would use the dataset correctly, which you can find here, but when we tried to adapt this code to our circumstances, we couldn't figure out how to get it to work. It's for that reason that we decided to write our own, less optimal code. That is certainly an aspect which could be improved upon in the future.


Step 4: Visualisation

For the visualisation of this data, we used the python library matplotlib, both for plotting the data and for animating it into gif form. We chose to go for an animation simply because there would be too many graphs if we wanted to plot everything for each year (that would be 120 graphs). The way of doing this is relatively straightforward. Firstly, because of the way our code was written, the new columns in the dataset weren't split up nicely. This can be done easily with Openrefine:

value.replace("np.float64","").replace(", ((","* "").replace("(","").replace(")","").replace("[","").replace("]","")

Then for the plotting, we first read the csv file with pandas, as we can simply read the columns with matplotlib and plot them against eachother.

View full code ``` import matplotlib.pyplot as plt import numpy as np import matplotlib.animation as animation import pandas as pd # Read the dataset with necessary columns only df = pd.read_csv("dataset-with-seperated-scores.csv", encoding="utf8", usecols=["底本初版発行年1", "Noun Scores", "Noun Score count", "Verb Scores", "Verb Score count", "Adjective Scores", "Adjective Score count"]) # The starting year of our timeframe of interest year = 1920 # Make an array of titles, needed to have a unique title for each frame in the animation titles = ["year {}".format(frame) for frame in range(1920, 1960)] # Function we will call for the animation. The animation will apply this function for each frame. def func(frame, ax, a, titles): # Reference the global variable year, defined outside of the function global year # Clear the axis so that each frame draws a new plot ax.cla() # Get the needed columns for the specific year, and from that new dataframe, take the columns we wish to plot b = a.loc[a["底本初版発行年1"] == year] ac = b["Noun Scores"] dc = b["Noun Score count"] # Set the title from the array of titles, and plot them in a bar plot ax.set_title(titles[frame]) ax.bar(ac, dc, width=0.05) # Increment year by 1 for the next frame year += 1 return ax, year # Setting up for plotting fig = plt.figure() ax = fig.add_subplot(111) # How many frames the animation will have frames = range(40) # Run the animation, and write it to a file ani = animation.FuncAnimation(fig, func, frames, interval=1000, repeat_delay=1000, blit=False, fargs=(ax, df, titles)) ani.save(filename="nouns.gif", writer="pillow") ```

This was only for the nouns, but we wanted this for the adjectives and verbs as well. To do this, we simply ran the code again after editing the necessary components (for example, taking the columns with verbs instead of nouns). After running that 3 times, we got the following gifs:

*Nouns gif*

*Verbs gif*

*Adjectives gif*

And we're done!

As you can see, the axes of the plots change constantly. This is because each year has a different amount of books, and therefore also a different amount of data. We did it this way because otherwise, the outliers would dominate the graph and make it much harder to see the other years with little publications. Below, you can find a failed experiment that shows why we went with the other approach.

*Prototype nouns gif*