Introduction

In this post will discuss, step by step, how we used AI to help us with our project. For our project we wanted to look at the parasocial relationships fans have with Vtubers (virtual video creators). For our quantitative research we wanted to scrape the Youtube comments under all of the videos of Pekora Ch. 兎田ぺこら(Barbaby chan insert hier een link naar Pekora haar channel aub), one of the most popular Vtubers right now. We decided that using Python was going to be the best way to do this. We also want to preface this by saying that none of us have any prior experience using Python and that we had basically no outside help other than Professor Coppens helping with an installation error and showing how to use certain scripts more efficiently using notepad.

Tools

Here are the tools we mainly used for our data scraping: - Python (and command prompt): The programming language we used to run all our scripts used for data scraping. Python also used data libraries that were necessary for using some scripts. These were downloaded through the command prompt with pip install. - ChatGPT: The chatbot AI that we used and did 95% of all of the work for us. ChatGPT is owned by OpenAI which is owned by Microsoft. ChatGPT is very adept at and useful for writing programming scripts and is already widely used in most Programming courses. - Google Cloud for our API Key: As it is against the terms of service of Google and Youtube to scrape data from their sites directly, we had to request a key from google cloud and activate it with Youtube API V3. This was necessary in almost all of our scripts to gather data from Youtube. - Youtube: The video sharing website on which our target Vtuber uploads and livestreams to. She has garnered 2.4 million followers and reaches an average of 28 thousand live viewers. - KomodoEdit: A program we learned to use last semester in the subject 'L-Dataverwerking'. We used it to edit large amounts of text and open large text files (that would otherwise crash notepad). - Notepad: A program in which we use to store our Python scripts for easier execution in the command prompt.

Understanding Python

Like stated before, none of us have any experience using Python, but thanks to AI we were able to figure out how to properly use Python as a program and be able to execute a simple script. When asking for a script at first ChatGPT responded the following: 1 This can be quite overwhelming to someone who has no experience in coding, but luckily that is no problem since you can always ask the AI to give you further instructions or explain certain terms. For example: 2 3 4 5

Step by step

In the following section we will list the most prominent scripts we used to scrape data using python in chronological order.

Step 1:

Testing our newfound knowledge on a simple script. One that ChatGPT itself suggested, using the aforementioned API key and a randomly chosen Video ID (an ID linked to a Video that can be found in the URL). Entering both into the script and saving it as a .py file, we could run it through our command console. 6 7

Step 2:

Now that we had our first success using Python, it was time to move on to actually scraping comments. We asked ChatGPT to scrape all comments under a video using an API key. And with some fiddling around we were able to gather all of the comments under one video. This did not include certain elements like: replies to comments, number of likes on a comment and number of replies under a comment. 8

Step 3:

Our next step to achieve our goal was to be able to scrape multiple video's at once. So we asked just that: ChatGPT was able to modify the existing code to be able to use multiple video ID's. The amount of comments we were scraping were becoming quite large, so we asked ChatGPT to take the scraped comments and put them in a single .txt file called "comments". These were the results: 9

Step 4:

Now that we were able to scrape as many comments as we liked so long as we had the right video ID's, the next step was to write a different script to gather all the ID's under one youtube channel (in this case Pekora Ch.). So once again we requested ChatGPT to write a new script to do just that. Now we had all the video ID's in one list and the only thing left before the final step was to reformat the list (we wanted to put every ID on one line, between apostrophe's and separated by a comma and a space) to be able to copy paste it into the script of step 3. We did this by using KomodoEdit, using ChatGPT to help us write a regex expression to do so: find \r and replace it with ', ' . 10

Step 5:

As you might guess, the final step is to just copy-paste the recently reformatted ID list into the aforementioned script (step 3) to scrape all the comments under one channel. The result? An 18MB text file consisting of 303K lines. This is a very messy document and still has to be cleaned to be able to conduct our quantitative research.

Problems and Solutions

Here are some small problems and hurdles we encountered and how we solved them with and without AI: - Installation error with Python. Even though I downloaded python through their website, it did not work properly and with some outside help from Professor Coppens, we decided to uninstall python and to redownload it through the Microsoft Store. This solved the problem. - Being only able to scrape 100 comments at a time: because Google restricts data scraping we were only able to scrape 100 comments at a time. This was solved by asking ChatGPT if it was possible to do an infinite amount of comments. So ChatGPT modified the script to make an infinite amount of 100 comment requests to be able to scrape an infinite amount of comments. - Channel URL: Recently Youtube changed the way they use channel URl's. Youtube used to use a certain channel ID at the end of a URL, now they use a sort of @ to identify Youtube creators. Take Pekora Ch. for example, It used to be: https://www.youtube.com/channel/UC1DCedRgGHBdm81E1llLhOQ, now it is https://www.youtube.com/@usadapekora. This is important because only the old URL worked in our script. We could easily get the older URL through social media links and the inspect element of their Youtube channel. - Youtube video ID script: The first script used searched for an upload playlist on the selected Youtube channel to scrape all the ID's, but because the selected Youtube creator had set their upload playlist to hidden, the execution of the script would always result in an error. We asked ChatGPT to use an alternative way to scrape all the ID's. So the AI rewrote the whole script, now to instead to use the search.list to find all the uploaded content of the channel. However, note that this doesn't distinguish between Video's from past livestreams or shorts (short style video's like the ones from TikTok). - Data cleaning through ChatGPT: we tried to clean the youtube ID's using AI, but this turned out to be a fruitless effort as ChatGPT either didn't do all of the ID's listed or generated an infinite amount of times, resulting in ChatGPT giving us random ID's. This goes to show that AI can't really do everything (yet).

Data Cleaning

Now we had our scraped comments in one document, but the document itself consisted of very messy and unworkable data. Therfore we used Openrefine to clean the document. This was quite complicated because of Grel, the regex used by Google. We used AI to write regex, looked on the Internet for existing Json scripts or tinkered with existing scripts/regex. And so we were able to remove all romaji, emoji's and other special symbols until only Japanese kana and kanji remained. This cleaning was very frustrating as Grel is not used by a lot of people and information about certain regex is very scarce. Furthermore, AI tools gave us wrong regex that gave errors and didn't work.

Data Analysis

As the final step in our quantitative research, we tried using a parser for our analysis of Japanese grammar to then conclude what type of informal or formal language was used in the scraped comments. We first tried using the python version of Mecab to no avail. We tried tinkering with the parser but to no avail and the deadline was approaching, so we chose an alternative method: Voyant. Voyant is a website that counts the frequency of words in a uploaded document. Through Voyant we were able to determine the most used words and kanji in the comments. We were able to conclude that most of the words that were used signified a familiar register and also had a positive connotation. There were of course still problems with Voyant, namely: it counted particles as words and made up words like Pekora, konpeko and the likes were split up and counted as different words and so muddled our frequency list. That's why we still had to make an excel document with our selected words to then visualize with a graph. These could be used for the conclusion of our project.