Penn University research team tested OpenAI’s GPT-2 for plagiarism.
The utilization of language models, such as ChatGPT, in generating responses for user prompts has raised concerns about potential plagiarism issues. The models may inadvertently reuse concepts from the training data without providing proper citation to the original source.
Before relying on chatbots to complete their assignments, students should be aware of the findings from a study conducted by a research team led by Penn University. This study specifically examined the topic and discovered that language models exhibit various forms of plagiarism when generating text in response to user prompts.
Dongwon Lee, professor of information sciences and technology at Penn State, explains that “plagiarism can take different forms”. The study aimed to investigate whether language models not only copy and paste but also engage in more sophisticated forms of plagiarism without being aware of it.
The researchers’ main aim was to pinpoint three types of plagiarism: verbatim plagiarism, which involves directly copying and pasting content, paraphrasing, which involves rewording and restructuring content without citing the original source, and idea plagiarism, which entails using the primary idea from a text without providing appropriate attribution. To achieve this, they established a system for detecting plagiarism automatically and evaluated it against OpenAI’s GPT-2. The decision to use the language model for the test was based on the fact that its training data is available online, making it possible for the researchers to compare generated texts to the 8 million documents utilized to pre-train the model.
To identify instances of plagiarism in pre-trained language models and fine-tuned language models, which were trained to specialize in specific topics, the researchers analyzed 210,000 generated texts. Specifically, they fine-tuned three language models to focus on scientific documents, scholarly articles related to COVID-19, and patent claims. The researchers utilized an open-source search engine to retrieve the top 10 training documents that were most similar to each generated text. They also modified a text alignment algorithm to more accurately detect instances of verbatim, paraphrased, and idea plagiarism.
During their research, the team discovered that language models were found to engage in all three types of plagiarism, and that the frequency of plagiarism increased with larger datasets and parameters used to train the model. The study also observed that language models that were fine-tuned experienced a decrease in verbatim plagiarism, but saw an increase in instances of paraphrasing and idea plagiarism. The researchers further identified that individuals’ private information could be exposed through all forms of plagiarism. The team is scheduled to present their findings at the ACM Web Conference in Austin, Texas from April 30 to May 4, 2023.
According to Jooyoung Lee, a doctoral student at Penn State’s College of Information Sciences and Technology, individuals pursue big language models as their generation capabilities improve with size. However, this pursuit also jeopardizes the uniqueness and creativity of the content in the training corpus, highlighting a crucial discovery.
According to the researchers, the study emphasizes the necessity for additional research on text generators and the ethical and philosophical concerns they bring up.
Thai Le, an assistant professor of computer and information science at the University of Mississippi, cautioned that while the results of language models may be attractive and they can be useful for certain tasks, it does not necessarily mean they are practical. Le, who began the project as a doctoral candidate at Penn State, emphasized that ethical and copyright concerns must be addressed in order to effectively utilize text generators in practical applications.
The study’s findings are only relevant to GPT-2; nonetheless, the automatic detection process for plagiarism established by the researchers can be used on more recent language models like ChatGPT to assess their tendency to plagiarize training content. Nevertheless, the researchers stated that testing for plagiarism relies on developers making their training data openly available.
According to the scientists, the present research can aid AI researchers in developing more resilient, dependable, and ethical language models in the future. However, for the time being, they caution individuals to be careful when utilizing text generators.
According to Jinghui Chen, an assistant professor of information sciences and technology at Penn State, researchers and scientists are working on improving language models to make them more efficient and durable. Meanwhile, many people use language models regularly for different tasks that enhance productivity. While it’s acceptable to use language models as a search engine or to debug code, Chen warns against using them for other purposes. This is because they could generate plagiarized content, which may lead to unfavorable outcomes for the user.
According to Dongwon Lee, the result of the plagiarism is not surprising.
“We have trained language models to imitate human writing without instructing them on proper plagiarism avoidance techniques, as stochastic parrots,” he explained. “Now, our objective is to educate them on writing more accurately, but we have a significant journey ahead of us.”