Python offers efficient ways to handle text files and process data. One common task is to check for and remove duplicate lines across files. This article will guide you through a Python script designed for this purpose.
Objective
The goal is to write a Python function remove_duplicate_lines
that compares two text files: a source file and a checking file. It removes lines from the checking file that are duplicates of any line in the source file, and then writes the result to an output file.
The Python Script
Function Definition
def remove_duplicate_lines(source_file, checking_file, output_file):
# Your code here
This function takes three arguments:
source_file
: The file to compare against.checking_file
: The file from which duplicates are to be removed.output_file
: The file where the result will be saved.
Reading the Source File
with open(source_file, "r", encoding="utf-8") as f_source:
source_lines = {line.strip().lower() for line in f_source}
This code block reads the source_file
, using a set comprehension to store each line. line.strip().lower()
is used to remove leading/trailing spaces and to convert each line to lowercase, ensuring that the comparison is case-insensitive.
Processing the Checking File
with open(checking_file, "r", encoding="utf-8") as f_checking:
unique_lines = [line for line in f_checking if line.strip().lower() not in source_lines]
Here, the checking_file
is read. The script iterates through each line and includes it in unique_lines
if it is not found in source_lines
.
Writing to the Output File
with open(output_file, "w", encoding="utf-8") as f_output:
f_output.writelines(unique_lines)
The unique lines are written to output_file
. This file now contains all lines from checking_file
that were not duplicates of lines in source_file
.
Calling the Function
source_file = "txt1.txt"
checking_file = "txt2.txt"
output_file = "txt/txt-checked.txt"
remove_duplicate_lines(source_file, checking_file, output_file)
Finally, the function is called with the names of the source file, checking file, and output file.
Conclusion
This script demonstrates a straightforward and efficient way to remove duplicate lines from text files using Python. It’s a practical solution for data cleaning and processing tasks where duplicate entries in text data need to be identified and eliminated.
Remember, Python’s standard library provides robust file handling capabilities, making it an excellent choice for such text processing tasks.
def remove_duplicate_lines(source_file, checking_file, output_file):
# Read the source file and store its lines in a set for faster lookups
with open(source_file, "r", encoding="utf-8") as f_source:
source_lines = {line.strip().lower() for line in f_source}
# Read the checking file and remove duplicate lines
with open(checking_file, "r", encoding="utf-8") as f_checking:
unique_lines = [line for line in f_checking if line.strip().lower() not in source_lines]
# Write the unique lines to the output file
with open(output_file, "w", encoding="utf-8") as f_output:
f_output.writelines(unique_lines)
# Call the function with the appropriate file names
source_file = "txt1.txt"
checking_file = "txt2.txt"
output_file = "txt/txt-checked.txt"
remove_duplicate_lines(source_file, checking_file, output_file)