Removing Duplicate Lines from Text Files in Python

Python offers efficient ways to handle text files and process data. One common task is to check for and remove duplicate lines across files. This article will guide you through a Python script designed for this purpose.

Objective

The goal is to write a Python function remove_duplicate_lines that compares two text files: a source file and a checking file. It removes lines from the checking file that are duplicates of any line in the source file, and then writes the result to an output file.

The Python Script

Function Definition

def remove_duplicate_lines(source_file, checking_file, output_file):
    # Your code here

This function takes three arguments:

source_file: The file to compare against.
checking_file: The file from which duplicates are to be removed.
output_file: The file where the result will be saved.

Reading the Source File

with open(source_file, "r", encoding="utf-8") as f_source:
    source_lines = {line.strip().lower() for line in f_source}

This code block reads the source_file, using a set comprehension to store each line. line.strip().lower() is used to remove leading/trailing spaces and to convert each line to lowercase, ensuring that the comparison is case-insensitive.

Processing the Checking File

with open(checking_file, "r", encoding="utf-8") as f_checking:
    unique_lines = [line for line in f_checking if line.strip().lower() not in source_lines]

Here, the checking_file is read. The script iterates through each line and includes it in unique_lines if it is not found in source_lines.

Writing to the Output File

with open(output_file, "w", encoding="utf-8") as f_output:
    f_output.writelines(unique_lines)

The unique lines are written to output_file. This file now contains all lines from checking_file that were not duplicates of lines in source_file.

Calling the Function

source_file = "txt1.txt"
checking_file = "txt2.txt"
output_file = "txt/txt-checked.txt"
remove_duplicate_lines(source_file, checking_file, output_file)

Finally, the function is called with the names of the source file, checking file, and output file.

Conclusion

This script demonstrates a straightforward and efficient way to remove duplicate lines from text files using Python. It’s a practical solution for data cleaning and processing tasks where duplicate entries in text data need to be identified and eliminated.

Remember, Python’s standard library provides robust file handling capabilities, making it an excellent choice for such text processing tasks.

def remove_duplicate_lines(source_file, checking_file, output_file):
    # Read the source file and store its lines in a set for faster lookups
    with open(source_file, "r", encoding="utf-8") as f_source:
        source_lines = {line.strip().lower() for line in f_source}

    # Read the checking file and remove duplicate lines
    with open(checking_file, "r", encoding="utf-8") as f_checking:
        unique_lines = [line for line in f_checking if line.strip().lower() not in source_lines]

    # Write the unique lines to the output file
    with open(output_file, "w", encoding="utf-8") as f_output:
        f_output.writelines(unique_lines)

# Call the function with the appropriate file names
source_file = "txt1.txt"
checking_file = "txt2.txt"
output_file = "txt/txt-checked.txt"
remove_duplicate_lines(source_file, checking_file, output_file)