Removing Duplicate Lines from Text Files in Python

Python offers efficient ways to handle text files and process data. One common task is to check for and remove duplicate lines across files. This article will guide you through a Python script designed for this purpose.

Objective

The goal is to write a Python function remove_duplicate_lines that compares two text files: a source file and a checking file. It removes lines from the checking file that are duplicates of any line in the source file, and then writes the result to an output file.

The Python Script

Function Definition

This function takes three arguments:

  • source_file: The file to compare against.
  • checking_file: The file from which duplicates are to be removed.
  • output_file: The file where the result will be saved.

Reading the Source File

This code block reads the source_file, using a set comprehension to store each line. line.strip().lower() is used to remove leading/trailing spaces and to convert each line to lowercase, ensuring that the comparison is case-insensitive.

Processing the Checking File

Here, the checking_file is read. The script iterates through each line and includes it in unique_lines if it is not found in source_lines.

Writing to the Output File

The unique lines are written to output_file. This file now contains all lines from checking_file that were not duplicates of lines in source_file.

Calling the Function

Finally, the function is called with the names of the source file, checking file, and output file.

Conclusion

This script demonstrates a straightforward and efficient way to remove duplicate lines from text files using Python. It’s a practical solution for data cleaning and processing tasks where duplicate entries in text data need to be identified and eliminated.

Remember, Python’s standard library provides robust file handling capabilities, making it an excellent choice for such text processing tasks.