Skip to content

Manasjain/Deduplication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Deduplication

Deduplicates dataset using csvdedupe.

1.) Clone this repository to your machine.

2.) Create a python virtualenv for your project.

 Setting up virtualenv refer: http://docs.python-guide.org/en/latest/dev/virtualenvs/ 

3.) Activate virtualenv using below commands:

source project_name/bin/activate.

4.) Change the directory to your cloned repository.

5.) Csvdedupe takes a messy input file or STDIN pipe and identifies duplicates. Thus, provide an input file, field names, and output file:

csvdedupe Deduplication/dataset.csv \  --field_names ln dob gn fn \    --output_file output.csv

6.) Output.csv contains an additional column with cluster id.

About

Deduplicates dataset using csvdedupe.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published