![]() |
In the rapidly evolving landscape of data science and machine learning, data versioning has become a crucial practice. As datasets grow in size and complexity, keeping track of changes, ensuring reproducibility, and maintaining data integrity are essential tasks. This article delves into the concept of data versioning, its importance, use cases, and applications, focusing on how it can be efficiently managed using Python. What is Data Versioning?Data versioning is the process of tracking and managing changes to datasets over time, similar to how version control systems manage source code. It involves creating snapshots of datasets at different points, including metadata like timestamps and change logs, to maintain a historical record and allow users to revert, compare, and understand data evolution. Applications of Data Versioning
Why is Data Versioning Necessary?Reproducibility: Reproducibility is a cornerstone of scientific research and data analysis. Data versioning ensures that datasets used in experiments, analyses, or machine learning models can be precisely replicated. By maintaining versions of datasets, researchers can trace back to the exact data that produced specific results, enabling others to validate findings and build upon previous work.
Let’s Get StartedPrerequisites
Pushing Data from local to remoteOpen your terminal and Type pip install dvc
To initialize git and dvc respectively git init
dvc init
you must have folder for keeping all data used for project let’s consider that you have folder named data ![]() data is folder containing all data required by project To add data files to DVC system. here .file is extension of file dvc add <path/to/data.file>
As soon as you type above command. the .dvc files is added to repository. git add <path/to/data.file>.dvc <path\to\data\folder>\.gitignore
This command will stage the progress into git. Now to push data into our own seperate remote storage also known as DVC Remote. We need to setup an remote origin for dvc same as the one we use for git. ![]() How to find your key Note : The key must be kept secret. To add DVC remote. run below with given syntax. named as origin dvc remote add --default < origin> gdrive://<Key of Folder in Gdrive where you want to store>
This syntax is useful only for google drive storage. You other kinds of storage you can check reference below After you successfully add setup remote run below command as it is. git commit .dvc\config -m "Configured Remote"
Next step is to push data into remote. So, to complete authentication with google drive. install following dependency. pip install dvc_gdrive
Then to push data run following command dvc push
If you are pushing for first time it will prompt you to authentication by google in your default browser. From make sure to choose the same account from which folder is created. After authentication data must be pushed. you can cross check this by seeing inside the google drive folder. Pulling / Cloning Data from remote to localNow, That suppose you want to clone it into new project named Data Versioning Clone To initialize git and dvc respectively git init
dvc init
To pull data from git repository named project where data is folder containing Data files we need. run following. -o is flag denoting where to store data on local repository after pulling. dvc get [email protected]:username/project.git data/da.txt -o data/
|
Reffered: https://www.geeksforgeeks.org
AI ML DS |
Type: | Geek |
Category: | Coding |
Sub Category: | Tutorial |
Uploaded by: | Admin |
Views: | 18 |