Data with GitLab Duo

A guide to generative AI for data teams

Generative AI for Data Teams: GitLab Duo & Snowflake CoPilot Guide

This guide introduces GitLab Duo and Snowflake CoPilot, AI-powered tools designed to optimize workflows for data teams. By leveraging natural language processing, these tools enable efficient script generation, automated data insights, and streamlined data transformations. This document provides an overview of the tools, setup instructions, quality guardrails, and best practices to help maximize their potential while ensuring governance and security.

Unlocking the Power of GitLab Duo with Data

“AI can significantly boost productivity for individuals and teams. By automating repetitive tasks, creatives can focus on more critical aspects of their work.”

GitLab Duo offers transformative capabilities for developers working with data pipelines, analytics, and ETL tasks. Whether you’re looking to streamline your development processes or improve existing code, Duo provides intuitive solutions that make working with data simpler, faster, and more effective. Here are a few key ways Duo can enhance your data workflows:

  • Automated Documentation: Duo can generate dbt (data build tool) documentation for data pipelines, reducing the manual work involved and ensuring comprehensive records of each step in your data transformations.
  • Code Optimization: By reviewing your existing code, Duo identifies opportunities to refactor and streamline, improving efficiency and readability.
  • Enhanced Query & Transformation Suggestions: Duo can create complex SQL queries or transformations based on your requirements, optimizing your ETL processes and saving development time.
  • Customizable Code Generation: With natural language prompts, you can generate code snippets for data manipulation, CI/CD pipelines, or even generate SQL to help with database migrations or transformations.

These features make Duo a valuable companion for both beginner and advanced data developers, helping ensure high-quality results while enhancing productivity.

Key Features and Functionalities

GitLab Duo

  • Automated Code & Pipeline Generation: Quickly generate scripts, CI/CD pipelines, and tests using natural language prompts.
  • AI-Powered Code Reviews: Duo assists in reviewing code for adherence to best practices and standards, flagging issues and offering improvements.
  • Enhanced Documentation: Duo can auto-generate documentation, improving clarity and continuity across projects.

Learn more about GitLab Duo

Snowflake CoPilot

  • SQL Query Generation: Generate complex SQL queries and transformations by simply describing your data requirements.
  • Data Summarization: Get AI-generated summaries and insights from large datasets, turning raw data into actionable knowledge.
  • ETL & Transformation Guidance: CoPilot provides suggestions for data cleaning and transformation tasks to streamline ETL processes.

Learn more about Snowflake CoPilot

Drawbacks and Considerations

“Artificial intelligence is not a substitute for human intelligence; it is a tool to amplify human creativity and ingenuity.” – Fei-Fei Li, Co-Director of the Stanford Institute for Human-Centered Artificial Intelligence.

While GitLab Duo and Snowflake CoPilot offer powerful advantages, they also come with some risks and drawbacks. It’s essential to understand these limitations to mitigate potential issues:

  • Erroneous Data Creation: AI-generated code or queries may produce unexpected or inaccurate results, leading to data integrity issues.
  • Non-Repeatable Work: Without clear reproducibility, AI-generated solutions may yield varying outputs on different runs, complicating version control and debugging.
  • Unknown Results: AI-generated solutions may include steps or transformations that are difficult to verify, increasing the risk of erroneous conclusions.
  • Extra Reviewer and Approver Workload: Due to the AI’s potential to produce incorrect results, there’s an increased need for thorough review and validation by team members, adding time and effort to workflows.
  • Concealed Steps: AI may automate steps that should be explicitly documented and understood, leading to a lack of transparency in critical parts of data processes.
  • Contextual Misinterpretations: Generative AI tools, while sophisticated, may misinterpret specific business or technical contexts. AI-driven code generation without an in-depth understanding of project specifics can produce suboptimal results or incorrect assumptions, particularly in nuanced areas like business logic or security requirements.
  • Performance Impact: The code generated by AI might not be optimized for performance, leading to slower query times or excessive resource consumption. Poorly performing code can impact data processing efficiency and may require manual intervention to optimize.
  • Dependency Risk: Reliance on AI for critical functions can lead to dependency, where users begin to trust AI outputs without understanding the underlying logic. This could create knowledge gaps and skill degradation, as team members rely more on AI-generated solutions than on their own coding or problem-solving skills.

While GitLab Duo and Snowflake CoPilot can streamline workflows, these risks underscore the importance of diligent review and documentation. Regular audits and peer reviews should be conducted to catch and correct potential errors before production deployment.

Guardrails for Quality, Governance, and Compliance

Approved Use with Internal Data for Duo and Snowflake Copilot

GitLab Duo and Snowflake CoPilot have been approved for use with GitLab’s internal data. These tools have undergone a review process to meet our security, privacy, and compliance standards. Users should rely solely on GitLab Duo and Snowflake CoPilot for generative AI applications within our data workflows.

Important: Do not use other third-party AI tools to process or analyze GitLab proprietary or sensitive data. Only GitLab Duo and Snowflake CoPilot are sanctioned for these tasks. These tools ensure that our internal data is processed securely and in compliance with GitLab’s data governance policies.

Quality Assurance

  • Code & Query Review: Conduct peer reviews of AI-generated code and queries to ensure accuracy and adherence to coding standards.
  • Automated Testing: Set up automated tests within GitLab CI/CD to validate all scripts and transformations.
  • Data Validation: Test AI-generated queries in a non-production environment and compare outputs to known benchmarks before deployment.

Governance and Compliance

  • Role-Based Access Control: Limit tool access to necessary team members based on project roles and data sensitivity.
  • Data Privacy Checks: Use Snowflake’s data masking and GitLab’s access controls to anonymize and secure sensitive data fields.
  • Comprehensive Documentation: Require detailed documentation for any production-ready code or queries generated by Duo and CoPilot.

Security and Risk Management

  • Audit Trails: Enable audit logging for all actions performed in GitLab Duo and Snowflake CoPilot to maintain traceability and accountability.
  • Sensitive Data Redaction: Configure Duo and CoPilot to prevent access to sensitive information, such as PII.
  • Regular Model Reviews: Schedule periodic evaluations of AI-generated outputs to ensure alignment with data team standards.

Getting Started

Prerequisites & Setup

  1. Access Permissions: GitLab Duo is available to Premium users with GitLab Duo Pro, Ultimate users with GitLab Duo Pro, or Enterprise users. It is also available for internal team members.
  2. Environment Configuration: Follow the GitLab Duo Chat documentation to use Duo with tools like Web IDE, VS Code, DataGrip, and more.

Best Practices for Managing AI-Generated Content

  1. Effective Prompting: Develop a list of consistent prompts to ensure accurate and uniform outputs across projects.
  2. Change Management & Version Control: Use GitLab’s version control to track all AI-generated code revisions, maintaining a clear record of changes and reasons.
  3. Regular Audits and Cleanup: Schedule routine reviews to remove outdated AI-generated scripts, reducing clutter and mitigating security risks.

Example Workflows and Use Cases

  • Automated Pipeline Creation: Generate and deploy a CI/CD pipeline in GitLab Duo with simple, clear language prompts.
  • Data Insights Summarization: Use Snowflake CoPilot to analyze and summarize large data sets, helping teams derive actionable insights.
  • End-to-End Data Project: Combine Duo and CoPilot to create, validate, and deploy a full data transformation and reporting pipeline, optimizing both speed and accuracy.

For a more comprehensive list of ideas visit our Duo Data Inspiration Hub.