1. Define Objectives and Requirements

Business Objectives: Clarify the purpose of the data pipeline. What business problems is it solving? What are the expected outcomes?
- The purpose of the data pipeline is to create a knowledge base to extract relevant information from BJJ techniques. Actions, positions, counters, and conditions ⇒ FRAMEWORK
- The population of the position tuples: (Family, State, Upper, Lower configuration)
  - 7 families, each with numerous states. State dependent upper and lower body configurations
  - Common vulnerabilities
- Transitions catalogued as Action mapping one position tuple to another
  - Nature of transition: opponent initiated action to a less dominant position, progress towards dominance etc.
  - Common counters
Data Requirements: Identify the types of data needed, data sources, data volume, and data velocity.
Stakeholders: Identify who will use the data and their specific needs.
- Developers for making the UI and database.
  - Needs to be able to be represented as a graph layout
  - Not clustered.

2. Data Ingestion

Source Identification: Define the sources of data (databases, APIs, flat files, third-party services).
- YouTube Transcript texts for now. Video data will be mapped into respective elements built from the framework above. Youtube API
- Reddit API for bjj subreddit
Ingestion Methods: Choose between batch, stream, or real-time ingestion based on the nature of the data and business requirements.
- Batch. No need for stream or real time for scrapping.
Data Validation: Establish standards for data quality checks at the ingestion point.
- Manual for now. Cant think of anything.

ETL/ELT Processes: Define the steps for Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT).
Data Cleaning: Specify how to handle missing values, duplicates, and inconsistencies.
Data Enrichment: Outline how to enhance the data with additional information (e.g., merging with other datasets).

Storage Solutions: Determine whether to use data lakes, data warehouses, or hybrid solutions.
Scalability: Ensure the storage solution can scale with data growth.
Accessibility: Define how users will access the stored data (e.g., through SQL queries, APIs).

Processing Frameworks: Choose the appropriate tools and frameworks (e.g., Apache Spark, Hadoop) for processing large datasets.