Projects & Tasks
What should the Position Data look like?
1. Define Objectives and Requirements
- Business Objectives: Clarify the purpose of the data pipeline. What business problems is it solving? What are the expected outcomes?
- The purpose of the data pipeline is to create a knowledge base to extract relevant information from BJJ techniques. Actions, positions, counters, and conditions ⇒ FRAMEWORK
- The population of the position tuples: (Family, State, Upper, Lower configuration)
- 7 families, each with numerous states. State dependent upper and lower body configurations
- Common vulnerabilities
- Transitions catalogued as Action mapping one position tuple to another
- Nature of transition: opponent initiated action to a less dominant position, progress towards dominance etc.
- Common counters
- Data Requirements: Identify the types of data needed, data sources, data volume, and data velocity.
- Stakeholders: Identify who will use the data and their specific needs.
- Developers for making the UI and database.
- Needs to be able to be represented as a graph layout
- Not clustered.
2. Data Ingestion
- Source Identification: Define the sources of data (databases, APIs, flat files, third-party services).
- YouTube Transcript texts for now. Video data will be mapped into respective elements built from the framework above. Youtube API
- Reddit API for bjj subreddit
- Ingestion Methods: Choose between batch, stream, or real-time ingestion based on the nature of the data and business requirements.
- Batch. No need for stream or real time for scrapping.
- Data Validation: Establish standards for data quality checks at the ingestion point.
- Manual for now. Cant think of anything.
3. Data Transformation
- ETL/ELT Processes: Define the steps for Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT).
- Data Cleaning: Specify how to handle missing values, duplicates, and inconsistencies.
- Data Enrichment: Outline how to enhance the data with additional information (e.g., merging with other datasets).
4. Data Storage
- Storage Solutions: Determine whether to use data lakes, data warehouses, or hybrid solutions.
- Scalability: Ensure the storage solution can scale with data growth.
- Accessibility: Define how users will access the stored data (e.g., through SQL queries, APIs).
5. Data Processing and Analysis
- Processing Frameworks: Choose the appropriate tools and frameworks (e.g., Apache Spark, Hadoop) for processing large datasets.