Dear tom, I would like to understand the Infra Sizing for a Vector Database in an BIG Enterprises for a RAG solution. (How to size a required infra for a Vector DB)
E.g.:
Document Repository Size - 2-3PB or more than that in a Biggest Indian Oil Company in Exploration and Drilling Department.
Document Types -
1. Technical reports (exploration, drilling, production data)
Dear tom, I would like to understand the Infra Sizing for a Vector Database in an BIG Enterprises for a RAG solution. (How to size a required infra for a Vector DB)
E.g.:
Document Repository Size - 2-3PB or more than that in a Biggest Indian Oil Company in Exploration and Drilling Department.
Document Types -
1. Technical reports (exploration, drilling, production data)
2. Contracts, MoUs, compliance docs
3. Engineering drawings, geospatial surveys (seismic, GIS data)
4. Annual reports, DPRs, feasibility studies
5. Internal manuals, policies, operational guidelines
6. Emails, meeting notes, project documentation
Kindly guide.
Regards
Ashutosh Pathak
ashutosh@slashcurate.com
You can do a back-of-the-envelope as follows:
- Estimate the number of documents you have, say, 1m.
- The size of each vector embedding is given by the embedding model you use. For example, 768 dimensions.
- Each dimension is usually 4 bytes.
=> Size needed is ~3gb. Add a some overhead for indexes and metadata, say 5gb in total.
https://substack.com/@althreecs
What happens if any of the documents have more than three words?