The Document sobject has an IsBodySearchable flag which controls if you can full text search it via SOSL.
ContentVersion has no such flag but I know that it is searchable in certain cases. However, documentation around this is frustratingly hard to find.
Does anyone know what the limitations are and when contentversion records are/are not searchable?
bonus question: Assuming we have several hundred thousand documents in SF (approximately 2TB) can global search and/or SOSL still work without timing out?
The only global limit I see anywhere is the following:
Searching document content supports multiple file types and has file size limits. The contents of documents that exceed the maximum sizes are not searched; however, the document fields are still searched. Only the first 1,000,000 characters of text are searched. Text beyond this limit is not included in the search.
I am unclear on whether this is a per-doc or global limit and what it means when text is not included in the search.
Does that mean that text is not part of the index? Or is it just that any given search will go through up to a million characters of text to try and find a match and time out if no match is found?
You are hitting one of the main limitations of Salesforce: its search capabilities. I don’t think there is an easy way to search versions of content. Plus search is likely to timeout when searching on dataset by cutting the number of results at 2000 occurrences, if I remember correctly. There are also other major limitations that limit what you can achieve with Apex and SOSL (i.e: you cannot access results beyond the first page).
The best way to get around this issue is to externalise the search.
I believe there are some apps on the AppExchange that do that, like KonaSearch for example.
If your need are more specific you could technically implement an external app on Heroku, fetch data via APIs and store them in an search index like WebSolr or ElasticSearch. I’ve done it using WebSolr in the past, it worked very well, but I am afraid it’s not a simple thing.
I hope it helps.