More evidence that I didn't pay enough attention in stats class. Also, I'm not at liberty to say what the actual data is, which is why I'm making up silly examples about the content of these tables using food.
Also, I'm trying to find a good textbook on this subject but all I'm finding are generic stats books. What sort of keywords should I be searching for, or which authors/researchers should I be reading?
1) Someone gives me a correlation table of a population's favorite recipes that is sparse, non-diagonal, and has all the negatives filtered out. They claim it is non-diagonal because they filtered out positive correlations that made the matrix be non-sparse; and that it has no negatives because negatives are useless for our purposes (which is true, we don't care about negatives).
Q1A) Other than being non-diagonal, are there any other properties I should assume this table has or does not have? Does all the filtering make it no longer a "real" correlation table?
Q1B) Other than performance / memory limitations, why would I prefer a sparse table over a dense table? Why not keep all the weak correlations then filter by correlation level as I use the table?
(ex: give me only the top 10 things that go with cake or give me everything, no matter how weak, that goes with coffee.)
Q1C) How valid is this table? Why filter out a correlation between a popular food, say, bread, and half the table? If everyone who likes bread also likes half of the table, why would we remove that and why doesn't that invalidate the table?
2) I use a machine learning tool to build a correlation table between types of food based on their typical ingredients. For example, sheet cake would correlate strongly with cupcakes (mostly the same ingredients) but only slightly with a BLT sandwich (due to the flour in the bread). This matrix is diagonal and has only positive correlations.
Q2A) Do I want/need negative correlations? If so, how would I create them? Would I do something like normalize the table so that instead of being in the positive space it goes from say, -0.5 to 0.5? Or would I change my tool to generate negative correlations based on some distance function between the ingredient vector spaces?
Q2B) Similar to Q1C, is there a reason to go through and filter out weak correlations or to normalize the correlations to go from 0.0-1.0? I've got RAM and CPU to burn, is there a reason to not keep all the data?