-
Notifications
You must be signed in to change notification settings - Fork 3
/
DevelopLog
367 lines (289 loc) · 15 KB
/
DevelopLog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
README
----------------------
1. All data should go into Experiments folder, each exp has its separate folder
2. The input database file is named "inputDB"
3. Program will produce a file called "predicateDB"
4. Frequent itemset mining algorithm will produce frequentPredicates
5. All tunable parameter goes into Config.java
6. The output is a set of minimal DCs
******************************************
IMPORTANT:
1. Need a way to assess the quality of derived DC
Maybe this is a good point where we need space pruning algorithm.
2. Need an artificial dataset to carefully examine the behavior of the algorithm
3. Threshold turning is another cirtical aspect
*****************************************
----------------------
Oct 25: Build a working prototype!
1. Build all the infrastructure
2. Understand how apriori works, and how to plug in our itemsets, later can try out cfpgrowth
3 In PredicateDB.java, now predicates can only do "=" and "!=", and attributes can only compare with same columns,
and we forbid comparison of attributes within one tuple
---------------------
Oct 26: Build a naive algorithm for deriving DCs from frequent predicate set
1. Make clear of what does frequent mean exactly?
Need examples to make it clear
2. The derivation algorithm may be integrated with frequent itemset mining algorithm,
so study up Apriori and FP Growth algorithm
---------------------
Oct 28: A Naive way of deriving DCs
1. for not(P1,...Pn) to be a valid DC,
all P1...Pi-1, Pi+1,...Pn should be frequent
P1,... Pn should not be frequent
Actually this is integrated with Apriori algorithm
2. Elimination of trivial DCs
e.g. not(P1,!P1) is a trivial DC
3. Elimination of not meaningful DCs
---------------------
Oct 29: Find Problems through exp
1. Even for Hospital dataset with 19 attributes, Apriori algorithm is very slow
Try database with fewer number of attributes, and see if it can give reasonable result
2. Try out FP-Growth algorithm, and it is much much faster than apriori.
So see if DC deriving algorithm can be integrated FPgrowth
(Not done yet, preliminary study shows can't, may need another step to drive that)
3. The difference between FPGrowth and CFPGrowth is CFP can define a minSupport for every item
---------------------
Oct 30: Find Problems through exp
1. I tried Hospital dataset, there seems to be a lot of DCs with IQ,
so i added another condition, which is there in a DC, the predicates can't all be IQ
Even so, we have a lot of redundancy
2. It seems like that we need a handcrafted dataset to test the resulting denial constraints
For DataGenerator1, it seems like we need to set miniSup to be very low for dcs to be found, or the other way to look at it is that the minisup
should be set w.r.t. the cardinality of a particular predicate.
e.g. minisup(EQ(t1.A,t2.A)) = |A| * percentage * n (n-1)
3. Start Implement space pruning algorithm
--------------------
Oct 31: Implement space pruning algorithm
1. Why don't we just examine the DB, and list all the violating DCs, and the remaining ones are the ones that are valid
Because the input DB can be dirty, and we want the for a dc to hold, we just want its coverage to be above a threshold
2. In Table.java, finish the method "testDC"
in CFD pattern tableaux discovery, they use "support, confidence and parsimony"
Support: many matches
confidence: few violations
3. In SpacePruning.java, enumerate all the space
4. Implement rules in the space pruning
(NOT YET DONE, don't know what's the best way to implement)
------------------
Nov 1: Read papers to seek more intuitions
1. Support: for CFD, # of tuples to match the antecedent
support(DC) = # of tuples that match all equality predicates/N
Confidence: many definitions, see CFD pattern tableaux discovery paper
the smallest # of tuples to be removed to eliminate all violations
Confidence(DC) = 1 - the smallest # of tuples to be removed to eliminate all violations / # of tuples that match all equality predicates
-----------------
Nov 2:
-----------------
Nov 3:
Question: how to measure the interestingness of a discovered DC
Reference: Miller and Divesh
1. in CFD case, there is pattern tuple, so we can count the the number of tuples that match the pattern tuple.
in DC, there is no pattern tuple, how can we do the match?
Possible solution: count the number of paired tuples who match all equality predicates in DC / N^2
how about confidence?
2. Keep in mind that the measure used for testing the validity of DC should support the correctness of three pruning rules
TO BE PROVEN
---------------
Nov4:
What is missing in space pruning method?
1. There seems to be two kinds of pruning criteria besides those axioms in the play
A: interestingness: a rule is interesting if it is statistically significant in some way and meaningful. But interestingness can only get you exact constraint
B: approximate: a rule is approximate if it does not hold on all data, meaning we allow certain percentage of violations of that rule
2. A smart way to evaluate a candidate DC against DB
In FD/CFD discovery, they are using partition method to quickly verify.
In DC, currently we do that in a most naive way, which is for every two tuple, verify if DC violates or not
Possible solution:
for each predicate, maintain a list of satisfying tuple pairs
For a DC, if the intersecting set of all predicates is all the violations of that particular DC
3. How to efficiently implement those axioms in space pruning, i.e. what data structure to use?
What is missing in constraints mining method?
1. Any other way to derive DC from frequency predicate besides the current one?
Need to evaluate the goodness of current deriving algorithm?
2. Get clear on what exactly does frequency mean?
It is predicate dependent, and it should be derived from the profiling of the predicate
3. A different data summary:
for each predicate, maintain the list, and see what can we do??
------------------
Nov 5:
Experiment with AB. A->B, but B not ->A
with Constraint mining, we get A->B
With space pruning, coverage is supposed to tolerate noise, and it will prune B-A, but still get A->B
, and interestingness should prune (TO BE DONE)
not(EQ(t1.B,t2.B),EQ(t1.A,t2.A) )
not(EQ(t1.A,t2.A) )
For both cases, we have axioms to get a minimal cover....
Directions:
1. quality of discovered DCs
frequency: A: find a function mapping from profiling to frequency
B: devise an algorithm to handle different frequency
deriving algorithm: current is most conservative
interestingness: used in constraint mining as a post process, used in space pruning when evaluate each candidate
2. efficiency:
evaluate a candidate DC against DB
axioms in space pruning
-------------------
Nov 6:
Take direction 1:
frequency: we have a naive frequency test
examine the output:
whether a dc is expected
interestingness
consistency within the output, this is a very interesting discussion, but exps show that this really doesn't help very much!
----------------
Nov 7:
Space pruning is extremely slow, find out why?
So the java profiling suggests that predicate.check() is taking a lot of time, and that getNumVios() is also taking a lot of time
so solution: use space to trade speed
1. no matter what, for each predicate, store the tuple pairs that it satisfies in a hashset
2. for each predicate, maintain a set of tuple pairs that have been checked, and update the set each time you encounter a new pair
This is going to help, of course!
Tyson frozen food database request is filed...
interestingness of a DC constraints can be measured by the dependencies amongst predicates?
So now the question is how to define the dependencies?
Keep in mind the generality of space pruning (i.e. how does it include FD/CFD discovery)
TODO List in descending order of importance
1. Adding axioms to space pruning ( see how much gain do we get in efficiency)
2. Define interestingness for DC
3. Get a clear idea of how FD did experiments by reading experiment sections of previous papers
specifically, what does it mean by scalability in terms of # of tuples, # of attributes
4. Efficiency in data mining algorithm, i.e. is Apriori is the best shot?
FP growth is better in finding frequent itemsets of course, but how about kFreK-1Not fre
5. Need a DC experiment where predicate is not within a single column
---------------
Nov 8:
Implement the space pruning method:
noiseTolerance level has an effect on the result of the pruning rules
I believe if noise tolerance = 0, then Rule 1,2,3 will not affect the result at all!
with the increase of the noise tolerance, the more the rules, you will miss some dcs. But theose missed ones are not necessary good ones,
however, the improvement in efficiency is substantial
We should have another contribution to the paper...
1. be better than other papers(FD/CFD)
2. Have a constraint verification phase, data examples? user involvement?
3. Combine two approach, use either pruning rule/mining algo to complement each other, a hybrid approach
4. Adding Constant...
I have a problem with space! We are taking up too much space, storing for each predicate, the tuple pairs that it satisfy in memory
Move to postgres!
1. loading original table and predicateDB table into database is extremely slow, so use bulkload, looks like this method has failed!
2. build index on all columns, this is very slow
I now have the big data problem!
In space pruning, if we don't store the evidence expclitly in DB, then every time we need to check a dc, we do the check. That is in essence,
space pruning is using time to trade for space
In data mining, we need enough memory to store the entire evidence space, which can be too big, and then minig.
there is going to be a problem for data mining, because the squared data is too big to fit into memory
-----------
Nov 9:
Think about the big data problem
Now I use CSV file to load, it is still slow + building index is very slow + every validation is very slow(SQL query)
Right now, i don't see any advantage of using Spacepruning1 > space pruning 2 (Need discussion with Paolo)
In Constraint Mining, if using Apriori algorithm, there is still a candidate generate and test process, what is the difference between
those two methods? If they are not different, we surly should consider a different algorithm, which is FP growth.
If using FPGrowth method, the compression is huge due to the FP tree. But the thing is now deriving algorithm is going to be harder, because
we don't have the advantage of integrating with mining algorithm...
1. Switch to FP Growth algorithm to reduce memory to store the evidence...(invent a deriving algorithm)
2. Stick to Apriori, but think of it as a big data problem, and use Map Reduce to solve the problem (study up)
Our data is different from market basket data
Think about the interestingness measure
Thank about a more interesting dataset, where predicate is across columns
-----
Nov 10:
Concern:
There are actually existing algorithms for rare itemset mining, so our second approach seems to be direct applications of a data mining algorithm.
Except we transform our database into a predicate database first. If we really want to make it stand out, then we need special techniques that make the mining
in our case fast by exploiting the specific properties of items in our setting which include:
1. # of items in our case is small, i.e. # of different predicates
2. But the transaction is very large, due to squared, this makes storing them in memory infeasible. Need compressions (FP tree is a way)
3. The # of items in each transaction is set to be #(P)/2, where #(P) is the number of predicates
4. We want k non-frequent k-1 frequent itemset
5. The items are not totally independent, this information can be exploited somehow.
For example, P and !P are mutually exclusive, #of evidence containing P + # of evidence containing !p = n(n-1)
6. Our transactions database consist of pairwise checking of the original database, is there anything we can do to take advantage of that?
-----
Nov 11:
Think about another contribution:
1. A hybrid discovery... seem to be difficult at the moment
2. The output of constraint discovery algorithm can have not only a set of denial constraints, but also for each dc,
a set of possible repairs, ranking by the probability of applicability.
For a DC: not(P1,...,Pn), the output could include # of violations, and among them # of violations for each predicate.
for P1,...Pn, which one is more likely to be wrong
Then, the MVC could become the weighted MVC.
Further, we can optimize the repairing algorithm
suppose not(X,Y) is a valid DC found.
The repairing rules have a very similar structure to that of association rules. X=>Y
Actually is the called the repairing rule mining
3. Interestingness of a DC
---
Nov 12,13:
Revisit the theoretic foundation of DCs
--
Nov 15:
Yet another constraint mining algorithm
---
Nov 18:
Ensure the correctness of the second mining algorithm
--
Nov 20:
Why the ordering matters so much? for UIS, if no dynamic ordering, it is taking forever
---
Nov 26:
Dealing with constants, constant predicates are provided by the users now!
How about predicates, such as EQ(t1.A,4), EQ(t2.B,2)
---
Nov 27:
Adding approximate DC output, where approximate means that we allow a certain amout of violations.
--
Nov 28:
Allow approximate DC output,
----
Dec 10:
javaplot in place, starting to run experiments.
Seems like opimizaing FASTDC is an urgent task.
Find reasons for all those scalability for different datasets.
---
Dec 23:
Established that FASTFD with DFS is the way to go
--
Dec 24:
Make Dynamic ordering work:
Dynamic ordering coupled with effective pruning technique.
--
Dec 25: TODO List
1. Post-Processing algorithm-revisited, implication testing is a linear algorithm
subset pruning and minimal cover are still hard to analyze
2. Adding constants (frequent predicate mining algorithm)
the algorithm is done, but code needs to be examied about the correctness.
Especially, the impilcation testing algorithm involving constants will need to change
3. Finding datasets (tax data generator)
----
Dec 26: Find a good running example
---
Dec 27:
Ensure, under a certain definition of minimal DCs, there is no subset pruning needed in the post-processing
For order predicate, it is slow, but we try to remove >=,<=,!= for numerical values.
--
Dec 28:
--
Dec 29:
Bug fix for linear implication testing algorithm. ----done
Make three system extensions work:
1. Adding constants
Elimination of trivial constant DCs, t1.A!=t2.A, t1.A=a,t2.A=a --- done
2. Approximate version
3. Cross columns
4. Rank them by interestingness! ----done
---
Dec 30:
Testing the correctness of the algorithm:
1. approximate version
2. adding constants
doing one pass of pair-wise comparisons, instead of several passes
3. cross columns
How should we decide if two columns are joinable
Think about the current way of dealing with order predicate, why does it work
---
Jan 2:
For each constant predicates, donnot re-do the pair wise comparison
---
Jan 3:
I want to try a differnt way of dealing with constant predicates,
--
Jan 4:
I have confirmed that wasted work