-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deal with null entries in post-crawl data analysis #121
Comments
Yes, that works! We can deal with the null entries at the data analysis step. We do not need to create figures for the null failures, but we should be able to say x% of a given crawl had all null values, just like we are able to say that y% of sites had a certain error. For both null values as well as any errors, we should not include these sites in any analysis statistics or figures we present as those do not have meaning for the analysis. Do you have a thought what causes the null entries? If we do a second attempt, would that lead to fewer null entries (i.e., apply the same approach as for errors)? |
Got it! I'll be working on creating the code to do the data analytics for the percentage of null error.
Based on my manual observation of the list of the sites with null entries, I don't think there's a fixed way to generalize one specific cause for the null entries. However, as discussed in the previous issue, we concluded that yelp.com was a VPN issue (potentially because we have been accessing the site using the same LA VPN IP address) because when I did another crawl after June crawl with the same VPN it still failed but succeeded when I changed to another LA VPN with different IP address. While this is the case for yelp, we can't say for sure that the cause for null entries is the same for all the other null sites. Nevertheless, looking at the null entries that have been consistently present in the previous crawls, especially some of the big names like meta.com, apple.com, I suspect that the cause for the null entries is something along the same line -- that is they recognize or block the VPN IP address. Although it is curious that they just give a blank page instead of an explicit 'Access Denied' page, as discussed in issue 51 Since I have successfully collected the sites that gave empty entries in June (which is a slightly shorter list compared to the April list), I will also try to do another crawl on just this list of sites that gave null entries as I've only tried for yelp last time to troubleshoot what was causing the change to null entries. Since I tried yelp with a different VPN LA IP address last time to check if my hypothesis was true, I will also be following that methodology for the re-crawl today. |
Sounds good! We do not necessarily need to figure out the reason for the null entries for sure. But it is nice for the paper to say that the VPN IP address blocking is the reason for at least some. As we discussed, maybe you find that a different LA VPN address for the next crawl results in fewer null entries. We may also update the crawl protocol slightly by doing a second crawl for all null entries (just like we do for other errors) with a different LA VPN. |
I finished the re-crawl specifically for the sites that output null values for all and manually looked through the result. Some important observations:
I doublechecked this conclusion by looking through our result from previous crawls for sites with specific errors identified like HumanCheckError and all of their entries also give null outputs; only the error column is filled with the identified errors.
I think the cause for the null entries is an amalgamation of possible reasons. For example, yelp.com seems to be an explicit VPN issue because of the experiment that I did a few weeks ago and reconfirmed with this crawl (went from blank page to be able to access it after changing the IP address). There are other forms of such VPN error. For instance, "Access Denied" could also be a problem of VPN IP address being blocked. |
Outcome: We wanted to explicitly identify the sites with only null entries. We noticed that these null entries are present in the previous crawls with roughly similar number of sites. We found that these null entries is similar to an error. After doing a re-crawl of sites with previously null entries, we found that these null entries indicate a precursor to an error; in the re-crawl, our crawl identified and flagged some of these sites as WebDriverError: Reached Error Page", "HumanCheckError", "InsecureCertificateError", "TimeoutError" that may have caused our ability to access these sites' data and thus returned null entries. We also found for some of the other sites, it could be the case of VPN error. For instance, after doing the re-crawl with a different VPN IP address, we managed to get data for the previously empty entries for yelp.com. At the same time, we also noticed sites that still blocked access because they potentially recognized our VPN IP address. |
Well said, @franciscawijaya! For the future we will:
(If any of these warrant more discussion, feel free to open a new issue. But it is also OK to address these points here if the answers are straightforward.) |
Update: I have added the code in the Colab to calculate the percentage of sites with all null values and have also made sure the figures for monthly data analysis did not use any of the null values and/or error sites for their calculations. (For June, these null values and/or error sites made up 8.47% of our crawl list of 11708 sites). Misc. notes: Examples for these two different natures of null values for reference:
I also counted these two types of null values/error sites in my percentage calculation accordingly and hence excluded them from the figures. Next: I'll be working on updating the code for the Crawl_Data_Over_Time (though this might take more time as I'm still working on fully understanding this Colab file). |
As we discussed today, at this point this issue is purely one for the data processing after the crawl. @franciscawijaya mentioned that the Crawl_Data_Over_Time still needs to be updated. Both @franciscawijaya and @natelevinson10 will work on adapting this and the other scripts as necessary. |
@eakubilo will take the lead on this issue, explore it a bit, more and report findings. |
As we have identified from the April and June crawl, there has always been sites with empty entries (No Data). . For the data for these two months, there are around 900+ empty entries.
I have been brainstorming about what to do with these entries that have been present for the past crawls as well.
I was initially thinking of including it in the error page but I'm not sure if that would be the best move, given that the null entries for one site could be caused by a different reason than another site with null entries (ie. it is not a definite and given human check error, Insecure Certificate Error or Web Driver Error).
Right now, I am thinking of just showing these empty results in our data analysis, maybe creating a column for creating figures of the percentage of sites in our data set that give empty entries for the month's crawl?
The text was updated successfully, but these errors were encountered: