To-do 13: Poking at Big Data (Yelp)

Na-Rae the fake student:

My 5-year-old ThinkPad laptop did ok! It has ... RAM, which worked ok with this much data, etc. Grepping through the whole of xxx json file took...

Wilson:

My laptop has 16GB of RAM (although WSL2 only has access to 8GB of it), a 6-core Intel i7-8750H, and a 1TB SSD (with 223GB free.) 100k lines were handled fine, taking only 6.4s to process, but 1 million was too much; it terminated (due to maxing out the RAM) after about a minute. The fact that every line was being read into a dataframe seemed like the culprit, so I tried re-implementing the same functionality using streaming (with ijson) and it actually halved execution time and was able to process the entire file in just under 4 minutes. Now I know that if I hit resource limits doing something like this, I can try refactoring my code so it processes the data line-by-line as a stream rather than loading it all into a dataframe.

Varun:

I tried it on my Macbook Pro, 8 gigs of ram. It did not take it well when I attempted to read it all in at once. My computer fan started being extra noisy, my terminal went unresponsive and so did the rest of the computer, even to keyboard interrups. I managed to kill it with the classic CTRL+\, but had to restart my computer because nothing worked. Upon restarting, my wallpaper was changed for some reason. A part of the problem might be because of the 100 chrome tabs I that will never close (I need all of them). I tried again, specifying chunksize, and it took a few minutes but worked.

Camryn:

My laptop’s handling of this dataset is about what I would expect it to be. I have a 2019 MacBook Pro with 2.4 GHz Quad-Core Intel Core i5 with 8 GB of RAM. Like most Macs, it gets overwhelmed just opening multiple tabs at once. I tested the script in increments of 10,000, then increments of 100,000, and then one increment of 1,000,000 just to see. While I ran the script with numbers below 100,000, the script ran incredibly fast and did not have much of an effect on my computer. After 100,000, but below 1,000,000, the script started taking longer to run but nothing more than a few seconds. Approaching 1,000,000 is when it started to slow down even further, and when it reached 1,000,000 it took around 3 minutes and 3 seconds to run. For fun, I tried running the script with 2,000,000 reviews and it took around 8 minutes before I gave up and terminated the process. To run this in its entirety successfully and quickly, a computer with better hardware would definitely help.

Mack

I have a 2019 MacBook Pro with 1.4 GHz Quad-Core Intel Core i5 and 8GB of RAM. Running grep to look at 'horrible' and 'scrumptious' reviews took a lot of CPU resources and ran for a minute or two. I tried 10, 100, 500, 1,000, 5,000, 7,500, 10,000, 20,000, 50,000, and accidentally jumped up to 750,000 but even 750k ran in about a minute or so. The others all ran instantly or within a few seconds with very little effort from my CPU. Running 750K took a lot on the CPU but no major issues or sounds came up. I ran 1,000,000 lines and it ran in a minute and a half. Running the last test took up over 8GB of memory, so a computer with a larger memory will likely improve the processing speed of the data. I am sure that I could process more lines, but I did not want to push my computer further than trying a million lines.

Moldir

I use a 2017 MacBook Pro 14.2 with 3.1 GHz Dual-Core Intel Core i5 and 8GB of RAM. Running egrep functions in Step 1 to search for 'horrible', 'scrumptious', and 'delicious' tokens in the reviews file took a minute or so. However, checking word counts for each file took longer. In Step 2, I couldn't run a python script on a command line from the first attempt, and it took a while until I found how to reconfigure .bash_profile so it would execute python scripts. I tried 1K, 10K, 100K, and 1M sampling, and everything worked just fine up to 1M. The last sample took more than 5 minutes to run and used almost 92% of CPU resources.

Ashley

I am using a 2020 Macbook Air with a 1.1 GHz Quad-Core Intel Core. It has 500 GB of disk space and 8 GB of RAM which, as I have discovered in working on my semester project, is not that much. When processing the reviews file, I tried 1000, 10000, 50000 without noticing much of a difference in processing time (all went very quickly, within a few seconds). 100000 lines took a few extra seconds, but at around 500000 lines, it took a few extra seconds just to write the FOO.json file, and processing the file took a bit longer. Finally, I tried 1000000 just for fun, and it took about 2 minutes to create the dataframe, but then another 2-3 minutes to print the most common word tokens. During this, my computer's RAM usage spiked: Python alone was using over 5GB of memory, and my computer had used about 7GB of swap memory, so I closed a lot of apps and tabs running in the background. Knowing my computer's less-than-stellar performance with large files, I decided to stop there. More GB of RAM would be essential to processing larger data, and I bet using more modern models of computers in general would increase processing efficiency due to technology advances.

Sen

My laptop is from 2020, it has an AMD Ryzen 5 processor, 500 GB of disk space (348 GB free), and 8 GB of RAM (7.40 GB usable). It's not a very powerful machine (compared to my other PC and computers in general!), and I only use it for things like school and work, so I did not expect much from it. Indeed, it took quite a while just to extract the files from the archived format. The first time running the python script, I (accidentally) ran it on the entire review.json file, my memory use shot up to 99% and miraculously I noticed and terminated the process before my computer crashed or something. 1K and 10K line samples processed near instantly, while 50K and 100K lines took a few seconds more. A 500K lines sample is when my laptop started to lag a bit, using quite a bit of memory resources and about 30 seconds to run. Finally, running the script on a 1M lines sample took about 2 minutes, and often spiked my memory and disk use up over 90%, though my CPU utilization remained quite low, between 10% and 15%.

Alex

I have a 2021 ROG Zephyrus M16 with a 2.50GHz Intel Core and 16 GB of RAM. As I know I have a relatively beefier machine, I hoped that I would be able to push pretty far before hitting any limits. Doing the grep searches took about five seconds each. I processed 1,000, 5,000, 10,000, 50,000 and 100,000 lines almost instantaneously. 500,000 lines took a few seconds and maxed out a bit over 4GB, but also was not an issue. 750,000 lines took a bit less than 30 seconds to run. 1,000,000,000 lines took about 30 seconds and about 9.5 GB of memory at peak, which was a little worrying because, on top of all my other processes, that brought me to 96% Memory use. Lastly, I tried pushing to 1,250,000 lines, which took about the same amount of time as a million lines and 9.7 GB of memory at peak. I did hit 99% of Memory use and so I feel weary of pushing farther, but I do think my machine is capable of doing more if I ended the background processes going on. Lastly, it was interesting to see that this never really had much impact on my CPU, the highest I saw was 17% CPU usage. As I saw, the most limiting factor is definitely having the RAM to actually load in the files, and so the more RAM a machine has the easier it will be to run this process on larger and larger samples.

Soobin

I bought my labtop 3 years ago, and it has Intel Core i7-10510 with 8 CPUs. It has 16GB of memory with 1TB SSD. So I am not very much worried if it will hit the limit or so. Grepping words through command line worked within a blink, so I decided to move on to the python script. I tried 100,000 lines instantly, and there was no delay at all. Then I tried 1,000,000 lines, and the memory usage surged up to 80%. It took 43 seconds to get the result. I guess this is not so bad. I wanted to try higher number of lines, but I didn't try in the fear of crashing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

todo13_yelp.md

todo13_yelp.md

To-do 13: Poking at Big Data (Yelp)

Na-Rae the fake student:

Wilson:

Varun:

Camryn:

Mack

Moldir

Ashley

Sen

Alex

Soobin

Files

todo13_yelp.md

Latest commit

History

todo13_yelp.md

File metadata and controls

To-do 13: Poking at Big Data (Yelp)

Na-Rae the fake student:

Wilson:

Varun:

Camryn:

Mack

Moldir

Ashley

Sen

Alex

Soobin