Or my first open source project on github
I’m usually content to make use of the software found around the net. I recently wanted a copy of the HIBP database to do offline analysis. The owner of HIBP was kind enough to write his own downloader in .net and link to other implmentations like C, python, rust or even bash/curl version.
I really liked oschonrock/hibp version because it produced a binary file that was very fast to search and 48% smaller than the downloaded files. It also skipped the flat file step. Some design choices like a noisy server that doesn’t return json and distributing it as a .deb which makes it very obtuse bugged me. So I attempted to make something compatible with it using php.
It turned out to be a pretty interesting problem.
First its a huge amount of data. Working with the files, bandwidth and time scale involved is difficult. The whole database is 50GB. The binary file is 28GB and working with big files takes time, memory and storage resources so that ended up being a bit of a juggling act.
It’s the first time I’ve used multi_curl on a project. There are a lot of wrong ways to use it. An early version leaked connections, then was slow because it didn’t recycle open connections. (TCP handshake is a thing) My dinky home router’s NAT was running out of ports because of how quickly it opens and closes files. Eventually discovered that keep-alive worked well in this case.
When was the last time you had to worry about unpacking binary data with php? Got to count the bytes, get the type right and order right too. (big endian vs little endian)
It’s the first time I’ve needed a binary sort. It’s surprising how quick that is, even with php. 5-15ms per request to look up something in a 28GB file. I also briefly added a table of contents but it didn’t really help the speed and make it take more memory per request.
I beat it up with apache bench to see how fast it was. At that scale it was intersting how things like having a log file impacted performance. That log writing was more expensive than I thought it would be. It’s now off by default.
Check it out, https://github.com/derak-kilgo/hibp-download
Leave a Reply