GitHub has a lot of code to search — more than 200 million repositories — and says last November's beta version of a search engine optimized for source code has caused a "flurry of innovation."
In a recent blog post, GitHub engineer Timothy Clem delved into the technology used to scour just a quarter of those repos, a code search engine built in Rust called Blackbird.
Blackbird currently provides access to almost 45 million GitHub repositories, which together amount to 115 TBytes of code and 15.5 billion documents.
Using ripgrep on an 8-core Intel CPU to run an exhaustive regular expression query on a 13-GByte file in memory, Clem explained, takes about 2.769 seconds, or 0.6 GByte/second/core. "We can see pretty quickly that this really isn't going to work for the larger amount of data we have," he said.
So GitHub front-loaded much of the work into precomputed search indices. Even so, these indices are too large to fit in memory. GitHub Code Search is presently in beta testing.
From The Register
View Full Article