Shrimper — A Small Search Engine Crafted in Rust

image by some robot

INTRODUCTION

Creating a search engine in Rust is an excellent way to start exploring the language’s strengths in performance and safety.

This project transitions indexing and searching concepts into Rust’s ecosystem, challenging but rewarding due to Rust’s unique syntax and paradigms.

We’d start by setting up the Rust environment, including essential tools and dependencies. Then, defining data models using structs and Rust crates tantivy for indexing/searching and serde for serialization. Through implementing a basic search engine, you’ll learn to manage indexing and execute search queries!

  • Rust and Cargo (Rust’s package manager and build system) installed. If not, you can install them from the official Rust website.

Create a new Rust project:

cargo new shrimp_engine

cd shrimp_engine

You might need a few crates (Rust libraries) to help with parsing and data handling. For example:

  • tantivy for indexing and searching text (similar to Lucene in the Java world).
  • serde and serde_json for JSON parsing if your data is in JSON format.

Add these to your Cargo.toml file:

[package]

name = "shrimp-engine"

version = "0.1.0"

edition = "2021"

[dependencies]

tantivy = "0.17"

serde = "1.0"

serde_json = "1.0"

Decide on the structure of the documents you’ll be indexing. For a basic example, consider a simple struct representing documents with a title and body.

use serde::{Serialize, Deserialize};
#[derive(Serialize, Deserialize, Debug)]

struct Document {

title: String,

body: String,

}

Using tantivy, create an index schema based on your data structure, and then add documents to the index.

rustCopy code

use tantivy::{schema::*, Index, doc};
fn create_index() -> tantivy::Result<()> {

// Define the schema

let mut schema_builder = Schema::builder();

schema_builder.add_text_field("title", TEXT | STORED);

schema_builder.add_text_field("body", TEXT);

let schema = schema_builder.build();
// Create the index in a directory

let index = Index::create_in_ram(schema.clone());
// Get the index writer

let mut index_writer = index.writer(50_000_000)?;
// Add documents

let title = schema.get_field("title").unwrap();

let body = schema.get_field("body").unwrap();
// Example document

let doc = doc!(title => "Example Title", body => "This is the body of the document.");

index_writer.add_document(doc)?;
// Commit the documents to the index

index_writer.commit()?;
Ok(())

}

Implement a function to search the index.

You’ll need to create a searcher and query parser.

use tantivy::query::QueryParser;

use tantivy::collector::TopDocs;
fn search_index(index: &Index, query_str: &str) -> tantivy::Result<()> {

let reader = index.reader()?;

let searcher = reader.searcher();
let schema = index.schema();

let title = schema.get_field("title").unwrap();

let body = schema.get_field("body").unwrap();

let query_parser = QueryParser::for_index(&index, vec![title, body]);
let query = query_parser.parse_query(query_str)?;

let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
for (_, doc_address) in top_docs {

let retrieved_doc = searcher.doc(doc_address)?;

println!("{:?}", retrieved_doc);

}
Ok(())

}

Now, let’s combine the indexing and searching into a main function, where we can modify the documents, the index, and queries:

use serde::{Serialize, Deserialize};

use tantivy::{schema::*, Index, doc, query::QueryParser, collector::TopDocs, TantivyError};

#[derive(Serialize, Deserialize, Debug)]

struct Document {

title: String,

body: String,

}

fn create_index() -> Result<Index, TantivyError> {

let mut schema_builder = Schema::builder();

schema_builder.add_text_field("title", TEXT | STORED);

schema_builder.add_text_field("body", TEXT);

let schema = schema_builder.build();

let index = Index::create_in_ram(schema.clone());

let mut index_writer = index.writer(50_000_000)?;

let title = schema.get_field("title").unwrap();

let body = schema.get_field("body").unwrap();

let doc = doc!(title => "Example Title", body => "the body of the document.");

index_writer.add_document(doc)?;

index_writer.commit()?;

Ok(index)

}

fn search_index(index: &Index, query_str: &str) -> Result<(), TantivyError> {

let reader = index.reader()?;

let searcher = reader.searcher();

let schema = index.schema();

let title = schema.get_field("title").unwrap();

let body = schema.get_field("body").unwrap();

let query_parser = QueryParser::for_index(&index, vec![title, body]);

let query = query_parser.parse_query(query_str)?;

let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;

for (_, doc_address) in top_docs {

let retrieved_doc = searcher.doc(doc_address)?;

println!("{:?}", retrieved_doc);

}

Ok(())

}

fn main() -> Result<(), TantivyError> {

println!("Hello, Shrimp!");

// Create the index and store it

let index = create_index()?;

// Search within the created index

search_index(&index, "Example")?;

Ok(())

}

Let’s break down the crucial components and their roles in the system:

Serde

  • serde::{Serialize, Deserialize}: These traits allow for the easy conversion of Rust structs to and from a format suitable for saving (like JSON), which is essential for working with data that needs to be indexed or retrieved.

Tantivy

  • tantivy::{schema::*, Index, doc, query::QueryParser, collector::TopDocs, TantivyError}:

The components from the tantivy crate are used for building the search engine’s core functionality, from creating an index to querying it.

Document Struct

  • Document Struct: Represents the data structure for documents to be indexed. Each document has a title and a body, mimicking a simple webpage or document in a real-world search engine.

the Schema

The schema defines the structure of the index, specifying which fields (here, title and body) should be indexed and how (e.g., stored, text-analyzed). An in-memory index is created, and documents are added to this index. Each document added is defined by the Document struct, which is then serialized for indexing. Changes are committed to the index, making it searchable.

1- Index Reader and Searcher:

To search the index, an index reader is instantiated, creating a searcher capable of executing queries against the index.

2- Query Parsing and Execution

A query parser interprets a query string, transforming it into a query object based on the defined schema. The searcher then uses this query to find and rank relevant documents.

3- Retrieving and Displaying Results

The top matching documents (up to a limit) are retrieved and displayed. The ability to extract and review indexed content based on search queries.

The main function ties everything together, first creating an index with at least one document and then performing a search within this index.

The simplicity of this setup demonstrates a fully functional search engine capable of indexing and searching text 🍤

  • The use of tantivy for indexing and searching provides a Rust-centric approach to text search, which offers high performance and safety.
  • serde‘s role ensures that complex data structures can be easily managed, serialized, and deserialized within the Rust ecosystem.
  • This example serves as a foundational framework, illustrating how Rust can be used to build search solutions 🍤

This example is intended to give you a starting point in search engine construction. Rust’s ownership and concurrency model, along with its type system, provide a robust foundation for building more complex and high-performance search engines.

We can expand this project by adding features like real-time indexing, advanced text processing, and custom scoring algorithms. Expect those features in the series of articles dedicated to search engines and information retrieval — all in Rust 🍤

0
Show Comments (0) Hide Comments (0)
0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x