本文介绍了在处理数据流时,如何有效地构建向量和该向量的索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个结构体Foo:

struct Foo {
    v: String,
    // Other data not important for the question
}

我想处理数据流并将结果保存到Vec<Foo>中,并在Foo::v字段上为此Vec<Foo>创建一个索引.

I want to handle a data stream and save the result into Vec<Foo> and also create an index for this Vec<Foo> on the field Foo::v.

我想使用HashMap<&str, usize>作为索引,其中的键将是&Foo::v,值是Vec<Foo>中的位置,但是我愿意接受其他建议.

I want to use a HashMap<&str, usize> for the index, where the keys will be &Foo::v and the value is the position in the Vec<Foo>, but I'm open to other suggestions.

我想尽快处理数据流,这不需要做两次明显的事情.

I want to do the data stream handling as fast as possible, which requires not doing obvious things twice.

例如,我要:

  • 每读取一个数据流仅分配一次String
  • 不要两次搜索索引,一次要检查密钥是否不存在,一次要插入新密钥.
  • 不使用RcRefCell增加运行时间.
  • allocate a String only once per one data stream reading
  • not search the index twice, once to check that the key does not exist, once for inserting new key.
  • not increase the run time by using Rc or RefCell.

借阅检查器不允许使用此代码:

The borrow checker does not allow this code:

let mut l = Vec::<Foo>::new();
{
    let mut hash = HashMap::<&str, usize>::new();
    //here is loop in real code, like: 
    //let mut s: String; 
    //while get_s(&mut s) {
    let s = "aaa".to_string();
    let idx: usize = match hash.entry(&s) { //a
        Occupied(ent) => {
            *ent.get()
        }
        Vacant(ent) => {
            l.push(Foo { v: s }); //b
            ent.insert(l.len() - 1);
            l.len() - 1
        }
    };
    // do something with idx
}

有多个问题:

  1. hash.entry借用了密钥,因此s的生存期必须比hash
  2. 我想在(b)行移动s,而在(a)行有只读引用
  1. hash.entry borrows the key so s must have a "bigger" lifetime than hash
  2. I want to move s at line (b), while I have a read-only reference at line (a)

那么我应该如何实现这种简单的算法而又不需额外调用String::clone或在调用HashMap::insert之后又调用HashMap::get呢?

So how should I implement this simple algorithm without an extra call to String::clone or calling HashMap::get after calling HashMap::insert?

推荐答案

一般来说,您要完成的操作是不安全的,Rust正确地阻止了您执行不应执行的操作.对于一个简单的原因,请考虑Vec<u8>.如果向量具有一项且容量为一项,则向向量添加另一值将导致向量中所有值的重新分配和复制,从而使对该向量的任何引用无效.这将导致索引中的所有键都指向任意的内存地址,从而导致不安全的行为.编译器可以防止这种情况.

In general, what you are trying to accomplish is unsafe and Rust is correctly preventing you from doing something you shouldn't. For a simple example why, consider a Vec<u8>. If the vector has one item and a capacity of one, adding another value to the vector will cause a re-allocation and copying of all the values in the vector, invalidating any references into the vector. This would cause all of your keys in your index to point to arbitrary memory addresses, thus leading to unsafe behavior. The compiler prevents that.

这种情况下,编译器不知道但程序员不知道的另外两条信息:

In this case, there's two extra pieces of information that the compiler is unaware of but the programmer isn't:

  1. 还有一个额外的间接寻址-String是堆分配的,因此将 pointer 移到该堆分配中并不是真正的问题.
  2. String绝不会更改. 如果是这样,则它可能会重新分配,从而使引用的地址无效.
  1. There's an extra indirection — String is heap-allocated, so moving the pointer to that heap allocation isn't really a problem.
  2. The String will never be changed. If it were, then it might reallocate, invalidating the referred-to address.

在这种情况下,只要您正确记录为什么不安全,就可以使用unsafe代码.

In cases like this, it is OK to use unsafe code, so long as you properly document why it's not unsafe.

use std::collections::HashMap;
use std::mem;

#[derive(Debug)]
struct Player {
    name: String,
}

fn main() {
    let names = ["alice", "bob", "clarice", "danny", "eustice", "frank"];

    let mut players = Vec::new();
    let mut index = HashMap::new();

    for &name in &names {
        let player = Player { name: name.into() };
        let idx = players.len();

        // INSERT REASON WHY THIS CODE IS NOT UNSAFE
        let stable_name: &str = unsafe { mem::transmute(&*player.name) };

        players.push(player);
        index.insert(idx, stable_name);
    }

    for (k, v) in &index {
        println!("{:?} -> {:?}", k, v);
    }

    for v in &players {
        println!("{:?}", v);
    }
}

但是,我的猜测是您不想在main方法中使用此代码,而是想从某个函数中返回它.这将是一个问题,因为您很快就会遇到为什么我不能在同一结构中存储值和对该值的引用?.

However, my guess is that you don't want this code in your main method but want to return it from some function. That will be a problem, as you will quickly run into Why can't I store a value and a reference to that value in the same struct?.

老实说,在Rust的限制范围内,有些代码风格不太适合.如果遇到这些情况,您可以:

Honestly, there's styles of code that don't fit well within Rust's limitations. If you run into these, you could:

  • 确定Rust不适合您或您的问题.
  • 使用unsafe代码,最好经过全面测试,仅公开安全的API.
  • 调查替代性陈述.
  • decide that Rust isn't a good fit for you or your problem.
  • use unsafe code, preferably thoroughly tested and only exposing a safe API.
  • investigate alternate representations.

例如,我可能会重写代码以使索引成为键的主要所有者:

For example, I'd probably rewrite the code to have the index be the primary owner of the key:

use std::collections::BTreeMap;

#[derive(Debug)]
struct Player<'a> {
    name: &'a str,
    data: &'a PlayerData,
}

#[derive(Debug)]
struct PlayerData {
    hit_points: u8,
}

#[derive(Debug)]
struct Players(BTreeMap<String, PlayerData>);

impl Players {
    fn new<I, S>(iter: I) -> Self
        where I: IntoIterator<Item = S>,
              S: AsRef<str>,
    {
        let players = iter.into_iter()
            .map(|name| (name.as_ref().to_string(), PlayerData { hit_points: 100 }))
            .collect();
        Players(players)
    }

    fn get<'a>(&'a self, name: &'a str) -> Option<Player<'a>> {
        self.0.get(name).map(|data| {
            Player {
                name: name,
                data: data,
            }
        })
    }
}

fn main() {
    let names = ["alice", "bob", "clarice", "danny", "eustice", "frank"];

    let players = Players::new(&names);

    for (k, v) in &players.0 {
        println!("{:?} -> {:?}", k, v);
    }

    println!("{:?}", players.get("eustice"));
}

或者,如中所示,用表的字段作为关键字制作查找表的惯用方式是什么?,您可以包装您的类型并将其存储在设置的容器中:

Alternatively, as shown in What's the idiomatic way to make a lookup table which uses field of the item as the key?, you could wrap your type and store it in a set container instead:

use std::collections::BTreeSet;

#[derive(Debug, PartialEq, Eq)]
struct Player {
    name: String,
    hit_points: u8,
}

#[derive(Debug, Eq)]
struct PlayerByName(Player);

impl PlayerByName {
    fn key(&self) -> &str {
        &self.0.name
    }
}

impl PartialOrd for PlayerByName {
    fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
        Some(self.cmp(other))
    }
}

impl Ord for PlayerByName {
    fn cmp(&self, other: &Self) -> std::cmp::Ordering {
        self.key().cmp(&other.key())
    }
}

impl PartialEq for PlayerByName {
    fn eq(&self, other: &Self) -> bool {
        self.key() == other.key()
    }
}

impl std::borrow::Borrow<str> for PlayerByName {
    fn borrow(&self) -> &str {
        self.key()
    }
}

#[derive(Debug)]
struct Players(BTreeSet<PlayerByName>);

impl Players {
    fn new<I, S>(iter: I) -> Self
        where I: IntoIterator<Item = S>,
              S: AsRef<str>,
    {
        let players = iter.into_iter()
            .map(|name| PlayerByName(Player { name: name.as_ref().to_string(), hit_points: 100 }))
            .collect();
        Players(players)
    }

    fn get(&self, name: &str) -> Option<&Player> {
        self.0.get(name).map(|pbn| &pbn.0)
    }
}

fn main() {
    let names = ["alice", "bob", "clarice", "danny", "eustice", "frank"];

    let players = Players::new(&names);

    for player in &players.0 {
        println!("{:?}", player.0);
    }

    println!("{:?}", players.get("eustice"));
}

在不执行性能分析的情况下猜测性能特征绝不是一个好主意.老实说,我不相信克隆或删除值时增加整数会导致明显的性能损失.如果问题既需要索引又需要向量,那么我将寻求某种共享所有权.

Guessing about performance characteristics without performing profiling is never a good idea. I honestly don't believe that there'd be a noticeable performance loss from incrementing an integer when a value is cloned or dropped. If the problem required both an index and a vector, then I would reach for some kind of shared ownership.

这篇关于在处理数据流时,如何有效地构建向量和该向量的索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-30 09:06