大规模图像存储

我认为按日期/时间查看目录有两个不利之处.首先，分布不太可能是统一的，这将导致某些目录比其他目录更完整.散列将均匀分布.至于跨越，您可以在添加文件时监控驱动器上的空间，并在空间用完时开始溢出到下一个驱动器.我想部分到期时间与日期相关，因此旧驱动器会随着新驱动器填满而开始清空，您必须弄清楚如何平衡这一点.元数据存储不必位于服务器本身.您已经在数据库中存储了与文件相关的数据.与直接从使用路径的行引用路径相反，请改为引用文件名键(我提到的第二个表).我想象用户使用某种网络或应用程序来连接到商店，所以聪明地找出文件将在存储服务器上的位置将放在那里，然后分享驱动器的根(或做一些奇特的东西与 NTFS 联结将所有驱动器放入一个子目录).如果您希望通过网站下拉文件，请在该网站上创建一个采用文件名 ID 的页面，然后在数据库中执行查找以获取哈希，然后它将哈希分解为任何配置级别，并通过共享请求服务器，然后将其流回客户端.如果希望 UNC 访问该文件，请让服务器只构建 UNC.这两种方法都会使您的最终用户应用程序对文件系统本身结构的依赖程度降低，并且以后您可以更轻松地调整和扩展存储空间.I will likely be involved in a project where an important component is a storage for a large number of files (in this case images, but it should just act as a file storage).Number of incoming files is expected to be around 500,000 per week (averaging around 100 Kb each), peaking around 100,000 files per day and 5 per second.Total number of files is expected to reach tens of million before reaching an equilibrium where files are being expired for various reasons at the input rate.So I need a system that can store around 5 files per second at peak hours, while reading around 4 and deleting 4 at any time.My initial idea is that a plain NTFS file system with a simple service for storing, expiring and reading should actually be sufficient. I could imagine the service creating sub-folders for each year, month, day and hour to keep the number of files per folder at a minimum and to allow manual expiration in case that should be needed.A large NTFS solution has been discussed here, but I could still use some advice on what problems to expect when building a storage with the mentioned specifications, what maintenance problems to expect and what alternatives exist. Preferably I would like to avoid a distributed storage, if possible and practical.editThanks for all the comments and suggestions. Some more bonus info about the project:This is not a web-application where images are supplied by end-users. Without disclosing too much, since this is in the contract phase, it's more in the category of quality control. Think production plant with conveyor belt and sensors. It's not traditional quality control since the value of the product is entirely dependent on the image and metadata database working smoothly.The images are accessed 99% by an autonomous application in first in - first out order, but random access by a user application will also occur. Images older than a day will mainly serve archive purposes, though that purpose is also very important.Expiration of the images follow complex rules for various reasons, but at some date all images should be deleted. Deletion rules follow business logic dependent on metadata and user interactions.There will be downtime each day, where maintenance can be performed.Preferably the file storage will not have to communicate image location back to the metadata server. Image location should be uniquely deducted from metadata, possibly though a mapping database, if some kind of hashing or distributed system is chosen.So my questions are:Which technologies will do a robust job?Which technologies will have the lowest implementing costs?Which technologies will be easiest to maintain by the client's IT-department?What risks are there for a given technology at this scale (5-20 TB data, 10-100 million files)? 解决方案 Here's some random thoughts on implementation and possible issues based on the follwing assumptions: average image size of 100kb, and a steady state of 50M (5GB) images. This also assumes users will not be accessing the file store directly, and will do it through software or a web site:Storage medium: The size of images you give amounts to a rather paltry read and write speeds, I would think most common hard drives wouldn't have an issue with this throughput. I would put them in a RAID1 configuration for data security, however. Backups wouldn't appear to be too much of an issue, since it's only 5gb of data.File storage: To prevent issues with maximum files in a directory, I would take the hash (MD5 minimum, this would be the quickest, but most-collision likely. And before people chirp in to say MD5 is broken, this is for identification, and not security. An attacker could pad images for a second preimage attack, and replace all images with goatse, but we'll consider this unlikely), and convert that has to a hexadecimal string. Then, when it comes time to stash the file in the file system, take the hex string in blocks of 2 characters, and create a directory structure for that file based on that. E.g. if the file hashes to abcdef, the root directory would be ab then under that a directory called cd, under which you would store the image with the name of abcdef. The real name will be kept somewhere else (discussed below).With this approach, if you start hitting file system limits (or performance issues) from too many files in a directory, you can just have the file storage part create another level of directories. You could also store with the metadata how many levels of directories the file was created with, so if you expand later, older files won't be looked for in the newer, deeper directories.Another benefit here: If you hit transfer speed issues, or file system issues in general, you could easily split off a set off files to other drives. Just change the software to keep the top level directories on different drives. So if you want to split the store in half, 00-7F on one drive, 80-FF on another.Hashing also nets you single instance storage, which can be nice. Since hashes of a normal population of files tend to be random, this should also net you an even distribution of files across all directories.Metadata storage: While 50M rows seems like a lot, most DBMS's are built to scoff at that number of records, with enough RAM, of course. The following is written based on SQL Server, but I'm sure most of these will apply to others. Create a table with the hash of the file as the primary key, along with things like the size, format, and level of nesting. Then create another table with an artificial key (an int Identity column would be fine for this), and also the original name of the file (varchar(255) or whatever), and the hash as a foreign key back to the first table, and the date that it was added, with an index on the file name column. Also add any other columns you need to figure out if a file is expired or not. This will let you store the original name if you have people trying to put the same file in under different names (but are otherwise identical, since they hash the same).Maintenance: This should be a scheduled task. Let Windows worry about when your task runs, less for you to debug and get wrong (what if you do maintenance every night at 2:30AM, and you're somewhere that observes Summer/daylight saving time. 2:30AM doesn't happen during the spring changeover). This service will then run a query against the database to establish which files are expired (based on the data stored per-file name, so it knows when all references that point to a stored file are expired. Any hashed file that is not referenced by at least one row in the file name table is no longer needed). The service would then go delete these files.I think that's about it for the major parts.EDIT: My comment was getting too long, moving it into an edit:Whoops, my mistake, that's what I get for doing math when I'm tired. In this case, if you want to avoid the extra redundancy of adding RAID levels (51 or 61 e.g. mirrored across a striped set), the hashing would net you the benefit of being able to slot 5 1TB drives into the server, and then have the file storage software span the drives by the hash like mentioned at the end of 2. You could even RAID1 the drives for added security for this.Backing up would be more complex, though the file system creation/modification times would still hold for doing this (You could have it touch each file to update it's modification time when a new reference to that file is added).I see a two-fold downside to going by date/time for the directories. First, it is unlikely the distribution would be uniform, this will cause some directories to be fuller than others. Hashing would distribute evenly. As for the spanning, you could monitor the space on the drive as you add files, and start spilling over to the next drive when space runs out. I imagine part of the expiry is date related, so you would have older drives start to empty as newer ones fill up, and you'd have to figure out how to balance that.The metadata store doesn't have to be on the server itself. You're already storing file related data in the database. As opposed to just referencing the path directly from the row where it is used, reference the file name key (the second table I mentioned) instead.I imagine users use some sort of web or application to interface to the store, so the smarts to figure out where the file would go on the storage server would live there, and just share out the roots of the drives (or do some fancy stuff with NTFS junctioning to put all the drives into one subdirectory). If you're expecting to pull down a file via a web site, create a page on the site that takes the file name ID, then perform the lookup in the DB to get the hash, then it would break the hash up to whatever configured level, and request that over the share to the server, then stream it back to the client. If expecting a UNC to access the file, have the server just build the UNC instead.Both of these methods would make your end-user app less dependent on the structure on the file system itself, and will make it easier for you to tweak and expand your storage later. 这篇关于大规模图像存储的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！