LDraw.org Discussion Forums

Full Version: Compressed LDraw files
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi all,

I'm thinking about supporting zipped ldraw files in LDCad without the need for users to unzip them first. This idea comes mainly from the fact my new deform functionality tends to create quite large files. These files can be compressed extremely well (1.5MB becomes a couple of K etc).

It would be almost trivial for me to implement an unzip before parsing and a zip before saving files in my editor. But I was thinking maybe other software authors would like to provide the same support.

So a semi serious public discussion about how to name/handle compressed LDRaw files might be in order. That way we could all use this.

I was thinking about using zlib (other authors could use any other zip supporting library) and simply prefixing "gz-" to the file extension, so e.g. "someBigModel.mpd" becomes "someBigModel.gz-mpd". The contents of the mpd itself stay unchanged.

I would recommend against using "gz"ip in the name, simply because most windows users have no idea what it means. I'd say simply appending a 'z' would be better eg. mpdz or ldrz.

But are big files really an issue in this day and age? As far as I know even downloading them from most sites on the internet will be transparently performed compressed (I could be wrong).

It sounds like a good idea, but you need to be careful with your terminology. Used alone, the term "zip" is generally interpreted to mean a PKZIP compatible multi-file archive (with .zip extension), and nothing else.

I personally agree that using zlib to create a gzip-compatible file is a good route to take, and would probably add support for this to a future LDView version if it became at all popular. However, if you go this route, make sure that the files you produce are actual gzip files (with valid gzip header at the beginning, plus checksum and uncompressed size at the end). The zlib library contains sample code for creating .zip files, but I don't feel these would be a good idea.
I agree that simply appending z to the file extension is better (and done in a number of other file formats that support having a gzipped version).
zip/gzip/zlib might be confusing indeed,

I'm using wxWidgets stuff for handing zip files already (the .sf seed files), But the wx library also supplies similar classes for zlib related compression I'm assuming those classes take care of all header and trailer stuff and therefore was hoping to use them in a similar matter.

I agree it should be a proper gzip file so most os/archive managers know how to handle them automatically (most windows archive managers will offer to (right click) extract them even when the extension is non standard). I will have to make sure wxWidgets' class generates such files, otherwise I'm going to use zlib directly.
If z at the ending is more common I'm ok with that, no need to introduce even more confusion Smile

On the drive size issue Tim raised, I agree with you in general but having bought an SSD recently I also hate to see (any) space wasted. And because I prefer quality over smaller files sizes, (optional) compression seemed like a nice all round solution.
Don't get me wrong, I think it's not a bad idea if it takes off and would like to see support for it. Can always uncompress and edit.

However I'd definitely recommend against making it the default setting.

May I suggest to use concatenated extensions?

so the model mymodel.ldr, when compressed, becomes mymodel.ldr.gz or mymodel.ldr.zip or mymodel.ldr.7z etc.

this way, the files can be normally opened using the normal compression programs.
also, e.g. the Windows shell will treat them properly.

your tool might recognize these double-extensions and directly handle them.

the second portion indicates the compression standard that has been used.
The problem with double extensions is that they have essentially zero support in any operating systems. The operating system sees the final extension, and nothing else, so the files become solely compressed files. I think Roland wants the files to be "compressed LDraw" files. That way, when a user double-clicks on one, it opens LDCad automatically. It's not appropriate for LDCad to claim to open all .gz files, because it can't do that; it can only open these specific ones. So if double extensions were used, LDCad could not be reasonably made the proper handler for them.
My 0.02: I like the .mpz idea - a changed suffix to indicate to casual useres that:
1. This isn't the .mpd format you are used to - don't open it in WordPad and
2. Your favorite editor might not understand it.
It's cute for developers that the files are 'secretly' standard compression formats and thus very easy to process, but that's a power user feature.

(With X-Plane we went the opposite route and allowed 7z archives with _no_ change of suffix....when the system handles them transparently it's great, but when an old tool fails to understand why the file is filled with junk, the user becomes confused because the extension doesn't look different.)
I admit I am skeptical of format proliferation. If this isn't saving many megabytes, I don't think it's worth it. The cost of not being able to share files with other software is high.

But if you must do it, please use only one extension. And make it ldrz. (No datz, no mpdz, etc.) The fact that LDraw files already use four extensions for the exact same content is crazy. Given that we just made MPD support mandatory and re-affirmed that all extensions are valid for all files, it's time to let .ldr shine as the branding vechicle it was meant to be.

While we're here talking about compression, has anybody tested the performance of using complete.zip directly to load official parts, instead of looking for them on the file system? My gut instinct is that it could actually be faster on modern systems, but I'm not sure I want to put in the effort necessary to test this when it's such an open question.

Note: I'm not asking this as a way to save disk space, but as a way to improve performance on a non-SSD drive.
Latest version of LeoCAD defaults to a zipped LDraw library, but can be configured to use normal folder. And it IS faster with zipped library.
Also, some antivirus/mall-ware (smtp) software will complain about double extension files/attachments.

edit: on a side note I might support opening ldr.gz anyway but only from a manual file open.
I would actually prefer using it with the normal extension, but that would break backwards compatibility and will probably even crash a couple of programs if they try to load it as normal text files.
Allen Smith Wrote:But if you must do it, please use only one extension. And make it ldrz. (No datz, no mpdz, etc.) The fact that LDraw files already use four extensions for the exact same content is crazy.

Didn't even think about that, I was just thinking to add an extra something to the extension, so you could easily extract the original name what ever it might be.

Main reason for this is the fact other files will still reference to the compressed file using it's 'normal' file-name, in that way you can change compression (on/off) for any file without breaking tons of dependencies. Downside is an extra locate operation (also testing for the z version, if the normal one is not found in a search location).
Oh please...learn from my fail. :-) Compressing with no suffix change was definitely a mistake in X-Plane...not just because of the compatibility loss, but because of the confusion that ensues when a compatibility problem props up.
We'll get a speed win on a large collection of files with old HDs by using an archive format like PKZip to cut seek time when opening many small files.

But that also increases the potential scope of supporting compression ... the scoping rules for locating parts vs MPD vs file system isn't particularly well-spec'd now, and adding archives would further complicate the situation. Does a program have to search archive table of contents to find any LDraw file, and how would the program know which archive to search?
Actually scoping won't be such a big problem, it very simular to MPD and just adds to the stack of locations to search.

But compressing multiple files in a single archive is a completely different thing from what I was (initially) suggesting in this thread. Although it might solve the issue Tim raised (allowing only an single new extension). But it wouldn't be a trivial thing to implement anymore though Smile

I think I'm going to push support for any compressed files to LDCad 1.3 for now, have to figure out the best way to go first (compressed container and or single compressed files).

Although I'm actually already leaning towards supporting the full blown 'container' approach, would be very cool for things like datsville etc (LDCad, currently, Can't multithread load an mpd, so many ldr's in a zip would be better.).
By changing the file format, you automatically break searching done by any non-compression-savvy program. Since it's a breaking change, you get to define whatever search path logic makes sense. Thus this:

example.mpd ➔ search for "example.ldrz"

is just as valid as this:

example.mpd ➔ search for "example.mpdz"
Now this is a *very* interesting idea. Are there any off-the-shelf libraries to make it easy?

The standard zlib library comes with minizip as sample code in the contrib/minizip directory of the source distribution. The zlib license is extremely permissive. LDView 4.2 will be using this on the Mac instead of a system call to unzip. Take a look at the section inside #ifdef HAVE_MINIZIP of the following C++ class from LDView:


The above unzips complete.zip onto the file system, but to do that it scans the file list from the zip file and stores the results in a std::map object. If you look at scan() and extractFile(), you should have relatively clear sample code that you can convert to Objective C for loading files from a zip directly into memory. (My code uses a fixed size buffer, which is incrementally filled with the file contents and immediately written to disk, but it's trivial to modify it to read an entire file into an NSMutableArray or BYTE *.)
Far as I know zip files are not seek able, so you have to decompress all files to get to a certain one, so the only way it would give any performance boost will be if you keep the entire library in memory, like a ramdrive.

Taking this into account I highly doubt if it would be faster then setting up a decent file list cache and only load files upon request. In short I'm wondering if the 20MB decompress and huge memory overhead 'wins' from smart file list caching (so you don't have to chat with the os for every file location resolve, and essentially only have the file read overhead).
That is incorrect. The table of contents data for the zip is at the end of the file, which means that if you have a partial file you're out of luck. However, the table of contents of the zip includes offsets to the compressed data for every file in the zip. My code above uses this offset to "seek" to the right place for each file (using the unzSetOffset() from minizip), then decompresses the file. I have written code for other apps that unzips any arbitrary file from a zip directly into memory.
hmm, this seems to be an advantage of (pk)zip (or some option that's off by default on other types) then, because whenever I need to extract a single file from a (huge) rar or 7zip archive the archive manager always takes for ever by decompressing everything up to the wanted file.

Might also be the reason zip's are larger though.
One reason that zips are larger than compressed tar files is that each file in the zip is individually compressed. For this reason, a zipped tar file will usually be smaller than a zip of the tar file's contents. A tar.gz or tar.bz2 is just full tar file that has been run through gzip or bzip2 compression, so cross-file compression works there.

I'm not sure how 7-zip behaves with respect to cross-file compression, but cross-file compression I would think in theory be possible without requiring all the intervening files to be decompressed in order to get to a specific file in the archive. Doing so might require a great deal more work, though, and might increase the size of the archive.

I just ran a test with zip test code I have at work, and it took around 100 msec to scan the table of contents from the 2012-03 complete.zip and store the information in a std::map. This only needs to be done once at program startup (and again any time the user changes their part library to a different one, of course).
I looked at compression efficiency when we were selecting a compression format for X-Plane scenery data. When maxing out compression (not always a good idea as it increases encode time significantly) I found that from worst to best compression ratio the list went:

pkzip, bzip2, rar, and 7zip.

Rar is heavily license-entangled, so it's not really an option. We used 7zip because it had the best ratio, but for LDraw, I'd recommend pkzip because of the wide availability of simple, clean, portable compression/decompression utils. (7zip's quite a bit messier.)

I also found that compression ratios were very sensitive to source content. Our file format is a heavily bit-packed binary format, and it has optional RLE encoded in some cases. I found that with 7zip, the RLE encoding significantly worsened 7-zips compression ratio...the moral of the story was: let the experts do the compression.

So...I suspect that the compression ratio differences are coming from iteratively superior compression algorithms (no surprise the oldest tech does the worst) but YMMV based on file contents.

Honestly, the more I read this discussion, the less valuable compression suppot seems, relative to other possible development work. My LDraw file is 138 MB on disk. By comparison, LDD is using 562 MB on disk, itunes 253 MB, Keynote 562 MB, and FireFox 184 MB. My MOC airport is only 6.6 MB on disk for 22,000 bricks and months of my lif

I'm not saying that this is okay or that I'm happy that my web browser is approaching nearly 200 MB in size, but it seems to me that the data sizes the LDraw software suite deals with are quite tame. :-)

Anyway, sorry for the rant...my take-away point is that given multiple options, I would suggest we optimize for developer convenience and not storage efficiency, since that last few % improvement from pkzip to 7zip aren't going to turn into a huge user experience win, but I think we are limited on volunteer dev time.
Allen Smith Wrote:By changing the file format, you automatically break searching done by any non-compression-savvy program. Since it's a breaking change, you get to define whatever search path logic makes sense.

Sorry missed this post,

Yes but using the file extension addition and leaving the type 1 lines unchanged non supporting apps will simply report a missing part, but referring to the z variant in the type 1 line might potentially crash apps when they expect text files.

So the 'not breaking dependencies' is more in means of a supporting app will find the file compressed or not without having to change all references every time you switch between compressed and plain.
You are right as a result of this discussion I've already pushed the compression issue to my 'nice to have' list.

Plenty of more urgent things to do for me like getting a public test version of LDCad with the flexible stuff done (something I hoped to finish weeks ago Smile )
Allen's point was that instead of your LDraw parser appending a z to the filename in the LDraw file when searching for a compressed version of a missing file, it would always replace the extension included in the LDraw file code with ldrz, and then there will only be one new extension. I agree with Allen that adding z to the existing extensions (dat, mpd, ldr) is a bad idea, and you should always uses a single one (probably ldrz).

As an additional note, the gzip format supports including the original filename inside the header of the gzip file, so you could make it so that any time you decompress the file and save that to the disk, it would revert to its original filename.
In light of this discussion I did a little tweaking on LDMakeList to make it easier to tweak to work with .zip files (also to make it a little less ad hoc).

If we ever do decide to start accessing e.g. Complete.zip rather than expanded directories, I'll do my best to support it.

ok, good point, I retract the suggestion
good idea, +1
accessing the library files inside the compressed file in memory on nowaday systems could be faster than file system access