Activeloop (YC S18) Is Hiring Senior Back End and AI Search Engineers(Onsite, MV) (careers.activeloop.ai)

I have used s5cmd in a professional setting and it works wonderfully. I have never attempted to test performance to confirm their claims, but as an out of the box client, it is (anecdotally) significantly faster than anything else I have tried.

My only headache was that I was invoking it from python, and it does not have bindings, so I had to write a custom wrapper to call out to it. I am not sure of the difficulty of adding native support for Python, but I assume its not worth the squeeze and just calling out to a subprocess will work for most user's needs.

therealmarv · 1d ago

Very interesting. And this is an amazing graph for small file uploading/download speed improvement. I have the feeling that all cloud drives are really not optimised for many small files like smaller than 1Mbytes in average.

https://raw.githubusercontent.com/peak/s5cmd/master/doc/benc...

I've implemented at work once a rudimentary parallel uploading of many small files to S3 in Python and with boto3 (was not allowed to use a third party library or tool at that time) because it's soooo slow to upload many small files to S3. It really takes ages and even if you just upload 8 small files in parallel it makes a huge difference.

random_savv · 1d ago

Most S3 systems are terrible with small files (~50KB). 1MB can still be ok though

rsync · 1d ago

If you don't want to install, or maintain, s5cmd yourself:

  ssh user@rsync.net s5cmd
  
  Password:
  NAME:
  s5cmd - Blazing fast S3 and local filesystem execution tool
  
  USAGE:
  s5cmd [global options] command [command options] [arguments...]

If you move data between clouds, you don't need to use any of your own bandwidth ...

rdg42 · 1d ago

s5cmd is not listed here https://www.rsync.net/resources/howto/remote_commands.html ? :-)

vasco · 1d ago

It either was added in the meantime or it already was there

> Backup Tools

> attic, borg (usage notes), rclone, unison, s5cmd and git are installed (on our server side).

> You would most likely call these commands by running these tools locally and connecting to rsync.net with them.

Galanwe · 1d ago

> For downloads, s5cmd can saturate a 40Gbps link (~4.3 GB/s)

I'm surprised by these claims. I have worked pretty intimately with S3 for almost 10 years now, developed high performance tools to retrieve data from it, as well as used dedicated third party tools for performant file download tailored for S3.

My experience is that individual S3 connections are capped over the board at ~80MB/s, and the throughput of 1 file is capped at 1.6GB/s (at least per ec2 instance). At least I have never managed myself nor seen any tool capable of going beyond that.

My understanding is then that this benchmark's claims of 4.3GB/s are across multiple files, but then it would be rather meaningless, as it's free concurrency basically.

nckslvrmn · 1d ago

I cant verify 40Gbps because I have never had access to a pipe that fast, but when I ran this tool on an AWS instance with a 20Gbps connection, it saturated that easily and maintained that speed for the duration of the transfer.

Galanwe · 1d ago

I just spawned a r6a.16xlarge with a 25gbps NIC, created a 10GB file which I uploaded to an S3 bucket in the same region, through a local S3 VPC endpoint.

Downloading that 10GB file to /dev/shm with s5cmd took 24s, all while spawning 20 or so threads which were all idling for IO.

The same test using a Python tool (http://github.com/NewbiZ/s3pd) with the same amount of workers took 10s.

Cranking up the worker count of the latter library until there is no more speedup, I can reach 6s with 80 workers. That is, 10/6 = 1.6GB/s, which seems to confirm my previous comment.

What am I doing wrong ?

Galanwe · 23h ago

Okay I found the trick, buried in the benchmark setup of s5cmd.

The claimed numbers are _not_ reached with S3, but rather from a custom server emulating the S3 API, hosted on the client machine.

I think this is very misleading, since these benchmark numbers are not reachable in any real life scenario. It also shows that there is very little point in using s5cmd compared to other tools, since beyond 1.6GB/s the throttling will be from S3, not from the client, so any tool able to saturate 1.6GB/s will be enough.

bob1029 · 1d ago

I pushed a very large file to S3 the other day from my local machine and I was barely seeing 90 megabytes/s with 5gbps of upstream available.

I wonder what the limit is for uploads inside AWS. I've never been able to afford EC2 instances with the requisite networking or I/O performance.

diroussel · 1d ago

The S3 API allows requests to read a byte range of the file (sorry , object). So you could have multiple connections each reading a different byte range. Then the ranges would need to be written to the target local file using a random access pattern.

Galanwe · 1d ago

I know that already... and it is exactly what I tested and confirmed here https://news.ycombinator.com/item?id=44249137

You can spawn multiple connections to S3 to retrieve chunks of a file in parallel, but each of these connections is capped at 80MB/s, and the whole of these connections, while operating on a single file, to a single EC2 instance, is capped at 1.6GB/s.

beastman82 · 1d ago

It works with single files

Also you should up your chunk size to limit tps

Source: running s5cmd in prod for 1yr+

hknlof1 · 1d ago

It is definitely possible. Depends on how many 503s you might need to hit and when there heuristic model allows you to increase throughput. As a customer, you can also push for higher dedicated throughput.

jeffbee · 1d ago

I think that's the point of this tool, that pipelining and parallelism are necessary to get high performance out of S3.

quodlibetor · 1d ago

I recently wrote a similar tool focused more on optimizing the case of exploring millions or billions of objects when you know a few aspects of the path: https://github.com/quodlibetor/s3glob

It supports glob patterns like so, and will do smart filtering at every stage possible: */2025-0[45]-*/user*/*/object.txt

I haven't done real benchmarks, but it's parallel enough to hit s3 parallel request limits/file system open file limits when downloading.*

BlackLotus89 · 1d ago

If you want parallel upload/download over s3 and other protocols (GDrive, ftp, WebDAV, ...) you can use rclone.

For s3 mounts I would use geesefs.

Have to later take a look at s5cmd as well...

StackTopherFlow · 1d ago

Awesome work. I love seeing a community project step up and make a better solution than a multi trillion dollar company.

remram · 1d ago

What is meant by "filesystem execution" here? Is it just an S3 file transfer tool?

nodesocket · 1d ago

I posted a link to s5cmd in the other post about s3mini. Really want to try it out migrating a few TBs of s3 data.

Roundtable (YC S23) Is Hiring a President / CRO (ycombinator.com)

Roame (YC S23) Is Hiring (ycombinator.com)

GauntletAI (YC S17): All expenses paid AI training and guaranteed $200k+ job (gauntletai.com)

SchemeFlow (YC S24) Is Hiring an Engineer (London) to Speed Up Construction (ycombinator.com)

Shaped (YC W22) Is Hiring (ycombinator.com)

Spice Data (YC S19) is hiring a software engineer – back end (ycombinator.com)

Onlook (YC W25) Is Hiring an engineer in SF

OneText (YC W23) Is Hiring a DevOps/DBA Lead Engineer (jobs.ashbyhq.com)

Gander (YC F24) Is Hiring Founding Engineers and Interns (ycombinator.com)

Ziina (YC W21) the Series A fintech is hiring product engineers (ziina.notion.site)

Onyx (YC W24) – AI Assistants for Work Hiring Founding AE (ycombinator.com)

Great Question (YC W21) Is Hiring a Director of Customer Success (ycombinator.com)

Deepnote (YC S19) is hiring engineers to build an AI-powered data notebook (deepnote.com)

Converge (YC S23) Well-capitalized New York startup seeks product developers (runconverge.com)

CircuitHub (YC W12) is hiring full-stack robotics engineers (workatastartup.com)

AtoB (YC S20) – Stripe for Transportation – is hiring engineers (jobs.ashbyhq.com)

PromptArmor (YC W24) Is Hiring in San Francisco (ycombinator.com)

Depot (YC W23) is hiring an enterprise support engineer (UK/EU) (ycombinator.com)

Patched (YC S24) Is Hiring SWEs in Singapore (ycombinator.com)

Activeloop (YC S18) Is Hiring Senior Back End and AI Search Engineers(Onsite, MV) (careers.activeloop.ai)

Morph (YC S23) Is Hiring a ML Engineer

Spark AI (YC W24) Is Hiring a Full Stack Engineer in San Francisco (ycombinator.com)

Demodesk (YC W19) Is Hiring Rails Engineers (demodesk.com)

Piramidal (YC W24) Is Hiring a Senior Full Stack Engineer (ycombinator.com)

AccessOwl (YC S22) is hiring an AI TypeScript Engineer to connect 100s of SaaS (ycombinator.com)

StackAI (YC W23) Is Looking for SWR and Tailwind Wizards (ycombinator.com)

Weave (YC W25) is hiring a founding engineer (ycombinator.com)

Infisical (YC W23) Is Hiring Full Stack Engineers (TypeScript) in US and Canada (ycombinator.com)

GoGoGrandparent (YC S16) is hiring Back end Engineers

Roundtable (YC S23) Is Hiring a Member of Technical Staff (ycombinator.com)

Diligent (YC S23) Is Hiring a Founding AI Engineer (ycombinator.com)

S5cmd: Parallel S3 and local filesystem execution tool

Comments (23)