AWS/EBS/C5 class instance: cannot create file: Read-only file system
Is AWS a sane choice of a Cloud for mission-critical infrastructure
The Problem — read only file system on a C5 class instance.
So here is what happened earlier today.
One of our C5 instances suddenly became read only.
Since most services write to disk, the instance essentially became completely useless.
I desperately searched for answer on AWS forums... to no avail.
Despite finding several threads describing my exact problem, I quickly realized that NONE of the threads contained a solution!!!!! Now single customer who reported the problem said on the forum — "Yay!", it worked!
Because there were no suggestions besides increasing nvme io_timeout from 30 seconds to some arbitrary large number. I did that, rebooted the instance, and nothing changed. And the io_timeout was reset back to 30 seconds.
Well, that was a flop. Nice try AWS. Next time — try a bit harder, will you?
Now, if you are experiencing the same problem I suggest you run the following command:
$ mount -l | grep nvme /dev/nvme0n1p1 on / type ext4 (ro,relatime,data=ordered) [cloudimg-rootfs]
What you see here is that the primary EBS volume was mounted as
ro — meaning read only.
If you issue the same command on a healthy machine, you should see
rw, instead of
ro, meaning, of course, "read-write".
OK, so why would a file system on Linux become read-only? Could it be some data corruption? Bad blocks?
That's a possibility. Unlikely, since they are supposed to be using SSDs that are more reliable than HDDs. At least that's what SSD manufacturers want you to believe.
Anyway, from my early Linux days (I installed my first Linux in 1996, I think, from 80 floppy disks, really), I remember the über helpful
fsck command. If you type
fsck<TAB><TAB> you will see that there are a bunch of them:
$ fsck<TAB><TAB> fsck fsck.cramfs fsck.ext3 fsck.ext4dev fsck.minix fsck.nfs fsck.xfs fsck.btrfs fsck.ext2 fsck.ext4 fsck.fat fsck.msdos fsck.vfat
OK, so we just need to figure out the file system — and then run it. Right above, where we did
mount -l you will notice that the file system type is
ext4. Alrighty then. We have our
Let's run it and see what options does it have:
fsck.ext4 Usage: fsck.ext4 [-panyrcdfvtDFV] [-b superblock] [-B blocksize] [-I inode_buffer_blocks] [-P process_inode_size] [-l|-L bad_blocks_file] [-C fd] [-j external_journal] [-E extended-options] device Emergency help: -p Automatic repair (no questions) -n Make no changes to the filesystem -y Assume "yes" to all questions -c Check for bad blocks and add them to the badblock list -f Force checking even if filesystem is marked clean -v Be verbose -b superblock Use alternative superblock -B blocksize Force blocksize when looking for superblock -j external_journal Set location of the external journal -l bad_blocks_file Add to badblocks list -L bad_blocks_file Set badblocks list
Alright — I love seeing something called "automatic repair". Since this machine is dead in the water, what am I going to loose?
Let's run this sucker.
$ fsck.ext4 -p /dev/nvme0n1p1 cloudimg-rootfs contains a file system with errors, check forced. cloudimg-rootfs: Deleted inode 1567791 has zero dtime. FIXED. cloudimg-rootfs: ***** REBOOT LINUX ***** cloudimg-rootfs: 568214/5120000 files (0.1% non-contiguous), 3412081/10485499 blocks
OMG!! Something got fixed?
This command run very quickly and found a bad inode, apparently with zero dtime, whatever that means. I am not about to go into the details of file systems, but this output looks promising to me.
So fuck it, let's reboot. See what happens.
sync; sync; reboot
NOTE: this how I recommend you always reboot you instances. Double
reboot. Why this is so is outside the scope of this post.
The box rebooted very quickily and some 30 seconds later I was able to SSH into the machine and vola!
No more read-only root partition, all services boot, and everything is back to normal. Whoohooo!
How Not to Run Support Forums
Now, I think to myself, I can share this wisdom on AWS Forums! And other people can benefit from this!
Not so fast, says AWS (allegorically, not literaly) they spell out for me:
Dear Konstantin, we are going to fuck you in a few more ways before you can be "helpful" to our customers. They really don't need your help, we got it covered. Really, fuck off.
What they actually said verbatim is this:
Your account is not ready for posting messages yet. See the following article for details: https://aws.amazon.com/premiumsupport/knowledge-center/error-forum-post/.
So now I am sitting on a solution, but I can't even share it with other AWS customers, because apparently my account is not activated. It says something stupid like "you have to login to activate it". I am already logged in! OK, fine, I'll log out and log back in again. Nope, same problem.
This is infuriating beyond belief. Hence, I take it to my own blog to bitch about it, and in the process, perhaps help a few people.
I personally think AWS is a behemoth not worth wasting your time on.
I believe in clouds that are flexible and do not lock you into their proprietary and expensive solutions.
I believe in a cloud that helps their customers and quickly resolves their problems.
I used to be on such a cloud, and it was a paradise. But my current employer was already on AWS, and so I deal with it.
Are you curious about what cloud does not screw with their customers, offers incredibly competitive pricing, and is the only cloud that can run Docker Containers natively on the bare metal?
It's Joyent — now a subsidiary of Samsung, little known but the most amazing cloud I've worked with.
But even if I wasn't on Joyent, I would probably choose Google Cloud. I just trust Google a whole lot more than I trust Amazon after my experience.
You can decide what's best for you, but don't forget, AWS Forums are not very helpful it turns out.
Please leave a comment if you found the solution here, and not on AWS support site. I would really appreciate it!