Migrating 1TB of data from Sharepoint to S3

I was recently tasked with archiving ~1TB of files from a hosted Microsoft Sharepoint site to Amazon S3. At that size, manually copying files was out of the question. I don't know much about Sharepoint so I went searching for ways to download the files via an API.

Attempt #1: Sharepoint REST API (multi-threaded)

I ended up on Office365-REST-Python-Client and asked the Sharepoint administrator to setup app-only access to generate a client ID and secret I could use to authenticate.

After a few false starts, I got a script working that could iterate the directories and download the files with multiple threads used so it could download more than one file at a time. I spun up a temporary EC2 instance and ran the script in a screen session.

The next day, I checked the progress. The script had crashed and a number of the files were empty. Eventually I discovered that Office365-REST-Python-Client was not thread safe and attempting to download files in multiple threads wouldn't work.

Attempt #2: Sharepoint REST API (no threads)

I adjusted the script to remove any multi-threading and added a check to verify the file size matched the file size reported by the Sharepoint API (Length field). I ran this for a bit and got better results, but noticed some file sizes didn't match. Some internet searching indicated that Sharepoint might be including metadata in its file size calculation, so I wrote off that issue and let the downloader run overnight again.

In the morning it had crashed again because of failed API calls. I added some retry logic and still got the same issue. Turns out Sharepoint throttles API calls and eventually starts returning 429 (Too Many Requests) status codes. You can implement some logic to catch these and determine when you can retry, but I started to think I might be taking the wrong route at this point.

Attempt #3: rclone

I've used rclone with a lot of success in the past. I looked at it prior to using the API, but it required a user on the Sharepoint instance which I didn't have at the time. I went back to the admin and requested a user account. With that setup, I was able to configure webdav access to Sharepoint. The configuration looked like this:

type = webdav
url = https://client.sharepoint.com/sites/SiteName
vendor = sharepoint
user = pbaumgartner@example.com
pass = ***

Keep in mind the pass is not in plain-text. I set it via the rclone config command.

I tried testing it via rclone ls sharepoint-webdav: which failed for me with:

2022/09/14 22:22:16 ERROR : : error listing: couldn't list files: 403 FORBIDDEN: 403 Forbidden
2022/09/14 22:22:16 Failed to ls with 2 errors: last error was: couldn't list files: 403 FORBIDDEN: 403 Forbidden

Eventually I figured out that my configuration was correct, but I had to specify a directory to list. When I did rclone ls sharepoint-webdav:DirectoryName that worked.

So I ran this overnight:

rclone copy --verbose --ignore-size --ignore-checksum \
 sharepoint-webdav:DirectoryName /path/to/local/DirectoryName

The next morning, I saw more errors. It turns out Sharepoint has some sensitive rate limiting and by default rclone is doing a fair amount of concurrency with the requests.

Attempt #4: rclone in shackles

In order to minimize rate limiting, I passed some additional flags to rclone in an attempt to prevent it from hitting the Sharepoint API too aggressively

rclone copy --verbose --ignore-size --ignore-checksum --checkers 1 --transfers 1 --update --user-agent 'NONISV|OrgName|rclone/1.53.3' \
 sharepoint-webdav:DirectoryName /path/to/local/DirectoryName

...and it worked!

With the files copied to the temporary instance, I could then sync them up to S3 using aws. It's worth noting that I could have done this all in one go with rclone, but I tried this once before when copying from an SFTP endpoint to S3 and it was considerably slower than rsync and aws separately. I assume it had to do with some of the remote checking rclone was doing to determine if the file existed, but didn't bother to troubleshoot. Having them locally as an intermediate stop made it easier to spot check and verify the files before shipping them off to S3.