I was recently tasked with archiving ~1TB of files from a hosted Microsoft Sharepoint site to Amazon S3. At that size, manually copying files was out of the question. I don't know much about Sharepoint so I went searching for ways to download the files via an API.
Attempt #1: Sharepoint REST API (multi-threaded)
After a few false starts, I got a script working that could iterate the directories and download the files with multiple threads used so it could download more than one file at a time. I spun up a temporary EC2 instance and ran the script in a
The next day, I checked the progress. The script had crashed and a number of the files were empty. Eventually I discovered that
Office365-REST-Python-Client was not thread safe and attempting to download files in multiple threads wouldn't work.
Attempt #2: Sharepoint REST API (no threads)
I adjusted the script to remove any multi-threading and added a check to verify the file size matched the file size reported by the Sharepoint API (
Length field). I ran this for a bit and got better results, but noticed some file sizes didn't match. Some internet searching indicated that Sharepoint might be including metadata in its file size calculation, so I wrote off that issue and let the downloader run overnight again.
In the morning it had crashed again because of failed API calls. I added some retry logic and still got the same issue. Turns out Sharepoint throttles API calls and eventually starts returning
429 (Too Many Requests) status codes. You can implement some logic to catch these and determine when you can retry, but I started to think I might be taking the wrong route at this point.
rclone with a lot of success in the past. I looked at it prior to using the API, but it required a user on the Sharepoint instance which I didn't have at the time. I went back to the admin and requested a user account. With that setup, I was able to configure webdav access to Sharepoint. The configuration looked like this:
[sharepoint-webdav] type = webdav url = https://client.sharepoint.com/sites/SiteName vendor = sharepoint user = email@example.com pass = ***
Keep in mind the
pass is not in plain-text. I set it via the
rclone config command.
I tried testing it via
rclone ls sharepoint-webdav: which failed for me with:
2022/09/14 22:22:16 ERROR : : error listing: couldn't list files: 403 FORBIDDEN: 403 Forbidden 2022/09/14 22:22:16 Failed to ls with 2 errors: last error was: couldn't list files: 403 FORBIDDEN: 403 Forbidden
Eventually I figured out that my configuration was correct, but I had to specify a directory to list. When I did
rclone ls sharepoint-webdav:DirectoryName that worked.
So I ran this overnight:
rclone copy --verbose --ignore-size --ignore-checksum \ sharepoint-webdav:DirectoryName /path/to/local/DirectoryName
The next morning, I saw more errors. It turns out Sharepoint has some sensitive rate limiting and by default
rclone is doing a fair amount of concurrency with the requests.
rclone in shackles
In order to minimize rate limiting, I passed some additional flags to
rclone in an attempt to prevent it from hitting the Sharepoint API too aggressively
rclone copy --verbose --ignore-size --ignore-checksum --checkers 1 --transfers 1 --update --user-agent 'NONISV|OrgName|rclone/1.53.3' \ sharepoint-webdav:DirectoryName /path/to/local/DirectoryName
...and it worked!
With the files copied to the temporary instance, I could then sync them up to S3 using
aws. It's worth noting that I could have done this all in one go with
rclone, but I tried this once before when copying from an SFTP endpoint to S3 and it was considerably slower than
aws separately. I assume it had to do with some of the remote checking
rclone was doing to determine if the file existed, but didn't bother to troubleshoot. Having them locally as an intermediate stop made it easier to spot check and verify the files before shipping them off to S3.