Broken GitHub and Azure Data Factory Link

Lately I’ve been using Azure Data Factory for a few data movement tasks. Like a good developer, I linked my factory to GitHub. I like to know I can recover from big mistakes somewhat easily. However, I ran into issues when something got out of whack.

I’ll address two issues here:

  1. Azure Data Factory reporting it doesn’t have access to the repository, even though it should.
  2. The “Publishing Error” that ensues after fixing the first problem.

All was working well until our organization made some authentication changes in the GitHub account. Suddenly my factories could no longer access GitHub. I thought “No big deal, I’ll just remove Git and re-authenticate.”

Problem 1: Re-authentication Failed

After removing Git, I tried to connect it and import the repository. Unfortunately, it didn’t work. The GitHub login screen would simply flash and disappear. As a workaround attempt I opened a private browser window to go though the same process. This time the login screen worked as it should.

However, even after authenticating, Azure Data Studio still complained that it didn’t have access to the repository. I didn’t capture the error message. Sorry (for real). Just know it let me know in big red words that it didn’t have permissions to read the repository. That was quite absurd considering I’m an admin on the account.

The repository list did show public repositories. But the private ones were nowhere to be seen.

Fixing the Disconnect

Fixing the problem feels like a hackish workaround. Because it is. It is simple, but annoying. Side note: I contacted Microsoft support before getting to this point. They couldn’t solve the issue. This isn’t a put-down to the support rep. I was actually very happy with how he handled the issue. I just managed to solve it on my own after the call.

To fix it, create a brand new data factory. Yes. A new data factory. Fill out all the details and make sure you check “Enable Git.” And fill those details out.

Azure Data Factory: New Data Factory

Ideally this will take care of it. For me, it didn’t. I ran into an additional problem that was also a pain in the neck to solve. This leads us to:

Problem 2: Non-publishable Factory

I wanted to simply run my pipeline. It wouldn’t run. I tried to publish my changes first. Instead of the “success” message I expected (this was a known, good, working factory after all), I was greeted with a failure message.

Publishing error: The publish branch is out of sync with the collaboration branch.

My first response was “Um, no. It hasn’t been published outside of Git mode.” As all dutiful people do, I clicked the “Git troubleshooting guide” link. It was absolutely worthless. If you’re curious where it goes to, this is the URL: https://docs.microsoft.com/en-us/azure/data-factory/source-control#troubleshooting-git-integration. It only deals with a stale publish branch. And it’s entirely confusing and unclear. At least to me it was.

Besides that, it didn’t help fix the problem.

Fixing the non-publishable factory

This is yet another hacky workaround. The process is not fun. But the result is a working factory again.

Step 1: Create a new GitHub respository. Initialize it with an empty directory where the factory will live. If you use “master” as the collaboration branch, push it there.

Step 2: Disassociate your data factory from GitHub. Instructions just in case you aren’t sure how to do that:
1. Go to the main Data Factory section (click this: Data Factory)
2. Click this: Git repo settings
3. Click this: Remove Git

Step 3: Reconnect the factory to GitHub and point to the new, empty repository. Make sure the “Root folder” points to the empty folder you created in step 1.

Step 4: From the data factory interface, create a new working branch.

Step 5: On your local computer, clone the new repository and checkout the working branch from step 4.

Step 6: Clone the original repository to your local computer and checkout the master branch (or whatever your collaboration branch is). Note: don’t use the adf_publish branch! Otherwise you might be back at square one.

Step 7: Copy everything from the original repository to the new repository on your local computer. In the new repository, commit and push changes. If you have triggers, you may not want to copy those yet.

Step 8: Back in the Data Factory interface, click the Refresh button. Everything should show up.

Step 9: You may need to edit your connections to make sure they connect. Mine did not. All I had to do was fill in credentials again.

Step 10: Close data factory. In fact, just close the whole browser. Then you can open it again and test your pipelines. If they seem to get stuck, try clearing browser cache as well. Or use a private browser instance. Before I did this, it seemed like it failed to connect to the database without error (aside from a timeout).

At this point, everything should be working again. You can decide what to do with the old repository, and the old factory. I won’t be needing mine.