Chapter 2

[VMW] WS1 Access

Статьи по установке, настройке и решению проблем с Workspace ONE Access (бывший Identity Manager).

Subsections of [VMW] WS1 Access

Access Installation

Connected Articles

Components

Identity Manager Appliance

  • The Service; User portal, Built-in AuthN and idP (TCP443)
  • Certificate Proxy Service (TCP5262)
  • Kerberos Key Distribution Center (KDC) (TCP/UDP 88)
  • User database (vPostgres OR external MS SQL)
  • OS = PhotonOS
  • Main WS1 Access service = horizon-workspace

Enterprise Connector

  • ❗️Old connector version 1811 needed for ThinApp publishing
  • Modern Connector is a Java Microservices app, with 4 microservices: User Auth, Kerberos Auth, Virtual App (1Gb RAM set for each service) and Directory Sync (4Gb RAM)

Increasing memory of services on connector:

  1. Log in to the Windows server in which the Workspace ONE Access enterprise service is installed.
  2. Navigate to the INSTALL_DIR\Workspace ONE Access\serviceName folder.
  3. Open the serviceName.xml file in a text editor.
  4. Change the Xmx1g entry to Xmxng where n is the maximum heap memory you want to allocate. Example: Xmx5g
  5. Save file, restart service.

Network ports for connector: Inbound & outbound TCP443 (many uses); Outbound TCP389, TCP636, TCP3268, TCP3269 (LDAP); Outbound TCP88, UDP88, TCP464, TCP135, TCP445 (Kerberos, Directory Sync); Outbound TCP53, UDP53 (DNS); Outbound TCP5555 (RSA SecurID); Outbound UDP514 (Syslog).

Intelligent Hub This is a client simultaneously for WS1 UEM and WS1 Access (Corp apps marketplace + Hub Services: people search, notifications)

Old components

Identity Manager on Windows Server (Old, uses only Cloud KDC)

  • The Service; User portal, Built-in AuthN and idP (TCP443)
  • The Connector; AuthN and User, ThinApp and Horizon Sync (TCP8443)
  • Certificate Proxy Service (TCP5262)

! For IDM on Windows do NOT use non-English localized Windows versions. Workaround: change the regional number setting for decimal to use a period “.” instead of a comma “,”. ! For IDM on Windows shutdown IIS to free up port TCP80. It is not used, but it is needed for IDM install.

Workspace One Client Mobile App App Bundle ID: com.air-watch.appcenter

Subsections of Access Installation

Managing Certificates

External links:

General Commands

Most common command is to build PFX-file from PEM files:


openssl pkcs12 -export -out certificate.pfx -inkey privkey.pem -in fullchain.pem -certfile cert.pem

WS1 Access always has to be signed with corporate or trusted certificates. If Access is clustered, sign the load-balanced name with external trusted certificate and the 3 nodes - with certs from corporate CA. 

  • Go to WS1 Access web console
  • On the top, click the Appliance Settings tab,
  • On the left, click the VA Configuration node.
  • On the right, click Manage Configuration. You will be redirected to a separate portal
  • Login as admin account
  • On the left, click Install TLS Certificates.
  • On the right, in the upper box, delete the certificate and key that are currently displayed.
  • Paste in the new PEM certificate and RSA private key. Paste every certificate in the chain: server + intermediate + root. Click Save.

❗️The order of certificates is important! First server, then intermediate, then root.

Certificate Requests


######### The cert request in idm01.domain.local.inf file

[Version]
Signature= "$Windows NT$" 
 
[NewRequest]
Subject = "CN=idm01.domain.local,OU=IT,O=Horn_n_hooves,L=Moscow,S=Moscow,C=RU"
KeySpec = 1
KeyLength = 2048
Exportable = TRUE
MachineKeySet = TRUE
UseExistingKeySet = FALSE
ProviderName = "Microsoft RSA SChannel Cryptographic Provider"
ProviderType = 12
RequestType = PKCS10
KeyUsage = 0xa0
FriendlyName = "vdm" ; needed for Horizon Connection Server only!
 
[EnhancedKeyUsageExtension]
OID=1.3.6.1.5.5.7.3.1
 
[RequestAttributes]
CertificateTemplate = WebServerExportable2008
 
[Extensions]
; If your client operating system is Windows Server 2008, Windows Server 2008 R2, Windows Vista, or Windows 7
; SANs can be included in the Extensions section by using the following text format.Note 2.5.29.17 is the OID for a SAN extension.
 
2.5.29.17 = "{text}"
_continue_ = "dns=idm01.domain.local&dns=idm01&dns=idm02.domain.local&dns=idm02&dns=idm03.domain.local&dns=idm03&dns=idm.domain.local"
##################################################################################

Bat script to submit certificate request:


set srvname=idm01.domain.local
cd C:\temp

Certreq -New -f %srvname%.inf %srvname%.req
Certreq -submit %srvname%.req %srvname%.crt
Certreq -accept %srvname%.crt
Certutil -exportpfx -p 12345 %srvname% "%srvname%.pfx"

In order to copy and paste the private key from PFX certificate for vIDM, you need a decrypted version of the key. Use OpenSSL to obtain this key:


openssl pkcs12 -in idm01.domain.local.pfx -nocerts -out idm01.domain.local_encrypted.key
openssl rsa -in idm01.domain.local_encrypted.key -out idm01.domain.local_decrypted.key

Open idm01.domain.local_decrypted.key with a text editor and copy the key from there. After inserting certificates, click OK to restart WS1 Access web service.

Password Reset

External Links:

❗️The admin user password must be at least 12 characters in length.

SSH User Password Change

Connect to WS1 Access Appliance by SSH, run command:


passwd sshuser

Web Console Admin Password Reset

  1. Log in to the URL as a root user: https://ApplianceFQDN:8443/cfg/changePassword

  2. Access WS1 Access using ssh (sshuser password needed), do su and use this command to reset password:

  • Admin console site:

/usr/sbin/hznAdminTool setOperatorPassword --pass newsecretpassword
  • Configurator page:

/usr/sbin/hznAdminTool setSystemAdminPassword --pass newsecretpassword

Root User Password Reset

WS1 Access works on VMware PhotonOS variation of Linux. It has a Single User Mode.

  1. Go to vCenter Server list of VMs;
  2. Right-click the affected WS1 Access OVA and click Open Console;
  3. Under the VM menu, click Power->Shut Down Guest;
  4. When the shutdown completes, Power On the WS1 Access Appliance;
  5. When the GNU GRUB menu displays, press p and enter the configured bootloader password = ⭐️ H0rizon! ⭐️ ; ❗️GRUB menu appears for a few seconds. If you miss it, reboot and try again;
  6. Use the up and down arrow keys to navigate to the first entry, and press e to edit the relevant boot parameters;
  7. Use the arrow keys to navigate to the line beginning with kernel and press e to edit;
  8. The cursor is at the end of the line, type a space and then append init=/bin/bash to the line;
  9. Press Enter to confirm the changes;
  10. Press b to execute the boot;
  11. After boot, you have ROOT access and you are able to set new passwords, for root: passwd and for sshuser: passwd sshuser; ❗️Follow password policies on new passwords!
  12. Shutdown by using command: shutdown -h -P now; ❗️VMware Tools do NOT work in Single User Mode, so shutdown from vCenter will NOT work;
  13. Start WS1 Access Appliance.

SQL Preparation

Attachments

Manual

Internal PostgreSQL

To login into the DB, get the PostgreSQL password: first login to console with SSH.


cat /usr/local/horizon/conf/db.pwd

Copy password, then login with it:


psql saas horizon

External MS SQL Preparation

❗️ Database schema name must be ‘saas’, cannot be changed. ❗️Collation must be ‘Latin1_General_CS_AS’, could be changed but change not recommended. ❗️The server role used to grant server-wide security privileges is set to public. The database role membership is db_owner.

Microsoft SQL Database Using Local SQL Server Authentication Mode for Workspace ONE Access (replace values in brackets < > ):


CREATE DATABASE saas
COLLATE Latin1_General_CS_AS;
ALTER DATABASE saas SET READ_COMMITTED_SNAPSHOT ON;
GO

BEGIN
CREATE LOGIN <loginusername> WITH PASSWORD = N'<password>';
END
GO

USE <saasdb>; 
IF EXISTS (SELECT * FROM sys.database_principals WHERE name=N'<loginusername>')
DROP USER [<loginusername>]
GO

CREATE USER [<loginusername>] FOR LOGIN [<loginusername>]
WITH DEFAULT_SCHEMA=saas;
GO

CREATE SCHEMA saas AUTHORIZATION <loginusername>
GRANT ALL ON DATABASE::saas TO <loginusername>;
GO

ALTER ROLE [db_owner] ADD MEMBER <loginusername>;
GO

JDBC URLs

SQL local user jdbc:sqlserver://<DB_VM_IP_ADDR>;DatabaseName=saas

jdbc:sqlserver://<DB_VM_IP_ADDR>\INSTANCE_NAME:PORT;DatabaseName=saas (you can remove the instance name if default)

AD domain user jdbc:jtds:sqlserver://<DB_VM_IP_ADDR>:1433/saas;integratedSecurity=true;domain=LAB.LOCAL;useNTLMv2=true

Multi-site, SQL Always On jdbc:sqlserver://;DatabaseName=saas;multiSubnetFailover=true

Troubleshooting Issues

Troubleshooting JDBC URL Wizard page

Back and continue buttons become greyed out and unclickable.

  • Use Web Admin tools of the Browser (Firefox, Chrome);
  • Right click and “Inspect Element” on the disabled button. Find the ID tag for “nextButton”;
  • Within this line of text there is a value = “is-disabled”. Remove “is-disabled” from the line by clicking and typing into the inspector;
  • Return to web page, button should start working. Click & proceed.

Database Locked Error

See resolution link

Check the result of the following DB query, If DB is locked, you should see an entry with the locked value set to 1 or TRUE.

	
SELECT * FROM DATABASECHANGELOGLOCK 

if locked=TRUE then run the below UPDATE statement to release/reset the DB lock.

USE [DBNAME]
GO 

UPDATE [saas].[DATABASECHANGELOGLOCK]
   SET [LOCKED] = 0 

 ,[LOCKGRANTED] =NULL 

 ,[LOCKEDBY] = NULL
 WHERE ID = 1 

GO

Restart horizon-workspace service on each node one after another. ❗️DO NOT restart all nodes of WS1 Access cluster simultaneously.

Important Queries

Update the Connector if there is more than 1 sync and Authentication


UPDATE Connector SET isDirectorySyncEnabled=false WHERE host=<auth_connector_hostname>
	

Check if IDM certificate is present in the console from DB side


SELECT * from dbo.coreuser WHERE isactivedirectoryuser=0  
SELECT * from dbo.certificate WHERE certificatethumbprint like '%%'  
SELECT * from dbo.UserLink WHERE coreuserid = < username >
SELECT * from dbo.role WHERE roleid =3
	

List of UUID with Super Admin access


SELECT strUsername, strFirstName, StrLastName,strExternalId, stremail, uuid FROM slesdb.saas.Users U (nolock) INNER JOIN  
(SELECT * FROM Slesdb.saas.ACS_RuleSetAssociation (nolock)  
WHERE ruleSetId LIKE  
(SELECT id FROM slesdb.saas.ACS_RuleSet (nolock)  
WHERE name LIKE 'Super Admin)) as A  
On U.uuid = A.SubjectUUID
ORDER BY strUsername
	

Update email address of the user or admin


UPDATE sles.saas.Users
SET strEmail = 'value'  
WHERE username LIKE 'value'
	

Update the value of the Identity Provider with correct connector name


UPDATE slesdb.saas.IdentityProviders  
SET strDescription = (  
SELECT host FROM slesdb.saas.Connector WHERE id = (

SELECT idConnector FROM slesdb.saas.IdpJoinConn WHERE idIdentityProvider = (  
SELECT id FROM slesdb.saas.IdentityProviders WHERE strFriendlyName LIKE '%Workspace%'))) 
	

Update the attribute column to make it mandatory or non-mandatory


 SELECT id, * FROM slesdb.saas.userattributedefinition WHERE idorganization IS NOT NULL AND ownerUuid IS NULL AND strName LIKE '<attribute to be updated>
	

Then run:


UPDATE slesdb.saas.userattributedefinition SET bIsRequired = 0 WHERE id = <id identified above>
	

Connector Sync Validation


SELECT id, uuid, tenantID, host, domainJoined, createdDate, oAuth2ClientId, isDirectorySyncEnabled FROM saas.Connector;

SELECT idSyncProfile, directoryConfigId, syncConnectors FROM saas.DirectorySyncProfile
	

⭐️ FROM saas.Connector isDirectorySyncEnabled = 1 means connector is the connector set as Sync. If not, you can update:


UPDATE saas.connector SET isDirectorySyncEnabled=0 WHERE id=1;
	

⭐️ FROM saas.DirectorySyncProfile syncConnector = saas.Connector uuid value. You can update if needed as well:


SET isDirectorySyncEnabled=0 WHERE id=1;  
update saas.DirectorySyncProfile SET syncConnectors='["12345678-abcd-1234-1234-0123a678b78"]' WHERE idSyncProfile=1;
	

AD Connection Removal

Supported removal

  • Directory cannot be removed, because an active Connector is attached to it;
  • Connector cannot be removed because an Identity Provider (IdP) is associated with it (BuiltIn IdP);
  • BuiltIn IdP cannot be dissociated, because it is locked to the Connector.
  • If you need to delete the Connector, you have to delete User Directory;
  • If you need to delete User Directory, then Built-in IDP should not be associated with it.

Steps:

  • Check the Built-in Identity Provider and take note of the access policies associated with the Built-in Identity Provider;
  • Un-associate all the policy rules. Edit Identity & Access Management → Policies tab (Setup mode) - remove “Password (Cloud Directory)” from policies, put “Password (Local Directory)” instead;
  • Delete the Built-in IdP:
    • Remove the connector associated with the Built-in IDP and save it. Uncheck the box for Password (Cloud deployment) in connector properties and save it;
    • Click on red cross to delete Connector in Built-In properties;
    • Click on Delete IDP. Click OK to delete IdP.
  • Now you can go to Directories and delete the User Directory;
  • Wait until the directory is deleted. Check by refreshing directory page;
  • Delete the Connector using the red delete button on the right of its’ name in the list;
  • Go via RDP to the server with Enterprise Connector, uninstall it.

Unsupported removal

Access saas database of IDM and delete association of IdP with Connector.

For PostgreSQL

If database of IDM is deployed as inherent PostgreSQL, then use SSH to connect to the primary appliance. ❗️Database on primary appliance is in read/write mode.

Connect to database:


su - postgres /opt/vmware/vpostgres/current/bin/psql vcac

Change schema to saas:


SET SCHEMA 'saas';

For Microsoft SQL Server

Use SQL Management Studio to connect to DB server, open saas database.

Find Connector table (example shown for MS SQL Server 2014) and ‘disconnect’ it from AD:


SELECT * FROM saas.saas.Connector;

-- Check the ID of your AD connector

UPDATE saas.saas.Connector
SET domainJoined = NULL
WHERE id = N; -- set N to the ID of your AD connector

❗️You cannot delete the Connector in SQL - it has dependencies from foreign keys. But all you need to do is find the ID of BuiltIn IdP and disconnect it:


SELECT * FROM saas.saas.IdentityProviders;
SELECT * FROM saas.saas.IdpJoinConn;

-- Check the ID of your IdP defore doing this!!

DELETE FROM saas.saas.IdpJoinConn
WHERE idIdentityProvider = N; -- set N to the ID of your IdP
  • After this, you can delete the Connector in Identity Manager Console. Then delete the AD Directory;
  • Reinstall Identity Manager Connector and activate it.

❗️After adding a new AD do not forget to follow the guide backwards: do not forget to add “Password (Cloud Directory)” back to Policies!

❗️The BuiltIn IdP must be associated back with the new Connector. This can be done using the console, or it can also be done in SQL:


SELECT * FROM saas.saas.IdpJoinConn;

-- Double-check the ID of your IdP and NEW Connector defore doing this!!

INSERT INTO saas.saas.IdpJoinConn
VALUES (X, Y) -- set X to the ID of your IdP, set Y to the ID of Connector

Elasticsearch

Background

Elasticsearch in VMware Workspace ONE Access is used for storing records on:

  • Audit Events and Dashboards (all events)
  • Sync Logs (LDAP)
  • Search data, Peoplesearch (users, groups, apps)
  • Entitlements (apps)

These records are collected on the local SAAS node and are being sent to RabbitMQ instance on the local node once per second. They are then picked up and sent to Analytics Service on local node, which then sends the records for storing in Elasticsearch. RabbitMQ acts as a buffer, so that if Elasticsearch is down, the records are not lost. It also decouples the generation rate from the storing rate, so if records are coming in faster than they can be stored, they can be queued up and left in RabbitMQ.

RabbitMQ previously WAS clustered, so records generated on one node could be picked up and processed by any other node in the cluster. This meant that if the local Elasticsearch instance is having issues, another node could grab the message and process them. Now RabbitMQ is NOT clustered (it had complex node stop/start order requirements to maintain a stable cluster, and maintaining a RabbitMQ cluster was more hassle than it was worth), so it is possible for the records generated on one node to start piling up in RabbitMQ. This is less robust, but makes it easier to spot when a specific node is having a problem.

Services connected to Elasticsearch:

  • elasticsearch
  • horizon-workspace
  • hzn-dots
  • hzn-network
  • hzn-sysconfig
  • rabbitmq-server
  • thinapprepo (optional, only if ThinApp is used)
  • vpostgres (optional, only for internal DB)

Elasticsearch Indices and Shards

Elasticsearch is a distributed, replicated data store, which means each node does not have full copy of the data. In a 3-node cluster, each node will have 2/3 of the data, so that all 3 nodes combined provides 2 full copies, and you can lose one node without losing any data. This distribution and replication is achieved by splitting each index into 5 shards and each shard has a primary copy and a replica (backup) copy. All ten shards for an index are then distributed across all three nodes so that no single node has both the primary and replica copy of any one shard, or all the shards for one index.

Audit data and sync logs are stored in a separate index for each day. The index is versioned based on the document structure so that you can seamlessly migrate from an old format to a new format. The audit indexes have the format "v<version>_<year>-<month>-<day>'. For example, version 4 of the document structure will be "v4_2022-03-04".

Search data (users, groups, applications) used by the admin UI for search and autocomplete, and the PeopleSearch feature, is stored in a separate index called "v<version>_searchentities", and currently uses version 2 of that document structure.

⭐️ The indices can be seen in the /db/elasticsearch/horizon/nodes/0/indices directory or listed with useful stats via: curl localhost:9200/_cat/indices?v

Elasticsearch Node Discovery

A cluster has a master node which has the additional job of coordinating the distribution of data and farms out the search and document retrieval requests to the other nodes based on which node has which shard. Any node can be the master node, the nodes choose the master themselves based on the makeup of the cluster. If the master node disappears, one of the remaining nodes is chosen as the new master.

Elasticsearch uses a list to determine how many nodes it should be able to see before a master can be elected and the cluster formed (and the other nodes its should be reaching out to). This is done to prevent a split-brain situation, where due to network issues not all the nodes can see each other, so form independent clusters, which require resetting the errant node and losing its data. So when WS1 Access tells Elasticsearch there should be 3 nodes, it knows it must see at least 2 nodes to form a cluster.

❗️If Elasticsearch is up before WS1 Access (horizon-workspace service), there will be timeout failures in the logs, which can result in the cluster getting stuck. If the other nodes in the cluster are not up, there will be error messages about trying to add them to the cluster and then removing them again because they did not respond.

Data Retention

❗️Due to rules and regulations in some countries and companies around data retention of audit data, by default historical data is not deleted.

Elasticsearch maintains a very complex series of lookup tables in memory that allows it to determine which documents match a particular query in just a couple of milliseconds, whether there is one document to query or a billion.

By default, to reduce memory footprint, the number of days (indices) allowed for Elasticsearch to keep in memory (or open) is limited to 90 days.
Older indices are closed to remove them from memory and are no longer searchable. Once a day, just after midnight (00:30am), the retention policy is executed, and any open index that is older than the 90 days is closed.

Also the open indices that are no longer being written to (any open index older than the current day) are optimized: As documents come in, Elasticsearch writes them to lots of small files on disk. This lets it store documents quickly, but slows down retrieval and consumes a huge amount of open file descriptors. When an index is no longer being written to (for example, because it is for yesterday and no dated records for yesterday should be coming in anymore to write to it), Elasticsearch takes all those little files and combines them into one big file. This reduces the number of file descriptors needed and speeds up document retrieval for older documents.

When the retention policy runs, there will be messages in the analytics-service log saying it has started, and for each open index, whether it was closed because it is now past the cut-off date, or was optimized (the optimization is blindly optimizing all open, older-than-today indices, because if it is already optimized, Elasticsearch does nothing).

com.vmware.idm.analytics.elasticsearch.ElasticSearchHttpStorageAdapter - Executing analytics retention policy, cutoff date is 2019-04-07
com.vmware.idm.analytics.elasticsearch.ElasticSearchHelper - Closed index: v4_2019-04-06
com.vmware.idm.analytics.elasticsearch.ElasticSearchHelper - Optimizing index v4_2019-05-02 
com.vmware.idm.analytics.elasticsearch.ElasticSearchHelper - Optimizing index v4_2019-06-28 
com.vmware.idm.analytics.elasticsearch.ElasticSearchHttpStorageAdapter - Analytics retention policy completed.

Elasticsearch and RabbitMQ Health Status

Time skew

Elasticsearch will be out of sync when the WS1 Access nodes are with different times. The easy way of checking is from the System Diagnostics Page -> Clocks - it should show all the clock times.

Check using CLI on all nodes: watch -n 1 date

Disk space

Due to default policy of never deleting old data, the /db filesystem usage will continue to grow until it runs out of space.

  • Elasticsearch will stop writing documents when it get 85% full (set via the cluster.routing.allocation.disk.watermark.low setting).
  • RabbitMQ will stop working when it gets down to just 250MB free (set via the disk-free-limit setting).

Check the free disk space for the /db filesystem with the command: df -h

RabbitMQ Queues

An indicator that something is wrong is if the size of pending documents in the analytics queue is growing. RabbitMQ is used for other types of messages, which go in their own queues, but the analytics queue is the only one that does not have a TTL, so it is the only one that can grow. This is done so that its’ messages never get deleted and the audit records never get lost. Typically the analytics queue should have just 1-2 messages waiting to be processed, unless something is wrong or there is a temporary spike in records being generated (during a large directory sync the queue might temporarily grow, if the system cannot keep up).

Check queue size on each node with CLI: rabbitmqctl list_queues | grep analytics

If the number reported is larger than 100, its time to investigate why.

⭐️ This is also reported as the AuditQueueSize value in WS1 Access health API: /SAAS/API/1.0/REST/system/health/

General Elasticsearch Checks

Overall cluster health:

curl http://localhost:9200/_cluster/health?pretty

Check that the node counts (number_of_nodes and number_of_data_nodes) equals the size of the cluster and the state is green (except for single nodes, whose state will always be yellow)

GREEN = good, there are enough nodes in the cluster to ensure at least 2 full copies of the data spread across the cluster;

YELLOW = ok, there are not enough nodes in the cluster to ensure HA (single-node cluster will always be in the yellow state);

RED = bad, unable to query existing data or store new data, typically due to not enough nodes in the cluster to function or out of disk space.

Check that nodes agree on which ones are actually in the cluster and who the master is (to ensure you don’t have errant nodes, or split-brain with two clusters instead of one).

Run this on each node and verify the output is the same for every node:

curl http://localhost:9200/_cluster/state/nodes,master_node?pretty

This information is also reported in the “ElasticsearchNodesCount”, “ElasticsearchHealth”, “ElasticsearchNodesList” values in the health API, but that will only be coming from one of the nodes as picked by the gateway: /SAAS/API/1.0/REST/system/health/

Other useful commands:

curl http://localhost:9200/_cluster/health?pretty=true
curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'

curl http://localhost:9200/_cluster/state/master_node,nodes?pretty
curl http://localhost:9200/_cluster/stats?pretty=true
curl http://localhost:9200/_nodes/stats?pretty=true
curl http://localhost:9200/_cluster/state?pretty=true

curl http://localhost:9200/_cat/shards | grep UNASSIGNED 
curl -XGET ‘http://localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason’ | grep UNASSIGNED 

Proactive support

Avoiding disk space issues

Determine the growth rate of data nodes can store. Use that to either schedule disk size increase, or determine how many days of data with their usage they can keep.

After a few weeks of running the system, look at the size of each index and determine the average day’s size using:

curl localhost:9200/_cat/indices?v

Extrapolate how many days nodes have space for. For safety, use 50-60% of space as the benchmark. You can change the data retention policy to automatically delete old data so that the filesystem usage is capped. This will not interrupt service and is done by a configuration change that must be done on each node, setting the maxQueryDays to the number of days to keep data. The new policy will take affect at 00:30 am:

  • Edit /usr/local/horizon/conf/runtime-config.properties on each node to add/modify the following lines:
analytics.deleteOldData=true   (default is false, close the older indices)
analytics.maxQueryDays=90      (default is 90)
  • Restart the workspace service after making the change. To avoid any interruption in service, wait for the service to be fully back up before moving on to the next node:
service horizon-workspace restart
  • When all nodes have been updated and restarted, pick one of the nodes and tell the cluster to re-open any old, closed indices so that the new policy will find and delete them:
curl -XPOST http://localhost:9200/_all/_open

Avoid full cluster restart issues

WS1 Access can take a long time to come up, because the first node has the default configuration of only expecting one node in the cluster and decides it is the master. Then it checks what indices it has and discovers a bunch a shards it cannot find. When the second node comes up, it is told by the first node that the first node is the master, but the first node is stuck because of the missing shards, so refuses to act like a master. The same thing happens when the third node comes up. This situation can be fixed by restarting Elasticsearch on the first node. This can be completely avoided by changing the configuration, so that on restart all the nodes know to expect more nodes in the cluster, so they wait and only elect a master when the second node comes up.

On each node, edit /opt/vmware/elasticsearch/config/elasticsearch.yml to add these lines, which tell the cluster to not form until at least 2 nodes are in communication, and to wait up to 10 minutes for a third node to come online before deciding to start copying shards between the two nodes that are there to ensure HA. It will also prevent accidentally starting Elasticsearch more than once on a node:

discovery.zen.minimum_master_nodes: 2
gateway.recover_after_nodes: 2
gateway.expected_nodes: 3
gateway.recover_after_time: 10m
node.max_local_storage_nodes: 1

❗️❗️ NOTES: ❗️❗️

  • Make sure there are no leading spaces in front of the entries, otherwise elasticsearch will not start (no “initializing” message in the log).
  • This should only be done for a 3 node cluster, this will break single nodes.
  • A restart after applying this configuration is NOT needed, the config is only applicable when a restart is happening, so it will take affect the next time the cluster restarts.

Elasticsearch Manual Reindex

. /usr/local/horizon/scripts/hzn-bin.inc && /usr/local/horizon/bin/curl -v -k -XPUT -H "Authorization:HZN <cookie>" -H "Content-Type: application/vnd.vmware.horizon.manager.systemconfigparameter+json" https://localhost/SAAS/jersey/manager/api/system/config/SearchCalculatorMode -d '{ "name": "SearchCalculatorMode", "values": { "values": ["REINDEX"] } }'

Troubleshooting Elasticsearch and RabbitMQ

Elasticsearch cluster health returns MasterNotDiscoveredException

When you try to check the elasticsearch cluster health you get:

{
  "error" : "MasterNotDiscoveredException[waited for [30s]]",
  "status" : 503
}

You can enable additional logging by editing /opt/vmware/elasticsearch/config/logging.yml and adding the following line after the “org.apache.http: INFO” one.
com.vmware: DEBUG OR com.vmware: TRACE

Causes:

  1. Stale/incorrect node entries stored in the DB being reported to Elasticsearch. Look at the contents of the “ServiceInstance” DB table. If there are any wrong entries manually delete them. Check the “status” value is 0. Any other value is indicates an issue with that node that needs to be looked at (1 = Inactive, 2 = error, 3 = invalid master key store, 4 = pending removal). When the issues with the ServiceInstance table have been addressed, within 30 seconds the elasticsearch plugin on all the nodes should now get just the correct records and sort itself out. If it hasn’t done that within a few minutes, restart elasticsearch on each node (service elasticsearch start).

  2. Mis-configured multi-DC setup, resulting in the service instance REST API reporting incorrect nodes to elasticsearch. On the affected nodes check the output of:

/usr/local/horizon/bin/curl -k https://localhost/SAAS/jersey/manager/api/system/clusterInstances

If it does not show the entries you expect, look in the “ServiceInstance” table in the DB and check the “datacenterId” column value is correct for the nodes (so they are identified as part of the correct per-DC cluster. Check the “status” value is 0. Any other value indicates an issue with that node that needs to be looked at (1 = inactive, 2 = error, 3 = invalid master key store, 4 = pending removal).

Elasticsearch unassigned shards

❗️ DO NOT DELETE UNASSIGNED SHARDS
You need to address the underlying cause, then the unassigned shards will be resolved by the cluster on its own. It is extremely rare for unassigned shards to be a real issue needing direct action (eg. filesystem corruption, accidental data deletion).

Elasticsearch cluster health yellow

One of the nodes is down, unreachable or broken. Find out which one is in trouble by getting the list of nodes and seeing which one is missing:

curl http://localhost:9200/_cluster/state/nodes,master_node?pretty

Do not worry about unassigned shards in the health API output, getting the node back in the cluster will allow the master to sort out who has what shard and within minutes the unassigned shards will be resolved and the cluster green.

Causes:

  1. If the node is down, bring it back up.

  2. Check the node is reachable from the other nodes in the cluster. If it isn’t, fix the networking/firewall issue so that it can rejoin the cluster.

  3. On the missing node, check the disk space is ok. If it’s full, either:

    a. increase the size of the filesystem, OR

    b. free up disk space to get the cluster back to green by manually deleting the older indices, based on their date. To do that, first stop Elasticsearch on all the nodes elasticsearch service stop. Then delete the directories for the oldest indices in the /db/elasticsearch/horizon/nodes/0/indices directory on the node with disk space issues. When there is at least 500MB-1GB free space again on that node, delete the exact same ones on the other two nodes. Finally, restart Elasticsearch on each node service elasticsearch start and the cluster should go back to green and then you can adjust the retention policy.

Elasticsearch cluster health red

At least two of the nodes are broken, the analytics queue in RabbitMQ will start to grow.

Causes:

  1. Check for the same causes as the yellow health.
  2. All nodes are up but the master node is misbehaving. You may see messages like this in the logs of the non-master nodes where they try to connect to the master, but are refused:
[discovery.zen ] [Node2] failed to send join request to master [Node1], 
reason [RemoteTransportException[[Node1]][internal:discovery/zen/join]]; nested: 
ElasticsearchIllegalStateException[Node [Node1]] not master for join request from [Node2]

Determine the master node from the log (eg Node2 in the example above) or with: curl http://localhost:9200/_cluster/state/nodes,master_node?pretty

Restart Elasticsearch on the master node service elasticsearch restart . One of the other two nodes should immediately become the master (re-run the curl on the other nodes again to verify) and the old master will re-join the cluster as a regular node when it comes up.

RabbitMQ analytics queue is large

The analytics queue size in RabbitMQ should usually only show 0-5 messages when the system is functioning correctly. If you see larger values, this means messages were not able to be delivered to elasticsearch.

Causes:

  1. The issue is with sending messages due to infrastructure issues. Look in the logs for error messages from “AnalyticsHttpChannel” to see if there is an issue sending (bad cert or hostname).
  2. The issue is with connecting to RabbitMQ. Look in the logs for messages that contain “analytics” and “RabbitMQMessageSubscriber”.
  3. The Elasticsearch status is yellow or red. Follow the steps above to remedy.
  4. It’s an old message. If the Elasticsearch status is green, look in the analytics-service log for details on why the message is unable to be processed. If you see a bulk store failure message containing “index-closed-exception”, then its because the message contains a record with a timestamp older than maxQueryDays. Allow the message to get processed by opening all the indices (the nightly data retention policy will then close any that need it again): curl -XPOST http://localhost:9200/_all/_open

You can use following command to check the indices status:

curl http://localhost:9200/_cat/indices

The messages should now be able to get processed and the queue size should drop.

  1. Its a bad message. If the Elasticsearch status is green, look in the analytics-service log for details on why the message is unable to be processed. If you see a bulk store failure message containing "parse_exception", then there is a malformed document.
  • You can clear the entire queue by executing rabbitmqctl purge_queue <queue>.
    Note: This will cause the loss of any other audit events, sync logs and search records that were also queued up, so the search index must be rebuilt to ensure the search/auto-complete functionality continues to work fully.

  • You can also use the management UI to remove just the bad message. Go to the queue in the Management UI and select the “Get messages” option with re-queue set to false to pull just the first message off the queue (which should be the bad message). Once the bad message is cleared, the others should get processed successfully.

  1. If the queue is for replicating to a secondary DC, but are old (DNS name change, or used IP addresses (other than 127.0.0.1) instead of FQDN), then the queue must be manually deleted from each node to prevent it from continuing to accumulate messages by doing the following on each node:

a. Enable firewall access to the RabbitMQ management UI: - Edit /usr/local/horizon/conf/iptables/elasticsearch to add “15672” to the ELASTICSEARCH_tcp_all entry, => ELASTICSEARCH_tcp_all="15672" - Re-apply the firewall rules by executing the /usr/local/horizon/scripts/updateiptables.hzn script.

b. You can only access the RabbitMQ Management UI using the default credentials of guest/guest on localhost, so you will have to create a new admin user in RabbitMQ. To create a user called “admin” with the password “s3cr3t”:

       rabbitmqctl add_user admin s3cr3t
       rabbitmqctl set_user_tags admin administrator
       rabbitmqctl set_permissions -p / admin ".*" ".*" ".*"

c. In a browser, access the RabbitMQ Management UI by going to http://host-or-ip-of-node:15672

  • log in with the newly created user
  • click on the “queues” tab
  • select the bad queue from the list
  • select delete from the “Delete / purge” option at the bottom of the page.

d. For security, re-disable remote access to the RabbitMQ Management UI by re-doing the a) step, but removing the 15672 port: ELASTICSEARCH_tcp_all="", then re-apply the firewall rules with the updateiptables.hzn script.

Diagnostics Page or Health API shows “Messaging Connection Ok: false”

This means RabbitMQ is in a bad state. Causes:

  • Out of disk space. Check the RabbitMQ status and compare the "disk_free_limit" and "disk_free" values to verify:
rabbitmqctl status

See cause #3 in the “Elasticsearch cluster health yellow” section for the steps to free up disk space.

  • Recovery file corruption, possibly due to an abrupt power off/failure on the VM See Article

RabbitMQ is not able to start, i.e. rabbitmqctl status gives a “nodedown” error and you see this error in /db/rabbitmq/log/startup_log:

      BOOT FAILED ... {error, {not_a_dets_file, 
      "/db/rabbitmq/data/rabbitmq@<YOUR_HOSTNAME>/recovery.dets"}}}

This will require deleting the recovery file and starting RabbitMQ again:

a. Stop all RabbitMQ processes by doing kill -9 on all processes found with ps -ef | grep -i rabbit, until none remain.

b. rm /db/rabbitmq/data/rabbitmq@<YOUR_HOSTNAME>/recovery.dets

c. rabbitmq-server -detached &

Once RabbitMQ is running again, workspace should be able to connect to it. Note, it may take up to 10 minutes before re-connection is tried. Look in horizon.log for messages from RabbitMQMessageSubscriber and RabbitMQMessagePublisher. If you see messages that indicate they were shutdown instead of connected or there are ProviderNotAvailableExceptions in the log, then workspace must be restarted service horizon-workspace restart to recreate them.

New/changed users or groups are not showing up in the search or assign to application boxes

  1. Ensure the cluster is healthy and all the nodes agree on the who is in the cluster and who the master node is by following the steps in “Elasticsearch Cluster Health” above;
  2. Ensure RabbitMQ queue size is not large, indicating messages containing the changes are undelivered to elasticsearch;
  3. If, after addressing any issues found with the previous checks, trigger a re-index of the search data.

Reset Everything (NUCLEAR BUTTON)

❗️❗️ DO NOT USE UNLESS NOTHING ELSE HELPS ❗️❗️

Option 1

Stop elasticsearch on all nodes, delete the data in elasticsearch and RabbitMQ, restart the elasticsearch process on each node, and trigger a re-index. This will result in the loss of the historic sync logs and audit data.

NOTE: If this is a Multi-Site setup:

  • Only steps 1-6 should be done on the DR site first (don’t do step 7, the reindex, it should ONLY to be done on the primary site, the results of that will then get replicated over to the DR site.).
  • When the DR site is OK, perform all 7 steps on the PRIMARY site.

Step 1. Stop elasticsearch on every node in the site: service elasticsearch stop

Step 2. When stopped on all nodes, clear the data, by doing the following on each node:

rm -rf /db/elasticsearch/horizon
rm -rf /opt/vmware/elasticsearch/logs 
//so we start with fresh logs to make it easier to see what is happening.

Get the analytics queue name(s) and purge them. On the primary site there should be two analytics RabbitMQ queues that need purging, one for 127.0.0.1 ("-.analytics.127.0.0.1") and one for the DR site (as configured via the analytics.replication.peers setting in the runtime-config.properties file, “-.analytics.dr.site.com”):

  rabbitmqctl list_queues | grep analytics
  rabbitmqctl purge_queue <analytics-queue-name>

Step 3. Verify the data retention policy configuration on each node is the same by looking at the /usr/local/horizon/conf/runtime-config.properties file and the “analytics.maxQueryDay” and “analytics.deleteOldData” values. If any node is different, then change them to match and restart WS1 Access on the node that did not match with: service horizon-workspace restart

Wait for WS1 Access to be back up before restarting another and before going on to step 4.

Step 4. Add the additional config to Elasticsearch from the above chapter “Avoid full cluster restart issues” preventative measure to help it recover from a full cluster restart, by adding these lines to /opt/vmware/elasticsearch/config/elasticsearch.yml on each node if missing:

   discovery.zen.minimum_master_nodes: 2
   gateway.recover_after_nodes: 2
   gateway.expected_nodes: 3
   gateway.recover_after_time: 10m

Step 5. Restart elasticsearch on each node (preferably quickly) service elasticsearch start

Step 6. Verify the cluster is functioning fully by running the following on each node and making sure the results match (ie, nodes are 3, same master and the node list is the same) and health is green:

  curl http://localhost:9200/_cluster/state/nodes,master_node?pretty
  curl http://localhost:9200/_cluster/health?pretty

If the output for those two commands do not match on all nodes stop, do not proceed to step 7. Gather the logs in /opt/vmware/elasticsearch/logs for each node.

STEP 7. Trigger a re-index of the search data.

To manually force a re-index, choose either the zero-down time option if you’re comfortable finding the HZN cookie value, or the down-time option by modifying the value directly in the DB.

The re-index should only take a few minutes. You can verify it started by looking in the workspace horizon.log for a message like:

com.vmware.horizon.search.SearchCalculatorLogic - 
Keep existing index. Search calculator mode is: REINDEX

Zero down-time

  • Log in to VIDM as the operator/first admin.
  • Use the developer tab in your browser to find the cookies (eg for Firefox, select Web Developer→Network. Then load a page and select the first request. Then select the “Cookies” tab and scroll down to the HZN cookie).
  • Copy the value of the HZN cookie.
    
  • ssh into a node make the following REST API call, replacing <cookie_value> with the HZN cookie value obtained from the browser.
    
. /usr/local/horizon/scripts/hzn-bin.inc && /usr/local/horizon/bin/curl -k -XPUT -H "Authorization:HZN <cookie_value>" -H "Content-Type: application/vnd.vmware.horizon.manager.systemconfigparameter+json" https://localhost/SAAS/jersey/manager/api/system/config/SearchCalculatorMode -d '{ "name": "SearchCalculatorMode", "values": { "values": ["REINDEX"] } }'

After initiated “REINDEX”, you can change the log to confirm it is started

cat /opt/vmware/horizon/workspace/logs/horizon.log | grep REINDEX

With down-time

Stop horizon-workspace on all nodes: service horizon-workspace stop

Edit the DB:

UPDATE "GlobalConfigParameters" SET "strData"='REINDEX' WHERE "id"='SearchCalculatorMode';

Restart horizon-workspace on all nodes: service horizon-workspace start

Verify progress

The reindex should start within a couple of seconds. Verify the reset was successful by looking for the message “Forcing a REINDEX of the search index” in the logs.

Reindexing can take hours, depending on the size of the data. You can monitor its progress by looking in the logs for “SearchCalculatorLogic” statistic messages, which show the counts of the objects its currently processing (typically it will be 5000 at a time when doing a re-index). When the all counts reach zero, it means it has finished. eg:

2020-02-28T05:54:08,208 WARN (pool-37-thread-2) [ACME;-;-;-] 
com.vmware.horizon.common.datastore.BaseCalculator - ++SearchCalculatorLogic$
$EnhancerBySpringCGLIB$$3e4c8f8c [ 0] 0.000000/Object ([ 0] 0.000000/Group, [ 0] 0.000000/Users, 
[ 0] 0.000000/ResourceDb)

Option 2

Re-Doing the Elastic Search Index.

Symptom: in case the user/group search is not working properly.

Resolution: Stop workspace service (on all nodes):

service horizon-workspace stop

  1. For User Entitlement:

update saas.GlobalConfigParameters set "strData"=-1 where "id"= 'LatestUserEntitlementVersion’;

For Search:

update saas.GlobalConfigParameters set "strData"=-1 where "id"= 'LatestSearchVersion’; 

For User Group association:

update saas.GlobalConfigParameters set "strData"=-1 where "id"=   'LatestUserGroupVersion’;
  1. Start service (on all nodes):
service horizon-workspace start

Mobile SSO - iOS

External Links:

❗️❗️ Do NOT use UAG in front of WS1 Access with MobileSSO for iOS scenario!

Schema

KDC Schema is here

Configuation Logic

ADMINISTRATOR CONFIGURATION

  • Following the red lines, the Administrator is going to configure REALM DNS entries (for on-premises only).
  • The admin will configure the Certificate Authority as well as the Cert templates and get the CA Cert for uploading into AirWatch and IDM.
  • The admin will configure AirWatch with the Cert Authority configurations, including setting up the certificate template and device profiles for pushing certificates.
  • Finally the admin will configure Identity Manager via the Admin Console to setup the Built-In Kerberos adapter and KDC with Cert Authority CA Root Certs from both AirWatch admin and Certificate Authority as necessary. Additionally, the admin will configure the Built-In IDP and authentication policies within Identity Manager to properly use the CA and Device certificates delivered through AirWatch to the end device.

USER ENROLLMENT

  • Next the user will enroll the device with AirWatch.
  • The enrollment request to AirWatch will request a device certificate from the Certificate authority.
  • The certificate bundle will be delivered back to the device during the completion of the enrollment.
  • Additionally, any native AirWatch applications such as Workspace ONE and other apps will be pushed to the iOS device.

USER APP LAUNCH

  • User launches application such as SalesForce. This requests gets passed down via TCP 443 through the load balancer, and hits the Built-In IDP and Built-In Kerberos Adapter. The Authentication Policy responds with the type of authentication it is expecting to see (i.e. Built-In Kerberos). This is passed back to the iOS device within a token which then tells the device to authenticate to a specific REALM.
  • The iOS device then does a REALM lookup against DNS and gets the specific KDC server it should authenticate against.
  • The iOS device requests authentication via TCP 88 to the KDC server, sending the Device certificate to the KDC, which then responds with an authentication approval.
  • Once IDM receives the authentication approval, the app launches on the iOS device and the user goes about their day.

iOS Login Flow

iOS Login Flow Schema is here

Login Flow Steps:

  • Managed iOS device is deployed with Kerberos SSO profile, a user certificate and a vanilla SalesForce app.
  • SalesForce.com has been configured to use Access/vIDM for authentication using SAML.
  • On launch, the salesforce app connects to salesforce.com and is redirected to VMware Access/vIDM.
  • Access/vIDM evaluates the authentication policy and determines that built-in KDC authentication should be used for the request. The built-in KDC adapter sends a SPNEGO 401 response to the device.
  • iOS on the device intercepts the response and performs Kerberos PKINIT authentication using the certificate. An OCSP check can optionally be configured. This results in the device receiving a Kerberos service ticket which is submitted to the IDM built-in KDC adapter in an Authorization header.
  • The built-in KDC adapter validates the ticket and completes the authentication on IDM. The UUID is extracted from the ticket and optionally verified with WOne UEM / AirWatch. 
  • Upon successful completion of all checks, a SAML assertion is generated and sent back to the salesforce.com app. 
  • The salesforce app can then present the SAML assertion to SalesForce.com and get logged in.

DNS

Given L4 Load Balancer at 1.2.3.4 and a DNS domain of example.com and a realm of EXAMPLE.COM:

kdc.example.com.     1800 IN  AAAA           ::ffff:1.2.3.4 kdc.example.com.     1800 IN  A                   1.2.3.4 _kerberos._tcp.example.com.   IN  SRV  10  0   88 kdc.example.com. _kerberos._udp.example.com.   IN  SRV  10  0   88 kdc.example.com.

❗️ There MUST be an IPv6 AAAA entry - iPhone requires it to work correctly.

AAAA entry may have to be converted to IPv6 format: :::::ffff:102:304 ⭐️ Use web-tool: https://www.ultratools.com/tools/ipv4toipv6

Troubleshooting DNS records

To test the DNS settings, you can use the dig command (built-in to Mac and Linux) or NSLOOKUP on Windows.

Here is what the dig command looks like for DNS records:


dig SRV _kerberos._tcp.example.com
dig SRV _kerberos._udp.example.com

# You may wish to define the name server to use with dig by using the following command:
dig @ns1.no-ip.com SRV _kerberos._tcp.example.com

Checking DNS entries with tools

Use Google Toolbox to check the SRV entries:

https://toolbox.googleapps.com/apps/dig/#SRV/_kerberos._tcp.example.com

The result must contain string of type:


;ANSWER
_kerberos._tcp.example.com. 3599 IN SRV 10 0 88 krb.example.com.

Use NC to check UDP/Kerberos services:


nc -u -z kdc.example.com 88

# Answer:
#> Connection to kdc.example.com port 88 [udp/kerberos] succeeded!

Use nslookup to check the SRV entries:


nslookup -q=srv _kerberos._tcp.vmwareidentity.eu
Server:		10.26.28.233
Address:	10.26.28.233#53

Non-authoritative answer:
_kerberos._tcp.vmwareidentity.eu	service = 10 0 88 kdc.vmwareidentity.eu.

Certificates

Cert Trust Schema is here

CA Requirements (including CloudKDC)

Mobile Device Profile

The mobile device profile for the Cloud KDC feature must include the following:

  • Add KDC server root certificate from previous step to credentials
  • Add credential for Certificate Authority issued certificate
  • Configure the SSO profile
  • A user principal name that does not contain an “@” symbol, i.e., it is is typically set to “{EnrollmentUser}". 
  • The realm name that identifies the site where the tenant is deployed. This is the domain name for the site in upper case, e.g., VMWAREIDENTITY.COM. 
  • Specify the Kerberos principal name: {EnrollmentUser}
  • Specify the certificate credential
  • Specify the URL List: https://tenant.example.com/
  • The URLPrefixMatches value in the Kerberos dictionary in the com.apple.sso payload type must include https://tenant-vidm-url/ where “tenant-vidm-url” is the FQDN of the tenant’s WS1 Access address. Note that this FQDN may be different than the one that is used for the realm, e.g., workspaceair.com vs. vmwareidentity.com

Specify a list of app bundle ids:

  • com.apple.mobilesafari (enables access from the Safari browser)
  • com.apple.SafariViewController (enables apps that use Safari View Controller)
  • com.air-watch.appcenter (enables the Workspace ONE native app)

OCSP Requirements

The CA OCSP responder must be accessible on the Internet so that it can be accessed by the KDC. The URL for the OCSP responder is provided in the client certificate in the usual manner.

WS1 UEM has confirmed that the AW CA will meet this requirement.

KDC in vIDM Schema (old)

Port 88 must be open TCP+UDP. Use packet sniffer to check packets from device.

AirWatch-KDC Schema here

Enable KDC

For a single Workspace ONE environment or primary domain:


/etc/init.d/vmware-kdc init --realm MY-IDM.EXAMPLE.COM
# Restart horizon-workspace service:
service horizon-workspace start
# Start KDC service:
service vmware-kdc start

Full instructions link

Reinitialize the KDC

Login to the console (or SSH and SU to root) and do the following: a. Stop the vmware-kdc service b. Force reinitialization of the KDC service (note the use of the ‘–force’ switch at the end) c. Restart the vmware-kdc service d. Restart the horizon-workspace service

⭐️ Last 2 steps can be replaced with rebooting the appliance.


service vmware-kdc stop
/etc/init.d/vmware-kdc init --realm EXAMPLE.COM --subdomain idm.example.com --force
service vmware-kdc start
service horizon-workspace restart
#Or just type reboot to restart the appliance.

Enable KDC using subdomain

This is useful for multiple Workspace ONE environments (Prod, UAT, DEV). Remember that realm must be all CAPs and subdomain is all lowercase.

  • /etc/init.d/vmware-kdc init --force --realm SUBDOMAIN.IDM.DOMAIN.COM --subdomain subdomain.idm.customer.com
  • Restart horizon-workspace service: service horizon-workspace start
  • Start KDC service: service vmware-kdc start

Flags: –force (force init erasing previous initialized KDC) –realm (by convention, this is all caps, this is the Kerberos realm and is usually the DNS domain of the customer. It can also be a DNS sub domain if the customer needs to have multiple Kerberos realms for multiple systems –subdomain (this is not related to subdomain in the conventional sense. This should be your WS1 Access FQDN that the end users use)

Change KDC using subdomain

Please be aware that this procedure is unnecessary if the vmware-kdc init command is run with the correct values in the first place. If a reinitialization is necessary, then there are two better options.                                                                      

If the KDC CA cert doesn’t have to be maintained, then just run vmware-kdc init --force with the right arguments, or

If the KDC CA cert does need to be maintained (so that the WS1 UEM profile doesn’t have to be updated), then copy the kdc-cacert.pem, kdc-cert.pem, and kdc-certkey.pem files from /opt/vmware/kdc/conf to a temporary area, make them readable by the horizon user, and then run vmware-kdc init --force but use the arguments that allow the certs to be passed in (–kdc-cacert, –kdc-cert, and –kdc-certkey).

How to update kdc subdomain without regenerating certificates

Navigate to the kdc-init.input file and change the subdomain there to the new value

  1. service vmware-kdc update
  2. service vmware-kdc restart
  3. Enter command /etc/init.d/vmware-kdc kadmin
  4. Next command: addprinc -clearpolicy -randkey - requires preauth HTTP/(insertnewsubdomain)
  5. Next command: ktadd -k kdc-adapter.keytab HTTP/(insertnewsubdomain)
  6. Next command: listprincs
  7. You should see the HTTP/(newsubdomain) after the listprincs command. This confirms a successful update of the subdomain.

Old Windows-based IDM

Windows VIDM uses the cloud KDC which only requires the appliance to reach outbound on 443 and 88. The clients also need to be able to reach outbound on 443 and 88.

REALM used in WS1 UEM and in Windows VIDM is: OP.VMWAREIDENTITY.COM

Troubleshooting MobileSSO for iOS

Reasons for failure

  1. The device isn’t enrolled
  2. The device is enrolled but cannot access port 88 on the VIDM service
  3. Response from port 88 on the VIDM service cannot reach device
  4. DNS entries not configured correctly
  5. The device is enrolled but the profile has the wrong realm
  6. The device is enrolled but doesn’t have a certificate in the profile
  7. The device is enrolled but has the wrong certificate in the profile
  8. The device is enrolled but the Kerberos principal is wrong
  9. The certificate SAN values cannot be matched with the Kerberos principal
  10. The device is enrolled but the KdcKerberosAuthAdapter has not been configured with the issuer certificate
  11. The device is enrolled but the user’s certificate has been revoked (via OCSP)
  12. The device is enrolled but doesn’t send the certificate during login
  13. The device is enrolled but the app bundle id for the app isn’t in the profile
  14. Time is not synced between device and KDC (this will cause the TGT to be issued and a TGS_REQ to be made, but the device will reject the service ticket and asking for an “http/fqdn” ticket)
  15. Time is not synced between KDC nodes (this will cause save of adapter settings to fail)
  16. KDC subdomain value is set incorrectly (this causes the TGS_REQ to be rejected)

General checks

  • What is not working?

    1. Unable to save Mobile SSO for iOS adapter settings
      1. Is the error message “Checksum failed”?
        1. Is this a cluster?
          1. How was the data copied from one node to another?  / Was “vmware-kdc init” run on multiple nodes?
            1. Use dump/load to replicate the data to all nodes rather than running init on all nodes.
    2. Device cannot login via Safari
    3. Device cannot login via Workspace ONE app
    4. Device cannot login via some other native app
      1. Is the app bundle id included in the profile
      2. Does the app use Safari View Controller?
        1. Is com.apple.SafariViewController included in the list of app bundle ids in the profile?
  • If the device cannot login, why is the login failing? 

    1. Device makes no request at all to the KDC
      1. vIDM policy not setup correctly
      2. Device cannot reach the KDC 
        1. DNS lookup issue
        2. Networking issue with reaching the KDC
    2. AS_REQ received but certificate isn’t validated
      1. Does the certificate have acceptable SAN values?
      2. Is OCSP used? 
        1. Is the OCSP server reachable?
        2. Is the OCSP server validating the certificate?
    3. TGT issued but TGS_REQ not issued
      1. Is the subdomain set correctly
    4. Is the service ticket ignored by the device
      1. Is the time on the KDC service sync’d with the device?
  • Is it failing for all devices or just a subset or one device?

  • What version of iOS is being used?

    1. iOS 9.3.5 is known not to work for Safari
    2. iOS 10.3.3 has a known issue with native apps that use SafariViewController
  • Can iOS device reach the KDC?  (use sniffer to check)

    1. There may be a networking issue. Sometime firewalls or load balancers do not pass through UDP and TCP traffic on port 88
  • Are the DNS SRV entries set up correctly?

  • Is the subdomain value for the on-premises KDC set correctly?

  • If using Hybrid KDC, does registration fails to save the adapter configuration?

    1. Is the vIDM service node able to resolve the _mtkadmin._tcp. SRV entry?
    2. Is the vIDM service node able to reach the https://hybrid-kdc-admin./mtkadmin/rest/health URL?
    3. Is the vIDM service node able to reach UDP port 88 for the KDC?
    4. Are responses from the KDC with source UDP port 88 able to reach the vIDM service node?
  • Is the AirWatch tunnel being used? 

    1. Is the tunnel bypassing port 88 traffic to allow it to reach the KDC?
  • Is REALM domain in KDC and DNS all CAPITAL LETTERS, and subdomain all lower letters?

  • If you are using {Enrollment User} as the Kerberos principal in the device profile and {Enrollment User} is mapped to the sAMAccountName, and the sAMAccountname does not match the first portion of the user’s UPN – which is everything before the @ sign – then mobile SSO will fail.  

  • For certificate authentication to work (Android SSO included), UserPrincipalName value in SAN needs to be set to user’s UPN or Email, in user’s certificate or SAN must have user’s email if UPN is missing.

  • When creating the certificate template in ADCS - use the KerberosAuthentication template with the following modifications: change the template name, change Subject Name to accept it from the request, add a new “Kerberos Client Authentication” auth policy (oid=1.3.6.1.5.2.3.4), add the AirWatch service account to the list of users that can use the certificate.

    ❗️The authentication policy oid must be exactly as specified

WS1 Logs

Built-in Kerberos adapter log:  /opt/vmware/horizon/workspace/logs/horizon.log KDC logs: /var/log/messages log and grep for ‘pkinit’ to see the errors returned.


grep -i 'pkinit' /var/log/messages

This may return errors like vmw_service_matches_map: “HTTP/idm.test.ws@TEST.WS” is not a match for the service_regex and Server not found in Kerberos database. Basically, either DNS is not correct or built-in KDC Server is not properly initialized.

How to Enable Kerberos iOS Device Logs to troubleshoot Mobile SSO

When having issues troubleshooting Mobile SSO for iOS, install the GSS Debug Profile on the device to retrieve more verbose Kerberos logs.

  • Copy this config file to your desktop.
  • Email the config file to the device.
  • From the device, tap on the attachment, tap on Install.  Enter the device passcode when prompted.
  • Tap on Install in the upper right on the Apple Consent notification.

Link: https://developer.apple.com/services-account/download?path=/iOS/iOS_Logs/Enterprise_SSO_and_Kerberos_Logging_Instructions.pdf ❗️Action requires Apple Dev Account!

Packet Sniffing

There is no tcpdump on WS1 Access, and there is no possibility to install it due to dependency problems (written & tested on SUSE Ent Linux version of WS1 Access).

Use Python2 to write your own sniffer of packets for Access-vIDM (fresh build on GitHub):


# Filename = sniffer.py
# Packet sniffer script 0.3
# Made by Alexei Rybalko for vIDM-Access Server
# Based on SUSE Ent. Linux 11 with python2

# Usage:
# python sniffer.py 192.168.1.1
# Will sniff any packets going from or coming into IP=192.168.1.1, includes ping-ICMP/TCP/UDP

import socket, sys
from struct import *
 
if __name__ == "__main__":
    s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, socket.ntohs(0x0003))

    if (not sys.argv[1]):
        print("Enter IP Address to filter packets from!")
        sys.exit(0)
 
    while True:
        packet = s.recvfrom(65565)[0]
        eth_header = packet[:14]
        eth = unpack('!6s6sH', eth_header)
        eth_protocol = socket.ntohs(eth[2])
 
        if eth_protocol == 8: # IP
            ip_header = packet[14:34]
            iph = unpack('!BBHHHBBH4s4s', ip_header)
            ttl = iph[5]
            protocol = iph[6]
            s_addr = socket.inet_ntoa(iph[8])
            d_addr = socket.inet_ntoa(iph[9])
            #print "Source IP: " + s_addr
            #print "Destination IP: " + d_addr
 
            if (s_addr == sys.argv[1]) or (d_addr == sys.argv[1]): # IP Address only the one provided as argument to script
                if protocol == 6: # TCP
                    tcp_header = packet[20:40]
                    tcph = unpack('!HHLLBBHHH', tcp_header)
                    source_port = tcph[0]
                    dest_port = tcph[1]
                    print("--TCP--")
                    print "Source port: " + str(source_port)
                    print "Destination port: " + str(dest_port)
 
                elif protocol == 1: # ICMP
                    icmp_header = packet[20:24]
                    icmph = unpack('!BBH', icmp_header)
                    icmp_type = icmph[0]
                    code = icmph[1]
                    checksum = icmph[2]
                    print("--ICMP--")
                    print "Type: " + str(icmp_type)
                    print "Code: " + str(code)
 
                elif protocol == 17: # UDP
                    udp_header = packet[20:28]
                    udph = unpack('!HHHH', udp_header)
                    source_port = udph[0]
                    dest_port = udph[1]
                    print("--UDP--")
                    print "Source port: " + str(source_port)
                    print "Destination port: " + str(dest_port)
 
                else:
                    print('Unknown Protocol!')