[00:01:20] (03PS4) 10Alex Monk: openstack mwopenstackclients: Remove unused methods provided by designateclient [puppet] - 10https://gerrit.wikimedia.org/r/513911 (https://phabricator.wikimedia.org/T224708) [00:41:19] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1109 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:22:15] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Ladsgroup) >>! In T218155#5228955, @Ankry wrote: > @Ladsgroup Are thre any tickets describing what are the script problems that this task shou... [01:55:13] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10herron) a:05herron→03Papaul Today I tried to perform OS installs on kafka-main200[345] but was not seeing DHCP requests from these hosts make it to the installNNNN hosts yet... [02:33:29] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 167.5 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [02:38:25] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10bd808) [04:02:05] (03PS1) 10BryanDavis: wiki replicas: Add specialized views of the "comment" table [puppet] - 10https://gerrit.wikimedia.org/r/513943 (https://phabricator.wikimedia.org/T224850) [04:33:57] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 95.3 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [04:43:17] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman [04:44:41] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman [04:53:12] (03PS2) 10Marostegui: db-eqiad.php: Slowly repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513676 (https://phabricator.wikimedia.org/T213422) [04:53:24] (03PS2) 10Marostegui: db2037: Prepare for decommission [puppet] - 10https://gerrit.wikimedia.org/r/513573 (https://phabricator.wikimedia.org/T224720) [04:54:37] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513676 (https://phabricator.wikimedia.org/T213422) (owner: 10Marostegui) [04:55:29] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513676 (https://phabricator.wikimedia.org/T213422) (owner: 10Marostegui) [04:56:35] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513676 (https://phabricator.wikimedia.org/T213422) (owner: 10Marostegui) [04:56:47] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool es1019 T213422 (duration: 00m 51s) [04:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:54] T213422: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 [04:58:55] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman [05:01:03] (03CR) 10Effie Mouzeli: [C: 03+1] mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [05:03:03] (03CR) 10Marostegui: [C: 03+2] db2037: Prepare for decommission [puppet] - 10https://gerrit.wikimedia.org/r/513573 (https://phabricator.wikimedia.org/T224720) (owner: 10Marostegui) [05:04:37] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman [05:04:48] !log Stop MySQL on db2037 for decommission T224720 [05:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:53] T224720: Decommission db2037 - https://phabricator.wikimedia.org/T224720 [05:05:08] !log Remove db2037 from tendril and zarcillo T224720 [05:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:39] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2037 - https://phabricator.wikimedia.org/T224720 (10Marostegui) p:05Triage→03Normal [05:06:47] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2037 - https://phabricator.wikimedia.org/T224720 (10Marostegui) [05:07:05] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2037 - https://phabricator.wikimedia.org/T224720 (10Marostegui) a:05Marostegui→03RobH This host is ready for DCOPs to take over. [05:10:00] (03PS1) 10Marostegui: db-eqiad.php: More traffic to es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513944 [05:12:03] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513944 (owner: 10Marostegui) [05:12:10] (03PS1) 10Marostegui: db1128: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/513945 (https://phabricator.wikimedia.org/T222682) [05:13:01] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513944 (owner: 10Marostegui) [05:13:09] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman [05:13:15] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513944 (owner: 10Marostegui) [05:13:19] (03CR) 10Marostegui: [C: 03+2] db1128: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/513945 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:14:35] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman [05:15:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to es1019 T213422 (duration: 00m 46s) [05:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:26] T213422: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 [05:17:07] !log Upgrade MariaDB on codfw hosts in preparation for s4 master failover T217396 [05:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:14] T217396: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 [05:41:03] 10Operations, 10DBA: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) [05:41:31] 10Operations, 10DBA: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) [05:41:36] 10Operations, 10DBA: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) p:05Triage→03Normal [05:42:20] 10Operations, 10DBA: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) [05:43:05] 10Operations, 10DBA: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) codfw hosts have been all upgraded to 10.1.39 [05:43:59] (03PS1) 10Marostegui: db-eqiad.php: Fully repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513946 (https://phabricator.wikimedia.org/T213422) [05:45:26] !log Upgrade mariadb on dbstore1004 - T224852 [05:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:32] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [05:47:52] 10Operations, 10DBA: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) eqiad hosts that need upgrade [] db1081 [] db1084 [] db1091 [] db1097 [] db1103 [] db1121 [] db1125 [] labsdb1012 [] dbstore1004 [05:48:42] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513946 (https://phabricator.wikimedia.org/T213422) (owner: 10Marostegui) [05:49:30] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513946 (https://phabricator.wikimedia.org/T213422) (owner: 10Marostegui) [05:49:44] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513946 (https://phabricator.wikimedia.org/T213422) (owner: 10Marostegui) [05:50:29] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool es1019 T213422 (duration: 00m 46s) [05:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:35] T213422: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 [05:50:41] 10Operations, 10DBA: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) [05:51:28] 10Operations, 10ops-eqiad, 10Patch-For-Review: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10Marostegui) Host fully repooled with its original weight. [05:55:45] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513947 (https://phabricator.wikimedia.org/T224852) [05:56:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [05:57:18] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513947 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [05:58:07] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513947 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [05:58:22] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513947 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [06:00:02] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1081 for upgrade (duration: 00m 47s) [06:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:52] !log Stop MySQL on db1081 for upgrade - T224852 [06:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:57] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [06:07:21] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: sync settings for all appservers [puppet] - 10https://gerrit.wikimedia.org/r/513949 [06:08:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: sync settings for all appservers [puppet] - 10https://gerrit.wikimedia.org/r/513949 (owner: 10Giuseppe Lavagetto) [06:10:56] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:15:37] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki) [06:15:41] 10Operations, 10Traffic, 10serviceops, 10User-jijiki: Allow directing a percentage of API traffic to PHP7 - https://phabricator.wikimedia.org/T219129 (10jijiki) 05Open→03Resolved a:03jijiki @Jdforrester-WMF I am marking this as resolved :D [06:16:55] 10Operations, 10Traffic, 10serviceops, 10User-jijiki: Allow directing a percentage of API traffic to PHP7 - https://phabricator.wikimedia.org/T219129 (10jijiki) [06:17:52] 10Operations, 10Traffic, 10serviceops, 10User-jijiki: Allow directing a percentage of API traffic to PHP7 - https://phabricator.wikimedia.org/T219129 (10jijiki) [06:20:46] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:22:55] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513952 [06:24:44] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513952 (owner: 10Marostegui) [06:25:35] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513952 (owner: 10Marostegui) [06:26:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1081 after upgrade (duration: 00m 46s) [06:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:41] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513952 (owner: 10Marostegui) [06:29:30] PROBLEM - puppet last run on mw2285 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:33:06] (03PS4) 10Elukey: varnishkafka webrequest: log Server: in response as 'backend' [puppet] - 10https://gerrit.wikimedia.org/r/511690 (https://phabricator.wikimedia.org/T224236) (owner: 10CDanis) [06:36:02] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1081 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513953 [06:36:52] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1081 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513953 (owner: 10Marostegui) [06:37:40] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1081 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513953 (owner: 10Marostegui) [06:37:54] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1081 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513953 (owner: 10Marostegui) [06:39:08] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1081 into API after upgrade (duration: 00m 49s) [06:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:17] (03CR) 10Marostegui: "I have slowly repooled this host. It is fully repooled now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513307 (owner: 10Jcrespo) [06:41:53] (03CR) 10Elukey: [C: 03+2] varnishkafka webrequest: log Server: in response as 'backend' [puppet] - 10https://gerrit.wikimedia.org/r/511690 (https://phabricator.wikimedia.org/T224236) (owner: 10CDanis) [06:43:14] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:44:19] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:44:43] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [06:44:44] !log kartik@deploy1001 scap-helm cxserver cluster staging completed [06:44:44] !log kartik@deploy1001 scap-helm cxserver finished [06:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:42] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:45:43] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [06:45:45] !log kartik@deploy1001 scap-helm cxserver cluster codfw completed [06:45:46] !log kartik@deploy1001 scap-helm cxserver finished [06:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:59] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad] [06:46:01] !log kartik@deploy1001 scap-helm cxserver cluster eqiad completed [06:46:01] !log kartik@deploy1001 scap-helm cxserver finished [06:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:44] !log roll restart varnishkafka (via puppet) for a config change - T224236 [06:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:49] T224236: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 [06:50:56] (03CR) 10Muehlenhoff: admins: remove expired contractor account of juliaglen (merge on May 31) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/512404 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [06:54:32] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:56:26] RECOVERY - puppet last run on mw2285 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:00:21] (03PS1) 10Muehlenhoff: Remove access for julianglen [puppet] - 10https://gerrit.wikimedia.org/r/513954 [07:01:47] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for julianglen [puppet] - 10https://gerrit.wikimedia.org/r/513954 (owner: 10Muehlenhoff) [07:04:54] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1081 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513955 [07:05:39] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [07:08:18] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1081 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513955 (owner: 10Marostegui) [07:08:52] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513307 [07:09:07] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1081 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513955 (owner: 10Marostegui) [07:09:09] (03Abandoned) 10Jcrespo: Revert "mariadb: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513307 (owner: 10Jcrespo) [07:09:21] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1081 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513955 (owner: 10Marostegui) [07:10:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1081 into API after upgrade (duration: 00m 48s) [07:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:08] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:13:56] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:18:23] 10Operations, 10Analytics, 10Traffic, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) Does logstash need to be changed to read the new field? Cc: @fgiunchedi [07:20:24] (03CR) 10Vgutierrez: [C: 03+2] redirects.dat: Get rid of www.*.wikipedia.[com,net,info] [puppet] - 10https://gerrit.wikimedia.org/r/513141 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [07:21:18] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): User[juliaglen] [07:21:25] (03PS3) 10Vgutierrez: redirects.dat: Get rid of www.*.wikipedia.[com,net,info] [puppet] - 10https://gerrit.wikimedia.org/r/513141 (https://phabricator.wikimedia.org/T224539) [07:26:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513956 (https://phabricator.wikimedia.org/T224852) [07:27:31] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513956 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [07:28:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513956 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [07:28:35] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513956 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [07:28:46] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:29:30] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103 for upgrade (duration: 00m 47s) [07:29:32] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:48] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:29:55] !log Stop MySQL on db1103 (s2 and s4) for upgrade T224852 [07:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:00] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [07:30:04] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:31:34] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [07:32:44] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10jcrespo) [07:35:08] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513957 [07:36:34] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:36:39] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1103" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513957 (owner: 10Marostegui) [07:37:32] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:37:35] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513957 (owner: 10Marostegui) [07:37:49] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513957 (owner: 10Marostegui) [07:38:32] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:40:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:41:22] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:41:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:42:58] (03Abandoned) 10Muehlenhoff: admins: remove expired contractor account of juliaglen (merge on May 31) [puppet] - 10https://gerrit.wikimedia.org/r/512404 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [07:44:44] RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1001 is OK: SSL OK - Certificate cloudelastic1001.wikimedia.org valid until 2019-09-01 06:00:17 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search [07:44:52] RECOVERY - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1001 is OK: SSL OK - Certificate cloudelastic1001.wikimedia.org valid until 2019-09-01 06:00:17 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search [07:45:22] RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1001 is OK: SSL OK - Certificate cloudelastic1001.wikimedia.org valid until 2019-09-01 06:00:17 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search [07:48:50] !log Repool db1103 after upgrade T224852 [07:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:56] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [07:49:53] 10Operations, 10Analytics, 10Traffic, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) Nevermind, self answered. The field is now showing up in the 50x logstash dashboard, but it seems requiring a refresh of the index list (I hovere... [07:57:10] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [07:58:32] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:58:38] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:58:45] !log refresh field list for logstash (via kibana Management -> Index patterns -> etc..) [07:58:46] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:49] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10Ankit-Maity) 05Resolved→03Open ` ... cp1075 cp1075, Varnish XID 1035109169 Error: 503, Backend fetch failed at Mon, 03... [07:58:54] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:59:22] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:59:53] (03PS1) 10Marostegui: db1064: Prepare for decommission [puppet] - 10https://gerrit.wikimedia.org/r/513958 (https://phabricator.wikimedia.org/T223217) [07:59:56] 10Operations, 10Analytics, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10MoritzMuehlenhoff) If Tensorflow works fine without hsa-ext-rocr-dev, we also have a third option, which seems cleaner and easier: - Import the existing repository (sans hsa-e... [08:00:02] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [08:01:07] (03CR) 10Marostegui: [C: 03+2] db1064: Prepare for decommission [puppet] - 10https://gerrit.wikimedia.org/r/513958 (https://phabricator.wikimedia.org/T223217) (owner: 10Marostegui) [08:01:30] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [08:01:45] !log Remove db1064 from tendril and zarcillo T223217 [08:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:50] T223217: Decommission db1064 - https://phabricator.wikimedia.org/T223217 [08:02:16] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [08:02:20] 10Operations, 10Analytics, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) @MoritzMuehlenhoff you are completely right, forgot about that option, at this point I am +1 on proceeding with the dummy package without waiting upstream. Going to c... [08:02:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:02:52] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:03:00] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:03:00] !log Stop MySQL on db1064 T223217 [08:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:08] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:03:36] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:04:15] 10Operations, 10ops-eqiad, 10decommission: Decommission db1064 - https://phabricator.wikimedia.org/T223217 (10Marostegui) a:05Marostegui→03RobH This host is ready for DCOPs to take over for decommission [08:04:23] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [08:07:00] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:07:12] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:07:22] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:07:26] (03CR) 10Vgutierrez: [C: 03+1] "PCC shows the expected NOOP: https://puppet-compiler.wmflabs.org/compiler1002/16834/" [puppet] - 10https://gerrit.wikimedia.org/r/513142 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [08:07:50] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:08:28] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:10:38] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:11:08] (03CR) 10Vgutierrez: [C: 03+1] "as expected PCC shows a NOOP for existing Apache servers using compile_redirects(): https://puppet-compiler.wmflabs.org/compiler1002/16835" [puppet] - 10https://gerrit.wikimedia.org/r/513279 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [08:11:18] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:11:26] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:11:36] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:12:38] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:12:40] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10DannyS712) Just got it again `via cp1075 cp1075, Varnish XID 1031998560 Error: 503, Backend fetch failed at Mon, 03 Jun 2... [08:14:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:15:32] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:15:52] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:17:11] !log manually removed phab_clean_tmp from www-data's crontab on phab1001 to reduce cronspam [08:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:34] 10Operations, 10Analytics, 10Traffic, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) Refreshed also the index list on kibana, no more warnings for the backend field. [08:18:38] !log cp1077: restart varnish-be [08:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:19:10] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:21:24] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:23:32] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:23:35] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10elukey) The traffic team restarted two Varnish backends, the issue should be fixed now. Thanks a lot for the reports, plea... [08:27:16] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [08:27:22] (03CR) 10Ema: [C: 03+1] "Two nits." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508327 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [08:29:46] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [08:30:48] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [08:32:05] (03CR) 10Ema: [C: 03+1] ATS: Provide a unified monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/506986 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [08:33:01] (03CR) 10Ema: [C: 03+1] ATS: Provide a unified logs define [puppet] - 10https://gerrit.wikimedia.org/r/510641 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [08:34:18] (03CR) 10DannyS712: [C: 03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513740 (https://phabricator.wikimedia.org/T224215) (owner: 10Urbanecm) [08:35:51] (03CR) 10Ema: [C: 03+1] "Two nits. Also, it would be great if you could briefly describe what "parent proxy" means in the commit log." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/511869 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:39:14] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:39:35] (03CR) 10Ema: [C: 03+1] "LGTM, please explain why the change is needed in the commit log." [puppet] - 10https://gerrit.wikimedia.org/r/512636 (https://phabricator.wikimedia.org/T224397) (owner: 10Vgutierrez) [08:40:12] (03PS8) 10Vgutierrez: prometheus: Toggle SSL certificate verification for trafficserver-exporter [puppet] - 10https://gerrit.wikimedia.org/r/508327 (https://phabricator.wikimedia.org/T221217) [08:40:56] (03CR) 10Vgutierrez: "Comments addressed, thanks for your review :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508327 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [08:41:10] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [08:41:55] (03CR) 10Ema: [C: 03+1] "One nit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/512643 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [08:42:46] (03CR) 10Ema: [C: 03+1] ATS: Avoid using traffic_layout [puppet] - 10https://gerrit.wikimedia.org/r/512855 (https://phabricator.wikimedia.org/T224428) (owner: 10Vgutierrez) [08:51:35] (03PS3) 10Vgutierrez: ATS: Set log mode independently of log filters [puppet] - 10https://gerrit.wikimedia.org/r/512636 (https://phabricator.wikimedia.org/T224397) [08:55:01] (03CR) 10Ema: ATS: Provide a TLS terminator profile (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:09:57] 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10Aklapper) >>! In T222308#5222534, @Aklapper wrote: > Maybe also send a heads-up to the list itself Done: https://lists.wikimedia.org/pipermail/engineering/2019-June/000701.html [09:14:48] (03PS6) 10Vgutierrez: ATS: Provide parent proxies support [puppet] - 10https://gerrit.wikimedia.org/r/511869 (https://phabricator.wikimedia.org/T221594) [09:18:14] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [09:21:31] 10Operations, 10Kubernetes: Decommission darmstadtium - https://phabricator.wikimedia.org/T224562 (10fsero) yep, it could be decomm'd. Registry has being served by new servers by two weeks and i didn't see any hiccup yet so this server can go. I can take care of the decom process but i don't know if it should... [09:25:44] mmm I don't find the alarming port for cr2-eqiad in librenms [09:28:58] 10Operations, 10Kubernetes: Decommission darmstadtium - https://phabricator.wikimedia.org/T224562 (10MoritzMuehlenhoff) >>! In T224562#5229575, @fsero wrote: > yep, it could be decomm'd. Registry has being served by new servers by two weeks and i didn't see any hiccup yet so this server can go. > > I can take... [09:29:53] 10Operations, 10Kubernetes: Decommission darmstadtium - https://phabricator.wikimedia.org/T224562 (10fsero) a:03fsero [09:33:03] !log upgrading Hadoop servers to new Java security release (will be picked up by forthcoming MDS reboots) [09:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% [09:43:45] !log upgrading AQS servers to new Java security release (will be picked up by forthcoming MDS reboots) [09:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:48] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2003.codfw.wmnet, maps2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:46:01] PROBLEM - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /osm-intl/9/207/163@1.5x.png (default scaled tile) timed out before a response was received: /osm-intl/11/828/655.png (get a tile in the middle of th [09:46:01] rzoom) timed out before a response was received: /img/osm-intl,1,0.0,0.0,100x100@1.5x.png (Small scaled map) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [09:46:39] !log upgrading Druid/Kafka-Jumbo servers to new Java security release (will be picked up by forthcoming MDS reboots) [09:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:15] karthoterian expected ? [09:47:32] godog: I think so. [09:47:38] I'm depooling codfw now [09:48:02] we didn't expect that to happen this early [09:48:03] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:48:52] !log depooled maps codfw due to lag and disk issues - T224395 [09:48:53] PROBLEM - puppet last run on analytics1059 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[openjdk-8-jdk] [09:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:58] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [09:49:47] PROBLEM - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:50:38] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:50:44] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:50:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_upload site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:51:14] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:51:54] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:52:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:52:27] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-upload site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:52:54] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:53:14] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [09:53:20] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.13:443, 10.2.1.13:6533]) https://wikitech.wikimedia.org/wiki/PyBal [09:53:22] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [09:53:46] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.13:443, 10.2.1.13:6533]) https://wikitech.wikimedia.org/wiki/PyBal [09:54:36] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [09:56:48] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [09:58:00] PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /v4/marker/pin-m+ffffff@2x.png (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:59:10] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:59:24] RECOVERY - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:59:48] RECOVERY - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1288 bytes in 3.720 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:00:19] PROBLEM - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received: /img/osm-intl,1,0.0,0.0,100x100@1.5x.png (Small scaled map) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps% [10:02:21] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=kartotherian [10:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:24] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=kartotherian [10:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:04:06] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:04:15] PROBLEM - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:04:20] PROBLEM - Maps HTTPS on maps2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [10:04:42] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:05:00] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:05:14] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:05:28] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:05:42] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:06:18] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:06:20] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:09:51] RECOVERY - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1288 bytes in 2.798 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:09:56] RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1288 bytes in 0.879 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [10:10:06] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:10:16] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [10:10:18] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [10:10:52] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:11:41] RECOVERY - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [10:11:52] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [10:15:54] RECOVERY - puppet last run on analytics1059 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:16:10] (03PS64) 10Vgutierrez: ATS: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [10:16:12] (03PS6) 10Vgutierrez: prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) [10:16:14] (03PS1) 10Vgutierrez: ATS: Include ATS tls instance in upload_ats role [puppet] - 10https://gerrit.wikimedia.org/r/513970 (https://phabricator.wikimedia.org/T221594) [10:16:44] (03CR) 10Vgutierrez: ATS: Provide a TLS terminator profile (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [10:18:34] (03PS65) 10Vgutierrez: ATS: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [10:18:36] (03PS2) 10Vgutierrez: ATS: Include ATS tls instance in upload_ats role [puppet] - 10https://gerrit.wikimedia.org/r/513970 (https://phabricator.wikimedia.org/T221594) [10:18:38] (03PS7) 10Vgutierrez: prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) [10:23:19] !log bmansurov@deploy1001 Started deploy [recommendation-api/deploy@5046f3c]: Update the recommendation API service [10:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:17] (03PS1) 10Arturo Borrero Gonzalez: prometheus: pdns_exporter: use hiera keys to define prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/513971 (https://phabricator.wikimedia.org/T224743) [10:26:34] !log bmansurov@deploy1001 Finished deploy [recommendation-api/deploy@5046f3c]: Update the recommendation API service (duration: 03m 15s) [10:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC OK: https://puppet-compiler.wmflabs.org/compiler1001/16836/ (NOOP)" [puppet] - 10https://gerrit.wikimedia.org/r/513971 (https://phabricator.wikimedia.org/T224743) (owner: 10Arturo Borrero Gonzalez) [10:28:50] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10WMDE-leszek) Hello @akosiaris, this is a friendly ping from your favourite WMDE customers :) Any news on the deployment front?... [10:29:06] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:30:04] jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190603T1030). Please do the needful. [10:34:37] !log upgrading Elastic servers to new Java security release [10:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:28] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513972 (https://phabricator.wikimedia.org/T128546) [10:36:02] !log Restarting php7.2-fpm in codfw in batches of 2 for 513949 [10:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:10] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:38:07] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513972 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:38:42] RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1002 is OK: SSL OK - Certificate cloudelastic1001.wikimedia.org valid until 2019-09-01 06:00:17 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search [10:39:00] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513972 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:39:08] RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1002 is OK: SSL OK - Certificate cloudelastic1001.wikimedia.org valid until 2019-09-01 06:00:17 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search [10:39:16] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513972 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:39:22] RECOVERY - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1002 is OK: SSL OK - Certificate cloudelastic1001.wikimedia.org valid until 2019-09-01 06:00:17 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search [10:40:49] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:513972| Bumping portals to master (T128546)]] (duration: 00m 49s) [10:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:55] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:41:37] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:513972| Bumping portals to master (T128546)]] (duration: 00m 47s) [10:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:11] !log upgrading prometheus-trafficserver-exporter in upload_ats ulsfo instances [10:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:08] (03CR) 10Gergő Tisza: "wikimedia.modular.im is up now." [dns] - 10https://gerrit.wikimedia.org/r/511842 (https://phabricator.wikimedia.org/T223835) (owner: 10Volans) [10:47:11] (03PS1) 10Arturo Borrero Gonzalez: hieradata: cloudservices: reallocate prometheus_nodes definition [puppet] - 10https://gerrit.wikimedia.org/r/513973 (https://phabricator.wikimedia.org/T224743) [10:47:29] !log upgrading WDQS servers to new Java security release [10:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: cloudservices: reallocate prometheus_nodes definition [puppet] - 10https://gerrit.wikimedia.org/r/513973 (https://phabricator.wikimedia.org/T224743) (owner: 10Arturo Borrero Gonzalez) [10:49:44] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [10:50:50] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:52:04] !log upgrading maps servers to new Java security release [10:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:36] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [10:55:04] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:58:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "[not a blocker] You've added sessionstore.discovery.wmnet twice to the SAN. Not a biggie but it's some useless bytes in the cert :)" [puppet] - 10https://gerrit.wikimedia.org/r/513323 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [10:58:58] (03PS9) 10Vgutierrez: prometheus: Toggle SSL certificate verification for trafficserver-exporter [puppet] - 10https://gerrit.wikimedia.org/r/508327 (https://phabricator.wikimedia.org/T221217) [10:59:12] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:00:04] Amir1 and Lucas_WMDE: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190603T1100). [11:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:26] o/ [11:00:45] o/ [11:02:25] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Toggle SSL certificate verification for trafficserver-exporter [puppet] - 10https://gerrit.wikimedia.org/r/508327 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [11:02:39] (03PS10) 10Vgutierrez: prometheus: Toggle SSL certificate verification for trafficserver-exporter [puppet] - 10https://gerrit.wikimedia.org/r/508327 (https://phabricator.wikimedia.org/T221217) [11:02:55] I'll swat my patches! [11:02:56] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513720 (https://phabricator.wikimedia.org/T224801) (owner: 10DannyS712) [11:03:53] (03Merged) 10jenkins-bot: Add "Zerrenda" (list) namespace to VisualEditor on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513720 (https://phabricator.wikimedia.org/T224801) (owner: 10DannyS712) [11:03:57] o/ [11:04:12] (03CR) 10jenkins-bot: Add "Zerrenda" (list) namespace to VisualEditor on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513720 (https://phabricator.wikimedia.org/T224801) (owner: 10DannyS712) [11:05:54] (03PS18) 10Vgutierrez: ATS: Provide a unified monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/506986 (https://phabricator.wikimedia.org/T221217) [11:06:08] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503680 (https://phabricator.wikimedia.org/T220881) (owner: 10DannyS712) [11:06:17] (03PS5) 10Urbanecm: Add 5 active namespaces for VisualEditor on en.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503680 (https://phabricator.wikimedia.org/T220881) (owner: 10DannyS712) [11:06:27] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503680 (https://phabricator.wikimedia.org/T220881) (owner: 10DannyS712) [11:06:40] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:513720|Add "Zerrenda" (list) namespace to VisualEditor on euwiki]] (T224801) (duration: 00m 48s) [11:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:46] T224801: Enable VisualEditor at Zerrenda: (List) namespace in euwiki - https://phabricator.wikimedia.org/T224801 [11:07:12] 10Operations, 10DNS, 10Matrix, 10Traffic, and 2 others: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Volans) Both records are actually up now: `lang=bash $ dig +trace wikimedia.modular.im [...SNIP...] wikimedia.modular.im. 300 IN A 52.56.1... [11:07:33] (03Merged) 10jenkins-bot: Add 5 active namespaces for VisualEditor on en.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503680 (https://phabricator.wikimedia.org/T220881) (owner: 10DannyS712) [11:07:47] (03CR) 10jenkins-bot: Add 5 active namespaces for VisualEditor on en.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503680 (https://phabricator.wikimedia.org/T220881) (owner: 10DannyS712) [11:09:20] (03PS2) 10Volans: Matrix wikimedia.org IDs domain authorization [dns] - 10https://gerrit.wikimedia.org/r/511842 (https://phabricator.wikimedia.org/T223835) [11:10:22] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513740 (https://phabricator.wikimedia.org/T224215) (owner: 10Urbanecm) [11:11:00] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:503680|Add 5 active namespaces for VisualEditor on en.wikiversity]] (T220881) (duration: 00m 48s) [11:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:05] T220881: en.wikiversity VisualEditor Changing Active Namespaces - https://phabricator.wikimedia.org/T220881 [11:11:14] (03Abandoned) 10Ladsgroup: Add sentinel profile and role [puppet] - 10https://gerrit.wikimedia.org/r/477415 (https://phabricator.wikimedia.org/T210580) (owner: 10Ladsgroup) [11:11:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/511842 (https://phabricator.wikimedia.org/T223835) (owner: 10Volans) [11:12:17] (03PS2) 10Urbanecm: Add Wikiprojekti namespace to wgExtraSignatureNamespaces for fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513740 (https://phabricator.wikimedia.org/T224215) [11:12:39] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "Bypassing jenkins to save time, already passed above." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513740 (https://phabricator.wikimedia.org/T224215) (owner: 10Urbanecm) [11:12:56] (03CR) 10jenkins-bot: Add Wikiprojekti namespace to wgExtraSignatureNamespaces for fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513740 (https://phabricator.wikimedia.org/T224215) (owner: 10Urbanecm) [11:14:21] (03PS1) 10Ema: Add 0019-vary-stevedore-mem-leak.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/513976 [11:15:26] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:513740|Add Wikiprojekti namespace to wgExtraSignatureNamespaces for fiwiki]] (T224215) (duration: 00m 47s) [11:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:32] T224215: Enable adding signature with Visual Editor in Finnish Wikipedia Wikiprojekti namespace - https://phabricator.wikimedia.org/T224215 [11:15:47] !log EU SWAT done [11:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:58] (03CR) 10Gergő Tisza: [C: 03+1] Matrix wikimedia.org IDs domain authorization [dns] - 10https://gerrit.wikimedia.org/r/511842 (https://phabricator.wikimedia.org/T223835) (owner: 10Volans) [11:17:49] !log Restarting php7.2-fpm in eqiad in batches of 2 for 513949 [11:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:11] (03CR) 10Volans: [C: 03+2] Matrix wikimedia.org IDs domain authorization [dns] - 10https://gerrit.wikimedia.org/r/511842 (https://phabricator.wikimedia.org/T223835) (owner: 10Volans) [11:22:14] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Wikimedia-Apache-configuration: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Volans) 05Open→03Resolved a:03Volans Change is live: ` L| 0 ~$ dig @ns0.wikimedia.org SRV _matrix._tcp.wikimed... [11:22:58] (03CR) 10jerkins-bot: [V: 04-1] Add 0019-vary-stevedore-mem-leak.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/513976 (owner: 10Ema) [11:24:51] (03PS1) 10Ema: Add 0020-assert-error-http1_minimal_response.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/513977 (https://phabricator.wikimedia.org/T224694) [11:28:45] !log reboot relforge for microcode + jvm upgrade [11:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:05] (03CR) 10jerkins-bot: [V: 04-1] Add 0020-assert-error-http1_minimal_response.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/513977 (https://phabricator.wikimedia.org/T224694) (owner: 10Ema) [11:33:46] (03PS1) 10Arturo Borrero Gonzalez: cloudservices: use unified prometheus_pdns_rec_exporter [puppet] - 10https://gerrit.wikimedia.org/r/513979 (https://phabricator.wikimedia.org/T224743) [11:36:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/16838/ PCC as expected" [puppet] - 10https://gerrit.wikimedia.org/r/513979 (https://phabricator.wikimedia.org/T224743) (owner: 10Arturo Borrero Gonzalez) [11:39:29] (03CR) 10Gehel: [C: 03+1] "LGTM, let's see if volans has more comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [11:41:39] (03PS1) 10Arturo Borrero Gonzalez: cloudservices: use prometheus pdns exporters in all roles [puppet] - 10https://gerrit.wikimedia.org/r/513980 (https://phabricator.wikimedia.org/T224743) [11:41:41] 10Operations, 10Maps: Maps2004 ran into disk space issues again after reimaging with new partitioning scheme - https://phabricator.wikimedia.org/T224874 (10Mathew.onipe) [11:49:28] 10Operations, 10Maps: Maps2004 ran into disk space issues again after reimaging with new partitioning scheme - https://phabricator.wikimedia.org/T224874 (10Mathew.onipe) p:05Triage→03High [11:53:14] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [11:53:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1001/16839/" [puppet] - 10https://gerrit.wikimedia.org/r/513980 (https://phabricator.wikimedia.org/T224743) (owner: 10Arturo Borrero Gonzalez) [11:56:44] uh wut [11:59:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Good point. Indeed it won't harm and it's probably more pain that it's worth. Merging. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/513323 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [11:59:15] (03PS2) 10Alexandros Kosiaris: Add sessionstore.discovery.wmnet TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/513323 (https://phabricator.wikimedia.org/T220401) [12:07:58] (03PS2) 10DCausse: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512195 [12:08:00] (03PS1) 10DCausse: [cirrus] remove unused wgCirrusSearchRequestEventSampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513982 [12:08:19] (03Abandoned) 10DCausse: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group0 (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512196 (owner: 10DCausse) [12:14:22] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [12:14:24] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [12:14:42] akosiaris: yours? ^^^ [12:15:14] (03PS1) 10Arturo Borrero Gonzalez: prometheus-pdns-exporter: add missing python dependency [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/513984 [12:15:16] (03PS1) 10Arturo Borrero Gonzalez: postinst: specify nonexistent home directory [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/513985 [12:15:18] (03PS1) 10Arturo Borrero Gonzalez: d/: drop upstart file [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/513986 [12:15:20] (03PS1) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 0.4 stretch-wikimedia [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/513987 [12:15:46] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [12:15:50] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [12:18:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] prometheus-pdns-exporter: add missing python dependency [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/513984 (owner: 10Arturo Borrero Gonzalez) [12:19:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] postinst: specify nonexistent home directory [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/513985 (owner: 10Arturo Borrero Gonzalez) [12:19:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/: drop upstart file [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/513986 (owner: 10Arturo Borrero Gonzalez) [12:19:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/changelog: generate entry for 0.4 stretch-wikimedia [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/513987 (owner: 10Arturo Borrero Gonzalez) [12:21:52] (03CR) 10Muehlenhoff: postinst: specify nonexistent home directory (031 comment) [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/513985 (owner: 10Arturo Borrero Gonzalez) [12:23:14] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [12:24:41] !log add prometheus-pdns-exporter v0.4 to stretch-wikimedia (T224877) [12:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:54] T224877: prometheus-pdns-exporter: add stretch support - https://phabricator.wikimedia.org/T224877 [12:26:51] (03PS1) 10Alexandros Kosiaris: kask: Add affinity/tolerations headings [deployment-charts] - 10https://gerrit.wikimedia.org/r/513988 (https://phabricator.wikimedia.org/T220401) [12:26:53] (03PS1) 10Alexandros Kosiaris: Bump kask to 0.0.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/513989 [12:47:37] (03PS1) 10Arturo Borrero Gonzalez: d/postinst: specify /nonexistent dir as home [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/513991 [12:47:39] (03PS1) 10Arturo Borrero Gonzalez: d/control: specify python depends [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/513992 [12:47:41] (03PS1) 10Arturo Borrero Gonzalez: d/rules: prevent dh_installinit from installing sysvinit files [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/513993 [12:47:43] (03PS1) 10Arturo Borrero Gonzalez: d/: drop upstart file [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/513994 [12:47:45] (03PS1) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 0.7 stretch-wikimedia [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/513995 (https://phabricator.wikimedia.org/T224877) [12:48:55] 10Operations, 10Maps: Maps2004 ran into disk space issues again after reimaging with new partitioning scheme - https://phabricator.wikimedia.org/T224874 (10Mathew.onipe) Looking deep into maps2004 postgres database, the problem was traced to gis.planet_osm_line table being larger than normal when compared with... [12:51:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/postinst: specify /nonexistent dir as home [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/513991 (owner: 10Arturo Borrero Gonzalez) [12:53:14] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [12:53:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/control: specify python depends [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/513992 (owner: 10Arturo Borrero Gonzalez) [12:53:32] (03PS1) 10Anomie: Set ActorTableSchemaMigrationStage => write-new/read-new on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513996 (https://phabricator.wikimedia.org/T188327) [12:53:51] (03CR) 10Anomie: [C: 03+2] "Deploy planned config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513996 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [12:54:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/rules: prevent dh_installinit from installing sysvinit files [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/513993 (owner: 10Arturo Borrero Gonzalez) [12:54:53] (03Merged) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-new/read-new on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513996 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [12:55:58] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-new/read-new on remaining wikis (T188327) (duration: 00m 48s) [12:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:06] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [12:56:37] (03CR) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-new/read-new on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513996 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [12:59:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/: drop upstart file [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/513994 (owner: 10Arturo Borrero Gonzalez) [12:59:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/changelog: generate entry for 0.7 stretch-wikimedia [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/513995 (https://phabricator.wikimedia.org/T224877) (owner: 10Arturo Borrero Gonzalez) [13:03:13] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% [13:03:14] !log add prometheus-pdns-rec-exporter v0.7 to stretch-wikimedia (T224877) [13:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:19] T224877: prometheus-pdns-exporter: add stretch support - https://phabricator.wikimedia.org/T224877 [13:19:03] !log Move db2078:3321 under db2062 T220170 [13:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:09] T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 [13:23:58] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Wikimedia-Apache-configuration: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Tgr) 05Resolved→03Open It seems the documentation is outdated and only the `.well-known` method works with Modul... [13:30:51] (03PS1) 10Marostegui: db-eqiad.php: Add db1138 to API in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514000 (https://phabricator.wikimedia.org/T224852) [13:34:38] (03PS2) 10Mholloway: WikimediaEditorTasks: Drop caption edit counter unlock delay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513645 (https://phabricator.wikimedia.org/T218599) [13:35:52] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Add db1138 to API in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514000 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [13:36:40] (03Merged) 10jenkins-bot: db-eqiad.php: Add db1138 to API in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514000 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [13:37:05] (03CR) 10jenkins-bot: db-eqiad.php: Add db1138 to API in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514000 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [13:37:09] (03CR) 10Mholloway: [C: 03+2] WikimediaEditorTasks: Drop caption edit counter unlock delay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513645 (https://phabricator.wikimedia.org/T218599) (owner: 10Mholloway) [13:37:58] (03CR) 10Mholloway: [C: 04-2] WikimediaEditorTasks: Drop caption edit counter unlock delay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513645 (https://phabricator.wikimedia.org/T218599) (owner: 10Mholloway) [13:38:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool db1138 into s4 API (duration: 00m 48s) [13:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:34] (03CR) 10Mholloway: [C: 04-2] "looks like i might have collided with another deployment, holding off for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513645 (https://phabricator.wikimedia.org/T218599) (owner: 10Mholloway) [13:40:30] marostegui: o/ are you all done deploying for the moment? i was just about to deploy a quick extension config change [13:40:47] mdholloway: Yeah, all done [13:40:51] marostegui: great, thanks! [13:41:01] thanks! [13:41:01] (03CR) 10Mholloway: [C: 03+2] WikimediaEditorTasks: Drop caption edit counter unlock delay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513645 (https://phabricator.wikimedia.org/T218599) (owner: 10Mholloway) [13:42:19] (03Merged) 10jenkins-bot: WikimediaEditorTasks: Drop caption edit counter unlock delay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513645 (https://phabricator.wikimedia.org/T218599) (owner: 10Mholloway) [13:42:36] (03CR) 10jenkins-bot: WikimediaEditorTasks: Drop caption edit counter unlock delay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513645 (https://phabricator.wikimedia.org/T218599) (owner: 10Mholloway) [13:44:52] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: WikimediaEditorTasks: Drop caption edit counter unlock delay to 0 (duration: 00m 49s) [13:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:02] (03PS1) 10Marostegui: db2062: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514003 [13:50:02] (03PS2) 10Marostegui: db2062: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514003 [13:51:04] (03CR) 10Marostegui: [C: 03+2] db2062: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514003 (owner: 10Marostegui) [13:53:38] !log draining ganeti1001 for eventual reboot to MDS-enabled Linux kernel [13:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:54] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:53:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:55] (03PS1) 10Ema: vcl: temporarily block abusive User-Agent [puppet] - 10https://gerrit.wikimedia.org/r/514005 [13:59:13] (03CR) 10CDanis: [C: 03+1] vcl: temporarily block abusive User-Agent [puppet] - 10https://gerrit.wikimedia.org/r/514005 (owner: 10Ema) [13:59:17] (03CR) 10Giuseppe Lavagetto: [C: 03+1] vcl: temporarily block abusive User-Agent [puppet] - 10https://gerrit.wikimedia.org/r/514005 (owner: 10Ema) [13:59:31] (03PS2) 10Ema: vcl: temporarily block abusive User-Agent [puppet] - 10https://gerrit.wikimedia.org/r/514005 [13:59:52] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10Eevans) >>! In T220401#5228040, @akosiaris wrote: >>>! In T220401#5226623, @Eevans wrote: >>>>! In T220401#5226531, @... [14:00:00] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review: Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10chasemp) Quick example on use cases and such, last week in {T224725} there were some artifacts that members of secteam wanted t... [14:00:14] (03CR) 10Ema: [C: 03+2] vcl: temporarily block abusive User-Agent [puppet] - 10https://gerrit.wikimedia.org/r/514005 (owner: 10Ema) [14:00:33] (03PS8) 10Rush: admin: add secteam and secteam-admin for T223463 [puppet] - 10https://gerrit.wikimedia.org/r/510753 (https://phabricator.wikimedia.org/T223463) [14:01:35] (03CR) 10jerkins-bot: [V: 04-1] admin: add secteam and secteam-admin for T223463 [puppet] - 10https://gerrit.wikimedia.org/r/510753 (https://phabricator.wikimedia.org/T223463) (owner: 10Rush) [14:05:58] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) >>! In T220401#5230202, @Eevans wrote: >>>! In T220401#5228040, @akosiaris wrote: >>>>! In T220401#5226623... [14:10:31] (03PS6) 10Bstorm: wiki replicas: Remove reference to old user fields [puppet] - 10https://gerrit.wikimedia.org/r/510595 (https://phabricator.wikimedia.org/T223406) (owner: 10Anomie) [14:12:25] 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10hashar) For some context: `wikitech-l` is the historical list for technical matters. Since we were all volunteers most every thing happened on that public mailing list As the foundation... [14:13:16] 10Operations, 10Maps: Maps2004 ran into disk space issues again after reimaging with new partitioning scheme - https://phabricator.wikimedia.org/T224874 (10MSantos) maps2004 kept track osm2pgsql script for the last 5 days, all of them ended with failures during replication due to disk space. It seems that at s... [14:14:52] (03PS1) 10Arturo Borrero Gonzalez: cloudservices1003: reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/514008 (https://phabricator.wikimedia.org/T221769) [14:16:02] (03CR) 10Andrew Bogott: [C: 03+1] cloudservices1003: reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/514008 (https://phabricator.wikimedia.org/T221769) (owner: 10Arturo Borrero Gonzalez) [14:17:07] (03CR) 10Bstorm: [C: 03+2] wiki replicas: Remove reference to old user fields [puppet] - 10https://gerrit.wikimedia.org/r/510595 (https://phabricator.wikimedia.org/T223406) (owner: 10Anomie) [14:18:56] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] kask: Add affinity/tolerations headings [deployment-charts] - 10https://gerrit.wikimedia.org/r/513988 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [14:19:06] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Bump kask to 0.0.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/513989 (owner: 10Alexandros Kosiaris) [14:20:45] 10Operations, 10Traffic: Rate limit requests to cache_upload - https://phabricator.wikimedia.org/T224884 (10ema) [14:20:52] !log upgrading acme-chief to version 0.17 in acme-chief production instances - T220518 [14:20:52] 10Operations, 10Traffic: Rate limit requests to cache_upload - https://phabricator.wikimedia.org/T224884 (10ema) p:05Triage→03Normal [14:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:58] T220518: acme-chief: Validate that configured certificates can be actually issued - https://phabricator.wikimedia.org/T220518 [14:23:32] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Enable SNI prevalidation for non-canonical certificates [puppet] - 10https://gerrit.wikimedia.org/r/512871 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [14:23:42] (03PS3) 10Vgutierrez: acme_chief: Enable SNI prevalidation for non-canonical certificates [puppet] - 10https://gerrit.wikimedia.org/r/512871 (https://phabricator.wikimedia.org/T220518) [14:24:22] PROBLEM - Host etcd1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:25:01] (03PS2) 10Andrew Bogott: cloudservices1003: reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/514008 (https://phabricator.wikimedia.org/T221769) (owner: 10Arturo Borrero Gonzalez) [14:25:09] ^ etcd1006 is the ganeti reboot [14:25:42] RECOVERY - Host etcd1006 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [14:26:21] (03PS3) 10Andrew Bogott: cloudservices1003: reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/514008 (https://phabricator.wikimedia.org/T221769) (owner: 10Arturo Borrero Gonzalez) [14:27:29] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices1003: reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/514008 (https://phabricator.wikimedia.org/T221769) (owner: 10Arturo Borrero Gonzalez) [14:29:23] 10Operations, 10serviceops, 10Kubernetes: Migrate etcd networking cluster to Stretch/Buster - https://phabricator.wikimedia.org/T224577 (10MoritzMuehlenhoff) Actually, this is probably entirely unused, Fabián pointed me to T212934 [14:29:39] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review: Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10chasemp) [14:30:54] 10Operations, 10Security-Team: Establish secteam production norms - https://phabricator.wikimedia.org/T224886 (10chasemp) [14:33:40] 10Operations, 10Security-Team: apache modsec rules deployment with scap - https://phabricator.wikimedia.org/T224887 (10chasemp) p:05Triage→03Normal [14:33:57] !log T221769 reimaging cloudservices1003 to stretch [14:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:33] 10Operations, 10Security-Team: Establish secteam production norms - https://phabricator.wikimedia.org/T224886 (10chasemp) [14:34:36] 10Operations, 10Security-Team: apache modsec rules deployment with scap - https://phabricator.wikimedia.org/T224887 (10chasemp) [14:34:44] T221769: Upgrade cloudservices1003/1004 to stretch/mitaka - https://phabricator.wikimedia.org/T221769 [14:35:06] 10Operations, 10Security-Team: Establish secteam production norms - https://phabricator.wikimedia.org/T224886 (10chasemp) [14:36:42] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:37:14] (03PS9) 10Rush: admin: add secteam and secteam-admin for T223463 [puppet] - 10https://gerrit.wikimedia.org/r/510753 (https://phabricator.wikimedia.org/T223463) [14:39:38] (03CR) 10jerkins-bot: [V: 04-1] admin: add secteam and secteam-admin for T223463 [puppet] - 10https://gerrit.wikimedia.org/r/510753 (https://phabricator.wikimedia.org/T223463) (owner: 10Rush) [14:41:34] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:44:05] 10Operations, 10Traffic, 10netops: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10ema) [14:44:12] 10Operations, 10Traffic, 10netops: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10ema) p:05Triage→03Normal [14:45:27] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul) 2. kafka-main200[2345] are not yet able to net boot 13:25 < papaul> herron: the problem is i added just kafka-main2001 to the DHCP file and not the othe... [14:45:43] !log deploy kask in sessionstore kubernetes namespace in eqiad, codfw T220401 [14:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:35] T220401: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 [14:50:45] (03CR) 10Gilles: "Shouldn't this be on the Varnish backends instead? It looks like it would rate-limit thumbnails that are misses on the frontend but potent" [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond) [14:53:07] PROBLEM - Host 208.80.154.143 is DOWN: PING CRITICAL - Packet loss = 100% [14:53:24] 10Operations, 10Security-Team: apache modsec rules deployment with scap - https://phabricator.wikimedia.org/T224887 (10chasemp) [14:53:55] arturo: ^^ expected? [14:54:12] vgutierrez: yes [14:54:17] but andrewbogott downtimed it [14:54:31] I… did? [14:54:40] I mean, I did, so I don't know why it's alerting. [14:54:44] I'll do it again I guess! [14:55:45] I can't explain it, I still see the window showing the confirmation that I downtimed it :/ [14:55:57] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10jcrespo) >>! In T222418#5229410, @Ankit-Maity wrote: > Just a question: is this intermittent behaviour expected or is the... [14:56:08] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review: Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10chasemp) pinged @MoritzMuehlenhoff to get feedback, esp on the list of perms for secteam-admin and he graciously agreed to look... [14:58:44] 10Operations, 10Security-Team: apache modsec rules deployment with scap - https://phabricator.wikimedia.org/T224887 (10chasemp) [14:59:25] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10herron) I would expect though that the DHCP requests would make it to the install servers, with or without entries in the dhcp config file. [15:00:09] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [15:00:53] 10Operations, 10Traffic, 10netops: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10ayounsi) I agree! That's all the "transports" LibreNMS alerting can use: https://docs.librenms.org/Alerting/Transports I'm not familiar with our paging system. If any of the ab... [15:01:07] (03PS2) 10Fsero: mcrouter: page 7 days before certs got expired [puppet] - 10https://gerrit.wikimedia.org/r/511397 (https://phabricator.wikimedia.org/T221346) [15:02:23] BTW what happened to cloudcontrol1003 andrewbogott ? [15:02:48] arturo: it runs a maintenance script that talks to the designate API which is currently down [15:02:54] ok [15:04:34] 10Operations, 10Traffic: Return HTTP 403 to requests in violation of User-Agent policy - https://phabricator.wikimedia.org/T224891 (10ema) [15:04:43] 10Operations, 10Traffic: Return HTTP 403 to requests in violation of User-Agent policy - https://phabricator.wikimedia.org/T224891 (10ema) p:05Triage→03Normal [15:06:07] (03PS1) 10Ema: cache_upload: return HTTP 403 to requests violating UA policy [puppet] - 10https://gerrit.wikimedia.org/r/514017 (https://phabricator.wikimedia.org/T224891) [15:07:52] (03PS3) 10Alexandros Kosiaris: Add sessionstore LVS DNS RRs [dns] - 10https://gerrit.wikimedia.org/r/513328 (https://phabricator.wikimedia.org/T220401) [15:08:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add sessionstore LVS DNS RRs [dns] - 10https://gerrit.wikimedia.org/r/513328 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [15:09:17] RECOVERY - Host 208.80.154.143 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [15:09:51] PROBLEM - tilerator on maps2004 is CRITICAL: connect to address 10.192.48.57 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [15:09:57] PROBLEM - Check systemd state on maps2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:10:03] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:10:05] arrg! [15:10:20] yay [15:10:20] downtime has expired [15:10:22] sory [15:10:28] the maps one? [15:11:01] (03PS1) 10Jbond: varnish: add thumbor Ratelimit for backend misses [puppet] - 10https://gerrit.wikimedia.org/r/514019 (https://phabricator.wikimedia.org/T224434) [15:11:15] ACKNOWLEDGEMENT - EDAC syslog messages on wtp2020 is CRITICAL: 5.001 ge 4 Ayounsi https://phabricator.wikimedia.org/T205712 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw+prometheus/ops [15:11:15] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 5.001 ge 4 Ayounsi https://phabricator.wikimedia.org/T205712 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw+prometheus/ops [15:11:27] (03PS14) 10Jbond: varnish: ratelimit thumbor frontend [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) [15:11:54] (03PS15) 10Jbond: varnish: ratelimit thumbor - cache_upload frontend [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) [15:11:58] chaomodus: yes! [15:12:12] (03PS1) 10Fsero: Decommision darmstadtium [puppet] - 10https://gerrit.wikimedia.org/r/514020 (https://phabricator.wikimedia.org/T224562) [15:12:14] (03PS2) 10Jbond: varnish: ratelimit thumbor - cache_upload backend [puppet] - 10https://gerrit.wikimedia.org/r/514019 (https://phabricator.wikimedia.org/T224434) [15:12:41] (03PS5) 10Jbond: varnish: cache_upload global cache miss rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224434) [15:12:58] (03PS3) 10Jbond: varnish: ratelimit thumbor - cache_upload backend [puppet] - 10https://gerrit.wikimedia.org/r/514019 (https://phabricator.wikimedia.org/T224434) [15:13:19] (03PS1) 10Fsero: decommision darmstadtium [dns] - 10https://gerrit.wikimedia.org/r/514024 (https://phabricator.wikimedia.org/T224562) [15:14:41] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul) kafak-main2002 after power drain `/admin1-> racadm serveraction powercycle Server power operation initiated successfully /admin1-> [15:15:08] (03CR) 10Jbond: "> Patch Set 13:" [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond) [15:17:11] (03PS6) 10Ema: varnish: cache_upload global cache miss rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond) [15:19:22] 10Operations, 10Operations-Software-Development, 10netbox, 10netops, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10ayounsi) If you need any examples, that's what I do in: https://gerrit.wikimedia.org/r/c/operations/software/netbox-deploy/+/507... [15:20:28] (03PS1) 10Alexandros Kosiaris: Introduce sessionstore LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/514025 (https://phabricator.wikimedia.org/T220401) [15:20:56] (03CR) 10jerkins-bot: [V: 04-1] Introduce sessionstore LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/514025 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [15:21:24] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul) >>! In T223493#5230429, @herron wrote: > I would expect though that the DHCP requests would make it to the install servers, with or without entries in the dhcp config fi... [15:22:07] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10herron) Ok, thanks! [15:22:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Bstorm) [15:23:36] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1029 - https://phabricator.wikimedia.org/T224795 (10elukey) This node is part of a OOW testing cluster, we can skip replacing the disk (that it is not used atm anyway). [15:23:41] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/514025 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [15:24:13] RECOVERY - Disk space on maps2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [15:24:15] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) [15:25:08] (03PS2) 10Bstorm: labstore: remove unused hiera yaml [puppet] - 10https://gerrit.wikimedia.org/r/513702 (https://phabricator.wikimedia.org/T187456) [15:25:21] (03Abandoned) 10Andrew Bogott: glance image sync timer: don't monitor if it's disabled [puppet] - 10https://gerrit.wikimedia.org/r/513276 (owner: 10Andrew Bogott) [15:27:29] (03CR) 10Gilles: [C: 03+1] varnish: ratelimit thumbor - cache_upload backend [puppet] - 10https://gerrit.wikimedia.org/r/514019 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond) [15:27:45] (03CR) 10Gilles: [C: 03+1] varnish: ratelimit thumbor - cache_upload frontend [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond) [15:28:04] (03CR) 10Bstorm: [C: 03+2] labstore: remove unused hiera yaml [puppet] - 10https://gerrit.wikimedia.org/r/513702 (https://phabricator.wikimedia.org/T187456) (owner: 10Bstorm) [15:29:48] (03PS2) 10Alexandros Kosiaris: Introduce sessionstore LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/514025 (https://phabricator.wikimedia.org/T220401) [15:29:50] (03PS1) 10Alexandros Kosiaris: sessionstore: Enable LVS paging [puppet] - 10https://gerrit.wikimedia.org/r/514028 (https://phabricator.wikimedia.org/T220401) [15:29:52] (03PS1) 10Alexandros Kosiaris: docker-registry: Page on LVS level failures [puppet] - 10https://gerrit.wikimedia.org/r/514029 [15:32:37] (03CR) 10Fsero: [C: 03+1] docker-registry: Page on LVS level failures [puppet] - 10https://gerrit.wikimedia.org/r/514029 (owner: 10Alexandros Kosiaris) [15:32:52] (03CR) 10Fsero: [C: 03+1] "this is a leftover from the migration" [puppet] - 10https://gerrit.wikimedia.org/r/514029 (owner: 10Alexandros Kosiaris) [15:32:55] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1029 - https://phabricator.wikimedia.org/T224795 (10fdans) p:05Triage→03Normal [15:34:42] (03PS1) 10Papaul: DHCP: Add MAC address entries for kafka-main200[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/514030 (https://phabricator.wikimedia.org/T223493) [15:34:59] RECOVERY - tilerator on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [15:35:09] RECOVERY - Check systemd state on maps2004 is OK: OK - running: The system is fully operational [15:36:43] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul) a:05Papaul→03herron @herron You can merge the DHCP code and you should be good. [15:36:46] (03PS1) 10Bstorm: labstore: switch labstore1003 to the spare:system role [puppet] - 10https://gerrit.wikimedia.org/r/514035 (https://phabricator.wikimedia.org/T187456) [15:37:29] (03PS2) 10Herron: DHCP: Add MAC address entries for kafka-main200[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/514030 (https://phabricator.wikimedia.org/T223493) (owner: 10Papaul) [15:39:09] !log T223406 labsdb1012 updated views for actor table changes [15:39:14] 10Operations, 10Analytics, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10fdans) p:05Triage→03High [15:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:23] T223406: Remove reference to fields replaced by the actor table from WMCS views - https://phabricator.wikimedia.org/T223406 [15:39:25] (03CR) 10Herron: [C: 03+2] DHCP: Add MAC address entries for kafka-main200[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/514030 (https://phabricator.wikimedia.org/T223493) (owner: 10Papaul) [15:41:18] (03CR) 10Bstorm: [C: 03+2] labstore: switch labstore1003 to the spare:system role [puppet] - 10https://gerrit.wikimedia.org/r/514035 (https://phabricator.wikimedia.org/T187456) (owner: 10Bstorm) [15:41:29] (03PS2) 10Bstorm: labstore: switch labstore1003 to the spare:system role [puppet] - 10https://gerrit.wikimedia.org/r/514035 (https://phabricator.wikimedia.org/T187456) [15:41:31] (03PS1) 10WMDE-leszek: Do not load InitialiseSettings-labs.php multiple times [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514038 [15:46:32] 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10Rfarrand) Agreed, no need to keep it from my perspective [15:49:01] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:52:44] (03PS1) 10Jbond: lvm_support: Increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/514041 (https://phabricator.wikimedia.org/T224884) [15:53:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Bstorm) [15:54:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Bstorm) [15:54:31] (03CR) 10Thcipriani: [C: 04-1] "+1 to upping the web_session memoryLimit per comments on the mailing list, -1 to 4096. Per the docs:" [puppet] - 10https://gerrit.wikimedia.org/r/513682 (owner: 10Paladox) [15:54:36] 10Operations, 10Puppet: facter 3: add timeout to custom facts external calls - https://phabricator.wikimedia.org/T223938 (10jbond) I looked at this a bit more today and my initial analysts was wrong. the facts do actually resolve they just take longer when there are disk issues. Further i was unable to find... [15:55:47] (03PS1) 10Ema: Add 0021-dont-test-gunzip-partial.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514043 [15:55:49] (03PS5) 10Paladox: Gerrit: Quadruple web session cache memory to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/513682 [15:56:52] (03CR) 10Thcipriani: [C: 03+1] Gerrit: Quadruple web session cache memory to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/513682 (owner: 10Paladox) [15:57:14] (03PS2) 10WMDE-leszek: Do not load InitialiseSettings-labs.php multiple times [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514038 (https://phabricator.wikimedia.org/T224899) [15:58:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Bstorm) [15:59:21] (03CR) 10Jforrester: [C: 03+1] "Hmm, yes, this should never happen. Will deploy later if someone doesn't get there first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514038 (https://phabricator.wikimedia.org/T224899) (owner: 10WMDE-leszek) [16:01:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Bstorm) a:05Bstorm→03RobH I think these are ready to hand off now. [16:03:12] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10fdans) 05Open→03Resolved [16:03:37] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, and 2 others: Decommission analytics-tool1003 (old superset host) - https://phabricator.wikimedia.org/T224023 (10fdans) 05Open→03Resolved [16:04:16] (03CR) 10jerkins-bot: [V: 04-1] Add 0021-dont-test-gunzip-partial.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514043 (owner: 10Ema) [16:05:51] PROBLEM - Host checker.tools.wmflabs.org is DOWN: /bin/ping -n -U -w 15 -c 5 checker.tools.wmflabs.org [16:06:17] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 2.81 ms [16:07:25] PROBLEM - Host checker.tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - checker.tools.wmflabs.org [16:07:25] PROBLEM - Host tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org [16:07:56] (03CR) 10WMDE-leszek: "Note from my side: it seems to me the issue is probably somewhere else, i.e. I take this has not been a case in the past, so maybe some th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514038 (https://phabricator.wikimedia.org/T224899) (owner: 10WMDE-leszek) [16:08:21] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [16:08:43] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [16:10:05] PROBLEM - Host checker.tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - checker.tools.wmflabs.org [16:11:15] PROBLEM - Host tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org [16:11:33] PROBLEM - Host paws.wmflabs.org is DOWN: check_ping: Invalid hostname/address - paws.wmflabs.org [16:15:04] (03PS2) 10Bstorm: cloudstore: enable more monitors on cloudstore1008/9 [puppet] - 10https://gerrit.wikimedia.org/r/513681 [16:16:33] (03CR) 10Bstorm: [C: 03+2] cloudstore: enable more monitors on cloudstore1008/9 [puppet] - 10https://gerrit.wikimedia.org/r/513681 (owner: 10Bstorm) [16:17:14] (03PS3) 1020after4: phabricator: Install php-mailparse [puppet] - 10https://gerrit.wikimedia.org/r/513713 (https://phabricator.wikimedia.org/T224752) [16:18:18] (03CR) 1020after4: [C: 03+1] "This needs to merge to make puppet match reality." [puppet] - 10https://gerrit.wikimedia.org/r/513713 (https://phabricator.wikimedia.org/T224752) (owner: 1020after4) [16:21:31] (03PS1) 10Bstorm: wikireplicas: depool labsdb1010 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/514063 (https://phabricator.wikimedia.org/T223406) [16:26:02] (03PS2) 10Bstorm: wikireplicas: depool labsdb1010 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/514063 (https://phabricator.wikimedia.org/T223406) [16:26:05] 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10Papaul) Before upgrade `Junos: 14.1X53-D47.6 JUNOS EX Software Suite [14.1X53-D47.6] JUNOS FIPS mode utilities [14.1X53-D47.6] JUNOS Online Documentation [14.1X53-D47.6] JUNOS EX 4300 Software Suite [1... [16:26:32] 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10Papaul) [16:27:44] (03CR) 10Bstorm: [C: 03+2] wikireplicas: depool labsdb1010 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/514063 (https://phabricator.wikimedia.org/T223406) (owner: 10Bstorm) [16:29:40] <[1997kB]> [21:55] (+icinga-wm) [21:41] PROBLEM - Host tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org Still down? [16:30:40] !log T223406 depooled labsdb1010 for view updates [16:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:45] T223406: Remove reference to fields replaced by the actor table from WMCS views - https://phabricator.wikimedia.org/T223406 [16:30:51] (03CR) 10Muehlenhoff: [C: 03+1] Decommision darmstadtium [puppet] - 10https://gerrit.wikimedia.org/r/514020 (https://phabricator.wikimedia.org/T224562) (owner: 10Fsero) [16:31:53] RECOVERY - Host paws.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [16:32:14] (03Abandoned) 10EBernhardson: Re-apply defaults removed in cirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490496 (owner: 10EBernhardson) [16:32:54] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10Eevans) >>! In T220401#5230212, @akosiaris wrote: >>>! In T220401#5230202, @Eevans wrote: >>>>! In T220401#5228040, @... [16:33:19] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:33:39] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.86 ms [16:35:15] PROBLEM - Host checker.tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - checker.tools.wmflabs.org [16:38:39] PROBLEM - Host paws.wmflabs.org is DOWN: check_ping: Invalid hostname/address - paws.wmflabs.org [16:40:24] ACKNOWLEDGEMENT - Host checker.tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - checker.tools.wmflabs.org andrew bogott this should recover in a bit [16:40:24] ACKNOWLEDGEMENT - Host paws.wmflabs.org is DOWN: check_ping: Invalid hostname/address - paws.wmflabs.org andrew bogott this should recover in a bit [16:40:24] ACKNOWLEDGEMENT - Host tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org andrew bogott this should recover in a bit [16:40:57] !log started osm-import on maps2004 - T224395 [16:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:03] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [16:42:02] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10hashar) 05Resolved→03Open a:05hashar→03None I am not sure what I have done a month ago, but on a new instance it is no more upgrad... [16:42:22] 10Operations, 10Discovery-Search, 10Wikimedia-Logstash, 10Epic: [Epic] Migrate log transport to kafka for Search Platform applications - https://phabricator.wikimedia.org/T224911 (10Gehel) p:05Triage→03Normal [16:43:07] (03CR) 1020after4: "git protocol v2 docs: https://git-scm.com/docs/protocol-v2" [puppet] - 10https://gerrit.wikimedia.org/r/473643 (owner: 10Paladox) [16:52:07] PROBLEM - Getent speed check on cloudstore1008 is CRITICAL: CRITICAL: getent group tools.admin failed https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [16:52:25] <_joe_> uh [16:52:42] (03PS2) 10Jbond: lvm_support: Increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/514041 (https://phabricator.wikimedia.org/T223938) [16:52:52] is that indicative of an ldap issue there? [16:54:52] Its a brand new check that bstorm_ just turned on. She will ack in a minute when not AFK [16:55:09] ah! ok [16:55:27] (03CR) 10Jbond: [C: 03+1] "This was approved in the monday SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/513293 (https://phabricator.wikimedia.org/T224627) (owner: 10Jhedden) [17:00:04] gehel and onimisionipe: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190603T1700). [17:00:14] here here [17:01:55] PROBLEM - Getent speed check on cloudstore1009 is CRITICAL: CRITICAL: getent group tools.admin failed https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [17:03:30] Ok. Back. Going to shut that check up and find out what's up with it. It's an old check on a new server, so there are a lot of possibilities. [17:05:03] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@9e3035c]: Blazegraph version wmf.4 [17:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:27] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [17:07:11] ACKNOWLEDGEMENT - Getent speed check on cloudstore1008 is CRITICAL: CRITICAL: getent group tools.admin failed Bstorm investigating if this check is needed and perhaps should be made working T224914 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [17:07:31] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 2.18 ms [17:11:34] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Andrew) [17:15:35] RECOVERY - Host paws.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [17:16:33] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@9e3035c]: Blazegraph version wmf.4 (duration: 11m 29s) [17:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:49] ~. [17:18:56] (03CR) 10Herron: Bird anycast: add anycast_healthchecker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [17:21:32] (03PS1) 10Bstorm: cloudstore: fix ldap check [puppet] - 10https://gerrit.wikimedia.org/r/514071 (https://phabricator.wikimedia.org/T224914) [17:22:13] (03CR) 10jerkins-bot: [V: 04-1] cloudstore: fix ldap check [puppet] - 10https://gerrit.wikimedia.org/r/514071 (https://phabricator.wikimedia.org/T224914) (owner: 10Bstorm) [17:22:30] (03PS2) 10Bstorm: cloudstore: fix ldap check [puppet] - 10https://gerrit.wikimedia.org/r/514071 (https://phabricator.wikimedia.org/T224914) [17:23:23] (03CR) 10jerkins-bot: [V: 04-1] cloudstore: fix ldap check [puppet] - 10https://gerrit.wikimedia.org/r/514071 (https://phabricator.wikimedia.org/T224914) (owner: 10Bstorm) [17:24:31] (03PS3) 10Bstorm: cloudstore: fix ldap check [puppet] - 10https://gerrit.wikimedia.org/r/514071 (https://phabricator.wikimedia.org/T224914) [17:32:35] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10Ottomata) Thanks Luca! Any reason we shouldn't add the field to webrequest Hive table too? [17:33:36] (03CR) 10Bstorm: [C: 03+2] cloudstore: fix ldap check [puppet] - 10https://gerrit.wikimedia.org/r/514071 (https://phabricator.wikimedia.org/T224914) (owner: 10Bstorm) [17:35:12] RECOVERY - Getent speed check on cloudstore1009 is OK: OK: getent group returns within a second https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [17:35:24] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) @Ottomata the quorum of the A-team in Mallorca said that we shouldn't add it since it seems more a debugging info rather th... [17:36:00] RECOVERY - Getent speed check on cloudstore1008 is OK: OK: getent group returns within a second https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [17:37:02] (03CR) 10Ayounsi: [C: 03+2] admins: Add jeh to ops group [puppet] - 10https://gerrit.wikimedia.org/r/513293 (https://phabricator.wikimedia.org/T224627) (owner: 10Jhedden) [17:37:14] (03PS4) 10Ayounsi: admins: Add jeh to ops group [puppet] - 10https://gerrit.wikimedia.org/r/513293 (https://phabricator.wikimedia.org/T224627) (owner: 10Jhedden) [17:37:29] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10Ottomata) K! let's leave it for now I don't mind either way. [17:37:40] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Mholloway) [17:38:59] (03CR) 10Jbond: "# Facts before change" [puppet] - 10https://gerrit.wikimedia.org/r/514041 (https://phabricator.wikimedia.org/T223938) (owner: 10Jbond) [17:41:43] (03CR) 10Jbond: "> # Facter after" [puppet] - 10https://gerrit.wikimedia.org/r/514041 (https://phabricator.wikimedia.org/T223938) (owner: 10Jbond) [17:45:28] (03PS3) 10Jbond: lvm_support: Increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/514041 (https://phabricator.wikimedia.org/T223938) [17:46:04] (03CR) 10jerkins-bot: [V: 04-1] lvm_support: Increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/514041 (https://phabricator.wikimedia.org/T223938) (owner: 10Jbond) [17:54:10] (03CR) 10Volans: [C: 03+1] "Output LGTM, feel free to move the sample output from the commit message to a CR comment, no need to keep that in the history." [puppet] - 10https://gerrit.wikimedia.org/r/514041 (https://phabricator.wikimedia.org/T223938) (owner: 10Jbond) [17:54:37] (03PS4) 10Jbond: lvm_support: Increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/514041 (https://phabricator.wikimedia.org/T223938) [17:54:43] (03CR) 10Jbond: "# Facts before change" [puppet] - 10https://gerrit.wikimedia.org/r/514041 (https://phabricator.wikimedia.org/T223938) (owner: 10Jbond) [17:55:21] (03CR) 10Jbond: [C: 03+2] lvm_support: Increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/514041 (https://phabricator.wikimedia.org/T223938) (owner: 10Jbond) [17:55:29] (03PS5) 10Jbond: lvm_support: Increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/514041 (https://phabricator.wikimedia.org/T223938) [18:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190603T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:03:40] (03PS1) 10Ayounsi: Routinator, change command line args for 0.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/514088 (https://phabricator.wikimedia.org/T220669) [18:04:16] (03PS1) 10Mathew.onipe: maps: disable replication cron [puppet] - 10https://gerrit.wikimedia.org/r/514090 (https://phabricator.wikimedia.org/T224874) [18:07:03] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/16842/" [puppet] - 10https://gerrit.wikimedia.org/r/514088 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [18:08:22] (03PS1) 10Bstorm: cloudstore: add the rest of the monitoring for cloudstore1008/9 [puppet] - 10https://gerrit.wikimedia.org/r/514091 [18:08:55] (03CR) 10Mathew.onipe: "PCC is Ok: https://puppet-compiler.wmflabs.org/compiler1002/16843/" [puppet] - 10https://gerrit.wikimedia.org/r/514090 (https://phabricator.wikimedia.org/T224874) (owner: 10Mathew.onipe) [18:10:37] (03PS1) 10Elukey: Add profile::kerberos::client to the Hadoop testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/514092 (https://phabricator.wikimedia.org/T212257) [18:11:16] (03CR) 10MSantos: [C: 03+1] "It's worth to note that the cron runs once a day. Not sure how long do you need it disabled, maybe you don't need to disable at all." [puppet] - 10https://gerrit.wikimedia.org/r/514090 (https://phabricator.wikimedia.org/T224874) (owner: 10Mathew.onipe) [18:12:28] (03CR) 10Elukey: [C: 03+2] Add profile::kerberos::client to the Hadoop testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/514092 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [18:15:09] PROBLEM - Device not healthy -SMART- on restbase-dev1006 is CRITICAL: cluster=restbase_dev device=sdd instance=restbase-dev1006:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1006&var-datasource=eqiad+prometheus/ops [18:17:01] !log add routinator 0.4.0 to APT repo - T220669 [18:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:17] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [18:21:31] (03PS2) 10Ayounsi: Routinator, change command line args for 0.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/514088 (https://phabricator.wikimedia.org/T220669) [18:26:33] (03PS1) 10Rush: modsec: deployment modsec rules with scap [puppet] - 10https://gerrit.wikimedia.org/r/514094 (https://phabricator.wikimedia.org/T224887) [18:26:40] 10Operations, 10Security-Team, 10Patch-For-Review: apache modsec rules deployment with scap - https://phabricator.wikimedia.org/T224887 (10chasemp) [18:28:10] 10Operations, 10SRE-Access-Requests: Requesting access to ops group in admin for jeh - https://phabricator.wikimedia.org/T224627 (10ayounsi) 05Open→03Resolved a:03ayounsi You should be good, please reopen if any issues. [18:30:07] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10Ankit-Maity) 05Open→03Resolved That explanation certainly helps >>! In T222418#5230418, @jcrespo wrote: >>>! In T2224... [18:30:14] 10Operations, 10Traffic, 10netops: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10CDanis) There is a "Nagios Compatible" transport, but it is underdocumented and seems to also only write to a local filesystem path (which is presumed to be a Nagios external com... [18:33:36] (03CR) 10Jhedden: [C: 03+2] cloudstore: add the rest of the monitoring for cloudstore1008/9 [puppet] - 10https://gerrit.wikimedia.org/r/514091 (owner: 10Bstorm) [18:34:23] (03PS2) 10Bstorm: cloudstore: add the rest of the monitoring for cloudstore1008/9 [puppet] - 10https://gerrit.wikimedia.org/r/514091 [18:35:03] !log drop all ICMP frag on cr1/2-eqiad - T224186 [18:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:36] !log switch most Quibble jobs to node 10 T222406 - ttps://gerrit.wikimedia.org/r/#/c/integration/config/+/514034/ T222406 [18:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:42] T222406: Switch quibble-based CI jobs from node6 to node10 - https://phabricator.wikimedia.org/T222406 [18:36:21] (03PS2) 10Rush: modsec: deployment modsec rules with scap [puppet] - 10https://gerrit.wikimedia.org/r/514094 (https://phabricator.wikimedia.org/T224887) [18:43:54] (03PS3) 10Rush: modsec: deployment modsec rules with scap [puppet] - 10https://gerrit.wikimedia.org/r/514094 (https://phabricator.wikimedia.org/T224887) [18:47:59] !log Add RPKI validators to all routers - T220669 [18:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:05] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [18:48:50] (03PS4) 10Rush: modsec: deployment modsec rules with scap [puppet] - 10https://gerrit.wikimedia.org/r/514094 (https://phabricator.wikimedia.org/T224887) [18:51:36] (03CR) 10Gehel: [C: 04-1] "see comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514090 (https://phabricator.wikimedia.org/T224874) (owner: 10Mathew.onipe) [18:51:46] onimisionipe: ^ [18:54:19] (03PS5) 10Rush: modsec: deployment modsec rules with scap [puppet] - 10https://gerrit.wikimedia.org/r/514094 (https://phabricator.wikimedia.org/T224887) [18:54:32] (03CR) 10Thcipriani: [C: 03+1] modsec: deployment modsec rules with scap [puppet] - 10https://gerrit.wikimedia.org/r/514094 (https://phabricator.wikimedia.org/T224887) (owner: 10Rush) [18:55:33] (03CR) 10Rush: [C: 03+2] modsec: deployment modsec rules with scap [puppet] - 10https://gerrit.wikimedia.org/r/514094 (https://phabricator.wikimedia.org/T224887) (owner: 10Rush) [19:01:30] PROBLEM - Keyholder SSH agent on deploy2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [19:01:54] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 15 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[apache2modsec/apache2modsec] [19:03:16] (03PS1) 10Jhedden: passwords: add cloud-wide root key for jhedden [labs/private] - 10https://gerrit.wikimedia.org/r/514095 [19:04:16] (03CR) 10Paladox: "Im getting this error in WMCS:" [puppet] - 10https://gerrit.wikimedia.org/r/514094 (https://phabricator.wikimedia.org/T224887) (owner: 10Rush) [19:06:08] PROBLEM - puppet last run on labweb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[apache2modsec/apache2modsec] [19:07:10] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:08:54] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[apache2modsec/apache2modsec] [19:18:52] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission: Decommission iron - https://phabricator.wikimedia.org/T220505 (10Krenair) 'install access for WMCS' struck me as odd so I asked around a bit: ` Iron has been used for cloudvirt installs in the past Normally we access new unpup... [19:19:19] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:25:13] (03PS2) 10Mathew.onipe: maps: disable replication cron [puppet] - 10https://gerrit.wikimedia.org/r/514090 (https://phabricator.wikimedia.org/T224874) [19:25:23] gehel: ^ [19:25:50] (03CR) 10Mathew.onipe: "> Patch Set 1: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514090 (https://phabricator.wikimedia.org/T224874) (owner: 10Mathew.onipe) [19:26:29] (03PS2) 10Jhedden: passwords: add cloud-wide root key for jhedden [labs/private] - 10https://gerrit.wikimedia.org/r/514095 (https://phabricator.wikimedia.org/T224192) [19:27:33] (03PS3) 10Gehel: maps: disable replication and admin cron [puppet] - 10https://gerrit.wikimedia.org/r/514090 (https://phabricator.wikimedia.org/T224874) (owner: 10Mathew.onipe) [19:27:34] onimisionipe: ^ [19:27:52] (03PS4) 10Mathew.onipe: maps: disable replication cron [puppet] - 10https://gerrit.wikimedia.org/r/514090 (https://phabricator.wikimedia.org/T224874) [19:28:41] (03PS5) 10Gehel: maps: disable replication cron [puppet] - 10https://gerrit.wikimedia.org/r/514090 (https://phabricator.wikimedia.org/T224874) (owner: 10Mathew.onipe) [19:28:49] PROBLEM - Keyholder SSH agent on deploy1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [19:29:19] (03CR) 10Gehel: [C: 03+2] maps: disable replication cron [puppet] - 10https://gerrit.wikimedia.org/r/514090 (https://phabricator.wikimedia.org/T224874) (owner: 10Mathew.onipe) [19:29:55] RECOVERY - puppet last run on labweb1002 is OK: OK: Puppet is currently enabled, last run 8 minutes ago with 0 failures [19:34:19] (03PS1) 10Jhedden: onboarding: add jhedden to prod icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/514102 (https://phabricator.wikimedia.org/T224192) [19:34:44] (03PS2) 10Jhedden: onboarding: add jhedden to prod icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/514102 (https://phabricator.wikimedia.org/T224192) [19:55:13] (03CR) 10Bstorm: [V: 03+2 C: 03+2] "Looks good, merging." [labs/private] - 10https://gerrit.wikimedia.org/r/514095 (https://phabricator.wikimedia.org/T224192) (owner: 10Jhedden) [19:57:18] !log stop sampling from cr2-eqiad [19:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] cscott, arlolra, subbu, bearND, and halfak: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190603T2000). [20:00:13] (03PS1) 10Herron: install_server: tweak raid10 8disk partman layout [puppet] - 10https://gerrit.wikimedia.org/r/514107 (https://phabricator.wikimedia.org/T223493) [20:03:00] (03CR) 10Herron: [C: 03+2] install_server: tweak raid10 8disk partman layout [puppet] - 10https://gerrit.wikimedia.org/r/514107 (https://phabricator.wikimedia.org/T223493) (owner: 10Herron) [20:12:35] (03PS1) 10Ayounsi: Fix eqsin network::infra v6 prefix [puppet] - 10https://gerrit.wikimedia.org/r/514109 [20:17:55] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [20:18:15] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [20:30:35] Sorry if my exports are causing issues. I think i've only timed out once. Another 8 to go. [20:37:29] (03PS4) 10CRusnov: Add cable names report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 (https://phabricator.wikimedia.org/T216469) [20:38:09] (03CR) 10jerkins-bot: [V: 04-1] Add cable names report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 (https://phabricator.wikimedia.org/T216469) (owner: 10CRusnov) [20:40:53] (03CR) 10CRusnov: "> Patch Set 3:" (033 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 (https://phabricator.wikimedia.org/T216469) (owner: 10CRusnov) [20:41:11] (03CR) 10Herron: [C: 03+1] "From a cursory look network::infrastructure appears fairly minimally used. It becomes $NETWORK_INFRA within ferm via puppet template and " [puppet] - 10https://gerrit.wikimedia.org/r/514109 (owner: 10Ayounsi) [20:41:49] (03PS5) 10CRusnov: Add cable names report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 (https://phabricator.wikimedia.org/T216469) [20:42:31] (03CR) 10jerkins-bot: [V: 04-1] Add cable names report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 (https://phabricator.wikimedia.org/T216469) (owner: 10CRusnov) [20:43:18] (03PS6) 10CRusnov: Add cable names report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 (https://phabricator.wikimedia.org/T216469) [20:47:33] (03CR) 10Herron: [C: 03+1] "> long as nothing in the upper half of the range" [puppet] - 10https://gerrit.wikimedia.org/r/514109 (owner: 10Ayounsi) [21:00:05] bawolff and Reedy: I, the Bot under the Fountain, allow thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190603T2100). [21:15:21] RF1dle: what exports are you doing? [21:16:15] legoktm, all done now but a bunch of templates via Special:Export on enwiki. Only timed out on first one. Ran the rest on batches of 50. [21:16:36] ah, okay [21:16:50] Kept timing out the other day. Guessing it would come up on your logs [21:17:30] (03CR) 10Ayounsi: [C: 03+2] "Indeed, thanks. I thought it was used by more servers." [puppet] - 10https://gerrit.wikimedia.org/r/514109 (owner: 10Ayounsi) [21:20:31] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [21:20:51] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [21:29:12] (03CR) 10Herron: [C: 03+1] "Sure, why not!" [puppet] - 10https://gerrit.wikimedia.org/r/513310 (owner: 10Volans) [21:29:33] !log drop all ICMP frag on all routers - T224186 [21:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:24] (03PS1) 10Bstorm: Revert "wikireplicas: depool labsdb1010 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/514170 [21:41:05] 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Migrate contint* hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [21:45:52] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Upgrade jenkins-debian-glue to v0.20.0 - https://phabricator.wikimedia.org/T212774 (10hashar) >>! In T212774#5049731, @hashar wrote: > Seems good so far. Thank you very much. I wil... [21:49:11] (03PS2) 10Bstorm: Revert "wikireplicas: depool labsdb1010 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/514170 [21:51:20] (03CR) 10Bstorm: [C: 03+2] Revert "wikireplicas: depool labsdb1010 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/514170 (owner: 10Bstorm) [21:54:09] (03PS16) 10CRusnov: Add LibreNMS parity check report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [21:55:30] (03CR) 10CRusnov: Add LibreNMS parity check report (0311 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [22:06:45] (03PS1) 10Bstorm: wikireplicas: depool labsdb1011 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/514185 (https://phabricator.wikimedia.org/T223406) [22:08:59] !log T223406 repooled labsdb1010 after completing view updates [22:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:05] T223406: Remove reference to fields replaced by the actor table from WMCS views - https://phabricator.wikimedia.org/T223406 [22:12:04] (03CR) 10Bstorm: [C: 03+2] wikireplicas: depool labsdb1011 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/514185 (https://phabricator.wikimedia.org/T223406) (owner: 10Bstorm) [22:20:15] !log T223406 depooled labsdb1011 [22:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:21] T223406: Remove reference to fields replaced by the actor table from WMCS views - https://phabricator.wikimedia.org/T223406 [22:27:13] (03PS2) 10Volans: icinga: clarify Puppet alert message [puppet] - 10https://gerrit.wikimedia.org/r/513310 [22:28:22] (03CR) 10Volans: [C: 03+2] icinga: clarify Puppet alert message [puppet] - 10https://gerrit.wikimedia.org/r/513310 (owner: 10Volans) [23:00:04] MaxSem, RoanKattouw, and Niharika: Time to snap out of that daydream and deploy Evening SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190603T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:07:31] (03PS1) 10Bartosz Dziewoński: Remove unused preference 'T47877-buster' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514193 [23:08:21] (03PS1) 10Jhedden: onboarding: add jhedden contact info and groups [puppet] - 10https://gerrit.wikimedia.org/r/514194 (https://phabricator.wikimedia.org/T224627) [23:10:58] (03Abandoned) 10Jhedden: onboarding: add jhedden contact info and groups [puppet] - 10https://gerrit.wikimedia.org/r/514194 (https://phabricator.wikimedia.org/T224627) (owner: 10Jhedden) [23:13:47] (03PS1) 10Jhedden: onboarding: add jhedden contact info and groups [puppet] - 10https://gerrit.wikimedia.org/r/514195 (https://phabricator.wikimedia.org/T224192) [23:14:27] 10Operations, 10Operations-Software-Development, 10netbox, 10netops, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10crusnov) I agree that from the perspective of more closely modelling the devices between the various tools that the domain name... [23:48:46] (03PS1) 10Bstorm: Revert "wikireplicas: depool labsdb1011 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/514197 [23:54:23] (03PS2) 10Bstorm: Revert "wikireplicas: depool labsdb1011 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/514197 [23:55:13] (03CR) 10Bstorm: [C: 03+2] Revert "wikireplicas: depool labsdb1011 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/514197 (owner: 10Bstorm)