[00:02:47] <icinga-wm>	 RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[00:25:13] <icinga-wm>	 PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Puppet has 11 failures. Last run 3 minutes ago with 11 failures. Failed resources (up to 3 shown): Service[cpjobqueue],Service[recommendation_api],Service[mobileapps],Service[mathoid]
[00:56:07] <icinga-wm>	 RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[01:34:27] <icinga-wm>	 PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:49:55] <icinga-wm>	 RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[02:58:11] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:29:03] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[03:31:37] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 917.99 seconds
[03:59:13] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 261.15 seconds
[04:10:22] <wikibugs>	 (03PS7) 10Krinkle: errorpages: Use service discovery for statsd in hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963)
[06:13:49] <marostegui>	 !log Deploy schema change on db1067 (s1) master - T86339
[06:13:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:13:53] <stashbot>	 T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339
[06:15:40] <marostegui>	 !log Deploy schema change on s3 codfw master (db2043) with replication - T86339
[06:15:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:19:11] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Kubernetes: set up a test node with new version, Redis as cache, a new Swift container and export metrics over Fraphana - https://phabricator.wikimedia.org/T210076 (10Joe) As far as redis goes - do we need to replicate cross-dc? if not, we c...
[06:20:20] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475435 (https://phabricator.wikimedia.org/T86339)
[06:26:48] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475435 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui)
[06:27:54] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475435 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui)
[06:29:17] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1123 - T86339 (duration: 00m 48s)
[06:29:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:20] <stashbot>	 T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339
[06:31:32] <marostegui>	 !log Deploy schema change dbstore1002:s3 - T86339
[06:31:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:03] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475435 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui)
[06:32:40] <marostegui>	 !log Deploy schema change db1123 - T86339
[06:32:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:49] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475436
[06:50:40] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui)
[06:51:01] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) @Cmjohnson reminder: this is RAID5 instead of 10 as noted on top of the task.
[06:57:29] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475436 (owner: 10Marostegui)
[06:58:32] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475436 (owner: 10Marostegui)
[06:58:46] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475436 (owner: 10Marostegui)
[06:59:46] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1123 - T86339 (duration: 00m 49s)
[06:59:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:50] <stashbot>	 T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339
[07:00:31] <marostegui>	 !log Deploy schema change db1095 - T86339
[07:00:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:00] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475438 (https://phabricator.wikimedia.org/T86339)
[07:12:16] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475438 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui)
[07:13:19] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475438 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui)
[07:14:20] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1078 - T86339 (duration: 00m 45s)
[07:14:21] <marostegui>	 !log Deploy schema change db1078 - T86339
[07:14:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:23] <stashbot>	 T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339
[07:14:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:06] <wikibugs>	 10Operations, 10DBA, 10monitoring: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427 (10Marostegui) Just to be on the safe side I have created Calendar events (personals and on Ops maintenance) as a reminder:  1 year before expiration 6 months before expiration 3...
[07:25:15] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475438 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui)
[07:29:13] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] "Only safe if the namespace is empty." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475367 (https://phabricator.wikimedia.org/T210171) (owner: 10Zoranzoki21)
[07:38:47] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475439
[07:47:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Again +1 from me" [puppet] - 10https://gerrit.wikimedia.org/r/396055 (https://phabricator.wikimedia.org/T182249) (owner: 10Awight)
[07:48:42] <moritzm>	 !log installing libtirpc security updates
[07:48:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Introduce zoterov2 LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/473733 (https://phabricator.wikimedia.org/T201611) (owner: 10Alexandros Kosiaris)
[07:59:29] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Introduce zoterov2 LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/473733 (https://phabricator.wikimedia.org/T201611)
[07:59:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Introduce zoterov2 LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/473733 (https://phabricator.wikimedia.org/T201611) (owner: 10Alexandros Kosiaris)
[08:04:04] <wikibugs>	 (03PS1) 10Gehel: wdqs: disable test queries on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/475440 (https://phabricator.wikimedia.org/T207665)
[08:05:42] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 031] wdqs: disable test queries on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/475440 (https://phabricator.wikimedia.org/T207665) (owner: 10Gehel)
[08:05:48] <wikibugs>	 10Operations, 10Traffic: Update certspotter - https://phabricator.wikimedia.org/T204993 (10MoritzMuehlenhoff) The Icinga servers in production are now running 0.9-1~bpo9+1, but the Cron job still needs to be re-instated.
[08:05:53] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: disable test queries on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/475440 (https://phabricator.wikimedia.org/T207665) (owner: 10Gehel)
[08:06:58] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 42 connections established with conf1004.eqiad.wmnet:4001 (min=43)
[08:08:44] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475439 (owner: 10Marostegui)
[08:09:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Record deskana as LDAP user [puppet] - 10https://gerrit.wikimedia.org/r/475441
[08:09:36] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475439 (owner: 10Marostegui)
[08:09:58] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2006 is CRITICAL: CRITICAL: 31 connections established with conf2001.codfw.wmnet:2379 (min=32)
[08:10:31] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1078 - T86339 (duration: 00m 46s)
[08:10:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:47] <stashbot>	 T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339
[08:11:01] <wikibugs>	 (03PS2) 10Muehlenhoff: Record deskana as LDAP user [puppet] - 10https://gerrit.wikimedia.org/r/475441
[08:12:06] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 43 connections established with conf1004.eqiad.wmnet:4001 (min=43)
[08:12:09] <wikibugs>	 (03PS25) 10Mathew.onipe: elasticsearch_cluster: Added multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918)
[08:12:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Record deskana as LDAP user [puppet] - 10https://gerrit.wikimedia.org/r/475441 (owner: 10Muehlenhoff)
[08:12:32] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1006 is CRITICAL: CRITICAL: 42 connections established with conf1004.eqiad.wmnet:4001 (min=43)
[08:12:33] <wikibugs>	 (03CR) 10Mathew.onipe: elasticsearch_cluster: Added multi-cluster/multi-instance support (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe)
[08:18:33] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475439 (owner: 10Marostegui)
[08:29:56] <akosiaris>	 pybal complaining was me
[08:32:44] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1006 is OK: OK: 43 connections established with conf1004.eqiad.wmnet:4001 (min=43)
[08:35:10] <wikibugs>	 (03CR) 10DCausse: [C: 031] elasticsearch_cluster: Added multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe)
[08:35:10] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 31 connections established with conf2001.codfw.wmnet:2379 (min=32)
[08:35:29] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Post hold because of "invalid headers" in wikimediacz-l - https://phabricator.wikimedia.org/T210223 (10Urbanecm)
[08:37:24] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/codfw/zoterov2 on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/zoterov2
[08:37:57] <akosiaris>	 fixing ^
[08:38:14] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: conftool: Add forgotten zotero codfw data [puppet] - 10https://gerrit.wikimedia.org/r/475442
[08:39:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] conftool: Add forgotten zotero codfw data [puppet] - 10https://gerrit.wikimedia.org/r/475442 (owner: 10Alexandros Kosiaris)
[08:40:20] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 32 connections established with conf2001.codfw.wmnet:2379 (min=32)
[08:40:50] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/codfw/zoterov2 on puppetmaster2001 is OK: No errors detected
[08:41:01] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=.*,service=zotero,cluster=kubernetes,name=.*
[08:41:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Diamond from several analytics roles [puppet] - 10https://gerrit.wikimedia.org/r/475444 (https://phabricator.wikimedia.org/T183454)
[08:44:18] <ema>	 akosiaris: so the alerts are because of zoterov2 being added, right? Did you restart pybal to fix them?
[08:44:28] <akosiaris>	 yup
[08:44:31] <ema>	 k
[08:44:33] <akosiaris>	 all fixed
[08:45:27] <ema>	 all but lvs2006?
[08:45:34] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2006 is OK: OK: 32 connections established with conf2001.codfw.wmnet:2379 (min=32)
[08:45:37] <akosiaris>	 :-)
[08:45:37] <ema>	 hehe
[08:45:53] <akosiaris>	 just icinga being slow 
[08:49:27] <wikibugs>	 (03CR) 10Elukey: [C: 031] Remove Diamond from several analytics roles [puppet] - 10https://gerrit.wikimedia.org/r/475444 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[08:59:54] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove Diamond from several analytics roles [puppet] - 10https://gerrit.wikimedia.org/r/475444 (https://phabricator.wikimedia.org/T183454)
[09:00:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove Diamond from several analytics roles [puppet] - 10https://gerrit.wikimedia.org/r/475444 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[09:05:01] <icinga-wm>	 PROBLEM - Check systemd state on druid1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:06:09] <icinga-wm>	 RECOVERY - Check systemd state on druid1003 is OK: OK - running: The system is fully operational
[09:07:21] <wikibugs>	 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10akosiaris)
[09:07:25] <wikibugs>	 10Operations, 10Citoid, 10Patch-For-Review, 10Service-deployment-requests, and 3 others: Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10akosiaris) 05Open>03Resolved Finally deployed to production  Things to note:  * zotero is configured to use the same outoing proxy (url-do...
[09:15:48] <wikibugs>	 (03PS2) 10Volans: netbox: allow read-only access to wmf ldap group [puppet] - 10https://gerrit.wikimedia.org/r/475295 (https://phabricator.wikimedia.org/T208267)
[09:17:31] <wikibugs>	 (03CR) 10Volans: [C: 032] netbox: allow read-only access to wmf ldap group [puppet] - 10https://gerrit.wikimedia.org/r/475295 (https://phabricator.wikimedia.org/T208267) (owner: 10Volans)
[09:26:40] <wikibugs>	 (03PS2) 10Filippo Giunchedi: WIP rsyslog: udp input json_lines shim [puppet] - 10https://gerrit.wikimedia.org/r/475352 (https://phabricator.wikimedia.org/T205851)
[09:29:50] <wikibugs>	 (03PS1) 10Volans: netbox: notify the uwsgi app on config change [puppet] - 10https://gerrit.wikimedia.org/r/475449 (https://phabricator.wikimedia.org/T208267)
[09:37:54] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to netbox for bd808 - https://phabricator.wikimedia.org/T208267 (10Volans) Added read-only access to `cn=wmf` and confirmed it works as expected allowing people to login but in read-only mode. Edit/delete/add...
[10:00:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/475449 (https://phabricator.wikimedia.org/T208267) (owner: 10Volans)
[10:13:08] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Migrate >=90% of existing Logstash traffic to the logging pipeline - https://phabricator.wikimedia.org/T205851 (10fgiunchedi) A first iteration on this might look like https://gerrit.wikimedia.org/r/475352 in which a new udp listener is added on localho...
[10:16:27] <wikibugs>	 (03CR) 10Volans: [C: 032] netbox: notify the uwsgi app on config change [puppet] - 10https://gerrit.wikimedia.org/r/475449 (https://phabricator.wikimedia.org/T208267) (owner: 10Volans)
[10:33:00] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to netbox for bd808 - https://phabricator.wikimedia.org/T208267 (10Volans) 05Open>03Resolved
[10:50:02] <wikibugs>	 10Operations, 10DBA, 10StructuredDiscussions, 10Growth-Team (Current Sprint), and 2 others: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Banyek) As I plan to get involved this I read back the ticket, and put the following tldr together.  Did I...
[11:01:54] <wikibugs>	 (03PS1) 10Alex Monk: Revert "certspotter: temporarily disable cron job" [puppet] - 10https://gerrit.wikimedia.org/r/475453 (https://phabricator.wikimedia.org/T204993)
[11:10:09] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10hashar)
[11:12:42] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::spark2: get security settings via hiera [puppet] - 10https://gerrit.wikimedia.org/r/475454
[11:19:27] <wikibugs>	 (03PS2) 10Elukey: profile::hadoop::spark2: get security settings via hiera [puppet] - 10https://gerrit.wikimedia.org/r/475454
[11:19:29] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13689/" [puppet] - 10https://gerrit.wikimedia.org/r/475454 (owner: 10Elukey)
[11:22:41] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero) hey @Bstorm any suggestions to handle labstore1006 and labstore1007 reboots? Those are for Dumps, right? cc @ArielGlenn
[11:28:21] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[11:29:31] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time
[11:34:01] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:34:05] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:34:09] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:34:51] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:34:55] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:35:01] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:35:02] * gehel is looking at wdqs
[11:35:57] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:35:59] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:36:07] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:36:15] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:36:19] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:36:48] <gehel>	 updater failing on all wdqs nodes
[11:37:13] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational
[11:37:29] <gehel>	 !log restarting updater on all wdqs ndoes
[11:37:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:35] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational
[11:38:15] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational
[11:38:16] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational
[11:38:19] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2006 is OK: OK - running: The system is fully operational
[11:38:25] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational
[11:38:27] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational
[11:38:33] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational
[11:38:37] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1008 is OK: OK - running: The system is fully operational
[11:38:39] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational
[11:38:41] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational
[11:39:46] <gehel>	 lot of "Weird reference" in the logs, updater seems to crash and restart, but not much in logs
[11:41:37] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:42:59] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:43:44] <wikibugs>	 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: wdqs-updater crashing on all wdqs servers - https://phabricator.wikimedia.org/T210235 (10Gehel)
[11:44:05] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:44:07] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:44:13] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:44:15] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:44:17] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:44:21] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:44:27] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:44:31] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:44:34] <volans>	 gehel: can I help?
[11:44:35] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:44:42] <gehel>	 restarting updater does not seem to be sufficient, it is probably getting data from wikidata that it does not understand
[11:44:59] <gehel>	 anyone knows of something that would have changed on the wikidata side?
[11:45:15] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:45:35] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:45:36] <gehel>	 volans: not an emergency yet (updater is an async process, so we'll have some lag, not good but not critical)
[11:46:10] <gehel>	 volans: if you could silence that icinga systemd check for all wdqs nodes while I did in the logs, that would be great!
[11:46:21] <volans>	 gehel: sure, on it
[11:46:26] <gehel>	 thanks!
[11:46:37] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational
[11:47:50] <gehel>	 looks like some bot uploading some problematic data about fish
[11:47:52] <volans>	 gehel: how long the downtime?
[11:47:55] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational
[11:48:05] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10ArielGlenn) They are indeed for dumps; one is set up to be the web server, and a reboot of that means that the broader public will notice. The other provides NFS service of dumps to stat100...
[11:48:05] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational
[11:48:26] <gehel>	 volans: 4 hours at the moment, I'll either remove or extend
[11:48:30] <volans>	 ack
[11:48:33] <volans>	 I thought the same
[11:48:37] <gehel>	 wat? now it is recovering?
[11:49:09] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:49:21] <gehel>	 na... it is going to die again
[11:49:48] <volans>	 {done} on wdqs[12]*
[11:49:54] <volans>	 only for the systemd check
[11:50:01] <gehel>	 yep, that's good
[11:50:12] <gehel>	 we might get an update lag alert soon
[11:50:25] <volans>	 they are warning right now
[11:50:53] <gehel>	 volans: can you silence them as well? no need to add noise, we know that something is wrong already
[11:51:21] <volans>	 also the updater process?
[11:51:27] <gehel>	 yep
[11:52:13] <volans>	 ack
[11:52:31] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1008 is OK: OK - running: The system is fully operational
[11:53:21] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational
[11:53:41] <volans>	 eh, failed alert before the downtime will spam with the recovery anyway
[11:54:18] <gehel>	 ack
[11:54:33] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2006 is OK: OK - running: The system is fully operational
[11:54:36] <gehel>	 logs are useless, how can the updater crash without a message?
[11:54:53] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational
[11:55:17] <gehel>	 Oh, out of memory!
[11:55:43] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational
[11:56:36] <volans>	 gehel: any reason why high lag check has disabled notifications on wdqs2003?
[11:57:01] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational
[11:57:02] <gehel>	 volans: who disabled it and is there a comment?
[11:57:08] <gehel>	 (probably me)
[11:57:17] <gehel>	 no good reason that I know of
[11:57:31] <volans>	 who I don't know, there is an old message referring to T207817 but is on multiple host
[11:57:31] <stashbot>	 T207817: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817
[11:57:34] <volans>	 *hosts
[11:57:42] <volans>	 this is the only one with the check disabled
[11:57:43] <wikibugs>	 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: wdqs-updater crashing on all wdqs servers - https://phabricator.wikimedia.org/T210235 (10Gehel) Out of memory error on a bind:  ` Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space Nov 23...
[11:57:49] <volans>	 I'll re-enable if it's ok for you
[11:57:53] <gehel>	 yep
[12:00:20] <gehel>	 and of course, heap size isn't configureable via puppet atm
[12:00:31] <volans>	 if I cannot help with much more, I'd go for lunch
[12:00:31] <volans>	 ofc
[12:00:50] <gehel>	 volans: yep, enjoy the lunch, I'm on it, we're all good!
[12:00:55] <volans>	 ack, thanks!
[12:00:57] <gehel>	 and thanks for the help!
[12:01:05] <gehel>	 bon appétit 
[12:01:22] <volans>	 merci
[12:04:32] <gehel>	 !log manually increasing wdqs-updater heap to 4G - T210235
[12:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:36] <stashbot>	 T210235: wdqs-updater crashing on all wdqs servers - https://phabricator.wikimedia.org/T210235
[12:04:45] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational
[12:04:45] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational
[12:04:51] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational
[12:05:07] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational
[12:05:41] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational
[12:07:33] <wikibugs>	 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: wdqs-updater crashing on all wdqs servers - https://phabricator.wikimedia.org/T210235 (10Gehel) increasing heap stops the updater from crashing, but blazegraph refuses updates > 200M (we probably don't want to increase this limit)
[12:11:12] <wikibugs>	 (03PS1) 10Gehel: wdqs: reduce batch size to 300 temporarily [puppet] - 10https://gerrit.wikimedia.org/r/475455 (https://phabricator.wikimedia.org/T210235)
[12:13:32] <wikibugs>	 (03PS2) 10Gehel: wdqs: reduce batch size to 300 temporarily [puppet] - 10https://gerrit.wikimedia.org/r/475455 (https://phabricator.wikimedia.org/T210235)
[12:14:15] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: reduce batch size to 300 temporarily [puppet] - 10https://gerrit.wikimedia.org/r/475455 (https://phabricator.wikimedia.org/T210235) (owner: 10Gehel)
[12:14:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Diamond from Ganeti hosts [puppet] - 10https://gerrit.wikimedia.org/r/475456
[12:15:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove Diamond from Ganeti hosts [puppet] - 10https://gerrit.wikimedia.org/r/475456 (owner: 10Muehlenhoff)
[12:16:13] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::appserver: add php monitoring [puppet] - 10https://gerrit.wikimedia.org/r/475457 (https://phabricator.wikimedia.org/T209573)
[12:17:58] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove Diamond from Ganeti hosts [puppet] - 10https://gerrit.wikimedia.org/r/475456 (https://phabricator.wikimedia.org/T183454)
[12:20:06] <wikibugs>	 10Operations, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: wdqs-updater crashing on all wdqs servers - https://phabricator.wikimedia.org/T210235 (10Gehel) Reducing batch size seems to work, updates are processed again.
[12:21:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/13690/mw1261.eqiad.wmnet/ the patch is correct in puppet terms, I'd like someone to take " [puppet] - 10https://gerrit.wikimedia.org/r/475457 (https://phabricator.wikimedia.org/T209573) (owner: 10Giuseppe Lavagetto)
[12:21:50] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] "NOOP per https://puppet-compiler.wmflabs.org/compiler1002/13688/, let's merge on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/475258 (owner: 10Dzahn)
[12:39:05] <wikibugs>	 10Operations, 10Analytics-Kanban, 10Discovery, 10Product-Analytics, and 3 others: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682 (10Liuxinyu970226)
[12:39:08] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "This is wrong, you are removing hosts from being spare." [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek)
[12:40:48] <wikibugs>	 (03CR) 10Banyek: "> This is wrong, you are removing hosts from being spare." [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek)
[12:43:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "You are removing dbproxy1012,13,14,17 from site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek)
[12:44:48] <wikibugs>	 (03CR) 10Banyek: "> You are removing dbproxy1012,13,14,17 from site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek)
[12:51:48] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team (Current): Investigate memory usage of ORES in kubernetes - https://phabricator.wikimedia.org/T210264 (10Ladsgroup)
[12:54:13] <wikibugs>	 (03PS3) 10Banyek: mariadb: productionize dbproxy1015 and dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367)
[13:01:09] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber): Blubber should be able to make multi docker files per repo - https://phabricator.wikimedia.org/T210267 (10Ladsgroup)
[13:02:18] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber): Build blubber file for ORES - https://phabricator.wikimedia.org/T210268 (10Ladsgroup) p:05Triage>03Low
[13:02:43] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team: Build helm charts for ORES - https://phabricator.wikimedia.org/T210269 (10Ladsgroup)
[13:18:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Add mapped IPv6 to labmon1002 [puppet] - 10https://gerrit.wikimedia.org/r/475461
[13:18:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Add mapped IPv6 to labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/475462
[13:22:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 031] Add mapped IPv6 to labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/475462 (owner: 10Muehlenhoff)
[13:32:36] <moritzm>	 !log installing confuse security updates
[13:32:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:55] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mathew.onipe)
[13:35:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for confuse [puppet] - 10https://gerrit.wikimedia.org/r/475465
[13:36:47] <wikibugs>	 (03PS2) 10Muehlenhoff: Add library hint for confuse [puppet] - 10https://gerrit.wikimedia.org/r/475465
[13:38:18] <wikibugs>	 10Operations, 10Traffic, 10Continuous-Integration-Infrastructure (Slipway): CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10hashar) Part of the project to migrate all CI jobs to Docker containers (#ci-slipway )
[13:38:23] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mathew.onipe)
[13:41:04] <Krenair>	 I wonder what we use `confuse` for
[13:42:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add library hint for confuse [puppet] - 10https://gerrit.wikimedia.org/r/475465 (owner: 10Muehlenhoff)
[13:47:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Add mapped IPv6 to labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/475462 (owner: 10Muehlenhoff)
[13:48:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "Thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/475461 (owner: 10Muehlenhoff)
[13:53:31] <icinga-wm>	 RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 406, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:59:31] <wikibugs>	 10Operations: Upgrade Ganeti clusters to 2.15.2-7+deb9u3 - https://phabricator.wikimedia.org/T210289 (10MoritzMuehlenhoff)
[14:00:23] <wikibugs>	 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: make wdqs-updater heap size configurable from puppet - https://phabricator.wikimedia.org/T210290 (10Gehel)
[14:02:23] <gehel>	 !log restor wdqs-updater heap to 2G - T210235
[14:02:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:27] <stashbot>	 T210235: wdqs-updater crashing on all wdqs servers - https://phabricator.wikimedia.org/T210235
[14:14:08] <wikibugs>	 (03CR) 10Elukey: [C: 032] "Andrew: going to merge this to start testing some settings in labs, I can amend/revert/etc.. if you don't like it :)" [puppet] - 10https://gerrit.wikimedia.org/r/475454 (owner: 10Elukey)
[14:14:14] <wikibugs>	 (03PS3) 10Elukey: profile::hadoop::spark2: get security settings via hiera [puppet] - 10https://gerrit.wikimedia.org/r/475454
[14:27:19] <wikibugs>	 (03PS1) 10Vgutierrez: certcentral: Provide TLS certificates for tendril [puppet] - 10https://gerrit.wikimedia.org/r/475468 (https://phabricator.wikimedia.org/T207050)
[14:27:21] <wikibugs>	 (03PS1) 10Vgutierrez: tendril: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475469 (https://phabricator.wikimedia.org/T207050)
[14:28:21] <wikibugs>	 (03CR) 10Alex Monk: [C: 031] certcentral: Provide TLS certificates for tendril [puppet] - 10https://gerrit.wikimedia.org/r/475468 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[14:28:33] <wikibugs>	 (03CR) 10Alex Monk: [C: 031] tendril: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475469 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[14:29:38] <Krenair>	 vgutierrez, they've all been apache2 rather than nginx so far right?
[14:29:46] <vgutierrez>	 indeed
[14:32:11] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] certcentral: Provide TLS certificates for tendril [puppet] - 10https://gerrit.wikimedia.org/r/475468 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[14:32:19] <wikibugs>	 (03PS2) 10Vgutierrez: certcentral: Provide TLS certificates for tendril [puppet] - 10https://gerrit.wikimedia.org/r/475468 (https://phabricator.wikimedia.org/T207050)
[14:33:56] <Krenair>	 vgutierrez, btw I tried to clean up the phab task structure and close some tickets we're pretty much done with
[14:34:01] <Krenair>	 hopefully I didn't muck up anything too badly
[14:34:18] <vgutierrez>	 I don't think so.. that's for taking care of that :)
[14:34:33] <vgutierrez>	 s/that's/thanks/
[14:34:34] <vgutierrez>	 damn OS X...
[14:35:16] <Krenair>	 at first I was cautious and then I was like eh... it should be uncontroversial and easily-revertible and pretty much bikeshedding, so
[14:35:24] <Krenair>	 yay closed tickets
[14:37:07] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Create socket dir also on puppet run, & for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013)
[14:37:09] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292)
[14:38:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo)
[14:48:53] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] tendril: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475469 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[14:49:36] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] tendril: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475469 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[14:49:44] <wikibugs>	 (03PS2) 10Vgutierrez: tendril: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475469 (https://phabricator.wikimedia.org/T207050)
[14:58:51] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292)
[14:59:33] <wikibugs>	 (03PS1) 10Vgutierrez: tendril: Use the cercentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475478 (https://phabricator.wikimedia.org/T207050)
[14:59:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo)
[15:04:38] <wikibugs>	 (03PS1) 10Hashar: tox: add 'venv' to run any command in a venv [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475479
[15:05:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tox: add 'venv' to run any command in a venv [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475479 (owner: 10Hashar)
[15:10:49] <Krenair>	 herron, about? I just noticed a wikimedia wiki password reset email got DKIM fail from google. I don't think that's expected?
[15:11:04] <wikibugs>	 (03PS1) 10Hashar: Fix invalid escape sequence in a regular expression [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475480
[15:11:28] <wikibugs>	 (03PS2) 10Hashar: tox: add 'venv' to run any command in a venv [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475479
[15:12:14] <Krenair>	 idle 68h, nope
[15:13:08] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] tendril: Use the cercentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475478 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[15:13:17] <wikibugs>	 (03PS2) 10Vgutierrez: tendril: Use the cercentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475478 (https://phabricator.wikimedia.org/T207050)
[15:13:35] <icinga-wm>	 PROBLEM - puppet last run on elastic1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:13:42] <Krenair>	 guess I should file a ticket
[15:14:44] <wikibugs>	 10Operations, 10Traffic: ATS path normalization - https://phabricator.wikimedia.org/T210295 (10ema) p:05Triage>03Normal
[15:17:19] <wikibugs>	 (03PS1) 10Vgutierrez: tendril: get rid of the old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/475481 (https://phabricator.wikimedia.org/T207050)
[15:19:34] <wikibugs>	 10Operations, 10Traffic: ATS path normalization - https://phabricator.wikimedia.org/T210295 (10ema)
[15:21:29] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: Revert "Unvendorize wherever possible" [debs/prometheus-php-fpm-exporter] - 10https://gerrit.wikimedia.org/r/475075 (owner: 10Giuseppe Lavagetto)
[15:21:44] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Create socket dir also on puppet run, & for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013)
[15:21:46] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292)
[15:21:54] <wikibugs>	 (03PS1) 10Vgutierrez: librenms: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475482 (https://phabricator.wikimedia.org/T207050)
[15:21:57] <wikibugs>	 (03PS1) 10Vgutierrez: netbox: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475483 (https://phabricator.wikimedia.org/T207050)
[15:21:59] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: site.pp: move slave redises to system::spare [puppet] - 10https://gerrit.wikimedia.org/r/443789 (https://phabricator.wikimedia.org/T198220) (owner: 10Giuseppe Lavagetto)
[15:22:48] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: mediawiki: move mediawiki::web to a profile [puppet] - 10https://gerrit.wikimedia.org/r/395715 (owner: 10Giuseppe Lavagetto)
[15:22:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo)
[15:23:57] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: profile::mediawiki::web: explicitly set log retention days [puppet] - 10https://gerrit.wikimedia.org/r/395716 (owner: 10Giuseppe Lavagetto)
[15:24:58] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292)
[15:26:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo)
[15:30:41] <wikibugs>	 (03CR) 10Alex Monk: [C: 031] librenms: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475482 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[15:30:43] <wikibugs>	 (03CR) 10Alex Monk: [C: 031] netbox: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475483 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[15:33:48] <wikibugs>	 10Operations, 10Release-Engineering-Team (Backlog): Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10faidon) >>! In T203003#4768791, @hashar wrote: > I guess we can close rKEYHOLDER. Seems to me keyholder code will be moved out of operations/puppet to `operations/software/keyho...
[15:34:55] <wikibugs>	 (03CR) 10Alex Monk: [C: 031] tendril: get rid of the old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/475481 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[15:35:52] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] tendril: get rid of the old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/475481 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[15:36:07] <wikibugs>	 (03PS2) 10Vgutierrez: tendril: get rid of the old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/475481 (https://phabricator.wikimedia.org/T207050)
[15:38:45] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] librenms: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475482 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[15:38:52] <wikibugs>	 (03PS2) 10Vgutierrez: librenms: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475482 (https://phabricator.wikimedia.org/T207050)
[15:39:17] <icinga-wm>	 RECOVERY - puppet last run on elastic1029 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[15:40:21] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] netbox: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475483 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez)
[15:40:30] <wikibugs>	 (03PS2) 10Vgutierrez: netbox: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475483 (https://phabricator.wikimedia.org/T207050)
[15:40:56] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Vgutierrez)
[15:41:33] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Create socket dir also on puppet run, & for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013)
[15:41:35] <wikibugs>	 (03PS5) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292)
[15:41:45] <wikibugs>	 (03PS1) 10Gehel: Revert "wdqs: reduce batch size to 300 temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/475484
[15:42:06] <wikibugs>	 (03PS2) 10Gehel: Revert "wdqs: reduce batch size to 300 temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/475484
[15:42:23] <wikibugs>	 (03PS3) 10Muehlenhoff: Script to generate service principals/keytabs (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/470566
[15:42:45] <wikibugs>	 (03CR) 10Gehel: [C: 032] Revert "wdqs: reduce batch size to 300 temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/475484 (owner: 10Gehel)
[15:45:13] <wikibugs>	 (03PS3) 10Vgutierrez: netbox: Get rid of http-01 challenge handling configuration [puppet] - 10https://gerrit.wikimedia.org/r/475483 (https://phabricator.wikimedia.org/T207050)
[16:03:15] <wikibugs>	 10Operations, 10User-Elukey: Apply interface::rps to all the mc hosts - https://phabricator.wikimedia.org/T209489 (10elukey) @BBlack do you have any suggestion about things to check / precautions to take before enabling interface::rps to all the memcached shards? I am asking since it is a very delicate piece o...
[16:05:05] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10Aboubacarkhoraa)
[16:12:53] <wikibugs>	 (03PS1) 10Ladsgroup: Refactor ORES uWSGI workers to use an absolute count [puppet] - 10https://gerrit.wikimedia.org/r/475487 (https://phabricator.wikimedia.org/T182249)
[16:15:59] <wikibugs>	 (03Abandoned) 10Ladsgroup: Refactor ORES uWSGI workers to use an absolute count [puppet] - 10https://gerrit.wikimedia.org/r/475487 (https://phabricator.wikimedia.org/T182249) (owner: 10Ladsgroup)
[16:16:44] <wikibugs>	 10Operations, 10Traffic: ATS path normalization - https://phabricator.wikimedia.org/T210295 (10ema) I now see that @BBlack prepped https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/407643/ to bring reality closer to theory by adding the missing characters to `{mediawiki,restbase}_encode` -- with a caveat...
[16:18:35] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: ATS backend-side request-mangling - https://phabricator.wikimedia.org/T209021 (10ema)
[16:24:17] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed
[16:24:35] <icinga-wm>	 PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:26:24] <wikibugs>	 10Operations, 10Traffic: Renew Digicert Unified in 2019 - https://phabricator.wikimedia.org/T209515 (10BBlack) Downtimes set, we shouldn't get cert alerts in icinga
[16:26:43] <wikibugs>	 (03PS1) 10BBlack: Remove old globalsign unified certs from config [puppet] - 10https://gerrit.wikimedia.org/r/475489 (https://phabricator.wikimedia.org/T206804)
[16:26:45] <wikibugs>	 (03PS1) 10BBlack: Remove old globalsign unified cert files [puppet] - 10https://gerrit.wikimedia.org/r/475490 (https://phabricator.wikimedia.org/T206804)
[16:26:53] <icinga-wm>	 RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational
[16:27:43] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active
[16:27:56] <wikibugs>	 (03CR) 10BBlack: [C: 032] Remove old globalsign unified certs from config [puppet] - 10https://gerrit.wikimedia.org/r/475489 (https://phabricator.wikimedia.org/T206804) (owner: 10BBlack)
[16:28:34] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10MoritzMuehlenhoff) Status u...
[16:32:29] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) This was already a...
[16:38:27] <wikibugs>	 (03CR) 10BBlack: [C: 032] Remove old globalsign unified cert files [puppet] - 10https://gerrit.wikimedia.org/r/475490 (https://phabricator.wikimedia.org/T206804) (owner: 10BBlack)
[16:50:43] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: wmflib: make the role() function store a path in $::_role [puppet] - 10https://gerrit.wikimedia.org/r/475498
[16:50:45] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: hiera: remove the role backend in production [puppet] - 10https://gerrit.wikimedia.org/r/475499
[16:50:47] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: hiera: fix the hierarchical order of lookups [puppet] - 10https://gerrit.wikimedia.org/r/475500
[16:53:42] <bblack>	 !log cleaned up remnants of globalsign-2017 unified cert (OCSP cache/config, unmanaged cert files, etc) on all cpNNNN - T206804
[16:53:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:46] <stashbot>	 T206804: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804
[16:55:36] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10Aklapper) a:05WMEE-Phabricator-List>03None [Please do not set random assignees](https://www.mediawiki.org/wiki/How_to_report_a_bug) who cannot work on this. Thanks. :)
[16:57:49] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10Aklapper) @aboubacarkhoraa: Could you explain what the "GIN" in "Wikimedia-GIN" means? Also see https://meta.wikimedia.org/wiki/Mailing_lists/Standardization . Thanks!
[16:58:33] <wikibugs>	 (03PS6) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292)
[17:01:24] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10Krenair) appears to be the ISO 3166-1 alpha-3 code for Guinea
[17:06:19] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10Aboubacarkhoraa) Comme la dit Kreinair GIN est le code de la Guinée qui a recu sont affiliation hier et nous souhaitons crée une liste de diffusion du groupe.
[17:17:48] <wikibugs>	 (03PS7) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292)
[17:29:16] <wikibugs>	 (03PS8) 10Jcrespo: mariadb: Modify dump_section to allow different types of dump [puppet] - 10https://gerrit.wikimedia.org/r/475471 (https://phabricator.wikimedia.org/T210292)
[18:42:56] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10Aklapper) Ah, merci pour l'expliquer! J'ai seulement su "GN".
[19:16:48] <wikibugs>	 (03PS1) 10Ema: ATS: path normalization [puppet] - 10https://gerrit.wikimedia.org/r/475508 (https://phabricator.wikimedia.org/T210295)
[19:56:58] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10Aboubacarkhoraa) Oui d'habitude ces GN mais pour le comité d'affiliation le codes qu'on m'a donné est GIN et ces pourquoi je les ajoute ici
[21:12:04] * bawolff deploying a security thing
[21:15:19] <bawolff>	 !log deploy patch for T210192
[21:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:38:23] <icinga-wm>	 PROBLEM - graphite-labs.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time
[22:39:33] <icinga-wm>	 RECOVERY - graphite-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.023 second response time