[00:36:59] <icinga-wm>	 PROBLEM - puppet last run on wtp1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:03:35] <icinga-wm>	 RECOVERY - puppet last run on wtp1044 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[02:25:31] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott I probably caused this!
[02:56:01] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[02:57:05] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[02:58:19] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[02:59:21] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[02:59:57] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[03:03:03] <icinga-wm>	 PROBLEM - puppet last run on scandium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:03:41] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[03:03:47] <icinga-wm>	 PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:04:55] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[03:05:57] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[03:29:13] <icinga-wm>	 PROBLEM - puppet last run on db1092 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:29:39] <icinga-wm>	 RECOVERY - puppet last run on scandium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[03:30:25] <icinga-wm>	 RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[03:55:51] <icinga-wm>	 RECOVERY - puppet last run on db1092 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:04:17] <wikibugs>	 (03PS1) 10ArielGlenn: reduce further the sleep between wikis for addds-changes dumps [dumps] - 10https://gerrit.wikimedia.org/r/508164
[04:07:41] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] reduce further the sleep between wikis for addds-changes dumps [dumps] - 10https://gerrit.wikimedia.org/r/508164 (owner: 10ArielGlenn)
[04:08:42] <logmsgbot>	 !log ariel@deploy1001 Started deploy [dumps/dumps@b4b7733]: reduce sleep time more between wikis for incrs
[04:08:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:08:48] <logmsgbot>	 !log ariel@deploy1001 Finished deploy [dumps/dumps@b4b7733]: reduce sleep time more between wikis for incrs (duration: 00m 05s)
[04:08:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:32:55] <icinga-wm>	 PROBLEM - puppet last run on db1082 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:54:38] <wikibugs>	 (03PS2) 10Marostegui: db1093: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507264 (https://phabricator.wikimedia.org/T222127)
[04:54:51] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Give some traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507263 (https://phabricator.wikimedia.org/T222127)
[04:59:27] <icinga-wm>	 RECOVERY - puppet last run on db1082 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[05:02:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1093: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507264 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[05:03:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give some traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507263 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[05:04:56] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Give some traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507263 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[05:06:08] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: SMART alerts on db1069 - https://phabricator.wikimedia.org/T222507 (10Marostegui) Thanks @jijiki for creating the task. We are no longer creating tasks for predictive failures, we let them fail so the task gets created automatically. We track the predictive failures at {T208...
[05:06:19] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Give some traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507263 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[05:06:58] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: SMART alerts on db1069 - https://phabricator.wikimedia.org/T222507 (10Marostegui)
[05:07:01] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui)
[05:08:24] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give some weight to db1093 (duration: 00m 58s)
[05:08:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:08:34] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui)
[05:09:50] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) >>! In T208323#5158076, @jcrespo wrote: > T222526 db2049 (again?)  You might be confused with db2047, I don't recall db2049 having a disk replaced lately
[05:14:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Reenable notifications on db1139 and db1140 [puppet] - 10https://gerrit.wikimedia.org/r/507925 (https://phabricator.wikimedia.org/T220572) (owner: 10Jcrespo)
[05:46:30] <icinga-wm>	 PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:47:05] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] backups: Decommission dbstore1001, dbstore2001 and dbstore2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507944 (https://phabricator.wikimedia.org/T220002) (owner: 10Jcrespo)
[05:53:55] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Give some API weight to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508167 (https://phabricator.wikimedia.org/T222127)
[05:57:49] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db2045 to codfw x1 master [puppet] - 10https://gerrit.wikimedia.org/r/508168 (https://phabricator.wikimedia.org/T219493)
[05:59:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give some API weight to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508167 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[06:00:38] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Give some API weight to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508167 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[06:00:54] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Give some API weight to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508167 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[06:01:57] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give some API traffic to db1093 (duration: 00m 52s)
[06:01:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:13:06] <icinga-wm>	 RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:18:45] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725)
[06:29:02] <icinga-wm>	 PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Netbox
[06:29:17] <wikibugs>	 10Operations, 10serviceops, 10User-Joe: SRE FY2019 Q3 goal: Ramp-up serving traffic to PHP 7 - https://phabricator.wikimedia.org/T212828 (10Joe)
[06:29:20] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review, 10User-Joe: Set up A/B testing mechanism for PHP7, - https://phabricator.wikimedia.org/T216676 (10Joe) 05Open→03Resolved
[06:29:56] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/field.sh]
[06:30:16] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:30:22] <icinga-wm>	 PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean]
[06:30:34] <icinga-wm>	 PROBLEM - puppet last run on an-worker1084 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R]
[06:31:56] <icinga-wm>	 PROBLEM - puppet last run on acmechief1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml]
[06:32:14] <icinga-wm>	 PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_sysctl]
[06:32:58] <icinga-wm>	 PROBLEM - puppet last run on authdns2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/20-confd.conf]
[06:33:08] <wikibugs>	 (03PS1) 10Elukey: service::uwsgi: add the core_limit parameter [puppet] - 10https://gerrit.wikimedia.org/r/508173 (https://phabricator.wikimedia.org/T212697)
[06:33:10] <wikibugs>	 (03PS1) 10Elukey: netbox: set the uwsgi's systemd LimitCore setting to 30G [puppet] - 10https://gerrit.wikimedia.org/r/508174 (https://phabricator.wikimedia.org/T212697)
[06:34:52] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/16338/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/508174 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey)
[06:35:18] <wikibugs>	 (03CR) 10Marostegui: "The commit message says from db2103 to db2120, but I only see from db2103 to db2111, is that expected? I guess this patchset is still work" [dns] - 10https://gerrit.wikimedia.org/r/507613 (owner: 10Papaul)
[06:37:46] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T222526 (10Marostegui) p:05Triage→03Normal
[06:37:58] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync
[06:38:29] <elukey>	 running puppet on netmon1002
[06:42:10] <icinga-wm>	 RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 346 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Netbox
[06:43:56] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync
[06:44:29] <elukey>	 in 5 mins those --^ will be re-executed, uwsgi was down
[06:48:53] <wikibugs>	 (03PS3) 10Luca Mauri: Create new http://www.mediawiki.org/xml/sitelist-1.1/ to reference sitelist-1.1.xsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508130 (https://phabricator.wikimedia.org/T222516)
[06:49:18] <wikibugs>	 (03CR) 10Luca Mauri: "> This new file needs adding to xml/index.html too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508130 (https://phabricator.wikimedia.org/T222516) (owner: 10Luca Mauri)
[06:51:22] <icinga-wm>	 RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational
[06:54:14] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync
[06:55:07] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Disable the PHP7 beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508177 (https://phabricator.wikimedia.org/T219128)
[06:55:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Disable the PHP7 beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508177 (https://phabricator.wikimedia.org/T219128) (owner: 10Giuseppe Lavagetto)
[06:56:20] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Disable the PHP7 beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508177 (https://phabricator.wikimedia.org/T219128)
[06:56:28] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:52] <icinga-wm>	 RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:06] <icinga-wm>	 RECOVERY - puppet last run on an-worker1084 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:58:28] <icinga-wm>	 RECOVERY - puppet last run on acmechief1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:58:36] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync
[06:58:42] <icinga-wm>	 RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:59:30] <icinga-wm>	 RECOVERY - puppet last run on authdns2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:01:01] <wikibugs>	 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) Added a couple of code reviews as attempt to add the LimitCore to the netbox's systemd unit. If this is not the idea that you guys had, please feel free to d...
[07:17:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/506750 (https://phabricator.wikimedia.org/T78076) (owner: 10Dzahn)
[07:19:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508173 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey)
[07:19:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] netbox: set the uwsgi's systemd LimitCore setting to 30G [puppet] - 10https://gerrit.wikimedia.org/r/508174 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey)
[07:23:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "These are getting decommisioned, but it's currently blocked on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/466833/ getting merg" [puppet] - 10https://gerrit.wikimedia.org/r/507948 (https://phabricator.wikimedia.org/T222443) (owner: 10Jbond)
[07:32:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch neodymium/sarin to spares [puppet] - 10https://gerrit.wikimedia.org/r/508277
[07:54:54] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: labmon1001 to prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/508280 (https://phabricator.wikimedia.org/T187987)
[08:03:20] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946
[08:03:21] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578
[08:03:23] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947
[08:07:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto)
[08:07:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto)
[08:07:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[08:07:52] <_joe_>	 jeez what's up with sphinx
[08:17:43] <wikibugs>	 (03CR) 10Elukey: service::uwsgi: add the core_limit parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508173 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey)
[08:19:09] <wikibugs>	 (03PS2) 10Elukey: netbox: set the uwsgi's systemd LimitCore setting to 30G [puppet] - 10https://gerrit.wikimedia.org/r/508174 (https://phabricator.wikimedia.org/T212697)
[08:19:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Seems fine, but needs meeting approval." [puppet] - 10https://gerrit.wikimedia.org/r/507813 (https://phabricator.wikimedia.org/T222368) (owner: 10Elukey)
[08:20:21] <wikibugs>	 (03PS2) 10Muehlenhoff: Drop trusty from debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/507327
[08:21:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Drop trusty from debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/507327 (owner: 10Muehlenhoff)
[08:22:34] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/508277 (owner: 10Muehlenhoff)
[08:23:25] <wikibugs>	 (03PS1) 10Ema: prometheus: add upload_ats target [puppet] - 10https://gerrit.wikimedia.org/r/508284 (https://phabricator.wikimedia.org/T219967)
[08:28:16] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the 14th May" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui)
[08:30:28] <wikibugs>	 (03PS2) 10Ema: prometheus: add upload_ats mtail targets [puppet] - 10https://gerrit.wikimedia.org/r/508284 (https://phabricator.wikimedia.org/T219967)
[08:35:59] <wikibugs>	 (03PS1) 10Elukey: role::deployment_server: remove analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/508286
[08:37:58] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, one optional comment inline." (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066 (owner: 10CRusnov)
[08:39:01] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, one nit inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508067 (owner: 10CRusnov)
[08:39:36] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch neodymium/sarin to spares [puppet] - 10https://gerrit.wikimedia.org/r/508277
[08:40:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch neodymium/sarin to spares [puppet] - 10https://gerrit.wikimedia.org/r/508277 (owner: 10Muehlenhoff)
[08:42:40] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for pushing for it" [puppet] - 10https://gerrit.wikimedia.org/r/508174 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey)
[08:42:42] <wikibugs>	 (03PS2) 10Elukey: role::deployment_server: remove analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/508286
[08:44:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::deployment_server: remove analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/508286 (owner: 10Elukey)
[08:44:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add upload_ats mtail targets [puppet] - 10https://gerrit.wikimedia.org/r/508284 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[08:44:40] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/508173 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey)
[08:45:50] <wikibugs>	 (03PS3) 10Ema: prometheus: add upload_ats mtail targets [puppet] - 10https://gerrit.wikimedia.org/r/508284 (https://phabricator.wikimedia.org/T219967)
[08:47:35] <wikibugs>	 (03CR) 10Ema: [C: 03+2] prometheus: add upload_ats mtail targets [puppet] - 10https://gerrit.wikimedia.org/r/508284 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[08:48:05] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10jcrespo) > You might be confused with db2047, I don't recall db2049 having a disk replaced lately  //Marostegui updated the task description. Feb 12 2019, 07:40://  https://phabricator.wikimedia.or...
[08:48:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: labmon1001 to prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/508280 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi)
[08:48:46] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: labmon1001 to prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/508280 (https://phabricator.wikimedia.org/T187987)
[08:50:09] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) >>! In T208323#5159099, @jcrespo wrote: >> You might be confused with db2047, I don't recall db2049 having a disk replaced lately >  > //Marostegui updated the task description. Feb 12...
[08:55:32] <wikibugs>	 (03PS2) 10Elukey: service::uwsgi: add the core_limit parameter [puppet] - 10https://gerrit.wikimedia.org/r/508173 (https://phabricator.wikimedia.org/T212697)
[08:56:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] service::uwsgi: add the core_limit parameter [puppet] - 10https://gerrit.wikimedia.org/r/508173 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey)
[08:57:12] <wikibugs>	 (03PS3) 10Elukey: netbox: set the uwsgi's systemd LimitCore setting to 30G [puppet] - 10https://gerrit.wikimedia.org/r/508174 (https://phabricator.wikimedia.org/T212697)
[08:57:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] netbox: set the uwsgi's systemd LimitCore setting to 30G [puppet] - 10https://gerrit.wikimedia.org/r/508174 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey)
[09:00:10] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Reenable notifications on db1139 and db1140 [puppet] - 10https://gerrit.wikimedia.org/r/507925 (https://phabricator.wikimedia.org/T220572)
[09:00:12] <wikibugs>	 (03PS2) 10Jcrespo: backups: Decommission dbstore1001, dbstore2001 and dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/507944 (https://phabricator.wikimedia.org/T220002)
[09:03:16] <wikibugs>	 (03PS8) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150)
[09:03:18] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi)
[09:03:20] <godog>	 !log upgrade labmon1001 to prometheus 2 - T187987
[09:03:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:25] <stashbot>	 T187987: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987
[09:03:54] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:05:28] <icinga-wm>	 PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 32, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:05:35] <wikibugs>	 10Operations, 10serviceops, 10Beta-Feature, 10Patch-For-Review, 10User-jijiki: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 (10jijiki)
[09:07:26] <icinga-wm>	 PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100%
[09:08:08] <icinga-wm>	 RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 36, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:10:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] backups: Decommission dbstore1001, dbstore2001 and dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/507944 (https://phabricator.wikimedia.org/T220002) (owner: 10Jcrespo)
[09:12:12] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10fgiunchedi) >>! In T196066#5155909, @Ottomata wrote: > I don't think Magnus would build it into librdkafk...
[09:12:50] <icinga-wm>	 RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 98.56 ms
[09:18:39] <wikibugs>	 (03PS31) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594)
[09:18:41] <wikibugs>	 (03PS1) 10Vgutierrez: prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217)
[09:31:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus: Identify trafficserver instances using the layer label (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez)
[09:35:52] <elukey>	 !log restart netbox on netmon1002 (trying to reproduce the segfault) - T212697
[09:35:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:56] <stashbot>	 T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697
[09:37:13] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10MoritzMuehlenhoff) 05Resolved→03Open Looks good, with the removal of udev from the component, can we please a...
[09:42:09] <wikibugs>	 (03PS4) 10Effie Mouzeli: cxserver: Clean up old scb cluster stanzas [puppet] - 10https://gerrit.wikimedia.org/r/506366 (https://phabricator.wikimedia.org/T213195)
[09:43:06] <wikibugs>	 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) A systemctl restart triggered a segfault, and a core was available under /var/tmp/core. This is what gdb says:  ` Core was generated by `/usr/bin/uwsgi --die...
[09:43:11] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10hashar) Eventually I have unzipped them, the reason is the log rotation is handled by python logging not by...
[09:54:35] <wikibugs>	 (03PS1) 10Ema: prometheus: add glob for ATS to file_sd_configs [puppet] - 10https://gerrit.wikimedia.org/r/508296 (https://phabricator.wikimedia.org/T219967)
[09:57:52] <wikibugs>	 (03CR) 10Ema: "pcc looks good https://puppet-compiler.wmflabs.org/compiler1002/16342/bast4002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/508296 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[09:59:12] <arturo>	 !log T222148 upgrade udev & libudev1 on cloudvirt[1001-1003,1005].eqiad.wmnet
[09:59:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:16] <stashbot>	 T222148: labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148
[10:01:07] <wikibugs>	 (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/508296 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[10:03:40] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10aborrero) 05Open→03Resolved cloudvirt[1014,1016-1017,1021-1023].eqiad.wmnet,cloudvirtan[1001-1005].eqiad.wmne...
[10:05:20] <arturo>	 !log upgrade udev in cloudservices2002-dev
[10:05:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:38] <wikibugs>	 (03PS2) 10Ema: prometheus: add glob for ATS to file_sd_configs [puppet] - 10https://gerrit.wikimedia.org/r/508296 (https://phabricator.wikimedia.org/T219967)
[10:11:43] <wikibugs>	 (03PS9) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225)
[10:12:15] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10MoritzMuehlenhoff) Looks good, thanks
[10:12:45] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Missing return inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[10:14:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add glob for ATS to file_sd_configs [puppet] - 10https://gerrit.wikimedia.org/r/508296 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[10:14:40] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946
[10:14:42] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578
[10:14:44] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947
[10:16:54] <wikibugs>	 (03PS1) 10Ladsgroup: Set $wgOresFrontendBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508298 (https://phabricator.wikimedia.org/T219396)
[10:17:59] <wikibugs>	 (03CR) 10Volans: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe)
[10:20:59] <wikibugs>	 (03PS2) 10Vgutierrez: prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217)
[10:25:46] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] cxserver: Clean up old scb cluster stanzas [puppet] - 10https://gerrit.wikimedia.org/r/506366 (https://phabricator.wikimedia.org/T213195) (owner: 10Effie Mouzeli)
[10:26:27] <wikibugs>	 (03PS3) 10Ema: prometheus: add glob for ATS to file_sd_configs [puppet] - 10https://gerrit.wikimedia.org/r/508296 (https://phabricator.wikimedia.org/T219967)
[10:27:06] <wikibugs>	 (03CR) 10Vgutierrez: prometheus: Identify trafficserver instances using the layer label (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez)
[10:27:53] <wikibugs>	 (03CR) 10Ema: [C: 03+2] prometheus: add glob for ATS to file_sd_configs [puppet] - 10https://gerrit.wikimedia.org/r/508296 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[10:30:04] <jouncebot>	 jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190506T1030).
[10:30:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: elastalert: new module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi)
[10:30:31] <wikibugs>	 (03PS8) 10Filippo Giunchedi: elastalert: new module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933)
[10:30:33] <wikibugs>	 (03PS8) 10Filippo Giunchedi: elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933)
[10:31:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez)
[10:32:35] <wikibugs>	 (03PS10) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225)
[10:33:13] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508302 (https://phabricator.wikimedia.org/T128546)
[10:34:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "a couple minor nitpicks, but LGTM." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli)
[10:34:43] <icinga-wm>	 PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[10:36:24] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Some minor things inline and one potentially major if we want to be safe with the depooling." (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe)
[10:36:59] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508302 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:38:03] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508302 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:38:19] <wikibugs>	 (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508302 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:38:25] <icinga-wm>	 PROBLEM - Check systemd state on scb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:38:29] <icinga-wm>	 PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:38:31] <volans>	 ema: I'm having a look at the lvs3001 puppet failure
[10:38:42] <jijiki>	 the scb* alert is me 
[10:38:48] <volans>	 ack
[10:38:49] <icinga-wm>	 PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:38:51] <jijiki>	 will fix 
[10:38:51] <icinga-wm>	 PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:38:55] <icinga-wm>	 PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:38:59] <jijiki>	 arg
[10:39:05] <icinga-wm>	 PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:39:15] <icinga-wm>	 PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:39:29] <icinga-wm>	 PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:39:31] <icinga-wm>	 PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:39:33] <icinga-wm>	 PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:40:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] hhvm: base::service_unit -> systemd::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/456319 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn)
[10:40:31] <wikibugs>	 (03PS11) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225)
[10:40:39] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[10:40:57] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[10:41:47] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[10:41:57] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[10:42:46] <wikibugs>	 (03PS9) 10Filippo Giunchedi: elastalert: new module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933)
[10:42:48] <wikibugs>	 (03PS9) 10Filippo Giunchedi: elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933)
[10:43:11] <logmsgbot>	 !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:508302| Bumping portals to master (T128546)]] (duration: 00m 52s)
[10:43:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:16] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[10:43:26] <volans>	 ema: did puppetmerge fail on all the other hosts?
[10:43:34] <ema>	 volans: looking
[10:43:35] <volans>	 it seems your last merged change was not propagated
[10:44:03] <logmsgbot>	 !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:508302| Bumping portals to master (T128546)]] (duration: 00m 51s)
[10:44:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:29] <ema>	 volans: yes it seems puppet-merge had issues
[10:44:33] <ema>	 https://phabricator.wikimedia.org/P8477
[10:44:42] <ema>	 error: cannot lock ref 'refs/remotes/origin/production': is at 73820b7f34685628be58c2166da5baf16c3830fe but expected da491ce740b81ddfdd43166ab3583dc646c5a89e
[10:45:01] <icinga-wm>	 RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational
[10:45:23] <icinga-wm>	 RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational
[10:45:23] <icinga-wm>	 RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational
[10:45:27] <icinga-wm>	 RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational
[10:45:37] <icinga-wm>	 RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational
[10:45:39] <volans>	 _joe_: for the conftool missing key ^^^ (ema's phaste)
[10:45:47] <icinga-wm>	 RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational
[10:45:58] <_joe_>	 what?
[10:45:59] <icinga-wm>	 RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational
[10:46:03] <_joe_>	 I miss all context
[10:46:03] <icinga-wm>	 RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational
[10:46:05] <icinga-wm>	 RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational
[10:46:40] <ema>	 _joe_: puppet-merge failed, log here: https://phabricator.wikimedia.org/P8477
[10:46:55] <_joe_>	 ema: yeah scb2005/cxserver
[10:46:55] <volans>	 and apart the puppet failure, also confctl failed at the end
[10:47:06] <_joe_>	 I guess you had some race condition?
[10:47:54] <_joe_>	 frankly, I don't think it has to do with confctl
[10:47:56] <_joe_>	 but lemme check
[10:48:57] <_joe_>	 that object doesn't exist indeed
[10:49:07] <_joe_>	 what it looks like from here is
[10:49:20] <_joe_>	 someone launched puppet-merge from two servers at the same time
[10:49:32] <jijiki>	 I was merging for sure
[10:49:42] <jijiki>	 but my changes went to all servers
[10:49:53] <ema>	 my change also got applied
[10:50:01] <jijiki>	 lovely 
[10:50:30] <jijiki>	 but I didn't get any errors
[10:50:33] <ema>	 I puppet-merged on puppetmaster1001
[10:50:40] <jijiki>	 me too 
[10:50:58] <volans>	 ema: your changes are not on all the other puppetmasters
[10:51:00] <volans>	 only on 1001
[10:51:08] <_joe_>	 so yes
[10:51:11] <_joe_>	 a race condition
[10:51:25] <jijiki>	 :/
[10:51:28] <_joe_>	 so, 1 - don't let your changed unmerged for long
[10:51:29] <jijiki>	 sorry ema 
[10:51:33] <_joe_>	 2 - we need a global lock
[10:51:48] <jijiki>	 I doubt we left unmerged changes for log 
[10:51:53] <jijiki>	 long 
[10:52:03] <arturo>	 :-S
[10:52:43] <ema>	 yeah I puppet-merged right away, I think the issue is just (2)
[10:53:12] <volans>	 btw is someone fixing it?
[10:53:47] <ema>	 I guess puppet-merging something new would be a fix?
[10:54:04] <_joe_>	 no
[10:54:16] <_joe_>	 puppet-merge $sha1 would be
[10:54:18] <_joe_>	 maybe from 2001
[10:54:25] <_joe_>	 and on the other failed servers
[10:54:31] <ema>	 ok, doing!
[10:54:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add support for OpenAPI 3.0 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507489 (https://phabricator.wikimedia.org/T218218) (owner: 10Clarakosi)
[10:55:58] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster2001 is OK: No changes to merge.
[10:56:43] <wikibugs>	 10Operations, 10Puppet, 10puppet-compiler: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10akosiaris) >>! In T221529#5143984, @jbond wrote:  > The error happened as puppet-merge was rolling out changes.  I have not looked at how puppet-merge works but this looks like it is caused by a...
[10:56:44] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster2002 is OK: No changes to merge.
[10:59:20] <icinga-wm>	 RECOVERY - Check systemd state on scb2004 is OK: OK - running: The system is fully operational
[10:59:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "the code is correct but I'd prefer another format." (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) (owner: 10Mobrovac)
[10:59:53] <ema>	 !log manual puppet-merge $sha on failed puppetmasters https://phabricator.wikimedia.org/P8477 
[10:59:54] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge.
[10:59:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:04] <jouncebot>	 MaxSem, RoanKattouw, and Niharika: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190506T1100).
[11:00:04] <jouncebot>	 Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:10] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge.
[11:00:38] <volans>	 thanks for fixing
[11:01:42] <ema>	 volans, _joe_: done, thanks!
[11:05:50] <icinga-wm>	 RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:06:13] <wikibugs>	 (03PS1) 10Alaa Sarhan: Enable Suggestion Constraint Status on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508303 (https://phabricator.wikimedia.org/T204439)
[11:06:41] <ema>	 volans: have you found out anything about lvs3001's puppetfail?
[11:06:58] <volans>	 ema: just the usual AH01102 :(
[11:07:34] <_joe_>	 a 503
[11:07:43] <volans>	 not a 502
[11:07:46] <volans>	 not a 503
[11:07:52] <alaa_wmde>	 anyone Swatting now?
[11:07:54] <_joe_>	 is it me or puppet (the passenger app) crashes more often than before?
[11:08:17] <_joe_>	 since a few months
[11:08:33] <volans>	 it seems so from the data chris was gathering, although that data is skewed at each puppet-merge that breaks puppet on a large number of hosts
[11:08:35] <wikibugs>	 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) Highlight to:  ` #7  0x00005645dd3aaf83 in uwsgi_segfault (signum=11) at core/uwsgi.c:1839 #8  <signal handler called> #9  0x00007ffa725bac99 in uwsgi_socket...
[11:08:43] <volans>	 we have 2 open tasks for that
[11:09:09] <wikibugs>	 (03PS1) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967)
[11:09:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto)
[11:09:20] <_joe_>	 "recheck"
[11:09:22] <_joe_>	 sigh
[11:14:51] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] Enable Suggestion Constraint Status on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508303 (https://phabricator.wikimedia.org/T204439) (owner: 10Alaa Sarhan)
[11:16:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "jenkins bot is not verifying this change because some issue with zuul apparently." [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez)
[11:16:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable Suggestion Constraint Status on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508303 (https://phabricator.wikimedia.org/T204439) (owner: 10Alaa Sarhan)
[11:17:29] <wikibugs>	 (03CR) 10Muehlenhoff: "https://puppet-compiler.wmflabs.org/compiler1002/16347/" [puppet] - 10https://gerrit.wikimedia.org/r/500413 (owner: 10Muehlenhoff)
[11:17:30] <arturo>	 !log merging puppet change to the sudo module https://gerrit.wikimedia.org/r/c/operations/puppet/+/507376
[11:17:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:36] <wikibugs>	 (03PS5) 10Muehlenhoff: dnsrecursor: Remove support for Ubuntu/trusty [puppet] - 10https://gerrit.wikimedia.org/r/500413
[11:17:42] <wikibugs>	 (03PS2) 10Alaa Sarhan: Enable Suggestion Constraint Status on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508303 (https://phabricator.wikimedia.org/T204439)
[11:17:57] <alaa_wmde>	 hey, we just added patch 508303 to current SWAT, if there's any chance for it to be deployed that would be so great :)
[11:19:00] <alaa_wmde>	 ( https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/508303 )
[11:19:07] <volans>	 arturo: that patch changes things to some prod hosts too
[11:19:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] dnsrecursor: Remove support for Ubuntu/trusty [puppet] - 10https://gerrit.wikimedia.org/r/500413 (owner: 10Muehlenhoff)
[11:19:19] <arturo>	 volans: I know
[11:19:31] <arturo>	 volans: see https://puppet-compiler.wmflabs.org/compiler1002/16343/
[11:20:07] <volans>	 on mwmaint it uses sudoldap though
[11:20:12] <volans>	 is that intended?
[11:20:25] <volans>	 they include the ldap::client::includes class
[11:20:37] <arturo>	 https://puppet-compiler.wmflabs.org/compiler1002/16343/mwmaint1002.eqiad.wmnet/ ?
[11:21:12] <arturo>	 oh I see what you mean
[11:22:02] <arturo>	 volans: I guess nobody should be using sudoldap outside cloudVPS, but I may be wrong
[11:22:14] <volans>	 also, I don't see any review on that patch, and it's obviously not trivial or emergency bugfix
[11:23:49] <arturo>	 not sure if you are suggesting to revert it
[11:24:37] <volans>	 it's not clear to me the impact of it, I just had a quick look and saw that it changes some behaviour of some classes applied to prod hosts and I'm not sure if they are intended
[11:25:18] <arturo>	 yes, is inteded, we are adding a new parameter to the sudo class, sudo::user and sudo::group as well
[11:25:44] <_joe_>	 is jenkins working for puppet?
[11:25:57] <arturo>	 this patch was in fact already applied to prod without issues. It got reverted because we had issues inside cloudvps VMs
[11:26:05] <arturo>	 and this is the second attempt to merge it
[11:26:08] <arturo>	 volans: ^^^
[11:26:32] <ema>	 _joe_: sporadically I think, see the logs on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508296/
[11:27:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete openstack::nova::compute::audit [puppet] - 10https://gerrit.wikimedia.org/r/508308
[11:27:30] <arturo>	 volans: I'm ready/open to revert if you think we should do so
[11:28:44] <_joe_>	 arturo: I'll be frank: you should've asked - an waited - for a review for a modification to the sudo module, it's quite fundamental
[11:29:13] <arturo>	 ok, give me a second, I will revert it
[11:29:23] <volans>	 arturo: as I said, I don't know, but would have expected at least a review from moritz or john
[11:29:28] <_joe_>	 I don't think that's needed
[11:29:39] <volans>	 and the part that puzzles me for mwmaint hosts is
[11:29:41] <volans>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/507376/11/modules/ldap/manifests/client/includes.pp#1
[11:29:42] <_joe_>	 but for the future, keep that in mind please :)
[11:29:43] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "sudo: decouple sudo from sudo-ldap" [puppet] - 10https://gerrit.wikimedia.org/r/508309
[11:29:56] <volans>	 in the sense that that file is included on those hosts
[11:30:08] <volans>	 and I'm not sure if that change in behaviour there is intented
[11:30:31] <arturo>	 I prefer to do things right and without disturbing anybody, so I'm reverting the patch right now
[11:30:49] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: Revert "sudo: decouple sudo from sudo-ldap" [puppet] - 10https://gerrit.wikimedia.org/r/508309
[11:31:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] Revert "sudo: decouple sudo from sudo-ldap" [puppet] - 10https://gerrit.wikimedia.org/r/508309 (owner: 10Arturo Borrero Gonzalez)
[11:32:17] <arturo>	 !log reverting puppet change to the sudo module
[11:32:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:50] <wikibugs>	 (03PS2) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967)
[11:34:02] <Amir1>	 I'm deploying those now
[11:34:05] <Amir1>	 sorry for being late
[11:34:12] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Enable Suggestion Constraint Status on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508303 (https://phabricator.wikimedia.org/T204439) (owner: 10Alaa Sarhan)
[11:34:21] <alaa_wmde>	 thank you @Amir1 
[11:35:15] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Suggestion Constraint Status on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508303 (https://phabricator.wikimedia.org/T204439) (owner: 10Alaa Sarhan)
[11:35:55] <icinga-wm>	 PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:36:04] <wikibugs>	 (03CR) 10jenkins-bot: Enable Suggestion Constraint Status on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508303 (https://phabricator.wikimedia.org/T204439) (owner: 10Alaa Sarhan)
[11:36:15] <icinga-wm>	 PROBLEM - puppet last run on ores2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:37:09] <icinga-wm>	 PROBLEM - puppet last run on db1096 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:37:09] <icinga-wm>	 PROBLEM - puppet last run on db2099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:37:13] <Amir1>	 alaa_wmde: it's live in mwdebug1002
[11:37:47] <wikibugs>	 (03PS23) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[11:38:55] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[11:39:26] <_joe_>	 it seems it was a very small spike
[11:39:33] <_joe_>	 but I'll keep an eye on that
[11:39:41] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:40:03] <icinga-wm>	 PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:40:03] <_joe_>	 May  6 11:32:59 ores2008 puppet-agent[35832]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Could not find declared class sudo::sudoersfile at /etc/puppet/modules/sudo/manifests/init.pp:6:5 on node ores2008.codfw.wmnet
[11:40:06] <_joe_>	 ij,
[11:40:09] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[11:40:19] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[11:40:34] <_joe_>	 uhm
[11:40:34] <arturo>	 _joe_: taking a look at ores2008
[11:40:39] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:508303|Enable Suggestion Constraint Status on Wikidata]] (duration: 00m 52s)
[11:40:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:57] <_joe_>	 arturo: no it's ok
[11:41:04] <_joe_>	 just a race condition in puppet-merge I guess
[11:41:16] <_joe_>	 I just re-run puppet and it was ok
[11:41:22] <arturo>	 ok then, sorry for the noise
[11:41:35] <icinga-wm>	 RECOVERY - puppet last run on ores2008 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[11:42:13] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508298 (https://phabricator.wikimedia.org/T219396) (owner: 10Ladsgroup)
[11:42:19] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:42:29] <icinga-wm>	 RECOVERY - puppet last run on db1096 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:42:29] <icinga-wm>	 RECOVERY - puppet last run on db2099 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[11:44:29] <wikibugs>	 (03CR) 10Ladsgroup: "jenkins seems to have problems with my patch. Will try again later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508298 (https://phabricator.wikimedia.org/T219396) (owner: 10Ladsgroup)
[11:44:43] <Amir1>	 !log EU SWAT is done
[11:44:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:58] <_joe_>	 Amir1: CI stuck for you too?
[11:45:05] <Amir1>	 yeah 
[11:45:24] <Amir1>	 _joe_: it doesn't even start the job: https://integration.wikimedia.org/zuul/
[11:45:34] <wikibugs>	 (03PS1) 10Muehlenhoff: openstack: Remove support for Ubuntu/Upstart [puppet] - 10https://gerrit.wikimedia.org/r/508310
[11:45:56] <Amir1>	 hmm, it was just slow
[11:46:49] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[11:47:15] <_joe_>	 https://grafana.wikimedia.org/d/000000321/zuul?panelId=8&fullscreen&orgId=1&from=now-3h&to=now up to two hours :D
[11:47:57] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/508311 (https://phabricator.wikimedia.org/T221225)
[11:48:03] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[11:48:11] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[11:50:40] <wikibugs>	 (03PS2) 10Muehlenhoff: openstack: Remove support for Ubuntu/Upstart [puppet] - 10https://gerrit.wikimedia.org/r/508310
[11:52:27] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "Do you have a PCC run for this?" [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff)
[11:52:44] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222538 (10Tobi_WMDE_SW) As the Engineering Manager at WMDE responsible for Adam's team, I endorse this request.
[11:52:59] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp3038 is CRITICAL: CRITICAL: expiry mailbox lag is 2059198 https://wikitech.wikimedia.org/wiki/Varnish
[12:01:58] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC at https://puppet-compiler.wmflabs.org/compiler1001/16351/" [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff)
[12:06:39] <icinga-wm>	 RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[12:07:21] <wikibugs>	 (03CR) 10Effie Mouzeli: mediawiki: if guard php72_only blocks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli)
[12:07:47] <icinga-wm>	 RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[12:11:47] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[12:14:57] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:17:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "I just ran a more wider PCC: https://puppet-compiler.wmflabs.org/compiler1001/16352/" [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff)
[12:19:01] <wikibugs>	 (03PS4) 10Mobrovac: Handle application/octet-stream requests properly; release v0.1.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401)
[12:21:18] <wikibugs>	 (03CR) 10Mobrovac: Handle application/octet-stream requests properly; release v0.1.5 (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) (owner: 10Mobrovac)
[12:24:08] <moritzm>	 !log installing golang security updates on jessie
[12:24:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:11] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[12:34:13] <wikibugs>	 (03PS1) 10Mathew.onipe: maps: remove cassandra metric blacklist [puppet] - 10https://gerrit.wikimedia.org/r/508313
[12:34:31] <wikibugs>	 (03CR) 10Mathew.onipe: icinga: create and apply cirrus config check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[12:35:54] <wikibugs>	 (03PS24) 10Mathew.onipe: icinga: create and apply cirrus config check(recheck) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[12:43:46] <moritzm>	 !log installing rsync security updates
[12:43:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:33] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1036 is OK: OK - running: The system is fully operational
[12:51:53] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:52:09] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:52:19] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:52:36] <xSavitar>	 Hi, is there anything wrong with the "test" pipeline? "recheck" doesn't work, rebasing a patch doesn't run tests etc. Cc paladox
[12:52:59] <paladox>	 apparently _joe_ filed a task about ci not working.
[12:53:05] <paladox>	 https://phabricator.wikimedia.org/T222605
[12:53:07] <xSavitar>	 Okay, thanks. 
[12:53:53] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[12:53:57] <xSavitar>	 paladox: And we subscribed same time
[12:54:01] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[12:54:01] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[12:54:02] <paladox>	 heh
[12:54:31] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:55:23] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[12:56:05] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:56:15] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:58:30] <wikibugs>	 (03PS7) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108)
[12:59:15] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[12:59:17] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[13:00:27] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[13:00:37] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[13:02:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "PCC is fine at https://puppet-compiler.wmflabs.org/compiler1002/16349/mc2026.codfw.wmnet/. There will only be ferm reload on most hosts, w" [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk)
[13:09:35] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1099 memory issues - https://phabricator.wikimedia.org/T221502 (10Marostegui) 05Open→03Resolved This host recovered itself, so closing for now as nothing is to be done.
[13:11:31] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM, appart from the failing build :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond)
[13:14:34] <ema>	 both 503 spikes seem to have been triggered by api.php requests
[13:15:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] network: remove old labs public range [puppet] - 10https://gerrit.wikimedia.org/r/507907 (https://phabricator.wikimedia.org/T193496) (owner: 10Alex Monk)
[13:15:03] <ema>	 varnish backends in eqiad look fine
[13:15:09] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: network: remove old labs public range [puppet] - 10https://gerrit.wikimedia.org/r/507907 (https://phabricator.wikimedia.org/T193496) (owner: 10Alex Monk)
[13:15:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/507907 (https://phabricator.wikimedia.org/T193496) (owner: 10Alex Monk)
[13:16:16] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "relforge and cloudelastic should also be configured." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[13:16:43] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] network: remove old labs public range [puppet] - 10https://gerrit.wikimedia.org/r/507907 (https://phabricator.wikimedia.org/T193496) (owner: 10Alex Monk)
[13:17:11] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "Do we even need to keep 3 versions of Python supported?" [software/cumin] - 10https://gerrit.wikimedia.org/r/508078 (owner: 10Volans)
[13:18:41] <wikibugs>	 (03PS8) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108)
[13:20:02] <wikibugs>	 (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [software/cumin] - 10https://gerrit.wikimedia.org/r/508078 (owner: 10Volans)
[13:20:13] <wikibugs>	 (03PS3) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967)
[13:20:18] <moritzm>	 !log installing unzip security updates
[13:20:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:21] <wikibugs>	 10Operations, 10Cloud-Services, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496 (10chasemp) I was 6 months off on my estimate for this :)
[13:20:29] <wikibugs>	 (03PS9) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108)
[13:22:21] <moritzm>	 !log installing audiofile security updates
[13:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:23] <wikibugs>	 (03PS4) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967)
[13:24:29] <wikibugs>	 (03CR) 10Papaul: "> Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/507613 (owner: 10Papaul)
[13:25:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for audiofile [puppet] - 10https://gerrit.wikimedia.org/r/508316
[13:26:01] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] icinga: create and apply cirrus config check(recheck) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[13:26:26] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add library hint for audiofile [puppet] - 10https://gerrit.wikimedia.org/r/508316 (owner: 10Muehlenhoff)
[13:26:30] <wikibugs>	 (03PS10) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108)
[13:29:53] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] wdqs: add WDQS restart cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe)
[13:30:43] <wikibugs>	 (03PS5) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967)
[13:30:52] <wikibugs>	 (03CR) 10CDanis: "PCC looks reasonable: https://puppet-compiler.wmflabs.org/compiler1002/16361/" [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108) (owner: 10CDanis)
[13:31:22] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe)
[13:31:53] <wikibugs>	 (03CR) 10Marostegui: "> > Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/507613 (owner: 10Papaul)
[13:32:33] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "> Patch Set 1:" [software/cumin] - 10https://gerrit.wikimedia.org/r/508078 (owner: 10Volans)
[13:33:39] <wikibugs>	 (03PS6) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967)
[13:33:46] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Drop support for Python 3.4 [software/cumin] - 10https://gerrit.wikimedia.org/r/508078 (owner: 10Volans)
[13:37:37] <wikibugs>	 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff)
[13:37:58] <wikibugs>	 (03PS7) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967)
[13:39:55] <wikibugs>	 (03Merged) 10jenkins-bot: Drop support for Python 3.4 [software/cumin] - 10https://gerrit.wikimedia.org/r/508078 (owner: 10Volans)
[13:40:10] <wikibugs>	 (03CR) 10jenkins-bot: Drop support for Python 3.4 [software/cumin] - 10https://gerrit.wikimedia.org/r/508078 (owner: 10Volans)
[13:41:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hints for vips and zziplib [puppet] - 10https://gerrit.wikimedia.org/r/508318
[13:42:26] <wikibugs>	 (03PS8) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967)
[13:44:34] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add library hints for vips and zziplib [puppet] - 10https://gerrit.wikimedia.org/r/508318 (owner: 10Muehlenhoff)
[13:44:42] <wikibugs>	 (03PS2) 10Muehlenhoff: Add library hints for vips and zziplib [puppet] - 10https://gerrit.wikimedia.org/r/508318
[13:45:11] <wikibugs>	 (03PS9) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967)
[13:45:30] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Enable gerrit.disableReverseDnsLookup [puppet] - 10https://gerrit.wikimedia.org/r/508127
[13:46:12] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add library hints for vips and zziplib [puppet] - 10https://gerrit.wikimedia.org/r/508318 (owner: 10Muehlenhoff)
[13:48:16] <wikibugs>	 (03PS10) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967)
[13:48:59] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Promote db2045 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508321 (https://phabricator.wikimedia.org/T219493)
[13:49:30] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "This needs to be submited after the topology changes and after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508168/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508321 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui)
[13:51:55] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[13:52:09] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[13:52:14] <hashar>	 !log CI does not run sometime for some reason ... https://phabricator.wikimedia.org/T222614  :(
[13:52:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:50] <moritzm>	 !log installing zziplib security updates
[13:54:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:51] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics install -n analytics -f analytics/staging-values.yaml stable/eventgate [namespace: eventgate-analytics, clusters: staging]
[13:56:52] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed
[13:56:52] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics finished
[13:56:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:22] <wikibugs>	 (03PS11) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967)
[14:03:07] <ema>	 !log cp3038: restart varnish-be 
[14:03:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:50] <wikibugs>	 (03CR) 10Volans: [C: 04-1] wdqs: add WDQS restart cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe)
[14:05:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108) (owner: 10CDanis)
[14:05:34] <wikibugs>	 (03PS1) 10Ottomata: eventgate - fix duplicate config error_stream in config.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/508324 (https://phabricator.wikimedia.org/T218346)
[14:06:26] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - fix duplicate config error_stream in config.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/508324 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata)
[14:06:31] <wikibugs>	 (03CR) 10Volans: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe)
[14:06:44] <wikibugs>	 (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108) (owner: 10CDanis)
[14:07:01] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp3038 is OK: OK: expiry mailbox lag is 0 https://wikitech.wikimedia.org/wiki/Varnish
[14:07:29] <wikibugs>	 (03PS5) 10Paladox: Gerrit: Enable gerrit.disableReverseDnsLookup [puppet] - 10https://gerrit.wikimedia.org/r/508127
[14:09:38] <hashar>	 !log CI workflow fixed by reverting a change deployed around 10:00 UTC # T222614
[14:09:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:42] <stashbot>	 T222614: CI no more triggers for some/all? repositories! - https://phabricator.wikimedia.org/T222614
[14:11:14] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics upgrade analytics -f analytics/staging-values.yaml --reset-values stable/eventgate [namespace: eventgate-analytics, clusters: staging]
[14:11:15] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed
[14:11:15] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics finished
[14:11:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: initial attempt at a varnishkafka exporter (032 comments) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[14:12:05] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, it requires a patch to Puppet to add the timeout to the configuration before deploying this." [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov)
[14:13:03] <wikibugs>	 (03PS12) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967)
[14:13:09] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Forgot to add a note" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov)
[14:15:13] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508321 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui)
[14:15:15] <wikibugs>	 (03CR) 10Ema: "pcc here https://puppet-compiler.wmflabs.org/compiler1001/16369/" [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[14:15:39] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[14:15:47] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[14:16:23] <wikibugs>	 (03PS1) 10Vgutierrez: prometheus: Toggle SSL certificate verification for trafficserver-exporter [puppet] - 10https://gerrit.wikimedia.org/r/508327 (https://phabricator.wikimedia.org/T221217)
[14:18:07] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:18:29] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[14:19:27] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[14:19:52] <ema>	 !log depool cp4026 and reimage as upload_ats T219967
[14:19:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:56] <stashbot>	 T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo   - https://phabricator.wikimedia.org/T219967
[14:19:57] <wikibugs>	 10Operations, 10Puppet, 10Patch-For-Review, 10User-herron: custom fact interface_primary breaks under newer versions of facter - https://phabricator.wikimedia.org/T182819 (10MoritzMuehlenhoff) We can close this, given that the patch by @jbond was merged, right?
[14:20:31] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[14:23:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: initial attempt at a varnishkafka exporter (031 comment) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[14:25:08] <moritzm>	 !log installing vips security updates
[14:25:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:33] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T222526 (10Papaul) a:05Papaul→03jcrespo Done
[14:26:00] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp4026.ulsfo.wmnet'] ` The log can be...
[14:27:29] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[14:29:31] <wikibugs>	 (03PS2) 10Vgutierrez: prometheus: Toggle SSL certificate verification for trafficserver-exporter [puppet] - 10https://gerrit.wikimedia.org/r/508327 (https://phabricator.wikimedia.org/T221217)
[14:29:33] <wikibugs>	 (03PS7) 10Vgutierrez: trafficserver: Provide a unified monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/506986 (https://phabricator.wikimedia.org/T221217)
[14:29:35] <wikibugs>	 (03PS32) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594)
[14:29:37] <wikibugs>	 (03PS3) 10Vgutierrez: prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217)
[14:29:55] <cdanis>	 hashar: is the icinga notification above a concern?
[14:35:35] <godog>	 !log swift eqiad-prod: finish decom ms-be101[45] - T220590
[14:35:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:39] <stashbot>	 T220590: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590
[14:36:16] <wikibugs>	 (03PS6) 10Ema: Normalize thumbnail URLs to avoid cachebusting [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles)
[14:37:51] <icinga-wm>	 PROBLEM - Host cp1083 is DOWN: PING CRITICAL - Packet loss = 100%
[14:39:38] <elukey>	 ema: working on cp1083? --^
[14:39:41] <ema>	 nope
[14:40:03] <ema>	 depooling
[14:40:19] <icinga-wm>	 RECOVERY - EDAC syslog messages on thumbor1004 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops
[14:40:46] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::deployment_server: depend on base role [puppet] - 10https://gerrit.wikimedia.org/r/508334
[14:40:48] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::deployment_server: reorganize code, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/508335
[14:41:00] <wikibugs>	 (03CR) 10Andrew Bogott: "I believe we still have exactly one VM still running Trusty -- labstore1003, which is waiting on T209527." [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff)
[14:41:05] <logmsgbot>	 !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1083.eqiad.wmnet
[14:41:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:22] <ema>	 thanks for the ping elukey 
[14:42:58] <ema>	 !log powercycle cp1083
[14:43:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:49] <elukey>	 ema: <3
[14:43:54] <wikibugs>	 (03CR) 10Andrew Bogott: openstack: Remove support for Ubuntu/Upstart (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff)
[14:43:56] <wikibugs>	 (03CR) 10Muehlenhoff: "But it doesn't use any of the classes touched here, see https://puppet-compiler.wmflabs.org/compiler1001/16352/labstore1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff)
[14:44:09] <icinga-wm>	 PROBLEM - IPsec on cp4031 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6
[14:44:13] <icinga-wm>	 PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1083_v4, cp1083_v6
[14:44:15] <icinga-wm>	 PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1083_v4, cp1083_v6
[14:44:17] <icinga-wm>	 PROBLEM - IPsec on cp5007 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6
[14:44:26] <ema>	 we know icinga we know
[14:44:33] <icinga-wm>	 PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1083_v4, cp1083_v6
[14:44:35] <icinga-wm>	 PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6
[14:44:35] <icinga-wm>	 PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6
[14:44:39] <icinga-wm>	 PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1083_v4, cp1083_v6
[14:44:41] <icinga-wm>	 PROBLEM - IPsec on cp4030 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6
[14:44:43] <icinga-wm>	 PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6
[14:44:51] <icinga-wm>	 PROBLEM - IPsec on cp4027 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6
[14:44:51] <icinga-wm>	 PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6
[14:44:57] <icinga-wm>	 PROBLEM - IPsec on cp5011 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6
[14:44:57] <icinga-wm>	 PROBLEM - IPsec on cp5010 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6
[14:44:59] <icinga-wm>	 PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1083_v4, cp1083_v6
[14:44:59] <icinga-wm>	 PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1083_v4, cp1083_v6
[14:45:07] <icinga-wm>	 PROBLEM - IPsec on cp4029 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6
[14:45:07] <icinga-wm>	 PROBLEM - IPsec on cp4028 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6
[14:45:17] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: role::deployment_server: reorganize code, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/508335
[14:45:19] <wikibugs>	 (03PS11) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108)
[14:46:07] <icinga-wm>	 RECOVERY - Host cp1083 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[14:46:11] <icinga-wm>	 RECOVERY - IPsec on cp4027 is OK: Strongswan OK - 36 ESP OK
[14:46:13] <icinga-wm>	 RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 36 ESP OK
[14:46:17] <icinga-wm>	 RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 52 ESP OK
[14:46:17] <icinga-wm>	 RECOVERY - IPsec on cp5011 is OK: Strongswan OK - 36 ESP OK
[14:46:17] <icinga-wm>	 RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 52 ESP OK
[14:46:17] <icinga-wm>	 RECOVERY - IPsec on cp5010 is OK: Strongswan OK - 36 ESP OK
[14:46:27] <icinga-wm>	 RECOVERY - IPsec on cp4029 is OK: Strongswan OK - 36 ESP OK
[14:46:27] <icinga-wm>	 RECOVERY - IPsec on cp4028 is OK: Strongswan OK - 36 ESP OK
[14:46:44] <wikibugs>	 (03PS12) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108)
[14:46:49] <icinga-wm>	 RECOVERY - IPsec on cp4031 is OK: Strongswan OK - 36 ESP OK
[14:46:53] <icinga-wm>	 RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 52 ESP OK
[14:46:55] <icinga-wm>	 RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 52 ESP OK
[14:46:59] <icinga-wm>	 RECOVERY - IPsec on cp5007 is OK: Strongswan OK - 36 ESP OK
[14:47:13] <icinga-wm>	 RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 52 ESP OK
[14:47:15] <icinga-wm>	 RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 36 ESP OK
[14:47:15] <icinga-wm>	 RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 36 ESP OK
[14:47:19] <icinga-wm>	 RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 52 ESP OK
[14:47:23] <icinga-wm>	 RECOVERY - IPsec on cp4030 is OK: Strongswan OK - 36 ESP OK
[14:47:23] <icinga-wm>	 RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 36 ESP OK
[14:47:56] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: role::deployment_server: reorganize code, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/508335
[14:47:59] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "True :)" [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff)
[14:49:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] Remove obsolete openstack::nova::compute::audit [puppet] - 10https://gerrit.wikimedia.org/r/508308 (owner: 10Muehlenhoff)
[14:49:35] <wikibugs>	 (03PS3) 10Andrew Bogott: nova: pool cloudvirt1001, 1002, 1003, 1004 [puppet] - 10https://gerrit.wikimedia.org/r/506715 (https://phabricator.wikimedia.org/T221141)
[14:50:10] <wikibugs>	 (03PS8) 10Vgutierrez: trafficserver: Provide a unified monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/506986 (https://phabricator.wikimedia.org/T221217)
[14:50:12] <wikibugs>	 (03PS33) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594)
[14:50:14] <wikibugs>	 (03PS4) 10Vgutierrez: prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217)
[14:54:19] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[14:54:25] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on db2049 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:5 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T222622 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[14:54:30] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Jgreen) @bblack circling back on this, do you still see any issue now after the Silverpop SSL improvements?
[14:54:33] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T222622 (10ops-monitoring-bot)
[14:54:53] <wikibugs>	 (03CR) 10CRusnov: "> Patch Set 2:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov)
[14:55:41] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 901 days) https://wikitech.wikimedia.org/wiki/Logs
[14:57:07] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T222526 (10Marostegui)
[14:57:09] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T222622 (10Marostegui)
[14:57:25] <godog>	 !log capture strace / core for rsyslog on wezen / lithium and restart - T199406
[14:57:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:30] <stashbot>	 T199406: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406
[14:58:02] <wikibugs>	 (03PS7) 10Ema: Normalize thumbnail URLs to avoid cachebusting [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles)
[14:58:03] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on db2049 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2049&var-datasource=codfw+prometheus/ops
[14:58:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova: pool cloudvirt1001, 1002, 1003, 1004 [puppet] - 10https://gerrit.wikimedia.org/r/506715 (https://phabricator.wikimedia.org/T221141) (owner: 10Andrew Bogott)
[14:58:39] <wikibugs>	 (03PS9) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946)
[14:59:24] <wikibugs>	 (03CR) 10Effie Mouzeli: "Puppet compiler https://puppet-compiler.wmflabs.org/compiler1001/16353/mw1222.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli)
[14:59:30] <wikibugs>	 (03CR) 10Mathew.onipe: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe)
[14:59:47] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[15:01:07] <wikibugs>	 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10Andrew)
[15:01:19] <icinga-wm>	 RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops
[15:01:23] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew)
[15:01:25] <wikibugs>	 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10Andrew) 05Open→03Resolved Thank you for working on all these, @Cmjohnson !
[15:01:35] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 838 days) https://wikitech.wikimedia.org/wiki/Logs
[15:01:59] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew)
[15:02:01] <wikibugs>	 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1002 with 10G interfaces - https://phabricator.wikimedia.org/T221140 (10Andrew) 05Open→03Resolved
[15:02:03] <wikibugs>	 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1003 with 10G interfaces - https://phabricator.wikimedia.org/T221139 (10Andrew) 05Open→03Resolved
[15:02:06] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew)
[15:02:11] <wikibugs>	 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1004 with 10G interfaces - https://phabricator.wikimedia.org/T221138 (10Andrew) 05Open→03Resolved
[15:02:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew)
[15:02:39] <wikibugs>	 (03PS2) 10Mathew.onipe: maps: remove cassandra metric blacklist [puppet] - 10https://gerrit.wikimedia.org/r/508313
[15:02:44] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew)
[15:03:51] <wikibugs>	 (03PS3) 10Papaul: DNS: Add mgmt and production DNS for db2[103-120] [dns] - 10https://gerrit.wikimedia.org/r/507613
[15:04:25] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[15:04:40] <wikibugs>	 (03PS13) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108)
[15:04:48] <wikibugs>	 (03CR) 10Ema: [C: 03+1] prometheus: Support several instances of the trafficserver exporter [puppet] - 10https://gerrit.wikimedia.org/r/506659 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez)
[15:07:01] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "Great stuff!" [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[15:07:05] <wikibugs>	 (03PS4) 10Papaul: DNS: Add mgmt and production DNS for db2[103-120] [dns] - 10https://gerrit.wikimedia.org/r/507613
[15:07:17] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[15:08:12] <volans>	 moritzm: related to ongoing upgrades? ^^^
[15:08:29] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4026.ulsfo.wmnet'] `  and were **ALL** successful.
[15:09:02] <godog>	 rsyslog on lithium is me
[15:09:14] <volans>	 ack, sorry for the ping mor.itz
[15:09:32] <volans>	 I missed your ! log, my bad
[15:09:43] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 901 days) https://wikitech.wikimedia.org/wiki/Logs
[15:09:52] <wikibugs>	 (03CR) 10Mathew.onipe: elasticsearch: add new attribute (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[15:10:02] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108) (owner: 10CDanis)
[15:10:35] <wikibugs>	 (03PS4) 10Mathew.onipe: elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932)
[15:11:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[15:12:30] <godog>	 I found the issue with rsyslog itself bizzarre, taking a break and then I'll look into it
[15:12:45] <icinga-wm>	 PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp4026 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined https://wikitech.wikimedia.org/wiki/Confd
[15:12:57] <ema>	 that's me ^
[15:12:59] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishstatsd on cp4026 is CRITICAL: NRPE: Command check_varnishstatsd not defined https://wikitech.wikimedia.org/wiki/Varnish
[15:13:07] <James_F>	 jouncebot: now
[15:13:07] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 46 minute(s)
[15:13:19] <icinga-wm>	 ACKNOWLEDGEMENT - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: NRPE: Command check_check_varnish_expiry_mailbox_lag not defined Ema reimaged w/ ATS https://wikitech.wikimedia.org/wiki/Varnish
[15:13:19] <icinga-wm>	 ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on cp4026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.128.0.126: Connection reset by peer Ema reimaged w/ ATS
[15:13:19] <icinga-wm>	 ACKNOWLEDGEMENT - Confd template for /etc/varnish/directors.backend.vcl on cp4026 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined Ema reimaged w/ ATS https://wikitech.wikimedia.org/wiki/Confd
[15:13:19] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp4026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.128.0.126: Connection reset by peer Ema reimaged w/ ATS
[15:13:19] <icinga-wm>	 ACKNOWLEDGEMENT - IPsec on cp4026 is CRITICAL: NRPE: Command check_IPsec not defined Ema reimaged w/ ATS
[15:13:19] <icinga-wm>	 ACKNOWLEDGEMENT - Varnish HTTP upload-backend - port 3128 on cp4026 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 connect failed - 1982 bytes in 0.747 second response time Ema reimaged w/ ATS https://wikitech.wikimedia.org/wiki/Varnish
[15:13:19] <icinga-wm>	 ACKNOWLEDGEMENT - Varnish traffic logger - varnishstatsd on cp4026 is CRITICAL: NRPE: Command check_varnishstatsd not defined Ema reimaged w/ ATS https://wikitech.wikimedia.org/wiki/Varnish
[15:14:03] <wikibugs>	 (03PS5) 10Mathew.onipe: elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932)
[15:14:05] <wikibugs>	 (03PS25) 10Mathew.onipe: icinga: create and apply cirrus config check(recheck) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[15:14:31] <ema>	 !log pool cp4026 w/ ATS backend T219967
[15:14:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:36] <stashbot>	 T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo   - https://phabricator.wikimedia.org/T219967
[15:14:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[15:16:25] <wikibugs>	 (03PS5) 10Marostegui: DNS: Add mgmt and production DNS for db2[103-120] [dns] - 10https://gerrit.wikimedia.org/r/507613 (owner: 10Papaul)
[15:17:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] DNS: Add mgmt and production DNS for db2[103-120] [dns] - 10https://gerrit.wikimedia.org/r/507613 (owner: 10Papaul)
[15:20:23] <andrewbogott>	 vgutierrez:   I still want to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474272/ — I'm not clear on if that still affects lvs hosts or if they've all moved to Buster already?
[15:21:26] <moritzm>	 andrewbogott: lvs hosts are on stretch and partly jessie
[15:23:45] <andrewbogott>	 moritzm: ok, so some things would be touched by that patch then
[15:23:50] <vgutierrez>	 pcc :)
[15:23:55] * andrewbogott wonders how they are working now
[15:24:01] <vgutierrez>	 but AFAIK that won't affect existing lvs servers
[15:24:09] <vgutierrez>	 at leat not those that are in production right now
[15:24:15] <andrewbogott>	 vgutierrez: great, I'll give the pcc a try
[15:24:23] <wikibugs>	 (03CR) 10Volans: "Looks good in general, some minor possible improvements inline." (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi)
[15:24:23] <andrewbogott>	 Assuming you don't object in principle :)
[15:24:48] <vgutierrez>	 https://puppet-compiler.wmflabs.org/compiler1002/15648/ --> this is an old PCC run against that CR
[15:25:10] <vgutierrez>	 it shows a NOOP for all existing lvs
[15:25:13] <wikibugs>	 (03CR) 10Andrew Bogott: "pcc run in progress: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/16377/" [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez)
[15:25:51] <wikibugs>	 (03CR) 10Ema: [C: 04-1] "18-normalize-thumbnail-url.vtc is failing. Please fix that." [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles)
[15:28:37] <wikibugs>	 (03PS6) 10Mathew.onipe: elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932)
[15:28:38] <wikibugs>	 (03PS26) 10Mathew.onipe: icinga: create and apply cirrus config check(recheck) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[15:29:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[15:30:04] <moritzm>	 !log updating base-files from recent stretch point release
[15:30:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:13] <wikibugs>	 (03PS1) 10RobH: setting production dns entries for db11[26-38].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/508354 (https://phabricator.wikimedia.org/T211613)
[15:30:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[15:30:28] <wikibugs>	 (03PS25) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594)
[15:31:18] <wikibugs>	 (03CR) 10RobH: [C: 03+2] setting production dns entries for db11[26-38].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/508354 (https://phabricator.wikimedia.org/T211613) (owner: 10RobH)
[15:31:29] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Marostegui)
[15:32:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Goal, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10RobH)
[15:33:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Goal, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10RobH) a:05RobH→03Marostegui All set!
[15:35:45] <papaul>	 !log shutting down elastic2038 for DIMM swap
[15:35:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:07] <moritzm>	 !log updating firmware-bnx2 (from stretch point release, this is a NOP, the source package firmware-nonfree was updated for various Wifi chipsets we don't use, doublechecked by comparing check sums for old and new bnx2 firmware)
[15:37:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:12] <wikibugs>	 10Operations: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 (10fgiunchedi) Same issue today, rsyslog was stuck on lithium and wezen, strace shows a whole lot of this:  ` 37672 recvfrom(837, 0x7f092c765c30, 55, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 37...
[15:37:30] <wikibugs>	 (03PS7) 10Mathew.onipe: elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932)
[15:37:32] <wikibugs>	 (03PS27) 10Mathew.onipe: icinga: create and apply cirrus config check(recheck) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[15:41:34] <icinga-wm>	 PROBLEM - Host elastic2038.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:41:34] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:44:05] <godog>	 ^ known
[15:45:02] <wikibugs>	 (03PS1) 10RobH: mac addresses for db11[26-38].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/508355 (https://phabricator.wikimedia.org/T211613)
[15:46:28] <wikibugs>	 (03PS2) 10RobH: mac addresses for db11[26-38].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/508355 (https://phabricator.wikimedia.org/T211613)
[15:47:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mac addresses for db11[26-38].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/508355 (https://phabricator.wikimedia.org/T211613) (owner: 10RobH)
[15:47:36] <robh>	 yeah yeah jenkinsbot i know
[15:47:46] <wikibugs>	 (03CR) 10Ema: [C: 03+1] trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[15:47:56] <wikibugs>	 (03PS3) 10RobH: mac addresses for db11[26-38].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/508355 (https://phabricator.wikimedia.org/T211613)
[15:48:29] <moritzm>	 !log updating firmware-bnx2x (from stretch point release, this is a NOP, the source package firmware-nonfree was updated for various Wifi chipsets we don't use, doublechecked by comparing check sums for old and new bnx2x firmware)
[15:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:39] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "> Patch Set 2:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov)
[15:49:04] <wikibugs>	 (03CR) 10RobH: [C: 03+2] mac addresses for db11[26-38].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/508355 (https://phabricator.wikimedia.org/T211613) (owner: 10RobH)
[15:50:23] <wikibugs>	 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 (10Papaul) @Gehel DIMM swap complete
[15:52:01] <icinga-wm>	 RECOVERY - Host elastic2038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.95 ms
[15:52:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) Thanks, now that ^ has been merged I will take over Note: db1127 is still not present on the netboot.cfg because it is not accessible yet via idrac s...
[15:52:18] <wikibugs>	 (03PS3) 10CRusnov: Ganeti module: Add timeout support [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390
[15:53:05] <icinga-wm>	 PROBLEM - Prometheus prometheus1003.eqiad.wmnet/analytics was restarted on prometheus1003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics
[15:53:05] <icinga-wm>	 PROBLEM - Prometheus prometheus1004.eqiad.wmnet/k8s-staging was restarted on prometheus1004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging
[15:53:24] <cdanis>	 sigh, fixing
[15:54:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui)
[15:54:58] <wikibugs>	 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10Cmjohnson) @moborvac I haven't had a chance to get to them until this week.  I should be able to get them...
[15:54:59] <icinga-wm>	 PROBLEM - Prometheus prometheus1004.eqiad.wmnet/ops was restarted on prometheus1004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops
[15:54:59] <icinga-wm>	 PROBLEM - Prometheus prometheus1003.eqiad.wmnet/global was restarted on prometheus1003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[15:56:52] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] Ganeti module: Add timeout support [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov)
[15:56:55] <icinga-wm>	 PROBLEM - Prometheus prometheus1004.eqiad.wmnet/services was restarted on prometheus1004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services
[15:56:55] <icinga-wm>	 PROBLEM - Prometheus prometheus1003.eqiad.wmnet/k8s was restarted on prometheus1003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s
[15:57:53] <hashar>	 !log CI / Zuul is being slowed down and being investigated
[15:57:55] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational
[15:57:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:55] <hashar>	 :(
[15:58:51] <icinga-wm>	 PROBLEM - Prometheus prometheus1003.eqiad.wmnet/k8s-staging was restarted on prometheus1003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging
[15:59:41] <wikibugs>	 (03PS1) 10CDanis: prometheus uptime alert: fix query [puppet] - 10https://gerrit.wikimedia.org/r/508356
[16:00:31] <icinga-wm>	 PROBLEM - Prometheus bast5001.wikimedia.org/ops was restarted on bast5001 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops
[16:01:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus uptime alert: fix query [puppet] - 10https://gerrit.wikimedia.org/r/508356 (owner: 10CDanis)
[16:01:16] <icinga-wm>	 PROBLEM - Prometheus prometheus1003.eqiad.wmnet/ops was restarted on prometheus1003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops
[16:01:17] <icinga-wm>	 PROBLEM - Prometheus prometheus2004.codfw.wmnet/analytics was restarted on prometheus2004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics
[16:01:44] <wikibugs>	 (03PS2) 10CDanis: prometheus uptime alert: fix query [puppet] - 10https://gerrit.wikimedia.org/r/508356 (https://phabricator.wikimedia.org/T222108)
[16:01:46] <wikibugs>	 (03Merged) 10jenkins-bot: Ganeti module: Add timeout support [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov)
[16:02:26] <wikibugs>	 (03PS3) 10CDanis: prometheus uptime alert: fix query [puppet] - 10https://gerrit.wikimedia.org/r/508356 (https://phabricator.wikimedia.org/T222108)
[16:02:43] <icinga-wm>	 PROBLEM - Prometheus prometheus1003.eqiad.wmnet/services was restarted on prometheus1003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services
[16:02:43] <icinga-wm>	 PROBLEM - Prometheus prometheus2004.codfw.wmnet/global was restarted on prometheus2004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global
[16:04:41] <icinga-wm>	 PROBLEM - Prometheus prometheus2004.codfw.wmnet/k8s was restarted on prometheus2004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s
[16:05:18] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] prometheus uptime alert: fix query [puppet] - 10https://gerrit.wikimedia.org/r/508356 (https://phabricator.wikimedia.org/T222108) (owner: 10CDanis)
[16:06:35] <icinga-wm>	 PROBLEM - Prometheus prometheus2003.codfw.wmnet/analytics was restarted on prometheus2003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics
[16:06:35] <icinga-wm>	 PROBLEM - Prometheus prometheus2004.codfw.wmnet/ops was restarted on prometheus2004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops
[16:08:11] <icinga-wm>	 PROBLEM - Prometheus bast4002.wikimedia.org/ops was restarted on bast4002 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=ulsfo+prometheus/ops
[16:08:29] <icinga-wm>	 PROBLEM - Prometheus prometheus2003.codfw.wmnet/global was restarted on prometheus2003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global
[16:08:31] <icinga-wm>	 PROBLEM - Prometheus prometheus2004.codfw.wmnet/services was restarted on prometheus2004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services
[16:08:38] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: access for foks to labweb (in one way or another) (or make changePassword.php work on mwmaint hosts) - https://phabricator.wikimedia.org/T220860 (10Dzahn) 05Open→03Stalled
[16:08:45] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on db2049 is CRITICAL: cluster=mysql device=cciss,11 instance=db2049:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2049&var-datasource=codfw+prometheus/ops
[16:10:06] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on db2049 is CRITICAL: cluster=mysql device=cciss,11 instance=db2049:9100 job=node site=codfw Marostegui being worked by papaul https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2049&var-datasource=codfw+prometheus/ops
[16:10:25] <icinga-wm>	 PROBLEM - Prometheus prometheus2003.codfw.wmnet/k8s was restarted on prometheus2003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s
[16:11:53] <hashar>	 !log CI queue drained. Should be working fine again now
[16:11:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:21] <icinga-wm>	 PROBLEM - Prometheus prometheus1004.eqiad.wmnet/analytics was restarted on prometheus1004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics
[16:12:21] <icinga-wm>	 PROBLEM - Prometheus prometheus2003.codfw.wmnet/ops was restarted on prometheus2003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops
[16:13:43] <wikibugs>	 (03PS1) 10EBernhardson: cloudelastic: Don't write to private wikis on cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508357
[16:13:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10RobH) This is only under warranty until later this month, and was brought up in the SRE weekly meeting.  This needs to be high priority!  Supposedly warra...
[16:14:13] <icinga-wm>	 PROBLEM - Prometheus prometheus1004.eqiad.wmnet/global was restarted on prometheus1004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[16:14:13] <icinga-wm>	 PROBLEM - Prometheus prometheus2003.codfw.wmnet/services was restarted on prometheus2003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services
[16:16:05] <icinga-wm>	 PROBLEM - Prometheus prometheus1004.eqiad.wmnet/k8s was restarted on prometheus1004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s
[16:20:36] <wikibugs>	 (03PS1) 10CDanis: prometheus uptime: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/508359
[16:21:24] <wikibugs>	 (03CR) 10jenkins-bot: Ganeti module: Add timeout support [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov)
[16:21:33] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] prometheus uptime: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/508359 (owner: 10CDanis)
[16:26:21] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[16:28:09] <wikibugs>	 (03PS1) 10CRusnov: profile::spicerack: Add timeout parameter for ganeti module. [puppet] - 10https://gerrit.wikimedia.org/r/508361
[16:28:11] <wikibugs>	 (03PS1) 10Joal: Update analytics sqoop scheduling [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378)
[16:28:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update analytics sqoop scheduling [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) (owner: 10Joal)
[16:28:56] <joal>	 elukey: -- if you have a minute --^
[16:28:58] <joal>	 Arf
[16:29:21] <icinga-wm>	 RECOVERY - Prometheus prometheus2004.codfw.wmnet/analytics was restarted on prometheus2004 is OK: (C)600 lt (W)1800 lt 5.258e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics
[16:29:37] <icinga-wm>	 RECOVERY - Prometheus prometheus1003.eqiad.wmnet/services was restarted on prometheus1003 is OK: (C)600 lt (W)1800 lt 5.317e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services
[16:29:39] <icinga-wm>	 RECOVERY - Prometheus prometheus1003.eqiad.wmnet/k8s was restarted on prometheus1003 is OK: (C)600 lt (W)1800 lt 5.317e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s
[16:29:39] <icinga-wm>	 RECOVERY - Prometheus prometheus1004.eqiad.wmnet/services was restarted on prometheus1004 is OK: (C)600 lt (W)1800 lt 5.309e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services
[16:29:43] <icinga-wm>	 RECOVERY - Prometheus prometheus2003.codfw.wmnet/services was restarted on prometheus2003 is OK: (C)600 lt (W)1800 lt 5.263e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services
[16:29:45] <icinga-wm>	 RECOVERY - Prometheus bast4002.wikimedia.org/ops was restarted on bast4002 is OK: (C)600 lt (W)1800 lt 5.251e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=ulsfo+prometheus/ops
[16:29:51] <icinga-wm>	 RECOVERY - Prometheus prometheus1004.eqiad.wmnet/k8s was restarted on prometheus1004 is OK: (C)600 lt (W)1800 lt 5.322e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s
[16:29:57] <icinga-wm>	 RECOVERY - Prometheus prometheus1003.eqiad.wmnet/analytics was restarted on prometheus1003 is OK: (C)600 lt (W)1800 lt 5.317e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics
[16:29:57] <icinga-wm>	 RECOVERY - Prometheus prometheus1004.eqiad.wmnet/k8s-staging was restarted on prometheus1004 is OK: (C)600 lt (W)1800 lt 5.322e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging
[16:30:07] <icinga-wm>	 RECOVERY - Prometheus prometheus1003.eqiad.wmnet/ops was restarted on prometheus1003 is OK: (C)600 lt (W)1800 lt 5.219e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops
[16:30:07] <icinga-wm>	 RECOVERY - Prometheus prometheus2004.codfw.wmnet/k8s was restarted on prometheus2004 is OK: (C)600 lt (W)1800 lt 5.259e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s
[16:30:09] <icinga-wm>	 RECOVERY - Prometheus prometheus2004.codfw.wmnet/ops was restarted on prometheus2004 is OK: (C)600 lt (W)1800 lt 5.259e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops
[16:30:09] <icinga-wm>	 RECOVERY - Prometheus prometheus1004.eqiad.wmnet/analytics was restarted on prometheus1004 is OK: (C)600 lt (W)1800 lt 5.322e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics
[16:30:09] <icinga-wm>	 RECOVERY - Prometheus prometheus2003.codfw.wmnet/ops was restarted on prometheus2003 is OK: (C)600 lt (W)1800 lt 5.263e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops
[16:30:35] <wikibugs>	 (03PS2) 10Joal: Update analytics sqoop scheduling [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378)
[16:30:41] <icinga-wm>	 RECOVERY - Prometheus prometheus1003.eqiad.wmnet/k8s-staging was restarted on prometheus1003 is OK: (C)600 lt (W)1800 lt 5.318e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging
[16:30:57] <cdanis>	 first try \o/
[16:31:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10Cmjohnson) I created a task for this with HPE.     Case ID: 5338390467 Case title: Failed BBU Severity 3-Normal Product serial number: MXQ616071T Product...
[16:31:22] <wikibugs>	 (03CR) 10Elukey: Update analytics sqoop scheduling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) (owner: 10Joal)
[16:32:42] <wikibugs>	 (03CR) 10Andrew Bogott: "That run shows everything as no-op except for a couple of logstash which seem to always produce false positives.  Rechecking those, they'r" [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez)
[16:33:33] <icinga-wm>	 RECOVERY - Prometheus prometheus1004.eqiad.wmnet/ops was restarted on prometheus1004 is OK: (C)600 lt (W)1800 lt 5.294e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops
[16:33:41] <wikibugs>	 (03PS2) 10CRusnov: profile::spicerack: Add timeout parameter for ganeti module. [puppet] - 10https://gerrit.wikimedia.org/r/508361
[16:34:05] <wikibugs>	 (03PS10) 10Andrew Bogott: lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez)
[16:34:39] <icinga-wm>	 RECOVERY - Prometheus prometheus2004.codfw.wmnet/services was restarted on prometheus2004 is OK: (C)600 lt (W)1800 lt 5.261e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services
[16:34:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez)
[16:34:55] <jynus>	 !log restart db2102 mysql for upgrade testing
[16:34:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:18] <wikibugs>	 (03CR) 10CRusnov: "Build looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/508361 (owner: 10CRusnov)
[16:38:34] <andrewbogott>	 !log re-imaging cloudvirt1024
[16:38:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:03] <icinga-wm>	 RECOVERY - Prometheus prometheus2003.codfw.wmnet/analytics was restarted on prometheus2003 is OK: (C)600 lt (W)1800 lt 5.269e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics
[16:39:10] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10Wikimedia-Incident: prometheus: some sort of IRC alerts on restarts? - https://phabricator.wikimedia.org/T222108 (10CDanis) 05Open→03Resolved a:03CDanis We now have IRC alerting based on scraping each prometheus for its `process_start_time_seconds`...
[16:39:28] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/508361 (owner: 10CRusnov)
[16:39:42] <wikibugs>	 (03PS3) 10CRusnov: profile::spicerack: Add timeout parameter for ganeti module. [puppet] - 10https://gerrit.wikimedia.org/r/508361
[16:40:37] <icinga-wm>	 RECOVERY - Prometheus bast5001.wikimedia.org/ops was restarted on bast5001 is OK: (C)600 lt (W)1800 lt 5.252e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops
[16:40:57] <icinga-wm>	 RECOVERY - Prometheus prometheus2003.codfw.wmnet/k8s was restarted on prometheus2003 is OK: (C)600 lt (W)1800 lt 5.27e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s
[16:41:34] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Marostegui) @elukey @Ottomata what do you guys want to do with this?
[16:42:06] <jynus>	 !log restart db1114 mysql for upgrade testing
[16:42:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:59] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10elukey) @Marostegui sorry I was under the impression that we'd have needed to wait for a feedback from Chris/Rob about how to proc...
[16:46:54] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] profile::spicerack: Add timeout parameter for ganeti module. [puppet] - 10https://gerrit.wikimedia.org/r/508361 (owner: 10CRusnov)
[16:47:48] <wikibugs>	 (03PS2) 10Elukey: admin: allow analytics-admins to sudo as the analytics user [puppet] - 10https://gerrit.wikimedia.org/r/507812 (https://phabricator.wikimedia.org/T222368)
[16:47:59] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Approved by the SRE team meeting" [puppet] - 10https://gerrit.wikimedia.org/r/507812 (https://phabricator.wikimedia.org/T222368) (owner: 10Elukey)
[16:49:25] <wikibugs>	 (03PS3) 10Elukey: admin: allow analytics-admins to use systemctl for all units [puppet] - 10https://gerrit.wikimedia.org/r/507813 (https://phabricator.wikimedia.org/T222368)
[16:49:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Approved by the SRE team meeting." [puppet] - 10https://gerrit.wikimedia.org/r/507813 (https://phabricator.wikimedia.org/T222368) (owner: 10Elukey)
[16:49:44] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Marostegui) @elukey sorry, I realised that I didn't sent the first sentence: "The errors corrected themselves and Icinga is now al...
[16:50:03] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Cmjohnson) I now have h/w log entries. I will need the server to be taken offline so I can relocate the DIMM and check to see if t...
[16:50:45] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Marostegui) @elukey can you coordinate with Chris? ^
[16:51:47] <wikibugs>	 (03CR) 10Joal: Update analytics sqoop scheduling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) (owner: 10Joal)
[16:51:57] <wikibugs>	 (03PS3) 10Joal: Update analytics sqoop scheduling [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378)
[16:52:14] <joal>	 Gone for dinner - Back in while
[16:52:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update analytics sqoop scheduling [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) (owner: 10Joal)
[16:52:29] <joal>	 milimetric: Shall we deploy new aqs datasource?
[16:52:34] <joal>	 oops wrong chan sorry
[16:53:55] <wikibugs>	 (03CR) 10Dmaza: [C: 03+1] Enable Partial Blocks on Bengali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508122 (https://phabricator.wikimedia.org/T222258) (owner: 10Ammarpad)
[16:54:06] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Marostegui) ` [18:50:55]  <cmjohnson1> marostegui i am confused over db1007...is there an issue or not an issue?  There is a h/w l...
[16:54:17] <wikibugs>	 (03PS4) 10Joal: Update analytics sqoop scheduling [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378)
[16:56:02] <wikibugs>	 (03PS3) 10Dzahn: Phab: Allow greg and aklapper to convert projects to subprojects/milestones [puppet] - 10https://gerrit.wikimedia.org/r/505667 (https://phabricator.wikimedia.org/T221112) (owner: 10Aklapper)
[16:57:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Phab: Allow greg and aklapper to convert projects to subprojects/milestones [puppet] - 10https://gerrit.wikimedia.org/r/505667 (https://phabricator.wikimedia.org/T221112) (owner: 10Aklapper)
[16:57:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "approved in today's SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/505667 (https://phabricator.wikimedia.org/T221112) (owner: 10Aklapper)
[16:58:27] <wikibugs>	 (03CR) 10Dzahn: "@Aklapper: just making sure, does the move_project script really not need any parameters, like the project name to be moved?" [puppet] - 10https://gerrit.wikimedia.org/r/505667 (https://phabricator.wikimedia.org/T221112) (owner: 10Aklapper)
[17:00:04] <jouncebot>	 gehel and onimisionipe: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190506T1700).
[17:01:03] <wikibugs>	 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10Dzahn) @Aklapper @mmodell The request has been app...
[17:01:36] <wikibugs>	 (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508130 (https://phabricator.wikimedia.org/T222516) (owner: 10Luca Mauri)
[17:02:01] <wikibugs>	 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10mmodell) @dzahn: it has a bunch of parameters :-/
[17:02:04] <wikibugs>	 (03CR) 10Jforrester: "Presumably this should only get deployed just before I6d0215082f?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508130 (https://phabricator.wikimedia.org/T222516) (owner: 10Luca Mauri)
[17:02:12] <wikibugs>	 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10Dzahn) a:03Aklapper Puppet ran on phab1001. If y...
[17:02:39] <wikibugs>	 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10mmodell) see T221112#5121800
[17:03:33] <wikibugs>	 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10Dzahn) >>! In T221112#5160984, @mmodell wrote: > @...
[17:07:04] <wikibugs>	 (03CR) 10Dzahn: "this is probably not enough because you won't be allowed to add parameters" [puppet] - 10https://gerrit.wikimedia.org/r/505667 (https://phabricator.wikimedia.org/T221112) (owner: 10Aklapper)
[17:11:00] <jynus>	 !log restart dbprov* hosts, in sequence, for kernel upgrade
[17:11:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:31] <wikibugs>	 (03CR) 10Nuria: "Nice. Super concise! Let's please make sure to test this actually works as intended." [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) (owner: 10Joal)
[17:16:09] <Krinkle>	 mutante: if you have a minute, would appreciate a review on this comment-only fix https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/504058/
[17:16:45] <wikibugs>	 (03PS2) 10Krinkle: mediawiki: remove comment about 'enable_profiling' [puppet] - 10https://gerrit.wikimedia.org/r/504058
[17:17:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] mediawiki: remove comment about 'enable_profiling' [puppet] - 10https://gerrit.wikimedia.org/r/504058 (owner: 10Krinkle)
[17:18:24] <mutante>	 Krinkle: no problem merged. in general just add me to Gerrit for those. i will see requests in my queue
[17:19:11] <elukey>	 !log restart netbox on netmon1002 as test
[17:19:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:23] <wikibugs>	 (03PS10) 10Alex Monk: elastalert: new module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi)
[17:26:24] <wikibugs>	 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) Opened https://github.com/unbit/uwsgi/issues/2010
[17:28:00] <wikibugs>	 10Operations, 10Traffic: false positives in check_trafficserver_config_status - https://phabricator.wikimedia.org/T222642 (10Vgutierrez)
[17:28:56] <wikibugs>	 10Operations, 10Traffic: false positives in check_trafficserver_config_status - https://phabricator.wikimedia.org/T222642 (10Vgutierrez) p:05Triage→03Normal
[17:29:27] <wikibugs>	 (03PS11) 10Alex Monk: elastalert: new module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi)
[17:30:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "seems to make sense since upstream does disable it by default in 3.0" [puppet] - 10https://gerrit.wikimedia.org/r/508127 (owner: 10Paladox)
[17:31:41] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[17:31:47] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[17:33:31] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[17:36:58] <wikibugs>	 (03PS2) 10CRusnov: profile::netbox: Move ganeti sync config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/508067
[17:37:06] <wikibugs>	 (03CR) 10CRusnov: profile::netbox: Move ganeti sync config to /etc/netbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508067 (owner: 10CRusnov)
[17:37:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Move ganeti sync config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/508067 (owner: 10CRusnov)
[17:39:35] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[17:39:41] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[17:41:27] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[17:51:18] <wikibugs>	 10Operations, 10ops-eqiad, 10Gerrit, 10serviceops, 10Release-Engineering-Team (Watching / External): Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10CDanis) cc @mark who I know is about to start looking at hardware requests for the coming FY
[17:51:39] <wikibugs>	 (03CR) 10CRusnov: "rebuild" [puppet] - 10https://gerrit.wikimedia.org/r/508067 (owner: 10CRusnov)
[17:52:13] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-main install -n main -f main/staging-values.yaml stable/eventgate [namespace: eventgate-main, clusters: staging]
[17:52:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:36] <wikibugs>	 10Operations, 10ops-eqiad, 10Gerrit, 10serviceops, 10Release-Engineering-Team (Watching / External): Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10Dzahn) I expect this to be a topic in our (DP - SRE) meeting this Wednesday.
[17:53:48] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-main install -n main -f main/staging-values.yaml stable/eventgate [namespace: eventgate-main, clusters: staging]
[17:53:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:00] <wikibugs>	 (03PS3) 10CRusnov: profile::netbox: Move ganeti sync config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/508067
[17:59:59] <mutante>	 @seen andre_
[17:59:59] <wm-bot>	 mutante: Last time I saw andre_ they were changing the nickname to Guest17533, but Guest17533 is no longer in channel #wikimedia-dev at 10/18/2018 4:53:24 AM (200d13h6m35s ago)
[18:00:04] <jouncebot>	 MaxSem, RoanKattouw, and Niharika: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190506T1800).
[18:00:05] <jouncebot>	 kostajh, raynor, and Amir1: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:21] <kostajh>	 I'm here
[18:00:23] <Amir1>	 o/
[18:00:40] <RoanKattouw>	 I can SWAT
[18:01:07] <wikibugs>	 (03PS1) 10Ottomata: Add eventgate-main to profile::kubernetes::deployment_server::services [puppet] - 10https://gerrit.wikimedia.org/r/508371 (https://phabricator.wikimedia.org/T218346)
[18:01:45] <raynor>	 o/
[18:02:03] <wikibugs>	 (03CR) 10Ottomata: "I believe this also needs a patch to ops/private. I see how to add them, can I generate a token or does this come from somewhere specific?" [puppet] - 10https://gerrit.wikimedia.org/r/508371 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata)
[18:02:37] <wikibugs>	 (03PS2) 10Catrope: Set $wgOresFrontendBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508298 (https://phabricator.wikimedia.org/T219396) (owner: 10Ladsgroup)
[18:02:48] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Set $wgOresFrontendBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508298 (https://phabricator.wikimedia.org/T219396) (owner: 10Ladsgroup)
[18:04:01] <wikibugs>	 (03Merged) 10jenkins-bot: Set $wgOresFrontendBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508298 (https://phabricator.wikimedia.org/T219396) (owner: 10Ladsgroup)
[18:04:16] <wikibugs>	 (03CR) 10jenkins-bot: Set $wgOresFrontendBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508298 (https://phabricator.wikimedia.org/T219396) (owner: 10Ladsgroup)
[18:06:13] <RoanKattouw>	 Amir1: Your patch is on mwdebug1002, please test
[18:06:45] <Amir1>	 on it
[18:06:51] <Amir1>	 it's a little bit complex to test
[18:08:42] <wikibugs>	 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10Dzahn) >>! In T221112#5160991, @Dzahn wrote: >>>!...
[18:11:29] <wikibugs>	 (03PS1) 10Dzahn: admins: simplify sudo privs for phab-admin group [puppet] - 10https://gerrit.wikimedia.org/r/508373
[18:11:32] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] profile::netbox: Move ganeti sync config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/508067 (owner: 10CRusnov)
[18:11:43] <Amir1>	 RoanKattouw: It fixes the issue
[18:11:46] <Amir1>	 please process
[18:11:49] <Amir1>	 *proceed
[18:14:26] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Set $wgOresFrontendBaseUrl (T219396) (duration: 00m 51s)
[18:14:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:33] <stashbot>	 T219396: Fix oresBaseUrl config variable in frontend - https://phabricator.wikimedia.org/T219396
[18:15:29] <wikibugs>	 (03PS3) 10CRusnov: ganeti-netbox sync: Sync host status also [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066
[18:17:31] <wikibugs>	 10Operations, 10WMF-Legal, 10Wikimedia-General-or-Unknown, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10Dzahn) p:05Low→03Normal The [[ https://www.mediawiki.org/wiki/Wikimedia_Engineering_Architecture_Principles | Wikimedia Engineeri...
[18:18:41] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudvirt1024: update the (now truncated) interface name [puppet] - 10https://gerrit.wikimedia.org/r/508378 (https://phabricator.wikimedia.org/T216724)
[18:19:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1024: update the (now truncated) interface name [puppet] - 10https://gerrit.wikimedia.org/r/508378 (https://phabricator.wikimedia.org/T216724) (owner: 10Andrew Bogott)
[18:21:01] <wikibugs>	 (03PS2) 10CDanis: swift-object-replicator: nice & ionice it [puppet] - 10https://gerrit.wikimedia.org/r/506321
[18:21:37] <wikibugs>	 (03PS2) 10Dzahn: admins: simplify sudo privs for phab-admin group [puppet] - 10https://gerrit.wikimedia.org/r/508373
[18:24:58] <jynus>	 !log restart and upgrade db1116
[18:25:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:43] <wikibugs>	 (03PS3) 10CDanis: swift-object-replicator: nice & ionice it [puppet] - 10https://gerrit.wikimedia.org/r/506321
[18:28:44] <RoanKattouw>	 raynor: Your patch is now (finally) on mwdebug1002, please test
[18:28:56] <raynor>	 RoanKattouw: thx, testing
[18:29:34] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] swift-object-replicator: nice & ionice it [puppet] - 10https://gerrit.wikimedia.org/r/506321 (owner: 10CDanis)
[18:29:57] <raynor>	 RoanKattouw: my patch was easy/quit to test -> it works
[18:30:01] <raynor>	 please deploy to prod
[18:30:20] <wikibugs>	 (03PS4) 10Dzahn: hhvm: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/456319 (https://phabricator.wikimedia.org/T194724)
[18:30:44] <cdanis>	 !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'ms-be*' 'disable-puppet "cdanis rollout I369f9b29"'
[18:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:58] <wikibugs>	 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10Volans) @elukey thanks a lot for deep dive and the bug upstream!
[18:32:19] <logmsgbot>	 !log catrope@deploy1001 Synchronized php-1.34.0-wmf.3/skins/MinervaNeue/includes/menu/Definitions.php: Harden Definitions::insertCommunityPortal() method (T222407) (duration: 00m 53s)
[18:32:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:25] <stashbot>	 T222407: [1.34.0-wmf.3] Regression: Definitions.php: Call to a member function exists() on a non-object (null) - https://phabricator.wikimedia.org/T222407
[18:35:27] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:35:49] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:35:51] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[18:36:03] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:36:27] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:36:29] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[18:37:35] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[18:37:37] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[18:37:43] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[18:37:48] <wikibugs>	 (03CR) 10Ayounsi: "Thanks, reply inline." (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi)
[18:38:09] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[18:38:10] <wikibugs>	 (03PS10) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992)
[18:38:29] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[18:39:45] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:39:59] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:40:25] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:40:27] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[18:40:43] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:43:55] <RoanKattouw>	 kostajh: Your null pageviews link patch is now on mwdebug1002, please test
[18:44:00] <kostajh>	 RoanKattouw: looking
[18:44:19] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[18:45:23] <kostajh>	 RoanKattouw: yep, looks good
[18:45:31] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[18:46:03] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[18:46:51] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[18:47:10] <logmsgbot>	 !log catrope@deploy1001 Synchronized php-1.34.0-wmf.3/extensions/GrowthExperiments/: Remove link to pageviews tool when no data available (T222405) (duration: 00m 52s)
[18:47:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:16] <andrewbogott>	 vgutierrez: are you still about?
[18:47:17] <stashbot>	 T222405: Homepage: null pageviews should not be a link - https://phabricator.wikimedia.org/T222405
[18:47:28] <vgutierrez>	 err....
[18:47:35] <vgutierrez>	 cooking dinner as we speak :)
[18:47:45] <andrewbogott>	 ok — I'll wait and bug you tomorrow then
[18:53:00] <wikibugs>	 (03CR) 10Dzahn: hhvm: base::service_unit -> systemd::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/456319 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn)
[18:53:45] <kostajh>	 RoanKattouw: for the config patch, I think you can just sync it when it's ready. I have stat1004 open with kafkacat and can look at some events to verify it's working properly
[18:53:53] <RoanKattouw>	 Oh yes that's right
[18:53:57] <RoanKattouw>	 I thought I was done, oops
[18:55:31] <wikibugs>	 (03PS5) 10Cwhite: initial attempt at a varnishkafka exporter [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066)
[18:55:50] <wikibugs>	 (03PS2) 10Catrope: GrowthExperiments: Begin experiment for Homepage with cs/kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507115 (https://phabricator.wikimedia.org/T221266) (owner: 10Kosta Harlan)
[18:56:33] <wikibugs>	 (03CR) 10Cwhite: initial attempt at a varnishkafka exporter (032 comments) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[18:57:48] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Begin experiment for Homepage with cs/kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507115 (https://phabricator.wikimedia.org/T221266) (owner: 10Kosta Harlan)
[18:58:51] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Begin experiment for Homepage with cs/kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507115 (https://phabricator.wikimedia.org/T221266) (owner: 10Kosta Harlan)
[18:58:53] <wikibugs>	 (03PS4) 10CRusnov: ganeti-netbox sync: Sync host status also [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066
[18:59:05] <wikibugs>	 (03CR) 10jenkins-bot: GrowthExperiments: Begin experiment for Homepage with cs/kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507115 (https://phabricator.wikimedia.org/T221266) (owner: 10Kosta Harlan)
[19:01:31] <wikibugs>	 (03CR) 10CRusnov: ganeti-netbox sync: Sync host status also (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066 (owner: 10CRusnov)
[19:01:36] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Begin homepage experiment on cswiki and kowiki (T221266) (duration: 00m 51s)
[19:01:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:40] <stashbot>	 T221266: Homepage: Deploy to target wikis in production - https://phabricator.wikimedia.org/T221266
[19:02:03] <wikibugs>	 (03CR) 10CRusnov: [V: 03+2] "Latest patchset with some changes due to testing. Seems to work, mirroring the status." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066 (owner: 10CRusnov)
[19:05:37] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) a:05aaron→03None
[19:13:06] <wikibugs>	 10Operations, 10ops-codfw, 10media-storage, 10observability: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 (10CDanis)
[19:20:21] <wikibugs>	 (03PS3) 10Dzahn: admins: remove ability to run commands as user 'apache' [puppet] - 10https://gerrit.wikimedia.org/r/506750 (https://phabricator.wikimedia.org/T78076)
[19:20:57] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 84 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[19:22:11] <cdanis>	 !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -m async -b4 'ms-be2*' 'run-puppet-agent --enable "cdanis rollout I369f9b29"' 'systemctl systemctl restart swift-object-replicator'
[19:22:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:16] <librenms-wmf>	 08Warning Alert for device cr1-codfw.wikimedia.org - Inbound interface errors
[19:27:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admins: remove ability to run commands as user 'apache' [puppet] - 10https://gerrit.wikimedia.org/r/506750 (https://phabricator.wikimedia.org/T78076) (owner: 10Dzahn)
[19:31:25] <mutante>	 XioNoX: the atlas.ripe net map looks kind of bad. 84 fails seems more than common
[19:31:54] <cdanis>	 !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -m async -b4 'ms-be1*' 'run-puppet-agent --enable "cdanis rollout I369f9b29"' 'systemctl restart swift-object-replicator'
[19:31:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:16] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device cr1-codfw.wikimedia.org recovered from Inbound interface errors
[19:35:50] <wikibugs>	 (03PS3) 10Dzahn: vagrant::mediawiki: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/508009 (https://phabricator.wikimedia.org/T194724)
[19:37:03] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 20 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[19:37:47] <cdanis>	 20 is about the baseline IIRC
[19:38:29] <wikibugs>	 (03PS3) 10Gehel: maps: remove cassandra metric blacklist [puppet] - 10https://gerrit.wikimedia.org/r/508313 (owner: 10Mathew.onipe)
[19:39:39] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:40:10] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] maps: remove cassandra metric blacklist [puppet] - 10https://gerrit.wikimedia.org/r/508313 (owner: 10Mathew.onipe)
[19:40:23] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:40:30] <wikibugs>	 (03PS4) 10Dzahn: phragile: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507084 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox)
[19:40:47] <wikibugs>	 (03CR) 10Dzahn: "labs only - affects https://tools.wmflabs.org/openstack-browser/server/phragile-pro.phragile.eqiad.wmflabs  - getting access to that insta" [puppet] - 10https://gerrit.wikimedia.org/r/507084 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox)
[19:41:09] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[19:41:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phragile: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507084 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox)
[19:41:19] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[19:43:01] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[19:43:37] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:44:21] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:45:08] <wikibugs>	 (03CR) 10Dzahn: "on phragile-pro.phragile: rm -rf /var/lib/phragile/composer and then let puppet recreate and reclone" [puppet] - 10https://gerrit.wikimedia.org/r/507084 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox)
[19:45:19] <paladox>	 thanks mutante !
[19:46:12] <wikibugs>	 (03PS1) 10Paladox: Gerrit: Configure logging in json to error_log.json [puppet] - 10https://gerrit.wikimedia.org/r/508391
[19:47:52] <RoanKattouw>	 !log Running recomputeNotifCounts.php  --notif-types=login-success on all Echo wikis for T220762
[19:47:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:57] <stashbot>	 T220762: [betalabs] Stuck cross-wiki notification  - https://phabricator.wikimedia.org/T220762
[19:48:32] <gehel>	 !log rolling restart of cassandra on maps* fro config change
[19:48:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:49:07] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[19:49:17] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[19:49:39] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[19:50:21] <wikibugs>	 (03PS6) 10Paladox: Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844)
[19:51:02] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Configure logging in json to error_log.json [puppet] - 10https://gerrit.wikimedia.org/r/508391
[19:51:45] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Configure logging in json to error_log.json [puppet] - 10https://gerrit.wikimedia.org/r/508391 (https://phabricator.wikimedia.org/T141324)
[19:52:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] authdns: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507076 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox)
[19:53:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "merging this means having to delete the git repo checkout dir and letting puppet re-clone it.. or editing the .git/config and replacing th" [puppet] - 10https://gerrit.wikimedia.org/r/507076 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox)
[19:54:26] <wikibugs>	 (03PS6) 10Dzahn: openldap: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507088 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox)
[20:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, and halfak: That opportune time is upon us again. Time for a Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190506T2000).
[20:03:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] openldap: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507088 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox)
[20:06:18] <wikibugs>	 (03PS2) 10Hashar: zuul: stop pinning python-pbr [puppet] - 10https://gerrit.wikimedia.org/r/502207 (https://phabricator.wikimedia.org/T218559)
[20:06:23] <wikibugs>	 (03CR) 10Dzahn: "confirmed that the python scripts using this use tempfile which should be deleted right away anyways.. and tested that cross-validate-acco" [puppet] - 10https://gerrit.wikimedia.org/r/507088 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox)
[20:08:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "on merge the existing checkout dirs should be deleted so that puppet re-creates them" [puppet] - 10https://gerrit.wikimedia.org/r/507074 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox)
[20:08:36] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "That should be good for deployment. I have patched Zuul to no more rely on python-pbr to get its version." [puppet] - 10https://gerrit.wikimedia.org/r/502207 (https://phabricator.wikimedia.org/T218559) (owner: 10Hashar)
[20:10:33] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "bump, see my last question" [puppet] - 10https://gerrit.wikimedia.org/r/494811 (owner: 10Paladox)
[20:13:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] zuul: stop pinning python-pbr [puppet] - 10https://gerrit.wikimedia.org/r/502207 (https://phabricator.wikimedia.org/T218559) (owner: 10Hashar)
[20:13:21] <wikibugs>	 (03CR) 10Paladox: Gerrit: Support switching ldap servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494811 (owner: 10Paladox)
[20:13:28] <wikibugs>	 (03CR) 10Krinkle: "As next step - can we switch this to async first, and then commit to monitoring before we serve multi-dc traffic?" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz)
[20:14:07] <wikibugs>	 (03CR) 10Dzahn: "> Once this puppet patch is merged, I can handle the upgrade of python-pbr and confirm that production works fine." [puppet] - 10https://gerrit.wikimedia.org/r/502207 (https://phabricator.wikimedia.org/T218559) (owner: 10Hashar)
[20:16:06] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "ok! could we just drop the word "custom" from all of these. it's kind of implied with any parameter that it's custom if you change it" [puppet] - 10https://gerrit.wikimedia.org/r/494811 (owner: 10Paladox)
[20:17:03] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Danke :)" [puppet] - 10https://gerrit.wikimedia.org/r/502207 (https://phabricator.wikimedia.org/T218559) (owner: 10Hashar)
[20:18:56] <hashar>	 mutante: I will do the upgrade tomorrow morning
[20:19:03] <hashar>	 too late for me to deal with potential aftermath
[20:19:28] <wikibugs>	 (03CR) 10Dzahn: "any further comments or links why we don't need / want it anymore? adding Tyler to reviewers" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507990 (owner: 10Paladox)
[20:19:36] <mutante>	 hashar: sure! ok
[20:20:46] <paladox>	 mutante see the other channel :)
[20:25:42] <wikibugs>	 (03CR) 10Dzahn: "per https://phabricator.wikimedia.org/T162070 which comes after https://phabricator.wikimedia.org/T165625 there is only 1 class left using" [puppet] - 10https://gerrit.wikimedia.org/r/354131 (owner: 10Paladox)
[20:27:07] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "we really want to delete the entire module instead" [puppet] - 10https://gerrit.wikimedia.org/r/354131 (owner: 10Paladox)
[20:50:28] <wikibugs>	 (03CR) 10Thcipriani: "> any further comments or links why we don't need / want it anymore?" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507990 (owner: 10Paladox)
[20:51:09] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "gotcha, thanks" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507990 (owner: 10Paladox)
[20:54:45] <wikibugs>	 (03PS1) 10Hashar: contint: bump git-daemon max connections 32 -> 48 [puppet] - 10https://gerrit.wikimedia.org/r/508408 (https://phabricator.wikimedia.org/T222661)
[21:00:04] <jouncebot>	 bawolff and Reedy: I, the Bot under the Fountain, allow thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190506T2100).
[21:02:58] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team (Watching / External): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10crusnov) Just to +1 the idea of shipping javamelody to prometheus. Let me know if I can help at all.
[21:08:36] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team (Watching / External): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10Paladox) @crusnov we could use your help, yup. We need to create a prometheusBearerToken [plugin.javamelody.prometheusBearerToken] https://gerrit.googleso...
[21:15:59] <godog>	 !log swift codfw-prod: push up-to-date rings, mistakenly pushed earlier an older version 
[21:16:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:33] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) 05Open→03Stalled
[21:24:31] <cdanis>	 !log experimenting with different disk scheduler on ms-be2014 -- cdanis@ms-be2014.codfw.wmnet ~ % for D in /sys/block/sd*/queue/scheduler ; echo cfq | sudo tee $D
[21:24:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:41] <wikibugs>	 (03CR) 10Dzahn: "per "I'm not sure this is still valid when the ongoing work is completed to allow wikitech user registration to be opened up again." ... a" [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn)
[21:30:59] <wikibugs>	 (03PS1) 10CDanis: swift: mid-line comments are not a thing, apparently. [puppet] - 10https://gerrit.wikimedia.org/r/508419
[21:31:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift: mid-line comments are not a thing, apparently. [puppet] - 10https://gerrit.wikimedia.org/r/508419 (owner: 10CDanis)
[21:31:46] <wikibugs>	 (03PS2) 10CDanis: swift: mid-line comments are not a thing, apparently. [puppet] - 10https://gerrit.wikimedia.org/r/508419
[21:32:25] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] swift: mid-line comments are not a thing, apparently. [puppet] - 10https://gerrit.wikimedia.org/r/508419 (owner: 10CDanis)
[21:44:49] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066 (owner: 10CRusnov)
[21:59:02] <mutante>	 !log LDAP - remove 'sukhe' from 'nda' and add to 'wmf' instead (T221990)
[21:59:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:08] <stashbot>	 T221990: LDAP access to the nda group for sukhe - https://phabricator.wikimedia.org/T221990
[22:06:37] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222538 (10Dzahn) Hi @RStallman-legalteam Do we still have an NDA on file for Adam Wight or does it need a new one now that he is WMDE employee?
[22:10:50] <wikibugs>	 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10chasemp)
[22:12:07] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222538 (10RStallman-legalteam) Thanks, I am running this by our contracts attorney. I think we'll just do a quick amendment to the existing NDA to reflect t...
[22:17:28] <wikibugs>	 10Operations, 10netops: cr4-ulsfo rebooted unexpectedly - https://phabricator.wikimedia.org/T221156 (10ayounsi) > After checking the core the engineering team has an update on what happened > “The thread that is holding the lock seem to have corrupted stack and is holding the lock for a very long time. Other t...
[22:18:55] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10Dzahn) p:05Triage→03High
[22:19:14] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222538 (10Dzahn) p:05Triage→03Normal
[22:22:55] <wikibugs>	 10Operations, 10ops-codfw, 10media-storage, 10observability: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 (10Dzahn) p:05Triage→03Normal
[22:32:34] <wikibugs>	 10Operations, 10ops-codfw, 10media-storage, 10observability: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 (10Dzahn) other tickets where ms-be disks died with "blk_update _request: I/O error" or similar.  T184053 , T183896, T218544, T136395, T163690, T166021  Afaict, w...
[22:37:47] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Make www-data the web-serving user (is currently apache) - https://phabricator.wikimedia.org/T78076 (10Dzahn) The ability to run commands as the 'apache' user has been removed from prod admins module sudo privileges today.
[22:38:19] <wikibugs>	 (03CR) 10CRusnov: [V: 03+2 C: 03+2] ganeti-netbox sync: Sync host status also [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066 (owner: 10CRusnov)
[22:42:51] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@0061190]: Deploy new version of ganeti-netbox sync.
[22:42:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:05] <RoanKattouw>	 !log Running refreshMessageBlobs.php on all wikis for T222539
[22:43:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:09] <stashbot>	 T222539: Special:RecentChanges on no.wp ignores localized MediaWiki:Rcfilters-show-new-changes message - https://phabricator.wikimedia.org/T222539
[22:46:45] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@0061190]: Deploy new version of ganeti-netbox sync. (duration: 03m 53s)
[22:46:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:37] <icinga-wm>	 PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:00:05] <jouncebot>	 MaxSem, RoanKattouw, and Niharika: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190506T2300).
[23:00:05] <jouncebot>	 MaxSem: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:24] <MaxSem>	 I'm here
[23:00:42] <MaxSem>	 Since there's only one patch, I'll deploy it in 30 mins
[23:01:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Fix typo in comments [software/service-checker] - 10https://gerrit.wikimedia.org/r/495237 (owner: 10Alexandros Kosiaris)
[23:03:06] <wikibugs>	 (03PS2) 10MaxSem: LoginNotify: remove setting that was moved to the extension itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503546 (https://phabricator.wikimedia.org/T220780)
[23:03:29] <wikibugs>	 (03PS1) 10CRusnov: puppetdb report: Exclude OFFLINE VMs from report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508456
[23:16:27] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul)
[23:20:40] <wikibugs>	 (03CR) 10MaxSem: [C: 03+2] LoginNotify: remove setting that was moved to the extension itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503546 (https://phabricator.wikimedia.org/T220780) (owner: 10MaxSem)
[23:21:42] <wikibugs>	 (03Merged) 10jenkins-bot: LoginNotify: remove setting that was moved to the extension itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503546 (https://phabricator.wikimedia.org/T220780) (owner: 10MaxSem)
[23:21:56] <wikibugs>	 (03CR) 10jenkins-bot: LoginNotify: remove setting that was moved to the extension itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503546 (https://phabricator.wikimedia.org/T220780) (owner: 10MaxSem)
[23:25:57] <logmsgbot>	 !log maxsem@deploy1001 Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/503546/ (duration: 00m 50s)
[23:26:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:22] <wikibugs>	 (03PS7) 10Paladox: Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844)
[23:33:44] <wikibugs>	 10Operations, 10Gerrit, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Dzahn)
[23:36:50] <wikibugs>	 (03PS1) 10Papaul: DHCP: Add MAC address entries for  db2[103-120] [puppet] - 10https://gerrit.wikimedia.org/r/508472 (https://phabricator.wikimedia.org/T221532)
[23:37:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] DHCP: Add MAC address entries for  db2[103-120] [puppet] - 10https://gerrit.wikimedia.org/r/508472 (https://phabricator.wikimedia.org/T221532) (owner: 10Papaul)
[23:41:06] <wikibugs>	 (03PS19) 10CDanis: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto)
[23:52:08] <wikibugs>	 (03PS2) 10Papaul: DHCP: Add MAC address entries for  db2[103-120] [puppet] - 10https://gerrit.wikimedia.org/r/508472 (https://phabricator.wikimedia.org/T221532)
[23:55:26] <wikibugs>	 (03PS3) 10Dzahn: DHCP: Add MAC address entries for db2103 through db2120 [puppet] - 10https://gerrit.wikimedia.org/r/508472 (https://phabricator.wikimedia.org/T221532) (owner: 10Papaul)
[23:55:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: Add MAC address entries for db2103 through db2120 [puppet] - 10https://gerrit.wikimedia.org/r/508472 (https://phabricator.wikimedia.org/T221532) (owner: 10Papaul)
[23:59:11] <James_F>	 MaxSem: All clear?
[23:59:22] <MaxSem>	 Yup!