[00:08:02] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 40.59, 36.08, 31.00
[00:53:38] <wikibugs>	 (03CR) 10Aaron Schulz: "> I like the patch as it is; we still obviously lack proper" [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz)
[01:45:22] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 37.22, 34.17, 32.16
[02:23:42] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20323992
[02:24:42] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1004 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 620792
[03:02:46] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.29) (duration: 11m 09s)
[03:02:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:03:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 49.46, 50.37, 48.03
[03:05:52] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 50.31, 49.94, 48.13
[03:13:21] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 2 down 1
[03:15:52] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 48.26, 47.91, 48.06
[03:22:52] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 51.71, 48.78, 48.17
[04:03:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 57.27, 50.68, 48.26
[04:06:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 51.18, 49.70, 48.25
[04:17:02] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 47.97, 47.50, 48.02
[04:49:22] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1288 is CRITICAL: CRITICAL - load average: 45.01, 42.83, 40.30
[04:56:01] <icinga-wm>	 RECOVERY - nutcracker process on deploy1001 is OK: PROCS OK: 1 process with UID = 114 (nutcracker), command name nutcracker
[04:59:02] <icinga-wm>	 PROBLEM - nutcracker process on deploy1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (nutcracker), command name nutcracker
[05:12:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 35.86, 34.49, 32.19
[05:12:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 38.13, 35.24, 31.38
[05:15:52] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 40.02, 36.25, 32.37
[05:25:30] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426843 (https://phabricator.wikimedia.org/T187089)
[05:26:48] <_joe_>	 ouch, again
[05:28:57] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426843 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[05:30:13] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426843 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[05:30:29] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426843 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui)
[05:31:51] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1087 (duration: 00m 59s)
[05:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:33:54] <marostegui>	 !log Deploy schema change on db1087 with replication (this will generate lag in labs) - T187089 T185128 T153182
[05:34:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:34:01] <stashbot>	 T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089
[05:34:01] <stashbot>	 T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182
[05:34:01] <stashbot>	 T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128
[05:35:49] <_joe_>	 !log depooling mw1341 to further debug the API issue
[05:35:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:36:17] <elukey>	 !log restart hhvm on mw1226,27,32,88 - high load
[05:36:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:37:06] <_joe_>	 so this is very very strange
[05:38:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 40.40, 34.46, 32.25
[05:40:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1315 is CRITICAL: CRITICAL - load average: 56.41, 51.18, 48.39
[05:42:04] <marostegui>	 !log Reload haproxy on dbproxy1010 
[05:42:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:42:52] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0
[05:46:02] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 5.81, 14.27, 23.69
[05:47:01] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 5.78, 11.90, 23.58
[05:48:55] <elukey>	 !log restart hhvm on mw1225, 1315, 1316, 1340, 1341, 1342, 1347 - high load
[05:49:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:49:01] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 6.70, 10.40, 23.98
[05:49:31] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1288 is OK: OK - load average: 9.45, 15.02, 29.11
[05:55:21] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1341 is OK: OK - load average: 13.86, 28.12, 34.74
[05:55:52] <elukey>	 !log repool mw1341 after investigation
[05:55:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:56:56] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4132722 (10Marostegui)
[05:57:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4123759 (10Marostegui) p:05Triage>03High
[05:58:02] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1315 is OK: OK - load average: 11.20, 20.17, 35.73
[05:58:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4123759 (10Marostegui) I have written a summary of the current state of debugging on the original task description, so it is easier to read instead of going thru all the co...
[05:59:57] <elukey>	 !log restart hhvm on mw[1221,1233,1280,1347] - high load
[06:00:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:04:21] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1316 is OK: OK - load average: 9.92, 14.89, 35.56
[06:20:08] <marostegui>	 !log Drop table flow_subscription from x1 - T149936
[06:20:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:20:14] <stashbot>	 T149936: Drop flow_subscription table - https://phabricator.wikimedia.org/T149936
[06:38:03] <_joe_>	 !log repooling mw1230
[06:38:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:39:39] <wikibugs>	 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Provide a forward port of ICU 52 for stretch / Investigate best ICU update strategy - https://phabricator.wikimedia.org/T177498#4132740 (10Joe)
[06:39:42] <wikibugs>	 10Operations, 10HHVM, 10Patch-For-Review, 10User-ArielGlenn, and 2 others: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4132736 (10Joe) 05Open>03Resolved
[06:40:22] <wikibugs>	 10Operations, 10HHVM, 10Patch-For-Review, 10User-ArielGlenn, and 2 others: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4037934 (10Joe) Enwiki finished it run at 14:40 UTC on saturday april 14th.
[06:43:01] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:43:51] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.027 second response time
[06:45:21] <TimStarling>	 !log depooled mw1230
[06:45:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:21] <icinga-wm>	 PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:47:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:48:01] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:16:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4132767 (10Marostegui)
[07:27:21] <moritzm>	 !log installing perl security updates on Debian systems
[07:27:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:30:42] <icinga-wm>	 PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:30:51] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:32:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.023 second response time
[07:33:00] <_joe_>	 mw1224 is me
[07:35:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4132775 (10Marostegui)
[07:35:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:35:56] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Reimage achernar as stretch [puppet] - 10https://gerrit.wikimedia.org/r/426851 (https://phabricator.wikimedia.org/T187090)
[07:36:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.030 second response time
[07:36:41] <icinga-wm>	 RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 80414 bytes in 0.086 second response time
[07:36:42] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.033 second response time
[07:37:25] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] install_server: Reimage achernar as stretch [puppet] - 10https://gerrit.wikimedia.org/r/426851 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez)
[07:39:23] <vgutierrez>	 !log Depool and reimage achernar.wikimedia.org - T187090
[07:39:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:29] <stashbot>	 T187090: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090
[07:40:43] <logmsgbot>	 !log vgutierrez@neodymium conftool action : set/pooled=no; selector: name=achernar.wikimedia.org,service=pdns_recursor
[07:40:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:41] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[07:44:41] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426852
[07:44:44] <elukey>	 this may be one host misbehaving --^ ?
[07:44:51] <marostegui>	 Might be db1114 yeah
[07:45:05] <elukey>	 checking https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor
[07:45:07] <wikibugs>	 (03CR) 10Marostegui: [V: 032 C: 032] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426852 (owner: 10Marostegui)
[07:45:35] <elukey>	 marostegui:  Could not wait for replica DBs to catch up to db1052 - related?
[07:45:48] <marostegui>	 most likely yeah
[07:45:52] <marostegui>	 It is depooling now
[07:45:58] <elukey>	 super
[07:45:59] <elukey>	 thanks :)
[07:46:24] <marostegui>	 I was running some tests on db1114
[07:46:24] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 (duration: 00m 59s)
[07:46:26] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4132784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` achernar.wikimedia.org ``` The log can be found in `/var/log/wmf-aut...
[07:46:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:44] <marostegui>	 trying to debug this: https://phabricator.wikimedia.org/T191996
[07:46:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4132785 (10Marostegui)
[07:47:11] <icinga-wm>	 RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 80416 bytes in 0.369 second response time
[07:47:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.044 second response time
[07:47:51] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.038 second response time
[07:48:23] <marostegui>	 elukey: confirmed, it was that. fatals' graph is back to normal
[07:49:02] <icinga-wm>	 PROBLEM - Host 2620:0:860:2:208:80:153:42 is DOWN: PING CRITICAL - Packet loss = 100%
[07:49:23] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426852 (owner: 10Marostegui)
[07:49:34] <elukey>	 marostegui: <3
[07:49:34] <vgutierrez>	 ^ that's achernar being reimaged
[07:49:56] <marostegui>	 !log Stop MySQL and reboot db1114 - T191996
[07:50:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:02] <stashbot>	 T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996
[07:51:12] <icinga-wm>	 PROBLEM - Host 2620:0:860:2:208:80:153:42 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:2:208:80:153:42)
[07:51:42] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.153.42 is CRITICAL: CRITICAL - Plugin timed out while executing system call
[07:53:41] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[07:56:30] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426854
[07:57:51] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426854 (owner: 10Marostegui)
[07:59:06] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426854 (owner: 10Marostegui)
[08:00:31] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1114 (duration: 00m 58s)
[08:00:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:11] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426854 (owner: 10Marostegui)
[08:04:00] <elukey>	 !log restart hhvm on mw[1228,1234,1281-1287,1289,1290,1312-1314,1317,1339,1343,1345,1346,1348] - more than 50% cpu usage, prevention scheme for current high load
[08:04:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:38] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426856
[08:14:32] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426856 (owner: 10Marostegui)
[08:14:43] <icinga-wm>	 PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:14:53] <icinga-wm>	 PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:15:04] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:15:46] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426856 (owner: 10Marostegui)
[08:17:09] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1114 in API (duration: 00m 58s)
[08:17:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:07] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for zuul-merger [puppet] - 10https://gerrit.wikimedia.org/r/426055 (https://phabricator.wikimedia.org/T135991)
[08:18:33] <wikibugs>	 (03PS1) 10Ema: Revert "Revert "varnish: restart backends every 3.5 days"" [puppet] - 10https://gerrit.wikimedia.org/r/426858
[08:19:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "varnish: restart backends every 3.5 days"" [puppet] - 10https://gerrit.wikimedia.org/r/426858 (owner: 10Ema)
[08:19:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for zuul-merger [puppet] - 10https://gerrit.wikimedia.org/r/426055 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:19:33] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426856 (owner: 10Marostegui)
[08:20:43] <wikibugs>	 (03PS2) 10Ema: Revert "Revert "varnish: restart backends every 3.5 days"" [puppet] - 10https://gerrit.wikimedia.org/r/426858
[08:21:08] <wikibugs>	 (03PS3) 10Ema: Revert "Revert "varnish: restart backends every 3.5 days"" [puppet] - 10https://gerrit.wikimedia.org/r/426858
[08:23:04] <joal>	 Hi ops-team - Just a ping about the analytics-team deploying jobs on the hadoop cluster
[08:23:39] <logmsgbot>	 !log joal@tin Started deploy [analytics/refinery@27416a9]: Regular weekly deploy - Mostly bugfixes from previous week huge deploy
[08:23:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:00] * elukey blames the analytics team
[08:24:21] <elukey>	 :D
[08:24:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "> yea, so i wasn't sure if i want to remove it for all or keep it for all. i wanted to be consistent though and have removed it from 2 oth" [puppet] - 10https://gerrit.wikimedia.org/r/425945 (owner: 10Dzahn)
[08:24:34] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4132831 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['achernar.wikimedia.org'] ```  Of which those **FAILED**: ``` ['achernar.wikimedia.org'] ```
[08:25:36] <_joe_>	 !log depooling mw1223 for investigation too
[08:25:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:33] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:28:43] <icinga-wm>	 PROBLEM - HHVM rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:29:06] <logmsgbot>	 !log joal@tin Finished deploy [analytics/refinery@27416a9]: Regular weekly deploy - Mostly bugfixes from previous week huge deploy (duration: 05m 27s)
[08:29:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:13] <icinga-wm>	 PROBLEM - Apache HTTP on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:29:24] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: CRITICAL - scalar( sum(rate(etcd_request_latencies_summary_sum{ job=k8s-api,instance=10.64.0.40:6443}[5m]))/ sum(rate(etcd_request_latencies_summary_count{ job=k8s-api,instance=10.64.0.40:6443}[5m]))): 152848.15066964287 = 50000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[08:29:24] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))): 198049.99711815565 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[08:30:44] <wikibugs>	 (03PS4) 10Gehel: maps: run populate_admin() regularly [puppet] - 10https://gerrit.wikimedia.org/r/425524 (https://phabricator.wikimedia.org/T190605)
[08:30:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.635 second response time
[08:31:13] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.034 second response time
[08:31:43] <icinga-wm>	 RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 80368 bytes in 0.081 second response time
[08:32:24] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: OK - scalar( sum(rate(etcd_request_latencies_summary_sum{ job=k8s-api,instance=10.64.0.40:6443}[5m]))/ sum(rate(etcd_request_latencies_summary_count{ job=k8s-api,instance=10.64.0.40:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[08:32:33] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[08:33:35] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4132854 (10Tim_WMDE)
[08:35:24] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:35:53] <icinga-wm>	 PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:36:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:41:08] <moritzm>	 !log pooled mw1261-mw1264 (app server canaries running stretch)
[08:41:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:42] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Restore db1114 main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426862
[08:42:39] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4132869 (10Vgutierrez) 05Open>03stalled
[08:44:49] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1114 main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426862 (owner: 10Marostegui)
[08:45:46] <wikibugs>	 (03CR) 10Gehel: [C: 032] maps: run populate_admin() regularly [puppet] - 10https://gerrit.wikimedia.org/r/425524 (https://phabricator.wikimedia.org/T190605) (owner: 10Gehel)
[08:46:02] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1114 main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426862 (owner: 10Marostegui)
[08:47:50] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1114 original main traffic weight (duration: 00m 58s)
[08:47:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:05] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Restore db1114 main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426862 (owner: 10Marostegui)
[08:49:46] <gehel>	 !log first manual run of populate_admin() for maps[12]001 - T190605
[08:49:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:52] <stashbot>	 T190605: Some of the borders on maps.wikimedia.org are outdated - https://phabricator.wikimedia.org/T190605
[08:50:50] <wikibugs>	 (03PS1) 10Marostegui: db1063.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/426864
[08:51:55] <wikibugs>	 (03PS1) 10Gehel: maps: fixed typo in populate_admin() cron [puppet] - 10https://gerrit.wikimedia.org/r/426866 (https://phabricator.wikimedia.org/T190605)
[08:52:23] <wikibugs>	 (03CR) 10Gehel: [C: 032] maps: fixed typo in populate_admin() cron [puppet] - 10https://gerrit.wikimedia.org/r/426866 (https://phabricator.wikimedia.org/T190605) (owner: 10Gehel)
[08:52:52] <wikibugs>	 (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10931/" [puppet] - 10https://gerrit.wikimedia.org/r/426864 (owner: 10Marostegui)
[08:53:00] <wikibugs>	 (03PS2) 10Marostegui: db1063.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/426864
[09:04:51] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-prometheus-exporter: Add missing section s8@dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/426869
[09:05:14] <moritzm>	 !log pooled mw1276-mw1278 (API app server canaries running stretch)
[09:05:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:16] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] "GTG now" [puppet] - 10https://gerrit.wikimedia.org/r/426007 (owner: 10Mobrovac)
[09:06:43] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb-prometheus-exporter: Add missing section s8@dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/426869 (owner: 10Jcrespo)
[09:06:47] <wikibugs>	 (03PS2) 10Muehlenhoff: Refresh mobrovac's SSH keys (step 2/2) [puppet] - 10https://gerrit.wikimedia.org/r/426007 (owner: 10Mobrovac)
[09:07:25] <wikibugs>	 (03PS3) 10Muehlenhoff: Refresh mobrovac's SSH keys (step 2/2) [puppet] - 10https://gerrit.wikimedia.org/r/426007 (owner: 10Mobrovac)
[09:07:56] <gehel>	 !log starting rolling restart of wdqs100[35] and wdqs200[123] for kernel upgrade
[09:08:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:50] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4132917 (10Aklapper)
[09:10:07] <icinga-wm>	 PROBLEM - Host wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100%
[09:10:59] <gehel>	 ^ oops, wdqs1003 is me not downtiming early enough
[09:11:27] <icinga-wm>	 RECOVERY - Host wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[09:12:18] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: labstore: nfs-exportd: prevent flushing all exports due to errors [puppet] - 10https://gerrit.wikimedia.org/r/426103 (https://phabricator.wikimedia.org/T145919)
[09:13:12] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] labstore: nfs-exportd: prevent flushing all exports due to errors [puppet] - 10https://gerrit.wikimedia.org/r/426103 (https://phabricator.wikimedia.org/T145919) (owner: 10Arturo Borrero Gonzalez)
[09:16:19] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase API weight for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426872
[09:20:06] <wikibugs>	 (03PS4) 10Muehlenhoff: Refresh mobrovac's SSH keys (step 2/2) [puppet] - 10https://gerrit.wikimedia.org/r/426007 (owner: 10Mobrovac)
[09:20:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Refresh mobrovac's SSH keys (step 2/2) [puppet] - 10https://gerrit.wikimedia.org/r/426007 (owner: 10Mobrovac)
[09:23:35] <logmsgbot>	 !log vgutierrez@neodymium conftool action : set/pooled=yes; selector: name=achernar.wikimedia.org,service=pdns_recursor
[09:23:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:43] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4132972 (10Vgutierrez)
[09:27:50] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Reimage acamar as stretch [puppet] - 10https://gerrit.wikimedia.org/r/426876 (https://phabricator.wikimedia.org/T187090)
[09:29:41] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] install_server: Reimage acamar as stretch [puppet] - 10https://gerrit.wikimedia.org/r/426876 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez)
[09:29:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add a simple NOTES.txt template to scaffolding (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/426074 (owner: 10Alexandros Kosiaris)
[09:32:14] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 4.180 second response time
[09:32:53] <icinga-wm>	 RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 80394 bytes in 0.085 second response time
[09:32:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.033 second response time
[09:40:45] <jynus>	 !log restarting dbstore2001:s8 to increase the number of purge threads
[09:40:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:42:32] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase-Cassandra, 10Services (next), and 2 others: Configure a threshold for earlier notification of /srv/cassandra/instance-data - https://phabricator.wikimedia.org/T191659#4132987 (10fgiunchedi)
[09:43:20] <gehel>	 !log rolling restart of wdqs100[35] and wdqs200[123] for kernel upgrade completed
[09:43:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:04] <icinga-wm>	 PROBLEM - DPKG on dbstore2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[09:46:04] <icinga-wm>	 RECOVERY - DPKG on dbstore2001 is OK: All packages OK
[09:49:54] <vgutierrez>	 !log Depool and reimage acamar as stretch - T187090
[09:50:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:00] <stashbot>	 T187090: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090
[09:50:08] <logmsgbot>	 !log vgutierrez@neodymium conftool action : set/pooled=no; selector: name=acamar.wikimedia.org,service=pdns_recursor
[09:50:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:09] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase API weight for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426872 (owner: 10Marostegui)
[09:52:35] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase API weight for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426872 (owner: 10Marostegui)
[09:52:50] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase API weight for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426872 (owner: 10Marostegui)
[09:54:57] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give more API traffic to db1114 (duration: 00m 58s)
[09:55:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:21] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4133017 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` acamar.wikimedia.org ``` The log can be found in `/var/log/wmf-auto-...
[10:02:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, see inline nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/426224 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles)
[10:03:03] <icinga-wm>	 PROBLEM - Host 2620:0:860:1:208:80:153:12 is DOWN: PING CRITICAL - Packet loss = 100%
[10:03:12] <wikibugs>	 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4133027 (10fgiunchedi) >>! In T169249#4130711, @Gilles wrote: > I don't think it's flamegraph.pl's fault, the issue is with the last line of the log...
[10:03:16] <vgutierrez>	 ^^ acamar being reimaged
[10:03:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-wmf-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425814 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:05:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-ircd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/424602 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:05:44] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.153.12 is CRITICAL: CRITICAL - Plugin timed out while executing system call
[10:07:05] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426886 (https://phabricator.wikimedia.org/T128546)
[10:07:53] <icinga-wm>	 ACKNOWLEDGEMENT - Host 2620:0:860:1:208:80:153:12 is DOWN: PING CRITICAL - Packet loss = 100% Ema acamar being reimaged
[10:08:07] <vgutierrez>	 thx ema <3
[10:08:24] <icinga-wm>	 ACKNOWLEDGEMENT - Recursive DNS on 208.80.153.12 is CRITICAL: CRITICAL - Plugin timed out while executing system call Ema acamar being reimaged
[10:09:16] <wikibugs>	 (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426886 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:10:29] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426886 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:10:52] <wikibugs>	 (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426886 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:12:25] <ema>	 vgutierrez: pleasure! Shouldn't those be auto-ack'ed by the reimage script?
[10:14:52] <vgutierrez>	 ema: the reimage script downtimes the host and all the services, but those two are not defined under the host running pdns-recursor
[10:16:38] <logmsgbot>	 !log jdrewniak@tin Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:426886|Bumping portals to master (T128546)]] (duration: 00m 59s)
[10:16:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:44] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[10:17:37] <logmsgbot>	 !log jdrewniak@tin Synchronized portals: Wikimedia Portals Update: [[gerrit:426886|Bumping portals to master (T128546)]] (duration: 00m 58s)
[10:17:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:35] <godog>	 !log upload prometheus-memcached-exporter to stretch-wikimedia - T189056
[10:18:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:41] <stashbot>	 T189056: import prometheus-memcached-exporter into wikimedia-stretch - https://phabricator.wikimedia.org/T189056
[10:21:43] <wikibugs>	 (03PS1) 10ArielGlenn: show the stacktrace for errors from dump job run in most cases [dumps] - 10https://gerrit.wikimedia.org/r/426888 (https://phabricator.wikimedia.org/T191177)
[10:22:15] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] show the stacktrace for errors from dump job run in most cases [dumps] - 10https://gerrit.wikimedia.org/r/426888 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn)
[10:23:34] <logmsgbot>	 !log ariel@tin Started deploy [dumps/dumps@4706d30]: show full stacktrace for dump job errors
[10:23:38] <logmsgbot>	 !log ariel@tin Finished deploy [dumps/dumps@4706d30]: show full stacktrace for dump job errors (duration: 00m 04s)
[10:23:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Reimage mw1299 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/426890
[10:29:57] <wikibugs>	 10Operations, 10Dumps-Generation, 10Patch-For-Review: data retrieval/write issues via NFS on dumpsdata1001, impacting some dump jobs - https://phabricator.wikimedia.org/T191177#4133107 (10ArielGlenn) I have gone back through the 'no such file' errors for the past 5 months. The vast majority are stubs; a few...
[10:30:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Add a simple NOTES.txt template to scaffolding (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/426074 (owner: 10Alexandros Kosiaris)
[10:31:36] <wikibugs>	 (03PS2) 10Muehlenhoff: Reimage mw1299 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/426890
[10:32:03] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add a simple NOTES.txt template to scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/426074
[10:32:04] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add a NOTES.txt template for mathoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/426075
[10:32:07] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Update the helm charts repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/426076
[10:32:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Reimage mw1299 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/426890 (owner: 10Muehlenhoff)
[10:36:14] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4133111 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['acamar.wikimedia.org'] ```  Of which those **FAILED**: ``` ['acamar.wikimedia.org'] ```
[10:43:21] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvps: add apt pinning for mitaka on jessie [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162)
[10:43:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudvps: add apt pinning for mitaka on jessie [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162) (owner: 10Arturo Borrero Gonzalez)
[10:48:23] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudvps: add apt pinning for mitaka on jessie [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162)
[10:50:59] <moritzm>	 !log reimaging mw1299 (job runner) to stretch
[10:51:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "The change is NOOP for hosts labstore1004.eqiad.wmnet, labstore1005.eqiad.wmnet, labpuppetmaster1001.wikimedia.org, labpuppetmaster1002.wi" [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162) (owner: 10Arturo Borrero Gonzalez)
[11:06:59] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Restore db1114 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426893
[11:08:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, see improvement inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424594 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff)
[11:08:47] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1114 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426893 (owner: 10Marostegui)
[11:10:05] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1114 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426893 (owner: 10Marostegui)
[11:10:12] <icinga-wm>	 PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: PING CRITICAL - Packet loss = 100%
[11:10:22] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Restore db1114 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426893 (owner: 10Marostegui)
[11:11:35] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1114 original weight (duration: 00m 59s)
[11:11:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:11] <icinga-wm>	 PROBLEM - Disk space on acamar is CRITICAL: Return code of 255 is out of bounds
[11:12:22] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.153.12 is CRITICAL: CRITICAL - Plugin timed out while executing system call
[11:17:21] <icinga-wm>	 PROBLEM - MD RAID on acamar is CRITICAL: Return code of 255 is out of bounds
[11:19:14] <wikibugs>	 (03PS4) 10Gilles: Remove obsolete imagescaler logic from swift proxy [puppet] - 10https://gerrit.wikimedia.org/r/424594 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff)
[11:19:31] <wikibugs>	 (03CR) 10Gilles: Remove obsolete imagescaler logic from swift proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424594 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff)
[11:20:42] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on acamar is CRITICAL: Return code of 255 is out of bounds
[11:22:31] <icinga-wm>	 PROBLEM - CPU frequency on acamar is CRITICAL: Return code of 255 is out of bounds
[11:23:05] <wikibugs>	 (03PS2) 10Gilles: Make xenon-log line-buffered [puppet] - 10https://gerrit.wikimedia.org/r/426224 (https://phabricator.wikimedia.org/T169249)
[11:23:06] <wikibugs>	 (03CR) 10Gilles: Make xenon-log line-buffered (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/426224 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles)
[11:24:11] <icinga-wm>	 PROBLEM - configured eth on acamar is CRITICAL: Return code of 255 is out of bounds
[11:24:11] <icinga-wm>	 PROBLEM - Check size of conntrack table on acamar is CRITICAL: Return code of 255 is out of bounds
[11:25:52] <icinga-wm>	 PROBLEM - Check systemd state on acamar is CRITICAL: Return code of 255 is out of bounds
[11:25:52] <icinga-wm>	 PROBLEM - dhclient process on acamar is CRITICAL: Return code of 255 is out of bounds
[11:27:41] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on acamar is CRITICAL: Return code of 255 is out of bounds
[11:27:41] <icinga-wm>	 PROBLEM - puppet last run on acamar is CRITICAL: Return code of 255 is out of bounds
[11:31:55] <icinga-wm>	 PROBLEM - IPMI Sensor Status on acamar is CRITICAL: Return code of 255 is out of bounds
[11:33:35] <icinga-wm>	 PROBLEM - Long running screen/tmux on acamar is CRITICAL: Return code of 255 is out of bounds
[11:35:45] <icinga-wm>	 PROBLEM - Recursive DNS on 2620:0:860:1:208:80:153:12 is CRITICAL: CRITICAL - Plugin timed out while executing system call
[11:35:55] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:37:05] <icinga-wm>	 PROBLEM - MegaRAID on acamar is CRITICAL: Return code of 255 is out of bounds
[11:41:25] <icinga-wm>	 PROBLEM - DPKG on acamar is CRITICAL: Return code of 255 is out of bounds
[11:42:35] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1299 is CRITICAL: Host mw1299 is not in mediawiki-installation dsh group
[11:43:25] <icinga-wm>	 RECOVERY - Check size of conntrack table on acamar is OK: OK: nf_conntrack is 0 % full
[11:43:25] <icinga-wm>	 RECOVERY - configured eth on acamar is OK: OK - interfaces up
[11:43:25] <icinga-wm>	 RECOVERY - DPKG on acamar is OK: All packages OK
[11:43:26] <icinga-wm>	 RECOVERY - Disk space on acamar is OK: DISK OK
[11:43:28] <moritzm>	 ^ reimage race, silencing
[11:43:35] <icinga-wm>	 RECOVERY - CPU frequency on acamar is OK: OK: CPU frequency is = 600 MHz (1199 MHz)
[11:43:36] <icinga-wm>	 RECOVERY - Recursive DNS on 2620:0:860:1:208:80:153:12 is OK: DNS OK: 0.086 seconds response time. www.wikipedia.org returns 208.80.154.224
[11:43:45] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on acamar is OK: OK ferm input default policy is set
[11:43:45] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.153.12 is OK: DNS OK: 0.044 seconds response time. www.wikipedia.org returns 208.80.154.224
[11:43:55] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on acamar is OK: PROCS OK: 1 process with command name bird
[11:43:56] <icinga-wm>	 RECOVERY - dhclient process on acamar is OK: PROCS OK: 0 processes with command name dhclient
[11:43:56] <icinga-wm>	 RECOVERY - Check systemd state on acamar is OK: OK - running: The system is fully operational
[11:44:26] <icinga-wm>	 RECOVERY - MD RAID on acamar is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[11:47:05] <icinga-wm>	 RECOVERY - MegaRAID on acamar is OK: OK: no disks configured for RAID
[11:48:05] <icinga-wm>	 RECOVERY - Long running screen/tmux on acamar is OK: OK: No SCREEN or tmux processes detected.
[11:48:05] <icinga-wm>	 RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:48:06] <icinga-wm>	 RECOVERY - IPMI Sensor Status on acamar is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[11:50:33] <logmsgbot>	 !log vgutierrez@neodymium conftool action : set/pooled=yes; selector: name=acamar.wikimedia.org,service=pdns_recursor
[11:50:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:55] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.574 second response time
[11:55:15] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[11:55:43] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4133269 (10Vgutierrez)
[11:55:46] <wikibugs>	 (03PS4) 10Gehel: maps: icinga alert if tiles are not being generated [puppet] - 10https://gerrit.wikimedia.org/r/410136 (https://phabricator.wikimedia.org/T175243)
[11:59:15] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[12:00:24] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Reimage hydrogen as stretch [puppet] - 10https://gerrit.wikimedia.org/r/426894 (https://phabricator.wikimedia.org/T187090)
[12:00:26] <wikibugs>	 (03PS1) 10Vgutierrez: Remove hydrogen from eqiad LVS name server config [puppet] - 10https://gerrit.wikimedia.org/r/426895 (https://phabricator.wikimedia.org/T187090)
[12:01:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/426895 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez)
[12:02:11] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] install_server: Reimage hydrogen as stretch [puppet] - 10https://gerrit.wikimedia.org/r/426894 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez)
[12:02:20] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] Remove hydrogen from eqiad LVS name server config [puppet] - 10https://gerrit.wikimedia.org/r/426895 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez)
[12:11:39] <vgutierrez>	 !log Depool and reimage hydrogen as stretch - T187090
[12:11:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:45] <stashbot>	 T187090: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090
[12:12:14] <logmsgbot>	 !log vgutierrez@neodymium conftool action : set/pooled=no; selector: name=hydrogen.wikimedia.org,service=pdns_recursor
[12:12:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:36] <icinga-wm>	 PROBLEM - Host 2620:0:861:1:208:80:154:50 is DOWN: PING CRITICAL - Packet loss = 100%
[12:19:09] <vgutierrez>	 ^ hydrogen being reimaged
[12:20:05] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.154.50 is CRITICAL: CRITICAL - Plugin timed out while executing system call
[12:21:48] <wikibugs>	 (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/425814 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[12:27:01] <wikibugs>	 (03PS1) 10Ema: Add varnish::trusted_proxies [puppet] - 10https://gerrit.wikimedia.org/r/426896 (https://phabricator.wikimedia.org/T187014)
[12:32:28] <icinga-wm>	 PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[12:43:49] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[12:44:06] <elukey>	 lovely
[12:46:17] <elukey>	 so it seems that the spike was temporary
[12:46:29] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[12:47:34] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4133388 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['hydrogen.wikimedia.org'] ```  Of which those **FAILED**: ``` ['hydrogen.wikimedia.org'] ```
[12:51:28] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[12:52:49] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[12:53:07] <ema>	 elukey: yeah, text_esams only it seems
[12:55:44] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: deployment-prep: switch jobrunner02/03 [puppet] - 10https://gerrit.wikimedia.org/r/426904 (https://phabricator.wikimedia.org/T192071)
[12:56:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] deployment-prep: switch jobrunner02/03 [puppet] - 10https://gerrit.wikimedia.org/r/426904 (https://phabricator.wikimedia.org/T192071) (owner: 10Giuseppe Lavagetto)
[12:57:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] maps: icinga alert if tiles are not being generated [puppet] - 10https://gerrit.wikimedia.org/r/410136 (https://phabricator.wikimedia.org/T175243) (owner: 10Gehel)
[12:58:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Remove all namespace directives [deployment-charts] - 10https://gerrit.wikimedia.org/r/426072 (owner: 10Alexandros Kosiaris)
[12:58:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove all namespace directives [deployment-charts] - 10https://gerrit.wikimedia.org/r/426072 (owner: 10Alexandros Kosiaris)
[12:58:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] mathoid: Dump all namespace definitions from manifests [deployment-charts] - 10https://gerrit.wikimedia.org/r/426073 (owner: 10Alexandros Kosiaris)
[12:58:25] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add a simple NOTES.txt template to scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/426074 (owner: 10Alexandros Kosiaris)
[12:58:31] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add a NOTES.txt template for mathoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/426075 (owner: 10Alexandros Kosiaris)
[12:58:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update the helm charts repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/426076 (owner: 10Alexandros Kosiaris)
[13:00:21] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-wmf-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425814 (https://phabricator.wikimedia.org/T135991)
[13:01:44] <icinga-wm>	 PROBLEM - MD RAID on hydrogen is CRITICAL: Return code of 255 is out of bounds
[13:03:33] <icinga-wm>	 PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: No response from NTP server
[13:07:03] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.154.50 is CRITICAL: CRITICAL - Plugin timed out while executing system call
[13:08:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-wmf-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425814 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:08:18] <Lydia_WMDE>	 mobrovac: could you please have a look at https://phabricator.wikimedia.org/T192198 ? It's pretty bad for us
[13:08:44] <icinga-wm>	 PROBLEM - Recursive DNS on 2620:0:861:1:7a2b:cbff:fe09:c21 is CRITICAL: CRITICAL - Plugin timed out while executing system call
[13:10:21] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-blazegraph-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425976 (https://phabricator.wikimedia.org/T135991)
[13:10:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-blazegraph-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425976 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:11:04] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Allow reimage of es1017 for upgrade to stretch/MariaDB 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/426906
[13:11:43] <icinga-wm>	 RECOVERY - Recursive DNS on 2620:0:861:1:7a2b:cbff:fe09:c21 is OK: DNS OK: 0.094 seconds response time. www.wikipedia.org returns 208.80.154.224
[13:11:54] <marostegui>	 jouncebot: next
[13:11:54] <jouncebot>	 In 22 hour(s) and 48 minute(s): Page Previews roll-out to enwiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180417T1200)
[13:12:03] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.154.50 is OK: DNS OK: 0.011 seconds response time. www.wikipedia.org returns 208.80.154.224
[13:12:34] <icinga-wm>	 RECOVERY - NTP peers on hydrogen is OK: NTP OK: Offset 0.000319 secs
[13:13:04] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool es1017 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426908
[13:13:17] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426909 (https://phabricator.wikimedia.org/T191996)
[13:13:21] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Depool es1017 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426908
[13:13:35] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426910
[13:13:40] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426910
[13:13:54] <icinga-wm>	 PROBLEM - Host 2620:0:861:1:7a2b:cbff:fe09:c21 is DOWN: PING CRITICAL - Packet loss = 100%
[13:13:54] <icinga-wm>	 RECOVERY - MD RAID on hydrogen is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[13:14:58] <wikibugs>	 (03PS1) 10Ppchelko: Revert switching the ChangeNotification job. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426911 (https://phabricator.wikimedia.org/T192198)
[13:15:06] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426910 (owner: 10Marostegui)
[13:15:08] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Allow reimage of es1017 for upgrade to stretch/MariaDB 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/426906
[13:15:16] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Kibana fails to load when using short URLs to share dashboard - https://phabricator.wikimedia.org/T192279#4133460 (10Gehel)
[13:16:12] <jynus>	 ^marostegui
[13:16:19] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426910 (owner: 10Marostegui)
[13:16:39] <marostegui>	 jynus: checking
[13:16:50] <jynus>	 guill. change, not mine
[13:16:54] <jynus>	 *task
[13:16:57] <marostegui>	 ah
[13:17:19] <marostegui>	 cool :)
[13:17:32] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage of es1017 for upgrade to stretch/MariaDB 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/426906 (owner: 10Jcrespo)
[13:18:31] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1087 (duration: 00m 59s)
[13:18:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:01] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-ircd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/424602 (https://phabricator.wikimedia.org/T135991)
[13:19:05] <wikibugs>	 (03PS1) 10Gehel: kibana: fix short URL issue [puppet] - 10https://gerrit.wikimedia.org/r/426912 (https://phabricator.wikimedia.org/T192279)
[13:19:28] <wikibugs>	 (03PS5) 10Filippo Giunchedi: Remove obsolete imagescaler logic from swift proxy [puppet] - 10https://gerrit.wikimedia.org/r/424594 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff)
[13:19:30] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Depool es1017 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426908
[13:19:39] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426910 (owner: 10Marostegui)
[13:19:42] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4133478 (10Marostegui) After the last reboot the errors have moved from being at times like:   XX:10:11 XX:20:11 XX:30:11  To: XX:04:11 XX:24:11 XX:34:11
[13:20:11] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic: sda failure in hydrogen.wikimedia.org - https://phabricator.wikimedia.org/T192280#4133479 (10Vgutierrez) p:05Triage>03Normal
[13:20:42] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic: sda failure in hydrogen.wikimedia.org - https://phabricator.wikimedia.org/T192280#4133479 (10Vgutierrez) SMART info about sda: ```root@hydrogen:~# smartctl -a /dev/sda smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build) Copyright (C) 2002-16, Bruce A...
[13:21:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Remove obsolete imagescaler logic from swift proxy [puppet] - 10https://gerrit.wikimedia.org/r/424594 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff)
[13:21:35] <wikibugs>	 (03PS1) 10Ema: Add fake trusted_proxy.json [labs/private] - 10https://gerrit.wikimedia.org/r/426913 (https://phabricator.wikimedia.org/T187014)
[13:22:48] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4133504 (10Marostegui)
[13:23:37] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM, 10Patch-For-Review, and 2 others: Upgrade deployment-prep appserver fleet to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T192071#4133505 (10Joe) All the main servers have been substituted with stretch VMs; the only one remaining turne...
[13:24:19] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426909 (https://phabricator.wikimedia.org/T191996)
[13:25:58] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426909 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui)
[13:26:31] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1017 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426908 (owner: 10Jcrespo)
[13:27:12] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426909 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui)
[13:27:56] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Depool es1017 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426908
[13:28:24] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 (duration: 00m 54s)
[13:28:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:54] <godog>	 !log roll-restart swift-proxy in codfw and eqiad - T188062
[13:31:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:00] <stashbot>	 T188062: Remove imagescaler cluster (aka 'rendering') - https://phabricator.wikimedia.org/T188062
[13:31:13] <marostegui>	 !log Stop MySQL on db1114 to reboot with another kernel - T191996
[13:31:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:19] <stashbot>	 T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996
[13:31:34] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426909 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui)
[13:32:01] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1017 (duration: 00m 58s)
[13:32:04] <wikibugs>	 (03PS2) 10Ema: Add fake trusted_proxies.json [labs/private] - 10https://gerrit.wikimedia.org/r/426913 (https://phabricator.wikimedia.org/T187014)
[13:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:27] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] Add fake trusted_proxies.json [labs/private] - 10https://gerrit.wikimedia.org/r/426913 (https://phabricator.wikimedia.org/T187014) (owner: 10Ema)
[13:41:08] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4133555 (10Marostegui) As another test to discard issues - I have rebooted db1114 with an older kernel. So it is now running  ``` root@db1114:~# uname -a Linux db1114 4.9.0...
[13:41:39] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426915 (https://phabricator.wikimedia.org/T191996)
[13:42:11] <wikibugs>	 (03CR) 10Ema: [C: 032] Add varnish::trusted_proxies [puppet] - 10https://gerrit.wikimedia.org/r/426896 (https://phabricator.wikimedia.org/T187014) (owner: 10Ema)
[13:42:17] <wikibugs>	 (03PS2) 10Ema: Add varnish::trusted_proxies [puppet] - 10https://gerrit.wikimedia.org/r/426896 (https://phabricator.wikimedia.org/T187014)
[13:44:47] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426915 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui)
[13:46:02] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426915 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui)
[13:46:40] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4133571 (10Niedzielski) I think this task is still being worked on but in case it helps, here's another report from the Obama page this morning:  ``` Request from 73.252.38.2...
[13:47:18] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1114 (duration: 00m 58s)
[13:47:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:19] <wikibugs>	 (03PS1) 10Elukey: Ensure existence of environment conf file [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/426918 (https://phabricator.wikimedia.org/T182924)
[13:49:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Ensure existence of environment conf file [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/426918 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey)
[13:51:07] <wikibugs>	 (03PS1) 10Ema: VCL: use trusted_proxies netmapper database [puppet] - 10https://gerrit.wikimedia.org/r/426920 (https://phabricator.wikimedia.org/T187014)
[13:51:11] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4133579 (10Joe) @Niedzielski interstingly, When requiring the `/summary/precambrian` page, I see a successful request to the API cluster, so the error is not a 503 on the par...
[13:51:44] <wikibugs>	 (03PS2) 10Elukey: Ensure existence of environment conf file [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/426918 (https://phabricator.wikimedia.org/T182924)
[13:53:46] <wikibugs>	 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#4133584 (10Joe)
[13:53:52] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM, 10Patch-For-Review, and 2 others: Upgrade deployment-prep appserver fleet to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T192071#4133583 (10Joe) 05Open>03Resolved
[13:55:33] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-autoinstall: Allow reimage of es2016 and es2017 [puppet] - 10https://gerrit.wikimedia.org/r/426922
[13:56:08] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb-autoinstall: Allow reimage of es2016 and es2017 [puppet] - 10https://gerrit.wikimedia.org/r/426922 (owner: 10Jcrespo)
[13:59:15] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: db1114, restore main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426923
[14:01:00] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: db1114, restore main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426923 (owner: 10Marostegui)
[14:01:52] <wikibugs>	 (03PS47) 10Giuseppe Lavagetto: Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz)
[14:02:16] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: db1114, restore main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426923 (owner: 10Marostegui)
[14:03:44] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore main traffic original weight for db1114 (duration: 00m 58s)
[14:03:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz)
[14:04:10] <wikibugs>	 (03PS4) 10Gehel: wdqs: new wdqs-internal service [dns] - 10https://gerrit.wikimedia.org/r/424587 (https://phabricator.wikimedia.org/T187766)
[14:05:16] <wikibugs>	 (03CR) 10Ottomata: [C: 031] "Nit: add # Note: This file is managed by puppet at the top?" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/426918 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey)
[14:05:34] <hashar>	 !log restarted Jenkins for plugin upgrade T192261
[14:05:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:52] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: new wdqs-internal service [dns] - 10https://gerrit.wikimedia.org/r/424587 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel)
[14:06:05] <wikibugs>	 (03CR) 10Elukey: "> Nit: add # Note: This file is managed by puppet at the top?" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/426918 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey)
[14:06:38] <jynus>	 !log start reimage of es2-codfw master, es2016
[14:06:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:59] <wikibugs>	 (03PS3) 10Elukey: Ensure existence of environment conf file [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/426918 (https://phabricator.wikimedia.org/T182924)
[14:07:07] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426924
[14:08:33] <mobrovac>	 Lydia_WMDE: looking
[14:08:48] <Lydia_WMDE>	 mobrovac: thank you!
[14:09:02] <Lydia_WMDE>	 seems a revert patch has been uploaded now which is awesome
[14:09:40] <wikibugs>	 (03PS3) 10Gehel: wdqs: LVS and conftool configuration for new wdqs-internal service [puppet] - 10https://gerrit.wikimedia.org/r/424599 (https://phabricator.wikimedia.org/T187766)
[14:11:08] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: LVS and conftool configuration for new wdqs-internal service [puppet] - 10https://gerrit.wikimedia.org/r/424599 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel)
[14:11:13] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426924 (owner: 10Marostegui)
[14:12:11] <gehel>	 _joe_: unmerged puppet change (something about mcrouter), should I merge?
[14:12:21] <moritzm>	 !log upgraded HHVM on mediawiki-deployment-09 to a build with a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854)
[14:12:22] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4133655 (10Niedzielski)
[14:12:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:27] <stashbot>	 T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854
[14:12:31] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kubeconfig: Allow setting the namespace for a context [puppet] - 10https://gerrit.wikimedia.org/r/426925
[14:12:36] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426924 (owner: 10Marostegui)
[14:12:39] <_joe_>	 gehel: yeah sorry
[14:12:47] <_joe_>	 it's labs-only, I forgot
[14:12:47] <gehel>	 _joe_: np
[14:13:52] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1114 in API (duration: 00m 58s)
[14:13:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:46] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[14:15:18] <wikibugs>	 (03PS14) 10Fdans: Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732)
[14:15:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans)
[14:16:07] <mobrovac>	 Lydia_WMDE: Amir1: have you taken a look at T192085 perhaps? that one looks like the root cause
[14:16:07] <stashbot>	 T192085: PHP Fatal in AffectedPagesFinder::getChangedAspects - https://phabricator.wikimedia.org/T192085
[14:16:13] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4133687 (10Niedzielski) Could the summary endpoint issue be a network or caching problem related to this task? I was wondering because it seems like the Node.js service is is...
[14:16:30] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-ircd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/424602 (https://phabricator.wikimedia.org/T135991)
[14:16:57] <wikibugs>	 (03PS1) 10Gehel: wdqs-internal: new service discovery entry [puppet] - 10https://gerrit.wikimedia.org/r/426926 (https://phabricator.wikimedia.org/T187766)
[14:17:24] <Lydia_WMDE>	 mobrovac: we'll have a look. thank you
[14:17:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-ircd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/424602 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:17:47] <gehel>	 ^^^ I'll merge the unmerged puppet changes in a minute, there is an issue with the last one...
[14:17:56] <wikibugs>	 (03CR) 10Vgutierrez: [C: 031] wdqs-internal: new service discovery entry [puppet] - 10https://gerrit.wikimedia.org/r/426926 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel)
[14:18:12] <wikibugs>	 (03PS2) 10Gehel: wdqs-internal: new service discovery entry [puppet] - 10https://gerrit.wikimedia.org/r/426926 (https://phabricator.wikimedia.org/T187766)
[14:18:25] <moritzm>	 gehel: ok, feel free to merge my ircd auto restart patch along
[14:18:26] <gehel>	 moritzm: could you hold your puppet merge for a second?
[14:18:29] <moritzm>	 sure
[14:18:31] <gehel>	 moritzm: thanks!
[14:18:46] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs-internal: new service discovery entry [puppet] - 10https://gerrit.wikimedia.org/r/426926 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel)
[14:19:08] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: kubeconfig: Allow setting the namespace for a context [puppet] - 10https://gerrit.wikimedia.org/r/426925
[14:19:28] <gehel>	 moritzm: merged
[14:19:47] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge.
[14:20:00] <moritzm>	 ack
[14:20:52] <Amir1>	 mobrovac: that part of the code (and practically everywhere around it) hasn't been touched for over a month now
[14:21:11] <Amir1>	 it doesn't match time-wise with any deployment too
[14:21:12] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426927
[14:21:31] <mobrovac>	 Amir1: but the code is there and produces a fatal
[14:21:46] <Amir1>	 it's more of an edge case that definitely needs to be taken care of but it can't bring down all of the jobqueue 
[14:22:22] <mobrovac>	 the jobqueue is not down
[14:22:33] <Amir1>	 does it happen on every single job insertion? 
[14:22:44] <Amir1>	 job = RC injection job
[14:22:56] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1006 is CRITICAL: CRITICAL: 41 connections established with conf1001.eqiad.wmnet:2379 (min=42)
[14:23:11] <Amir1>	 yeah, I mean all of the rc jobs
[14:23:47] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2006 is CRITICAL: CRITICAL: 30 connections established with conf2001.codfw.wmnet:2379 (min=31)
[14:24:00] <elukey>	 mmmmm
[14:24:05] <vgutierrez>	 that's expected
[14:24:16] <vgutierrez>	 a new service got configured but pybal hasn't been restarted yet
[14:24:40] <elukey>	 ah ok thanks for the explanation, the critical looks scary without context :D
[14:24:49] <mobrovac>	 looking Amir1
[14:25:26] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426927 (owner: 10Marostegui)
[14:25:38] <vgutierrez>	 !log restarting pybal on lvs2006 - T187766
[14:25:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:44] <stashbot>	 T187766: Install / configure new WDQS servers - https://phabricator.wikimedia.org/T187766
[14:26:05] <wikibugs>	 (03PS1) 10Ottomata: Re-enable dumps/other fetcher rsync job, simplify jobs [puppet] - 10https://gerrit.wikimedia.org/r/426928 (https://phabricator.wikimedia.org/T189283)
[14:26:16] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4133745 (10Mholloway) @Niedzielski I'll look into this on the mobileapps side today.  It's possible there's a problem in the config for the beta cluster.  Just to be clear, i...
[14:27:29] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426927 (owner: 10Marostegui)
[14:28:24] <wikibugs>	 (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10938/labstore1007.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/426928 (https://phabricator.wikimedia.org/T189283) (owner: 10Ottomata)
[14:28:27] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Re-enable dumps/other fetcher rsync job, simplify jobs [puppet] - 10https://gerrit.wikimedia.org/r/426928 (https://phabricator.wikimedia.org/T189283) (owner: 10Ottomata)
[14:28:32] <icinga-wm>	 PROBLEM - pybal on lvs2006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal
[14:28:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090
[14:28:57] <vgutierrez>	 ^^ everything under control :)
[14:29:02] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give more API traffic to db1114 (duration: 00m 57s)
[14:29:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:08] <icinga-wm>	 PROBLEM - Host wdqs-internal.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100%
[14:30:28] <volans>	 gehel: not yet in prod right?
[14:30:41] <gehel>	 volans: yep, new service, we're on it with vgutierrez 
[14:30:47] <apergos>	 ok great
[14:30:48] <volans>	 ack
[14:30:54] <elukey>	 nit for the next time - turn off paging
[14:31:16] <wikibugs>	 (03PS15) 10Fdans: Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732)
[14:31:18] <wikibugs>	 (03PS1) 10Ottomata: Remove trailing / from rsync locations [puppet] - 10https://gerrit.wikimedia.org/r/426929 (https://phabricator.wikimedia.org/T189283)
[14:31:22] <elukey>	 (and enable it when the service is working fine)
[14:31:44] <gehel>	 elukey: except the icinga check was just created...
[14:31:56] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Remove trailing / from rsync locations [puppet] - 10https://gerrit.wikimedia.org/r/426929 (https://phabricator.wikimedia.org/T189283) (owner: 10Ottomata)
[14:32:03] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-wdqs-internal.state on baham is CRITICAL: File not found: /var/lib/gdnsd/discovery-wdqs-internal.state
[14:32:03] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-wdqs-internal.state on eeden is CRITICAL: File not found: /var/lib/gdnsd/discovery-wdqs-internal.state
[14:32:12] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wdqs-internal on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/wdqs-internal
[14:33:10] <elukey>	 gehel: IIRC there is a hiera config to avoid paging for an LVS endpoint
[14:33:17] <godog>	 oof -- I didn't get paged
[14:33:27] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4133786 (10Niedzielski) Thanks @mholloway. This task is specific to the [[ https://en.wikipedia.beta.wmflabs.org/wiki/Barack_Obama | Barack Obama ]] page. The page summary is...
[14:33:32] <icinga-wm>	 RECOVERY - pybal on lvs2006 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal
[14:33:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy
[14:33:33] <gehel>	 elukey: remind me to look into it once this is fixed :)
[14:33:35] <wikibugs>	 (03PS2) 10Ema: VCL: use trusted_proxies netmapper database [puppet] - 10https://gerrit.wikimedia.org/r/426920 (https://phabricator.wikimedia.org/T187014)
[14:34:03] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-wdqs-internal.state on baham is OK: No errors detected
[14:34:12] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-wdqs-internal.state on eeden is OK: No errors detected
[14:34:12] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 30 connections established with conf2001.codfw.wmnet:2379 (min=31)
[14:34:12] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/eqiad/wdqs-internal on puppetmaster1001 is OK: No errors detected
[14:35:18] <icinga-wm>	 RECOVERY - Host wdqs-internal.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms
[14:35:34] <godog>	 lolz, sms came in just now
[14:37:27] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1003 is CRITICAL: CRITICAL: 41 connections established with conf1001.eqiad.wmnet:2379 (min=42)
[14:38:28] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Restore original API weight db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426930
[14:38:47] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2006 is OK: OK: 31 connections established with conf2001.codfw.wmnet:2379 (min=31)
[14:38:49] <wikibugs>	 (03PS1) 10Ottomata: Add $delete paramater to dumps::web::fetches::job [puppet] - 10https://gerrit.wikimedia.org/r/426931 (https://phabricator.wikimedia.org/T189283)
[14:39:32] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Add $delete paramater to dumps::web::fetches::job [puppet] - 10https://gerrit.wikimedia.org/r/426931 (https://phabricator.wikimedia.org/T189283) (owner: 10Ottomata)
[14:39:34] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=wdqs-internal
[14:39:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:13] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore original API weight db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426930 (owner: 10Marostegui)
[14:41:52] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Restore original API weight db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426930 (owner: 10Marostegui)
[14:42:00] <vgutierrez>	 !log restart pybal on lvs1006 - T187766
[14:42:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:11] <stashbot>	 T187766: Install / configure new WDQS servers - https://phabricator.wikimedia.org/T187766
[14:42:37] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1299 is OK: OK
[14:42:47] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Move es2016 socket location away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/426933 (https://phabricator.wikimedia.org/T148507)
[14:42:57] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1006 is OK: OK: 42 connections established with conf1001.eqiad.wmnet:2379 (min=42)
[14:43:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/426918 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey)
[14:43:14] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1114 in API - T191996 (duration: 00m 58s)
[14:43:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:19] <stashbot>	 T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996
[14:44:04] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Move es2016 socket location away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/426933 (https://phabricator.wikimedia.org/T148507)
[14:44:27] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: kubeconfig: Allow setting the namespace for a context [puppet] - 10https://gerrit.wikimedia.org/r/426925
[14:44:46] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4133813 (10ovasileva) p:05Triage>03High
[14:45:40] <wikibugs>	 (03PS3) 10Ema: VCL: use trusted_proxies netmapper database [puppet] - 10https://gerrit.wikimedia.org/r/426920 (https://phabricator.wikimedia.org/T187014)
[14:45:43] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Move es2016 socket location away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/426933 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo)
[14:45:58] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Move es2017 socket location away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/426937 (https://phabricator.wikimedia.org/T148507)
[14:46:27] <wikibugs>	 (03PS4) 10Ema: VCL: use trusted_proxies netmapper database [puppet] - 10https://gerrit.wikimedia.org/r/426920 (https://phabricator.wikimedia.org/T187014)
[14:48:23] <wikibugs>	 (03PS5) 10Ema: VCL: use trusted_proxies netmapper database [puppet] - 10https://gerrit.wikimedia.org/r/426920 (https://phabricator.wikimedia.org/T187014)
[14:49:02] <wikibugs>	 (03CR) 10Ema: [C: 032] VCL: use trusted_proxies netmapper database [puppet] - 10https://gerrit.wikimedia.org/r/426920 (https://phabricator.wikimedia.org/T187014) (owner: 10Ema)
[14:49:41] <vgutierrez>	 !log restart pybal on lvs2003 - T187766
[14:49:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:47] <stashbot>	 T187766: Install / configure new WDQS servers - https://phabricator.wikimedia.org/T187766
[14:49:47] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.047 second response time
[14:49:57] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Move es2017 socket location away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/426937 (https://phabricator.wikimedia.org/T148507)
[14:49:58] <icinga-wm>	 RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.030 second response time
[14:49:58] <icinga-wm>	 RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 80385 bytes in 0.085 second response time
[14:53:13] <vgutierrez>	 !log restart pybal on lvs1003 - T187766
[14:53:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:17] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 31 connections established with conf2001.codfw.wmnet:2379 (min=31)
[14:57:27] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1003 is OK: OK: 42 connections established with conf1001.eqiad.wmnet:2379 (min=42)
[14:58:49] <wikibugs>	 (03CR) 10DCausse: [C: 031] kibana: fix short URL issue [puppet] - 10https://gerrit.wikimedia.org/r/426912 (https://phabricator.wikimedia.org/T192279) (owner: 10Gehel)
[14:59:04] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "Remove hydrogen from eqiad LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/426940 (https://phabricator.wikimedia.org/T187090)
[15:01:07] <logmsgbot>	 !log vgutierrez@neodymium conftool action : set/pooled=yes; selector: name=hydrogen.wikimedia.org,service=pdns_recursor
[15:01:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:08] <icinga-wm>	 PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[librenms/librenms]
[15:03:25] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] Revert "Remove hydrogen from eqiad LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/426940 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez)
[15:04:55] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4133869 (10Vgutierrez)
[15:04:59] <wikibugs>	 (03PS1) 10Ottomata: Point refine job at 0.0.62 jar version [puppet] - 10https://gerrit.wikimedia.org/r/426943 (https://phabricator.wikimedia.org/T159962)
[15:05:52] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 5 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4133871 (10ema) >>! In T187014#4129884, @Nuria wrote: > +1 let me know when it is in place and i can help check things square again on...
[15:07:15] <jynus>	 !log start reimage of es3-codfw master, es2017
[15:07:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:59] <icinga-wm>	 RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[15:09:35] <wikibugs>	 (03PS2) 10Ottomata: Point refine job at 0.0.62 jar version [puppet] - 10https://gerrit.wikimedia.org/r/426943 (https://phabricator.wikimedia.org/T159962)
[15:09:50] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Point refine job at 0.0.62 jar version [puppet] - 10https://gerrit.wikimedia.org/r/426943 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata)
[15:10:24] <wikibugs>	 (03CR) 10Nuria: [C: 031] Add varnish::trusted_proxies [puppet] - 10https://gerrit.wikimedia.org/r/426896 (https://phabricator.wikimedia.org/T187014) (owner: 10Ema)
[15:14:03] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Move es2017 socket location away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/426937 (https://phabricator.wikimedia.org/T148507)
[15:14:41] <wikibugs>	 (03PS3) 10Ottomata: Blacklist mediawiki.job topics from cross DC main <-> main Kafka mirroring [puppet] - 10https://gerrit.wikimedia.org/r/425824 (https://phabricator.wikimedia.org/T192005)
[15:17:42] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Move es2017 socket location away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/426937 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo)
[15:18:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "That seems fine, I'll merge, build and import tomorrow." [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) (owner: 10Hashar)
[15:21:37] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4133923 (10Addshore)
[15:25:12] <moritzm>	 !log upgraded HHVM on mediawiki-deployment-07 to a build with a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854)
[15:25:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:18] <stashbot>	 T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854
[15:27:59] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[15:28:06] <logmsgbot>	 !log ppchelko@tin Started deploy [cpjobqueue/deploy@2a720fc]: Log HTML for PHP fatal errors from MW
[15:28:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:07] <logmsgbot>	 !log ppchelko@tin Finished deploy [cpjobqueue/deploy@2a720fc]: Log HTML for PHP fatal errors from MW (duration: 01m 01s)
[15:29:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:02] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "suggestion" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man)
[15:46:01] <icinga-wm>	 PROBLEM - HHVM rendering on mw2262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.433 second response time
[15:46:01] <icinga-wm>	 PROBLEM - HHVM rendering on mw2252 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.480 second response time
[15:46:01] <icinga-wm>	 PROBLEM - HHVM rendering on mw2236 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.761 second response time
[15:46:01] <icinga-wm>	 PROBLEM - HHVM rendering on mw2227 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.805 second response time
[15:46:01] <icinga-wm>	 PROBLEM - HHVM rendering on mw2173 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 1.065 second response time
[15:46:01] <icinga-wm>	 PROBLEM - HHVM rendering on mw2229 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.140 second response time
[15:46:01] <icinga-wm>	 PROBLEM - HHVM rendering on mw2275 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.381 second response time
[15:46:02] <icinga-wm>	 PROBLEM - HHVM rendering on mw2284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.449 second response time
[15:46:02] <icinga-wm>	 PROBLEM - HHVM rendering on mw2283 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.699 second response time
[15:46:03] <icinga-wm>	 PROBLEM - HHVM rendering on mw2231 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.792 second response time
[15:46:03] <icinga-wm>	 PROBLEM - HHVM rendering on mw2235 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 2.023 second response time
[15:46:04] <icinga-wm>	 PROBLEM - HHVM rendering on mw2139 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 2.099 second response time
[15:46:04] <icinga-wm>	 PROBLEM - HHVM rendering on mw2251 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 2.361 second response time
[15:46:05] <icinga-wm>	 PROBLEM - HHVM rendering on mw2165 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 2.405 second response time
[15:46:10] <icinga-wm>	 PROBLEM - HHVM rendering on mw2288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 2.680 second response time
[15:46:10] <icinga-wm>	 PROBLEM - HHVM rendering on mw2169 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 2.747 second response time
[15:46:10] <icinga-wm>	 PROBLEM - HHVM rendering on mw2285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 2.017 second response time
[15:46:10] <icinga-wm>	 PROBLEM - HHVM rendering on mw2171 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 1.048 second response time
[15:46:10] <icinga-wm>	 PROBLEM - HHVM rendering on mw2211 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.455 second response time
[15:46:11] <icinga-wm>	 PROBLEM - HHVM rendering on mw2254 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.774 second response time
[15:46:22] <icinga-wm>	 PROBLEM - HHVM rendering on mw2188 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.090 second response time
[15:46:22] <icinga-wm>	 PROBLEM - HHVM rendering on mw2237 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.413 second response time
[15:46:23] <icinga-wm>	 PROBLEM - HHVM rendering on mw2196 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.423 second response time
[15:46:23] <icinga-wm>	 PROBLEM - HHVM rendering on mw2193 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.698 second response time
[15:46:24] <icinga-wm>	 PROBLEM - HHVM rendering on mw2138 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.180 second response time
[15:46:24] <icinga-wm>	 PROBLEM - HHVM rendering on mw2146 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.667 second response time
[15:46:25] <icinga-wm>	 PROBLEM - HHVM rendering on mw2287 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.750 second response time
[15:46:25] <icinga-wm>	 PROBLEM - HHVM rendering on mw2141 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.969 second response time
[15:46:26] <icinga-wm>	 PROBLEM - HHVM rendering on mw2192 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 1.039 second response time
[15:46:26] <icinga-wm>	 PROBLEM - HHVM rendering on mw2137 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.279 second response time
[15:46:27] <icinga-wm>	 PROBLEM - HHVM rendering on mw2178 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.374 second response time
[15:46:27] <icinga-wm>	 PROBLEM - HHVM rendering on mw2187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.568 second response time
[15:46:28] <icinga-wm>	 PROBLEM - HHVM rendering on mw2143 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.700 second response time
[15:46:28] <icinga-wm>	 PROBLEM - HHVM rendering on mw2233 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.839 second response time
[15:46:29] <icinga-wm>	 PROBLEM - HHVM rendering on mw2182 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.938 second response time
[15:46:29] <icinga-wm>	 PROBLEM - HHVM rendering on mw2286 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.153 second response time
[15:46:30] <icinga-wm>	 PROBLEM - HHVM rendering on mw2145 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.249 second response time
[15:46:30] <icinga-wm>	 PROBLEM - HHVM rendering on mw2183 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.484 second response time
[15:47:37] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: kubeconfig: Allow setting the namespace for a context [puppet] - 10https://gerrit.wikimedia.org/r/426925
[15:49:39] * akosiaris looking ^
[15:49:49] <icinga-wm>	 PROBLEM - HHVM rendering on mw2197 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.384 second response time
[15:49:52] <jynus>	 could be my reimage
[15:50:29] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge.
[15:50:37] <akosiaris>	 jynus: all those hosts ?
[15:50:54] <jynus>	 I am reimaging 1 content master
[15:50:56] <akosiaris>	 doubtful, https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=16&hoststatustypes=3&serviceprops=2097162 list 130+ problematic HHVM
[15:51:43] <godog>	 from /var/log/apache2/other_vhosts_access.log on mw2252 api.php is 200 but /wiki/Main_Page yields a 500
[15:53:10] <akosiaris>	 !log restart hhvm on mw2252
[15:53:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:16] <akosiaris>	 let's see if that helps
[15:54:02] <jynus>	 akosiaris: let me start mysql
[15:54:05] <jynus>	 and see
[15:54:18] <akosiaris>	 jynus: it might indeed be what you say, so yes please do
[15:54:25] <godog>	 mediawiki-errors for mw2252 https://logstash.wikimedia.org/goto/c5d4766387219b396a04734d52c4e2a2
[15:54:29] <icinga-wm>	 RECOVERY - HHVM rendering on mw2138 is OK: HTTP OK: HTTP/1.1 200 OK - 80438 bytes in 4.328 second response time
[15:54:30] <icinga-wm>	 RECOVERY - HHVM rendering on mw2146 is OK: HTTP OK: HTTP/1.1 200 OK - 80436 bytes in 0.262 second response time
[15:54:30] <icinga-wm>	 RECOVERY - HHVM rendering on mw2287 is OK: HTTP OK: HTTP/1.1 200 OK - 80436 bytes in 0.267 second response time
[15:54:30] <icinga-wm>	 RECOVERY - HHVM rendering on mw2143 is OK: HTTP OK: HTTP/1.1 200 OK - 80436 bytes in 0.273 second response time
[15:54:30] <icinga-wm>	 RECOVERY - HHVM rendering on mw2141 is OK: HTTP OK: HTTP/1.1 200 OK - 80436 bytes in 0.277 second response time
[15:54:30] <icinga-wm>	 RECOVERY - HHVM rendering on mw2145 is OK: HTTP OK: HTTP/1.1 200 OK - 80436 bytes in 0.284 second response time
[15:54:30] <icinga-wm>	 RECOVERY - HHVM rendering on mw2137 is OK: HTTP OK: HTTP/1.1 200 OK - 80436 bytes in 0.285 second response time
[15:54:31] <icinga-wm>	 RECOVERY - HHVM rendering on mw2286 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.260 second response time
[15:54:31] <icinga-wm>	 RECOVERY - HHVM rendering on mw2182 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.260 second response time
[15:54:32] <icinga-wm>	 RECOVERY - HHVM rendering on mw2183 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.267 second response time
[15:54:32] <icinga-wm>	 RECOVERY - HHVM rendering on mw2187 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.260 second response time
[15:54:33] <icinga-wm>	 RECOVERY - HHVM rendering on mw2233 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.267 second response time
[15:54:33] <icinga-wm>	 RECOVERY - HHVM rendering on mw2178 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.278 second response time
[15:54:34] <godog>	 showing some dbconnection / dbreplication errors
[15:54:34] <icinga-wm>	 RECOVERY - HHVM rendering on mw2166 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.276 second response time
[15:54:36] <akosiaris>	 lol
[15:54:45] <icinga-wm>	 RECOVERY - HHVM rendering on mw2268 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.263 second response time
[15:54:45] <icinga-wm>	 RECOVERY - HHVM rendering on mw2201 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.271 second response time
[15:54:46] <icinga-wm>	 RECOVERY - HHVM rendering on mw2176 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.280 second response time
[15:54:46] <icinga-wm>	 RECOVERY - HHVM rendering on mw2167 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.293 second response time
[15:54:47] <akosiaris>	 jynus: and I am guessing you just did ? :P
[15:54:47] <icinga-wm>	 RECOVERY - HHVM rendering on mw2271 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.257 second response time
[15:54:47] <icinga-wm>	 RECOVERY - HHVM rendering on mw2163 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.281 second response time
[15:54:48] <icinga-wm>	 RECOVERY - HHVM rendering on mw2222 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.274 second response time
[15:54:49] <icinga-wm>	 RECOVERY - HHVM rendering on mw2191 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.275 second response time
[15:54:49] <icinga-wm>	 RECOVERY - HHVM rendering on mw2239 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.271 second response time
[15:54:59] <icinga-wm>	 RECOVERY - HHVM rendering on mw2197 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.277 second response time
[15:54:59] <icinga-wm>	 RECOVERY - HHVM rendering on mw2168 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.277 second response time
[15:54:59] <icinga-wm>	 RECOVERY - HHVM rendering on mw2241 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.269 second response time
[15:54:59] <icinga-wm>	 RECOVERY - HHVM rendering on mw2185 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.288 second response time
[15:55:00] <icinga-wm>	 RECOVERY - HHVM rendering on mw2150 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.271 second response time
[15:55:00] <icinga-wm>	 RECOVERY - HHVM rendering on mw2198 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.265 second response time
[15:55:00] <icinga-wm>	 RECOVERY - HHVM rendering on mw2180 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.280 second response time
[15:55:00] <icinga-wm>	 RECOVERY - HHVM rendering on mw2190 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.273 second response time
[15:55:01] <icinga-wm>	 RECOVERY - HHVM rendering on mw2218 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.260 second response time
[15:55:01] <icinga-wm>	 RECOVERY - HHVM rendering on mw2224 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.266 second response time
[15:55:02] <icinga-wm>	 RECOVERY - HHVM rendering on mw2200 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.267 second response time
[15:55:02] <icinga-wm>	 RECOVERY - HHVM rendering on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.273 second response time
[15:55:03] <icinga-wm>	 RECOVERY - HHVM rendering on mw2208 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.265 second response time
[15:55:03] <icinga-wm>	 RECOVERY - HHVM rendering on mw2238 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.259 second response time
[15:55:19] <icinga-wm>	 RECOVERY - HHVM rendering on mw2227 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.270 second response time
[15:55:19] <icinga-wm>	 RECOVERY - HHVM rendering on mw2231 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.267 second response time
[15:55:19] <icinga-wm>	 RECOVERY - HHVM rendering on mw2236 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.269 second response time
[15:55:19] <icinga-wm>	 RECOVERY - HHVM rendering on mw2139 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.272 second response time
[15:55:19] <icinga-wm>	 RECOVERY - HHVM rendering on mw2275 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.266 second response time
[15:55:19] <icinga-wm>	 RECOVERY - HHVM rendering on mw2251 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.265 second response time
[15:55:19] <icinga-wm>	 RECOVERY - HHVM rendering on mw2229 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.277 second response time
[15:55:20] <icinga-wm>	 RECOVERY - HHVM rendering on mw2173 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.271 second response time
[15:55:20] <icinga-wm>	 RECOVERY - HHVM rendering on mw2165 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.275 second response time
[15:55:21] <icinga-wm>	 RECOVERY - HHVM rendering on mw2235 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.274 second response time
[15:55:21] <icinga-wm>	 RECOVERY - HHVM rendering on mw2283 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.265 second response time
[15:55:22] <icinga-wm>	 RECOVERY - HHVM rendering on mw2288 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.258 second response time
[15:55:36] <jynus>	 akosiaris, godog: https://phabricator.wikimedia.org/T180918
[15:56:13] <godog>	 jynus: heh, so "known problem" :|
[15:56:24] <akosiaris>	 lol, ok
[16:00:16] <icinga-wm>	 RECOVERY - HHVM rendering on mw2136 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.273 second response time
[16:01:16] <wikibugs>	 (03PS2) 10Gehel: kibana: fix short URL issue [puppet] - 10https://gerrit.wikimedia.org/r/426912 (https://phabricator.wikimedia.org/T192279)
[16:01:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] kubeconfig: Allow setting the namespace for a context [puppet] - 10https://gerrit.wikimedia.org/r/426925 (owner: 10Alexandros Kosiaris)
[16:02:06] <wikibugs>	 (03PS3) 10Gehel: kibana: fix short URL issue [puppet] - 10https://gerrit.wikimedia.org/r/426912 (https://phabricator.wikimedia.org/T192279)
[16:02:47] <wikibugs>	 (03CR) 10Gehel: [C: 032] kibana: fix short URL issue [puppet] - 10https://gerrit.wikimedia.org/r/426912 (https://phabricator.wikimedia.org/T192279) (owner: 10Gehel)
[16:03:37] <icinga-wm>	 RECOVERY - HHVM rendering on mw2177 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.267 second response time
[16:03:37] <icinga-wm>	 RECOVERY - HHVM rendering on mw2258 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.277 second response time
[16:04:06] <icinga-wm>	 RECOVERY - HHVM rendering on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.270 second response time
[16:04:06] <icinga-wm>	 RECOVERY - HHVM rendering on mw2245 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.274 second response time
[16:06:17] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: Kibana fails to load when using short URLs to share dashboard - https://phabricator.wikimedia.org/T192279#4134072 (10Gehel) 05Open>03Resolved a:03Gehel Ugly fix is deployed. We might come back and remove it if...
[16:11:33] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4134090 (10Marostegui)
[16:14:54] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10Traffic, and 2 others: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#4134111 (10mforns) 05Open>03Resolved a:03mforns
[16:26:11] <wikibugs>	 10Operations, 10Analytics, 10Traffic, 10User-Elukey: Add VSL error counters to Varnishkafka stats - https://phabricator.wikimedia.org/T164259#4134131 (10mforns) p:05Normal>03Low
[16:54:25] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone: replace wmfkeystonehooks in Mitaka config [puppet] - 10https://gerrit.wikimedia.org/r/426956 (https://phabricator.wikimedia.org/T192304)
[16:56:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Keystone: replace wmfkeystonehooks in Mitaka config [puppet] - 10https://gerrit.wikimedia.org/r/426956 (https://phabricator.wikimedia.org/T192304) (owner: 10Andrew Bogott)
[17:00:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Add component/zookeeper349 [puppet] - 10https://gerrit.wikimedia.org/r/426957
[17:11:43] <moritzm>	 !log upgraded HHVM on mediawiki-jobrunner03 to a build with a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854)
[17:11:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:51] <stashbot>	 T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854
[17:12:57] <wikibugs>	 (03CR) 10Elukey: [C: 031] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/426957 (owner: 10Muehlenhoff)
[17:13:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add component/zookeeper349 [puppet] - 10https://gerrit.wikimedia.org/r/426957 (owner: 10Muehlenhoff)
[17:29:11] <wikibugs>	 10Operations, 10Commons, 10Multimedia, 10media-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101#4134323 (10AlexisJazz)
[17:30:25] <wikibugs>	 10Operations, 10Commons, 10Multimedia, 10media-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101#1945898 (10AlexisJazz) Copied from my duplicate task:  https://commons.wikimedia.org/wiki/File:Accuracy_Int...
[17:52:14] <wikibugs>	 (03PS4) 10Ottomata: Blacklist mediawiki.job and change-prop from cross DC main <-> main mirror [puppet] - 10https://gerrit.wikimedia.org/r/425824 (https://phabricator.wikimedia.org/T192005)
[17:56:44] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Blacklist mediawiki.job and change-prop from cross DC main <-> main mirror [puppet] - 10https://gerrit.wikimedia.org/r/425824 (https://phabricator.wikimedia.org/T192005) (owner: 10Ottomata)
[18:03:24] <ottomata>	 !log restarting main <-> main DC kafka mirror maker instances to blacklist job and cp topics T190940 T167039
[18:03:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:30] <stashbot>	 T167039: Upgrade Kafka on main cluster with security features - https://phabricator.wikimedia.org/T167039
[18:03:31] <stashbot>	 T190940: Use --new.consumer for main codfw <-> eqiad Kafka MirrorMaker - https://phabricator.wikimedia.org/T190940
[18:03:51] <wikibugs>	 (03PS9) 10Volans: First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504)
[18:03:53] <wikibugs>	 (03PS7) 10Volans: Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504)
[18:03:55] <wikibugs>	 (03PS9) 10Volans: Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504)
[18:03:58] <wikibugs>	 (03PS5) 10Volans: Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504)
[18:04:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[18:04:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[18:04:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[18:04:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[18:05:31] <wikibugs>	 (03CR) 10Volans: "Replies inline, thanks for the review!" (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[18:06:46] <wikibugs>	 (03CR) 10Volans: "> I am not entirely sure about this approach. Django has its own" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[18:06:58] <wikibugs>	 (03CR) 10Volans: "Reply inline" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans)
[18:18:04] <wikibugs>	 (03PS4) 10Ottomata: Use profile::kafka::mirror with --new.consumer for main-codfw -> main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/424344 (https://phabricator.wikimedia.org/T190940)
[18:19:35] <wikibugs>	 (03PS5) 10Ottomata: Use profile::kafka::mirror with --new.consumer for main-codfw -> main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/424344 (https://phabricator.wikimedia.org/T190940)
[18:25:01] <wikibugs>	 (03PS6) 10Ottomata: Use profile::kafka::mirror with --new.consumer for main-codfw -> main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/424344 (https://phabricator.wikimedia.org/T190940)
[18:28:04] <ottomata>	 !log temporarily stopping puppet on kafka200[123] to apply MirrorMaker --new.consumer https://gerrit.wikimedia.org/r/#/c/424344/ T190940 
[18:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:11] <stashbot>	 T190940: Use --new.consumer for main codfw <-> eqiad Kafka MirrorMaker - https://phabricator.wikimedia.org/T190940
[18:29:37] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Use profile::kafka::mirror with --new.consumer for main-codfw -> main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/424344 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata)
[18:29:42] <wikibugs>	 (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10942/kafka2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/424344 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata)
[18:39:09] <wikibugs>	 (03PS1) 10Ottomata: Use --new.consumer for main codfw -> eqiad MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/426965 (https://phabricator.wikimedia.org/T190940)
[18:40:04] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Use --new.consumer for main codfw -> eqiad MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/426965 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata)
[18:48:33] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))): 274809.0696474634 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[18:49:53] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 128174.41681260944 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=
[18:49:53] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))): 149952.0616071429 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[18:56:34] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[18:56:53] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1003 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[18:56:53] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[19:03:34] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))): 56753.572735590096 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=
[19:03:53] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))): 57918.16353383459 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[19:03:53] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 83493.59792027729 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[19:04:34] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[19:04:53] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1003 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[19:04:53] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[19:09:22] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4134473 (10Mholloway) a:03Mholloway
[19:14:45] <wikibugs>	 (03PS1) 10Ottomata: Configure jmx_exporter prometheus config for kafka main (mirror) [puppet] - 10https://gerrit.wikimedia.org/r/426971 (https://phabricator.wikimedia.org/T190940)
[19:15:39] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Configure jmx_exporter prometheus config for kafka main (mirror) [puppet] - 10https://gerrit.wikimedia.org/r/426971 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata)
[19:23:27] <wikibugs>	 (03PS1) 10Ottomata: Remove unused role::kafka::main::mirror and set up main MM alerts [puppet] - 10https://gerrit.wikimedia.org/r/426973 (https://phabricator.wikimedia.org/T190940)
[19:24:39] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Remove unused role::kafka::main::mirror and set up main MM alerts [puppet] - 10https://gerrit.wikimedia.org/r/426973 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata)
[19:32:03] <wikibugs>	 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#4134513 (10mobrovac)
[19:33:52] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[19:37:45] <mobrovac>	 i'm taking over tin for 10 mins to fix a UBN
[19:42:04] <wikibugs>	 (03PS1) 10Ottomata: Fix main-eqiad_to_main-codfw Mirror alert [puppet] - 10https://gerrit.wikimedia.org/r/426976 (https://phabricator.wikimedia.org/T190940)
[19:42:52] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[19:43:40] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Fix main-eqiad_to_main-codfw Mirror alert [puppet] - 10https://gerrit.wikimedia.org/r/426976 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata)
[19:46:03] <logmsgbot>	 !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/includes/EventBus.php: Use the wiki set in the JobQueue when creating the event, file 1/2 - T192198 (duration: 01m 00s)
[19:46:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:46:10] <stashbot>	 T192198: Wikidata doesn't update recentchanges - https://phabricator.wikimedia.org/T192198
[19:47:28] <logmsgbot>	 !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/includes/JobQueueEventBus.php: Use the wiki set in the JobQueue when creating the event, file 2/2 - T192198 (duration: 00m 59s)
[19:47:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:55] <icinga-wm>	 PROBLEM - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=1&fullscreen&orgId=1
[19:59:22] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4134550 (10Niedzielski)
[19:59:45] <wikibugs>	 10Operations, 10Analytics, 10Traffic, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927#4134556 (10Ottomata) Hm, I just thought about this a little bit, and I'm not so sure we should do it.  The hiera in...
[20:04:00] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4134566 (10Imarlier) a:03BBlack Brandon - No further action for Performance on this.  I'm assigning to you to close out or for further investigation, i...
[20:12:06] <logmsgbot>	 !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/includes/JobQueueEventBus.php: Revert using the wiki of the job runner, file 1/2 (duration: 00m 58s)
[20:12:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:18] <logmsgbot>	 !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/includes/EventBus.php: Revert using the wiki of the job runner, file 2/2 (duration: 00m 58s)
[20:13:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:04] <icinga-wm>	 RECOVERY - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=1&fullscreen&orgId=1
[20:28:36] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4134626 (10Mholloway) @Niedzielski I restarted the beta cluster restbase and mobileapps servi...
[20:31:16] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4134640 (10Mholloway)
[20:33:56] <logmsgbot>	 !log imarlier@tin Started deploy [performance/navtiming@64d9c90]: null deploy
[20:33:58] <logmsgbot>	 !log imarlier@tin Finished deploy [performance/navtiming@64d9c90]: null deploy (duration: 00m 02s)
[20:34:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:51:21] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[20:52:52] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4134666 (10Mholloway) For posterity: there was a recent change to the config variable that se...
[21:02:29] <logmsgbot>	 !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/includes/EventBus.php: Use the correct way of calculating the domain from the wiki, file 1/2 - T192198 (duration: 00m 59s)
[21:02:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:35] <stashbot>	 T192198: Wikidata doesn't update recentchanges - https://phabricator.wikimedia.org/T192198
[21:03:46] <logmsgbot>	 !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/includes/JobQueueEventBus.php: Use the correct way of calculating the domain from the wiki, file 2/2 - T192198 (duration: 00m 58s)
[21:03:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:27] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[21:23:24] <wikibugs>	 (03CR) 10Krinkle: [C: 031] Make xenon-log line-buffered [puppet] - 10https://gerrit.wikimedia.org/r/426224 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles)
[21:29:18] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka2001 is CRITICAL: NRPE: Command check_kafka-mirror-main-eqiad_to_main-codfw@0 not defined
[21:31:50] <mobrovac>	 ottomata: ^ ?
[21:36:42] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "Assuming that the scap-deployed service will be live soon, let's roll this out via the new repository to avoid diverging the code between " [puppet] - 10https://gerrit.wikimedia.org/r/420831 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog)
[21:37:10] <wikibugs>	 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#4134725 (10Pchelolo)
[21:38:16] <wikibugs>	 (03PS4) 10Krinkle: webperf1001/2001 start using webperf role [puppet] - 10https://gerrit.wikimedia.org/r/392030 (https://phabricator.wikimedia.org/T186774) (owner: 10Dzahn)
[21:57:16] <ejegg>	 Hi XioNoX and/or paravoid! Can one of you help me with a payments-cluster-related firewall adjustment?
[21:57:25] <ejegg>	 https://phabricator.wikimedia.org/T191669
[21:58:16] <ejegg>	 Thought I'd filed the request well ahead of the deadline to switch payment API addresses
[21:58:29] <ejegg>	 but it turns out I'd put the wrong security policy on the ticket
[21:58:37] <ejegg>	 and hidden from all but fundraising :(
[22:19:25] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4134804 (10Niedzielski) I tried about 50 links and it seems to work. Thanks (and thanks for k...
[22:25:13] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426915 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui)
[22:25:17] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: db1114, restore main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426923 (owner: 10Marostegui)
[22:25:21] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool es1017 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426908 (owner: 10Jcrespo)
[22:25:25] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426924 (owner: 10Marostegui)
[22:25:29] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426927 (owner: 10Marostegui)
[22:25:35] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Restore original API weight db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426930 (owner: 10Marostegui)
[22:48:58] <icinga-wm>	 PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))): 42910515.973187715 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[22:50:47] <icinga-wm>	 PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))): 83241901.26878615 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[22:54:33] <ejegg>	 XioNoX or paravoid: could one of you ping me when you get the chance to take a look at that ticket?
[22:54:53] <ejegg>	 As soon as the routes are open I'll deploy a settings change to the payments cluster to use the new address
[23:01:47] <icinga-wm>	 RECOVERY - Request latencies on chlorine is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[23:04:07] <icinga-wm>	 RECOVERY - Request latencies on argon is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api