[00:08:02] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 40.59, 36.08, 31.00 [00:53:38] (03CR) 10Aaron Schulz: "> I like the patch as it is; we still obviously lack proper" [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [01:45:22] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 37.22, 34.17, 32.16 [02:23:42] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20323992 [02:24:42] RECOVERY - Postgres Replication Lag on maps1004 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 620792 [03:02:46] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.29) (duration: 11m 09s) [03:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:51] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 49.46, 50.37, 48.03 [03:05:52] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 50.31, 49.94, 48.13 [03:13:21] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [03:15:52] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 48.26, 47.91, 48.06 [03:22:52] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 51.71, 48.78, 48.17 [04:03:01] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 57.27, 50.68, 48.26 [04:06:01] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 51.18, 49.70, 48.25 [04:17:02] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 47.97, 47.50, 48.02 [04:49:22] PROBLEM - High CPU load on API appserver on mw1288 is CRITICAL: CRITICAL - load average: 45.01, 42.83, 40.30 [04:56:01] RECOVERY - nutcracker process on deploy1001 is OK: PROCS OK: 1 process with UID = 114 (nutcracker), command name nutcracker [04:59:02] PROBLEM - nutcracker process on deploy1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (nutcracker), command name nutcracker [05:12:01] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 35.86, 34.49, 32.19 [05:12:51] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 38.13, 35.24, 31.38 [05:15:52] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 40.02, 36.25, 32.37 [05:25:30] (03PS1) 10Marostegui: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426843 (https://phabricator.wikimedia.org/T187089) [05:26:48] <_joe_> ouch, again [05:28:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426843 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:30:13] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426843 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:30:29] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426843 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:31:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1087 (duration: 00m 59s) [05:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:54] !log Deploy schema change on db1087 with replication (this will generate lag in labs) - T187089 T185128 T153182 [05:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:01] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [05:34:01] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [05:34:01] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [05:35:49] <_joe_> !log depooling mw1341 to further debug the API issue [05:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:17] !log restart hhvm on mw1226,27,32,88 - high load [05:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:06] <_joe_> so this is very very strange [05:38:01] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 40.40, 34.46, 32.25 [05:40:01] PROBLEM - High CPU load on API appserver on mw1315 is CRITICAL: CRITICAL - load average: 56.41, 51.18, 48.39 [05:42:04] !log Reload haproxy on dbproxy1010 [05:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:52] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [05:46:02] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 5.81, 14.27, 23.69 [05:47:01] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 5.78, 11.90, 23.58 [05:48:55] !log restart hhvm on mw1225, 1315, 1316, 1340, 1341, 1342, 1347 - high load [05:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:01] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 6.70, 10.40, 23.98 [05:49:31] RECOVERY - High CPU load on API appserver on mw1288 is OK: OK - load average: 9.45, 15.02, 29.11 [05:55:21] RECOVERY - High CPU load on API appserver on mw1341 is OK: OK - load average: 13.86, 28.12, 34.74 [05:55:52] !log repool mw1341 after investigation [05:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:56] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4132722 (10Marostegui) [05:57:15] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4123759 (10Marostegui) p:05Triage>03High [05:58:02] RECOVERY - High CPU load on API appserver on mw1315 is OK: OK - load average: 11.20, 20.17, 35.73 [05:58:12] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4123759 (10Marostegui) I have written a summary of the current state of debugging on the original task description, so it is easier to read instead of going thru all the co... [05:59:57] !log restart hhvm on mw[1221,1233,1280,1347] - high load [06:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:21] RECOVERY - High CPU load on API appserver on mw1316 is OK: OK - load average: 9.92, 14.89, 35.56 [06:20:08] !log Drop table flow_subscription from x1 - T149936 [06:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:14] T149936: Drop flow_subscription table - https://phabricator.wikimedia.org/T149936 [06:38:03] <_joe_> !log repooling mw1230 [06:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:39] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Provide a forward port of ICU 52 for stretch / Investigate best ICU update strategy - https://phabricator.wikimedia.org/T177498#4132740 (10Joe) [06:39:42] 10Operations, 10HHVM, 10Patch-For-Review, 10User-ArielGlenn, and 2 others: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4132736 (10Joe) 05Open>03Resolved [06:40:22] 10Operations, 10HHVM, 10Patch-For-Review, 10User-ArielGlenn, and 2 others: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4037934 (10Joe) Enwiki finished it run at 14:40 UTC on saturday april 14th. [06:43:01] PROBLEM - Nginx local proxy to apache on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:43:51] RECOVERY - Nginx local proxy to apache on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.027 second response time [06:45:21] !log depooled mw1230 [06:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:21] PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:47:31] PROBLEM - Apache HTTP on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:48:01] PROBLEM - Nginx local proxy to apache on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:16:39] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4132767 (10Marostegui) [07:27:21] !log installing perl security updates on Debian systems [07:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:32] PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:30:42] PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:30:51] PROBLEM - Nginx local proxy to apache on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:32:31] RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.023 second response time [07:33:00] <_joe_> mw1224 is me [07:35:07] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4132775 (10Marostegui) [07:35:41] PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:35:56] (03PS1) 10Vgutierrez: install_server: Reimage achernar as stretch [puppet] - 10https://gerrit.wikimedia.org/r/426851 (https://phabricator.wikimedia.org/T187090) [07:36:31] RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.030 second response time [07:36:41] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 80414 bytes in 0.086 second response time [07:36:42] RECOVERY - Nginx local proxy to apache on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.033 second response time [07:37:25] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage achernar as stretch [puppet] - 10https://gerrit.wikimedia.org/r/426851 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [07:39:23] !log Depool and reimage achernar.wikimedia.org - T187090 [07:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:29] T187090: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090 [07:40:43] !log vgutierrez@neodymium conftool action : set/pooled=no; selector: name=achernar.wikimedia.org,service=pdns_recursor [07:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:41] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [07:44:41] (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426852 [07:44:44] this may be one host misbehaving --^ ? [07:44:51] Might be db1114 yeah [07:45:05] checking https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor [07:45:07] (03CR) 10Marostegui: [V: 032 C: 032] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426852 (owner: 10Marostegui) [07:45:35] marostegui: Could not wait for replica DBs to catch up to db1052 - related? [07:45:48] most likely yeah [07:45:52] It is depooling now [07:45:58] super [07:45:59] thanks :) [07:46:24] I was running some tests on db1114 [07:46:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 (duration: 00m 59s) [07:46:26] 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4132784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` achernar.wikimedia.org ``` The log can be found in `/var/log/wmf-aut... [07:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:44] trying to debug this: https://phabricator.wikimedia.org/T191996 [07:46:52] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4132785 (10Marostegui) [07:47:11] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 80416 bytes in 0.369 second response time [07:47:21] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.044 second response time [07:47:51] RECOVERY - Nginx local proxy to apache on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.038 second response time [07:48:23] elukey: confirmed, it was that. fatals' graph is back to normal [07:49:02] PROBLEM - Host 2620:0:860:2:208:80:153:42 is DOWN: PING CRITICAL - Packet loss = 100% [07:49:23] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426852 (owner: 10Marostegui) [07:49:34] marostegui: <3 [07:49:34] ^ that's achernar being reimaged [07:49:56] !log Stop MySQL and reboot db1114 - T191996 [07:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:02] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [07:51:12] PROBLEM - Host 2620:0:860:2:208:80:153:42 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:2:208:80:153:42) [07:51:42] PROBLEM - Recursive DNS on 208.80.153.42 is CRITICAL: CRITICAL - Plugin timed out while executing system call [07:53:41] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [07:56:30] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426854 [07:57:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426854 (owner: 10Marostegui) [07:59:06] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426854 (owner: 10Marostegui) [08:00:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1114 (duration: 00m 58s) [08:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:11] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426854 (owner: 10Marostegui) [08:04:00] !log restart hhvm on mw[1228,1234,1281-1287,1289,1290,1312-1314,1317,1339,1343,1345,1346,1348] - more than 50% cpu usage, prevention scheme for current high load [08:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:38] (03PS1) 10Marostegui: db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426856 [08:14:32] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426856 (owner: 10Marostegui) [08:14:43] PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:14:53] PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:15:04] PROBLEM - Nginx local proxy to apache on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:15:46] (03Merged) 10jenkins-bot: db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426856 (owner: 10Marostegui) [08:17:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1114 in API (duration: 00m 58s) [08:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:07] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for zuul-merger [puppet] - 10https://gerrit.wikimedia.org/r/426055 (https://phabricator.wikimedia.org/T135991) [08:18:33] (03PS1) 10Ema: Revert "Revert "varnish: restart backends every 3.5 days"" [puppet] - 10https://gerrit.wikimedia.org/r/426858 [08:19:06] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "varnish: restart backends every 3.5 days"" [puppet] - 10https://gerrit.wikimedia.org/r/426858 (owner: 10Ema) [08:19:24] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for zuul-merger [puppet] - 10https://gerrit.wikimedia.org/r/426055 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:19:33] (03CR) 10jenkins-bot: db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426856 (owner: 10Marostegui) [08:20:43] (03PS2) 10Ema: Revert "Revert "varnish: restart backends every 3.5 days"" [puppet] - 10https://gerrit.wikimedia.org/r/426858 [08:21:08] (03PS3) 10Ema: Revert "Revert "varnish: restart backends every 3.5 days"" [puppet] - 10https://gerrit.wikimedia.org/r/426858 [08:23:04] Hi ops-team - Just a ping about the analytics-team deploying jobs on the hadoop cluster [08:23:39] !log joal@tin Started deploy [analytics/refinery@27416a9]: Regular weekly deploy - Mostly bugfixes from previous week huge deploy [08:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:00] * elukey blames the analytics team [08:24:21] :D [08:24:21] (03CR) 10Alexandros Kosiaris: "> yea, so i wasn't sure if i want to remove it for all or keep it for all. i wanted to be consistent though and have removed it from 2 oth" [puppet] - 10https://gerrit.wikimedia.org/r/425945 (owner: 10Dzahn) [08:24:34] 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4132831 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['achernar.wikimedia.org'] ``` Of which those **FAILED**: ``` ['achernar.wikimedia.org'] ``` [08:25:36] <_joe_> !log depooling mw1223 for investigation too [08:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:33] PROBLEM - Nginx local proxy to apache on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:43] PROBLEM - HHVM rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:29:06] !log joal@tin Finished deploy [analytics/refinery@27416a9]: Regular weekly deploy - Mostly bugfixes from previous week huge deploy (duration: 05m 27s) [08:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:13] PROBLEM - Apache HTTP on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:29:24] PROBLEM - etcd request latencies on neon is CRITICAL: CRITICAL - scalar( sum(rate(etcd_request_latencies_summary_sum{ job=k8s-api,instance=10.64.0.40:6443}[5m]))/ sum(rate(etcd_request_latencies_summary_count{ job=k8s-api,instance=10.64.0.40:6443}[5m]))): 152848.15066964287 = 50000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:29:24] PROBLEM - Request latencies on neon is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))): 198049.99711815565 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:30:44] (03PS4) 10Gehel: maps: run populate_admin() regularly [puppet] - 10https://gerrit.wikimedia.org/r/425524 (https://phabricator.wikimedia.org/T190605) [08:30:54] RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.635 second response time [08:31:13] RECOVERY - Nginx local proxy to apache on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.034 second response time [08:31:43] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 80368 bytes in 0.081 second response time [08:32:24] RECOVERY - etcd request latencies on neon is OK: OK - scalar( sum(rate(etcd_request_latencies_summary_sum{ job=k8s-api,instance=10.64.0.40:6443}[5m]))/ sum(rate(etcd_request_latencies_summary_count{ job=k8s-api,instance=10.64.0.40:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:32:33] RECOVERY - Request latencies on neon is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:33:35] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4132854 (10Tim_WMDE) [08:35:24] PROBLEM - Nginx local proxy to apache on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:53] PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:04] PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:41:08] !log pooled mw1261-mw1264 (app server canaries running stretch) [08:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:42] (03PS1) 10Marostegui: db-eqiad.php: Restore db1114 main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426862 [08:42:39] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4132869 (10Vgutierrez) 05Open>03stalled [08:44:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1114 main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426862 (owner: 10Marostegui) [08:45:46] (03CR) 10Gehel: [C: 032] maps: run populate_admin() regularly [puppet] - 10https://gerrit.wikimedia.org/r/425524 (https://phabricator.wikimedia.org/T190605) (owner: 10Gehel) [08:46:02] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1114 main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426862 (owner: 10Marostegui) [08:47:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1114 original main traffic weight (duration: 00m 58s) [08:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:05] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1114 main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426862 (owner: 10Marostegui) [08:49:46] !log first manual run of populate_admin() for maps[12]001 - T190605 [08:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:52] T190605: Some of the borders on maps.wikimedia.org are outdated - https://phabricator.wikimedia.org/T190605 [08:50:50] (03PS1) 10Marostegui: db1063.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/426864 [08:51:55] (03PS1) 10Gehel: maps: fixed typo in populate_admin() cron [puppet] - 10https://gerrit.wikimedia.org/r/426866 (https://phabricator.wikimedia.org/T190605) [08:52:23] (03CR) 10Gehel: [C: 032] maps: fixed typo in populate_admin() cron [puppet] - 10https://gerrit.wikimedia.org/r/426866 (https://phabricator.wikimedia.org/T190605) (owner: 10Gehel) [08:52:52] (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10931/" [puppet] - 10https://gerrit.wikimedia.org/r/426864 (owner: 10Marostegui) [08:53:00] (03PS2) 10Marostegui: db1063.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/426864 [09:04:51] (03PS1) 10Jcrespo: mariadb-prometheus-exporter: Add missing section s8@dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/426869 [09:05:14] !log pooled mw1276-mw1278 (API app server canaries running stretch) [09:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:16] (03CR) 10Mobrovac: [C: 031] "GTG now" [puppet] - 10https://gerrit.wikimedia.org/r/426007 (owner: 10Mobrovac) [09:06:43] (03CR) 10Jcrespo: [C: 032] mariadb-prometheus-exporter: Add missing section s8@dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/426869 (owner: 10Jcrespo) [09:06:47] (03PS2) 10Muehlenhoff: Refresh mobrovac's SSH keys (step 2/2) [puppet] - 10https://gerrit.wikimedia.org/r/426007 (owner: 10Mobrovac) [09:07:25] (03PS3) 10Muehlenhoff: Refresh mobrovac's SSH keys (step 2/2) [puppet] - 10https://gerrit.wikimedia.org/r/426007 (owner: 10Mobrovac) [09:07:56] !log starting rolling restart of wdqs100[35] and wdqs200[123] for kernel upgrade [09:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:50] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4132917 (10Aklapper) [09:10:07] PROBLEM - Host wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100% [09:10:59] ^ oops, wdqs1003 is me not downtiming early enough [09:11:27] RECOVERY - Host wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [09:12:18] (03PS2) 10Arturo Borrero Gonzalez: labstore: nfs-exportd: prevent flushing all exports due to errors [puppet] - 10https://gerrit.wikimedia.org/r/426103 (https://phabricator.wikimedia.org/T145919) [09:13:12] (03CR) 10Arturo Borrero Gonzalez: [C: 032] labstore: nfs-exportd: prevent flushing all exports due to errors [puppet] - 10https://gerrit.wikimedia.org/r/426103 (https://phabricator.wikimedia.org/T145919) (owner: 10Arturo Borrero Gonzalez) [09:16:19] (03PS1) 10Marostegui: db-eqiad.php: Increase API weight for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426872 [09:20:06] (03PS4) 10Muehlenhoff: Refresh mobrovac's SSH keys (step 2/2) [puppet] - 10https://gerrit.wikimedia.org/r/426007 (owner: 10Mobrovac) [09:20:53] (03CR) 10Muehlenhoff: [C: 032] Refresh mobrovac's SSH keys (step 2/2) [puppet] - 10https://gerrit.wikimedia.org/r/426007 (owner: 10Mobrovac) [09:23:35] !log vgutierrez@neodymium conftool action : set/pooled=yes; selector: name=achernar.wikimedia.org,service=pdns_recursor [09:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:43] 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4132972 (10Vgutierrez) [09:27:50] (03PS1) 10Vgutierrez: install_server: Reimage acamar as stretch [puppet] - 10https://gerrit.wikimedia.org/r/426876 (https://phabricator.wikimedia.org/T187090) [09:29:41] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage acamar as stretch [puppet] - 10https://gerrit.wikimedia.org/r/426876 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [09:29:52] (03CR) 10Giuseppe Lavagetto: Add a simple NOTES.txt template to scaffolding (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/426074 (owner: 10Alexandros Kosiaris) [09:32:14] RECOVERY - Nginx local proxy to apache on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 4.180 second response time [09:32:53] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 80394 bytes in 0.085 second response time [09:32:53] RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.033 second response time [09:40:45] !log restarting dbstore2001:s8 to increase the number of purge threads [09:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:32] 10Operations, 10Cassandra, 10RESTBase-Cassandra, 10Services (next), and 2 others: Configure a threshold for earlier notification of /srv/cassandra/instance-data - https://phabricator.wikimedia.org/T191659#4132987 (10fgiunchedi) [09:43:20] !log rolling restart of wdqs100[35] and wdqs200[123] for kernel upgrade completed [09:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:04] PROBLEM - DPKG on dbstore2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:46:04] RECOVERY - DPKG on dbstore2001 is OK: All packages OK [09:49:54] !log Depool and reimage acamar as stretch - T187090 [09:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:00] T187090: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090 [09:50:08] !log vgutierrez@neodymium conftool action : set/pooled=no; selector: name=acamar.wikimedia.org,service=pdns_recursor [09:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase API weight for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426872 (owner: 10Marostegui) [09:52:35] (03Merged) 10jenkins-bot: db-eqiad.php: Increase API weight for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426872 (owner: 10Marostegui) [09:52:50] (03CR) 10jenkins-bot: db-eqiad.php: Increase API weight for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426872 (owner: 10Marostegui) [09:54:57] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give more API traffic to db1114 (duration: 00m 58s) [09:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:21] 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4133017 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` acamar.wikimedia.org ``` The log can be found in `/var/log/wmf-auto-... [10:02:40] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, see inline nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/426224 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles) [10:03:03] PROBLEM - Host 2620:0:860:1:208:80:153:12 is DOWN: PING CRITICAL - Packet loss = 100% [10:03:12] 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4133027 (10fgiunchedi) >>! In T169249#4130711, @Gilles wrote: > I don't think it's flamegraph.pl's fault, the issue is with the last line of the log... [10:03:16] ^^ acamar being reimaged [10:03:45] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-wmf-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425814 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:05:34] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-ircd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/424602 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:05:44] PROBLEM - Recursive DNS on 208.80.153.12 is CRITICAL: CRITICAL - Plugin timed out while executing system call [10:07:05] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426886 (https://phabricator.wikimedia.org/T128546) [10:07:53] ACKNOWLEDGEMENT - Host 2620:0:860:1:208:80:153:12 is DOWN: PING CRITICAL - Packet loss = 100% Ema acamar being reimaged [10:08:07] thx ema <3 [10:08:24] ACKNOWLEDGEMENT - Recursive DNS on 208.80.153.12 is CRITICAL: CRITICAL - Plugin timed out while executing system call Ema acamar being reimaged [10:09:16] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426886 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:10:29] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426886 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:10:52] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426886 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:12:25] vgutierrez: pleasure! Shouldn't those be auto-ack'ed by the reimage script? [10:14:52] ema: the reimage script downtimes the host and all the services, but those two are not defined under the host running pdns-recursor [10:16:38] !log jdrewniak@tin Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:426886|Bumping portals to master (T128546)]] (duration: 00m 59s) [10:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:44] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:17:37] !log jdrewniak@tin Synchronized portals: Wikimedia Portals Update: [[gerrit:426886|Bumping portals to master (T128546)]] (duration: 00m 58s) [10:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:35] !log upload prometheus-memcached-exporter to stretch-wikimedia - T189056 [10:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:41] T189056: import prometheus-memcached-exporter into wikimedia-stretch - https://phabricator.wikimedia.org/T189056 [10:21:43] (03PS1) 10ArielGlenn: show the stacktrace for errors from dump job run in most cases [dumps] - 10https://gerrit.wikimedia.org/r/426888 (https://phabricator.wikimedia.org/T191177) [10:22:15] (03CR) 10ArielGlenn: [C: 032] show the stacktrace for errors from dump job run in most cases [dumps] - 10https://gerrit.wikimedia.org/r/426888 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn) [10:23:34] !log ariel@tin Started deploy [dumps/dumps@4706d30]: show full stacktrace for dump job errors [10:23:38] !log ariel@tin Finished deploy [dumps/dumps@4706d30]: show full stacktrace for dump job errors (duration: 00m 04s) [10:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:39] (03PS1) 10Muehlenhoff: Reimage mw1299 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/426890 [10:29:57] 10Operations, 10Dumps-Generation, 10Patch-For-Review: data retrieval/write issues via NFS on dumpsdata1001, impacting some dump jobs - https://phabricator.wikimedia.org/T191177#4133107 (10ArielGlenn) I have gone back through the 'no such file' errors for the past 5 months. The vast majority are stubs; a few... [10:30:35] (03CR) 10Alexandros Kosiaris: Add a simple NOTES.txt template to scaffolding (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/426074 (owner: 10Alexandros Kosiaris) [10:31:36] (03PS2) 10Muehlenhoff: Reimage mw1299 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/426890 [10:32:03] (03PS2) 10Alexandros Kosiaris: Add a simple NOTES.txt template to scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/426074 [10:32:04] (03PS2) 10Alexandros Kosiaris: Add a NOTES.txt template for mathoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/426075 [10:32:07] (03PS2) 10Alexandros Kosiaris: Update the helm charts repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/426076 [10:32:17] (03CR) 10Muehlenhoff: [C: 032] Reimage mw1299 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/426890 (owner: 10Muehlenhoff) [10:36:14] 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4133111 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['acamar.wikimedia.org'] ``` Of which those **FAILED**: ``` ['acamar.wikimedia.org'] ``` [10:43:21] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: add apt pinning for mitaka on jessie [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162) [10:43:52] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: add apt pinning for mitaka on jessie [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162) (owner: 10Arturo Borrero Gonzalez) [10:48:23] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: add apt pinning for mitaka on jessie [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162) [10:50:59] !log reimaging mw1299 (job runner) to stretch [10:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:25] (03CR) 10Arturo Borrero Gonzalez: "The change is NOOP for hosts labstore1004.eqiad.wmnet, labstore1005.eqiad.wmnet, labpuppetmaster1001.wikimedia.org, labpuppetmaster1002.wi" [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162) (owner: 10Arturo Borrero Gonzalez) [11:06:59] (03PS1) 10Marostegui: db-eqiad.php: Restore db1114 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426893 [11:08:26] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, see improvement inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424594 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [11:08:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1114 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426893 (owner: 10Marostegui) [11:10:05] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1114 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426893 (owner: 10Marostegui) [11:10:12] PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: PING CRITICAL - Packet loss = 100% [11:10:22] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1114 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426893 (owner: 10Marostegui) [11:11:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1114 original weight (duration: 00m 59s) [11:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:11] PROBLEM - Disk space on acamar is CRITICAL: Return code of 255 is out of bounds [11:12:22] PROBLEM - Recursive DNS on 208.80.153.12 is CRITICAL: CRITICAL - Plugin timed out while executing system call [11:17:21] PROBLEM - MD RAID on acamar is CRITICAL: Return code of 255 is out of bounds [11:19:14] (03PS4) 10Gilles: Remove obsolete imagescaler logic from swift proxy [puppet] - 10https://gerrit.wikimedia.org/r/424594 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [11:19:31] (03CR) 10Gilles: Remove obsolete imagescaler logic from swift proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424594 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [11:20:42] PROBLEM - Bird Internet Routing Daemon on acamar is CRITICAL: Return code of 255 is out of bounds [11:22:31] PROBLEM - CPU frequency on acamar is CRITICAL: Return code of 255 is out of bounds [11:23:05] (03PS2) 10Gilles: Make xenon-log line-buffered [puppet] - 10https://gerrit.wikimedia.org/r/426224 (https://phabricator.wikimedia.org/T169249) [11:23:06] (03CR) 10Gilles: Make xenon-log line-buffered (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/426224 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles) [11:24:11] PROBLEM - configured eth on acamar is CRITICAL: Return code of 255 is out of bounds [11:24:11] PROBLEM - Check size of conntrack table on acamar is CRITICAL: Return code of 255 is out of bounds [11:25:52] PROBLEM - Check systemd state on acamar is CRITICAL: Return code of 255 is out of bounds [11:25:52] PROBLEM - dhclient process on acamar is CRITICAL: Return code of 255 is out of bounds [11:27:41] PROBLEM - Check whether ferm is active by checking the default input chain on acamar is CRITICAL: Return code of 255 is out of bounds [11:27:41] PROBLEM - puppet last run on acamar is CRITICAL: Return code of 255 is out of bounds [11:31:55] PROBLEM - IPMI Sensor Status on acamar is CRITICAL: Return code of 255 is out of bounds [11:33:35] PROBLEM - Long running screen/tmux on acamar is CRITICAL: Return code of 255 is out of bounds [11:35:45] PROBLEM - Recursive DNS on 2620:0:860:1:208:80:153:12 is CRITICAL: CRITICAL - Plugin timed out while executing system call [11:35:55] PROBLEM - Nginx local proxy to apache on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:37:05] PROBLEM - MegaRAID on acamar is CRITICAL: Return code of 255 is out of bounds [11:41:25] PROBLEM - DPKG on acamar is CRITICAL: Return code of 255 is out of bounds [11:42:35] PROBLEM - mediawiki-installation DSH group on mw1299 is CRITICAL: Host mw1299 is not in mediawiki-installation dsh group [11:43:25] RECOVERY - Check size of conntrack table on acamar is OK: OK: nf_conntrack is 0 % full [11:43:25] RECOVERY - configured eth on acamar is OK: OK - interfaces up [11:43:25] RECOVERY - DPKG on acamar is OK: All packages OK [11:43:26] RECOVERY - Disk space on acamar is OK: DISK OK [11:43:28] ^ reimage race, silencing [11:43:35] RECOVERY - CPU frequency on acamar is OK: OK: CPU frequency is = 600 MHz (1199 MHz) [11:43:36] RECOVERY - Recursive DNS on 2620:0:860:1:208:80:153:12 is OK: DNS OK: 0.086 seconds response time. www.wikipedia.org returns 208.80.154.224 [11:43:45] RECOVERY - Check whether ferm is active by checking the default input chain on acamar is OK: OK ferm input default policy is set [11:43:45] RECOVERY - Recursive DNS on 208.80.153.12 is OK: DNS OK: 0.044 seconds response time. www.wikipedia.org returns 208.80.154.224 [11:43:55] RECOVERY - Bird Internet Routing Daemon on acamar is OK: PROCS OK: 1 process with command name bird [11:43:56] RECOVERY - dhclient process on acamar is OK: PROCS OK: 0 processes with command name dhclient [11:43:56] RECOVERY - Check systemd state on acamar is OK: OK - running: The system is fully operational [11:44:26] RECOVERY - MD RAID on acamar is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [11:47:05] RECOVERY - MegaRAID on acamar is OK: OK: no disks configured for RAID [11:48:05] RECOVERY - Long running screen/tmux on acamar is OK: OK: No SCREEN or tmux processes detected. [11:48:05] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:48:06] RECOVERY - IPMI Sensor Status on acamar is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [11:50:33] !log vgutierrez@neodymium conftool action : set/pooled=yes; selector: name=acamar.wikimedia.org,service=pdns_recursor [11:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:55] RECOVERY - Nginx local proxy to apache on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.574 second response time [11:55:15] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:55:43] 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4133269 (10Vgutierrez) [11:55:46] (03PS4) 10Gehel: maps: icinga alert if tiles are not being generated [puppet] - 10https://gerrit.wikimedia.org/r/410136 (https://phabricator.wikimedia.org/T175243) [11:59:15] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:00:24] (03PS1) 10Vgutierrez: install_server: Reimage hydrogen as stretch [puppet] - 10https://gerrit.wikimedia.org/r/426894 (https://phabricator.wikimedia.org/T187090) [12:00:26] (03PS1) 10Vgutierrez: Remove hydrogen from eqiad LVS name server config [puppet] - 10https://gerrit.wikimedia.org/r/426895 (https://phabricator.wikimedia.org/T187090) [12:01:29] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/426895 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [12:02:11] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage hydrogen as stretch [puppet] - 10https://gerrit.wikimedia.org/r/426894 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [12:02:20] (03CR) 10Vgutierrez: [C: 032] Remove hydrogen from eqiad LVS name server config [puppet] - 10https://gerrit.wikimedia.org/r/426895 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [12:11:39] !log Depool and reimage hydrogen as stretch - T187090 [12:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:45] T187090: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090 [12:12:14] !log vgutierrez@neodymium conftool action : set/pooled=no; selector: name=hydrogen.wikimedia.org,service=pdns_recursor [12:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:36] PROBLEM - Host 2620:0:861:1:208:80:154:50 is DOWN: PING CRITICAL - Packet loss = 100% [12:19:09] ^ hydrogen being reimaged [12:20:05] PROBLEM - Recursive DNS on 208.80.154.50 is CRITICAL: CRITICAL - Plugin timed out while executing system call [12:21:48] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/425814 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:27:01] (03PS1) 10Ema: Add varnish::trusted_proxies [puppet] - 10https://gerrit.wikimedia.org/r/426896 (https://phabricator.wikimedia.org/T187014) [12:32:28] PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:43:49] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [12:44:06] lovely [12:46:17] so it seems that the spike was temporary [12:46:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [12:47:34] 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4133388 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['hydrogen.wikimedia.org'] ``` Of which those **FAILED**: ``` ['hydrogen.wikimedia.org'] ``` [12:51:28] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [12:52:49] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [12:53:07] elukey: yeah, text_esams only it seems [12:55:44] (03PS1) 10Giuseppe Lavagetto: deployment-prep: switch jobrunner02/03 [puppet] - 10https://gerrit.wikimedia.org/r/426904 (https://phabricator.wikimedia.org/T192071) [12:56:49] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment-prep: switch jobrunner02/03 [puppet] - 10https://gerrit.wikimedia.org/r/426904 (https://phabricator.wikimedia.org/T192071) (owner: 10Giuseppe Lavagetto) [12:57:37] (03CR) 10Filippo Giunchedi: [C: 031] maps: icinga alert if tiles are not being generated [puppet] - 10https://gerrit.wikimedia.org/r/410136 (https://phabricator.wikimedia.org/T175243) (owner: 10Gehel) [12:58:09] (03CR) 10Alexandros Kosiaris: [C: 032] Remove all namespace directives [deployment-charts] - 10https://gerrit.wikimedia.org/r/426072 (owner: 10Alexandros Kosiaris) [12:58:11] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove all namespace directives [deployment-charts] - 10https://gerrit.wikimedia.org/r/426072 (owner: 10Alexandros Kosiaris) [12:58:18] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] mathoid: Dump all namespace definitions from manifests [deployment-charts] - 10https://gerrit.wikimedia.org/r/426073 (owner: 10Alexandros Kosiaris) [12:58:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add a simple NOTES.txt template to scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/426074 (owner: 10Alexandros Kosiaris) [12:58:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add a NOTES.txt template for mathoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/426075 (owner: 10Alexandros Kosiaris) [12:58:37] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update the helm charts repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/426076 (owner: 10Alexandros Kosiaris) [13:00:21] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-wmf-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425814 (https://phabricator.wikimedia.org/T135991) [13:01:44] PROBLEM - MD RAID on hydrogen is CRITICAL: Return code of 255 is out of bounds [13:03:33] PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: No response from NTP server [13:07:03] PROBLEM - Recursive DNS on 208.80.154.50 is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:08:12] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-wmf-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425814 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:08:18] mobrovac: could you please have a look at https://phabricator.wikimedia.org/T192198 ? It's pretty bad for us [13:08:44] PROBLEM - Recursive DNS on 2620:0:861:1:7a2b:cbff:fe09:c21 is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:10:21] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-blazegraph-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425976 (https://phabricator.wikimedia.org/T135991) [13:10:56] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-blazegraph-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425976 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:11:04] (03PS1) 10Jcrespo: mariadb: Allow reimage of es1017 for upgrade to stretch/MariaDB 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/426906 [13:11:43] RECOVERY - Recursive DNS on 2620:0:861:1:7a2b:cbff:fe09:c21 is OK: DNS OK: 0.094 seconds response time. www.wikipedia.org returns 208.80.154.224 [13:11:54] jouncebot: next [13:11:54] In 22 hour(s) and 48 minute(s): Page Previews roll-out to enwiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180417T1200) [13:12:03] RECOVERY - Recursive DNS on 208.80.154.50 is OK: DNS OK: 0.011 seconds response time. www.wikipedia.org returns 208.80.154.224 [13:12:34] RECOVERY - NTP peers on hydrogen is OK: NTP OK: Offset 0.000319 secs [13:13:04] (03PS1) 10Jcrespo: mariadb: Depool es1017 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426908 [13:13:17] (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426909 (https://phabricator.wikimedia.org/T191996) [13:13:21] (03PS2) 10Jcrespo: mariadb: Depool es1017 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426908 [13:13:35] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426910 [13:13:40] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426910 [13:13:54] PROBLEM - Host 2620:0:861:1:7a2b:cbff:fe09:c21 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:54] RECOVERY - MD RAID on hydrogen is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:14:58] (03PS1) 10Ppchelko: Revert switching the ChangeNotification job. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426911 (https://phabricator.wikimedia.org/T192198) [13:15:06] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426910 (owner: 10Marostegui) [13:15:08] (03PS2) 10Jcrespo: mariadb: Allow reimage of es1017 for upgrade to stretch/MariaDB 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/426906 [13:15:16] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Kibana fails to load when using short URLs to share dashboard - https://phabricator.wikimedia.org/T192279#4133460 (10Gehel) [13:16:12] ^marostegui [13:16:19] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426910 (owner: 10Marostegui) [13:16:39] jynus: checking [13:16:50] guill. change, not mine [13:16:54] *task [13:16:57] ah [13:17:19] cool :) [13:17:32] (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage of es1017 for upgrade to stretch/MariaDB 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/426906 (owner: 10Jcrespo) [13:18:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1087 (duration: 00m 59s) [13:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:01] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-ircd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/424602 (https://phabricator.wikimedia.org/T135991) [13:19:05] (03PS1) 10Gehel: kibana: fix short URL issue [puppet] - 10https://gerrit.wikimedia.org/r/426912 (https://phabricator.wikimedia.org/T192279) [13:19:28] (03PS5) 10Filippo Giunchedi: Remove obsolete imagescaler logic from swift proxy [puppet] - 10https://gerrit.wikimedia.org/r/424594 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [13:19:30] (03PS3) 10Jcrespo: mariadb: Depool es1017 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426908 [13:19:39] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426910 (owner: 10Marostegui) [13:19:42] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4133478 (10Marostegui) After the last reboot the errors have moved from being at times like: XX:10:11 XX:20:11 XX:30:11 To: XX:04:11 XX:24:11 XX:34:11 [13:20:11] 10Operations, 10ops-eqiad, 10Traffic: sda failure in hydrogen.wikimedia.org - https://phabricator.wikimedia.org/T192280#4133479 (10Vgutierrez) p:05Triage>03Normal [13:20:42] 10Operations, 10ops-eqiad, 10Traffic: sda failure in hydrogen.wikimedia.org - https://phabricator.wikimedia.org/T192280#4133479 (10Vgutierrez) SMART info about sda: ```root@hydrogen:~# smartctl -a /dev/sda smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build) Copyright (C) 2002-16, Bruce A... [13:21:33] (03CR) 10Filippo Giunchedi: [C: 032] Remove obsolete imagescaler logic from swift proxy [puppet] - 10https://gerrit.wikimedia.org/r/424594 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [13:21:35] (03PS1) 10Ema: Add fake trusted_proxy.json [labs/private] - 10https://gerrit.wikimedia.org/r/426913 (https://phabricator.wikimedia.org/T187014) [13:22:48] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4133504 (10Marostegui) [13:23:37] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM, 10Patch-For-Review, and 2 others: Upgrade deployment-prep appserver fleet to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T192071#4133505 (10Joe) All the main servers have been substituted with stretch VMs; the only one remaining turne... [13:24:19] (03PS2) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426909 (https://phabricator.wikimedia.org/T191996) [13:25:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426909 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [13:26:31] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1017 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426908 (owner: 10Jcrespo) [13:27:12] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426909 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [13:27:56] (03PS4) 10Jcrespo: mariadb: Depool es1017 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426908 [13:28:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 (duration: 00m 54s) [13:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:54] !log roll-restart swift-proxy in codfw and eqiad - T188062 [13:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:00] T188062: Remove imagescaler cluster (aka 'rendering') - https://phabricator.wikimedia.org/T188062 [13:31:13] !log Stop MySQL on db1114 to reboot with another kernel - T191996 [13:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:19] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [13:31:34] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426909 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [13:32:01] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1017 (duration: 00m 58s) [13:32:04] (03PS2) 10Ema: Add fake trusted_proxies.json [labs/private] - 10https://gerrit.wikimedia.org/r/426913 (https://phabricator.wikimedia.org/T187014) [13:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:27] (03CR) 10Ema: [V: 032 C: 032] Add fake trusted_proxies.json [labs/private] - 10https://gerrit.wikimedia.org/r/426913 (https://phabricator.wikimedia.org/T187014) (owner: 10Ema) [13:41:08] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4133555 (10Marostegui) As another test to discard issues - I have rebooted db1114 with an older kernel. So it is now running ``` root@db1114:~# uname -a Linux db1114 4.9.0... [13:41:39] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426915 (https://phabricator.wikimedia.org/T191996) [13:42:11] (03CR) 10Ema: [C: 032] Add varnish::trusted_proxies [puppet] - 10https://gerrit.wikimedia.org/r/426896 (https://phabricator.wikimedia.org/T187014) (owner: 10Ema) [13:42:17] (03PS2) 10Ema: Add varnish::trusted_proxies [puppet] - 10https://gerrit.wikimedia.org/r/426896 (https://phabricator.wikimedia.org/T187014) [13:44:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426915 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [13:46:02] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426915 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [13:46:40] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4133571 (10Niedzielski) I think this task is still being worked on but in case it helps, here's another report from the Obama page this morning: ``` Request from 73.252.38.2... [13:47:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1114 (duration: 00m 58s) [13:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:19] (03PS1) 10Elukey: Ensure existence of environment conf file [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/426918 (https://phabricator.wikimedia.org/T182924) [13:49:47] (03CR) 10jerkins-bot: [V: 04-1] Ensure existence of environment conf file [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/426918 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [13:51:07] (03PS1) 10Ema: VCL: use trusted_proxies netmapper database [puppet] - 10https://gerrit.wikimedia.org/r/426920 (https://phabricator.wikimedia.org/T187014) [13:51:11] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4133579 (10Joe) @Niedzielski interstingly, When requiring the `/summary/precambrian` page, I see a successful request to the API cluster, so the error is not a 503 on the par... [13:51:44] (03PS2) 10Elukey: Ensure existence of environment conf file [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/426918 (https://phabricator.wikimedia.org/T182924) [13:53:46] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#4133584 (10Joe) [13:53:52] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM, 10Patch-For-Review, and 2 others: Upgrade deployment-prep appserver fleet to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T192071#4133583 (10Joe) 05Open>03Resolved [13:55:33] (03PS1) 10Jcrespo: mariadb-autoinstall: Allow reimage of es2016 and es2017 [puppet] - 10https://gerrit.wikimedia.org/r/426922 [13:56:08] (03CR) 10Jcrespo: [C: 032] mariadb-autoinstall: Allow reimage of es2016 and es2017 [puppet] - 10https://gerrit.wikimedia.org/r/426922 (owner: 10Jcrespo) [13:59:15] (03PS1) 10Marostegui: db-eqiad.php: db1114, restore main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426923 [14:01:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: db1114, restore main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426923 (owner: 10Marostegui) [14:01:52] (03PS47) 10Giuseppe Lavagetto: Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [14:02:16] (03Merged) 10jenkins-bot: db-eqiad.php: db1114, restore main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426923 (owner: 10Marostegui) [14:03:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore main traffic original weight for db1114 (duration: 00m 58s) [14:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:07] (03CR) 10Giuseppe Lavagetto: [C: 032] Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [14:04:10] (03PS4) 10Gehel: wdqs: new wdqs-internal service [dns] - 10https://gerrit.wikimedia.org/r/424587 (https://phabricator.wikimedia.org/T187766) [14:05:16] (03CR) 10Ottomata: [C: 031] "Nit: add # Note: This file is managed by puppet at the top?" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/426918 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [14:05:34] !log restarted Jenkins for plugin upgrade T192261 [14:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:52] (03CR) 10Gehel: [C: 032] wdqs: new wdqs-internal service [dns] - 10https://gerrit.wikimedia.org/r/424587 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [14:06:05] (03CR) 10Elukey: "> Nit: add # Note: This file is managed by puppet at the top?" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/426918 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [14:06:38] !log start reimage of es2-codfw master, es2016 [14:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:59] (03PS3) 10Elukey: Ensure existence of environment conf file [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/426918 (https://phabricator.wikimedia.org/T182924) [14:07:07] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426924 [14:08:33] Lydia_WMDE: looking [14:08:48] mobrovac: thank you! [14:09:02] seems a revert patch has been uploaded now which is awesome [14:09:40] (03PS3) 10Gehel: wdqs: LVS and conftool configuration for new wdqs-internal service [puppet] - 10https://gerrit.wikimedia.org/r/424599 (https://phabricator.wikimedia.org/T187766) [14:11:08] (03CR) 10Gehel: [C: 032] wdqs: LVS and conftool configuration for new wdqs-internal service [puppet] - 10https://gerrit.wikimedia.org/r/424599 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [14:11:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426924 (owner: 10Marostegui) [14:12:11] _joe_: unmerged puppet change (something about mcrouter), should I merge? [14:12:21] !log upgraded HHVM on mediawiki-deployment-09 to a build with a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854) [14:12:22] 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4133655 (10Niedzielski) [14:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:27] T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854 [14:12:31] (03PS1) 10Alexandros Kosiaris: kubeconfig: Allow setting the namespace for a context [puppet] - 10https://gerrit.wikimedia.org/r/426925 [14:12:36] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426924 (owner: 10Marostegui) [14:12:39] <_joe_> gehel: yeah sorry [14:12:47] <_joe_> it's labs-only, I forgot [14:12:47] _joe_: np [14:13:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1114 in API (duration: 00m 58s) [14:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:46] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [14:15:18] (03PS14) 10Fdans: Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) [14:15:52] (03CR) 10jerkins-bot: [V: 04-1] Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [14:16:07] Lydia_WMDE: Amir1: have you taken a look at T192085 perhaps? that one looks like the root cause [14:16:07] T192085: PHP Fatal in AffectedPagesFinder::getChangedAspects - https://phabricator.wikimedia.org/T192085 [14:16:13] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4133687 (10Niedzielski) Could the summary endpoint issue be a network or caching problem related to this task? I was wondering because it seems like the Node.js service is is... [14:16:30] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-ircd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/424602 (https://phabricator.wikimedia.org/T135991) [14:16:57] (03PS1) 10Gehel: wdqs-internal: new service discovery entry [puppet] - 10https://gerrit.wikimedia.org/r/426926 (https://phabricator.wikimedia.org/T187766) [14:17:24] mobrovac: we'll have a look. thank you [14:17:41] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-ircd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/424602 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:17:47] ^^^ I'll merge the unmerged puppet changes in a minute, there is an issue with the last one... [14:17:56] (03CR) 10Vgutierrez: [C: 031] wdqs-internal: new service discovery entry [puppet] - 10https://gerrit.wikimedia.org/r/426926 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [14:18:12] (03PS2) 10Gehel: wdqs-internal: new service discovery entry [puppet] - 10https://gerrit.wikimedia.org/r/426926 (https://phabricator.wikimedia.org/T187766) [14:18:25] gehel: ok, feel free to merge my ircd auto restart patch along [14:18:26] moritzm: could you hold your puppet merge for a second? [14:18:29] sure [14:18:31] moritzm: thanks! [14:18:46] (03CR) 10Gehel: [C: 032] wdqs-internal: new service discovery entry [puppet] - 10https://gerrit.wikimedia.org/r/426926 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [14:19:08] (03PS2) 10Alexandros Kosiaris: kubeconfig: Allow setting the namespace for a context [puppet] - 10https://gerrit.wikimedia.org/r/426925 [14:19:28] moritzm: merged [14:19:47] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [14:20:00] ack [14:20:52] mobrovac: that part of the code (and practically everywhere around it) hasn't been touched for over a month now [14:21:11] it doesn't match time-wise with any deployment too [14:21:12] (03PS1) 10Marostegui: db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426927 [14:21:31] Amir1: but the code is there and produces a fatal [14:21:46] it's more of an edge case that definitely needs to be taken care of but it can't bring down all of the jobqueue [14:22:22] the jobqueue is not down [14:22:33] does it happen on every single job insertion? [14:22:44] job = RC injection job [14:22:56] PROBLEM - PyBal connections to etcd on lvs1006 is CRITICAL: CRITICAL: 41 connections established with conf1001.eqiad.wmnet:2379 (min=42) [14:23:11] yeah, I mean all of the rc jobs [14:23:47] PROBLEM - PyBal connections to etcd on lvs2006 is CRITICAL: CRITICAL: 30 connections established with conf2001.codfw.wmnet:2379 (min=31) [14:24:00] mmmmm [14:24:05] that's expected [14:24:16] a new service got configured but pybal hasn't been restarted yet [14:24:40] ah ok thanks for the explanation, the critical looks scary without context :D [14:24:49] looking Amir1 [14:25:26] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426927 (owner: 10Marostegui) [14:25:38] !log restarting pybal on lvs2006 - T187766 [14:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:44] T187766: Install / configure new WDQS servers - https://phabricator.wikimedia.org/T187766 [14:26:05] (03PS1) 10Ottomata: Re-enable dumps/other fetcher rsync job, simplify jobs [puppet] - 10https://gerrit.wikimedia.org/r/426928 (https://phabricator.wikimedia.org/T189283) [14:26:16] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4133745 (10Mholloway) @Niedzielski I'll look into this on the mobileapps side today. It's possible there's a problem in the config for the beta cluster. Just to be clear, i... [14:27:29] (03Merged) 10jenkins-bot: db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426927 (owner: 10Marostegui) [14:28:24] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10938/labstore1007.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/426928 (https://phabricator.wikimedia.org/T189283) (owner: 10Ottomata) [14:28:27] (03CR) 10Ottomata: [C: 032] Re-enable dumps/other fetcher rsync job, simplify jobs [puppet] - 10https://gerrit.wikimedia.org/r/426928 (https://phabricator.wikimedia.org/T189283) (owner: 10Ottomata) [14:28:32] PROBLEM - pybal on lvs2006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [14:28:32] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [14:28:57] ^^ everything under control :) [14:29:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give more API traffic to db1114 (duration: 00m 57s) [14:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:08] PROBLEM - Host wdqs-internal.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:30:28] gehel: not yet in prod right? [14:30:41] volans: yep, new service, we're on it with vgutierrez [14:30:47] ok great [14:30:48] ack [14:30:54] nit for the next time - turn off paging [14:31:16] (03PS15) 10Fdans: Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) [14:31:18] (03PS1) 10Ottomata: Remove trailing / from rsync locations [puppet] - 10https://gerrit.wikimedia.org/r/426929 (https://phabricator.wikimedia.org/T189283) [14:31:22] (and enable it when the service is working fine) [14:31:44] elukey: except the icinga check was just created... [14:31:56] (03CR) 10Ottomata: [C: 032] Remove trailing / from rsync locations [puppet] - 10https://gerrit.wikimedia.org/r/426929 (https://phabricator.wikimedia.org/T189283) (owner: 10Ottomata) [14:32:03] PROBLEM - Confd template for /var/lib/gdnsd/discovery-wdqs-internal.state on baham is CRITICAL: File not found: /var/lib/gdnsd/discovery-wdqs-internal.state [14:32:03] PROBLEM - Confd template for /var/lib/gdnsd/discovery-wdqs-internal.state on eeden is CRITICAL: File not found: /var/lib/gdnsd/discovery-wdqs-internal.state [14:32:12] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wdqs-internal on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/wdqs-internal [14:33:10] gehel: IIRC there is a hiera config to avoid paging for an LVS endpoint [14:33:17] oof -- I didn't get paged [14:33:27] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4133786 (10Niedzielski) Thanks @mholloway. This task is specific to the [[ https://en.wikipedia.beta.wmflabs.org/wiki/Barack_Obama | Barack Obama ]] page. The page summary is... [14:33:32] RECOVERY - pybal on lvs2006 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [14:33:33] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [14:33:33] elukey: remind me to look into it once this is fixed :) [14:33:35] (03PS2) 10Ema: VCL: use trusted_proxies netmapper database [puppet] - 10https://gerrit.wikimedia.org/r/426920 (https://phabricator.wikimedia.org/T187014) [14:34:03] RECOVERY - Confd template for /var/lib/gdnsd/discovery-wdqs-internal.state on baham is OK: No errors detected [14:34:12] RECOVERY - Confd template for /var/lib/gdnsd/discovery-wdqs-internal.state on eeden is OK: No errors detected [14:34:12] PROBLEM - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 30 connections established with conf2001.codfw.wmnet:2379 (min=31) [14:34:12] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/wdqs-internal on puppetmaster1001 is OK: No errors detected [14:35:18] RECOVERY - Host wdqs-internal.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [14:35:34] lolz, sms came in just now [14:37:27] PROBLEM - PyBal connections to etcd on lvs1003 is CRITICAL: CRITICAL: 41 connections established with conf1001.eqiad.wmnet:2379 (min=42) [14:38:28] (03PS1) 10Marostegui: db-eqiad.php: Restore original API weight db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426930 [14:38:47] RECOVERY - PyBal connections to etcd on lvs2006 is OK: OK: 31 connections established with conf2001.codfw.wmnet:2379 (min=31) [14:38:49] (03PS1) 10Ottomata: Add $delete paramater to dumps::web::fetches::job [puppet] - 10https://gerrit.wikimedia.org/r/426931 (https://phabricator.wikimedia.org/T189283) [14:39:32] (03CR) 10Ottomata: [C: 032] Add $delete paramater to dumps::web::fetches::job [puppet] - 10https://gerrit.wikimedia.org/r/426931 (https://phabricator.wikimedia.org/T189283) (owner: 10Ottomata) [14:39:34] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=wdqs-internal [14:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore original API weight db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426930 (owner: 10Marostegui) [14:41:52] (03Merged) 10jenkins-bot: db-eqiad.php: Restore original API weight db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426930 (owner: 10Marostegui) [14:42:00] !log restart pybal on lvs1006 - T187766 [14:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:11] T187766: Install / configure new WDQS servers - https://phabricator.wikimedia.org/T187766 [14:42:37] RECOVERY - mediawiki-installation DSH group on mw1299 is OK: OK [14:42:47] (03PS1) 10Jcrespo: mariadb: Move es2016 socket location away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/426933 (https://phabricator.wikimedia.org/T148507) [14:42:57] RECOVERY - PyBal connections to etcd on lvs1006 is OK: OK: 42 connections established with conf1001.eqiad.wmnet:2379 (min=42) [14:43:06] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/426918 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [14:43:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1114 in API - T191996 (duration: 00m 58s) [14:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:19] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [14:44:04] (03PS2) 10Jcrespo: mariadb: Move es2016 socket location away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/426933 (https://phabricator.wikimedia.org/T148507) [14:44:27] (03PS3) 10Alexandros Kosiaris: kubeconfig: Allow setting the namespace for a context [puppet] - 10https://gerrit.wikimedia.org/r/426925 [14:44:46] 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4133813 (10ovasileva) p:05Triage>03High [14:45:40] (03PS3) 10Ema: VCL: use trusted_proxies netmapper database [puppet] - 10https://gerrit.wikimedia.org/r/426920 (https://phabricator.wikimedia.org/T187014) [14:45:43] (03CR) 10Jcrespo: [C: 032] mariadb: Move es2016 socket location away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/426933 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [14:45:58] (03PS1) 10Jcrespo: mariadb: Move es2017 socket location away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/426937 (https://phabricator.wikimedia.org/T148507) [14:46:27] (03PS4) 10Ema: VCL: use trusted_proxies netmapper database [puppet] - 10https://gerrit.wikimedia.org/r/426920 (https://phabricator.wikimedia.org/T187014) [14:48:23] (03PS5) 10Ema: VCL: use trusted_proxies netmapper database [puppet] - 10https://gerrit.wikimedia.org/r/426920 (https://phabricator.wikimedia.org/T187014) [14:49:02] (03CR) 10Ema: [C: 032] VCL: use trusted_proxies netmapper database [puppet] - 10https://gerrit.wikimedia.org/r/426920 (https://phabricator.wikimedia.org/T187014) (owner: 10Ema) [14:49:41] !log restart pybal on lvs2003 - T187766 [14:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:47] T187766: Install / configure new WDQS servers - https://phabricator.wikimedia.org/T187766 [14:49:47] RECOVERY - Nginx local proxy to apache on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.047 second response time [14:49:57] (03PS2) 10Jcrespo: mariadb: Move es2017 socket location away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/426937 (https://phabricator.wikimedia.org/T148507) [14:49:58] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.030 second response time [14:49:58] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 80385 bytes in 0.085 second response time [14:53:13] !log restart pybal on lvs1003 - T187766 [14:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:17] RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 31 connections established with conf2001.codfw.wmnet:2379 (min=31) [14:57:27] RECOVERY - PyBal connections to etcd on lvs1003 is OK: OK: 42 connections established with conf1001.eqiad.wmnet:2379 (min=42) [14:58:49] (03CR) 10DCausse: [C: 031] kibana: fix short URL issue [puppet] - 10https://gerrit.wikimedia.org/r/426912 (https://phabricator.wikimedia.org/T192279) (owner: 10Gehel) [14:59:04] (03PS1) 10Vgutierrez: Revert "Remove hydrogen from eqiad LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/426940 (https://phabricator.wikimedia.org/T187090) [15:01:07] !log vgutierrez@neodymium conftool action : set/pooled=yes; selector: name=hydrogen.wikimedia.org,service=pdns_recursor [15:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:08] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[librenms/librenms] [15:03:25] (03CR) 10Vgutierrez: [C: 032] Revert "Remove hydrogen from eqiad LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/426940 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [15:04:55] 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4133869 (10Vgutierrez) [15:04:59] (03PS1) 10Ottomata: Point refine job at 0.0.62 jar version [puppet] - 10https://gerrit.wikimedia.org/r/426943 (https://phabricator.wikimedia.org/T159962) [15:05:52] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 5 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4133871 (10ema) >>! In T187014#4129884, @Nuria wrote: > +1 let me know when it is in place and i can help check things square again on... [15:07:15] !log start reimage of es3-codfw master, es2017 [15:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:59] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:09:35] (03PS2) 10Ottomata: Point refine job at 0.0.62 jar version [puppet] - 10https://gerrit.wikimedia.org/r/426943 (https://phabricator.wikimedia.org/T159962) [15:09:50] (03CR) 10Ottomata: [V: 032 C: 032] Point refine job at 0.0.62 jar version [puppet] - 10https://gerrit.wikimedia.org/r/426943 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [15:10:24] (03CR) 10Nuria: [C: 031] Add varnish::trusted_proxies [puppet] - 10https://gerrit.wikimedia.org/r/426896 (https://phabricator.wikimedia.org/T187014) (owner: 10Ema) [15:14:03] (03PS3) 10Jcrespo: mariadb: Move es2017 socket location away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/426937 (https://phabricator.wikimedia.org/T148507) [15:14:41] (03PS3) 10Ottomata: Blacklist mediawiki.job topics from cross DC main <-> main Kafka mirroring [puppet] - 10https://gerrit.wikimedia.org/r/425824 (https://phabricator.wikimedia.org/T192005) [15:17:42] (03CR) 10Jcrespo: [C: 032] mariadb: Move es2017 socket location away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/426937 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [15:18:06] (03CR) 10Muehlenhoff: [C: 031] "That seems fine, I'll merge, build and import tomorrow." [debs/tidy-0.99] - 10https://gerrit.wikimedia.org/r/425257 (https://phabricator.wikimedia.org/T191771) (owner: 10Hashar) [15:21:37] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4133923 (10Addshore) [15:25:12] !log upgraded HHVM on mediawiki-deployment-07 to a build with a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854) [15:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:18] T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854 [15:27:59] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [15:28:06] !log ppchelko@tin Started deploy [cpjobqueue/deploy@2a720fc]: Log HTML for PHP fatal errors from MW [15:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:07] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@2a720fc]: Log HTML for PHP fatal errors from MW (duration: 01m 01s) [15:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:02] (03CR) 10Lucas Werkmeister (WMDE): "suggestion" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [15:46:01] PROBLEM - HHVM rendering on mw2262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.433 second response time [15:46:01] PROBLEM - HHVM rendering on mw2252 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.480 second response time [15:46:01] PROBLEM - HHVM rendering on mw2236 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.761 second response time [15:46:01] PROBLEM - HHVM rendering on mw2227 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.805 second response time [15:46:01] PROBLEM - HHVM rendering on mw2173 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 1.065 second response time [15:46:01] PROBLEM - HHVM rendering on mw2229 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.140 second response time [15:46:01] PROBLEM - HHVM rendering on mw2275 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.381 second response time [15:46:02] PROBLEM - HHVM rendering on mw2284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.449 second response time [15:46:02] PROBLEM - HHVM rendering on mw2283 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.699 second response time [15:46:03] PROBLEM - HHVM rendering on mw2231 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.792 second response time [15:46:03] PROBLEM - HHVM rendering on mw2235 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 2.023 second response time [15:46:04] PROBLEM - HHVM rendering on mw2139 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 2.099 second response time [15:46:04] PROBLEM - HHVM rendering on mw2251 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 2.361 second response time [15:46:05] PROBLEM - HHVM rendering on mw2165 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 2.405 second response time [15:46:10] PROBLEM - HHVM rendering on mw2288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 2.680 second response time [15:46:10] PROBLEM - HHVM rendering on mw2169 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 2.747 second response time [15:46:10] PROBLEM - HHVM rendering on mw2285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 2.017 second response time [15:46:10] PROBLEM - HHVM rendering on mw2171 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 1.048 second response time [15:46:10] PROBLEM - HHVM rendering on mw2211 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.455 second response time [15:46:11] PROBLEM - HHVM rendering on mw2254 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.774 second response time [15:46:22] PROBLEM - HHVM rendering on mw2188 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.090 second response time [15:46:22] PROBLEM - HHVM rendering on mw2237 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.413 second response time [15:46:23] PROBLEM - HHVM rendering on mw2196 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.423 second response time [15:46:23] PROBLEM - HHVM rendering on mw2193 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.698 second response time [15:46:24] PROBLEM - HHVM rendering on mw2138 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.180 second response time [15:46:24] PROBLEM - HHVM rendering on mw2146 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.667 second response time [15:46:25] PROBLEM - HHVM rendering on mw2287 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.750 second response time [15:46:25] PROBLEM - HHVM rendering on mw2141 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.969 second response time [15:46:26] PROBLEM - HHVM rendering on mw2192 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 1.039 second response time [15:46:26] PROBLEM - HHVM rendering on mw2137 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.279 second response time [15:46:27] PROBLEM - HHVM rendering on mw2178 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.374 second response time [15:46:27] PROBLEM - HHVM rendering on mw2187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.568 second response time [15:46:28] PROBLEM - HHVM rendering on mw2143 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.700 second response time [15:46:28] PROBLEM - HHVM rendering on mw2233 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.839 second response time [15:46:29] PROBLEM - HHVM rendering on mw2182 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.938 second response time [15:46:29] PROBLEM - HHVM rendering on mw2286 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.153 second response time [15:46:30] PROBLEM - HHVM rendering on mw2145 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.249 second response time [15:46:30] PROBLEM - HHVM rendering on mw2183 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1527 bytes in 1.484 second response time [15:47:37] (03PS4) 10Alexandros Kosiaris: kubeconfig: Allow setting the namespace for a context [puppet] - 10https://gerrit.wikimedia.org/r/426925 [15:49:39] * akosiaris looking ^ [15:49:49] PROBLEM - HHVM rendering on mw2197 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1526 bytes in 0.384 second response time [15:49:52] could be my reimage [15:50:29] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [15:50:37] jynus: all those hosts ? [15:50:54] I am reimaging 1 content master [15:50:56] doubtful, https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=16&hoststatustypes=3&serviceprops=2097162 list 130+ problematic HHVM [15:51:43] from /var/log/apache2/other_vhosts_access.log on mw2252 api.php is 200 but /wiki/Main_Page yields a 500 [15:53:10] !log restart hhvm on mw2252 [15:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:16] let's see if that helps [15:54:02] akosiaris: let me start mysql [15:54:05] and see [15:54:18] jynus: it might indeed be what you say, so yes please do [15:54:25] mediawiki-errors for mw2252 https://logstash.wikimedia.org/goto/c5d4766387219b396a04734d52c4e2a2 [15:54:29] RECOVERY - HHVM rendering on mw2138 is OK: HTTP OK: HTTP/1.1 200 OK - 80438 bytes in 4.328 second response time [15:54:30] RECOVERY - HHVM rendering on mw2146 is OK: HTTP OK: HTTP/1.1 200 OK - 80436 bytes in 0.262 second response time [15:54:30] RECOVERY - HHVM rendering on mw2287 is OK: HTTP OK: HTTP/1.1 200 OK - 80436 bytes in 0.267 second response time [15:54:30] RECOVERY - HHVM rendering on mw2143 is OK: HTTP OK: HTTP/1.1 200 OK - 80436 bytes in 0.273 second response time [15:54:30] RECOVERY - HHVM rendering on mw2141 is OK: HTTP OK: HTTP/1.1 200 OK - 80436 bytes in 0.277 second response time [15:54:30] RECOVERY - HHVM rendering on mw2145 is OK: HTTP OK: HTTP/1.1 200 OK - 80436 bytes in 0.284 second response time [15:54:30] RECOVERY - HHVM rendering on mw2137 is OK: HTTP OK: HTTP/1.1 200 OK - 80436 bytes in 0.285 second response time [15:54:31] RECOVERY - HHVM rendering on mw2286 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.260 second response time [15:54:31] RECOVERY - HHVM rendering on mw2182 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.260 second response time [15:54:32] RECOVERY - HHVM rendering on mw2183 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.267 second response time [15:54:32] RECOVERY - HHVM rendering on mw2187 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.260 second response time [15:54:33] RECOVERY - HHVM rendering on mw2233 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.267 second response time [15:54:33] RECOVERY - HHVM rendering on mw2178 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.278 second response time [15:54:34] showing some dbconnection / dbreplication errors [15:54:34] RECOVERY - HHVM rendering on mw2166 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.276 second response time [15:54:36] lol [15:54:45] RECOVERY - HHVM rendering on mw2268 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.263 second response time [15:54:45] RECOVERY - HHVM rendering on mw2201 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.271 second response time [15:54:46] RECOVERY - HHVM rendering on mw2176 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.280 second response time [15:54:46] RECOVERY - HHVM rendering on mw2167 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.293 second response time [15:54:47] jynus: and I am guessing you just did ? :P [15:54:47] RECOVERY - HHVM rendering on mw2271 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.257 second response time [15:54:47] RECOVERY - HHVM rendering on mw2163 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.281 second response time [15:54:48] RECOVERY - HHVM rendering on mw2222 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.274 second response time [15:54:49] RECOVERY - HHVM rendering on mw2191 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.275 second response time [15:54:49] RECOVERY - HHVM rendering on mw2239 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.271 second response time [15:54:59] RECOVERY - HHVM rendering on mw2197 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.277 second response time [15:54:59] RECOVERY - HHVM rendering on mw2168 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.277 second response time [15:54:59] RECOVERY - HHVM rendering on mw2241 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.269 second response time [15:54:59] RECOVERY - HHVM rendering on mw2185 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.288 second response time [15:55:00] RECOVERY - HHVM rendering on mw2150 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.271 second response time [15:55:00] RECOVERY - HHVM rendering on mw2198 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.265 second response time [15:55:00] RECOVERY - HHVM rendering on mw2180 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.280 second response time [15:55:00] RECOVERY - HHVM rendering on mw2190 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.273 second response time [15:55:01] RECOVERY - HHVM rendering on mw2218 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.260 second response time [15:55:01] RECOVERY - HHVM rendering on mw2224 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.266 second response time [15:55:02] RECOVERY - HHVM rendering on mw2200 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.267 second response time [15:55:02] RECOVERY - HHVM rendering on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.273 second response time [15:55:03] RECOVERY - HHVM rendering on mw2208 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.265 second response time [15:55:03] RECOVERY - HHVM rendering on mw2238 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.259 second response time [15:55:19] RECOVERY - HHVM rendering on mw2227 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.270 second response time [15:55:19] RECOVERY - HHVM rendering on mw2231 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.267 second response time [15:55:19] RECOVERY - HHVM rendering on mw2236 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.269 second response time [15:55:19] RECOVERY - HHVM rendering on mw2139 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.272 second response time [15:55:19] RECOVERY - HHVM rendering on mw2275 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.266 second response time [15:55:19] RECOVERY - HHVM rendering on mw2251 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.265 second response time [15:55:19] RECOVERY - HHVM rendering on mw2229 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.277 second response time [15:55:20] RECOVERY - HHVM rendering on mw2173 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.271 second response time [15:55:20] RECOVERY - HHVM rendering on mw2165 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.275 second response time [15:55:21] RECOVERY - HHVM rendering on mw2235 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.274 second response time [15:55:21] RECOVERY - HHVM rendering on mw2283 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.265 second response time [15:55:22] RECOVERY - HHVM rendering on mw2288 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.258 second response time [15:55:36] akosiaris, godog: https://phabricator.wikimedia.org/T180918 [15:56:13] jynus: heh, so "known problem" :| [15:56:24] lol, ok [16:00:16] RECOVERY - HHVM rendering on mw2136 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.273 second response time [16:01:16] (03PS2) 10Gehel: kibana: fix short URL issue [puppet] - 10https://gerrit.wikimedia.org/r/426912 (https://phabricator.wikimedia.org/T192279) [16:01:53] (03CR) 10Alexandros Kosiaris: [C: 032] kubeconfig: Allow setting the namespace for a context [puppet] - 10https://gerrit.wikimedia.org/r/426925 (owner: 10Alexandros Kosiaris) [16:02:06] (03PS3) 10Gehel: kibana: fix short URL issue [puppet] - 10https://gerrit.wikimedia.org/r/426912 (https://phabricator.wikimedia.org/T192279) [16:02:47] (03CR) 10Gehel: [C: 032] kibana: fix short URL issue [puppet] - 10https://gerrit.wikimedia.org/r/426912 (https://phabricator.wikimedia.org/T192279) (owner: 10Gehel) [16:03:37] RECOVERY - HHVM rendering on mw2177 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.267 second response time [16:03:37] RECOVERY - HHVM rendering on mw2258 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.277 second response time [16:04:06] RECOVERY - HHVM rendering on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.270 second response time [16:04:06] RECOVERY - HHVM rendering on mw2245 is OK: HTTP OK: HTTP/1.1 200 OK - 80446 bytes in 0.274 second response time [16:06:17] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: Kibana fails to load when using short URLs to share dashboard - https://phabricator.wikimedia.org/T192279#4134072 (10Gehel) 05Open>03Resolved a:03Gehel Ugly fix is deployed. We might come back and remove it if... [16:11:33] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4134090 (10Marostegui) [16:14:54] 10Operations, 10Analytics, 10Analytics-Cluster, 10Traffic, and 2 others: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#4134111 (10mforns) 05Open>03Resolved a:03mforns [16:26:11] 10Operations, 10Analytics, 10Traffic, 10User-Elukey: Add VSL error counters to Varnishkafka stats - https://phabricator.wikimedia.org/T164259#4134131 (10mforns) p:05Normal>03Low [16:54:25] (03PS1) 10Andrew Bogott: Keystone: replace wmfkeystonehooks in Mitaka config [puppet] - 10https://gerrit.wikimedia.org/r/426956 (https://phabricator.wikimedia.org/T192304) [16:56:29] (03CR) 10Andrew Bogott: [C: 032] Keystone: replace wmfkeystonehooks in Mitaka config [puppet] - 10https://gerrit.wikimedia.org/r/426956 (https://phabricator.wikimedia.org/T192304) (owner: 10Andrew Bogott) [17:00:56] (03PS1) 10Muehlenhoff: Add component/zookeeper349 [puppet] - 10https://gerrit.wikimedia.org/r/426957 [17:11:43] !log upgraded HHVM on mediawiki-jobrunner03 to a build with a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854) [17:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:51] T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854 [17:12:57] (03CR) 10Elukey: [C: 031] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/426957 (owner: 10Muehlenhoff) [17:13:49] (03CR) 10Muehlenhoff: [C: 032] Add component/zookeeper349 [puppet] - 10https://gerrit.wikimedia.org/r/426957 (owner: 10Muehlenhoff) [17:29:11] 10Operations, 10Commons, 10Multimedia, 10media-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101#4134323 (10AlexisJazz) [17:30:25] 10Operations, 10Commons, 10Multimedia, 10media-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101#1945898 (10AlexisJazz) Copied from my duplicate task: https://commons.wikimedia.org/wiki/File:Accuracy_Int... [17:52:14] (03PS4) 10Ottomata: Blacklist mediawiki.job and change-prop from cross DC main <-> main mirror [puppet] - 10https://gerrit.wikimedia.org/r/425824 (https://phabricator.wikimedia.org/T192005) [17:56:44] (03CR) 10Ottomata: [C: 032] Blacklist mediawiki.job and change-prop from cross DC main <-> main mirror [puppet] - 10https://gerrit.wikimedia.org/r/425824 (https://phabricator.wikimedia.org/T192005) (owner: 10Ottomata) [18:03:24] !log restarting main <-> main DC kafka mirror maker instances to blacklist job and cp topics T190940 T167039 [18:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:30] T167039: Upgrade Kafka on main cluster with security features - https://phabricator.wikimedia.org/T167039 [18:03:31] T190940: Use --new.consumer for main codfw <-> eqiad Kafka MirrorMaker - https://phabricator.wikimedia.org/T190940 [18:03:51] (03PS9) 10Volans: First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) [18:03:53] (03PS7) 10Volans: Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) [18:03:55] (03PS9) 10Volans: Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) [18:03:58] (03PS5) 10Volans: Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) [18:04:08] (03CR) 10jerkins-bot: [V: 04-1] First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [18:04:11] (03CR) 10jerkins-bot: [V: 04-1] Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [18:04:13] (03CR) 10jerkins-bot: [V: 04-1] Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [18:04:15] (03CR) 10jerkins-bot: [V: 04-1] Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [18:05:31] (03CR) 10Volans: "Replies inline, thanks for the review!" (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [18:06:46] (03CR) 10Volans: "> I am not entirely sure about this approach. Django has its own" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [18:06:58] (03CR) 10Volans: "Reply inline" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [18:18:04] (03PS4) 10Ottomata: Use profile::kafka::mirror with --new.consumer for main-codfw -> main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/424344 (https://phabricator.wikimedia.org/T190940) [18:19:35] (03PS5) 10Ottomata: Use profile::kafka::mirror with --new.consumer for main-codfw -> main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/424344 (https://phabricator.wikimedia.org/T190940) [18:25:01] (03PS6) 10Ottomata: Use profile::kafka::mirror with --new.consumer for main-codfw -> main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/424344 (https://phabricator.wikimedia.org/T190940) [18:28:04] !log temporarily stopping puppet on kafka200[123] to apply MirrorMaker --new.consumer https://gerrit.wikimedia.org/r/#/c/424344/ T190940 [18:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:11] T190940: Use --new.consumer for main codfw <-> eqiad Kafka MirrorMaker - https://phabricator.wikimedia.org/T190940 [18:29:37] (03CR) 10Ottomata: [C: 032] Use profile::kafka::mirror with --new.consumer for main-codfw -> main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/424344 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata) [18:29:42] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10942/kafka2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/424344 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata) [18:39:09] (03PS1) 10Ottomata: Use --new.consumer for main codfw -> eqiad MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/426965 (https://phabricator.wikimedia.org/T190940) [18:40:04] (03CR) 10Ottomata: [C: 032] Use --new.consumer for main codfw -> eqiad MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/426965 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata) [18:48:33] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))): 274809.0696474634 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:49:53] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 128174.41681260944 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId= [18:49:53] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))): 149952.0616071429 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:56:34] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:56:53] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:56:53] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:03:34] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))): 56753.572735590096 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId= [19:03:53] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))): 57918.16353383459 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:03:53] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 83493.59792027729 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:04:34] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:04:53] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:04:53] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:09:22] 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4134473 (10Mholloway) a:03Mholloway [19:14:45] (03PS1) 10Ottomata: Configure jmx_exporter prometheus config for kafka main (mirror) [puppet] - 10https://gerrit.wikimedia.org/r/426971 (https://phabricator.wikimedia.org/T190940) [19:15:39] (03CR) 10Ottomata: [C: 032] Configure jmx_exporter prometheus config for kafka main (mirror) [puppet] - 10https://gerrit.wikimedia.org/r/426971 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata) [19:23:27] (03PS1) 10Ottomata: Remove unused role::kafka::main::mirror and set up main MM alerts [puppet] - 10https://gerrit.wikimedia.org/r/426973 (https://phabricator.wikimedia.org/T190940) [19:24:39] (03CR) 10Ottomata: [C: 032] Remove unused role::kafka::main::mirror and set up main MM alerts [puppet] - 10https://gerrit.wikimedia.org/r/426973 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata) [19:32:03] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#4134513 (10mobrovac) [19:33:52] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [19:37:45] i'm taking over tin for 10 mins to fix a UBN [19:42:04] (03PS1) 10Ottomata: Fix main-eqiad_to_main-codfw Mirror alert [puppet] - 10https://gerrit.wikimedia.org/r/426976 (https://phabricator.wikimedia.org/T190940) [19:42:52] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [19:43:40] (03CR) 10Ottomata: [C: 032] Fix main-eqiad_to_main-codfw Mirror alert [puppet] - 10https://gerrit.wikimedia.org/r/426976 (https://phabricator.wikimedia.org/T190940) (owner: 10Ottomata) [19:46:03] !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/includes/EventBus.php: Use the wiki set in the JobQueue when creating the event, file 1/2 - T192198 (duration: 01m 00s) [19:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:10] T192198: Wikidata doesn't update recentchanges - https://phabricator.wikimedia.org/T192198 [19:47:28] !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/includes/JobQueueEventBus.php: Use the wiki set in the JobQueue when creating the event, file 2/2 - T192198 (duration: 00m 59s) [19:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:55] PROBLEM - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=1&fullscreen&orgId=1 [19:59:22] 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4134550 (10Niedzielski) [19:59:45] 10Operations, 10Analytics, 10Traffic, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927#4134556 (10Ottomata) Hm, I just thought about this a little bit, and I'm not so sure we should do it. The hiera in... [20:04:00] 10Operations, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4134566 (10Imarlier) a:03BBlack Brandon - No further action for Performance on this. I'm assigning to you to close out or for further investigation, i... [20:12:06] !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/includes/JobQueueEventBus.php: Revert using the wiki of the job runner, file 1/2 (duration: 00m 58s) [20:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:18] !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/includes/EventBus.php: Revert using the wiki of the job runner, file 2/2 (duration: 00m 58s) [20:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:04] RECOVERY - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=1&fullscreen&orgId=1 [20:28:36] 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4134626 (10Mholloway) @Niedzielski I restarted the beta cluster restbase and mobileapps servi... [20:31:16] 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4134640 (10Mholloway) [20:33:56] !log imarlier@tin Started deploy [performance/navtiming@64d9c90]: null deploy [20:33:58] !log imarlier@tin Finished deploy [performance/navtiming@64d9c90]: null deploy (duration: 00m 02s) [20:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:21] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:52:52] 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4134666 (10Mholloway) For posterity: there was a recent change to the config variable that se... [21:02:29] !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/includes/EventBus.php: Use the correct way of calculating the domain from the wiki, file 1/2 - T192198 (duration: 00m 59s) [21:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:35] T192198: Wikidata doesn't update recentchanges - https://phabricator.wikimedia.org/T192198 [21:03:46] !log mobrovac@tin Synchronized php-1.31.0-wmf.29/extensions/EventBus/includes/JobQueueEventBus.php: Use the correct way of calculating the domain from the wiki, file 2/2 - T192198 (duration: 00m 58s) [21:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:27] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [21:23:24] (03CR) 10Krinkle: [C: 031] Make xenon-log line-buffered [puppet] - 10https://gerrit.wikimedia.org/r/426224 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles) [21:29:18] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka2001 is CRITICAL: NRPE: Command check_kafka-mirror-main-eqiad_to_main-codfw@0 not defined [21:31:50] ottomata: ^ ? [21:36:42] (03CR) 10Krinkle: [C: 04-1] "Assuming that the scap-deployed service will be live soon, let's roll this out via the new repository to avoid diverging the code between " [puppet] - 10https://gerrit.wikimedia.org/r/420831 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [21:37:10] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#4134725 (10Pchelolo) [21:38:16] (03PS4) 10Krinkle: webperf1001/2001 start using webperf role [puppet] - 10https://gerrit.wikimedia.org/r/392030 (https://phabricator.wikimedia.org/T186774) (owner: 10Dzahn) [21:57:16] Hi XioNoX and/or paravoid! Can one of you help me with a payments-cluster-related firewall adjustment? [21:57:25] https://phabricator.wikimedia.org/T191669 [21:58:16] Thought I'd filed the request well ahead of the deadline to switch payment API addresses [21:58:29] but it turns out I'd put the wrong security policy on the ticket [21:58:37] and hidden from all but fundraising :( [22:19:25] 10Operations, 10Beta-Cluster-Infrastructure, 10Mobile-Content-Service, 10Page-Previews, and 3 others: [Bug] Beta cluster page summary endpoint sometimes reponds with 5xx - https://phabricator.wikimedia.org/T192287#4134804 (10Niedzielski) I tried about 50 links and it seems to work. Thanks (and thanks for k... [22:25:13] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426915 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [22:25:17] (03CR) 10jenkins-bot: db-eqiad.php: db1114, restore main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426923 (owner: 10Marostegui) [22:25:21] (03CR) 10jenkins-bot: mariadb: Depool es1017 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426908 (owner: 10Jcrespo) [22:25:25] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426924 (owner: 10Marostegui) [22:25:29] (03CR) 10jenkins-bot: db-eqiad.php: Increase API traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426927 (owner: 10Marostegui) [22:25:35] (03CR) 10jenkins-bot: db-eqiad.php: Restore original API weight db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426930 (owner: 10Marostegui) [22:48:58] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))): 42910515.973187715 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [22:50:47] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))): 83241901.26878615 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [22:54:33] XioNoX or paravoid: could one of you ping me when you get the chance to take a look at that ticket? [22:54:53] As soon as the routes are open I'll deploy a settings change to the payments cluster to use the new address [23:01:47] RECOVERY - Request latencies on chlorine is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.45:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [23:04:07] RECOVERY - Request latencies on argon is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.32.133:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api