[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190115T0000). [00:00:04] MatmaRex, RoanKattouw, and tgr: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:06] gehel: ah, yea, probably needs a puppet run on icinga server and wdqs.. ACK [00:00:19] I'll SWAT [00:00:20] hi. [00:00:21] o/ [00:02:58] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:22] (03CR) 10Catrope: [C: 03+2] Improve list of privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483022 (owner: 10Gergő Tisza) [00:03:29] Doing tgr's change first [00:04:28] (03Merged) 10jenkins-bot: Improve list of privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483022 (owner: 10Gergő Tisza) [00:05:35] tgr: Your change is on mwdebug1002, please test [00:06:54] RoanKattouw: not worth the effort testing IMO, it's just a change of a list [00:07:00] OK, syncing [00:08:11] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Improve list of privileged groups (duration: 00m 46s) [00:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:58] thanks! [00:11:16] MatmaRex will be next once Jenkins finishes [00:13:35] alright [00:15:06] (03CR) 10jenkins-bot: Improve list of privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483022 (owner: 10Gergő Tisza) [00:15:26] (03PS2) 10Jbond: update the offboard-user script so that it also checks absent users [puppet] - 10https://gerrit.wikimedia.org/r/484276 [00:19:37] (03CR) 10Dzahn: [C: 03+1] "> 'grep' the logs (mwmaint1002 & mwmaint2001 and see if there has been recent errors" [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio) [00:19:58] (03PS1) 10Volans: documentation: fine-tune generated documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/484330 [00:22:06] (03CR) 10Dzahn: [C: 03+1] "> I'm certainly up to give it a try." [puppet] - 10https://gerrit.wikimedia.org/r/483876 (https://phabricator.wikimedia.org/T213591) (owner: 10MarcoAurelio) [00:22:31] (03CR) 10Volans: documentation: fine-tune generated documentation (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484330 (owner: 10Volans) [00:27:16] PROBLEM - MariaDB Slave Lag: s6 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.05 seconds [00:27:17] RoanKattouw: it passed btw [00:27:22] PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.37 seconds [00:27:48] Thanks for the heads up, it took a long time [00:27:56] PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.41 seconds [00:28:00] PROBLEM - MariaDB Slave Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.70 seconds [00:28:00] PROBLEM - MariaDB Slave Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.34 seconds [00:28:02] PROBLEM - MariaDB Slave Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.63 seconds [00:28:04] PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.56 seconds [00:28:12] PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.74 seconds [00:28:16] PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.74 seconds [00:28:34] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [00:28:54] MatmaRex: It's on mwdebug1002, please test [00:30:02] RoanKattouw: looking. it is taking forever to load [00:30:51] RoanKattouw: seems good! [00:33:42] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.17 seconds [00:34:35] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.12/resources/lib/ooui/oojs-ui-core.js: OOUI backport (T213544) (duration: 00m 46s) [00:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:38] T213544: Pressing 'Enter' not submitting form after OOUI v0.30.0 releases in some cases - https://phabricator.wikimedia.org/T213544 [00:36:36] (03PS3) 10Dzahn: profile::mediawiki::webserver: inline mediawiki::conftool [puppet] - 10https://gerrit.wikimedia.org/r/482791 (owner: 10Giuseppe Lavagetto) [00:37:11] (03CR) 10Dzahn: "compiler said (unrelated) Syntax error at '<<' at /srv/jenkins-workspace/puppet-compiler/14328/change/src/modules/systemd/manifests/syslog" [puppet] - 10https://gerrit.wikimedia.org/r/482791 (owner: 10Giuseppe Lavagetto) [00:37:14] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.12/extensions/GrowthExperiments/: Make welcome survey config use array_plus_2d (duration: 00m 46s) [00:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:46] (03PS1) 10EBernhardson: Add wbsearchentities profiles for de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484334 (https://phabricator.wikimedia.org/T213106) [00:39:12] (03PS1) 10Bstorm: toolforge: change the default proxy options to real proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/484335 (https://phabricator.wikimedia.org/T213711) [00:39:34] (03CR) 10Catrope: [C: 03+2] Welcome survey: experiment 2: A vs. C [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483449 (owner: 10Sbisson) [00:39:39] (03PS4) 10Catrope: Welcome survey: experiment 2: A vs. C [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483449 (owner: 10Sbisson) [00:39:42] (03CR) 10jerkins-bot: [V: 04-1] Add wbsearchentities profiles for de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484334 (https://phabricator.wikimedia.org/T213106) (owner: 10EBernhardson) [00:39:46] (03CR) 10Catrope: [C: 03+2] Welcome survey: experiment 2: A vs. C [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483449 (owner: 10Sbisson) [00:40:28] (03CR) 10EBernhardson: [C: 04-1] "needs Id395471 deployed first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484334 (https://phabricator.wikimedia.org/T213106) (owner: 10EBernhardson) [00:40:59] (03Merged) 10jenkins-bot: Welcome survey: experiment 2: A vs. C [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483449 (owner: 10Sbisson) [00:41:24] (03CR) 10jenkins-bot: Welcome survey: experiment 2: A vs. C [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483449 (owner: 10Sbisson) [00:44:10] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Welcome survey experiment 2: 50% variation A, 50% variation C (duration: 00m 46s) [00:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:59] 10Operations: Consider making a variant of the fatalmonitor CLI tool that ignores appserver timeouts - https://phabricator.wikimedia.org/T213777 (10Jdforrester-WMF) [00:48:08] SWAT's all done [00:50:13] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14329/" [puppet] - 10https://gerrit.wikimedia.org/r/482791 (owner: 10Giuseppe Lavagetto) [00:50:56] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.19 seconds [00:51:00] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.06 seconds [00:51:02] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.68 seconds [00:51:12] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.16 seconds [00:51:14] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.44 seconds [00:51:20] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.20 seconds [00:51:44] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.31 seconds [00:54:00] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.84 seconds [00:54:26] it's wikibase like earlier today, i had already left a message [00:56:37] (03CR) 10Dzahn: [C: 03+2] "noop on mw1261, mw2271 ..." [puppet] - 10https://gerrit.wikimedia.org/r/482791 (owner: 10Giuseppe Lavagetto) [01:07:32] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 281.22 seconds [01:09:08] PROBLEM - Host cp1078 is DOWN: PING CRITICAL - Packet loss = 100% [01:09:52] ^ https://phabricator.wikimedia.org/T203194#4878172 ... looking on mgmt [01:11:04] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.88 seconds [01:12:21] !log cp1078 - bnxt_en - TX timeout detected - Host cp1078 is DOWN - powercycled via mgmt (T203194) [01:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:24] T203194: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 [01:14:52] RECOVERY - Host cp1078 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [01:15:15] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10Dzahn) ` 20:09 <+icinga-wm> PROBLEM - Host cp1078 is DOWN: PING CRITICAL - Packet loss = 100% ... cp1078 login: [33059.724815] bnxt_en 0000:3b:00.0 enp59s0f0: TX timeo... [01:19:05] (03CR) 10Dzahn: [C: 03+2] service::node: do not install nodejs-legacy if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/483891 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [01:19:07] (03PS3) 10Dzahn: service::node: do not install nodejs-legacy if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/483891 (https://phabricator.wikimedia.org/T201366) [01:25:46] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.07 seconds [01:28:45] (03CR) 10Dzahn: [C: 03+2] "noop on scb1004, sca1004, ruthenium ...one issue removed on scandium" [puppet] - 10https://gerrit.wikimedia.org/r/483891 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [01:35:11] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Jan-Mar 2019), 10Documentation: Add Srishti to analytics-privatedata-users - https://phabricator.wikimedia.org/T213780 (10srishakatux) p:05Triage→03Normal [01:35:36] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Jan-Mar 2019), 10Documentation: Add Srishti to analytics-privatedata-users - https://phabricator.wikimedia.org/T213780 (10srishakatux) a:05srishakatux→03None [01:37:01] (03PS1) 10Dzahn: visualdiff: ensure git clone happens before creating pngs dir [puppet] - 10https://gerrit.wikimedia.org/r/484342 (https://phabricator.wikimedia.org/T201366) [01:39:13] (03CR) 10Dzahn: "issue introduced in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/269070/" [puppet] - 10https://gerrit.wikimedia.org/r/484342 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [01:49:04] PROBLEM - Blazegraph Port for wdqs-categories on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused [01:49:14] PROBLEM - Blazegraph process -wdqs-categories- on wdqs1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war [01:49:16] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:50:58] onimisionipe: [01:54:57] !log wdqs1009 - icinga alerts about Blazegraph process for wdqs categories. starting wdsq blazegraph,.. already running [01:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:59] SMalyshev: ^ [01:59:02] mutante: sorry, I'll fix it in a minute [01:59:17] SMalyshev: cool, thanks [01:59:56] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.50 seconds [01:59:56] RECOVERY - Blazegraph Port for wdqs-categories on wdqs1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 [02:00:10] RECOVERY - Blazegraph process -wdqs-categories- on wdqs1009 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war [02:00:10] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational [02:04:04] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@c920aec]: Re-deploy namespace script [02:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:30] (03PS1) 10Dzahn: parsoid: ensure /srv/deployment/parsoid exists before cloning [puppet] - 10https://gerrit.wikimedia.org/r/484343 (https://phabricator.wikimedia.org/T201366) [02:07:29] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) [02:11:40] (03PS2) 10Dzahn: parsoid: ensure /srv/deployment/parsoid exists before cloning [puppet] - 10https://gerrit.wikimedia.org/r/484343 (https://phabricator.wikimedia.org/T201366) [02:12:46] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@c920aec]: Re-deploy namespace script (duration: 08m 42s) [02:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:13:08] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 53.52 seconds [02:13:08] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 54.15 seconds [02:13:10] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 50.19 seconds [02:13:16] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 48.40 seconds [02:13:30] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 46.12 seconds [02:13:54] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 35.28 seconds [02:13:56] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 34.32 seconds [02:14:02] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 30.91 seconds [02:17:51] (03PS7) 10Dzahn: puppetmaster/configmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 [02:33:48] PROBLEM - WDQS HTTP Port on wdqs1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 410 bytes in 0.001 second response time [02:39:54] RECOVERY - WDQS HTTP Port on wdqs1009 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.018 second response time [02:44:00] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:44:13] (03CR) 10Andrew Bogott: [C: 03+1] toolforge: change the default proxy options to real proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/484335 (https://phabricator.wikimedia.org/T213711) (owner: 10Bstorm) [02:46:31] (03PS1) 10Smalyshev: Move categories namespace to second instance [puppet] - 10https://gerrit.wikimedia.org/r/484344 [02:46:34] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:46:46] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:47:40] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:49:02] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:49:12] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:50:44] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 43.23 seconds [02:55:33] (03PS1) 10Smalyshev: Switch category endpoint to 9990 [puppet] - 10https://gerrit.wikimedia.org/r/484345 [03:06:39] (03PS1) 10Mathew.onipe: elasticsearch_cluster: fix doc for is_green [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 [03:14:23] (03CR) 10Mathew.onipe: documentation: fine-tune generated documentation (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484330 (owner: 10Volans) [03:16:36] RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 56.47 seconds [03:16:40] RECOVERY - MariaDB Slave Lag: s6 on db2095 is OK: OK slave_sql_lag Replication lag: 55.89 seconds [03:16:56] RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 49.39 seconds [03:17:22] RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 38.46 seconds [03:17:22] RECOVERY - MariaDB Slave Lag: s6 on db2087 is OK: OK slave_sql_lag Replication lag: 37.72 seconds [03:17:26] RECOVERY - MariaDB Slave Lag: s6 on db2089 is OK: OK slave_sql_lag Replication lag: 38.16 seconds [03:17:27] RECOVERY - MariaDB Slave Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 38.23 seconds [03:17:36] RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 34.20 seconds [03:17:44] RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 37.89 seconds [03:33:56] RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 267.93 seconds [03:38:02] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 190.24 seconds [03:42:39] (03PS1) 10Smalyshev: Make cron endpoint configurable [puppet] - 10https://gerrit.wikimedia.org/r/484348 [03:43:36] (03CR) 10Smalyshev: "Since vars.sh is generated by scap, after this is merged we'd need scap deploy I presume." [puppet] - 10https://gerrit.wikimedia.org/r/484345 (owner: 10Smalyshev) [03:46:07] (03PS2) 10Smalyshev: Switch external category endpoint to 9990 [puppet] - 10https://gerrit.wikimedia.org/r/484345 (https://phabricator.wikimedia.org/T213212) [03:46:34] (03PS2) 10Smalyshev: Make cron endpoint configurable [puppet] - 10https://gerrit.wikimedia.org/r/484348 (https://phabricator.wikimedia.org/T213212) [03:47:52] (03PS3) 10Smalyshev: Switch category endpoint config to 9990 [puppet] - 10https://gerrit.wikimedia.org/r/484345 (https://phabricator.wikimedia.org/T213212) [03:48:27] (03PS2) 10Smalyshev: Move categories namespace to second instance [puppet] - 10https://gerrit.wikimedia.org/r/484344 (https://phabricator.wikimedia.org/T213212) [03:49:06] (03PS4) 10Smalyshev: Switch category endpoint config to 9990 [puppet] - 10https://gerrit.wikimedia.org/r/484345 (https://phabricator.wikimedia.org/T213212) [03:49:33] (03PS5) 10Smalyshev: Switch category endpoint config to 9990 [puppet] - 10https://gerrit.wikimedia.org/r/484345 (https://phabricator.wikimedia.org/T213212) [04:01:26] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 58.32 seconds [04:01:38] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 53.55 seconds [04:01:38] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 53.69 seconds [04:01:47] (03CR) 10Mobrovac: "IMHO, it would be better to ensure this in service::deploy::gitclone as the root problem is that that define does not check for the dir's " [puppet] - 10https://gerrit.wikimedia.org/r/484343 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [04:01:48] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 50.04 seconds [04:02:02] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 44.66 seconds [04:02:32] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 31.98 seconds [04:02:34] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 31.97 seconds [04:13:54] PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:14:00] PROBLEM - Blazegraph process -wdqs-categories- on wdqs1007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war [04:15:08] RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational [04:15:14] RECOVERY - Blazegraph process -wdqs-categories- on wdqs1007 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war [04:27:44] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.16 seconds [04:35:20] (03PS1) 10Krinkle: xhgui: Remove outdated clone of xhprof mirror [puppet] - 10https://gerrit.wikimedia.org/r/484351 (https://phabricator.wikimedia.org/T196406) [05:15:42] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 45.48 seconds [05:33:02] PROBLEM - MariaDB Slave Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.25 seconds [05:33:06] PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.15 seconds [05:33:06] PROBLEM - MariaDB Slave Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.26 seconds [05:33:06] PROBLEM - MariaDB Slave Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.72 seconds [05:33:18] PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.19 seconds [05:33:28] PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.23 seconds [05:33:30] PROBLEM - MariaDB Slave Lag: s6 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.46 seconds [05:33:34] PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.65 seconds [05:33:50] PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.72 seconds [05:54:51] 10Operations, 10MediaWiki-Debug-Logger, 10Performance-Team: Set up request profiling for PHP 7 - https://phabricator.wikimedia.org/T206152 (10Krinkle) Assuming the above means tideways is now installed and available with PHP7 on mwdebug servers, next step is to make it work with our X-Wikimedia-Debug "Profil... [06:15:57] (03PS1) 10Marostegui: pc1007.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/484353 (https://phabricator.wikimedia.org/T208383) [06:18:38] (03CR) 10Marostegui: [C: 03+2] pc1007.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/484353 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:24:41] (03PS1) 10Marostegui: db-eqiad.php: Pool pc1007 into pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484354 (https://phabricator.wikimedia.org/T208383) [06:26:48] PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.74 seconds [06:27:42] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.25 seconds [06:28:32] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.45 seconds [06:28:54] PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.36 seconds [06:28:59] (03PS2) 10Marostegui: db-eqiad.php: Pool pc1007 into pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484354 (https://phabricator.wikimedia.org/T208383) [06:29:20] PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.47 seconds [06:29:30] PROBLEM - MariaDB Slave Lag: s5 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.49 seconds [06:29:34] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.52 seconds [06:29:34] PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.68 seconds [06:29:42] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.52 seconds [06:32:31] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Pool pc1007 into pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484354 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:32:44] PROBLEM - Host ps1-d7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [06:33:04] PROBLEM - Host cp2024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:33:06] PROBLEM - Host cp2023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:33:06] PROBLEM - Host ms-be2025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:33:06] PROBLEM - Host ms-be2026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:33:16] PROBLEM - Host cp2025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:33:16] PROBLEM - Host cp2026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:33:21] uh? [06:33:36] (03Merged) 10jenkins-bot: db-eqiad.php: Pool pc1007 into pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484354 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:34:09] (03CR) 10jenkins-bot: db-eqiad.php: Pool pc1007 into pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484354 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:34:36] PROBLEM - Host ms-be2027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:34:41] rack down? [06:34:42] PROBLEM - Host ms-be2039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:35:16] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.60 seconds [06:35:22] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool pc1007 in pc1 - T208383 (duration: 00m 49s) [06:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:25] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [06:35:46] PROBLEM - Host ms-be2050.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:36:32] PROBLEM - Host elastic2053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:36:32] PROBLEM - Host elastic2054.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:36:53] yeah, I think the rack went down [06:38:18] not good [06:42:11] marostegui: only mgmt interfaces right? [06:42:15] not rack down [06:45:01] the rack is D7 https://netbox.wikimedia.org/dcim/racks/73/ [06:45:01] D7: Testing: DO not merge - https://phabricator.wikimedia.org/D7 [06:45:13] elukey: seems so. Yes [06:46:25] not great but better than the hosts down completely :D [06:46:34] true [06:47:33] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10Vgutierrez) Thanks for handling cp1078 @Dzahn. It looks like 4.9.144 is also affected [06:49:40] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.40 seconds [06:49:44] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.57 seconds [06:49:56] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.86 seconds [06:50:04] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.83 seconds [06:50:30] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.49 seconds [06:55:18] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 330.54 seconds [07:02:34] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 334.58 seconds [07:03:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484356 (https://phabricator.wikimedia.org/T85757) [07:04:39] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484356 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [07:06:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484356 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [07:06:31] (03PS2) 10Mathew.onipe: elasticsearch: mask default exporter service [puppet] - 10https://gerrit.wikimedia.org/r/484243 (https://phabricator.wikimedia.org/T210592) [07:07:18] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1099:3311 T85757 (duration: 00m 45s) [07:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:21] !log Deploy schema change on db1099:3311 - T85757 [07:07:21] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [07:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:15] 10Operations, 10DBA, 10Patch-For-Review: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) pc1007 is now serving. I have also updated tendril and zarcillo to reflect that it is the master for pc1. pc1010, pc2007 a... [07:12:29] 10Operations, 10DBA, 10Patch-For-Review: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) [07:12:36] 10Operations, 10DBA, 10Patch-For-Review: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) 05Open→03Resolved [07:13:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484356 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [07:13:24] 10Operations, 10netops: mgmt host interfaces down for rack D7 in codfw due to ge-0/0/30 on msw1-codfw down - https://phabricator.wikimedia.org/T213790 (10elukey) p:05Triage→03High [07:20:29] !log Drop tag_summary from s5 - T212255 [07:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:32] T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 [07:23:36] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:24:16] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.69 seconds [07:24:20] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.47 seconds [07:24:24] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.01 seconds [07:24:40] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.53 seconds [07:24:46] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.32 seconds [07:24:48] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.61 seconds [07:25:20] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.41 seconds [07:28:16] !log Drop tag_summary from wikitech - T212255 [07:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:18] T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 [07:35:29] (03PS1) 10Marostegui: site.pp: Convert dbstore1003 to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/484357 (https://phabricator.wikimedia.org/T210478) [07:36:18] (03PS2) 10Marostegui: site.pp: Convert dbstore1003 to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/484357 (https://phabricator.wikimedia.org/T210478) [07:40:53] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10Vgutierrez) so far we've seen crashes in the following servers: * cp1078 (twice) * cp1080 * cp1084 * cp1085 on the Dell community forum there is a [[ https://www.dell... [07:45:45] 10Operations, 10netops: mgmt host interfaces down for rack D7 in codfw due to ge-0/0/30 on msw1-codfw down - https://phabricator.wikimedia.org/T213790 (10Peachey88) [07:47:10] 10Operations, 10monitoring, 10netops: create a test for multicast relay - https://phabricator.wikimedia.org/T82038 (10Peachey88) [07:51:34] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484359 [07:52:59] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484359 (owner: 10Marostegui) [07:53:16] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/14330/" [puppet] - 10https://gerrit.wikimedia.org/r/484357 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [07:54:39] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484359 (owner: 10Marostegui) [07:56:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1099:3311 T85757 (duration: 00m 46s) [07:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:12] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [07:56:39] (03PS1) 10Marostegui: Bug: T85757 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484360 (https://phabricator.wikimedia.org/T85757) [07:58:04] (03PS2) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484360 (https://phabricator.wikimedia.org/T85757) [07:58:44] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 19.23 seconds [07:58:50] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 0.50 seconds [07:58:54] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 0.44 seconds [07:59:26] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 0.26 seconds [07:59:39] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484360 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [08:01:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484360 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [08:01:21] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10MoritzMuehlenhoff) The reports in that thread are for RHEL 7, which uses 3.10 as the base layer kernel (but with backports for all kinds of drivers, so it's hard to te... [08:02:16] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1089 T85757 (duration: 00m 45s) [08:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:19] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [08:02:58] !log Deploy schema change on db1089 - T85757 [08:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:10] (03CR) 10Muehlenhoff: update the offboard-user script so that it also checks absent users (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/484276 (owner: 10Jbond) [08:04:52] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484359 (owner: 10Marostegui) [08:04:54] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484360 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [08:06:28] (03PS6) 10Alexandros Kosiaris: lvs: Remove all mentions of zoterov2 [puppet] - 10https://gerrit.wikimedia.org/r/482810 [08:06:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] lvs: Remove all mentions of zoterov2 [puppet] - 10https://gerrit.wikimedia.org/r/482810 (owner: 10Alexandros Kosiaris) [08:10:28] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [08:10:30] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [08:10:46] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.26 seconds [08:11:52] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 5 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Addshore) p:05High→03Normal Moving from High to Normal based on the comments above. Can promote if this happens again [08:12:19] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 3 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Addshore) [08:19:34] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484362 [08:21:14] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [08:21:18] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484362 (owner: 10Marostegui) [08:22:17] (03PS1) 10Marostegui: install_server: Do not reimage pc1007 [puppet] - 10https://gerrit.wikimedia.org/r/484363 (https://phabricator.wikimedia.org/T208383) [08:22:22] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484362 (owner: 10Marostegui) [08:23:17] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage pc1007 [puppet] - 10https://gerrit.wikimedia.org/r/484363 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [08:23:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1089 T85757 (duration: 00m 46s) [08:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:32] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [08:24:21] (03CR) 10DCausse: [C: 04-1] Elasticsearch failed shard allocation check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [08:25:08] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484365 (https://phabricator.wikimedia.org/T85757) [08:26:15] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484365 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [08:27:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484365 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [08:28:05] (03PS7) 10Vgutierrez: certcentral: Allow specifying authorized hosts and regex in the config [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) [08:28:07] (03PS2) 10Hashar: contint: delete unused doc.wikimedia.org site config [puppet] - 10https://gerrit.wikimedia.org/r/484321 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [08:28:10] RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 59.30 seconds [08:28:16] RECOVERY - MariaDB Slave Lag: s5 on db2089 is OK: OK slave_sql_lag Replication lag: 59.56 seconds [08:28:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1106 T85757 (duration: 00m 45s) [08:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:29] !log Deploy schema change on db1106 - T85757 [08:28:31] (03CR) 10Hashar: [C: 03+1] "I have amended the patch to remove a few directories that are no more used :)" [puppet] - 10https://gerrit.wikimedia.org/r/484321 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [08:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:32] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [08:28:38] RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 55.67 seconds [08:28:44] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 55.04 seconds [08:28:50] RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 55.86 seconds [08:29:00] RECOVERY - MariaDB Slave Lag: s5 on db2094 is OK: OK slave_sql_lag Replication lag: 53.47 seconds [08:29:03] (03CR) 10DCausse: [C: 04-1] Elasticsearch failed shard allocation check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [08:29:06] RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 52.18 seconds [08:29:21] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Allow specifying authorized hosts and regex in the config [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [08:30:52] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484362 (owner: 10Marostegui) [08:30:54] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484365 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [08:34:30] PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:30] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 75841 bytes in 0.114 second response time [08:38:21] !log Stop replication on s1 on all labs hosts - T85757 [08:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:24] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [08:38:46] jouncebot: next [08:38:46] In 3 hour(s) and 21 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190115T1200) [08:38:56] (03CR) 10Hashar: "Works! Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/484194 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [08:41:00] (03PS3) 10Addshore: Add WikibaseQualityConstraints configs in testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480535 (https://phabricator.wikimedia.org/T209922) (owner: 10Ladsgroup) [08:44:48] RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 59.53 seconds [08:47:46] (03CR) 10Vgutierrez: "Thanks volans!" (033 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [08:47:55] (03CR) 10Vgutierrez: "recheck" [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [08:47:56] jouncebot: refresh [08:47:57] I refreshed my knowledge about deployments. [08:47:58] (03CR) 10Muehlenhoff: update the offboard-user script so that it also checks absent users (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/484276 (owner: 10Jbond) [08:48:00] jouncebot: next [08:48:00] In 0 hour(s) and 11 minute(s): Wikidata Configuration changes (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190115T0900) [08:48:03] (03PS4) 10Elukey: Introduce role::analytics_test_cluster [puppet] - 10https://gerrit.wikimedia.org/r/482645 (https://phabricator.wikimedia.org/T212256) [08:49:13] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Allow specifying authorized hosts and regex in the config [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [08:51:54] 10Operations, 10netops: mgmt host interfaces down for rack D7 in codfw due to ge-0/0/30 on msw1-codfw down - https://phabricator.wikimedia.org/T213790 (10elukey) [08:52:45] (03CR) 10Elukey: [C: 03+2] Introduce role::analytics_test_cluster [puppet] - 10https://gerrit.wikimedia.org/r/482645 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [08:52:47] (03PS1) 10Jcrespo: mariadb: Depool db1091 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484368 (https://phabricator.wikimedia.org/T213664) [08:55:48] (03PS1) 10Addshore: wgWBQualityConstraintsTypeCheckMaxEntities 300 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484370 (https://phabricator.wikimedia.org/T209504) [09:00:04] addshore: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Configuration changes deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190115T0900). [09:00:05] addshore: A patch you scheduled for Wikidata Configuration changes is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [09:00:12] \o [09:00:15] (03CR) 10Addshore: [C: 03+2] Add WikibaseQualityConstraints configs in testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480535 (https://phabricator.wikimedia.org/T209922) (owner: 10Ladsgroup) [09:00:24] (03CR) 10WMDE-Fisch: [C: 03+1] [labs] Remove $wmgUseAdvancedSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483905 (owner: 10MaxSem) [09:00:47] (03CR) 10WMDE-Fisch: [C: 03+1] [labs] Remove $wgAdvancedSearchBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483906 (owner: 10MaxSem) [09:00:58] CFisch_WMDE: I see you have some labs / beta only patches? :O [09:00:59] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 56.59 seconds [09:01:05] (03CR) 10WMDE-Fisch: [C: 03+1] [labs] Remove $wmgUseNewWikiDiff2Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483912 (owner: 10MaxSem) [09:01:18] (03Merged) 10jenkins-bot: Add WikibaseQualityConstraints configs in testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480535 (https://phabricator.wikimedia.org/T209922) (owner: 10Ladsgroup) [09:01:41] addshore: technically not "my" patches but yes [09:01:57] on the other hand I have not really time today :-/ [09:02:10] :D [09:02:17] save them for next week? ;) [09:02:38] although, beta ones are boring ;) [09:03:01] hrhr [09:03:48] I don't know about MaxSem's plans regarding the deployment of these :-) [09:03:56] btw thanks for the cleanup [09:04:17] buuuut we want to go full beta with the FileExporter this week [09:04:51] (03PS6) 10Alexandros Kosiaris: sca: Remove the cluster from conftool [puppet] - 10https://gerrit.wikimedia.org/r/482809 (https://phabricator.wikimedia.org/T212772) [09:04:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] sca: Remove the cluster from conftool [puppet] - 10https://gerrit.wikimedia.org/r/482809 (https://phabricator.wikimedia.org/T212772) (owner: 10Alexandros Kosiaris) [09:05:13] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] sca: Remove the cluster from conftool [puppet] - 10https://gerrit.wikimedia.org/r/482809 (https://phabricator.wikimedia.org/T212772) (owner: 10Alexandros Kosiaris) [09:05:40] so maybe we could do that tomorrow addshore :-) [09:05:50] :) feeel free to give me a poke [09:05:59] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T209922 Add WikibaseQualityConstraints configs in testwikidatawiki (duration: 00m 47s) [09:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:02] T209922: Configure WikibaseQualityConstraints on test - https://phabricator.wikimedia.org/T209922 [09:06:20] (03CR) 10Alexandros Kosiaris: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/483134 (owner: 10Alexandros Kosiaris) [09:06:20] * CFisch_remote will poke Thiemo first, he wanted to prepare the patch ^^ [09:06:28] (03PS2) 10Alexandros Kosiaris: mtail: Remove sca2004 from tests [puppet] - 10https://gerrit.wikimedia.org/r/483134 [09:06:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] mtail: Remove sca2004 from tests [puppet] - 10https://gerrit.wikimedia.org/r/483134 (owner: 10Alexandros Kosiaris) [09:06:41] (03PS2) 10Addshore: wgWBQualityConstraintsTypeCheckMaxEntities 300 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484370 (https://phabricator.wikimedia.org/T209504) [09:06:46] (03CR) 10Addshore: [C: 03+2] wgWBQualityConstraintsTypeCheckMaxEntities 300 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484370 (https://phabricator.wikimedia.org/T209504) (owner: 10Addshore) [09:08:06] (03Merged) 10jenkins-bot: wgWBQualityConstraintsTypeCheckMaxEntities 300 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484370 (https://phabricator.wikimedia.org/T209504) (owner: 10Addshore) [09:10:07] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgWBQualityConstraintsTypeCheckMaxEntities 300, T209504 (duration: 00m 46s) [09:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:10] (03CR) 10jenkins-bot: Add WikibaseQualityConstraints configs in testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480535 (https://phabricator.wikimedia.org/T209922) (owner: 10Ladsgroup) [09:10:10] T209504: Perform more constraint type checks in PHP before falling back to SPARQL - https://phabricator.wikimedia.org/T209504 [09:10:12] (03CR) 10jenkins-bot: wgWBQualityConstraintsTypeCheckMaxEntities 300 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484370 (https://phabricator.wikimedia.org/T209504) (owner: 10Addshore) [09:10:58] * addshore watches some graphs [09:18:55] !log upgrade and restart db2078 [09:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:07] jouncebot: refresh [09:20:08] I refreshed my knowledge about deployments. [09:20:10] !log deploy slot done [09:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:12] jouncebot: now [09:20:12] For the next 0 hour(s) and 9 minute(s): Wikidata Configuration changes (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190115T0900) [09:21:01] (03PS2) 10Jcrespo: mariadb: Depool db1091 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484368 (https://phabricator.wikimedia.org/T213664) [09:21:15] (03CR) 10Marostegui: [C: 03+1] mariadb: Depool db1091 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484368 (https://phabricator.wikimedia.org/T213664) (owner: 10Jcrespo) [09:21:55] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10fgiunchedi) [09:23:11] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1091 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484368 (https://phabricator.wikimedia.org/T213664) (owner: 10Jcrespo) [09:24:14] (03Merged) 10jenkins-bot: mariadb: Depool db1091 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484368 (https://phabricator.wikimedia.org/T213664) (owner: 10Jcrespo) [09:25:36] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1091 (duration: 00m 46s) [09:25:36] (03PS1) 10Elukey: Configure analytics1028->41 as Hadoop Analytics test cluster [puppet] - 10https://gerrit.wikimedia.org/r/484374 (https://phabricator.wikimedia.org/T212256) [09:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484243 (https://phabricator.wikimedia.org/T210592) (owner: 10Mathew.onipe) [09:27:38] (03PS2) 10Elukey: Configure analytics1028->41 as Hadoop Analytics test cluster [puppet] - 10https://gerrit.wikimedia.org/r/484374 (https://phabricator.wikimedia.org/T212256) [09:29:23] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.14 seconds [09:30:24] (03PS3) 10Mathew.onipe: elasticsearch: mask default exporter service [puppet] - 10https://gerrit.wikimedia.org/r/484243 (https://phabricator.wikimedia.org/T210592) [09:30:44] (03CR) 10Mathew.onipe: elasticsearch: mask default exporter service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484243 (https://phabricator.wikimedia.org/T210592) (owner: 10Mathew.onipe) [09:30:58] (03CR) 10Elukey: [C: 03+1] "Awesome thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/484357 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [09:34:47] (03CR) 10Jcrespo: "Question v" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484357 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [09:35:01] (03CR) 10Gehel: [C: 04-1] "minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484243 (https://phabricator.wikimedia.org/T210592) (owner: 10Mathew.onipe) [09:36:11] (03CR) 10jenkins-bot: mariadb: Depool db1091 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484368 (https://phabricator.wikimedia.org/T213664) (owner: 10Jcrespo) [09:36:13] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/14332/" [puppet] - 10https://gerrit.wikimedia.org/r/484374 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [09:36:15] (03CR) 10Marostegui: site.pp: Convert dbstore1003 to multiinstance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484357 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [09:36:17] (03CR) 10Volans: elasticsearch_cluster: fix doc for is_green (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484346 (owner: 10Mathew.onipe) [09:36:53] (03CR) 10Volans: documentation: fine-tune generated documentation (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/484330 (owner: 10Volans) [09:38:21] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.33 seconds [09:38:23] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.41 seconds [09:38:25] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.57 seconds [09:38:45] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.51 seconds [09:38:45] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.70 seconds [09:38:49] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.37 seconds [09:38:55] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.89 seconds [09:40:03] (03PS3) 10Marostegui: site.pp: Convert dbstore1003 to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/484357 (https://phabricator.wikimedia.org/T210478) [09:40:12] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484380 [09:41:30] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484380 [09:42:37] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484380 (owner: 10Marostegui) [09:42:39] (03PS8) 10Mathew.onipe: Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) [09:43:42] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484380 (owner: 10Marostegui) [09:43:53] (03CR) 10Mathew.onipe: Elasticsearch failed shard allocation check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [09:45:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1106 T85757 (duration: 00m 46s) [09:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:30] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:47:12] (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484382 (https://phabricator.wikimedia.org/T85757) [09:47:38] (03PS4) 10Mathew.onipe: elasticsearch: mask default exporter service [puppet] - 10https://gerrit.wikimedia.org/r/484243 (https://phabricator.wikimedia.org/T210592) [09:48:22] (03CR) 10Mathew.onipe: elasticsearch: mask default exporter service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484243 (https://phabricator.wikimedia.org/T210592) (owner: 10Mathew.onipe) [09:48:25] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484382 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [09:49:04] (03PS5) 10Gehel: elasticsearch: mask default exporter service [puppet] - 10https://gerrit.wikimedia.org/r/484243 (https://phabricator.wikimedia.org/T210592) (owner: 10Mathew.onipe) [09:49:46] (03CR) 10Gehel: [C: 03+2] elasticsearch: mask default exporter service [puppet] - 10https://gerrit.wikimedia.org/r/484243 (https://phabricator.wikimedia.org/T210592) (owner: 10Mathew.onipe) [09:50:01] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484382 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [09:51:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1114 T85757 (duration: 00m 45s) [09:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:04] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:51:34] !log Deploy schema change on db1114 - T85757 [09:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:49] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-Cache, 10Language-Team (Language-2019-January-March), and 5 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10Nikerabbit) Chunking would p... [09:54:05] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Pablo-WMDE) @Smalyshev No reason to be sorry for asking the right questions! If we truly wanted* to boil the reason down to one sentence:... [09:54:18] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1091 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484383 [09:54:59] PROBLEM - puppet last run on logstash1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-elasticsearch-exporter] [09:55:13] ^ oops, that's probably me, checking [09:55:47] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484380 (owner: 10Marostegui) [09:55:49] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484382 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [09:56:21] PROBLEM - puppet last run on logstash1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-elasticsearch-exporter] [09:56:55] damn, it looks like logstash is still jessie and does not support mask [09:56:59] onimisionipe: ^ [09:57:09] rollign back [09:57:44] (03PS1) 10Gehel: Revert "elasticsearch: mask default exporter service" [puppet] - 10https://gerrit.wikimedia.org/r/484386 [09:58:01] gehel: not good. was just testing on elastic and it worked fine [09:58:32] onimisionipe: so back to the `exec` hack and leave a note to cleanup once logstash is on stretch [09:58:39] (03CR) 10Gehel: [C: 03+2] Revert "elasticsearch: mask default exporter service" [puppet] - 10https://gerrit.wikimedia.org/r/484386 (owner: 10Gehel) [09:58:51] gehel: alright then! [09:59:12] onimisionipe: I'm rolling back, I'll let you prepare a new patch [09:59:29] gehel: I'm doing that [09:59:33] thanks! [10:00:21] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10akosiaris) >>! In T211881#4878702, @Tgr wrote: > Are those numbers reliable? Arabic Wikipedia gets about... [10:01:08] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-Cache, 10Language-Team (Language-2019-January-March), and 5 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10Nikerabbit) I filed {T213802... [10:01:31] RECOVERY - puppet last run on logstash1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:02:04] (03PS21) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [10:02:06] (03PS23) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [10:02:08] (03PS1) 10DCausse: [cirrus] Add cirrussearch-big-indices tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484387 (https://phabricator.wikimedia.org/T210381) [10:02:10] (03PS1) 10DCausse: [cirrus] Start writing to psi & omega (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484388 (https://phabricator.wikimedia.org/T210381) [10:02:23] PROBLEM - puppet last run on logstash1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-elasticsearch-exporter] [10:04:57] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-elasticsearch-exporter] [10:05:19] RECOVERY - puppet last run on logstash1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:05:28] (03PS1) 10Mathew.onipe: elasticsearch: mask default exporter [puppet] - 10https://gerrit.wikimedia.org/r/484389 (https://phabricator.wikimedia.org/T210592) [10:06:54] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484390 [10:10:16] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484390 (owner: 10Marostegui) [10:11:12] gehel, onimisionipe: the approach in the PS2 with the exec should also work on jessie, though [10:11:57] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484390 (owner: 10Marostegui) [10:12:16] moritzm: yeah, I was trying to be too smart with my comments :/ [10:12:59] (03PS2) 10Gehel: elasticsearch: mask default exporter [puppet] - 10https://gerrit.wikimedia.org/r/484389 (https://phabricator.wikimedia.org/T210592) (owner: 10Mathew.onipe) [10:13:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1114 T85757 (duration: 00m 45s) [10:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:54] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:13:55] (03CR) 10Gehel: [C: 03+2] elasticsearch: mask default exporter [puppet] - 10https://gerrit.wikimedia.org/r/484389 (https://phabricator.wikimedia.org/T210592) (owner: 10Mathew.onipe) [10:15:45] (03PS1) 10Marostegui: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484391 (https://phabricator.wikimedia.org/T85757) [10:16:50] onimisionipe: ^^ looks good this time! [10:16:56] !log installing zeromq3 security updates on stretch (jessie/trusty not affected) [10:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:30] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484391 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [10:18:23] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.73 seconds [10:18:26] gehel: cool! [10:18:29] will test [10:19:04] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484391 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [10:19:27] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.22 seconds [10:19:32] !log upgrade and restart db1091 [10:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:37] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.68 seconds [10:19:39] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.49 seconds [10:19:45] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.58 seconds [10:19:53] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.10 seconds [10:19:59] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.24 seconds [10:19:59] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.51 seconds [10:20:02] (03PS1) 10Muehlenhoff: Add library hint for zeromq3 [puppet] - 10https://gerrit.wikimedia.org/r/484393 [10:20:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1080 T85757 (duration: 00m 45s) [10:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:07] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:20:21] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.80 seconds [10:20:50] !log Deploy schema change on db1080 - T85757 [10:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:20] (03PS2) 10Muehlenhoff: Add library hint for zeromq3 [puppet] - 10https://gerrit.wikimedia.org/r/484393 [10:23:23] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484390 (owner: 10Marostegui) [10:23:25] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484391 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [10:26:45] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for zeromq3 [puppet] - 10https://gerrit.wikimedia.org/r/484393 (owner: 10Muehlenhoff) [10:28:23] RECOVERY - puppet last run on logstash1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:31:01] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:36:30] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484398 [10:38:28] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484398 (owner: 10Marostegui) [10:38:43] (03PS1) 10Jcrespo: mariadb: Repool db1091 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484399 (https://phabricator.wikimedia.org/T213664) [10:39:47] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484398 (owner: 10Marostegui) [10:39:53] (03PS2) 10Jcrespo: mariadb: Repool db1091 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484399 (https://phabricator.wikimedia.org/T213664) [10:40:44] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1080 T85757 (duration: 00m 45s) [10:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:53] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:43:47] (03PS4) 10Marostegui: site.pp: Convert dbstore1003 to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/484357 (https://phabricator.wikimedia.org/T210478) [10:44:16] (03PS5) 10Volans: sre.hosts: add upgrade and reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/480072 (https://phabricator.wikimedia.org/T205886) [10:46:05] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1091 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484383 [10:46:09] (03PS1) 10Jcrespo: mariadb: Depool db1103 from s2 and s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484400 (https://phabricator.wikimedia.org/T213664) [10:46:26] (03CR) 10Jcrespo: [C: 03+2] mariadb: Repool db1091 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484399 (https://phabricator.wikimedia.org/T213664) (owner: 10Jcrespo) [10:46:35] (03CR) 10Volans: [C: 03+2] sre.hosts: add upgrade and reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/480072 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [10:47:29] (03Merged) 10jenkins-bot: mariadb: Repool db1091 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484399 (https://phabricator.wikimedia.org/T213664) (owner: 10Jcrespo) [10:47:37] (03PS1) 10Elukey: Apply -R 200 to memcached on mc1024 [puppet] - 10https://gerrit.wikimedia.org/r/484401 (https://phabricator.wikimedia.org/T208844) [10:47:55] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 48.25 seconds [10:47:57] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 46.59 seconds [10:48:19] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 0.43 seconds [10:48:39] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 0.38 seconds [10:48:47] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:48:49] (03Merged) 10jenkins-bot: sre.hosts: add upgrade and reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/480072 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [10:48:49] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:48:55] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.47 seconds [10:49:03] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 0.20 seconds [10:49:42] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484398 (owner: 10Marostegui) [10:49:44] (03CR) 10jenkins-bot: mariadb: Repool db1091 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484399 (https://phabricator.wikimedia.org/T213664) (owner: 10Jcrespo) [10:49:52] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1091 with low load (duration: 00m 45s) [10:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:01] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:51:12] (03CR) 10DCausse: [C: 04-1] "found some issues when testing" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [10:52:32] (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484402 (https://phabricator.wikimedia.org/T85757) [10:53:06] jouncebot: next [10:53:06] In 1 hour(s) and 6 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190115T1200) [10:53:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:53:31] (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484402 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [10:53:34] !log START - Cookbook sre.hosts.upgrade-and-reboot (volans@cumin2001) [10:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:15] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484402 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [10:54:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:54:47] (03CR) 10Alexandros Kosiaris: "service::deploy::common already defines this and is required by service::node. Maybe it's just a race and a require would solve it?" [puppet] - 10https://gerrit.wikimedia.org/r/484343 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [10:55:23] (03CR) 10Alexandros Kosiaris: "scratch that, service::deploy::common defines /srv/deployment, not /srv/deployment/parsoid" [puppet] - 10https://gerrit.wikimedia.org/r/484343 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [10:55:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:55:55] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484402 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [10:55:57] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:56:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1083 T85757 (duration: 00m 46s) [10:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:54] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:57:00] !log Deploy schema change on db1083 - T85757 [10:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:31] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484403 [10:58:26] !log END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) (volans@cumin2001) [10:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:04] (03CR) 10ArielGlenn: [C: 03+1] "I see, there's the possibility of 'format' being specified in the config directly. This seems fine to me." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/484011 (owner: 10Hoo man) [11:01:39] !log run 'apt-get purge tmpreaper' on mw1297,1298,2150,2151,2244,2245 (all role spare) to avoid daily cronspam [11:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:28] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Security-Team: Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10Jhernandez) @phuedx and I looked at https://www.mediawiki.org/wiki/Proton#Technical_documents and https://wikitech.wikimedia.org/wiki/P... [11:02:50] !log dropping database test on db1124:s5 with replication [11:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:05] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484402 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [11:04:57] (03PS1) 10Elukey: decommisison_appserver.sh: add a step to clean up tmpreaper [puppet] - 10https://gerrit.wikimedia.org/r/484404 [11:06:50] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Reading-Infrastructure-Team-Backlog, 10Security-Team: Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10Jhernandez) [11:08:25] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484403 (owner: 10Marostegui) [11:09:27] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484403 (owner: 10Marostegui) [11:10:30] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1083 T85757 (duration: 00m 45s) [11:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:33] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [11:10:52] (03PS3) 10Jbond: update the offboard-user script so that it also checks absent users [puppet] - 10https://gerrit.wikimedia.org/r/484276 [11:11:02] (03CR) 10Alexandros Kosiaris: "I am not sure we should be using nodejs from stretch-backports. We already have nodejs 10.x under component/node10 for stretch and unless " [puppet] - 10https://gerrit.wikimedia.org/r/483889 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [11:11:33] PROBLEM - HHVM rendering on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:11:42] (03PS4) 10Jbond: update the offboard-user script so that it also checks absent users [puppet] - 10https://gerrit.wikimedia.org/r/484276 [11:12:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. When we move away from HHVM we can also consider dropping tmpreaper entirely, it's showing it's age (T185195)" [puppet] - 10https://gerrit.wikimedia.org/r/484404 (owner: 10Elukey) [11:12:39] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 75677 bytes in 2.513 second response time [11:14:21] (03CR) 10Muehlenhoff: "Yeah, we should align on a common version, backports is also rather volatile (as it follows testing), while we have control over what chan" [puppet] - 10https://gerrit.wikimedia.org/r/483889 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [11:15:36] (03CR) 10Muehlenhoff: [C: 03+1] update the offboard-user script so that it also checks absent users [puppet] - 10https://gerrit.wikimedia.org/r/484276 (owner: 10Jbond) [11:15:59] (03CR) 10Elukey: [C: 03+2] decommisison_appserver.sh: add a step to clean up tmpreaper [puppet] - 10https://gerrit.wikimedia.org/r/484404 (owner: 10Elukey) [11:16:21] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484403 (owner: 10Marostegui) [11:18:03] (03CR) 10Alexandros Kosiaris: [C: 04-1] "One minor omission. The profile needs to be included in the role::releases. Rest LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/483800 (https://phabricator.wikimedia.org/T213563) (owner: 10Thcipriani) [11:23:42] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Reading-Infrastructure-Team-Backlog, and 2 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Jhernandez) @phuedx and I talked about this. We need some documentation about how proton inte... [11:30:16] (03CR) 10Alexandros Kosiaris: geoip::maxmind: replace deprecated validate_string functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483222 (owner: 10Dzahn) [11:30:24] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Joe) [11:31:34] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Joe) @Jhernandez I'm happy to explain to you whatever you might want to know about our load-ba... [11:39:49] (03PS3) 10Jcrespo: Revert "mariadb: Depool db1091 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484383 [11:41:28] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1091 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484383 (owner: 10Jcrespo) [11:42:22] (03PS2) 10Jcrespo: mariadb: Depool db1103 from s2 and s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484400 (https://phabricator.wikimedia.org/T213664) [11:42:33] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1091 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484383 (owner: 10Jcrespo) [11:42:47] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1091 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484383 (owner: 10Jcrespo) [11:42:52] (03PS3) 10Jcrespo: mariadb: Depool db1103 from s2 and s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484400 (https://phabricator.wikimedia.org/T213664) [11:44:04] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1091 fully (duration: 00m 45s) [11:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:13] PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.01 seconds [11:44:15] PROBLEM - MariaDB Slave Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.63 seconds [11:44:23] PROBLEM - MariaDB Slave Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.73 seconds [11:44:35] PROBLEM - MariaDB Slave Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.41 seconds [11:44:39] PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.39 seconds [11:44:39] PROBLEM - MariaDB Slave Lag: s6 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.09 seconds [11:44:57] PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.24 seconds [11:45:07] PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.44 seconds [11:45:07] PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.79 seconds [11:47:38] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1103 from s2 and s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484400 (https://phabricator.wikimedia.org/T213664) (owner: 10Jcrespo) [11:49:01] (03Merged) 10jenkins-bot: mariadb: Depool db1103 from s2 and s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484400 (https://phabricator.wikimedia.org/T213664) (owner: 10Jcrespo) [11:49:07] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1103 from s2 and s4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484409 [11:50:25] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103 (duration: 00m 45s) [11:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:20] 10Operations, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10fgiunchedi) I pulled the swift logs for that file (and similar) below, looks like similarly-named versions where also uploaded and deleted. The file in question was deleted on Jan 13th. On the MW side we ha... [11:55:36] (03CR) 10jenkins-bot: mariadb: Depool db1103 from s2 and s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484400 (https://phabricator.wikimedia.org/T213664) (owner: 10Jcrespo) [11:55:49] jynus: ^ looks like the file got legitimately deleted by some process in mw [11:57:49] yes, but deleted means archived? [11:58:16] or maybe it got deleted but the metadata didn't? [11:58:54] and that is why it cannot be restored [11:59:46] looks like the latter yeah, deleted from swift but not the metadata [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190115T1200). [12:00:04] revi and dcausse: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:07] and yes deleted more or less means the same as archived, different container [12:00:16] o/ [12:00:28] hoi [12:00:35] I can SWAT [12:00:38] need 5 min to prepare laptop so do dcausse's one first [12:00:45] then you can do yours first >_< [12:00:49] (03PS3) 10Hashar: rsync: readd incoming and outgoing chmod [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) [12:00:51] (03PS2) 10Hashar: doc: make published files group writable [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) [12:00:53] (03PS1) 10Hashar: Rake: honor rubocop AllCops/Excludes [puppet] - 10https://gerrit.wikimedia.org/r/484410 [12:01:12] (03CR) 10Alex Monk: certcentral: Allow specifying authorized hosts and regex in the config (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [12:01:14] revi: ok will do one of mine [12:01:20] kk [12:01:30] (03CR) 10Hashar: "Our rake task for rubocop would not honor the exclude list. Fixed by:" [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [12:02:24] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484387 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:03:10] dcausse: I'm around, but I was hoping you'll swat :D [12:03:12] !log starting upgrading of prometheus-elasticsearch-exporter for codfw T210592 [12:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:15] T210592: Fix prometheus elasticsearch exporter to show all the metrics - https://phabricator.wikimedia.org/T210592 [12:03:29] zeljkof: thanks :) [12:03:34] (03Merged) 10jenkins-bot: [cirrus] Add cirrussearch-big-indices tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484387 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:04:20] 10Operations, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10jcrespo) Could you try to restore it @Platonides using the wiki admin tools before trying some SQL? [12:04:44] (03CR) 10Hashar: "That fixed it on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/484304/ !" [puppet] - 10https://gerrit.wikimedia.org/r/484410 (owner: 10Hashar) [12:05:12] (03PS4) 10Revi: Change links of wgGEHelpPanelLinks for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483996 (https://phabricator.wikimedia.org/T209467) [12:05:13] rebased [12:05:58] dcausse: ready now, fyi [12:06:01] !log upgrade and restart db1103 [12:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:57] (03CR) 10jenkins-bot: [cirrus] Add cirrussearch-big-indices tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484387 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:09:08] !log dcausse@deploy1001 Synchronized wmf-config/CommonSettings.php: [cirrus] Add cirrussearch-big-indices tag T210381 (duration: 00m 46s) [12:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:11] T210381: Update mw-config to use the psi&omega elastic clusters - https://phabricator.wikimedia.org/T210381 [12:10:01] revi: let's do your patch [12:10:05] kk [12:10:27] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483996 (https://phabricator.wikimedia.org/T209467) (owner: 10Revi) [12:11:32] (03Merged) 10jenkins-bot: Change links of wgGEHelpPanelLinks for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483996 (https://phabricator.wikimedia.org/T209467) (owner: 10Revi) [12:12:28] revi: should be available on mwdebug1002, is it possible for you to test? [12:12:32] it is [12:12:40] testing.... [12:12:59] dcausse: confirmed on mwdebug1002! [12:13:07] great! deploying [12:14:48] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Change links of wgGEHelpPanelLinks for kowiki T209467 (duration: 00m 46s) [12:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:51] T209467: Help panel: Determine what text and links to show in the help panel on KO wikipedia - https://phabricator.wikimedia.org/T209467 [12:15:03] revi: should be available everywhere now [12:15:10] testing again... [12:15:14] !log starting upgrading of prometheus-elasticsearch-exporter for eqiad T210592 [12:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:17] T210592: Fix prometheus elasticsearch exporter to show all the metrics - https://phabricator.wikimedia.org/T210592 [12:15:23] yup, LGTM dcausse! [12:15:39] 10Operations, 10Citoid, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10Mvolz) Great, thanks! Should I be able to ssh into bastion? ` > ssh bast1001.wikimedia.org Enter passphrase for key '/home/mar... [12:16:13] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.43 seconds [12:16:25] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.08 seconds [12:16:33] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.50 seconds [12:16:33] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.25 seconds [12:16:52] s7...? [12:16:55] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.13 seconds [12:17:15] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.49 seconds [12:17:22] this is only codfw apparently [12:17:25] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.14 seconds [12:17:27] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 323.68 seconds [12:17:31] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 326.58 seconds [12:17:43] jynus, marostegui, is this expected? ^ [12:17:43] I was bit concerned since kowiki is on s7 :P [12:17:48] ok :) [12:17:59] (that my patch wasn't happy for them) [12:18:12] db2*** [12:18:26] revi: I doubt your patch can cause this but who knows? :) [12:18:35] yeah :I [12:18:37] check the deployments page to see if there is something someone wrote about ongoing maintenance [12:19:16] If this is my patch that made codfw unhappy, we probably should revert and wait till kostajh wakes up [12:19:29] you should read the deployments page before deploying :-) [12:20:17] jynus: what do you mean? I see migrateActors.php, does this mean that SWAT should be cancelled? [12:21:22] https://phabricator.wikimedia.org/T188327#4877257 "Running migrateActors.php on wikitech for T188327. This may cause lag in codfw." [12:21:22] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [12:21:29] I mean that as far as an0mie said, it is normal [12:21:36] but probably normal tho [12:21:55] RECOVERY - MariaDB Slave Lag: s6 on db2089 is OK: OK slave_sql_lag Replication lag: 56.76 seconds [12:21:56] (03CR) 10jenkins-bot: Change links of wgGEHelpPanelLinks for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483996 (https://phabricator.wikimedia.org/T209467) (owner: 10Revi) [12:21:57] RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 57.64 seconds [12:22:05] RECOVERY - MariaDB Slave Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 52.44 seconds [12:22:11] ok recovering [12:22:14] * revi goes to play game now [12:22:15] RECOVERY - MariaDB Slave Lag: s6 on db2087 is OK: OK slave_sql_lag Replication lag: 48.64 seconds [12:22:17] revi: https://commons.wikimedia.org/wiki/File:IZBROKEIT.png [12:22:17] RECOVERY - MariaDB Slave Lag: s6 on db2095 is OK: OK slave_sql_lag Replication lag: 49.18 seconds [12:22:19] RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 46.72 seconds [12:22:20] revi: thanks, I should have clicked on the phab link :) [12:22:30] jynus: thanks and sorry for the noise :) [12:22:30] Hauskatze: exactly [12:22:33] dcausse: https://commons.wikimedia.org/wiki/File:IZBROKEIT.png [12:22:39] RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 38.35 seconds [12:22:47] RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 39.21 seconds [12:22:49] RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 38.17 seconds [12:22:56] actually I was playing game until I got jouncebot notice [12:22:57] LOL [12:23:07] revi: what do you play? [12:23:22] ok moving forward with my patch [12:23:47] I was playing some sort of weird game titled Blood and Soul [12:23:53] Blade and Soul* [12:23:54] not Blood [12:23:55] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 38.19 seconds [12:23:58] >_> [12:23:59] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 28.49 seconds [12:23:59] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 23.89 seconds [12:24:15] (03PS5) 10Jbond: update the offboard-user script so that it also checks absent users [puppet] - 10https://gerrit.wikimedia.org/r/484276 [12:24:43] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:24:43] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 0.19 seconds [12:24:45] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:24:59] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 0.50 seconds [12:25:06] (03PS2) 10DCausse: [cirrus] Start writing to psi & omega (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484388 (https://phabricator.wikimedia.org/T210381) [12:25:08] (03PS22) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [12:25:10] (03PS24) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [12:27:04] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484388 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:27:11] (03CR) 10Jbond: [C: 03+2] update the offboard-user script so that it also checks absent users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484276 (owner: 10Jbond) [12:28:07] (03Merged) 10jenkins-bot: [cirrus] Start writing to psi & omega (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484388 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:29:27] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Broken elasticsearch-prometheus-exporter service on logstash nodes after reboot - https://phabricator.wikimedia.org/T210597 (10Mathew.onipe) @MoritzMuehlenhoff can you verify that prometheus-elasticsearch-exporter.service no longer fails and i... [12:33:15] !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: T210381: [cirrus] Start writing to psi & omega (take 2) (1/2) (duration: 00m 45s) [12:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:18] T210381: Update mw-config to use the psi&omega elastic clusters - https://phabricator.wikimedia.org/T210381 [12:35:12] (03CR) 10jenkins-bot: [cirrus] Start writing to psi & omega (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484388 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:36:33] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T210381: [cirrus] Start writing to psi & omega (take 2) (2/2) (duration: 00m 45s) [12:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:47] 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10jbond) [12:40:24] the last patch deployed affects CirrusSearch writes, I'm going to monitor this a bit more before closing SWAT [12:40:43] 10Operations, 10Operations-Software-Development: debdeploy: show help message if invoked with no arguments - https://phabricator.wikimedia.org/T207845 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03jbond [12:43:57] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.69 seconds [12:44:15] (03PS1) 10Arturo Borrero Gonzalez: rsync: introduce cold-standby synchronization [puppet] - 10https://gerrit.wikimedia.org/r/484414 [12:44:39] (03CR) 10jerkins-bot: [V: 04-1] rsync: introduce cold-standby synchronization [puppet] - 10https://gerrit.wikimedia.org/r/484414 (owner: 10Arturo Borrero Gonzalez) [12:47:50] !log EU SWAT done [12:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:16] (03PS2) 10Elukey: Apply -R 200 to memcached on mc1024 [puppet] - 10https://gerrit.wikimedia.org/r/484401 (https://phabricator.wikimedia.org/T208844) [12:59:33] (03CR) 10Elukey: [C: 03+2] Apply -R 200 to memcached on mc1024 [puppet] - 10https://gerrit.wikimedia.org/r/484401 (https://phabricator.wikimedia.org/T208844) (owner: 10Elukey) [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190115T1300) [13:00:47] !log restart memcached on mc1024 to pick up new settings (-R 200) - T208844 [13:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:52] T208844: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 [13:10:10] (03PS1) 10Marostegui: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484418 (https://phabricator.wikimedia.org/T85757) [13:12:39] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484418 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [13:14:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484418 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [13:14:36] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484418 (https://phabricator.wikimedia.org/T85757) (owner: 10Marostegui) [13:15:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1119 T85757 (duration: 00m 46s) [13:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:24] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [13:15:29] !log Deploy schema change on db1119 - T85757 [13:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:39] (03PS2) 10Arturo Borrero Gonzalez: rsync: introduce cold-standby synchronization [puppet] - 10https://gerrit.wikimedia.org/r/484414 [13:16:54] (03CR) 10jerkins-bot: [V: 04-1] rsync: introduce cold-standby synchronization [puppet] - 10https://gerrit.wikimedia.org/r/484414 (owner: 10Arturo Borrero Gonzalez) [13:18:33] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484419 [13:19:39] (03PS3) 10Arturo Borrero Gonzalez: rsync: introduce cold-standby synchronization [puppet] - 10https://gerrit.wikimedia.org/r/484414 [13:20:25] (03CR) 10jerkins-bot: [V: 04-1] rsync: introduce cold-standby synchronization [puppet] - 10https://gerrit.wikimedia.org/r/484414 (owner: 10Arturo Borrero Gonzalez) [13:21:53] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.08 seconds [13:24:09] 10Operations, 10DBA, 10Performance-Team: Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Marostegui) a:05aaron→03Marostegui I have been talking to @Joe and we have decided, just be on the safe side. We will increase the TTL just by 2 days, and ma... [13:25:44] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10daniel) > The reason this is a dedicated service is the language it is written in (typescript), which was chosen because it allows us to cr... [13:29:41] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.01 seconds [13:29:43] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.36 seconds [13:29:45] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.18 seconds [13:29:53] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.12 seconds [13:30:07] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.15 seconds [13:30:11] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.99 seconds [13:30:13] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.62 seconds [13:30:56] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484419 (owner: 10Marostegui) [13:32:32] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484419 (owner: 10Marostegui) [13:33:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1119 T85757 (duration: 00m 46s) [13:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:54] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [13:35:28] (03PS1) 10Volans: cookbooks.sre.hosts: improve upgrade-and-reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/484422 (https://phabricator.wikimedia.org/T205886) [13:35:40] (03PS4) 10Arturo Borrero Gonzalez: rsync: introduce cold-standby synchronization [puppet] - 10https://gerrit.wikimedia.org/r/484414 [13:36:10] 10Operations: prometheus-node-exporter - invalid group: ‘prometheus:prometheus' - https://phabricator.wikimedia.org/T167245 (10fgiunchedi) 05Open→03Invalid Fixed in {T158968} [13:36:27] (03CR) 10jerkins-bot: [V: 04-1] rsync: introduce cold-standby synchronization [puppet] - 10https://gerrit.wikimedia.org/r/484414 (owner: 10Arturo Borrero Gonzalez) [13:38:39] (03PS5) 10Marostegui: site.pp: Convert dbstore1003 to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/484357 (https://phabricator.wikimedia.org/T210478) [13:39:07] (03CR) 10Hashar: "We are excluding modules/rsync in rubocop configuration ( /.rubocop.yml) but our rake task passes affected files explicitly and thus they " [puppet] - 10https://gerrit.wikimedia.org/r/484414 (owner: 10Arturo Borrero Gonzalez) [13:39:40] (03CR) 10Marostegui: [C: 03+2] site.pp: Convert dbstore1003 to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/484357 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [13:40:04] (03PS2) 10GTirloni: wmcs::nfs::misc - Remove wmcs-root from admin groups [puppet] - 10https://gerrit.wikimedia.org/r/484260 (https://phabricator.wikimedia.org/T209527) [13:40:45] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484419 (owner: 10Marostegui) [13:41:00] (03CR) 10Arturo Borrero Gonzalez: "> We are excluding modules/rsync in rubocop configuration (" [puppet] - 10https://gerrit.wikimedia.org/r/484414 (owner: 10Arturo Borrero Gonzalez) [13:42:01] arturo: hi :) and I am pretty sure my rubocop fix up is right https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/484410/ :D [13:42:12] but we probably want Giuseppe to have a look at it [13:42:31] cool, no rush, I'm polishing my code meanwhile, thanks hashar [13:44:44] arturo: we also have sync::quickdatacopy which might be similar to your code :) [13:44:46] (03PS9) 10Mathew.onipe: Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) [13:44:59] (03CR) 10Mathew.onipe: Elasticsearch failed shard allocation check (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [13:45:22] arturo: we use that for copying Phabricator repositories between from the active server to the passive/hotspare one [13:45:44] hashar: but you run the command manually? [13:46:07] no, I see the cron there [13:46:11] arturo: it has a cron minutes => */10 [13:46:17] but I don't think it does it over ssh [13:46:25] I see [13:46:26] might setup a rsync daemon [13:46:32] anyway, that looks similar [13:46:34] yes, we can probably converge the code [13:46:37] (03CR) 10GTirloni: [C: 03+2] wmcs::nfs::misc - Remove wmcs-root from admin groups [puppet] - 10https://gerrit.wikimedia.org/r/484260 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [13:47:20] I like using SSH so we don't have another port open, a bit more secure auth, etc.. [13:47:49] I guess rsync::quickdatacopy just creates ferm rule but otherwise runs unauthenticated. Pure wild guess though [13:48:03] mutante would know more, he pointed me at rsync::quickdatacopy a few weeks ago [13:48:49] 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10MoritzMuehlenhoff) [13:50:45] mmmm I really would like to avoid having these 2 similar codes... but I also would really like to use SSH [13:51:13] (03PS1) 10GTirloni: Revert "wmcs::nfs::misc - Remove wmcs-root from admin groups" [puppet] - 10https://gerrit.wikimedia.org/r/484428 (https://phabricator.wikimedia.org/T209527) [13:51:39] also, I need the timer to be configurable, I will use this cronjob for ~50GB data, which can't be transferred in 10 minutes [13:53:08] !log Downtime db1115 and es1019 for 4 hours - T196726 T213422 [13:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:12] T196726: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 [13:53:13] T213422: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 [13:53:32] 10Operations, 10Patch-For-Review, 10User-Marostegui, 10User-fgiunchedi: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10fgiunchedi) [13:57:17] (03PS18) 10Fsero: Initial docker::registry::ha puppetization. [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) [13:57:58] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/certcentral] - 10https://gerrit.wikimedia.org/r/484429 [13:58:07] (03PS1) 10Jbond: Ensure debdeploy exits cleanly when called without any arguments [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484430 (https://phabricator.wikimedia.org/T207845) [13:58:24] (03CR) 10GTirloni: [C: 03+2] Revert "wmcs::nfs::misc - Remove wmcs-root from admin groups" [puppet] - 10https://gerrit.wikimedia.org/r/484428 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [13:59:26] (03CR) 10jerkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [software/certcentral] - 10https://gerrit.wikimedia.org/r/484429 (owner: 10Hashar) [14:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190115T1400) [14:00:40] (03PS5) 10Arturo Borrero Gonzalez: rsync: introduce cold-standby synchronization [puppet] - 10https://gerrit.wikimedia.org/r/484414 [14:01:09] (03CR) 10jerkins-bot: [V: 04-1] rsync: introduce cold-standby synchronization [puppet] - 10https://gerrit.wikimedia.org/r/484414 (owner: 10Arturo Borrero Gonzalez) [14:01:15] (03Abandoned) 10Arturo Borrero Gonzalez: rsync: introduce cold-standby synchronization [puppet] - 10https://gerrit.wikimedia.org/r/484414 (owner: 10Arturo Borrero Gonzalez) [14:04:56] 10Operations, 10Citoid, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10CDanis) `bast1001` doesn't exist anymore. You should have access to `bast1002` and the others, though. https://wikitech.wikimedi... [14:10:36] 10Operations: Consider making a variant of the fatalmonitor CLI tool that ignores appserver timeouts - https://phabricator.wikimedia.org/T213777 (10CDanis) p:05Triage→03Normal [14:11:56] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/certcentral] - 10https://gerrit.wikimedia.org/r/484429 (owner: 10Hashar) [14:13:16] (03PS3) 10Arturo Borrero Gonzalez: toolforge: refactor docker registry profile [puppet] - 10https://gerrit.wikimedia.org/r/483765 (https://phabricator.wikimedia.org/T213418) [14:14:17] !log rebooting acamar [14:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:40] (03CR) 10DCausse: [C: 04-1] Elasticsearch failed shard allocation check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [14:17:01] (03CR) 10Volans: "Are all subparsing groups required?" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484430 (https://phabricator.wikimedia.org/T207845) (owner: 10Jbond) [14:18:29] (03PS1) 10Volans: sre.host: add Icinga downtime cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/484432 (https://phabricator.wikimedia.org/T205886) [14:20:14] (03CR) 10Volans: sre.host: add Icinga downtime cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/484432 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [14:22:26] (03CR) 10Hashar: "I am not sure what I have messed up, but it does not seem to pull images anymore :((((((((" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475843 (https://phabricator.wikimedia.org/T200720) (owner: 10Hashar) [14:23:41] ottomata: thanks for your input on T213081 ! I'm here if you'd like to chat more real time [14:23:41] T213081: Consider increasing kafka logging topic partitions - https://phabricator.wikimedia.org/T213081 [14:24:55] godog: ya am here! [14:28:21] ottomata: sweet! essentially if https://gerrit.wikimedia.org/r/c/operations/puppet/+/484226 looks ok now or we should shoot for more/less partitions from the get go [14:29:04] also whether we have documentation/procedures around increasing partitions on existing topics, e.g. if it is transparent to consumers/producers or we'll need to bounce them [14:29:30] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table querycachetwo: try to repair it on query. Default database: atjwiki. [Query snipped] [14:30:20] (03CR) 10Mathew.onipe: Elasticsearch failed shard allocation check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [14:30:44] marostegui: --^ [14:30:54] (03PS10) 10Mathew.onipe: Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) [14:31:05] godog: if you don't need any ordering guaruntees and have multiplee consumers consuming from the same topic [14:31:09] a few more partitions won't hurt :) [14:31:25] 3 or 6, doesn't really matter I think [14:31:49] it might be easier to reason about what is leader/consumer where with 3, but 6 isn't really that much harder [14:32:09] anomie: thanks for checking, it was just a heads up just in case [14:32:20] 10Operations, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10User100100) This is my speculation. I think that the root cause is bug T210739. There is an edit history merge request in this file's (Juan Guaidó.jpg) talk page. Someone has tried to merge edit history, bu... [14:32:23] No problem [14:32:56] elukey: another breakage? [14:33:28] ottomata: yeah order not a problem ATM at least since we're ingesting into elasticsearch anyways, ok I'll change to 3 for now! [14:33:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove zoterov2 RRs [dns] - 10https://gerrit.wikimedia.org/r/482807 (owner: 10Alexandros Kosiaris) [14:34:01] (03PS2) 10Alexandros Kosiaris: Remove zoterov2 RRs [dns] - 10https://gerrit.wikimedia.org/r/482807 [14:34:56] elukey: I have fixed it [14:35:09] <3 [14:35:15] (03CR) 10Ottomata: [C: 03+1] "+1 3 or 6 partitions is fine." [puppet] - 10https://gerrit.wikimedia.org/r/484226 (https://phabricator.wikimedia.org/T213081) (owner: 10Filippo Giunchedi) [14:35:15] we need to move to innodb asap [14:35:34] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:35:54] marostegui: ready when the dump finishes and you run your alters :) [14:36:00] <2 [14:36:02] <3 [14:36:50] (03PS1) 10Thcipriani: Merge tag 'v2.15.8' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484437 [14:37:38] (03PS2) 10Filippo Giunchedi: hieradata: increase default kafka partitions for logging cluster [puppet] - 10https://gerrit.wikimedia.org/r/484226 (https://phabricator.wikimedia.org/T213081) [14:38:45] (03PS3) 10Filippo Giunchedi: hieradata: increase default kafka partitions for logging cluster [puppet] - 10https://gerrit.wikimedia.org/r/484226 (https://phabricator.wikimedia.org/T213081) [14:39:00] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1103 from s2 and s4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484409 (owner: 10Jcrespo) [14:39:52] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: increase default kafka partitions for logging cluster [puppet] - 10https://gerrit.wikimedia.org/r/484226 (https://phabricator.wikimedia.org/T213081) (owner: 10Filippo Giunchedi) [14:40:19] godog: what do you think of https://phabricator.wikimedia.org/T213561 [14:40:20] ? [14:40:45] !log fdans@deploy1001 Started deploy [analytics/superset/deploy@408a30e]: deploying 0.26.3-wikimedia1 [14:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:47] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1103 from s2 and s4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484409 (owner: 10Jcrespo) [14:41:03] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1103 from s2 and s4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484409 (owner: 10Jcrespo) [14:41:10] 10Operations, 10Citoid, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10Mvolz) Whoops, thanks :). Works. >>! In T213269#4880939, @CDanis wrote: > `bast1001` doesn't exist anymore. You should have... [14:41:19] !log fdans@deploy1001 Finished deploy [analytics/superset/deploy@408a30e]: deploying 0.26.3-wikimedia1 (duration: 00m 36s) [14:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:58] (03PS1) 10Vgutierrez: certcentral: Bump acme to the latest version shipped in stretch-backports [software/certcentral] - 10https://gerrit.wikimedia.org/r/484438 (https://phabricator.wikimedia.org/T213820) [14:42:44] (03PS2) 10Vgutierrez: certcentral: Bump acme to the latest version shipped in stretch-backports [software/certcentral] - 10https://gerrit.wikimedia.org/r/484438 (https://phabricator.wikimedia.org/T213820) [14:46:03] (03PS1) 10Filippo Giunchedi: profile: introduce kafka::broker num_partitions [puppet] - 10https://gerrit.wikimedia.org/r/484440 (https://phabricator.wikimedia.org/T213081) [14:47:04] ottomata: seems sensible to me, I'll comment on the task [14:47:11] (03CR) 10Ottomata: [C: 03+1] Configure analytics1028->41 as Hadoop Analytics test cluster [puppet] - 10https://gerrit.wikimedia.org/r/484374 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [14:47:17] thanks [14:48:37] 10Operations, 10Analytics, 10EventBus, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10fgiunchedi) Discovery records for kafka would come handy in the logging pipeline case too, namely during datacenter failover to move producers off a given datacen... [14:49:11] (03PS1) 10Vgutierrez: certcentral: Bump josepy to the latest version shipped in stretch-bp [software/certcentral] - 10https://gerrit.wikimedia.org/r/484442 (https://phabricator.wikimedia.org/T213820) [14:50:08] 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10MoritzMuehlenhoff) [14:54:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1002/14340/" [puppet] - 10https://gerrit.wikimedia.org/r/484440 (https://phabricator.wikimedia.org/T213081) (owner: 10Filippo Giunchedi) [14:54:40] trivial enough, cc elukey ottomata ^ [14:54:48] my bad for not running pcc on the previous change heh [14:55:54] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 53.85 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:56:51] !log fdans@deploy1001 Started deploy [analytics/superset/deploy@9d6156a]: reverting deploy of 0.26.3-wikimedia1 [14:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:53] godog: looks good, running pcc on kafka1001 and kafka-jumbo1001 as paranoid check [14:57:06] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 98.09 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:57:11] (03CR) 10DCausse: [C: 03+1] "lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [14:58:53] (03CR) 10Elukey: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/14342/ - lovely" [puppet] - 10https://gerrit.wikimedia.org/r/484440 (https://phabricator.wikimedia.org/T213081) (owner: 10Filippo Giunchedi) [14:59:06] (03PS1) 10Alexandros Kosiaris: Remove utf8 char from geo-maps [dns] - 10https://gerrit.wikimedia.org/r/484445 [14:59:25] elukey: sweet, thanks! [14:59:42] (03PS2) 10Filippo Giunchedi: profile: introduce kafka::broker num_partitions [puppet] - 10https://gerrit.wikimedia.org/r/484440 (https://phabricator.wikimedia.org/T213081) [14:59:44] (03PS8) 10Vgutierrez: certcentral: Allow specifying authorized hosts and regex in the config [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) [15:00:16] (03CR) 10Alex Monk: [C: 03+2] certcentral: Bump acme to the latest version shipped in stretch-backports [software/certcentral] - 10https://gerrit.wikimedia.org/r/484438 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [15:00:54] (03Abandoned) 10Alex Monk: Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 (https://phabricator.wikimedia.org/T207373) (owner: 10Alex Monk) [15:01:17] (03PS2) 10Zoranzoki21: Update groupOverrides for srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484187 (https://phabricator.wikimedia.org/T213679) [15:01:27] (03PS2) 10Zoranzoki21: Update groupOverrides for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484195 (https://phabricator.wikimedia.org/T213684) [15:01:42] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1103 (duration: 00m 48s) [15:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:07] (03Merged) 10jenkins-bot: certcentral: Bump acme to the latest version shipped in stretch-backports [software/certcentral] - 10https://gerrit.wikimedia.org/r/484438 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [15:02:12] (03CR) 10Alex Monk: [C: 03+2] certcentral: Bump josepy to the latest version shipped in stretch-bp [software/certcentral] - 10https://gerrit.wikimedia.org/r/484442 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [15:02:23] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: introduce kafka::broker num_partitions [puppet] - 10https://gerrit.wikimedia.org/r/484440 (https://phabricator.wikimedia.org/T213081) (owner: 10Filippo Giunchedi) [15:02:55] !log fdans@deploy1001 Finished deploy [analytics/superset/deploy@9d6156a]: reverting deploy of 0.26.3-wikimedia1 (duration: 06m 06s) [15:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:02] (03PS6) 10Zoranzoki21: Update groupOverrides for srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482609 (https://phabricator.wikimedia.org/T213055) [15:03:51] (03CR) 10jenkins-bot: certcentral: Bump acme to the latest version shipped in stretch-backports [software/certcentral] - 10https://gerrit.wikimedia.org/r/484438 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [15:03:57] (03Merged) 10jenkins-bot: certcentral: Bump josepy to the latest version shipped in stretch-bp [software/certcentral] - 10https://gerrit.wikimedia.org/r/484442 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [15:05:41] (03CR) 10jenkins-bot: certcentral: Bump josepy to the latest version shipped in stretch-bp [software/certcentral] - 10https://gerrit.wikimedia.org/r/484442 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [15:06:37] (03CR) 10Vgutierrez: "replied inline" (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [15:09:14] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Performance-Team (Radar): Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) [15:10:11] !log fdans@deploy1001 Started deploy [analytics/superset/deploy@UNKNOWN]: reverting deploy of 0.26.3-wikimedia1 [15:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:43] !log fdans@deploy1001 Finished deploy [analytics/superset/deploy@UNKNOWN]: reverting deploy of 0.26.3-wikimedia1 (duration: 00m 32s) [15:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:33] 10Operations, 10Analytics, 10EventBus, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Joe) Sorry, I need some more specifics: you want to make a dns query, and get as a response the "nearest" kafka cluster in the form of a list of hostnames/ports?... [15:13:06] (03CR) 10BBlack: [C: 03+1] Remove utf8 char from geo-maps [dns] - 10https://gerrit.wikimedia.org/r/484445 (owner: 10Alexandros Kosiaris) [15:13:34] 10Operations, 10Analytics, 10EventBus, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Ottomata) No no for me, all I want is an alias for the list of Kafka brokers in a given Kafka cluster. I don't need any DC failover stuff. Perhaps discovery is... [15:13:57] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) [15:14:12] (03CR) 10Alex Monk: certcentral: Allow specifying authorized hosts and regex in the config (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [15:19:52] (03PS9) 10Vgutierrez: certcentral: Allow specifying authorized hosts and regex in the config [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) [15:19:59] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Logic overall LGTM, but see my comments about the code." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [15:20:13] (03CR) 10Vgutierrez: certcentral: Allow specifying authorized hosts and regex in the config (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [15:20:17] (03CR) 10Muehlenhoff: [C: 03+1] "I haven't read any of the discussions in PS1-PS33, but PS34 looks good to me. Maybe before merging additionally cherrypick this patch to a" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [15:20:52] PROBLEM - HHVM rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:38] (03CR) 10Alex Monk: [C: 03+2] certcentral: Allow specifying authorized hosts and regex in the config [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [15:21:56] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 75587 bytes in 0.138 second response time [15:22:23] (03PS1) 10Ottomata: Add kafka-single-node chart for local development [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247) [15:22:39] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10Gilles) [15:23:21] (03Merged) 10jenkins-bot: certcentral: Allow specifying authorized hosts and regex in the config [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [15:23:35] (03PS1) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484499 [15:23:37] (03PS1) 10Zoranzoki21: Update groupOverrides for srwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484500 (https://phabricator.wikimedia.org/T213824) [15:23:57] (03Abandoned) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484499 (owner: 10Zoranzoki21) [15:24:04] (03PS2) 10Ottomata: Add kafka-single-node chart for local development [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247) [15:24:10] PROBLEM - MariaDB Slave Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.49 seconds [15:24:22] PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.58 seconds [15:24:24] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.18 seconds [15:24:30] PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.12 seconds [15:24:32] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.46 seconds [15:24:46] PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.31 seconds [15:24:56] PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 323.09 seconds [15:25:02] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.14 seconds [15:25:07] (03CR) 10jenkins-bot: certcentral: Allow specifying authorized hosts and regex in the config [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [15:25:59] (03PS3) 10Ottomata: Add kafka-single-node chart for local development [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247) [15:26:13] 10Operations, 10Analytics, 10EventBus, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Joe) Might I suggest that you use a SRV dns record instead? It's more appropriate for enumerating members in a cluster. We use those for etcd discovery. [15:26:15] 10Operations, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10User100100) By the way, if I'm right with my comment... currently you can delete any of Wikimedia Commons's file without any traces in MediaWiki's delete log. Of course you have to be an administrator in Wik... [15:32:16] (03PS1) 10Zoranzoki21: Update groupOverrides for srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484501 (https://phabricator.wikimedia.org/T213828) [15:33:50] PROBLEM - puppet last run on analytics-tool1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 26 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[init_superset] [15:39:02] RECOVERY - puppet last run on analytics-tool1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:45:00] (03PS19) 10Fsero: Initial docker::registry::ha puppetization. [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) [15:45:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove utf8 char from geo-maps [dns] - 10https://gerrit.wikimedia.org/r/484445 (owner: 10Alexandros Kosiaris) [15:47:06] (03CR) 10Elukey: admin: allow users to be deployed without ssh keys configured (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [15:47:20] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Jan-Mar 2019), 10Documentation: Add Srishti to analytics-privatedata-users - https://phabricator.wikimedia.org/T213780 (10CDanis) Hi Srisha, Looks like this is also a request for production shell access, so there's a little more work for you to do... [15:48:45] (03CR) 10jerkins-bot: [V: 04-1] Initial docker::registry::ha puppetization. [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) (owner: 10Fsero) [15:49:08] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Jan-Mar 2019), 10Documentation: Add Srishti to analytics-privatedata-users - https://phabricator.wikimedia.org/T213780 (10CDanis) [15:49:34] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Jan-Mar 2019), 10Documentation: Add Srishti to analytics-privatedata-users - https://phabricator.wikimedia.org/T213780 (10CDanis) a:03srishakatux [15:53:12] !log T210381: elastic search clusters, catching up updates since first import on new psi&omega clusters in eqiad&codfw (from mwmaint1002) [15:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:15] T210381: Update mw-config to use the psi&omega elastic clusters - https://phabricator.wikimedia.org/T210381 [15:54:08] (03PS20) 10Fsero: Initial docker::registry::ha puppetization. [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) [15:55:02] (03PS36) 10Elukey: admin: allow users to be deployed without ssh keys configured [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) [15:58:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Overall lgtm. Let's bring this to production as soon as possible." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) (owner: 10Fsero) [16:00:58] !log stop es1019 for hw maintenance T213422 [16:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:01] T213422: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 [16:01:43] (03CR) 10Elukey: "Amended the code after Joe's and Andrew's comments. New PCC runs:" [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [16:03:10] (03CR) 10Fsero: [C: 03+2] "Thanks for the review _joe_ :)" [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) (owner: 10Fsero) [16:03:19] 10Operations, 10netops: mgmt host interfaces down for rack D7 in codfw due to ge-0/0/30 on msw1-codfw down - https://phabricator.wikimedia.org/T213790 (10ayounsi) a:03Papaul Papaul, can you please verify the status of msw-d7-codfw and its link to msw1-codfw, and replace any faulty part if necessary. Thank you. [16:04:39] (03PS1) 10ArielGlenn: write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) [16:05:09] 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10jbond) [16:08:04] (03PS1) 10CDanis: Add Holger Knust to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/484506 (https://phabricator.wikimedia.org/T213812) [16:09:32] (03PS2) 10CDanis: Add Holger Knust to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/484506 (https://phabricator.wikimedia.org/T213812) [16:09:38] (03CR) 10CDanis: [C: 03+2] Add Holger Knust to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/484506 (https://phabricator.wikimedia.org/T213812) (owner: 10CDanis) [16:21:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. There's a pull request for cpython to revert to the Python 2 behaviour by default, but that hasn't seen much traction..." [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/484430 (https://phabricator.wikimedia.org/T207845) (owner: 10Jbond) [16:23:08] 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10jbond) [16:33:32] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:35:52] 10Operations, 10ops-eqiad, 10Patch-For-Review: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10jcrespo) Waiting for Chris to be available to fully shutdown it (as otherwise I wouldn't be able to put it back up). @Cmjohnson Please see if there... [16:38:22] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:39:23] 10Operations: Offboard Balazs - https://phabricator.wikimedia.org/T213703 (10jbond) [16:43:14] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:43:55] 10Operations, 10ops-eqiad, 10DBA: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10jcrespo) @Cmjohnson The most likely scenario is that we move the dimm and we keep detecting 96GB of ram, and then we will ask you to ask for a replacemen... [16:44:26] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:47:33] PROBLEM - Check systemd state on registry1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:50:35] !log roll-restart kafka-logging in eqiad to apply new topic defaults - T213081 [16:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:38] T213081: Consider increasing kafka logging topic partitions - https://phabricator.wikimedia.org/T213081 [16:52:42] !log stop db1115 for hw maintenance [16:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:46] (03PS1) 10Ottomata: Add round robin DNS records for Kafka clusters [dns] - 10https://gerrit.wikimedia.org/r/484509 (https://phabricator.wikimedia.org/T213561) [16:53:37] dbtree and tendril are likely to go down during maintenance [16:54:37] PROBLEM - HTTP-dbtree on dbmonitor2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 281 bytes in 0.111 second response time [16:54:58] (03PS1) 10Fsero: docker_registry_ha: package installation missing [puppet] - 10https://gerrit.wikimedia.org/r/484510 [16:55:02] (03PS1) 10Vgutierrez: Release 0.8 [software/certcentral] - 10https://gerrit.wikimedia.org/r/484511 (https://phabricator.wikimedia.org/T209980) [16:55:10] ^those are to be expected, they went down faster than I downtime ed [16:55:29] (03CR) 10Fsero: "also removed extra whitespaces" [puppet] - 10https://gerrit.wikimedia.org/r/484510 (owner: 10Fsero) [16:57:19] !log move cr1-eqiad:xe-3/3/1 to xe-4/1/3 - T212791 [16:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:22] T212791: Interface errors on cr1-eqiad:xe-3/3/1 - https://phabricator.wikimedia.org/T212791 [16:58:31] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:58:37] expected ^ [17:00:05] godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190115T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:25] <_joe_> oh good, I'm off now :P [17:01:22] 10Operations, 10ops-eqiad, 10DBA: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10Cmjohnson) racadm SEL root@db1115.mgmt.eqiad.wmnet's password: /admin1-> racadm getsel Record: 1 Date/Time: 01/18/2018 11:23:25 Source: sys... [17:01:53] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:02:51] (03PS2) 10Fsero: docker_registry_ha: package installation missing [puppet] - 10https://gerrit.wikimedia.org/r/484510 [17:03:58] (03PS1) 10WMDE-Fisch: Deploy the FileExporter as a beta feature on all Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484513 (https://phabricator.wikimedia.org/T213425) [17:04:25] RECOVERY - Host ps1-d7-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.56 ms [17:04:31] RECOVERY - Host elastic2054.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.68 ms [17:04:49] RECOVERY - Host elastic2053.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.42 ms [17:05:45] RECOVERY - Host ms-be2025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.34 ms [17:06:06] 10Operations, 10ops-eqiad, 10DBA: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues - https://phabricator.wikimedia.org/T196726 (10Cmjohnson) i swapped the dimm from a2 to b2 and cleared the log. Please put back in the rotation and let's see if and where the error occurs. [17:06:07] PROBLEM - Host db1115.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:06:19] RECOVERY - Host cp2024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.82 ms [17:06:20] 10Operations, 10netops: mgmt host interfaces down for rack D7 in codfw due to ge-0/0/30 on msw1-codfw down - https://phabricator.wikimedia.org/T213790 (10Papaul) 05Open→03Resolved looks like the mgmt switch froze have to unplug and plug the power back. Switch is back up [17:06:21] RECOVERY - Host cp2023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 39.52 ms [17:06:21] ^that is expected, hw maint going on [17:06:21] RECOVERY - Host ms-be2026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.19 ms [17:06:22] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) @fgiunchedi: Thanks for updating about the ms-be systems! I see you added they can be gracefully powered down, can we just power them back up and ensure puppet runs post... [17:06:25] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Jan-Mar 2019), 10Documentation: Add Srishti to analytics-privatedata-users - https://phabricator.wikimedia.org/T213780 (10bd808) This comment is my manager level approval for these rights [17:06:41] !log move back cr1-eqiad:xe-4/1/3 to xe-3/3/1 - T212791 [17:06:43] RECOVERY - Host cp2025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms [17:06:43] RECOVERY - Host cp2026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.69 ms [17:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:44] T212791: Interface errors on cr1-eqiad:xe-3/3/1 - https://phabricator.wikimedia.org/T212791 [17:07:31] 10Operations, 10ops-eqiad, 10Analytics, 10DBA: swap a2-eqiad PDU with on-site spare - https://phabricator.wikimedia.org/T213748 (10RobH) >>! In T213748#4881709, @RobH wrote: > @fgiunchedi: Thanks for updating about the ms-be systems! I see you added they can be gracefully powered down, can we just power t... [17:08:33] RECOVERY - Host ms-be2027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.69 ms [17:08:35] RECOVERY - Host ms-be2039.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms [17:09:33] RECOVERY - Host ms-be2050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.97 ms [17:10:09] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:11:21] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:11:25] RECOVERY - Host db1115.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [17:13:01] RECOVERY - HTTP-dbtree on dbmonitor2001 is OK: HTTP OK: HTTP/1.1 200 OK - 52208 bytes in 2.862 second response time [17:13:36] !log set partitions to 3 for existing kafka-logging topics - T213081 [17:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:39] T213081: Consider increasing kafka logging topic partitions - https://phabricator.wikimedia.org/T213081 [17:16:05] (03PS4) 10Ottomata: Add kafka-single-node chart for local development [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247) [17:18:07] 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10Cmjohnson) @Eevans The error has not returned, I cannot say with 100% certainty that it will not return but for now please take the server back and do what... [17:19:12] 10Operations, 10Recommendation-API, 10Release-Engineering-Team, 10Research, and 2 others: Recommendation API improvements - https://phabricator.wikimedia.org/T213222 (10bmansurov) [17:21:02] (03CR) 10Thiemo Kreuz (WMDE): "I wonder if it is better to list groups of wikis instead?" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484513 (https://phabricator.wikimedia.org/T213425) (owner: 10WMDE-Fisch) [17:21:21] !log depool logstash1007 before restarting logstash - T213081 [17:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:24] T213081: Consider increasing kafka logging topic partitions - https://phabricator.wikimedia.org/T213081 [17:22:16] (03CR) 10Mathew.onipe: Elasticsearch failed shard allocation check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [17:23:53] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.96 seconds [17:26:24] !log roll-restart logstash in eqiad - T213081 [17:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:27] T213081: Consider increasing kafka logging topic partitions - https://phabricator.wikimedia.org/T213081 [17:27:22] (03PS5) 10Gehel: wdqs: prometheus-blazegraph-exporter supports multi instances [puppet] - 10https://gerrit.wikimedia.org/r/484246 (https://phabricator.wikimedia.org/T213234) [17:28:16] (03PS2) 10ArielGlenn: write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) [17:28:32] (03CR) 10jerkins-bot: [V: 04-1] write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) (owner: 10ArielGlenn) [17:28:59] (03PS2) 10WMDE-Fisch: Deploy the FileExporter as a beta feature on all Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484513 (https://phabricator.wikimedia.org/T213425) [17:29:19] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.68 seconds [17:30:34] 10Operations, 10netops, 10Performance-Team (Radar): Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10ayounsi) Scheduling the work on Tuesday January 22nd, 16:00UTC, scope is Amsterdam only. That gives us the remaining of the week to monitor for any issue. Then collect/ana... [17:33:54] (03CR) 10WMDE-Fisch: "> It seems there is also a list of wikis that prefer Commons uploads" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484513 (https://phabricator.wikimedia.org/T213425) (owner: 10WMDE-Fisch) [17:36:09] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Deploy the FileExporter as a beta feature on all Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484513 (https://phabricator.wikimedia.org/T213425) (owner: 10WMDE-Fisch) [17:42:28] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Reading-Infrastructure-Team-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Pchelolo) With request rate as low as this endpoint is expected to have, the Varnish hit rate... [17:42:55] RECOVERY - IPMI Sensor Status on db1107 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [17:43:35] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10hashar) [17:44:39] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10hashar) From the releng meeting, I have added `wikimedia/portals` to th... [17:45:25] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Jan-Mar 2019), 10Documentation: Add Srishti to analytics-privatedata-users - https://phabricator.wikimedia.org/T213780 (10CDanis) [17:45:51] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.57 seconds [17:46:37] 10Operations, 10Wikimedia-Mailing-lists: lists.wikimedia.org reporting "You must GET the form before submitting it" for all list subscription attempts - https://phabricator.wikimedia.org/T185222 (10Tomthirteen) Hi, I am using Windows 7, browsers like Chrome and Mozilla. Thank you, Tom [17:47:09] PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.60 seconds [17:47:24] (03PS3) 10ArielGlenn: write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) [17:47:40] (03CR) 10jerkins-bot: [V: 04-1] write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) (owner: 10ArielGlenn) [17:47:51] PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.37 seconds [17:48:03] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.39 seconds [17:48:05] PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.12 seconds [17:48:17] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.79 seconds [17:48:21] PROBLEM - MariaDB Slave Lag: s5 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.63 seconds [17:48:37] PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.74 seconds [17:50:03] PROBLEM - Host cloudservices1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:50:37] PROBLEM - Host cp1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:50:37] PROBLEM - Host analytics1055.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:50:37] PROBLEM - Host elastic1035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:50:39] PROBLEM - Host dbproxy1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:50:53] PROBLEM - Host rdb1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:50:55] PROBLEM - Host graphite1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:10] 10Operations, 10netops: network device audit - https://phabricator.wikimedia.org/T213843 (10RobH) p:05Triage→03High [17:51:13] PROBLEM - Host analytics1053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:15] PROBLEM - Host ps1-a3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:51:17] 10Operations, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10RobH) [17:51:23] PROBLEM - Host kubernetes1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:23] PROBLEM - Host analytics1056.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:23] PROBLEM - Host analytics1059.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:27] PROBLEM - Juniper alarms on asw2-a-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:51:31] XioNoX: working on A3? [17:51:39] PROBLEM - Juniper alarms on asw-a-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:51:52] elukey: no, Chris said we lost one power phase [17:51:55] PROBLEM - Host dbstore1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:55] PROBLEM - Host restbase1011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:55] PROBLEM - Host restbase1016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:55] PROBLEM - Host analytics1057.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:55] PROBLEM - Host analytics1054.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:56] PROBLEM - Host analytics1052.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:59] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.47 seconds [17:52:09] PROBLEM - Host analytics1060.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:14] and probably the mgmt switch went down [17:52:15] PROBLEM - Host dbproxy1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:33] PROBLEM - Host ganeti1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:37] I can see [17:52:38] Jan 15 17:48:14 msw1-eqiad chassism[1399]: ifd_process_flaps IFD: ge-0/0/5, sent flap msg to RE, Downstate [17:52:40] volans: yep, that's the only impact I can see, rught? [17:53:01] PROBLEM - Host db1103.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:53:13] PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.63 seconds [17:53:21] XioNoX: so far yes, ps1-a3-eqiad [17:53:29] (03CR) 10Gehel: [C: 03+2] wdqs: prometheus-blazegraph-exporter supports multi instances [puppet] - 10https://gerrit.wikimedia.org/r/484246 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [17:53:29] PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.80 seconds [17:53:35] RECOVERY - Host ps1-a3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.71 ms [17:53:42] is the replication lag related? [17:53:55] PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.75 seconds [17:54:05] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.90 seconds [17:54:07] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.28 seconds [17:54:11] PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.61 seconds [17:54:13] PROBLEM - MariaDB Slave Lag: s2 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.66 seconds [17:54:16] it shouldn't as it's codfw [17:54:23] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.26 seconds [17:54:23] PROBLEM - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.42 seconds [17:54:23] PROBLEM - MariaDB Slave Lag: s2 on db2063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.13 seconds [17:54:25] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.62 seconds [17:54:25] PROBLEM - MariaDB Slave Lag: s5 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.82 seconds [17:54:27] PROBLEM - MariaDB Slave Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.41 seconds [17:54:43] PROBLEM - MariaDB Slave Lag: s2 on db2041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.92 seconds [17:54:50] the timing sure seems odd [17:55:04] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Reading-Infrastructure-Team-Backlog, 10Security-Team: [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10ovasileva) [17:55:19] oh those codfw slave lag alerts have been going on for a while before, they're just repeating [17:55:21] RECOVERY - Host cloudservices1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [17:55:24] afaik, no servers were harmed during that power outage [17:55:49] bblack: yep, what I was about to say :) [17:55:49] RECOVERY - Host analytics1054.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.24 ms [17:55:49] RECOVERY - Host analytics1057.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [17:55:49] RECOVERY - Host analytics1052.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [17:55:49] RECOVERY - Host analytics1053.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms [17:55:55] RECOVERY - Host elastic1035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [17:55:55] RECOVERY - Host cp1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.42 ms [17:55:55] RECOVERY - Host analytics1055.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [17:55:57] RECOVERY - Host dbproxy1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.88 ms [17:56:01] RECOVERY - Host analytics1059.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [17:56:09] RECOVERY - Host analytics1060.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [17:56:11] RECOVERY - Host rdb1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [17:56:13] RECOVERY - Host graphite1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [17:56:18] no servers went down ...just the mgmt switch ...it's not redundant [17:56:38] yeah, no big deal [17:56:41] RECOVERY - Host analytics1056.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms [17:56:41] RECOVERY - Host kubernetes1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.39 ms [17:57:01] (03PS4) 10ArielGlenn: write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) [17:57:13] RECOVERY - Host dbstore1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [17:57:13] RECOVERY - Host restbase1011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms [17:57:13] RECOVERY - Host restbase1016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [17:57:20] (03CR) 10jerkins-bot: [V: 04-1] write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) (owner: 10ArielGlenn) [17:57:31] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:57:35] RECOVERY - Host dbproxy1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [17:57:51] RECOVERY - Host ganeti1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [17:58:11] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:58:19] RECOVERY - Host db1103.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [17:58:29] (03PS1) 10Gehel: Revert "wdqs: prometheus-blazegraph-exporter supports multi instances" [puppet] - 10https://gerrit.wikimedia.org/r/484515 [17:59:13] RECOVERY - Check systemd state on registry1001 is OK: OK - running: The system is fully operational [17:59:28] (03CR) 10Gehel: [C: 03+2] Revert "wdqs: prometheus-blazegraph-exporter supports multi instances" [puppet] - 10https://gerrit.wikimedia.org/r/484515 (owner: 10Gehel) [18:00:03] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190115T1800). [18:00:13] no parsoid deploy today [18:01:47] RECOVERY - MariaDB Slave Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 56.60 seconds [18:02:03] RECOVERY - MariaDB Slave Lag: s2 on db2041 is OK: OK slave_sql_lag Replication lag: 11.88 seconds [18:02:24] (03PS5) 10ArielGlenn: write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) [18:02:39] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:02:47] RECOVERY - MariaDB Slave Lag: s2 on db2091 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:02:57] RECOVERY - MariaDB Slave Lag: s2 on db2056 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:02:59] RECOVERY - MariaDB Slave Lag: s2 on db2063 is OK: OK slave_sql_lag Replication lag: 0.47 seconds [18:02:59] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:08:29] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational [18:08:33] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational [18:08:45] jouncebot: now [18:08:45] For the next 0 hour(s) and 51 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190115T1800) [18:08:48] jouncebot: next [18:08:48] In 0 hour(s) and 51 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190115T1900) [18:08:57] PROBLEM - Check systemd state on registry1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:09:09] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [18:09:23] (03PS1) 10Gehel: wdqs: prometheus-blazegraph-exporter supports multi instances [puppet] - 10https://gerrit.wikimedia.org/r/484521 (https://phabricator.wikimedia.org/T213234) [18:11:20] (03PS1) 10Reedy: frwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484523 (https://phabricator.wikimedia.org/T213848) [18:13:19] (03PS2) 10EBernhardson: Add wbsearchentities profiles for de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484334 (https://phabricator.wikimedia.org/T213106) [18:13:24] (03PS2) 10Reedy: frwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484523 (https://phabricator.wikimedia.org/T213848) [18:14:09] (03CR) 10jerkins-bot: [V: 04-1] frwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484523 (https://phabricator.wikimedia.org/T213848) (owner: 10Reedy) [18:14:35] (03PS3) 10Reedy: frwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484523 (https://phabricator.wikimedia.org/T213848) [18:16:01] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 58.07 seconds [18:16:19] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 15.29 seconds [18:16:23] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 15.15 seconds [18:16:23] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 9.19 seconds [18:16:23] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 9.27 seconds [18:16:41] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 0.30 seconds [18:16:43] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 0.23 seconds [18:16:47] (03PS1) 10Volans: dns: fix logging message [software/spicerack] - 10https://gerrit.wikimedia.org/r/484524 [18:16:49] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 0.07 seconds [18:17:13] PROBLEM - IPMI Sensor Status on cloudservices1004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] [18:18:23] PROBLEM - IPMI Sensor Status on analytics1057 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [18:20:23] PROBLEM - IPMI Sensor Status on dbproxy1002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] [18:23:01] hmm [18:23:06] phabs not loading for me [18:23:09] PROBLEM - IPMI Sensor Status on analytics1055 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical] [18:23:24] and now it does. [18:23:35] PROBLEM - Host elastic1030 is DOWN: PING CRITICAL - Packet loss = 100% [18:23:35] PROBLEM - Host dbproxy1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:23:37] PROBLEM - Host elastic1031 is DOWN: PING CRITICAL - Packet loss = 100% [18:23:55] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.42 seconds [18:24:43] PROBLEM - Host dbproxy1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:24:51] PROBLEM - Host cloudservices1004 is DOWN: PING CRITICAL - Packet loss = 100% [18:24:53] PROBLEM - Host prometheus1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:25:27] PROBLEM - Host dbstore1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:25:27] PROBLEM - Host restbase1016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:25:35] hmm, oh gerrit's not working [18:25:47] PROBLEM - debmonitor.wikimedia.org on debmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:53] PROBLEM - Host dbproxy1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:25:59] PROBLEM - LibreNMS HTTPS on netmon1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:26:27] thcipriani ^^ [18:26:35] PROBLEM - IPMI Sensor Status on analytics1060 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [18:26:53] we're having power issues in a rack [18:27:13] PROBLEM - Host dbproxy1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:27:16] Trizek: ^ obviously not going to deploy it just atm :) [18:27:23] PROBLEM - Host elastic1030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:27:23] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:27:24] I am going to switchover dbproxy1001 and 2 [18:27:31] 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Gehel) 05Open→03Resolved Since @Papaul says it is all done, I'll close this. No more need to track it on our side. [18:27:53] PROBLEM - Host elastic1031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:28:11] PROBLEM - Host prometheus1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:28:13] PROBLEM - IPMI Sensor Status on analytics1056 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical] [18:28:49] of course gerrit won't work [18:28:53] PROBLEM - Host cloudservices1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:28:55] robh, papaul: ^^^ I just closed T211023 since there is a comment that it is done. Feel free to reopen if you need it for some kind of tracking [18:28:56] T211023: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 [18:28:59] bblack: can we update dns manually? [18:29:15] gerrit will not work as it needs the db for groups. [18:29:21] gehel: dont close decom tasks assigned to me ;D [18:29:25] RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 38.24 seconds [18:29:25] PROBLEM - Host cp1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:29:27] those are lease tracking returns [18:29:29] its why it was open [18:29:39] RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 22.14 seconds [18:29:41] RECOVERY - MariaDB Slave Lag: s5 on db2089 is OK: OK slave_sql_lag Replication lag: 17.61 seconds [18:29:55] RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 0.27 seconds [18:29:55] RECOVERY - MariaDB Slave Lag: s5 on db2094 is OK: OK slave_sql_lag Replication lag: 0.36 seconds [18:29:57] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.17 seconds [18:30:01] RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:30:04] robh: than I'll reopen and remove search from it! [18:30:15] RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 0.49 seconds [18:30:15] no worries [18:30:22] yeah, remove all but decom projects is fine [18:30:24] 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH) 05Resolved→03Stalled I was keepign lease return decom tasks open for now, since we're uncertain what is happening with t... [18:30:26] gehel: FYI elastic103[01] affected [18:30:39] done [18:30:47] 10Operations, 10ops-codfw, 10decommission: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH) [18:30:53] 10Operations, 10ops-codfw, 10decommission: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Gehel) 05Stalled→03Open It looks like @robh still need to track this. [18:30:55] gehel: all good to be clear im not worried about it! [18:30:58] =] [18:31:07] volans: reading back... [18:31:07] ie: i hope my comment didnt sound mean! [18:31:26] normally once we unrack we would 100% close the task [18:31:37] this is only strange due to being a lease return, and we have NEVER returned a leased system yet [18:31:43] ie: its an outlier [18:31:51] robh: not at all! [18:31:57] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10monitoring, 10Discovery-Search (Current work): upgrade prometheus-blazegraph-exporter to python3 - https://phabricator.wikimedia.org/T213305 (10Mathew.onipe) [18:31:59] =] [18:32:13] volans: something else than the .mgmt interfaces? I'm lost [18:32:33] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10monitoring, 10Discovery-Search (Current work): upgrade prometheus-blazegraph-exporter to python3 - https://phabricator.wikimedia.org/T213305 (10Mathew.onipe) a:03Mathew.onipe [18:32:43] PROBLEM - IPMI Sensor Status on analytics1053 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [18:33:01] PROBLEM - IPMI Sensor Status on analytics1059 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [18:33:07] gehel: yes I see the two hosts down [18:33:41] oh, right, should be no issue, but checking [18:34:06] jynus: update dns manually? [18:34:11] yes [18:34:18] no gerrit available [18:34:21] PROBLEM - IPMI Sensor Status on analytics1052 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [18:34:25] uh ok [18:34:27] m1-master and m2-master [18:34:29] what change do you need? [18:34:38] from 1001 and 1002 to dbproxy1006 and 7 [18:34:45] (and why is gerrit borked too, but that can be dealt with after I guess) [18:34:54] that^will fix gerrit [18:35:00] bblack because groups require the db [18:35:20] but pushing through gerrit without the ui *may* work untested though [18:36:25] ok working on things [18:36:34] bblack: if you're taking over I'll stop [18:36:50] volans: was anything done alreadY/ [18:36:53] * volans was reconstructing the steps based on wikitech and the script [18:36:54] y? :) [18:36:56] bblack: nope [18:37:03] so this is on the eqiad.wmnet zone [18:37:04] was about to :) [18:37:14] CNAME m1-master [18:37:17] and m2-master [18:38:09] 10Operations, 10Discovery-Search (Current work), 10Epic: Migrate elasticsearch scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T202885 (10Mathew.onipe) [18:38:16] yeah fwiw, direct push to git master doesn't work either [18:38:37] PROBLEM - IPMI Sensor Status on analytics1054 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical] [18:38:52] edit locally on all and fix later? [18:39:22] yes please, I assumed you knew about the script more than us (authdns-update) [18:39:51] my idea was to edit locally on all authdns and find the right script to regenerate the templates and reload gdsnd [18:39:52] or any dirty way to do it [18:40:06] and once gerrit is up, I can commit the proper fix [18:40:07] !log DNS manually updated for m1-master -> dbproxy1006 and m2-master -> dbproxy1007 [18:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:08] just not totally sure about all the steps and sequence [18:40:13] thanks, bblack! [18:40:20] would be nice to document that on wt lateer [18:40:26] (now waiting a bit to propagate) [18:40:29] it's "fixed" now, but it will get reverted by any authdns-update, so don't touch dns till we get past this [18:40:50] gerrit is back to me [18:41:06] volans: yeah there's always supposedly been an ability to do manual updates via local git clones an dpulling from each other, but in practice I think the fallout i smessy [18:41:07] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: connect to address 10.64.32.177 and port 9001: Connection refused [18:41:32] mmm, although ssh interface still fails [18:41:37] volans: I just literally edited the live zonefiles on the 3 servers and did "gdnsdctl reload-zones" on each server, that's the least-fallout way [18:41:50] one probably need to restart gerrit, I would guess the jdbc driver is stuck with the old IP [18:41:55] no need to run gen-zones? [18:42:01] or how is called the other one [18:42:04] per hashar [18:42:13] volans: no, I edited the template outputs, not the inputs [18:42:24] ah got it [18:42:24] hashar: can you do that? [18:42:25] as in "vi /etc/gdnsd/zones/wmnet" [18:42:35] jynus: probably? :) [18:42:37] yeah seems the less messy [18:42:57] RECOVERY - debmonitor.wikimedia.org on debmonitor1001 is OK: HTTP OK: Status line output matched HTTP/1.1 301 - 274 bytes in 0.004 second response time [18:43:02] !log restarted debmonitor on debmonitor1001 [18:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:08] thanks for the restarts [18:43:09] RECOVERY - LibreNMS HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 200 OK - 8737 bytes in 0.066 second response time [18:43:13] etherpad should be next, checking [18:43:15] !log Restarting Gerrit to catch up with a DNS change with the database [18:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:19] volans: in theory we can do a git commit on any authdns locally and have the others pull that around, but then we have to fix git history (and even then, something else has seemed "off" to me in that area, but we're getting way out in left field) [18:43:34] ack [18:43:35] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8188 bytes in 0.013 second response time [18:43:57] oh, I didn't touched etherpad service yet, someone did? [18:44:02] mmmh I'm still not loading debmonitor from outside... I'll have a look [18:44:06] jynus: not me [18:44:15] it's kind of on the backlog somewhere to re-investigate all of that a fix it and document a procedure that's known to work and has manageable and documented fallout/recoveyr :) [18:44:17] !log [2019-01-15 18:44:06,959] [main] INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 2.15.6-5-g4b9c845200 ready [18:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:23] jynus: but seems to work to me [18:44:24] volans: dns load balancing is not great :-) [18:44:34] volans: confirmed on debmonitor down fyi [18:44:48] jynus: so yeah gerrit is back. The java connector does not seem to handle dns changes :/ [18:45:18] hashar: can you review? [18:45:22] it asks for my user [18:45:44] that's gone away upstream now hashar :) [18:45:58] (03CR) 10Hashar: "I can review :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484409 (owner: 10Jcrespo) [18:46:06] jynus: yes i can review stuff ^^ :) [18:46:12] (03PS1) 10BBlack: emergency m[12]-master changes [dns] - 10https://gerrit.wikimedia.org/r/484546 [18:46:24] (03CR) 10BBlack: [V: 03+2 C: 03+2] emergency m[12]-master changes [dns] - 10https://gerrit.wikimedia.org/r/484546 (owner: 10BBlack) [18:46:26] ok I did restart it too early and it got the old CNAME, should work now debmonitor [18:46:35] thanks bblack [18:46:47] that's just catching up the git repo so it doesn't revert the local changes [18:46:55] the downtime may have messed up my config [18:46:55] I'm not going to run authdns-update yet till we're a little more stable [18:47:23] oh, and review now works again [18:47:25] ah https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-jvm-ttl.html "The JVM caches DNS name lookups" ... [18:47:44] but i would guess that JDBC driver has its own issue of some sort [18:47:47] * volans wished django would have logged the IP of Can't connect to MySQL server on 'm2-master.eqiad.wmnet' [18:47:50] the dns switchover was supposed to be a temporary fix, better than having nothing [18:48:02] (as happened in the past) [18:48:07] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 5 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/limn-edit-data],Exec[git_pull_analytics/limn-language-data],Exec[git_pull_analytics/limn-ee-data],Exec[git_pull_analytics/reportupdater-queries] [18:48:10] (03PS1) 10Hashar: Honor absolute paths in .dockerignore [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484547 (https://phabricator.wikimedia.org/T183546) [18:48:42] and that puppet alarm on stat1006 is related to gerrit as well I guess ;) [18:48:45] PROBLEM - IPMI Sensor Status on restbase1010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] [18:48:49] so lets confirm potentially affected services [18:49:21] bacula etherpad librenms racktables rt [18:49:23] PROBLEM - puppet last run on db1125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [18:49:35] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [18:49:42] gerrit, debmonitor, otrs, iegreview, wikimania scholarship (from a quick grep) [18:49:54] that is matches for m2-master in puppet [18:49:54] debmonitor done fwiw [18:49:55] gerrit otrs debmonitor frimpressions iegreview scholarships [18:49:57] PROBLEM - IPMI Sensor Status on rdb1005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [18:50:19] PROBLEM - IPMI Sensor Status on elastic1033 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] [18:50:19] does anyone have otrs access and can confirm it works? [18:50:33] I don't anymore :( [18:50:41] tzatziki: ^ [18:50:48] anything to do for prometheus1003? cc godog, cdanis [18:50:51] PROBLEM - IPMI Sensor Status on relforge1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] [18:50:52] frimpressions sounds like fundraising related, might want to check with them [18:51:10] hashar: I'll ping them [18:51:23] the app seems up [18:51:25] actually cwd ^^^ :) [18:51:35] I don't believe so volans, AIUI eqiad and codfw each have two redundant prometheui [18:51:37] Reedy: ? [18:51:42] but maybe something could be weird [18:51:43] tzatziki: Can you check OTRS is ok [18:51:43] may need to summarise what's going on heh [18:51:53] just in general? [18:52:00] Yeah [18:52:00] ok since seems to be working, I'm going to run authdns-update (which should get us back in normal sync, and leave the emergency DNS change persistent for now) [18:52:00] tzatziki: check for database errors [18:52:03] tzatziki: would you be able to confirm that OTRS is reacheable and that you can act on it? Eg save a response or some action that requires to write? [18:52:04] OTRS seems to be up [18:52:08] okay [18:52:08] cdanis: yes it's in couple with prometheus1004, but I'm bot sure what will happen after [18:52:13] tzatziki: tray loading a message [18:52:18] if it works, everthing ok [18:52:27] I will check bacula [18:52:33] !log authdns-update for https://gerrit.wikimedia.org/r/c/operations/dns/+/484546 (make normal git stuff match manual changes already in place) [18:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:50] volans: .... yeah, good point. prometheus1003 will have some missing data. not sure if there's a good way to fix that [18:53:28] with graphite it was a mess to fix with redundancy... with prometheus I honestly don't know [18:53:30] jynus: seems ok [18:53:36] can write [18:53:37] (03PS1) 10Bstorm: toolforge: pin python3 requests to backports for jessie on proxies [puppet] - 10https://gerrit.wikimedia.org/r/484550 (https://phabricator.wikimedia.org/T213711) [18:53:40] AFAIK backfilling is not supported [18:53:40] tzatziki: thanks, that was very helpful [18:53:47] (that all went fine, effectively a no-op and still pointing at dbproxy100[67]) [18:53:48] np :) [18:54:02] bblack: thanks! [18:54:27] so you can use normal workflow from here forward if you want to make further changes or reverts, assuming gerrit stays alive [18:54:29] tzatziki: if someone asks, we had a small issue with the otrs database, so some people may have experience a brief interruption [18:54:37] jynus: ah gotcha [18:54:37] bblack: it will be up [18:54:41] thanks for the heads up [18:54:47] bblack: database never crashed [18:54:51] only the proxy [18:55:07] and the current proxy is still down [18:55:51] PROBLEM - IPMI Sensor Status on db1103 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [18:55:55] volans: looks like backfill is intentionally unsupported in prometheus [18:56:29] cdanis: volans, in theory the redundancy will take care of that [18:56:35] not transparently, though [18:56:59] PROBLEM - IPMI Sensor Status on kubernetes1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [18:56:59] we will still have metrics on prom1004, yeah [18:57:00] other things affected, cloudservices1004 [18:57:03] it might if grafana is smart :D [18:57:14] *it might be transparent [18:57:17] cp1008 [18:57:26] elastic1030/1 [18:57:47] RECOVERY - Host dbproxy1001 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [18:57:49] volans: yeah looks like there's no support for that :) [18:57:52] although I will suppose that except cloudservices, those will be probably behind a proxy [18:57:55] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Tomorrow (Jan 15) we have a meeting with some SRE folks to revisit this. We've got the cloud-analytics Hadoop... [18:58:00] cdanis: that was the original design yeah :) [18:58:17] it might be interesting to add grafana datasources that hit each prometheus individually, for use by hand in such situations [18:58:41] RECOVERY - Host dbproxy1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [18:58:51] RECOVERY - Host dbproxy1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [18:58:53] PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.26 seconds [18:59:05] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.58 seconds [18:59:09] PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.54 seconds [18:59:21] PROBLEM - MariaDB Slave Lag: s5 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.06 seconds [18:59:23] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.71 seconds [18:59:25] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Some links: - https://prestodb.io/docs/current/security/ldap.html - https://prestodb.io/docs/current/connector... [18:59:25] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.25 seconds [18:59:27] cp1008 isn't important FWIW [18:59:29] PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.44 seconds [18:59:34] bblack: cool! [18:59:41] PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.49 seconds [18:59:44] I don't know about cloudservices1004 [19:00:00] (03CR) 10Smalyshev: [C: 03+1] Add wbsearchentities profiles for de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484334 (https://phabricator.wikimedia.org/T213106) (owner: 10EBernhardson) [19:00:03] RECOVERY - Host dbproxy1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190115T1900) [19:00:08] bd808: andrewbogott: bstorm_ ^ [19:00:16] is is worryinh? [19:00:27] jynus: it's secondary luckily, I let them know [19:00:32] (03CR) 10BryanDavis: [C: 03+1] "Fixes a timebomb left for us by a manual `pip install requests` that was done sometime around 2015-09-30 on tools-proxy-02 and tools-proxy" [puppet] - 10https://gerrit.wikimedia.org/r/484550 (https://phabricator.wikimedia.org/T213711) (owner: 10Bstorm) [19:00:34] they are investigating why it didn't page atm but no outage [19:00:51] PROBLEM - IPMI Sensor Status on elastic1035 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] [19:00:56] chasemp: that is good news [19:01:04] probably dependency on host [19:01:13] if host is down, no services alerts [19:01:14] (03CR) 10GTirloni: [C: 03+1] toolforge: pin python3 requests to backports for jessie on proxies [puppet] - 10https://gerrit.wikimedia.org/r/484550 (https://phabricator.wikimedia.org/T213711) (owner: 10Bstorm) [19:01:16] (03CR) 10Bstorm: [C: 03+2] toolforge: pin python3 requests to backports for jessie on proxies [puppet] - 10https://gerrit.wikimedia.org/r/484550 (https://phabricator.wikimedia.org/T213711) (owner: 10Bstorm) [19:01:39] (03CR) 10EBernhardson: [C: 04-2] "needs d395471 deployed first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484334 (https://phabricator.wikimedia.org/T213106) (owner: 10EBernhardson) [19:02:26] (03CR) 10Paladox: [C: 03+1] "LGTM" [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484437 (owner: 10Thcipriani) [19:03:20] no error on bacula logs, but it may show up later? [19:03:33] PROBLEM - IPMI Sensor Status on restbase1011 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] [19:05:32] Cloudservices1004 is one of an HA pair so we're ok without it for a bit. I'm very far AFK [19:05:49] PROBLEM - IPMI Sensor Status on elastic1034 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] [19:06:08] Gerrit 2.16 could mean it doesn't rely on the mysql/mariadb backend anymore so Gerrit would not go down for dbproxy going down anymore [19:06:19] ^^ [19:07:15] I thought that happened already [19:07:21] PROBLEM - IPMI Sensor Status on elastic1032 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] [19:07:37] however, there is a disadbantage- backups will not be as stright forward [19:08:23] afaict some things have already been moved to notesdb but not all of them, so it still needed groups from mysql as paladox mentioned above.. but in future version everything would move [19:08:42] i am not sure though if that is really recommended as well for "large installs" [19:08:52] jynus somethings have been moved to NoteDB, namely changes. Groups are done in 2.16. [19:09:03] i have this task https://phabricator.wikimedia.org/T211139 [19:09:09] ah, and which version are we in? [19:09:19] 2.15.6 [19:09:34] just reading scrollback, which service might have affected frimpressions? dns switchover? [19:09:51] * cwd pretends to know what frimpressions is [19:10:03] PROBLEM - IPMI Sensor Status on dbproxy1003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] [19:10:23] any plans on upgrading our gerrit install soon paladox / mutante ? [19:10:47] Hauskatze i know that we will be branching 2.16 soon, but as to upgrading im not sure. [19:11:06] cwd: eh.. does FR store impressions in something that as a mysql backend? maybe that [19:11:28] we are upgrading to 2.15.8 soon https://gerrit.wikimedia.org/r/484437 [19:11:48] mutante: yes indeed, imported from kafka [19:11:50] cwd: yes frimpressions, we changed the CNAME of the m1/m2 masters in eqiad due to a power issue to the proxies [19:12:19] PROBLEM - IPMI Sensor Status on ganeti1007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [19:12:19] so, depending on the way the connection pool is handled it might need a reload/restart to pick the new CNAME record hence IP [19:12:30] s/pool// [19:13:04] would the power issue have caused the service to flap at odd hours? [19:13:20] Hauskatze: i think a major thing to consider is that it would change the UI for quite a few users [19:13:34] cwd: I don't think so the proxy host went down a bit ago, let me get the right time [19:14:01] yes indeed, PolyGerrit would be the default UI [19:14:07] ok nm, i have been seeing weird alerts about that service i haven't been able to get to the bottom of yet [19:14:09] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:14:12] for quite a while [19:14:13] cwd: 18:24 UTC ~45m ago [19:14:18] as long as the editting thing is fully implemented, I'd have no qualms [19:14:23] volans: thanks, checking... [19:15:15] mutante the old ui has been removed upstream (so they will be only fixing bugs at this poin't i guess) [19:15:16] cwd: and the CNAME was moved at ~18:40 [19:15:25] RECOVERY - puppet last run on db1125 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:15:26] PolyGerrit is even more complete from 2.16 [19:15:36] volans: which cname is it? [19:15:39] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:16:04] cwd: m1-master and m2-master eqiad [19:16:29] ty [19:20:00] gehel: anything that needs to be done now or later when elastic103[01] will be back online? [19:20:37] (03PS1) 10Dduvall: group0 to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484556 [19:20:54] ^ just prepping fyi [19:21:28] volans: Nope, they should just reconnect [19:21:38] ok, good to know, thanks :) [19:21:44] cwd: let us know what you find ;) [19:28:50] !log depooling wdqs1008 and wdqs2004 for DB copying for T213854 [19:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:54] T213854: Reload database on wdq2[456] from another server - https://phabricator.wikimedia.org/T213854 [19:32:58] 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10Volans) FYI the host is currently down due to a partial power issue in that rack. [19:36:51] !log started copying wdqs1008->wdqs2004 for T213854 [19:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:54] T213854: Reload database on wdq2[456] from another server - https://phabricator.wikimedia.org/T213854 [19:40:58] 10Operations, 10Recommendation-API, 10Release-Engineering-Team, 10Research, and 2 others: Recommendation API improvements - https://phabricator.wikimedia.org/T213222 (10leila) [19:41:13] PROBLEM - IPMI Sensor Status on dbproxy1002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] [19:41:40] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10MSantos) >>! In T211881#4878426, @akosiaris wrote: >> @akosiaris, the logic in PROBLEM - IPMI Sensor Status on dbproxy1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] [19:46:35] PROBLEM - MD RAID on ms-be1016 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [19:46:36] ACKNOWLEDGEMENT - MD RAID on ms-be1016 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T213856 [19:46:41] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T213856 (10ops-monitoring-bot) [19:46:45] 10Operations, 10ops-eqiad, 10Patch-For-Review: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 (10jcrespo) es1019 is back up and mgmt is working. Not starting mysql though, until chris confirms everthing is done and ok (and he is understandingly b... [19:47:35] PROBLEM - very high load average likely xfs on ms-be1016 is CRITICAL: CRITICAL - load average: 154.28, 102.30, 62.82 [19:47:39] PROBLEM - Disk space on ms-be1016 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb3 is not accessible: Input/output error [19:58:07] ACKNOWLEDGEMENT - Disk space on ms-be1016 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb3 is not accessible: Input/output error daniel_zahn https://phabricator.wikimedia.org/T213856 [19:58:07] ACKNOWLEDGEMENT - very high load average likely xfs on ms-be1016 is CRITICAL: CRITICAL - load average: 124.33, 130.73, 98.22 daniel_zahn https://phabricator.wikimedia.org/T213856 [19:58:30] 10Operations, 10Analytics, 10EventBus, 10Patch-For-Review, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Pchelolo) Once we get it we would need to update Change-Prop, JQ-Change-Prop, EventBus-service, event streams to use the new DNS record. [19:58:47] 10Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T213856 (10Dzahn) [19:58:54] 10Operations, 10DBA: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) p:05Triage→03High [19:59:27] 10Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T213856 (10Dzahn) server is unusable, ssh to it results in: ` -bash: /usr/bin/tput: Input/output error -bash: cannot create temp file for here-document: Read-only file system -bash: %6+1: syntax error:... [20:00:04] marxarelli: Dear deployers, time to do the MediaWiki train - Americas version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190115T2000). [20:00:21] volans: i'm pretty sure frimpressions is cruft, will confirm and make a task to remove it if so [20:01:00] we do store impression in mysql but it's on a fundraising server [20:01:33] PROBLEM - swift-container-updater on ms-be1016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [20:01:39] 10Operations, 10DBA: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) [20:02:45] PROBLEM - IPMI Sensor Status on graphite1003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 1 = Critical, Power Supplies = Critical] [20:02:57] ACKNOWLEDGEMENT - swift-container-updater on ms-be1016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater daniel_zahn https://phabricator.wikimedia.org/T213856 [20:04:01] cwd: ack, so no impact I guess :) [20:04:29] the power supply alert on graphite1003 is not new. it was just disabled notifications before for other reasons and i removed that [20:04:41] RECOVERY - very high load average likely xfs on ms-be1016 is OK: OK - load average: 19.00, 55.77, 76.80 [20:05:14] so that we see the real picture which are PSU alerts [20:09:55] marostegui: does https://phabricator.wikimedia.org/T213858 block the train deployment? [20:10:22] 10Operations, 10DBA: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) I would like to aim for Thursday at 7AM UTC [20:11:57] 10Operations, 10Wikimedia-Mailing-lists: lists.wikimedia.org reporting "You must GET the form before submitting it" for all list subscription attempts - https://phabricator.wikimedia.org/T185222 (10Framawiki) [20:14:25] (03PS3) 10EBernhardson: Add wbsearchentities profiles for de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484334 (https://phabricator.wikimedia.org/T213106) [20:17:08] 10Operations, 10ops-eqiad: eqiad: rack a2 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) p:05Triage→03High [20:19:02] (03Abandoned) 10Hashar: Update plugins reviewers-by-blame to stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/482775 (https://phabricator.wikimedia.org/T101131) (owner: 10Hashar) [20:19:08] 10Operations, 10ops-eqiad: eqiad: rack a2 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) [20:20:09] marostegui: k. i see it's scheduled for thursday. i'll roll group0 to 1.33.0-wmf.13 then [20:20:19] (03CR) 10Hashar: [C: 03+1] "Tag looks good." [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484437 (owner: 10Thcipriani) [20:20:30] 10Operations, 10ops-eqiad: eqiad: rack a2 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) [20:20:49] 10Operations, 10Analytics, 10EventBus, 10Patch-For-Review, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10akosiaris) >>! In T213561#4881255, @Joe wrote: > Might I suggest that you use a SRV dns record instead? It's more appropriate for enumeratin... [20:21:32] 10Operations, 10ops-eqiad: eqiad: rack a2 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) [20:21:58] 10Operations, 10DBA: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) @jcrespo I have created the usual failover checklist [20:23:44] 10Operations, 10Analytics, 10EventBus, 10Patch-For-Review, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Ottomata) Kafka doesn't support SRV. Hence my Round Robin DNS patch. After more discussion with @bblack, I think I've decided to abandon t... [20:24:40] (03PS2) 10Gehel: wdqs: prometheus-blazegraph-exporter supports multi instances [puppet] - 10https://gerrit.wikimedia.org/r/484521 (https://phabricator.wikimedia.org/T213234) [20:27:48] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Broken elasticsearch-prometheus-exporter service on logstash nodes after reboot - https://phabricator.wikimedia.org/T210597 (10Gehel) I can confirm that `prometheus-elasticsearch-exporter.service` is masked on logstash nodes. I have not reboot... [20:28:03] (03CR) 10Gehel: [C: 03+2] wdqs: prometheus-blazegraph-exporter supports multi instances [puppet] - 10https://gerrit.wikimedia.org/r/484521 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [20:28:07] 10Operations, 10ops-eqiad: eqiad: rack a2 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) [20:28:27] PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.26 seconds [20:28:41] PROBLEM - MariaDB Slave Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.36 seconds [20:28:45] PROBLEM - MariaDB Slave Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.65 seconds [20:28:52] 10Operations, 10ops-eqiad: eqiad: rack a2 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) [20:28:55] PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.97 seconds [20:29:01] PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.61 seconds [20:29:09] PROBLEM - MariaDB Slave Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.50 seconds [20:29:48] ^ known maintenance work. it has been pointed out to me it is also listed on deployment calendar [20:30:09] !log dduvall@deploy1001 Pruned MediaWiki: 1.33.0-wmf.6 (duration: 09m 15s) [20:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:52] 10Operations, 10ops-eqiad: eqiad: rack a2 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Marostegui) [20:31:05] PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 343.02 seconds [20:31:05] PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 343.07 seconds [20:32:29] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:32:55] PROBLEM - MariaDB Slave Lag: s6 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 370.33 seconds [20:33:17] 10Operations, 10DBA: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) p:05Triage→03High [20:33:46] !log dduvall@deploy1001 Pruned MediaWiki: 1.33.0-wmf.8 (duration: 03m 04s) [20:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:05] RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 53.53 seconds [20:35:25] RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 8.94 seconds [20:35:41] RECOVERY - MariaDB Slave Lag: s5 on db2094 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [20:35:45] RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 0.01 seconds [20:35:51] (03PS1) 10Gehel: wdqs: fixed typo in prometheus-blazegraph-exporter systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/484565 (https://phabricator.wikimedia.org/T213234) [20:35:57] RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 0.18 seconds [20:36:01] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.35 seconds [20:36:05] RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [20:36:23] 10Operations, 10ops-eqiad: eqiad: rack a2 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) [20:36:35] RECOVERY - MariaDB Slave Lag: s5 on db2089 is OK: OK slave_sql_lag Replication lag: 0.59 seconds [20:36:42] (03CR) 10Gehel: [C: 03+2] wdqs: fixed typo in prometheus-blazegraph-exporter systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/484565 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [20:36:59] !log dduvall@deploy1001 Started scap: testwiki to php-1.33.0-wmf.13 and rebuild l10n cache [20:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:25] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:38:21] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [20:38:34] 10Operations, 10ops-eqiad: eqiad: rack a2 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) [20:40:27] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:41:21] 10Operations, 10DBA: Failover dbproxy1003 to dbproxy1008 - https://phabricator.wikimedia.org/T213865 (10Marostegui) This needs to happen before Thursday 17th [20:49:23] 10Operations, 10ops-eqiad: eqiad: rack a2 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Gehel) For `elastic103[0-5]`, we should be fine just shutting them down. The theory is that we should be able to loose a full row and not worry too much about it. That being said, 6 serve... [20:50:47] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational [20:52:35] (03CR) 10Hashar: "I had the issue with a .dockerignore containing /cache. Somewhere there was a broken symbolic link which caused the files copying to rais" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484547 (https://phabricator.wikimedia.org/T183546) (owner: 10Hashar) [20:52:37] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational [20:55:32] 10Operations, 10ops-eqiad: eqiad: rack a2 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) [20:55:55] (03PS1) 10Thcipriani: Gerrit 2.15.8 release [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484567 (https://phabricator.wikimedia.org/T210785) [20:56:15] 10Operations, 10ops-eqiad: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) [20:58:35] (03CR) 10Paladox: [C: 03+1] "LGTM :)" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484567 (https://phabricator.wikimedia.org/T210785) (owner: 10Thcipriani) [20:58:48] 10Operations, 10ops-eqiad: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) [20:59:01] 10Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T213856 (10CDanis) Dzahn, can you assign a priority for this ticket? Is 'normal' appropriate for Swift backend hosts? [21:00:57] PROBLEM - High CPU load on API appserver on mw1285 is CRITICAL: CRITICAL - load average: 63.39, 28.96, 19.45 [21:01:34] ^ likely due to scap-cdb-rebuild [21:01:57] RECOVERY - MariaDB Slave Lag: s6 on db2095 is OK: OK slave_sql_lag Replication lag: 10.03 seconds [21:02:01] RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 5.48 seconds [21:02:11] RECOVERY - High CPU load on API appserver on mw1285 is OK: OK - load average: 33.10, 28.13, 19.96 [21:02:17] RECOVERY - MariaDB Slave Lag: s6 on db2087 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:02:19] RECOVERY - MariaDB Slave Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:02:24] (03PS3) 10Gehel: Make cron endpoint configurable [puppet] - 10https://gerrit.wikimedia.org/r/484348 (https://phabricator.wikimedia.org/T213212) (owner: 10Smalyshev) [21:02:27] RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 0.08 seconds [21:02:27] RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 0.10 seconds [21:02:31] RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:02:39] RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:02:45] RECOVERY - MariaDB Slave Lag: s6 on db2089 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:05:06] 10Operations, 10ops-eqiad: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) [21:05:46] 10Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T213856 (10Dzahn) p:05Triage→03High That's a good question. It looks like ms-be 17, 18 and 19 can still handle the load but given that it's not just a regular degraded RAID but the whole server is r... [21:06:07] (03CR) 10Gehel: [C: 03+2] Make cron endpoint configurable [puppet] - 10https://gerrit.wikimedia.org/r/484348 (https://phabricator.wikimedia.org/T213212) (owner: 10Smalyshev) [21:08:06] 10Operations, 10ops-eqiad: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10Gehel) relforge1001 can also be cleanly shutdown and restarted. It will crash the relforge cluster, but that cluster is not expected to be highly available. I'll warn the search platform te... [21:09:39] PROBLEM - Check systemd state on ms-be1016 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected [21:09:41] !log dduvall@deploy1001 Finished scap: testwiki to php-1.33.0-wmf.13 and rebuild l10n cache (duration: 32m 42s) [21:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:11] (03PS3) 10Dzahn: contint: delete unused doc.wikimedia.org site config [puppet] - 10https://gerrit.wikimedia.org/r/484321 (https://phabricator.wikimedia.org/T137890) [21:10:30] (03CR) 10Dduvall: [C: 03+2] group0 to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484556 (owner: 10Dduvall) [21:10:37] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 287.12 seconds [21:11:37] (03Merged) 10jenkins-bot: group0 to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484556 (owner: 10Dduvall) [21:13:36] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.33.0-wmf.13 [21:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:37] (03CR) 10Dzahn: [C: 03+2] contint: delete unused doc.wikimedia.org site config [puppet] - 10https://gerrit.wikimedia.org/r/484321 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [21:15:45] (03CR) 10Gehel: Switch category endpoint config to 9990 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484345 (https://phabricator.wikimedia.org/T213212) (owner: 10Smalyshev) [21:19:01] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 58.62 seconds [21:19:03] RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 56.61 seconds [21:19:05] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 56.38 seconds [21:19:37] RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 48.69 seconds [21:19:39] RECOVERY - MariaDB Slave Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 46.57 seconds [21:19:53] RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 44.92 seconds [21:19:57] RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 43.12 seconds [21:20:01] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 41.44 seconds [21:20:50] 10Operations, 10ops-eqiad: Interface errors on cr1-eqiad:xe-3/3/1 - https://phabricator.wikimedia.org/T212791 (10ayounsi) Told Zayo about Equinix's test/cleanup of the X-connect. Then received a "Dispatch Charge Notification" without approval request, and the folloing Zayo updates: > "Good afternoon, we receiv... [21:21:46] !log doc.wikimedia.org httpd config has been removed from contint1001, is now on doc1001 [21:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:17] (03CR) 10jenkins-bot: group0 to 1.33.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484556 (owner: 10Dduvall) [21:23:38] !log contint1001 rmdir /srv/org/wikimedia/integration/coverage ; rmdir /srv/org/wikimedia/integration/logs (T137890) [21:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:41] T137890: Relocate CI generated docs and coverage reports - https://phabricator.wikimedia.org/T137890 [21:24:14] (03CR) 10Dzahn: [C: 03+2] "i deleted the 2 empty dirs, i left the (full) doc dir as is" [puppet] - 10https://gerrit.wikimedia.org/r/484321 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [21:25:17] (03PS6) 10ArielGlenn: write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) [21:27:01] (03PS1) 10Effie Mouzeli: role::eqiad::scb: Switch rdb1006 to redis::misc::master [puppet] - 10https://gerrit.wikimedia.org/r/484572 (https://phabricator.wikimedia.org/T213859) [21:27:34] 10Operations, 10DBA: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Marostegui) @Anomie I will let you know if we need to pause your migration script - as the lag in codfw would make the failover harder. Once we have agreed on a date/time we will talk to you! [21:29:30] (03PS6) 10Gehel: Switch category endpoint config to 9990 [puppet] - 10https://gerrit.wikimedia.org/r/484345 (https://phabricator.wikimedia.org/T213212) (owner: 10Smalyshev) [21:32:07] (03PS8) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [21:33:12] (03CR) 10Smalyshev: [C: 03+1] Switch category endpoint config to 9990 [puppet] - 10https://gerrit.wikimedia.org/r/484345 (https://phabricator.wikimedia.org/T213212) (owner: 10Smalyshev) [21:35:31] (03PS7) 10Gehel: Switch category endpoint config to 9990 [puppet] - 10https://gerrit.wikimedia.org/r/484345 (https://phabricator.wikimedia.org/T213212) (owner: 10Smalyshev) [21:36:32] (03CR) 10Gehel: [C: 03+2] Switch category endpoint config to 9990 [puppet] - 10https://gerrit.wikimedia.org/r/484345 (https://phabricator.wikimedia.org/T213212) (owner: 10Smalyshev) [21:39:29] !log depooling wdqs2005 for T213854 [21:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:34] T213854: Reload database on wdq2[456] from another server - https://phabricator.wikimedia.org/T213854 [21:40:14] 10Operations, 10DBA: s3 master emergency failover (db1075) - https://phabricator.wikimedia.org/T213858 (10Anomie) Thanks for letting me know about the failover. It will probably kill the script anyway when the old master goes away, or at least whichever s3 wiki it happens to be processing at the time. I won't... [21:49:30] !log re-activate BGP to Zayo on cr1-eqiad - T212791 [21:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:33] T212791: Interface errors on cr1-eqiad:xe-3/3/1 - https://phabricator.wikimedia.org/T212791 [21:52:59] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.128 second response time [21:53:15] (03PS3) 10Dzahn: service: ensure parent dir exists before git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484343 (https://phabricator.wikimedia.org/T201366) [21:53:34] 10Operations, 10ops-eqiad: Interface errors on cr1-eqiad:xe-3/3/1 - https://phabricator.wikimedia.org/T212791 (10ayounsi) Zayo tech swapped their optic and so far no more errors. I re-enabled BGP. [21:54:31] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) [21:55:25] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.13 seconds [21:57:07] (03CR) 10Dzahn: "> IMHO, it would be better to ensure this in service::deploy::gitclone" [puppet] - 10https://gerrit.wikimedia.org/r/484343 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [21:57:36] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) [21:58:01] (03PS1) 10Hashar: scan and process templates in parallel [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 [21:59:02] (03CR) 10Jeena Huneidi: [C: 03+1] Merge tag 'v2.15.8' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484437 (owner: 10Thcipriani) [21:59:14] (03CR) 10Hashar: "Better seen ignoring whitespace changes: git show -w" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 (owner: 10Hashar) [21:59:37] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: rack a3 pdu swap / failure / replacement - https://phabricator.wikimedia.org/T213859 (10RobH) [22:02:11] (03PS1) 10Dzahn: testreduce: use component/node10 instead of stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/484579 (https://phabricator.wikimedia.org/T201366) [22:03:25] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.126 second response time [22:03:57] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.05 seconds [22:04:01] (03CR) 10Dzahn: [C: 03+2] "alright, keep in mind i did this after being told node is now available in stretch-backports finally." [puppet] - 10https://gerrit.wikimedia.org/r/483889 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [22:04:05] PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.53 seconds [22:04:32] PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.54 seconds [22:04:33] PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.83 seconds [22:04:37] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.58 seconds [22:04:51] PROBLEM - MariaDB Slave Lag: s5 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.22 seconds [22:04:57] PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.20 seconds [22:05:09] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.47 seconds [22:14:26] (03PS1) 10Volans: decorators: make retry() DRY-RUN aware [software/spicerack] - 10https://gerrit.wikimedia.org/r/484582 [22:16:02] (03CR) 10Volans: "This is a proposal to make @retry DRY-RUN aware without adding boilerplate." [software/spicerack] - 10https://gerrit.wikimedia.org/r/484582 (owner: 10Volans) [22:17:09] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.06 seconds [22:19:35] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.10 seconds [22:21:59] (03PS5) 10Dzahn: geoip::maxmind: add data types, rm deprecated validate_string [puppet] - 10https://gerrit.wikimedia.org/r/483222 [22:22:41] (03CR) 10jerkins-bot: [V: 04-1] geoip::maxmind: add data types, rm deprecated validate_string [puppet] - 10https://gerrit.wikimedia.org/r/483222 (owner: 10Dzahn) [22:30:23] (03PS6) 10Dzahn: geoip::maxmind: add data types, rm deprecated validate_string [puppet] - 10https://gerrit.wikimedia.org/r/483222 [22:32:01] PROBLEM - SSH on ms-be1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:33:07] RECOVERY - SSH on ms-be1034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [22:34:49] PROBLEM - Host cp1088 is DOWN: PING CRITICAL - Packet loss = 100% [22:34:51] PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.223: Connection reset by peer [22:36:59] PROBLEM - SSH on ms-be1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:39:36] !log repooled wdqs1008 [22:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:22] (03CR) 10Hashar: [C: 03+2] Merge tag 'v2.15.8' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484437 (owner: 10Thcipriani) [22:41:29] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:41:29] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:41:31] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1088_v4, cp1088_v6 [22:41:31] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:41:31] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1088_v4, cp1088_v6 [22:41:39] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:41:39] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1088_v4, cp1088_v6 [22:41:41] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:41:41] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:41:45] PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:01] PROBLEM - IPsec on cp5006 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:05] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:05] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:05] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:07] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1088_v4, cp1088_v6 [22:42:07] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1088_v4, cp1088_v6 [22:42:07] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1088_v4, cp1088_v6 [22:42:07] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1088_v4, cp1088_v6 [22:42:07] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:11] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1088_v4, cp1088_v6 [22:42:15] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:17] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:18] (03CR) 10Hashar: [V: 03+2 C: 03+2] "There is no CI job for this repository, that is unacceptable :-) More seriously, seems it is all about crafting a container that gets ba" [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484437 (owner: 10Thcipriani) [22:42:19] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:19] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:21] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:21] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1088_v4, cp1088_v6 [22:42:23] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1088_v4, cp1088_v6 [22:42:25] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:27] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1088_v4, cp1088_v6 [22:42:29] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:29] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:29] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:37] PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:39] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:42:41] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1088_v4, cp1088_v6 [22:42:44] that's all a single host that has been taken down [22:42:45] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1088_v4, cp1088_v6 [22:43:01] RECOVERY - SSH on ms-be1034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [22:46:32] (03PS1) 10Catrope: Enable help panel in development mode in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484585 [22:46:44] (03CR) 10Catrope: [C: 03+2] Enable help panel in development mode in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484585 (owner: 10Catrope) [22:48:17] (03Merged) 10jenkins-bot: Enable help panel in development mode in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484585 (owner: 10Catrope) [22:48:30] (03CR) 10Dzahn: geoip::maxmind: add data types, rm deprecated validate_string (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/483222 (owner: 10Dzahn) [22:49:55] 10Operations, 10ops-eqiad: Interface errors on cr1-eqiad:xe-3/3/1 - https://phabricator.wikimedia.org/T212791 (10ayounsi) 05Open→03Resolved [22:50:12] (03PS1) 10Cwhite: scb: enable statsd_exporter and add matching rules [puppet] - 10https://gerrit.wikimedia.org/r/484586 (https://phabricator.wikimedia.org/T205870) [22:50:27] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/WikibaseMediaInfo/resources/filepage/CaptionsPanel.js: Hot-deploy Ibb1f763f to unbreak setting captions on WikibaseMediaInfo (duration: 00m 51s) [22:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:57] RoanKattouw: I've pulled your Beta config patch to prod staging so people don't complain. ;-) [22:51:18] (03CR) 10Hashar: "Have you pushed them to Archiva ? I am not familiar with git-fat but with this change (or the tip of the branch) I get:" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484567 (https://phabricator.wikimedia.org/T210785) (owner: 10Thcipriani) [22:52:17] James_F: what broke? :O [22:53:05] (03CR) 10jenkins-bot: Enable help panel in development mode in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484585 (owner: 10Catrope) [22:53:06] addshore: JS fatal on trying to edit captions. Not caused by you, I merged a patch without testing in some circumstances. However, this is what test-commons is for, after all. :-) [22:53:17] :D [22:53:49] !log removing one file for legal compliance [22:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:22] (03CR) 10Thcipriani: "> Have you pushed them to Archiva ? I am not familiar with git-fat" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484567 (https://phabricator.wikimedia.org/T210785) (owner: 10Thcipriani) [23:02:55] (03CR) 10Mobrovac: [C: 03+1] service: ensure parent dir exists before git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484343 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [23:05:56] (03CR) 10Hashar: [V: 03+2 C: 03+2] "Well mine is broken somehow :)" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484567 (https://phabricator.wikimedia.org/T210785) (owner: 10Thcipriani) [23:10:14] (03PS1) 10Thcipriani: Fix deploy_artifacts.py [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484587 [23:13:08] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14348/" [puppet] - 10https://gerrit.wikimedia.org/r/484343 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [23:13:20] (03PS4) 10Dzahn: service: ensure parent dir exists before git cloning [puppet] - 10https://gerrit.wikimedia.org/r/484343 (https://phabricator.wikimedia.org/T201366) [23:19:34] (03CR) 10Paladox: [C: 03+1] Fix deploy_artifacts.py [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484587 (owner: 10Thcipriani) [23:21:10] mobrovac: thanks for review.. merged. it's all cool except surprised by " remote error: parsoid/deploy unavailable" heh [23:21:32] hmmm [23:21:39] that's weird [23:21:52] https://gerrit.wikimedia.org/r/parsoid/deploy [23:21:54] 404 [23:22:08] that's always 404 :) [23:22:27] the command-line used by puppet [23:22:27] /usr/bin/git clone --recurse-submodules https://gerrit.wikimedia.org/r/parsoid/deploy /srv/deployment/parsoid/deploy [23:22:32] it's using the wrong project [23:22:33] https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/services/parsoid/deploy [23:22:35] mutante: it's mw/services/parsoid/deploy [23:23:07] hmm.. the code was $dir = "/srv/deployment/${title}/deploy" before i touched it [23:23:25] thanks. looking [23:24:46] that's rather weird indeed [23:25:02] but this is also not it https://gerrit.wikimedia.org/r/mediawiki/services/parsoid/deploy [23:25:13] because the git::clone hasn't changed really [23:25:15] and the "admin" part in paladox' URL doesnt seem to be what i should be using either? [23:25:23] nope [23:25:33] git clone https://gerrit.wikimedia.org/r/mediawiki/services/parsoid/deploy [23:26:07] paladox: yea.. but i also get Not Found in browser is normal? [23:26:08] yeah, so the mediawiki/services part is missing [23:26:14] mutante yup, that's normal [23:26:21] that's a git url not a web ui :) [23:26:25] that's anonymous http close, so it's normal [23:27:11] i did not change the variable $repository ... [23:27:12] 10Operations, 10RESTBase-Cassandra, 10Services (next): restbase cassandra driver excessive logging when cassandra hosts are down - https://phabricator.wikimedia.org/T212424 (10Pchelolo) We did see the driver trying to connect decommissioned nodes for quite a while after they have left the cluster during the... [23:27:15] ok @ normal part [23:27:56] mutante mobrovac https://github.com/wikimedia/puppet/blob/c259862ab317c95115de418b9093676599e86a97/hieradata/role/common/deployment_server.yaml#L142 [23:28:00] maybe it's that? [23:28:07] that should be mediawiki/services/parsoid/deploy [23:28:25] repository: mediawiki/services/parsoid/deploy [23:31:49] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:32:23] PROBLEM - High lag on wdqs2004 is CRITICAL: 1.43e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [23:32:51] service::deploy::gitclone .. repository => $repo .. $repo = "${title}/deploy", (service::node) ... service::node { 'parsoid': [23:32:59] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10ayounsi) I think there is a distinction to make here when saying "prod", as it's made of several vlans/networks, especial... [23:33:05] RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational [23:35:28] ACKNOWLEDGEMENT - High lag on wdqs2004 is CRITICAL: 1.442e+04 ge 3600 Stas Malychev Reloading data, will catch up soon https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [23:36:21] paladox: those are scap::sources though while we use the git deployment method? [23:37:00] hmm, but dosen't scap copy what ever is on the deployment host? [23:40:48] service::deploy::gitclone uses $repository and calls git::clone using $repository as $title. service::node has a parameter $deployment and if that is set to "git" then it uses service::deploy::gitclone with $title [23:41:41] "repository => $repo" the variable $repo is now used in service::node [23:42:06] that is set to "${title}/deploy" [23:42:46] finally there are various profiles that use service::node with the service name as $title, like: [23:43:02] service::node { 'parsoid' [23:43:19] question is.. where in all this would it ever know that "mediawiki/services? has to be prefixed.. [23:43:40] it just uses $title..which is "parsoid" [23:44:35] PROBLEM - High lag on wdqs2005 is CRITICAL: 1.441e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [23:45:49] the Hiera values you linked to dont seem to be used unless $deployment is set to scap [23:46:15] mutante: for scap, scap::sources tells it that, but for service::deploy::gitclone we'll need to add it i guess [23:46:25] i'm wondering how this was working previously [23:47:20] yea, that's the main thing i am wondering too. i noticed on ruthenium there is /srv/deployment/parsoid/deploy-old as well.. heh [23:47:57] (03CR) 10Jeena Huneidi: [C: 03+1] Fix deploy_artifacts.py [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/484587 (owner: 10Thcipriani) [23:50:35] can just start to use "origin =>" with git::clone too [23:51:13] $default_url_format = $source ? { [23:51:18] 'gerrit' => 'https://gerrit.wikimedia.org/r/%s', [23:54:08] !log mobrovac@deploy1001 Started deploy [restbase/deploy@a04ebdd]: Restart RESTBase to pick up the fact that restbase1016 is not there - T212418 [23:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:11] T212418: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 [23:57:40] mutante: yeah, either that or use mediawiki/services/parsoid as the git::clone title