[00:08:23] !log twentyafterfour@deploy1001 Finished scap: syncing 1.33.0-wmf.1 refs T206655 (duration: 36m 58s) [00:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:26] T206655: 1.33.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T206655 [00:08:27] (03PS6) 10Dzahn: icinga: add puppet types for parameters [puppet] - 10https://gerrit.wikimedia.org/r/468468 [00:10:03] (03CR) 10Dzahn: [C: 032] "noop on all 3 servers: https://puppet-compiler.wmflabs.org/compiler1002/13172/" [puppet] - 10https://gerrit.wikimedia.org/r/468468 (owner: 10Dzahn) [00:11:15] 10Operations, 10MediaWiki-Page-deletion, 10Performance: Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 (10tstarling) I logged a deletion on en.wikipedia.org using X-Wikimedia-Debug, you can see it in mwlog1001.eqiad.wmnet:/srv/mw-log/XWikimediaDebug.log . You... [00:11:16] (03PS1) 1020after4: group0 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469352 [00:11:18] (03CR) 1020after4: [C: 032] group0 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469352 (owner: 1020after4) [00:12:32] (03Merged) 10jenkins-bot: group0 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469352 (owner: 1020after4) [00:13:33] PROBLEM - Memory correctable errors -EDAC- on wtp2013 is CRITICAL: 240 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw%2520prometheus%252Fops [00:14:50] 10Operations, 10MediaWiki-Page-deletion, 10Performance: Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 (10BPirkle) Thank you. I came to the same conclusion a bit ago, and am now pondering what to do about it. The offending code is: ``` $archivedRevision... [00:16:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] "With the current coding of osm::planet_sync it is not possible to go to a resolution that is higher than a day. That puppet define will n" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/469329 (https://phabricator.wikimedia.org/T205735) (owner: 10MSantos) [00:17:44] 10Operations, 10MediaWiki-Page-deletion, 10Performance: Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 (10tstarling) There's no index on ar_page_id, it needs to select by namespace and title [00:17:45] (03CR) 10Dzahn: [C: 032] "noop in prod as well" [puppet] - 10https://gerrit.wikimedia.org/r/468468 (owner: 10Dzahn) [00:18:47] (03CR) 10jenkins-bot: group0 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469352 (owner: 1020after4) [00:21:14] (03PS4) 10Dzahn: icinga: allow configuring max_concurrent_checks via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/469253 (https://phabricator.wikimedia.org/T202782) [00:25:17] 10Operations, 10ops-eqiad, 10netops: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Papaul) [00:26:20] !log finished with mediawiki train for group0 refs T206655 [00:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:24] T206655: 1.33.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T206655 [00:27:23] James_F: hi hi [00:27:41] What does our config plan look like? [00:27:47] Today or? [00:28:05] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.33.0-wmf.1 refs T206655 [00:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:09] 10Operations, 10ops-eqiad, 10netops: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Papaul) Went on site today to do my hand scan, get my access code and have an idea where things are . Got the router from shipping and racked it already. [00:35:28] addshore: Today isn't a good idea. I want to go home (been at the office for >10 hours already), and the train just took 6 hours to deploy… [00:35:33] (03PS5) 10Dzahn: icinga: allow configuring max_concurrent_checks via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/469253 (https://phabricator.wikimedia.org/T202782) [00:35:57] addshore: Early (08:00 SF) tomorrow morning work for you? [00:37:11] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13173/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/469253 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:41:11] (03CR) 10Dzahn: [C: 032] "tegmen/einsteinium: only whitespace changes in icinga.cfg" [puppet] - 10https://gerrit.wikimedia.org/r/469253 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:43:23] (03PS3) 10Dzahn: icinga: don't log service/host check retries [puppet] - 10https://gerrit.wikimedia.org/r/469317 (https://phabricator.wikimedia.org/T202782) [00:45:11] 10Operations, 10ops-eqiad, 10netops: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Papaul) @chris can you please update cr2-eqord Custom Fields in netbox when I am on site tomorrow I will put in the asset tag information. Or if you have the purchase task number you can just ink it to t... [00:45:32] (03CR) 10Dzahn: "one of the few changes where i am actually also changing things on einsteinium / current prod" [puppet] - 10https://gerrit.wikimedia.org/r/469317 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:48:17] (03CR) 10Dzahn: [C: 032] icinga: don't log service/host check retries [puppet] - 10https://gerrit.wikimedia.org/r/469317 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:49:54] 10Operations, 10MediaWiki-Page-deletion, 10MW-1.32-release, 10Patch-For-Review, 10Performance: Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 (10Legoktm) [00:52:04] (03PS2) 10Dzahn: icinga: tell rsyslog to discard logs from check_nrpe [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) [00:53:02] (03CR) 10jerkins-bot: [V: 04-1] icinga: tell rsyslog to discard logs from check_nrpe [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) (owner: 10Dzahn) [00:53:43] (03CR) 10Dzahn: "eh.. yea.. why?? "Unknown resource type: 'rsyslog::conf'"? it's used all over the place!" [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) (owner: 10Dzahn) [01:03:27] (03PS3) 10Dzahn: icinga: on stretch, tell rsyslog to discard logs from check_nrpe [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) [01:04:08] (03CR) 10jerkins-bot: [V: 04-1] icinga: on stretch, tell rsyslog to discard logs from check_nrpe [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) (owner: 10Dzahn) [01:06:35] (03PS4) 10Dzahn: icinga: on stretch, tell rsyslog to discard logs from check_nrpe [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) [01:07:26] (03CR) 10jerkins-bot: [V: 04-1] icinga: on stretch, tell rsyslog to discard logs from check_nrpe [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) (owner: 10Dzahn) [01:08:49] (03CR) 10Dzahn: "the jenkins-bot -1 doesn't seem to make sense. the repo is full of rsyslog::conf resources. since it is only affecting stretch anyways now" [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) (owner: 10Dzahn) [01:10:38] PROBLEM - SSH on wdqs1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:11:08] PROBLEM - SSH on wdqs1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:11:19] PROBLEM - SSH on wdqs2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:11:19] PROBLEM - SSH on wdqs2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:12:02] eh.. [01:12:09] PROBLEM - SSH on wdqs1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:12:27] SMalyshev: are you around ^ [01:12:28] PROBLEM - SSH on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:12:39] PROBLEM - SSH on wdqs2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:12:59] PROBLEM - SSH on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:13:10] trying to SSH myself [01:14:19] PROBLEM - SSH on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:14:48] PROBLEM - SSH on wdqs2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:15] Now those are really weird.. [01:15:20] ah, those are not the active ones appearing on https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1 [01:15:26] onimisionipe: oh hi :) [01:15:28] PROBLEM - SSH on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:49] PROBLEM - SSH on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:16:04] i am finally getting a response from 1007.. kind of .. waiting [01:16:13] tryign mgmt for another one [01:17:19] on console of 2005.. it shos a lot of output from wdqs-updater [01:17:53] Getting my system up with speed... [01:18:00] Caused by: java.time.format.DateTimeParseException: [01:19:07] lots of exceptions.. coming from wdqs-updater [01:20:26] 10Operations, 10LDAP-Access-Requests, 10Core Platform Team Kanban (Blocked Externally): Remove "daniel" from "wmde" LDAP group and add him to "wmf" - https://phabricator.wikimedia.org/T207788 (10daniel) @addshore thanks for filing this! [01:20:27] it looks like only the ones are affecete dthat belong to "wdqs internal" [01:20:33] not the other ones [01:21:22] no, actually not true [01:22:16] James_F: yes, 8am works for me [01:23:20] i powercycled this one i was on. could not login either way [01:24:09] PROBLEM - Host wdqs2005 is DOWN: PING CRITICAL - Packet loss = 100% [01:24:28] RECOVERY - Host wdqs2005 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [01:24:46] !log wdqs2005 - powercycled, wasnt reachable via SSH and also couldn't login on mgmt, mgmt full of jave exceptions from wdqs-updater [01:24:58] RECOVERY - SSH on wdqs2005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [01:25:08] it continues with that problem right away [01:25:49] mutante: Failed to log message to wiki. Somebody should check the error logs. [01:25:51] what is this.. an automatic update process that isn't planned? [01:26:23] mutatnte: seems powercycling works? [01:26:39] onimisionipe: no, it just continues with the same exceptions [01:27:09] or let's say "some exceptions" from wdqs-updater [01:27:13] I mean SSH [01:28:00] hmm. can you? [01:28:09] PROBLEM - HHVM rendering on mwdebug2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:28:38] PROBLEM - High lag on wdqs1005 is CRITICAL: 3608 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:28:39] PROBLEM - High lag on wdqs1003 is CRITICAL: 3606 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:28:46] mutante: into 2005 [01:28:48] yes [01:28:49] PROBLEM - High lag on wdqs2004 is CRITICAL: 3601 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:29:03] onimisionipe: oh. ok! good [01:29:09] RECOVERY - HHVM rendering on mwdebug2001 is OK: HTTP OK: HTTP/1.1 200 OK - 72447 bytes in 0.504 second response time [01:29:09] PROBLEM - High lag on wdqs1008 is CRITICAL: 3640 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:29:09] PROBLEM - High lag on wdqs2001 is CRITICAL: 3623 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:29:18] PROBLEM - High lag on wdqs1006 is CRITICAL: 3647 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:29:19] PROBLEM - High lag on wdqs2002 is CRITICAL: 3632 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:29:28] PROBLEM - High lag on wdqs1009 is CRITICAL: 3656 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:29:28] PROBLEM - High lag on wdqs2005 is CRITICAL: 3637 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:29:29] PROBLEM - High lag on wdqs1004 is CRITICAL: 3658 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:29:30] PROBLEM - High lag on wdqs1010 is CRITICAL: 3663 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:29:38] PROBLEM - High lag on wdqs1007 is CRITICAL: 3670 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:29:48] PROBLEM - High lag on wdqs2006 is CRITICAL: 3658 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:31:04] onimisionipe: blazegraph restart maybe? [01:31:10] can you enter any commands [01:31:14] yes [01:31:22] blazegraph already started.. [01:31:33] wdqs-updater is starting and failng [01:32:41] you see that Java Exception wit the time.format reference" java.time.format.DateTimeParseException [01:32:59] that's what the console log seemed to be full of [01:33:28] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:33:36] arr [01:34:13] the graph doesnt even seem to agree on that? [01:34:30] Note, I'm in the middle of deploying a patch [01:34:38] But I don't think that's related [01:34:56] oh.. i bet it is [01:35:04] i have seen it before during deploy [01:35:45] ah ok. Scary when suddenly errors happen in the middle of deploying something [01:36:00] en.wp seems fine to me [01:36:28] My patch includes an i18n error message, so I'm running scap sync, as opposed to scap sync-file like I normally do [01:36:41] This is a much longer process than I realized [01:36:56] lookign at 5xx.log on oxygen [01:37:01] doesnt seem "fast" [01:37:06] I think whatever is wrong at mediawiki is propagating to WDQS.. [01:37:12] !log deployed T207750 [01:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:20] scap sync just finished for me [01:38:57] i mean , look at the grafana: https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:39:01] I don't really see any increase in fatals/exceptions in log stash, just a small bump in timeouts during the deploy that immediately went away when it was finished [01:39:04] that does NOT look like "100% over 50 " ?? [01:39:19] it actually looks low? [01:39:37] yea.. grmbl @ grafana monitoring [01:39:50] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:39:52] ... [01:40:20] won't complain about that [01:40:45] mutante: can you call Stas? [01:41:06] but didn't that toally feel like "during deployment" (the mediawiki fatals) [01:41:16] onimisionipe: checking office wiki.. [01:42:16] yes, on it [01:44:19] onimisionipe: got a voice box, left a message [01:45:29] mutante: Ok. [01:46:55] !log tstarling@deploy1001 Synchronized php-1.32.0-wmf.26/includes/page/WikiPage.php: fix deletion performance regression T207530 (duration: 00m 55s) [01:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:03] T207530: Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 [01:48:03] mutante: looks like some new update caused this.. [01:48:44] onimisionipe: the part above about performance regression on page delete? [01:48:56] nope [01:49:05] the wdqs-updater issue [01:50:22] update of mediawiki? packages on this host didn't seem to be touched today [01:51:50] I mean updates to wdqs-updater. [01:52:19] ok [01:52:30] I rememer doing a deploy on monday. But there are new updates on the updater [01:53:06] ah, *nod*, are there automatic updates without human intervention? [01:53:13] nope [01:53:34] someone deployed it. Maybe Stas [01:53:37] ok [01:53:38] PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:53:53] eh.. and it's ongoing maybe? [01:53:59] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:56:54] !log tstarling@deploy1001 Synchronized php-1.33.0-wmf.1/includes/page/WikiPage.php: T207530 (duration: 00m 53s) [01:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:59] T207530: Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 [01:56:59] PROBLEM - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad%2520prometheus%252Fops [01:58:18] PROBLEM - SSH on wdqs2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:00:19] PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:01:00] stepping back to let Matt attempt revert of the wdqs-updater update [02:12:22] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@d4692ea]: Reverting update on wdqs1003 to fix wdqs-updater issue [02:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:45] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@d4692ea]: Reverting update on wdqs1003 to fix wdqs-updater issue (duration: 00m 23s) [02:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:39] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:24:14] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@d4692ea]: Reverting update on wdqs1003 to fix wdqs-updater issue [02:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:18] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@d4692ea]: Reverting update on wdqs1003 to fix wdqs-updater issue (duration: 00m 03s) [02:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:19] PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:25:39] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:32:08] PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:34:06] !log powercycled wdqs1009 - by request [02:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:58] PROBLEM - Host wdqs1009 is DOWN: PING CRITICAL - Packet loss = 100% [02:36:38] RECOVERY - Host wdqs1009 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [02:36:57] onimisionipe: somehow that looks better now!:) [02:36:59] RECOVERY - SSH on wdqs1009 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [02:36:59] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational [02:37:16] eh.. and it spits out the updater errors again thogh.. [02:37:36] 10Operations, 10New-Readers: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10Prtksxna) [02:38:18] 10Operations, 10New-Readers: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10Prtksxna) [02:40:31] 10Operations, 10New-Readers: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10Prtksxna) We understand that these URLs currently take one to https://es.wikipedia.org/wiki/Bienvenida (and correctly so). If neither of these is possible, let us know and we'll try to come up w... [02:42:18] PROBLEM - High lag on wdqs1009 is CRITICAL: 8026 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:53:48] PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:58:09] PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected [03:03:49] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected [03:04:59] PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:08:39] PROBLEM - SSH on wdqs1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:12:24] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Mathew.onipe) [03:12:32] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Mathew.onipe) p:05Triage>03High [03:18:09] scheduled downtime for these. determined users are not directly affected. linked to ticket, left message , what we could do.. and out now [03:28:48] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 834.70 seconds [03:46:48] RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational [03:55:29] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational [03:58:59] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 159.34 seconds [04:00:09] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:10:09] RECOVERY - Check systemd state on wdqs2006 is OK: OK - running: The system is fully operational [04:14:58] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:26:19] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:39:21] (03PS1) 10Catrope: Enable $wgWMEUnderstandingFirstDay on English beta labs too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469371 [04:44:19] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:56:38] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:10:31] marostegui: Is it OK to deploy cxserver. Seeing som CRITICAL notification now.. [05:10:57] (not related, but just in case) [05:11:33] kart_: I think so yes [05:14:38] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:14:51] (03CR) 10Catrope: [C: 032] Enable $wgWMEUnderstandingFirstDay on English beta labs too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469371 (owner: 10Catrope) [05:15:57] (03Merged) 10jenkins-bot: Enable $wgWMEUnderstandingFirstDay on English beta labs too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469371 (owner: 10Catrope) [05:16:58] marostegui: thanks [05:20:39] !log kartik@deploy1001 Started deploy [cxserver/deploy@80dc518]: Update cxserver to 9ad60d9 (T207445) [05:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:43] T207445: CX2: Paragraph not added to the translation with MT failure message - https://phabricator.wikimedia.org/T207445 [05:24:46] !log kartik@deploy1001 Finished deploy [cxserver/deploy@80dc518]: Update cxserver to 9ad60d9 (T207445) (duration: 04m 06s) [05:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:59] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:27:58] (03CR) 10jenkins-bot: Enable $wgWMEUnderstandingFirstDay on English beta labs too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469371 (owner: 10Catrope) [05:40:28] PROBLEM - MariaDB Slave SQL: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1007, Errmsg: Error Cant create database srwiki: database exists on query. Default database: srwiki. [Query snipped] [05:44:48] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:48:46] I will fix that issue [05:50:09] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:50:28] RECOVERY - MariaDB Slave SQL: s5 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:52:42] (03PS1) 10Smalyshev: Disable Kafka update due to breakage [puppet] - 10https://gerrit.wikimedia.org/r/469378 (https://phabricator.wikimedia.org/T207817) [05:53:05] onimisionipe: are you around? [05:54:06] can anybody merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/469378 to restore wdqs updater to working state? [05:54:53] I'm around [05:55:17] But I can't merge [05:55:58] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:56:18] SMalyshev: checking [05:56:48] (03CR) 10Mathew.onipe: [C: 031] Disable Kafka update due to breakage [puppet] - 10https://gerrit.wikimedia.org/r/469378 (https://phabricator.wikimedia.org/T207817) (owner: 10Smalyshev) [05:57:33] SMalyshev: so this thing needs to be merged and puppet deployed on all the wdqs hosts? [05:58:03] so diff is https://puppet-compiler.wmflabs.org/compiler1002/13175/wdqs1004.eqiad.wmnet/ [05:58:06] does it look good? [05:58:08] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational [05:58:12] I don't have a huge context [05:59:05] elukey: yes, looks good. This switches wdqs updater from kafka back to RC API, so I can look into it without server getting more and more behind [05:59:32] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13175/wdqs1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/469378 (https://phabricator.wikimedia.org/T207817) (owner: 10Smalyshev) [05:59:50] SMalyshev: merging and running puppet then [05:59:51] ack? [06:00:52] onimisionipe: ? [06:01:07] elukey: yep [06:01:13] super [06:01:14] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev) p:05High>03Unbreak! This seems to be caused by this error: ``` Oct 24 00:28:30 wdqs1003... [06:01:44] elukey: this seems to be generated by mediawiki, see T207817 [06:01:45] T207817: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 [06:02:55] running puppet [06:03:05] "domain": "www.mediawiki.org" [06:03:16] yeah so only a few events have the weird timestamp [06:04:02] SMalyshev: confirmed that mediawiki.org is in group 0 [06:04:06] it is the last mediawiki deployment [06:04:23] so, should we put it as train blocker? [06:04:30] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev) The first error is at Oct 24 00:28:30 which matches Mediawiki deployment. Pinging @20after4... [06:05:23] SMalyshev: it is probably a good idea, it might cause problems.. It would be great to find the change that triggered this [06:06:05] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev) [06:08:07] another problem is that I can not reach most of wdq servers at all. onimisionipe - can you reach any of them by ssh? [06:08:24] yeah same thing for me [06:08:36] well, that's unfortunate [06:08:57] could it be related to puppet running and triggering the WDQS updater ? [06:09:02] I can check from the console [06:09:35] updater shouldn't mess with ssh... [06:09:40] so something else is wrong [06:09:49] sure, but it could raise network activity [06:10:42] elukey: I can reach 2003, but looks like puppet didn't work there [06:10:46] the config is old [06:11:28] so this is weird, if I connect to the mgmt console (not even login as root) I can see a ton of wdqs-updater errors [06:11:39] [4889352.412670] wdqs-updater[852]: 05:54:13.115 [main] WARN o.w.q.r.tool.change.JsonDeserializer - Data in topic eqiad.mediawiki.page-delete cannot be deserialized [{"comment": "Orphaned talk page", "database": "mediawikiwiki", [06:11:43] etc.. [06:12:06] yeah that's the problem there... but it should stop once puppet switches it to RC API [06:12:07] that keeps going [06:12:31] that's bad - it should have put the service into failed state and not spam... [06:13:16] so it seems going through page-delete from ~2018-10-24T00:39:17.788200+00:00 [06:14:39] elukey: which server is that? [06:14:49] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:14:57] wdqs1004 [06:14:59] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev) Another problem on this - for some reason, when Updater fails, systemd keeps restarting in a... [06:15:16] I can log in in wdqs1003, I am sure that puppet ran there (it was the only successful host) [06:15:18] elukey: yeah, can't even login there... [06:15:26] elukey: yeah 1003 is fine [06:15:40] but the rest seems to be still stuck, and no puppet and no ssh [06:15:59] What is the current impact? Do we have any idea? [06:16:05] we can reboot one at the time [06:16:16] elukey: well, updater is broken on all hosts for now [06:16:20] except 1003 [06:16:27] and I can't reach them [06:16:38] sure I mean services running on those hosts, other than the updater [06:16:52] query service itself [06:16:55] let me check [06:17:13] queries seem to be working fine [06:17:20] good [06:17:31] and the load looks fine too. so I wonder why ssh/puppet don't work? [06:17:41] logged in to wdqs1005, same thing, can't get into root login there [06:18:06] elukey: can it be somehow related to journald? [06:19:05] SMalyshev: no idea, but the only thing that I can think of would be to reboot one and test [06:19:28] elukey: 2003 seems to be responsive but puppet didn't run... can you check what's up there? [06:19:52] SMalyshev, elukey I was able to login into 1003 and 2005 [06:20:07] 2005 because it was power cycled [06:20:28] just ran puppet on 2003 [06:20:34] And also we powercycled wdqs1009 before we could login [06:20:35] ok, looks good now [06:20:51] I suggest restarting if we can't login. Via ssh [06:20:53] Ok [06:20:59] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:21:02] onimisionipe: what do you mean? Did you powercycle any host? [06:21:36] onimisionipe: did you restore updater to latest version after reverts? [06:22:07] onimisionipe: if not, it might be a good idea to do it now, since otherwise old updater will bring in buggy data [06:24:06] let's sync in here before taking actions please [06:24:06] I'll redeploy on 1003 [06:24:25] ok [06:24:50] elukey: since 1003 is now working fine, I want to restore latest updater version there [06:24:59] otherwise updates for lexemes will be buggy [06:25:36] any objections? [06:27:09] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:27:39] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:27:40] SMalyshev: can we wait a sec? [06:27:52] If you are ok I'd powercycle 1004 to see if it comes up fine [06:28:06] sure, though if it takes extended time I'd rather stop the updater [06:28:17] yeah let's stop it for a second [06:28:23] SMalyshev: the update revert didn't work [06:28:24] but couple of minutes is ok [06:28:44] onimisionipe: by didn't work do you mean it didn't install the revert or it didn't solve the problem? [06:28:54] onimisionipe: please sync with us before taking any action [06:29:10] <_joe_> you should stop the updater everywhere, ensure it won't start [06:29:15] <_joe_> let's do this first [06:29:19] elukey: ok.. [06:29:21] elukey: waiting for your ok [06:29:27] SMalyshev: it didn't revert [06:29:29] <_joe_> then ensure all machines are reachable [06:29:44] onimisionipe: ok then, it's good then [06:29:51] SMalyshev: you can stop the updater on the reachable machines [06:29:59] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/nf_conntrack.conf] [06:30:06] _joe_: well, given that I can't ssh to most of them, I can't also stop anything :) [06:30:18] <_joe_> ok, so let's stop a sec [06:30:26] <_joe_> which machines are unreachable via ssh? [06:30:48] <_joe_> I can test it myself [06:30:58] most of the wdqs, I managed to ssh and run puppet only on 1003 and 2005 [06:31:00] <_joe_> what's the name of the damn updater service? [06:31:04] _joe_: most of wdqs machines except 1003 and 2003 [06:31:11] wdqs-updater _joe_ [06:31:14] _joe_: wdqs-updater [06:31:28] sorry 2003 and 1003 [06:31:35] elukey: ok I stopped updater on 1003 and 2003 [06:31:44] rest of the hosts are ignoring me [06:31:49] <_joe_> (12) wdqs[2001-2002,2004-2006].codfw.wmnet,wdqs[1004-1010].eqiad.wmnet [06:31:59] SMalyshev: can we powercycle 1004 or does it need a specific depool procedure? [06:32:00] <_joe_> these machines need to be powercycled AFAICS [06:32:06] _joe_: yep [06:32:08] <_joe_> elukey: go on ffs [06:32:19] elukey: no, no procedure [06:32:23] <_joe_> those machines are dead [06:32:32] <_joe_> also pybal will depool them [06:32:35] you can cycle it and LB should figure it out [06:32:42] <_joe_> let's do one at a time for datacenter [06:32:45] <_joe_> ok? [06:32:46] _joe_: weirdly, they seem to answer queries [06:32:59] <_joe_> who is going to powercycle in eqiad? [06:33:04] <_joe_> and who in codfw? [06:33:11] <_joe_> ack here, and !log every action please [06:33:18] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:33:20] I am starting with 1004 [06:33:28] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:33:30] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev) Also, I can not access most hosts by ssh. Whatever updater problem is there, it should not b... [06:33:31] !log powercycle wdqs1004 [06:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:34] SMalyshev: that freaked me and mutante out. No she access but queries were going through [06:33:38] <_joe_> ok [06:33:53] <_joe_> onimisionipe: can you powercycle the servers in codfw? or should I do it? [06:34:06] yup, no idea what is going on there... [06:34:16] let's do them one by one, so querying is not disrupted too much [06:34:35] <_joe_> SMalyshev: should not be disrupted at all if we do one-by-one [06:34:39] <_joe_> pybal will depool them [06:34:42] _joe_: I think you should do it [06:34:45] <_joe_> ok! [06:35:01] <_joe_> !log powercycling wdqs[2001-2002,2004-2006].codfw.wmnet, one at a time [06:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:24] I am following 1004's boot now [06:36:18] PROBLEM - Host wdqs1004 is DOWN: PING CRITICAL - Packet loss = 100% [06:36:28] sure sure [06:36:33] <_joe_> I am going to disable the updater once the machine comes up [06:36:55] it is up now [06:37:07] _joe_: ok, though after puppet runs the updater should be fine [06:37:08] RECOVERY - Host wdqs1004 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [06:37:09] RECOVERY - SSH on wdqs1004 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [06:37:14] _joe_: name is `wdqs-updater` [06:37:31] I can log in fine [06:37:35] _joe_: I switched it to RC API which is not affected by kafka problem [06:37:52] PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time [06:37:58] <_joe_> SMalyshev: still, I'd like us to restart ASAP [06:38:06] <_joe_> ok this wasn't really expected? [06:38:06] _joe_: in fact, updater runs fine now on 1004 [06:38:14] uh [06:38:17] _joe_: what wasn't expected? [06:38:20] <_joe_> looks like the pybal config is bad [06:38:27] <_joe_> the page from wdqs.svc.eqiad.wmnet [06:38:40] <_joe_> can someone phone guillame please? [06:38:53] hmm ah yes, looks like it didn't depool as it should have... [06:39:02] RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.017 second response time [06:39:13] that was fast [06:39:17] anyway, 1004 seems to be healthy now [06:39:23] Cool [06:39:30] <_joe_> please refrain from taking further actions [06:39:35] <_joe_> until I say it's ok [06:39:35] ack [06:39:39] sure [06:40:00] <_joe_> wdqs1003.eqiad.wmnet: disabled/up/not pooled [06:40:11] <_joe_> 1 - we have only 3 machines per pool [06:40:18] RECOVERY - SSH on wdqs2001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [06:40:21] <_joe_> 2 - why is wdqs1003 not pooled? [06:40:59] _joe_: hmmm... good q. maybe it was depooled due to lag issues [06:41:10] _joe_: we need to repool it then [06:41:16] <_joe_> SMalyshev, onimisionipe any idea why wdqs1003 was manually depooled? [06:41:30] _joe_: ^^ probably previous lag issues [06:41:32] <_joe_> yes, can you do it? [06:41:36] sure [06:41:40] <_joe_> SMalyshev: and !log it here please [06:41:48] SMalyshev: that's weird [06:42:07] !log repooled wdqs1003 [06:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:13] The last time I repooled was Sunday after it had lag issues.. [06:42:37] it has lag issues all over, that's separate story [06:42:54] and what was wdqs1005 doing then? [06:43:02] there's something very broken going on there, but that's not for now as we have bigger fish to deal with [06:43:05] it seems pooled now, even if ssh doesn't answer [06:43:35] 1005 is answering queries I assume... but no idea what's up there [06:43:41] SMalyshev: true [06:43:56] <_joe_> elukey: all servers are pooled [06:43:56] We need to bring gehel here... [06:44:13] <_joe_> but pybal clearly checks for a static url [06:44:20] sigh [06:44:25] <_joe_> onimisionipe: I asked to phone him since 10 minutes [06:44:45] <_joe_> the only thing to do now is reboot servers one by one [06:44:46] has anybody called him? [06:44:58] _joe_ 2005 is next I suppose? [06:45:01] err 1005 [06:45:02] <_joe_> elukey: 2002 [06:45:05] <_joe_> for me [06:45:09] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:45:10] <_joe_> 1005 for you I guess [06:45:19] <_joe_> if only I could get into 2002 [06:45:27] onimisionipe: can you call Gehel please? [06:45:34] I just called him [06:45:38] super thanks! [06:45:39] He is coming [06:45:47] I am going to wait for 1005 then [06:45:56] <_joe_> elukey: no, go for it [06:46:02] Crap, I'm on the way to daycare with Oscar [06:46:02] <_joe_> there is no reason to wait, really [06:46:12] <_joe_> gehel: ok we can take care of things then [06:46:13] gehel: argh sorry :( [06:46:16] yeah sure [06:46:25] !log powercycle wdqs1005 [06:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:30] <_joe_> elukey: if you really want, depool the servers while you powercycle [06:46:43] <_joe_> it shouldn't be an issue if 1003 is repooled though [06:47:08] PROBLEM - Host wdqs2002 is DOWN: PING CRITICAL - Packet loss = 100% [06:47:15] I can't think of anything else than power cycle at this point [06:47:27] <_joe_> as to why it's impossible to ssh into those hosts, I guess it has to do with journald and some interactions with ssh [06:47:30] <_joe_> but no idea tbh [06:47:38] I'll hurry back home, but at least 20' [06:47:42] _joe_: yep [06:47:50] gehel: no need to, after the reboot it seems fine [06:47:51] <_joe_> gehel: take your time, we're powercycling servers now [06:47:54] please take care of oscar [06:47:59] <_joe_> +1 :P [06:48:08] RECOVERY - Host wdqs2002 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [06:48:17] 1005 is booting [06:48:18] RECOVERY - SSH on wdqs2002 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [06:48:28] yeah looks like the healing powers of powercycle are in effect [06:48:51] SMalyshev: :) [06:48:57] <_joe_> SMalyshev: it's clear that the inability to log in is due to resource starvation [06:49:02] <_joe_> so rebooting solves it [06:49:13] <_joe_> sometimes just temporarily, but not in this case it seems [06:49:26] _joe_: can anything be done about it? I mean, one puny process shouldn't brong down the whole thing... [06:49:38] PROBLEM - Host wdqs1005 is DOWN: PING CRITICAL - Packet loss = 100% [06:49:46] whatever went wrong there with the updater, even if it [06:49:49] RECOVERY - SSH on wdqs1005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [06:49:58] RECOVERY - Host wdqs1005 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [06:50:06] is completely broken, should not kill the rest of the machine [06:50:18] I have my suspicion that journald is at fault [06:50:25] (but may be wrong) [06:50:53] let's do a good incident report on this [06:50:59] so we can verify what went bad [06:51:07] hosts come up with correct puppet so that's good [06:51:23] ok so 1005 is up [06:51:34] <_joe_> SMalyshev: I think journald interaction with ssh [06:51:37] elukey: I am recording all weird stuff into T207817 for now but there definitely will be incident report and TODOs [06:51:37] T207817: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 [06:51:46] <_joe_> 06:47 < _joe_> as to why it's impossible to ssh into those hosts, I guess it has to do with journald and some interactions with ssh [06:52:06] _joe_: ok, great minds :) [06:52:27] Lol [06:52:28] <_joe_> wdqs1003/wdqs2003 report systemd failures [06:52:32] <_joe_> anyone knows why? [06:52:37] looks like we have to get rid of journald reporting in wdqs-updater asap [06:52:49] _joe_: hmm let me check [06:53:12] it is the wdqs-updater.service [06:53:20] we have stopped it right? [06:53:25] it is listed as failed [06:53:43] _joe_: ah, I stopped updater there by request :) [06:53:44] yep yep just seen logs [06:53:47] I can put it back [06:53:48] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational [06:53:56] Cool [06:54:14] <_joe_> SMalyshev: ok cool [06:54:30] <_joe_> sorry I wasn't sure, I didn't notice the SAL entries [06:54:59] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational [06:55:00] so now wdqs-internal? [06:55:05] 1006-8 ? [06:55:14] they seem stuck as well [06:55:16] elukey: yes [06:55:29] RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational [06:55:36] so they are all pooled, so I am going to depool one and powercycle it [06:55:39] RECOVERY - SSH on wdqs2004 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [06:55:50] !log powercycle wdqs1006 (depool first) [06:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:50] <_joe_> elukey: no need to depool [06:57:02] <_joe_> pybal does that for you if it's one server [06:57:28] following 1006's boot [06:59:08] PROBLEM - Host wdqs1006 is DOWN: PING CRITICAL - Packet loss = 100% [06:59:28] RECOVERY - Host wdqs1006 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [06:59:28] RECOVERY - SSH on wdqs1006 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [06:59:32] 1006 back [06:59:53] !log powercycle wdqs1007 [06:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:09] RECOVERY - SSH on wdqs2005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [07:00:48] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:01:20] following 1007's boot [07:01:30] 10Operations, 10Recommendation-API, 10Research, 10Core Platform Team Kanban (Done with CPT), and 2 others: Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10mobrovac) 05Open>03Resolved It turns out SCB nodes already have access to the needed DB host. Resolving. [07:02:58] PROBLEM - Host wdqs1007 is DOWN: PING CRITICAL - Packet loss = 100% [07:03:08] RECOVERY - SSH on wdqs1007 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [07:03:18] RECOVERY - Host wdqs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [07:03:34] SMalyshev: can you check how the query service is doing? [07:03:46] I am seeing a warning in icinga about response time [07:03:54] probably due to reboots buuut let's check :) [07:04:41] !log powercycle wdqs1008 [07:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:32] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev) The message that comes from mediawiki.org looks like: ``` {"comment": "Stabl", "database":... [07:06:35] elukey: checking [07:06:38] RECOVERY - SSH on wdqs2006 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [07:07:01] elukey: queries are running... [07:07:17] sure sure, just wanted to make sure that metrics looked ok [07:07:29] RECOVERY - SSH on wdqs1008 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [07:07:36] dashboards do not show any anomalous load [07:07:42] ack [07:07:52] 1008 up, eqiad reboots completed afaics [07:08:03] _joe_ --^ [07:08:04] <_joe_> elukey: I'm done with codfw [07:08:14] updater starts caching up: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&from=now-6h&to=now&panelId=8&fullscreen [07:08:42] will take some time to get back 6 hours [07:08:47] <_joe_> (2) wdqs[1009-1010].eqiad.wmnet [07:08:51] <_joe_> what are those? [07:09:17] one is test (1010) [07:09:21] <_joe_> SMalyshev: ack, I'd expect the post-mortem for this incident will involve multiple people from multiple teams [07:09:22] _joe_: test cluster [07:09:31] wdqs-test [07:09:33] <_joe_> ok guillame can restart those later [07:09:34] do them too [07:09:39] <_joe_> I can get back to my work [07:09:50] _joe_: ok, yeah gehel can do them [07:10:07] _joe_: thanks! [07:10:19] I will write report tomorrow... we still need to figure out what broke the timestamps [07:10:27] Joe, elukey Thanks! [07:10:31] and do about 20 other things of course [07:10:56] SMalyshev: I can inagine [07:11:04] <_joe_> SMalyshev: yes, we need to figure out what change caused your problem, why it wasn't communicated, why we didn't catch this issue before getting to production, etc [07:11:05] yeah I am curious why the timestamp changed [07:11:12] * gehel is back, reading backlog [07:11:28] gehel: T207817 [07:11:29] T207817: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 [07:11:34] ^ a good start [07:11:52] <_joe_> gehel: so, please don't powercycle wdqs1009 for now, I want to see if I can figure out why ssh doesn't work [07:13:00] gehel: if it's journald that means we need that patch for cleaning up logging asap. I still have some requests for it but I am really tired now so probably will get to it tomorrow [07:13:28] SMalyshev: yep, get some sleep, I'll keep an eye on the herd [07:14:17] I'll be around about .5 hour just in case, but then probably will go to bed [07:15:43] _joe_, elukey, SMalyshev, onimisionipe (and all others): thanks for taking care of that! [07:18:32] <_joe_> gehel: I'm powercycling 1009 too [07:19:02] <_joe_> !log powercycling wdqs1009 [07:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:10] <_joe_> can't log into the console either [07:19:47] Ack [07:20:13] * gehel is getting coffee and starting to dig into the mess [07:20:22] I feel it's wrong that one process can mess up environment so much that root console login is not working... maybe we're configuring something wrong there [07:20:58] PROBLEM - Host wdqs1009 is DOWN: PING CRITICAL - Packet loss = 100% [07:21:18] RECOVERY - Host wdqs1009 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [07:21:48] RECOVERY - SSH on wdqs1009 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [07:27:33] gehel: I've created incident page here: https://wikitech.wikimedia.org/wiki/Incident_documentation/20181024-WDQS - will fill it up tomorrow [07:28:37] SMalyshev: thanks! I'll update it [07:28:57] SMalyshev: I'll probably need your help (or someone else) to track the source of that date format change [07:29:04] gehel: did you restart wdq1010? I still can't reach it [07:29:09] <_joe_> gehel: releng [07:29:14] <_joe_> ask them [07:29:16] _joe_: ok, will ping [07:29:30] SMalyshev: I did not, can do it now [07:29:34] gehel: I added it to T206655 [07:29:35] T206655: 1.33.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T206655 [07:29:40] gehel: please do [07:29:52] unless anybody needs the zombie for investigation? [07:30:06] but if not please cycle it [07:31:37] SMalyshev: I'll just try to see if I can see something, then powercycle [07:31:44] gehel: cool, thanks [07:32:50] can't log either, and console is too overloaded to be good for anything [07:33:26] !log powercycling wdqs1010 - T207817 [07:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:30] T207817: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 [07:34:11] (03PS1) 10Elukey: profile::eventlogging::analytics::files: reduce retention to 7 days [puppet] - 10https://gerrit.wikimedia.org/r/469384 (https://phabricator.wikimedia.org/T206542) [07:34:38] PROBLEM - Host wdqs1010 is DOWN: PING CRITICAL - Packet loss = 100% [07:34:51] (03CR) 10jerkins-bot: [V: 04-1] profile::eventlogging::analytics::files: reduce retention to 7 days [puppet] - 10https://gerrit.wikimedia.org/r/469384 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [07:35:18] RECOVERY - SSH on wdqs1010 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [07:35:19] RECOVERY - Host wdqs1010 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [07:36:01] 09:34:46 An error occurred while loading ./spec/classes/profile_base_certificate_spec.rb. [07:36:04] 09:34:46 Failure/Error: [07:36:58] gehel: is it what you are asking to releng? [07:37:06] (I didn't get it fro mthe backscroll) [07:37:27] (03CR) 10Elukey: [V: 032 C: 032] profile::eventlogging::analytics::files: reduce retention to 7 days [puppet] - 10https://gerrit.wikimedia.org/r/469384 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [07:37:40] elukey: the profile_base_certificate? No. I'll ask about the timestamp change [07:37:54] ahh sorry [07:38:12] yep lemme know what is the answer, curious [07:39:54] 10Operations, 10monitoring, 10Patch-For-Review: newer version of nagios-nrpe-plugin nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 (10fgiunchedi) We enable backports by default nowadays, so pinning said package in puppet to backports will DTRT I think. [07:45:28] elukey: so am I! But it will take some time to track down [07:49:19] (03PS2) 10Gehel: wdqs: rate limit log sent to logstash [puppet] - 10https://gerrit.wikimedia.org/r/468979 (https://phabricator.wikimedia.org/T207656) [07:49:54] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: WDQS logging to logstash should be rate limited - https://phabricator.wikimedia.org/T207656 (10Gehel) [07:50:00] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Gehel) [07:51:04] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: WDQS logging should be rate limited - https://phabricator.wikimedia.org/T207656 (10Gehel) [07:53:39] 10Puppet, 10Beta-Cluster-Infrastructure, 10Discovery-Search, 10Beta-Cluster-reproducible, 10Patch-For-Review: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10fgiunchedi) I'm seeing only `beta-search` cluster configured by puppet on depl... [08:03:34] !log fix aggregation to 'sum' for MediaWiki.RevisionSlider - T205416 [08:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:37] T205416: Fix aggregation of "MediaWiki.RevisionSlider.event.load.sum" from average to sum - https://phabricator.wikimedia.org/T205416 [08:07:31] 10Operations, 10Revision-Slider, 10TCB-Team, 10WMDE-Analytics-Engineering, 10Graphite: Fix aggregation of "MediaWiki.RevisionSlider.event.load.sum" from average to sum - https://phabricator.wikimedia.org/T205416 (10fgiunchedi) Apologies for the delay, I've now fixed the aggregation to 'sum' for `MediaWik... [08:40:02] (03PS1) 10DCausse: [deployment-prep] fix elastic config for deployment-logstash2 [puppet] - 10https://gerrit.wikimedia.org/r/469387 (https://phabricator.wikimedia.org/T205672) [08:45:19] (03PS1) 10Marostegui: db1092: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/469388 [08:46:40] (03CR) 10Jcrespo: [C: 031] "But to merge after restart/upgrade only." [puppet] - 10https://gerrit.wikimedia.org/r/469388 (owner: 10Marostegui) [08:47:42] !log Update MySQL on db1092 for upgrade and reboot [08:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:39] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::vhost: use handler to do proxying [puppet] - 10https://gerrit.wikimedia.org/r/469201 [08:48:41] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::vhost: allow serving requests with php7 [puppet] - 10https://gerrit.wikimedia.org/r/469202 [08:48:43] (03PS2) 10Giuseppe Lavagetto: beta: start using set_handler instead of the proxy passes [puppet] - 10https://gerrit.wikimedia.org/r/469203 [08:52:03] <_joe_> grrr [08:52:08] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::vhost: use handler to do proxying [puppet] - 10https://gerrit.wikimedia.org/r/469201 [08:52:10] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::vhost: allow serving requests with php7 [puppet] - 10https://gerrit.wikimedia.org/r/469202 [08:52:12] (03PS3) 10Giuseppe Lavagetto: beta: start using set_handler instead of the proxy passes [puppet] - 10https://gerrit.wikimedia.org/r/469203 [08:54:36] (03CR) 10Marostegui: [C: 032] db1092: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/469388 (owner: 10Marostegui) [08:55:19] !log Stop MySQL for upgrade and reboot on db1087 [08:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:17] (03CR) 10Giuseppe Lavagetto: [C: 031] "This will only add the proxy definition with ProxySet which should be innocuous until we activate the feature flag somewhere:" [puppet] - 10https://gerrit.wikimedia.org/r/469201 (owner: 10Giuseppe Lavagetto) [09:03:17] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1092,db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469389 [09:04:19] (03CR) 10Marostegui: [C: 04-1] "Wait for db1087 to catchup and to enable notifications" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469389 (owner: 10Marostegui) [09:07:00] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:17:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1092,db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469389 (owner: 10Marostegui) [09:18:40] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1092,db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469389 (owner: 10Marostegui) [09:20:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1092 and db1087 (duration: 01m 05s) [09:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:18] 10Operations, 10Revision-Slider, 10TCB-Team, 10WMDE-Analytics-Engineering, 10Graphite: Fix aggregation of "MediaWiki.RevisionSlider.event.load.sum" from average to sum - https://phabricator.wikimedia.org/T205416 (10Lea_WMDE) Thanks @fgiunchedi! Just to be sure: That update won't fix already aggregated nu... [09:21:28] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469392 [09:23:10] (03CR) 10DCausse: [C: 04-1] [deployment-prep] fix elastic config for deployment-logstash2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469387 (https://phabricator.wikimedia.org/T205672) (owner: 10DCausse) [09:25:01] (03CR) 10Filippo Giunchedi: [C: 031] mediawiki::web::vhost: use handler to do proxying [puppet] - 10https://gerrit.wikimedia.org/r/469201 (owner: 10Giuseppe Lavagetto) [09:26:35] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban): Add Lars Wirzenius to releng LDAP groups - https://phabricator.wikimedia.org/T207833 (10hashar) [09:26:52] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10hashar) Follow up for LDAP groups: T207833 [09:28:44] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Cleanup Wikidata Query Service logging configuration - https://phabricator.wikimedia.org/T207834 (10Gehel) p:05Triage>03High [09:28:58] (03PS10) 10Gehel: wdqs: cleanup logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T207834) [09:29:32] 10Operations, 10Cloud-Services, 10netops: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) I can investigate how difficult is this and give a better guess-estimate of the disruption to end users. I'll try the approach that @chasemp suggested, h... [09:30:37] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1092,db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469389 (owner: 10Marostegui) [09:32:31] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469392 (owner: 10Marostegui) [09:38:17] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469392 (owner: 10Marostegui) [09:39:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1092 (duration: 00m 54s) [09:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:48] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): wdqs updater should be better isolated from blazegraph and common workload should be shared between servers - https://phabricator.wikimedia.org/T207837 (10Gehel) p:05Triage>03High [09:46:14] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469392 (owner: 10Marostegui) [09:50:16] (03PS1) 10Filippo Giunchedi: puppetmaster: add back post-receive hook for self-hosted master [puppet] - 10https://gerrit.wikimedia.org/r/469398 [09:50:55] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add back post-receive hook for self-hosted master [puppet] - 10https://gerrit.wikimedia.org/r/469398 (owner: 10Filippo Giunchedi) [09:51:01] (03PS1) 10Marostegui: db-eqiad.php: Give more traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469399 [09:52:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Give more traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469399 (owner: 10Marostegui) [09:53:37] (03Merged) 10jenkins-bot: db-eqiad.php: Give more traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469399 (owner: 10Marostegui) [09:54:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1092 (duration: 00m 54s) [09:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:58] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Verified-1" [puppet] - 10https://gerrit.wikimedia.org/r/469398 (owner: 10Filippo Giunchedi) [09:58:58] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469402 [09:59:52] (03PS1) 10Ema: ATS: remove trafficserver.systemd.erb [puppet] - 10https://gerrit.wikimedia.org/r/469403 [10:01:30] (03PS2) 10Ema: ATS: remove trafficserver.systemd.erb [puppet] - 10https://gerrit.wikimedia.org/r/469403 (https://phabricator.wikimedia.org/T200178) [10:02:07] (03CR) 10jenkins-bot: db-eqiad.php: Give more traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469399 (owner: 10Marostegui) [10:02:53] (03CR) 10Ema: [C: 032] ATS: remove trafficserver.systemd.erb [puppet] - 10https://gerrit.wikimedia.org/r/469403 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [10:05:35] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::vhost: use handler to do proxying [puppet] - 10https://gerrit.wikimedia.org/r/469201 [10:05:37] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::vhost: allow serving requests with php7 [puppet] - 10https://gerrit.wikimedia.org/r/469202 [10:05:39] (03PS4) 10Giuseppe Lavagetto: beta: start using set_handler instead of the proxy passes [puppet] - 10https://gerrit.wikimedia.org/r/469203 [10:11:11] RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:11:30] RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:11:31] RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 0.38 seconds [10:12:00] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:12:00] RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:12:01] RECOVERY - MariaDB Slave Lag: s5 on db2094 is OK: OK slave_sql_lag Replication lag: 0.30 seconds [10:12:01] RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:12:01] RECOVERY - MariaDB Slave Lag: s5 on db2089 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:16:37] 10Operations, 10ops-eqiad, 10netops: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Papaul) @ayounsi What time do you want to start working on this today? I can be on site by 10:30 am Chicago time. [10:22:55] (03PS1) 10Alex Monk: Avoid mediawiki::state within labs for now. [puppet] - 10https://gerrit.wikimedia.org/r/469406 (https://phabricator.wikimedia.org/T206598) [10:23:45] (03CR) 10jerkins-bot: [V: 04-1] Avoid mediawiki::state within labs for now. [puppet] - 10https://gerrit.wikimedia.org/r/469406 (https://phabricator.wikimedia.org/T206598) (owner: 10Alex Monk) [10:26:20] well that's a weird failure [10:26:24] 10:23:42 An error occurred while loading ./spec/classes/profile_base_certificate_spec.rb. [10:26:25] [stuff] [10:26:29] 10:23:42 LoadError: [10:26:30] 10:23:42 cannot load such file -- augeas [10:26:37] (03CR) 10Alex Monk: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/469406 (https://phabricator.wikimedia.org/T206598) (owner: 10Alex Monk) [10:26:56] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think we should add a static file on the labs standalone puppetmasters instead, much like we did for the puppet compiler, see https://ge" [puppet] - 10https://gerrit.wikimedia.org/r/469406 (https://phabricator.wikimedia.org/T206598) (owner: 10Alex Monk) [10:27:10] (03CR) 10jerkins-bot: [V: 04-1] Avoid mediawiki::state within labs for now. [puppet] - 10https://gerrit.wikimedia.org/r/469406 (https://phabricator.wikimedia.org/T206598) (owner: 10Alex Monk) [10:27:23] <_joe_> Krenair: I think you're not the first who had that problem [10:27:37] <_joe_> it seems some issue in the bundle we use for CI? [10:27:44] <_joe_> it's pretty strange though [10:28:06] hey ops, I'm over in #wikimedia-fundraising and I see lots of icinga-wm errors but none of our fr-tech-ops guys arae online. Could someone give me access or provide some details into what's being reported so I can look into it? [10:28:26] icinga-wm> PROBLEM - check_rsyslog_backlog on payments1002 is CRITICAL: CRITICAL frlog1001=27 [critical = 10] [10:28:33] <_joe_> jgleeson: access to what? [10:28:36] (03PS1) 10Vgutierrez: certcentral: Track number of retries and apply exponential backoff [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) [10:28:47] !log Compare revision table on dewiki cebwiki shwiki srwiki mgwiktionary enwikivoyage on db1100 and db2075 - T184805 [10:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:50] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [10:28:55] <_joe_> to the fundraising servers? I wouldn't know how to do that [10:29:05] <_joe_> I don't have access to FR puppet or to the machines [10:29:52] so I have access to some, but I'm not familiar with icinga. Is it set up to run checks and report the success? if so, it would be good to see what tests are failing to trigger the notice [10:30:21] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Track number of retries and apply exponential backoff [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) (owner: 10Vgutierrez) [10:30:30] <_joe_> the checks are running on the servers, they typically are under /etc/nagios [10:30:47] ok cool, I'll take a look [10:30:48] thanks [10:30:49] <_joe_> that's where they are configured [10:31:41] <_joe_> and the actual program to be run is usually listed inside there [10:32:07] (03PS2) 10Vgutierrez: certcentral: Track number of retries and apply exponential backoff [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) [10:32:19] (03PS2) 10DCausse: [deployment-prep] fix elastic config for deployment-logstash2 [puppet] - 10https://gerrit.wikimedia.org/r/469387 (https://phabricator.wikimedia.org/T205672) [10:32:50] jgleeson: check in /etc/nagios/nrpe.d if there is a file check_rsyslog_backlog [10:32:59] s/check_rsyslog_backlog/check_rsyslog_backlog.cfg/ [10:33:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469402 (owner: 10Marostegui) [10:33:07] that should tell you the local check that is run there [10:33:45] (03Abandoned) 10Alex Monk: Avoid mediawiki::state within labs for now. [puppet] - 10https://gerrit.wikimedia.org/r/469406 (https://phabricator.wikimedia.org/T206598) (owner: 10Alex Monk) [10:34:14] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Track number of retries and apply exponential backoff [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) (owner: 10Vgutierrez) [10:34:27] <_joe_> volans: it's a passive check, not sure that's where the check config will be [10:34:27] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469402 (owner: 10Marostegui) [10:34:33] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet error on deployment-mwmaint01 - https://phabricator.wikimedia.org/T206598 (10Krenair) 05Open>03Resolved Created deployment-puppetmaster03:/etc/conftool-state/mediawiki.yaml per @joe [10:35:13] <_joe_> Krenair: should we puppetize that maybe? [10:35:20] <_joe_> it's useful in all of labs IMHO [10:35:41] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1092 and starting to restore db1104 original weight (duration: 00m 54s) [10:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:48] yeah it might not be the same being NSCA, but I'm not sure how they are running them [10:35:56] before proxying them as passive checks to us [10:38:03] (03PS1) 10Ema: ATS: temporarily avoid calling 'verify_config' in ExecReload [puppet] - 10https://gerrit.wikimedia.org/r/469408 (https://phabricator.wikimedia.org/T204232) [10:38:28] so I've found /etc/nagios_nsca.conf with a bunch of lines that look like check-type jobs [10:38:44] and I see the one being reported in the irc chan [10:39:02] although it looks like it maps to a custom check, here is the line: [10:39:14] check_rsyslog_backlog check_rsyslog_backlog_sudo [10:40:16] _joe_, hmm. what if conftool starts being used for more than just the active DC and readonly states? [10:40:24] guessing I need to find the contents or check behaviour associated with 'check_rsyslog_backlog_sudo' [10:40:31] yes [10:40:47] <_joe_> Krenair: it shouldn't [10:40:54] ok [10:41:04] <_joe_> Krenair: we should remove those uses, even [10:41:18] in prod? [10:42:35] * Krenair will be back later [10:44:55] found the file, thanks all [10:47:32] (03CR) 10Ema: [C: 032] ATS: temporarily avoid calling 'verify_config' in ExecReload [puppet] - 10https://gerrit.wikimedia.org/r/469408 (https://phabricator.wikimedia.org/T204232) (owner: 10Ema) [10:48:44] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469402 (owner: 10Marostegui) [10:49:59] jgleeson: great, let us know if we can help further, also in a couple of hours I guess any FR-ops might start showing up [10:50:38] I've found the perl job performing the check, I just need to work out which logs it's writing to now [10:51:05] nothing obvious nagios/incinga related in /var/log [10:52:27] already tried syslog? [10:52:41] (03PS1) 10Marostegui: db-eqiad.php: Restore db1092 and db1104 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469409 [10:53:47] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13180/mw1261.eqiad.wmnet/ this is now a complete noop. Merging as there is consensus this" [puppet] - 10https://gerrit.wikimedia.org/r/469201 (owner: 10Giuseppe Lavagetto) [10:53:56] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::vhost: use handler to do proxying [puppet] - 10https://gerrit.wikimedia.org/r/469201 [10:53:56] yup just searching that now [10:53:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1092 and db1104 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469409 (owner: 10Marostegui) [10:55:27] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1092 and db1104 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469409 (owner: 10Marostegui) [10:56:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Restore db1092 and db1104 original weight (duration: 00m 52s) [10:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181024T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:00:33] o/ [11:00:40] no patches, no swat [11:01:31] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:02:36] (03PS3) 10Vgutierrez: certcentral: Track number of retries and apply exponential backoff [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) [11:03:37] so the script looks like it's using syslog, but there's nothing in syslog relating to icinga or a failure relating to rsync [11:03:52] tricky [11:04:10] (03PS4) 10Vgutierrez: certcentral: Track number of retries and apply exponential backoff [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) [11:04:26] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1092 and db1104 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469409 (owner: 10Marostegui) [11:06:01] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:09:18] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::vhost: allow serving requests with php7 [puppet] - 10https://gerrit.wikimedia.org/r/469202 [11:10:05] (03PS1) 10Ema: ATS: remove synthetic and mgmt ports configuration [puppet] - 10https://gerrit.wikimedia.org/r/469412 (https://phabricator.wikimedia.org/T204232) [11:10:31] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:11:49] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13181/ this is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/469202 (owner: 10Giuseppe Lavagetto) [11:13:24] (03PS2) 10Ema: ATS: remove synthetic and mgmt ports configuration [puppet] - 10https://gerrit.wikimedia.org/r/469412 (https://phabricator.wikimedia.org/T204232) [11:14:42] (03CR) 10Ema: [C: 032] ATS: remove synthetic and mgmt ports configuration [puppet] - 10https://gerrit.wikimedia.org/r/469412 (https://phabricator.wikimedia.org/T204232) (owner: 10Ema) [11:22:34] !log cp1071: upgrade trafficserver to 8.0.0-1wm1 T204232 [11:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:38] T204232: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 [11:25:44] (03CR) 10Volans: "The approach looks good, few nitpick/optional things inline." (035 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) (owner: 10Vgutierrez) [11:28:36] _joe_, so are you saying that the mediawiki::state function should not be reading stuff out of conftool like that? [11:29:00] PROBLEM - Long running screen/tmux on notebook1003 is CRITICAL: CRIT: Long running SCREEN process. (user: fsalutari PID: 27549, 1733774s 1728000s). [11:29:50] <_joe_> Krenair: that the file that feeds it should be a static file on disk [11:29:56] <_joe_> and not generated by conftool [11:30:10] ok [11:34:21] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Mathew.onipe) To further clarify the point above about using percentage based threshold, here is a screenshot showing the perce... [11:34:48] (03PS1) 10Giuseppe Lavagetto: mediawiki::web: fix the proxies declarations [puppet] - 10https://gerrit.wikimedia.org/r/469413 [11:44:32] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): increase restart interval of wdqs updater - https://phabricator.wikimedia.org/T207843 (10Gehel) p:05Triage>03High [11:53:40] (03PS1) 10Elukey: profile::analytics::refinery::job::camus: remove unused param [puppet] - 10https://gerrit.wikimedia.org/r/469415 [11:54:25] (03CR) 10jerkins-bot: [V: 04-1] profile::analytics::refinery::job::camus: remove unused param [puppet] - 10https://gerrit.wikimedia.org/r/469415 (owner: 10Elukey) [11:56:00] hashar: o/ [11:56:15] jenkins for puppet catalog compiler seems broker :( [11:56:35] <_joe_> elukey: CI for puppet you mean [11:57:06] yes [11:57:30] (03PS1) 10Ema: ATS: use YAML format for logging config file [puppet] - 10https://gerrit.wikimedia.org/r/469417 (https://phabricator.wikimedia.org/T204232) [11:58:21] 10Operations, 10monitoring, 10Patch-For-Review: newer version of nagios-nrpe-plugin nrpe (check_nrpe) with fixed logging issue on stretch icinga - https://phabricator.wikimedia.org/T207775 (10fgiunchedi) >>! In T207775#4691005, @fgiunchedi wrote: > We enable backports by default nowadays, so pinning said pac... [11:58:33] 13:54:23 LoadError: [11:58:34] 13:54:23 cannot load such file -- augeas [11:59:27] yeah I ran into that earlier too :( [11:59:35] running rspec for module puppetmaster iirc [11:59:59] running ./spec/classes/profile_base_certificate_spec.rb [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181024T1200) [12:00:27] <_joe_> it's the rspec for profile that fails because it doesn't find augeas [12:04:57] (03CR) 10Ema: [C: 032] ATS: use YAML format for logging config file [puppet] - 10https://gerrit.wikimedia.org/r/469417 (https://phabricator.wikimedia.org/T204232) (owner: 10Ema) [12:05:48] mmm ema got the +2? [12:06:09] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::camus: remove unused param [puppet] - 10https://gerrit.wikimedia.org/r/469415 (owner: 10Elukey) [12:06:16] (03PS2) 10Elukey: profile::analytics::refinery::job::camus: remove unused param [puppet] - 10https://gerrit.wikimedia.org/r/469415 [12:06:24] let's see [12:07:03] (03CR) 10jerkins-bot: [V: 04-1] profile::analytics::refinery::job::camus: remove unused param [puppet] - 10https://gerrit.wikimedia.org/r/469415 (owner: 10Elukey) [12:09:02] whattt [12:09:44] jerkins, omen nomen [12:11:35] no idea where to check honestly [12:11:38] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban): Add Lars Wirzenius to releng LDAP groups - https://phabricator.wikimedia.org/T207833 (10jijiki) p:05Triage>03Normal a:03jijiki [12:12:04] (03CR) 10Filippo Giunchedi: [C: 032] puppetmaster: add back post-receive hook for self-hosted master [puppet] - 10https://gerrit.wikimedia.org/r/469398 (owner: 10Filippo Giunchedi) [12:12:12] (03PS2) 10Filippo Giunchedi: puppetmaster: add back post-receive hook for self-hosted master [puppet] - 10https://gerrit.wikimedia.org/r/469398 [12:12:53] !log cp1072: upgrade trafficserver to 8.0.0-1wm1 T204232 [12:12:54] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add back post-receive hook for self-hosted master [puppet] - 10https://gerrit.wikimedia.org/r/469398 (owner: 10Filippo Giunchedi) [12:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:56] T204232: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 [12:21:52] (03PS1) 10Elukey: eventlogging: whitelist ContentTranslationAbuseFilter [puppet] - 10https://gerrit.wikimedia.org/r/469419 [12:24:13] !log cp-ats: upgrade trafficserver to 8.0.0-1wm1 T204232 [12:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:17] T204232: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 [12:24:30] (03CR) 10Joal: "LGTM !" [puppet] - 10https://gerrit.wikimedia.org/r/469419 (owner: 10Elukey) [12:24:48] (03CR) 10Amire80: [C: 031] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/469419 (owner: 10Elukey) [12:34:42] (03PS1) 10Ema: ATS: stop shipping Lua format logging config file [puppet] - 10https://gerrit.wikimedia.org/r/469421 (https://phabricator.wikimedia.org/T204232) [12:36:35] 10Operations, 10Traffic: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10jijiki) a:03Vgutierrez [12:36:46] 10Operations, 10Traffic: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10jijiki) p:05Triage>03Normal [12:38:20] (03CR) 10Ema: [C: 032] ATS: stop shipping Lua format logging config file [puppet] - 10https://gerrit.wikimedia.org/r/469421 (https://phabricator.wikimedia.org/T204232) (owner: 10Ema) [12:39:44] (03PS2) 10Joal: Add configuration for java-logging in hive conf [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469256 [12:43:08] (03PS3) 10MSantos: Decrease OSM update Frequency [puppet] - 10https://gerrit.wikimedia.org/r/469329 (https://phabricator.wikimedia.org/T205735) [12:43:54] (03PS1) 10Elukey: Add role::statistics::private to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/469422 (https://phabricator.wikimedia.org/T205846) [12:46:11] 10Operations, 10monitoring: monitor postgresql replication status - https://phabricator.wikimedia.org/T116580 (10jijiki) [12:46:52] (03CR) 10MSantos: "> Patch Set 2: Code-Review-1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/469329 (https://phabricator.wikimedia.org/T205735) (owner: 10MSantos) [12:48:41] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13185/" [puppet] - 10https://gerrit.wikimedia.org/r/469422 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [12:50:22] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10Marostegui) a:05Marostegui>03jcrespo I have checked `revision` and `user` table for all the wikis in s5: ``` dewiki cebwiki shwiki srwiki mgwiktionar... [12:50:30] (03PS4) 10MSantos: Decrease OSM update Frequency [puppet] - 10https://gerrit.wikimedia.org/r/469329 (https://phabricator.wikimedia.org/T205735) [12:52:59] 10Operations, 10Traffic: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [12:53:02] 10Operations, 10Traffic, 10Patch-For-Review: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 (10ema) 05Open>03Resolved a:03ema Upgrade finished! [12:54:28] (03PS2) 10Elukey: Add role::statistics::private to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/469422 (https://phabricator.wikimedia.org/T205846) [12:58:08] 10Operations, 10monitoring: monitor postgresql replication status - https://phabricator.wikimedia.org/T116580 (10Volans) I guess the description should be updated, as we have more installations in prod now, and we actually already have a check for replication, see `modules/postgresql/manifests/slave/monitoring... [12:58:34] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web: fix the proxies declarations [puppet] - 10https://gerrit.wikimedia.org/r/469413 (owner: 10Giuseppe Lavagetto) [12:58:45] (03PS2) 10Giuseppe Lavagetto: mediawiki::web: fix the proxies declarations [puppet] - 10https://gerrit.wikimedia.org/r/469413 [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181024T1300) [13:03:54] 10Operations, 10Operations-Software-Development: debdeploy: show help message if invoked with no arguments - https://phabricator.wikimedia.org/T207845 (10ema) [13:04:05] 10Operations, 10Operations-Software-Development: debdeploy: show help message if invoked with no arguments - https://phabricator.wikimedia.org/T207845 (10ema) p:05Triage>03Low [13:11:52] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10mmodell) Do we have a patch or should we roll back group0? [13:11:56] 10Operations, 10ops-eqiad, 10netops: Interface errors on cr2-eqiad:xe-4/0/0 - https://phabricator.wikimedia.org/T203719 (10ayounsi) 05Open>03Resolved Seems all solved. [13:12:34] (03PS1) 10Giuseppe Lavagetto: mediawiki::webserver: reorganize default settings [puppet] - 10https://gerrit.wikimedia.org/r/469425 [13:15:32] 10Operations, 10ops-eqiad, 10netops: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) 11:00 Chicago time works for me, I sent you a calendar invitation. [13:15:51] 10Operations, 10ops-eqiad, 10netops: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) [13:16:16] (03CR) 10GTirloni: "Just want to make sure this quick hack is fine in this case. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/468865 (https://phabricator.wikimedia.org/T184261) (owner: 10GTirloni) [13:17:40] !log begin cache hosts rolling reboots for kernel/microcode updates T203011 [13:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:20] (03PS5) 10Giuseppe Lavagetto: beta: start using set_handler instead of the proxy passes [puppet] - 10https://gerrit.wikimedia.org/r/469203 [13:18:31] 10Operations, 10LDAP-Access-Requests: Remove "aude" from "wmde" LDAP group - https://phabricator.wikimedia.org/T207793 (10jijiki) @Addshore could you please give us some context on this? thank you! [13:19:40] 10Operations, 10LDAP-Access-Requests: Remove "jk" from "wmde" ldap group - https://phabricator.wikimedia.org/T207792 (10jijiki) @Addshore could you please give us some context on this? thank you! [13:19:59] (03CR) 10ArielGlenn: "For dumps, this looks right as far as it goes. When stat1007 is ready to take over stat1005's jobs, there's a couple more changes you'll n" [puppet] - 10https://gerrit.wikimedia.org/r/469422 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [13:20:48] 10Operations, 10Patch-For-Review: Upgrade calico in production to version 2.4+ - https://phabricator.wikimedia.org/T207804 (10Aklapper) [13:23:17] 10Operations, 10LDAP-Access-Requests, 10Core Platform Team Kanban (Blocked Externally): Remove "daniel" from "wmde" LDAP group and add him to "wmf" - https://phabricator.wikimedia.org/T207788 (10jijiki) @Addshore so we need to remove @daniel from `wmde` group and and them to `wmf` ? [13:24:31] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Ayounsi, for this ticket, shall we ask for these to be set up in the public VLAN? [13:33:55] (03CR) 10Filippo Giunchedi: [C: 031] wdqs: rate limit log sent to logstash [puppet] - 10https://gerrit.wikimedia.org/r/468979 (https://phabricator.wikimedia.org/T207656) (owner: 10Gehel) [13:35:17] (03CR) 10Filippo Giunchedi: [C: 031] "To Moritz's comment, I believe this can go ahead now and absent the memcached collector later." [puppet] - 10https://gerrit.wikimedia.org/r/466907 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [13:35:42] !log pre-configure switch ports for labvirt1007/8/9/12:eth1 in cloud-virt-instance-trunk range on asw2-b-eqiad [13:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:45] (03CR) 10Filippo Giunchedi: "> What works (according to pcc) is an include below the role() in site.pp for the logstash es data hosts. Naturally wmf-style flags this," [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [13:39:23] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10ayounsi) That sounds good to me but will have @faidon doublecheck. Ideally please distribute those servers across multip... [13:42:16] (03CR) 10Filippo Giunchedi: [C: 04-1] "This is unfortunately still blocked on getting memcached-exporter on wmcs hosts, which in turn is blocked on the fact that the memcached-e" [puppet] - 10https://gerrit.wikimedia.org/r/469250 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [13:42:28] cmjohnson1, andrewbogott, arturo, are we still good to move the clould servers to asw2-b-eqiad in ~15min ? [13:42:59] 10Operations, 10LDAP-Access-Requests: Remove "aude" from "wmde" LDAP group - https://phabricator.wikimedia.org/T207793 (10Aklapper) Which context is wanted? A more explicit "aude is not listed anymore as working for wmde", or something else? :) [13:43:29] XioNoX: yes. thanks for the reminder; I'll send an email now [13:44:23] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, I can merge this tomorrow EU morning" [puppet] - 10https://gerrit.wikimedia.org/r/469387 (https://phabricator.wikimedia.org/T205672) (owner: 10DCausse) [13:49:42] XioNoX I am not there today, I am still not able to walk [13:49:52] sorry..i should've updated you yesterday [13:50:15] XioNoX, cmjohnson1, can we just push this out seven days? Same day/time next week? [13:50:42] that works for me [13:50:59] also, cmjohnson1, Ouch! Hope it's something that heals quickly! [13:51:49] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 41 probes of 322 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:53:24] (03PS3) 10Elukey: profile::analytics::refinery::job::camus: remove unused param [puppet] - 10https://gerrit.wikimedia.org/r/469415 [13:54:49] cmjohnson1: ok with the 31st? [13:55:16] 10Operations, 10LDAP-Access-Requests, 10Core Platform Team Kanban (Blocked Externally): Remove "daniel" from "wmde" LDAP group and add him to "wmf" - https://phabricator.wikimedia.org/T207788 (10Addshore) >>! In T207788#4691603, @jijiki wrote: > @Addshore so we need to remove @daniel from `wmde` group and an... [13:55:23] (03PS8) 10Herron: logstash: apply role::kafka::logging to logstash es data nodes [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) [13:56:18] andrewbogott: yes, that will be great..i am moving dumpsdata1001 at 1200 so right before or after [13:56:58] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 21 probes of 322 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:57:05] ok — I updated the calendar item. Thanks. [13:57:10] 10Operations, 10DBA: Populate the wikishared db on all dbstores - https://phabricator.wikimedia.org/T126252 (10Marostegui) I am not sure whether this still applies or not, does it? [13:57:39] (03CR) 10Ottomata: [C: 031] eventlogging: whitelist ContentTranslationAbuseFilter [puppet] - 10https://gerrit.wikimedia.org/r/469419 (owner: 10Elukey) [13:57:47] (03PS3) 10Elukey: Add role::statistics::private to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/469422 (https://phabricator.wikimedia.org/T205846) [13:58:16] (03PS2) 10Elukey: eventlogging: whitelist ContentTranslationAbuseFilter [puppet] - 10https://gerrit.wikimedia.org/r/469419 [13:58:57] (03PS1) 10Andrew Bogott: mwyaml_backend.rb: use safe_load when loading hiera config from wikitech [puppet] - 10https://gerrit.wikimedia.org/r/469431 (https://phabricator.wikimedia.org/T171289) [13:59:23] 10Operations, 10DBA: Populate the wikishared db on all dbstores - https://phabricator.wikimedia.org/T126252 (10jcrespo) 05Open>03Resolved a:03jcrespo This was done long time ago on dbstore1002, and doesn't apply anymore on dbstores due to multiinstance. [14:00:08] (03CR) 10Ottomata: "One nit, but +1 after that" (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469256 (owner: 10Joal) [14:00:28] (03CR) 10Elukey: [C: 032] eventlogging: whitelist ContentTranslationAbuseFilter [puppet] - 10https://gerrit.wikimedia.org/r/469419 (owner: 10Elukey) [14:00:59] PROBLEM - DPKG on cp1087 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:01:08] PROBLEM - DPKG on cp1089 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:01:17] oops, that's me ^ [14:01:25] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13188/" [puppet] - 10https://gerrit.wikimedia.org/r/469422 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [14:01:28] PROBLEM - DPKG on cp1075 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:01:49] PROBLEM - DPKG on cp1077 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:01:49] PROBLEM - DPKG on cp1083 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:01:58] (03CR) 10Ottomata: [C: 031] Add role::statistics::private to stat1007 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469422 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [14:01:59] PROBLEM - DPKG on cp1085 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:02:04] /o\ [14:02:08] (03PS2) 10Giuseppe Lavagetto: mediawiki::webserver: reorganize default settings [puppet] - 10https://gerrit.wikimedia.org/r/469425 [14:02:09] (03PS6) 10Giuseppe Lavagetto: beta: start using set_handler instead of the proxy passes [puppet] - 10https://gerrit.wikimedia.org/r/469203 [14:02:11] (03PS1) 10Giuseppe Lavagetto: mediawiki::web: only include the hhvm catchall if set_handler is not set [puppet] - 10https://gerrit.wikimedia.org/r/469432 [14:02:15] (03PS2) 10Andrew Bogott: mwyaml_backend.rb: use safe_load when loading hiera config from wikitech [puppet] - 10https://gerrit.wikimedia.org/r/469431 (https://phabricator.wikimedia.org/T171289) [14:03:19] (03CR) 10Elukey: Add role::statistics::private to stat1007 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469422 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [14:03:28] (03PS4) 10Elukey: Add role::statistics::private to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/469422 (https://phabricator.wikimedia.org/T205846) [14:04:20] (03CR) 10Andrew Bogott: [C: 032] mwyaml_backend.rb: use safe_load when loading hiera config from wikitech [puppet] - 10https://gerrit.wikimedia.org/r/469431 (https://phabricator.wikimedia.org/T171289) (owner: 10Andrew Bogott) [14:04:20] RECOVERY - DPKG on cp1085 is OK: All packages OK [14:04:20] RECOVERY - DPKG on cp1087 is OK: All packages OK [14:04:28] (03PS5) 10Vgutierrez: certcentral: Track number of retries and apply exponential backoff [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) [14:04:50] (03CR) 10Elukey: [C: 032] Add role::statistics::private to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/469422 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [14:04:57] (03PS5) 10Elukey: Add role::statistics::private to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/469422 (https://phabricator.wikimedia.org/T205846) [14:05:22] (03CR) 10Vgutierrez: "Thx for the review volans :)" (034 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) (owner: 10Vgutierrez) [14:06:28] RECOVERY - DPKG on cp1083 is OK: All packages OK [14:07:08] RECOVERY - DPKG on cp1075 is OK: All packages OK [14:07:58] RECOVERY - DPKG on cp1089 is OK: All packages OK [14:08:29] morning James_F [14:09:08] PROBLEM - puppet last run on cp1077 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[openssh-server],Exec[set debconf flag seen for wireshark-common/install-setuid] [14:09:14] Heya. [14:09:48] RECOVERY - DPKG on cp1077 is OK: All packages OK [14:11:26] James_F: ready to try this thing? [14:11:28] jouncebot: now [14:11:28] For the next 0 hour(s) and 48 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181024T1300) [14:12:37] Not yet. Need to get to the office. 🙂 [14:13:15] i guess i could just do it actually [14:14:08] RECOVERY - puppet last run on cp1077 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:17:09] addshore: If you’re happy to, please go ahead. [14:17:27] I’ll be in the office in ~40 minutes’ time at this rate. [14:21:50] okay [14:22:30] (03PS4) 10Addshore: Wikibase.php, don't load wikidata repo settings on other repos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469209 [14:23:27] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/13189/ this is a noop in production, and works well in beta." [puppet] - 10https://gerrit.wikimedia.org/r/469432 (owner: 10Giuseppe Lavagetto) [14:23:41] !log scheduled icinga downtime and disabling puppet on logstash hosts. deploying role::kafka::logging to logstash elasticserach data hosts [14:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:18] (03PS1) 10Andrew Bogott: puppet httpyaml: use safe_load when loading hiera [puppet] - 10https://gerrit.wikimedia.org/r/469435 (https://phabricator.wikimedia.org/T171289) [14:24:59] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::webserver: reorganize default settings [puppet] - 10https://gerrit.wikimedia.org/r/469425 (owner: 10Giuseppe Lavagetto) [14:25:11] (03PS3) 10Giuseppe Lavagetto: mediawiki::webserver: reorganize default settings [puppet] - 10https://gerrit.wikimedia.org/r/469425 [14:25:24] (03PS9) 10Herron: logstash: apply role::kafka::logging to logstash es data nodes [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) [14:26:09] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web: only include the hhvm catchall if set_handler is not set [puppet] - 10https://gerrit.wikimedia.org/r/469432 (owner: 10Giuseppe Lavagetto) [14:26:13] jouncebot: next [14:26:13] In 1 hour(s) and 33 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181024T1600) [14:26:19] (03PS2) 10Andrew Bogott: puppet httpyaml: use safe_load when loading hiera [puppet] - 10https://gerrit.wikimedia.org/r/469435 (https://phabricator.wikimedia.org/T171289) [14:26:25] (03PS2) 10Giuseppe Lavagetto: mediawiki::web: only include the hhvm catchall if set_handler is not set [puppet] - 10https://gerrit.wikimedia.org/r/469432 [14:26:27] (03CR) 10Herron: [C: 032] "> > What works (according to pcc) is an include below the role() in" [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [14:26:29] James_F: I'm poised but think I'll wait for you to get to the office, an extra pair of eyes never hurt anyone [14:26:30] <_joe_> andrewbogott: can I merge my two changes first? [14:26:40] <_joe_> they're related and I'd like to do them in one go [14:26:45] (03PS10) 10Herron: logstash: apply role::kafka::logging to logstash es data nodes [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) [14:26:47] 10Operations, 10LDAP-Access-Requests: Remove "aude" from "wmde" LDAP group - https://phabricator.wikimedia.org/T207793 (10WMDE-leszek) As an engineering manager at WMDE, I confirm that the person behind the user name "aude" is not doing any work for WMDE any more since quite a while. Therefore, it does not see... [14:26:49] <_joe_> herron: too [14:26:51] _joe_: yep, I'll back off from the rebase button :) [14:26:54] addshore: No worries. Go get brekkie? :-) [14:27:03] <_joe_> andrewbogott: heh rebase wars :P [14:27:08] meh, i'll get brekkie after [14:27:56] _joe_: sure, all yours [14:29:11] <_joe_> andrewbogott, herron go :) [14:29:19] <_joe_> now fight between you two :D [14:29:32] lol, after you andrewbogott [14:29:33] herron: you first [14:29:35] haha [14:29:54] I'm cooking breakfast so you should definitely go first [14:30:09] ok, going [14:30:33] (03PS11) 10Herron: logstash: apply role::kafka::logging to logstash es data nodes [puppet] - 10https://gerrit.wikimedia.org/r/469246 (https://phabricator.wikimedia.org/T206454) [14:32:00] (03CR) 10Giuseppe Lavagetto: [C: 032] "Already applied in beta" [puppet] - 10https://gerrit.wikimedia.org/r/469203 (owner: 10Giuseppe Lavagetto) [14:32:24] (03PS1) 10Sbisson: Enable PageTriage/Copyvio on enwiki betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469436 [14:34:15] (03PS1) 10Sbisson: Enable PageTriage/Copyvio in testwiki and enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469438 [14:34:42] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:36:42] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.107 second response time [14:39:52] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:09] (03PS1) 10Ema: wmf-upgrade-and-reboot: non-interactive Debian frontend [puppet] - 10https://gerrit.wikimedia.org/r/469439 [14:40:54] (03CR) 10jerkins-bot: [V: 04-1] wmf-upgrade-and-reboot: non-interactive Debian frontend [puppet] - 10https://gerrit.wikimedia.org/r/469439 (owner: 10Ema) [14:44:02] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.907 second response time [14:47:01] (03PS2) 10Ema: wmf-upgrade-and-reboot: non-interactive Debian frontend [puppet] - 10https://gerrit.wikimedia.org/r/469439 [14:49:32] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:38] (03PS3) 10Andrew Bogott: puppet httpyaml: use safe_load when loading hiera [puppet] - 10https://gerrit.wikimedia.org/r/469435 (https://phabricator.wikimedia.org/T171289) [14:51:09] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Gehel) >>! In T207817#4691569, @mmodell wrote: > Do we have a patch or should we roll back group0? We... [14:52:01] (03CR) 10Andrew Bogott: [C: 032] puppet httpyaml: use safe_load when loading hiera [puppet] - 10https://gerrit.wikimedia.org/r/469435 (https://phabricator.wikimedia.org/T171289) (owner: 10Andrew Bogott) [14:52:16] (03PS3) 10Filippo Giunchedi: puppetmaster: add back post-receive hook for self-hosted master [puppet] - 10https://gerrit.wikimedia.org/r/469398 [14:52:57] addshore: Ready? [14:53:20] Yes! [14:53:26] 10Puppet, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10dcausse) [14:53:27] shall I click the buttons or you?> [14:54:14] 10Puppet, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10dcausse) a:03dcausse [14:54:15] (03PS1) 10Alex Monk: horizon: Change phlogiston's enabled region to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/469441 (https://phabricator.wikimedia.org/T204551) [14:54:48] James_F: ^^ [14:55:07] addshore: Go for it. [14:55:44] ack! [14:55:52] (03CR) 10Addshore: [C: 032] Wikibase.php, don't load wikidata repo settings on other repos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469209 (owner: 10Addshore) [14:57:13] (03Merged) 10jenkins-bot: Wikibase.php, don't load wikidata repo settings on other repos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469209 (owner: 10Addshore) [14:59:14] okay, it is on mwdebug 1002 [15:00:07] * addshore does some testing.... [15:00:43] Original exception: [W9CI7wpAAC4AAE0ZxAkAAAAN] 2018-10-24 15:00:30: Fatal exception of type "Wikimedia\Assert\ParameterTypeException" [15:00:50] oooof [15:00:57] Hmm. [15:01:17] James_F: well, that needs fixing, lets revert and investigate I guess (that was on the enwiki main page on mwdebug).... [15:01:38] Bad value for parameter $wikiId: must be a string|boolean [15:01:49] Is that us? [15:01:56] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 929 bytes in 0.100 second response time [15:02:05] yes [15:02:08] (03PS1) 10Addshore: Revert "Wikibase.php, don't load wikidata repo settings on other repos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469443 [15:02:10] (03CR) 10Addshore: [C: 032] Revert "Wikibase.php, don't load wikidata repo settings on other repos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469443 (owner: 10Addshore) [15:02:10] reverting :) [15:02:19] But… how? [15:02:48] i'll look at it once I'm at the venue [15:02:55] * James_F nods. [15:03:14] i always forget, does real traffic hit the mwdebug servers? [15:03:18] No. [15:03:31] And they don't show up in fatalmonitor either. [15:03:35] ack :) [15:04:11] (03Merged) 10jenkins-bot: Revert "Wikibase.php, don't load wikidata repo settings on other repos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469443 (owner: 10Addshore) [15:05:25] (03PS1) 10Addshore: Revert "Revert "Wikibase.php, don't load wikidata repo settings on other repos"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469444 [15:05:32] (03CR) 10Addshore: [C: 04-2] "needs fixing first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469444 (owner: 10Addshore) [15:06:06] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 73718 bytes in 1.442 second response time [15:06:15] oh, it was actually the Berlin article [15:06:26] and lovely, mwdebug1002 is alive again [15:07:09] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Faidon asked for a diagram to help understand the data flow. Here we go! {F26768261} [15:07:59] (03PS2) 10Andrew Bogott: horizon: Change phlogiston's enabled region to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/469441 (https://phabricator.wikimedia.org/T204551) (owner: 10Alex Monk) [15:08:01] James_F: although https://en.wikipedia.org/wiki/User:Addshore still wont load for me on mwdebug1002? [15:08:10] I get Error: 500, Internal Server Error [15:08:49] (03CR) 10Andrew Bogott: [C: 032] horizon: Change phlogiston's enabled region to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/469441 (https://phabricator.wikimedia.org/T204551) (owner: 10Alex Monk) [15:09:21] because the requests are taking too long.... [15:09:28] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 (10herron) Kafka service is now running on the logstash elasticsearch data hosts, and related icinga service checks are gr... [15:09:29] (03CR) 10Jforrester: Revert "Revert "Wikibase.php, don't load wikidata repo settings on other repos"" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469444 (owner: 10Addshore) [15:09:52] addshore: Loads for me. [15:10:09] (03PS1) 10Bstorm: sonofgridengine: add compute node roles [puppet] - 10https://gerrit.wikimedia.org/r/469445 (https://phabricator.wikimedia.org/T200557) [15:10:11] James_F: loading now, looks like the server may have just been slightly overloaded, my requests were taking over 60 seconds [15:10:16] * addshore goes to read your comments [15:10:32] One of them is relevant, two are irreverent. ;-) [15:10:37] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.712 second response time [15:10:52] (03CR) 10Addshore: [C: 04-2] Revert "Revert "Wikibase.php, don't load wikidata repo settings on other repos"" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469444 (owner: 10Addshore) [15:10:54] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: add compute node roles [puppet] - 10https://gerrit.wikimedia.org/r/469445 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [15:11:02] right, breakfast [15:11:11] (03CR) 10jenkins-bot: Wikibase.php, don't load wikidata repo settings on other repos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469209 (owner: 10Addshore) [15:11:13] (03CR) 10jenkins-bot: Revert "Wikibase.php, don't load wikidata repo settings on other repos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469443 (owner: 10Addshore) [15:13:57] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:32] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 (10herron) [15:16:03] (03PS1) 10Vgutierrez: certcentral: Avoid getting stuck on CHALLENGES_PUSHED status [software/certcentral] - 10https://gerrit.wikimedia.org/r/469446 (https://phabricator.wikimedia.org/T207737) [15:16:49] 10Operations, 10SRE-Access-Requests: Requesting access to deployment and analytics-privatedata-users for sbassett - https://phabricator.wikimedia.org/T207852 (10sbassett) [15:20:16] (03CR) 10Jforrester: Revert "Revert "Wikibase.php, don't load wikidata repo settings on other repos"" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469444 (owner: 10Addshore) [15:22:28] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, and 2 others: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10herron) Looks good to me! Thanks much @aborrero! [15:24:48] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10aborrero) [15:25:00] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, and 2 others: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10aborrero) 05Open>03Resolved I didn't detect any issue so far. Closing task now :-) [15:28:07] (03PS1) 10Gehel: wdqs: increase restart interval of wdqs-updater [puppet] - 10https://gerrit.wikimedia.org/r/469447 (https://phabricator.wikimedia.org/T207843) [15:29:27] (03CR) 10Alex Monk: certcentral: Avoid getting stuck on CHALLENGES_PUSHED status (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/469446 (https://phabricator.wikimedia.org/T207737) (owner: 10Vgutierrez) [15:30:42] (03CR) 10Alex Monk: "I'm wondering if we should report upstream too." (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/469446 (https://phabricator.wikimedia.org/T207737) (owner: 10Vgutierrez) [15:31:30] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, and 2 others: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10Andrew) Does this mean that we no longer need the IP aliaser in eqiad1-r? [15:31:53] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Ottomata) Ping @Pchelolo. This is happening because of https://gerrit.wikimedia.org/r/#/c/mediawiki/ex... [15:31:56] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.495 second response time [15:33:55] (03PS3) 10Gehel: wdqs: rate limit log sent to logstash [puppet] - 10https://gerrit.wikimedia.org/r/468979 (https://phabricator.wikimedia.org/T207656) [15:35:27] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:37:15] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Pchelolo) We could rollback https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/468482/... [15:39:01] (03CR) 10Volans: [C: 031] "LGTM, but take in mind that I'm not too familiar with the rest of the code ;)" (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) (owner: 10Vgutierrez) [15:39:56] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev) @Pchelolo I can look into how to make Jackson parse it, it's probably possible, but I'd appr... [15:41:00] (03PS2) 10Bmansurov: Stop collecting data CitaitonUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465418 (https://phabricator.wikimedia.org/T191086) [15:42:57] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Ottomata) Maybe jackson just can't parse this microsecond stuff? Maybe milliseconds are fine? [15:43:05] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Gehel) >>! In T207817#4691885, @Ottomata wrote: > Interesting! I checked Jodatime stuff to make sure o... [15:43:49] (03CR) 10Alex Monk: certcentral: Track number of retries and apply exponential backoff (033 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) (owner: 10Vgutierrez) [15:44:39] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Pchelolo) > Maybe jackson just can't parse this microsecond stuff? Maybe milliseconds are fine? Judgin... [15:45:01] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10mmodell) ok anyone care to give the ol' +2 to revert this on the branch? https://gerrit.wikimedia.org/... [15:45:47] RECOVERY - High lag on wdqs1006 is OK: (C)3600 ge (W)1200 ge 1151 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:46:22] (03PS2) 10Vgutierrez: certcentral: Avoid getting stuck on CHALLENGES_PUSHED status [software/certcentral] - 10https://gerrit.wikimedia.org/r/469446 (https://phabricator.wikimedia.org/T207737) [15:46:47] (03CR) 10Vgutierrez: certcentral: Avoid getting stuck on CHALLENGES_PUSHED status (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/469446 (https://phabricator.wikimedia.org/T207737) (owner: 10Vgutierrez) [15:52:27] (03CR) 10Alex Monk: [C: 032] certcentral: Avoid getting stuck on CHALLENGES_PUSHED status [software/certcentral] - 10https://gerrit.wikimedia.org/r/469446 (https://phabricator.wikimedia.org/T207737) (owner: 10Vgutierrez) [15:52:51] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 2 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev) [15:53:11] (03PS2) 10Bstorm: sonofgridengine: add compute node roles [puppet] - 10https://gerrit.wikimedia.org/r/469445 (https://phabricator.wikimedia.org/T200557) [15:53:46] RECOVERY - High lag on wdqs1007 is OK: (C)3600 ge (W)1200 ge 1119 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:53:52] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: add compute node roles [puppet] - 10https://gerrit.wikimedia.org/r/469445 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [15:54:37] !log deploying https://gerrit.wikimedia.org/r/469451 [15:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:53] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 2 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev) p:05Unbreak!>03High > Maybe jackson just can't parse this microsecond stuff? Maybe milliseconds... [15:57:47] (03CR) 10Brian Wolff: [C: 031] "Generally I think this looks like a good direction." [puppet] - 10https://gerrit.wikimedia.org/r/467100 (https://phabricator.wikimedia.org/T207243) (owner: 10Paladox) [15:57:58] 10Operations, 10Patch-For-Review: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10crusnov) 05Open>03Resolved Verified access to pwstore. Looks like this ticket is complete! [15:59:50] !log disable BGP sessions to transit/peering on cr1-eqord - T204170 [15:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:53] T204170: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 [16:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181024T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:33] !log 15:59:06 Synchronized php-1.33.0-wmf.1/extensions/EventBus/: revert "Set event datetime with microsecond resolution." on 1.33.0-wmf.1 refs T207817 (duration: 00m 56s) [16:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:39] T207817: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 [16:04:02] !log power-off cr1-eqord - T204170 [16:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:17] RECOVERY - High lag on wdqs1008 is OK: (C)3600 ge (W)1200 ge 1128 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:06:21] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 3 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Pchelolo) Judging by the quick skim of the Jackson code it supports [[ https://docs.oracle.com/javase/8/docs/ap... [16:06:57] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:07:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:07:36] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 64, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:08:07] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:08:41] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, and 2 others: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10Andrew) Created T207859 [16:09:59] (03CR) 10Vgutierrez: certcentral: Track number of retries and apply exponential backoff (033 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) (owner: 10Vgutierrez) [16:13:32] (03PS2) 10Addshore: Wikibase.php, don't load wikidata repo settings on other repos (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469444 [16:13:35] James_F: so yes, the namespace is needed, see ^^ [16:13:48] i think we can try that one [16:13:50] jouncebot: now [16:13:50] For the next 0 hour(s) and 46 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181024T1600) [16:14:52] OK. [16:15:26] (03CR) 10Jforrester: Wikibase.php, don't load wikidata repo settings on other repos (take 2) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469444 (owner: 10Addshore) [16:15:41] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Cleanup Wikidata Query Service logging configuration - https://phabricator.wikimedia.org/T207834 (10Smalyshev) My ideal situation for Updater logs is as follows: * Updater which is run from service... [16:16:46] 10Operations, 10Cloud-Services, 10Datasets-General-or-Unknown, 10User-ArielGlenn, 10cloud-services-team (Kanban): Adjust bandwidth/connection limits, memory settings on labstore1006,7 as appropriate - https://phabricator.wikimedia.org/T191491 (10Bstorm) a:03Bstorm [16:17:52] (03CR) 10Addshore: [C: 04-2] Wikibase.php, don't load wikidata repo settings on other repos (take 2) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469444 (owner: 10Addshore) [16:18:03] James_F: I guess we can try in this swat window? [16:18:16] addshore: Or now? [16:18:24] thats what I meant, it is swta now :P [16:18:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:18:27] swat... [16:18:31] Oh, hah. Yes. [16:18:36] want to give it a +1? :P [16:18:54] (03CR) 10Jforrester: [C: 032] Wikibase.php, don't load wikidata repo settings on other repos (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469444 (owner: 10Addshore) [16:19:02] :P [16:19:24] Oh, sorry, +1. [16:19:27] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:19:27] You'll deploy? [16:19:28] thats fine :P [16:19:31] i can deploy [16:19:38] Kk. [16:20:06] (03Merged) 10jenkins-bot: Wikibase.php, don't load wikidata repo settings on other repos (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469444 (owner: 10Addshore) [16:20:27] 10Operations, 10SRE-Access-Requests: Requesting access to deployment and analytics-privatedata-users for sbassett - https://phabricator.wikimedia.org/T207852 (10Reedy) Would need/want John to +1 this request [16:20:31] it is on mwdebug 1002 [16:20:36] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:21:33] looking pretty good to me [16:22:09] is there a way to extend the ammount of time scap leaves the change on canary servers? [16:22:43] greg is sat next to me and says no [16:22:46] (03CR) 10jenkins-bot: Wikibase.php, don't load wikidata repo settings on other repos (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469444 (owner: 10Addshore) [16:23:26] RECOVERY - High lag on wdqs1009 is OK: (C)3600 ge (W)1200 ge 1172 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:24:36] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 66, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:24:40] i might write a ticket for that.... [16:25:48] * addshore is going to stop scap once it puts the change on the canary servers so that I can just watch it there for a slightly longer period [16:26:18] thcipriani: ^^ thoughts? [16:26:20] XioNoX: I am running a rsync from stat1005 to stat1007 with several terabytes of data, lemme know if it is a problem or not [16:26:21] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Cleanup Wikidata Query Service logging configuration - https://phabricator.wikimedia.org/T207834 (10Gehel) My current patch is trying to put all that logic into `logback.xml`, but it is definitely s... [16:26:36] elukey: ok! [16:27:34] (03PS3) 10Thcipriani: Don't drop the colon between hash type/digest [software/keyholder] - 10https://gerrit.wikimedia.org/r/458229 (owner: 10Faidon Liambotis) [16:27:42] (03CR) 10Thcipriani: [C: 032] Don't drop the colon between hash type/digest [software/keyholder] - 10https://gerrit.wikimedia.org/r/458229 (owner: 10Faidon Liambotis) [16:27:57] (03PS3) 10Ema: wmf-upgrade-and-reboot: non-interactive Debian frontend [puppet] - 10https://gerrit.wikimedia.org/r/469439 [16:28:29] (03Merged) 10jenkins-bot: Don't drop the colon between hash type/digest [software/keyholder] - 10https://gerrit.wikimedia.org/r/458229 (owner: 10Faidon Liambotis) [16:29:10] addshore: You mean, manually early-terminate the scap process? [16:29:19] addshore: Sounds… messy. [16:29:34] (Ticket for future behaviour sounds good though.) [16:29:38] yup [16:29:42] well, lets just go for it [16:29:43] #tm [16:29:46] ™ [16:30:08] (03PS6) 10Vgutierrez: certcentral: Track number of retries and apply exponential backoff [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) [16:30:10] (03PS3) 10Vgutierrez: certcentral: Avoid getting stuck on CHALLENGES_PUSHED status [software/certcentral] - 10https://gerrit.wikimedia.org/r/469446 (https://phabricator.wikimedia.org/T207737) [16:30:21] syncing [16:30:31] * James_F crosses fingers. [16:30:39] except i wrote the message wong [16:30:41] * addshore tries again [16:30:46] Ha. [16:30:47] (03CR) 10Alex Monk: certcentral: Track number of retries and apply exponential backoff (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) (owner: 10Vgutierrez) [16:30:58] syncing [16:31:06] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.340 second response time [16:31:27] on the canaries [16:31:50] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: [[gerrit:469444]] Wikibase.php, dont load wikidata repo settings on other repos (take 2) (duration: 00m 54s) [16:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:57] RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 1143 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:32:16] looking good James_F [16:32:22] Agreed. [16:32:36] Time to do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/469059 ? [16:32:39] yu [16:32:46] (03PS2) 10Jforrester: Revert "[Beta Cluster] Re-disable WBMI on Beta Commons for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469059 (https://phabricator.wikimedia.org/T180981) [16:32:51] (03CR) 10Jforrester: [C: 032] Revert "[Beta Cluster] Re-disable WBMI on Beta Commons for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469059 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [16:33:20] (03PS3) 10Bstorm: sonofgridengine: add compute node roles [puppet] - 10https://gerrit.wikimedia.org/r/469445 (https://phabricator.wikimedia.org/T200557) [16:33:44] (03PS7) 10Vgutierrez: certcentral: Track number of retries and apply exponential backoff [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) [16:33:46] (03PS4) 10Vgutierrez: certcentral: Avoid getting stuck on CHALLENGES_PUSHED status [software/certcentral] - 10https://gerrit.wikimedia.org/r/469446 (https://phabricator.wikimedia.org/T207737) [16:34:17] (03CR) 10Alex Monk: [C: 032] certcentral: Track number of retries and apply exponential backoff [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) (owner: 10Vgutierrez) [16:34:17] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 53 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:34:21] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: add compute node roles [puppet] - 10https://gerrit.wikimedia.org/r/469445 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [16:34:24] (03Merged) 10jenkins-bot: Revert "[Beta Cluster] Re-disable WBMI on Beta Commons for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469059 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [16:34:36] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:50] James_F: lovely [16:34:52] * addshore watches beta [16:35:21] addshore: I'll sync to avoid dirty prod config. [16:35:29] ack [16:36:38] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Avoid getting stuck on CHALLENGES_PUSHED status [software/certcentral] - 10https://gerrit.wikimedia.org/r/469446 (https://phabricator.wikimedia.org/T207737) (owner: 10Vgutierrez) [16:36:56] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: [Beta Cluster] Re-disable WBMI on Beta Commons for now T180981 (duration: 00m 54s) [16:36:56] (03PS5) 10Jforrester: Enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981) [16:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:01] T180981: Deploy WikibaseMediaInfo extension to beta - https://phabricator.wikimedia.org/T180981 [16:37:33] James_F: i created https://phabricator.wikimedia.org/T207864 [16:37:34] * James_F stalks https://integration.wikimedia.org/ci/view/Beta/job/beta-mediawiki-config-update-eqiad/ [16:37:48] Cool. [16:38:17] James_F: can you ping me once it is on beta so i can have a look around? [16:38:18] (03CR) 10jenkins-bot: certcentral: Track number of retries and apply exponential backoff [software/certcentral] - 10https://gerrit.wikimedia.org/r/469407 (https://phabricator.wikimedia.org/T207478) (owner: 10Vgutierrez) [16:38:22] addshore: Want to comment on https://phabricator.wikimedia.org/T207767 ? I can write a stack of patches if you're OK with the concept. [16:38:27] (03CR) 10jenkins-bot: Revert "[Beta Cluster] Re-disable WBMI on Beta Commons for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469059 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [16:38:47] James_F: won't have time to look today I doubt it [16:38:51] but will aim to look soonish [16:38:55] No worries. [16:39:41] (03CR) 10jenkins-bot: certcentral: Avoid getting stuck on CHALLENGES_PUSHED status [software/certcentral] - 10https://gerrit.wikimedia.org/r/469446 (https://phabricator.wikimedia.org/T207737) (owner: 10Vgutierrez) [16:40:11] (03PS1) 10Vgutierrez: Release 0.3 including the following changes: [software/certcentral] - 10https://gerrit.wikimedia.org/r/469459 (https://phabricator.wikimedia.org/T207737) [16:40:48] (03PS2) 10Vgutierrez: Release 0.3 [software/certcentral] - 10https://gerrit.wikimedia.org/r/469459 (https://phabricator.wikimedia.org/T207737) [16:42:42] (03CR) 10Alex Monk: [C: 032] Release 0.3 [software/certcentral] - 10https://gerrit.wikimedia.org/r/469459 (https://phabricator.wikimedia.org/T207737) (owner: 10Vgutierrez) [16:44:31] (03Merged) 10jenkins-bot: Release 0.3 [software/certcentral] - 10https://gerrit.wikimedia.org/r/469459 (https://phabricator.wikimedia.org/T207737) (owner: 10Vgutierrez) [16:46:22] (03CR) 10jenkins-bot: Release 0.3 [software/certcentral] - 10https://gerrit.wikimedia.org/r/469459 (https://phabricator.wikimedia.org/T207737) (owner: 10Vgutierrez) [16:47:37] (03PS1) 10Vgutierrez: certcentral: Track number of retries and apply exponential backoff [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/469460 (https://phabricator.wikimedia.org/T207478) [16:47:39] (03PS1) 10Vgutierrez: certcentral: Avoid getting stuck on CHALLENGES_PUSHED status [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/469461 (https://phabricator.wikimedia.org/T207737) [16:47:41] (03PS1) 10Vgutierrez: Release 0.3 [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/469462 (https://phabricator.wikimedia.org/T207737) [16:48:37] (03PS4) 10Bstorm: sonofgridengine: add compute node roles [puppet] - 10https://gerrit.wikimedia.org/r/469445 (https://phabricator.wikimedia.org/T200557) [16:49:00] (03CR) 10Alex Monk: [C: 032] certcentral: Track number of retries and apply exponential backoff [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/469460 (https://phabricator.wikimedia.org/T207478) (owner: 10Vgutierrez) [16:49:14] (03CR) 10Alex Monk: [C: 032] certcentral: Avoid getting stuck on CHALLENGES_PUSHED status [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/469461 (https://phabricator.wikimedia.org/T207737) (owner: 10Vgutierrez) [16:49:19] (03CR) 10Alex Monk: [C: 032] Release 0.3 [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/469462 (https://phabricator.wikimedia.org/T207737) (owner: 10Vgutierrez) [16:51:54] (03CR) 10Bstorm: [C: 032] sonofgridengine: add compute node roles [puppet] - 10https://gerrit.wikimedia.org/r/469445 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [16:52:03] (03PS5) 10Bstorm: sonofgridengine: add compute node roles [puppet] - 10https://gerrit.wikimedia.org/r/469445 (https://phabricator.wikimedia.org/T200557) [16:52:40] (03CR) 10jenkins-bot: certcentral: Track number of retries and apply exponential backoff [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/469460 (https://phabricator.wikimedia.org/T207478) (owner: 10Vgutierrez) [16:52:44] (03CR) 10jenkins-bot: certcentral: Avoid getting stuck on CHALLENGES_PUSHED status [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/469461 (https://phabricator.wikimedia.org/T207737) (owner: 10Vgutierrez) [16:53:37] (03PS1) 10Vgutierrez: debian: Add release 0.3 to changelog [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/469464 (https://phabricator.wikimedia.org/T207737) [16:53:55] addshore: Finally live on Beta Cluster: https://commons.wikimedia.beta.wmflabs.org/wiki/Special:Version [16:54:10] James_F: okay! [16:54:15] (03CR) 10jenkins-bot: Release 0.3 [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/469462 (https://phabricator.wikimedia.org/T207737) (owner: 10Vgutierrez) [16:54:46] James_F: hmmmmmmmm [16:54:51] Hmm? [16:54:54] James_F: https://commons.wikimedia.beta.wmflabs.org/wiki/Special:NewItem still exists .... [16:55:12] Yes, but it will fail to work? [16:56:02] Yup, `"Wikibase item" content is not allowed on page Q4`. [16:56:33] that page shouldnt appear in the first place :/ [16:56:47] public function isListed() { [16:56:47] return $this->entityNamespaceLookup->getEntityNamespace( $this->getEntityType() ) !== null; [16:56:47] } [16:56:53] But "an" entity type is registered. [16:56:56] oh wait, is listed vs is working [16:57:00] Specifically, MediaInfo. [16:57:04] * addshore looks to see if it is listed [16:57:35] https://commons.wikimedia.beta.wmflabs.org/wiki/Special:ItemByTitle is listed and is pretty pointless. [16:57:38] (03CR) 10Vgutierrez: [C: 032] debian: Add release 0.3 to changelog [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/469464 (https://phabricator.wikimedia.org/T207737) (owner: 10Vgutierrez) [16:57:47] (03PS2) 10Faidon Liambotis: Fix mgmt incongruences [dns] - 10https://gerrit.wikimedia.org/r/467706 (owner: 10Volans) [16:58:01] Wow, https://commons.wikimedia.beta.wmflabs.org/wiki/Special:NewLexeme is available. But that extension isn't even loaded? Is there a mess of inter-dependencies? [16:58:05] (03CR) 10Faidon Liambotis: [C: 032] Fix mgmt incongruences [dns] - 10https://gerrit.wikimedia.org/r/467706 (owner: 10Volans) [16:58:13] James_F: yes, as lexeme is loaded on the clients [16:58:19] and yes, the entity types lexeme, form, sense seem to be enabled still [16:58:24] Eurgh. Laaaaame. [16:58:36] thanks paravoid! :) [16:58:38] Does every Wikibase extension need splitting into client and repo modes? [16:58:39] can we turn it off on beta again until we think about this a little more? :/ [16:58:46] * James_F sighs. [16:58:47] Fine. [16:58:51] * addshore sighs too [16:59:12] (03PS1) 10Jforrester: Revert "Revert "[Beta Cluster] Re-disable WBMI on Beta Commons for now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469466 [16:59:18] (03CR) 10Jforrester: [C: 032] Revert "Revert "[Beta Cluster] Re-disable WBMI on Beta Commons for now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469466 (owner: 10Jforrester) [16:59:27] its okay James_F , we are getting closer.... [16:59:42] Closer but no cigars. [17:00:37] (03Merged) 10jenkins-bot: Revert "Revert "[Beta Cluster] Re-disable WBMI on Beta Commons for now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469466 (owner: 10Jforrester) [17:01:20] (03PS2) 10Faidon Liambotis: Remove obsolete mgmt records [dns] - 10https://gerrit.wikimedia.org/r/467707 (owner: 10Volans) [17:01:27] (03CR) 10jenkins-bot: debian: Add release 0.3 to changelog [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/469464 (https://phabricator.wikimedia.org/T207737) (owner: 10Vgutierrez) [17:01:47] (03PS1) 10Andrew Bogott: labsaliaser: don't have Python generate Lua code directly [puppet] - 10https://gerrit.wikimedia.org/r/469467 (https://phabricator.wikimedia.org/T207534) [17:01:51] 10Operations, 10LDAP-Access-Requests: Remove "aude" from "wmde" LDAP group - https://phabricator.wikimedia.org/T207793 (10jijiki) >>! In T207793#4691712, @WMDE-leszek wrote: > As an engineering manager at WMDE, I confirm that the person behind the user name "aude" is not doing any work for WMDE any more since... [17:02:59] (03CR) 10Faidon Liambotis: [C: 032] Remove obsolete mgmt records [dns] - 10https://gerrit.wikimedia.org/r/467707 (owner: 10Volans) [17:03:07] !log jforrester@deploy1001 scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [17:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:20] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10aborrero) >>! In T207536#4689648, @faidon wrote: > > Does that make sense/help? > Yes, thanks :-) the extra contex... [17:03:42] (03PS2) 10Andrew Bogott: labsaliaser: don't have Python generate Lua code directly [puppet] - 10https://gerrit.wikimedia.org/r/469467 (https://phabricator.wikimedia.org/T207534) [17:04:16] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: [Beta Cluster] Re-disable WBMI on Beta Commons for now T180981 (duration: 00m 54s) [17:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:20] T180981: Deploy WikibaseMediaInfo extension to beta - https://phabricator.wikimedia.org/T180981 [17:07:01] (03PS3) 10Andrew Bogott: labsaliaser: don't have Python generate Lua code directly [puppet] - 10https://gerrit.wikimedia.org/r/469467 (https://phabricator.wikimedia.org/T207534) [17:07:21] (03PS3) 10Thcipriani: Only show tracebacks on DEBUG logging levels [software/keyholder] - 10https://gerrit.wikimedia.org/r/458230 (owner: 10Faidon Liambotis) [17:07:46] (03CR) 1020after4: [C: 032] "lgtm" [software/keyholder] - 10https://gerrit.wikimedia.org/r/458230 (owner: 10Faidon Liambotis) [17:07:48] (03PS4) 10Andrew Bogott: labsaliaser: don't have Python generate Lua code directly [puppet] - 10https://gerrit.wikimedia.org/r/469467 (https://phabricator.wikimedia.org/T207534) [17:08:21] (03Merged) 10jenkins-bot: Only show tracebacks on DEBUG logging levels [software/keyholder] - 10https://gerrit.wikimedia.org/r/458230 (owner: 10Faidon Liambotis) [17:09:16] (03PS2) 10Dmaza: Enable partial blocks on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469332 [17:09:46] (03CR) 10Andrew Bogott: [C: 032] labsaliaser: don't have Python generate Lua code directly [puppet] - 10https://gerrit.wikimedia.org/r/469467 (https://phabricator.wikimedia.org/T207534) (owner: 10Andrew Bogott) [17:10:00] andrewbogott: thanks for that! [17:10:09] we'll see if it works :) [17:10:29] (03PS3) 10Thcipriani: Respond with SSH_AGENT_FAILURE on protocol errors [software/keyholder] - 10https://gerrit.wikimedia.org/r/458231 (owner: 10Faidon Liambotis) [17:10:53] heh [17:11:05] saw your task about possibly getting rid of it for eqiad-r1, so yay :) [17:12:31] !log rebooting cloudvirt1019 [17:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:17] PROBLEM - Recursive DNS on 208.80.153.78 is CRITICAL: CRITICAL - Plugin timed out while executing system call [17:14:34] (we know, it's labtest, work in progress) [17:14:42] (03CR) 10Catrope: [C: 032] Enable partial blocks on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469332 (owner: 10Dmaza) [17:14:57] PROBLEM - Recursive DNS on 208.80.153.51 is CRITICAL: CRITICAL - Plugin timed out while executing system call [17:15:09] that one is also labtest [17:15:37] PROBLEM - Host cloudvirt1019 is DOWN: PING CRITICAL - Packet loss = 100% [17:15:56] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 58.82 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:16:57] (03Merged) 10jenkins-bot: Enable partial blocks on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469332 (owner: 10Dmaza) [17:17:19] PROBLEM - Recursive DNS on 208.80.154.143 is CRITICAL: CRITICAL - Plugin timed out while executing system call [17:17:51] uh? mmmh seems labs-recursor2.wikimedia.org [17:17:56] (03CR) 10jenkins-bot: Revert "Revert "[Beta Cluster] Re-disable WBMI on Beta Commons for now"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469466 (owner: 10Jforrester) [17:17:58] (03CR) 10jenkins-bot: Enable partial blocks on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469332 (owner: 10Dmaza) [17:18:06] paravoid: should be unrelated to the patches just merged, but just double checking in case I'm needed [17:18:16] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 72.89 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:18:27] RECOVERY - Recursive DNS on 208.80.154.143 is OK: DNS OK: 0.066 seconds response time. www.wikipedia.org returns 208.80.154.224 [17:18:49] volans: lab*-recursor* is the andrewbogott/Krenair change above, they seem to be aware of it! [17:19:02] ack [17:19:22] yeah [17:19:28] don't worry about it [17:20:27] RECOVERY - Device not healthy -SMART- on cloudvirt1019 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt1019&var-datasource=eqiad%2520prometheus%252Fops [17:20:58] !log repooling all elasticsearch servers in eqiad [17:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:08] RECOVERY - Host cloudvirt1019 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [17:24:44] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) The latest update, HPE sent me for new ssds, I replaced the SSDs and they disks are showing up as bad in the raid cfg. Maybe the backplane change is required. [17:25:27] PROBLEM - Recursive DNS on 208.80.154.20 is CRITICAL: CRITICAL - Plugin timed out while executing system call [17:25:56] ^ known [17:26:46] PROBLEM - HHVM jobrunner on mw1338 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [17:26:49] (03PS6) 10Bstorm: sonofgridengine: add compute node roles [puppet] - 10https://gerrit.wikimedia.org/r/469445 (https://phabricator.wikimedia.org/T200557) [17:27:37] sorry about the alerts, I don't know why the lag for applying this is so big [17:27:47] RECOVERY - HHVM jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [17:29:37] PROBLEM - Recursive DNS on 208.80.154.24 is CRITICAL: CRITICAL - Plugin timed out while executing system call [17:30:40] 10Operations, 10Discovery-Search (Current work): Investigate reducing number of servers in the elasticsearch cluster - https://phabricator.wikimedia.org/T207724 (10Gehel) Actually, already some pool counter errors with 29 nodes on eqiad. All nodes are repooled. This give us a base line where I would be mostly... [17:30:56] RECOVERY - Recursive DNS on 208.80.154.20 is OK: DNS OK: 0.007 seconds response time. www.wikipedia.org returns 208.80.154.224 [17:34:06] RECOVERY - Recursive DNS on 208.80.154.24 is OK: DNS OK: 0.010 seconds response time. www.wikipedia.org returns 208.80.154.224 [17:36:06] (03PS1) 10Kosta Harlan: Fix configuration variable name for wgWMEUnderstandingFirstDay [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469471 (https://phabricator.wikimedia.org/T205759) [17:39:17] (03PS2) 10Sbisson: Enable PageTriage/Copyvio on enwiki betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469436 [17:40:03] (03PS2) 10Sbisson: Enable PageTriage/Copyvio in testwiki and enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469438 [17:40:56] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 31 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [17:48:09] ACKNOWLEDGEMENT - HP RAID on cloudvirt1019 is CRITICAL: CRITICAL: Slot 1: Failed: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8 - OK: 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T207868 [17:48:12] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T207868 (10ops-monitoring-bot) [17:48:16] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 54 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [17:50:36] ACKNOWLEDGEMENT - Recursive DNS on 208.80.153.51 is CRITICAL: CRITICAL - Plugin timed out while executing system call andrew bogott Im working on this but these arent user-facing [17:50:36] ACKNOWLEDGEMENT - Recursive DNS on 208.80.153.78 is CRITICAL: CRITICAL - Plugin timed out while executing system call andrew bogott Im working on this but these arent user-facing [17:53:57] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 3 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev) @Pchelolo I think a public notification on Kafka format change would be a great idea. Since various... [17:55:55] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban): Add Lars Wirzenius to releng LDAP groups - https://phabricator.wikimedia.org/T207833 (10jijiki) 05Open>03Resolved [17:57:50] 10Operations, 10LDAP-Access-Requests: Remove "aude" from "wmde" LDAP group - https://phabricator.wikimedia.org/T207793 (10jijiki) a:03jijiki [17:57:55] 10Operations, 10LDAP-Access-Requests: Remove "jk" from "wmde" ldap group - https://phabricator.wikimedia.org/T207792 (10jijiki) a:03jijiki [17:58:56] 10Operations, 10LDAP-Access-Requests, 10Core Platform Team Kanban (Blocked Externally): Remove "daniel" from "wmde" LDAP group and add him to "wmf" - https://phabricator.wikimedia.org/T207788 (10jijiki) a:03jijiki [17:59:24] (03PS1) 10Ayounsi: Puppet: rename cr1-eqord to cr2-eqord [puppet] - 10https://gerrit.wikimedia.org/r/469474 (https://phabricator.wikimedia.org/T204170) [18:00:27] PROBLEM - Juniper alarms on cr1-eqord is CRITICAL: JNX_ALARMS CRITICAL - The requested table is empty or does not exist https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [18:00:46] PROBLEM - Device not healthy -SMART- on cloudvirt1019 is CRITICAL: cluster=misc device={cciss,6,cciss,7,cciss,8,cciss,9} instance=cloudvirt1019:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt1019&var-datasource=eqiad%2520prometheus%252Fops [18:03:27] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 25 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [18:05:29] (03PS1) 10Ayounsi: DNS: rename cr1-eqord to cr2-eqord [dns] - 10https://gerrit.wikimedia.org/r/469476 (https://phabricator.wikimedia.org/T204170) [18:06:39] (03CR) 10Smalyshev: wdqs: increase restart interval of wdqs-updater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469447 (https://phabricator.wikimedia.org/T207843) (owner: 10Gehel) [18:08:10] (03CR) 10Ayounsi: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13190/" [puppet] - 10https://gerrit.wikimedia.org/r/469474 (https://phabricator.wikimedia.org/T204170) (owner: 10Ayounsi) [18:09:25] (03CR) 10Ayounsi: [C: 032] DNS: rename cr1-eqord to cr2-eqord [dns] - 10https://gerrit.wikimedia.org/r/469476 (https://phabricator.wikimedia.org/T204170) (owner: 10Ayounsi) [18:10:11] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2018), 10User-Johan: Lessons learned: Communicating the server switch 2018 - https://phabricator.wikimedia.org/T206649 (10Elitre) (I assumed work has started since the "expiration date" is so close. Hope that's not a problem.) [18:10:56] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 51 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [18:12:17] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.208 second response time [18:12:52] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 3 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev) [18:15:47] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:15:56] (03PS3) 10Joal: Add configuration for java-logging in hive conf [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469256 [18:16:04] (03CR) 10Joal: Add configuration for java-logging in hive conf (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469256 (owner: 10Joal) [18:16:44] !log enable BGP sessions to transit/peering on cr2-eqord - T204170 [18:16:44] (03CR) 10Smalyshev: [C: 031] wdqs: rate limit log sent to logstash [puppet] - 10https://gerrit.wikimedia.org/r/468979 (https://phabricator.wikimedia.org/T207656) (owner: 10Gehel) [18:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:47] T204170: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 [18:21:30] (03CR) 10Catrope: [C: 032] "Ugh, thanks for catching this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469471 (https://phabricator.wikimedia.org/T205759) (owner: 10Kosta Harlan) [18:22:42] (03PS3) 10Catrope: Enable PageTriage/Copyvio on enwiki betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469436 (owner: 10Sbisson) [18:22:46] (03Merged) 10jenkins-bot: Fix configuration variable name for wgWMEUnderstandingFirstDay [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469471 (https://phabricator.wikimedia.org/T205759) (owner: 10Kosta Harlan) [18:22:57] (03CR) 10Catrope: [C: 032] Enable PageTriage/Copyvio on enwiki betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469436 (owner: 10Sbisson) [18:25:44] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.179 second response time [18:26:04] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [18:28:55] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:31:00] (03CR) 10Smalyshev: wdqs: increase restart interval of wdqs-updater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469447 (https://phabricator.wikimedia.org/T207843) (owner: 10Gehel) [18:31:39] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) [18:33:04] (03CR) 10jenkins-bot: Fix configuration variable name for wgWMEUnderstandingFirstDay [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469471 (https://phabricator.wikimedia.org/T205759) (owner: 10Kosta Harlan) [18:33:13] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 58 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [18:35:35] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active, AS6939/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:43:22] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) a:05Cmjohnson>03Papaul [19:00:04] twentyafterfour: Your horoscope predicts another unfortunate MediaWiki train - Americas version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181024T1900). [19:02:45] Hallo. [19:02:54] aharoni: hello [19:03:27] I'm a bit (just a bit) confused about the status of the train this week. Catalan and Hebrew Wikipedias are supposed to get the latest version of MediaWiki extensions today, are they? [19:04:28] aharoni: correct, afaik [19:05:08] twentyafterfour: and this hasn't happened yet, right? happening in the next couple of hours? [19:05:19] (03PS1) 1020after4: group1 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469485 [19:05:21] (03CR) 1020after4: [C: 032] group1 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469485 (owner: 1020after4) [19:05:31] aharoni: happening right now [19:05:46] I'm about to deploy the change as soon as ^ that patch merges [19:07:07] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469485 (owner: 1020after4) [19:13:12] (03CR) 10Ottomata: [C: 032] "Looks good! https://puppet-compiler.wmflabs.org/compiler1002/13191/stat1005.eqiad.wmnet/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469256 (owner: 10Joal) [19:14:27] twentyafterfour: thanks [19:15:07] (03PS1) 10Ottomata: Update cdh module with hive parquet logging change [puppet] - 10https://gerrit.wikimedia.org/r/469486 [19:15:30] (03CR) 10jerkins-bot: [V: 04-1] Update cdh module with hive parquet logging change [puppet] - 10https://gerrit.wikimedia.org/r/469486 (owner: 10Ottomata) [19:16:12] (03CR) 10Cwhite: [C: 031] icinga: on stretch, tell rsyslog to discard logs from check_nrpe [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) (owner: 10Dzahn) [19:16:48] (03CR) 10jerkins-bot: [V: 04-1] icinga: on stretch, tell rsyslog to discard logs from check_nrpe [puppet] - 10https://gerrit.wikimedia.org/r/469337 (https://phabricator.wikimedia.org/T207775) (owner: 10Dzahn) [19:16:59] (03PS2) 10Ottomata: Update cdh module with hive parquet logging change [puppet] - 10https://gerrit.wikimedia.org/r/469486 [19:18:11] (03CR) 10Ottomata: [C: 032] Update cdh module with hive parquet logging change [puppet] - 10https://gerrit.wikimedia.org/r/469486 (owner: 10Ottomata) [19:19:08] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.1 refs T206655 [19:19:29] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.1 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469485 (owner: 1020after4) [19:19:41] (03PS3) 10Zoranzoki21: New throttle rule for Johannesburg Event on 2018-10-27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469261 (https://phabricator.wikimedia.org/T207742) [19:20:03] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.1 refs T206655 (duration: 00m 54s) [19:20:36] (03PS4) 10Zoranzoki21: New throttle rule for Johannesburg Event on 2018-10-27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469261 (https://phabricator.wikimedia.org/T207742) [19:20:54] (03PS5) 10Zoranzoki21: New throttle rule for Johannesburg Event on 2018-10-27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469261 (https://phabricator.wikimedia.org/T207742) [19:21:57] ugh [19:22:11] lots of Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.48.15) [19:22:23] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [19:22:34] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:22:39] (03PS1) 1020after4: group1 wikis to 1.32.0-wmf.26 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469487 [19:22:42] (03CR) 1020after4: [C: 032] group1 wikis to 1.32.0-wmf.26 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469487 (owner: 1020after4) [19:22:53] !log rolling back group1 due to high error rate [19:24:20] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.26 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469487 (owner: 1020after4) [19:25:14] PROBLEM - HHVM rendering on mw1242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:34] PROBLEM - Nginx local proxy to apache on mw1242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:46] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.26 refs T206655 [19:25:54] PROBLEM - Apache HTTP on mw1242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:26:39] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.26 refs T206655 (duration: 00m 52s) [19:27:04] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [19:27:22] (03PS6) 10Zoranzoki21: New throttle rule for Johannesburg Event on 2018-10-27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469261 (https://phabricator.wikimedia.org/T207742) [19:27:53] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [19:28:14] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:28:53] RECOVERY - Nginx local proxy to apache on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 4.140 second response time [19:29:03] RECOVERY - Apache HTTP on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.033 second response time [19:29:33] RECOVERY - HHVM rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 71689 bytes in 0.107 second response time [19:30:38] so something in the new branch is causing lock timeouts [19:31:49] !log the errors were all coming from wmf.26 but the error rate skyrocketed after deploying 1.33.0-wmf.1 to group1 so there is some query in the new branch which is holding a lock. T207881 [19:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:53] T207881: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 - https://phabricator.wikimedia.org/T207881 [19:33:51] (03PS1) 10MaxSem: Undeploy RelatedSites from a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469495 (https://phabricator.wikimedia.org/T202761) [19:34:51] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.26 refs T206655 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469487 (owner: 1020after4) [19:38:32] !log The train is now blocked by database lock contention of unknown origin [19:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:23] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:40:44] RECOVERY - High lag on wdqs2004 is OK: (C)3600 ge (W)1200 ge 1182 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:41:53] RECOVERY - High lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 1189 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:43:06] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban): Add Lars Wirzenius to releng LDAP groups - https://phabricator.wikimedia.org/T207833 (10hashar) Thank you @jijiki ! @LarsWirzenius you should now have access to the various tools. Most importantly to releng:... [19:50:53] (03CR) 10Thcipriani: [C: 032] Respond with SSH_AGENT_FAILURE on protocol errors [software/keyholder] - 10https://gerrit.wikimedia.org/r/458231 (owner: 10Faidon Liambotis) [19:51:01] (03PS3) 10Thcipriani: Switch to using Enum for SSH protocol codes [software/keyholder] - 10https://gerrit.wikimedia.org/r/458232 (owner: 10Faidon Liambotis) [19:51:41] (03Merged) 10jenkins-bot: Respond with SSH_AGENT_FAILURE on protocol errors [software/keyholder] - 10https://gerrit.wikimedia.org/r/458231 (owner: 10Faidon Liambotis) [19:54:59] (03PS2) 10MaxSem: Undeploy RelatedSites from a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469495 (https://phabricator.wikimedia.org/T202761) [19:58:28] (03PS1) 10Bstorm: sonofgridengine: several changes to the compute node definitions [puppet] - 10https://gerrit.wikimedia.org/r/469496 [19:59:11] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: several changes to the compute node definitions [puppet] - 10https://gerrit.wikimedia.org/r/469496 (owner: 10Bstorm) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: How many deployers does it take to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181024T2000). [20:03:38] (03PS2) 10Bstorm: sonofgridengine: several changes to the compute node definitions [puppet] - 10https://gerrit.wikimedia.org/r/469496 (https://phabricator.wikimedia.org/T200557) [20:05:48] (03CR) 10Bstorm: [C: 032] sonofgridengine: several changes to the compute node definitions [puppet] - 10https://gerrit.wikimedia.org/r/469496 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [20:18:23] (03PS1) 10Joal: Add hive parquet-log folder creation [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469499 [20:19:11] ottomata: --^ [20:20:17] (03PS1) 10Bstorm: sonofgridengine: fix the sysdir variable for the HBA scripts [puppet] - 10https://gerrit.wikimedia.org/r/469500 (https://phabricator.wikimedia.org/T200557) [20:21:16] (03PS2) 10Joal: Add hive parquet-log folder creation [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469499 [20:21:24] (03CR) 10Bstorm: [C: 032] sonofgridengine: fix the sysdir variable for the HBA scripts [puppet] - 10https://gerrit.wikimedia.org/r/469500 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [20:24:02] (03CR) 10Ottomata: [C: 032] Add hive parquet-log folder creation [puppet/cdh] - 10https://gerrit.wikimedia.org/r/469499 (owner: 10Joal) [20:28:26] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 2 others: Cleanup Wikidata Query Service logging configuration - https://phabricator.wikimedia.org/T207834 (10Smalyshev) [20:28:57] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 2 others: increase restart interval of wdqs updater - https://phabricator.wikimedia.org/T207843 (10Smalyshev) [20:29:14] RECOVERY - High lag on wdqs2006 is OK: (C)3600 ge (W)1200 ge 1190 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:34:38] ACKNOWLEDGEMENT - Device not healthy -SMART- on cloudvirt1019 is CRITICAL: cluster=misc device={cciss,6,cciss,7,cciss,8,cciss,9} instance=cloudvirt1019:9100 job=node site=eqiad andrew bogott T207868 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt1019&var-datasource=eqiad%2520prometheus%252Fops [20:35:53] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 31 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:43:13] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 70 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:47:02] (03CR) 10Thcipriani: [C: 032] Switch to using Enum for SSH protocol codes (031 comment) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458232 (owner: 10Faidon Liambotis) [20:47:04] (03PS3) 10Thcipriani: Switch to Construct for the SSH agent protocol [software/keyholder] - 10https://gerrit.wikimedia.org/r/458233 (owner: 10Faidon Liambotis) [20:47:16] (03Merged) 10jenkins-bot: Switch to using Enum for SSH protocol codes [software/keyholder] - 10https://gerrit.wikimedia.org/r/458232 (owner: 10Faidon Liambotis) [20:55:09] (03PS1) 10Ottomata: Update cdh module with fix [puppet] - 10https://gerrit.wikimedia.org/r/469515 [20:55:26] (03CR) 10Ottomata: [V: 032 C: 032] Update cdh module with fix [puppet] - 10https://gerrit.wikimedia.org/r/469515 (owner: 10Ottomata) [21:19:03] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:23:01] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/dnsrecursor/manifests/labsaliaser.pp#63 - is that making the Cron require two File resources? [21:23:18] or is something weirder going on here? [21:25:30] (I wrote it 3 years okay? It's been a while.) [21:25:40] years ago* [21:26:16] !log pausing replication on dbstore2002 (T204930) [21:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:22] T204930: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 [21:26:23] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 53 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:33:27] !log compressing tables in s1@dbstore2002 (T204930) [21:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:31] T204930: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 [21:36:34] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 31 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:37:37] (03PS1) 10Alex Monk: exim smarthosts: Allow setting helo_data on transports [puppet] - 10https://gerrit.wikimedia.org/r/469522 (https://phabricator.wikimedia.org/T41785) [21:39:32] (03PS2) 10Alex Monk: exim smarthosts: Allow setting helo_data on transports [puppet] - 10https://gerrit.wikimedia.org/r/469522 (https://phabricator.wikimedia.org/T41785) [21:39:51] (03CR) 10Paladox: exim smarthosts: Allow setting helo_data on transports (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/469522 (https://phabricator.wikimedia.org/T41785) (owner: 10Alex Monk) [21:41:22] paladox I don't think that's right. [21:41:41] Krenair hmm? [21:41:43] <% if helo_data -%> [21:41:45] yes [21:41:52] the - has special meaning [21:41:55] there's a - on the right one [21:41:58] yeh [21:41:58] yes [21:49:03] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 41 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:53:57] (03PS1) 10Alex Monk: Move mail_smarthost (and wikimail_smarthost) to hiera [puppet] - 10https://gerrit.wikimedia.org/r/469524 (https://phabricator.wikimedia.org/T207887) [21:54:11] (03PS1) 10Bstorm: sonofgridengine: add the banners [puppet] - 10https://gerrit.wikimedia.org/r/469525 (https://phabricator.wikimedia.org/T200557) [21:55:13] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.173 second response time [21:55:46] (03CR) 10Bstorm: [C: 032] sonofgridengine: add the banners [puppet] - 10https://gerrit.wikimedia.org/r/469525 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [21:58:30] (03CR) 10Alex Monk: "pros:" [puppet] - 10https://gerrit.wikimedia.org/r/469524 (https://phabricator.wikimedia.org/T207887) (owner: 10Alex Monk) [21:58:34] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:59:23] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:04:27] 10Operations, 10Security-Team, 10Wikimedia-Site-requests: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10faidon) Cool! Cc'ing @herron and @fgiunchedi here for awareness and their input. Logstash may or may not be happy about the extra load, depending on how much that wo... [22:06:43] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 64 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:08:29] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10ayounsi) Email sent to Equinix so they update their MAC filtering. [22:08:58] ACKNOWLEDGEMENT - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active, AS6939/IPv4: Active Ayounsi https://phabricator.wikimedia.org/T204170#4693274 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:12:31] (03PS1) 10Brian Wolff: Enable csp report only on outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469527 (https://phabricator.wikimedia.org/T207900) [22:15:54] RECOVERY - High lag on wdqs2001 is OK: (C)3600 ge (W)1200 ge 1197 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:18:08] I'm going to deploy some CSP related stuff [22:19:08] 10Operations, 10Security-Team, 10Wikimedia-Site-requests, 10Patch-For-Review: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10Bawolff) >>! In T207900#4693255, @faidon wrote: > Cool! Cc'ing @herron and @fgiunchedi here for awareness and their input. Logstash may or may... [22:19:54] (03CR) 10Brian Wolff: [C: 032] Enable csp report only on outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469527 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [22:21:48] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [22:22:11] (03Merged) 10jenkins-bot: Enable csp report only on outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469527 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [22:23:13] RECOVERY - High lag on wdqs2002 is OK: (C)3600 ge (W)1200 ge 1179 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:23:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10faidon) a:03Andrew @Andrew, promethium's hostname, IP and MAC address are still referenced in a number of places in the puppet tree, including e.g. hardcoded in Python code (proxyl... [22:24:11] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [22:27:12] (03CR) 10jenkins-bot: Enable csp report only on outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469527 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [22:31:18] 10Operations, 10Traffic, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10ayounsi) [22:31:21] 10Operations, 10Traffic: Document eqsin power connections in Netbox - https://phabricator.wikimedia.org/T207138 (10ayounsi) 05Open>03Resolved a:03ayounsi I imported everything that was not the servers' uplinks (for the reason mentioned above). [22:31:33] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.870 second response time [22:32:33] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:33:07] !log bawolff@deploy1001 scap failed: average error rate on 8/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [22:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:21] oh, that's not good [22:34:05] ok, its entirely "Notice: Undefined variable: wmgUseCSPReportOnly in /srv/mediawiki/wmf-config/CommonSettings.php on line 3737" [22:34:12] Because i should be doing them in the other order [22:34:54] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:36:19] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Deploy csp report-only to outreachwiki T207900 (duration: 00m 54s) [22:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:23] T207900: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 [22:36:54] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:37:42] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 3 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Smalyshev) With T207873 deployed (planned next Monday) Updater should be able to parse such timestamps, but I'd... [22:38:03] 10Operations, 10MediaWiki-Page-deletion, 10MW-1.32-notes, 10MW-1.32-release, and 3 others: Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 (10tstarling) 05Open>03Resolved a:03tstarling This should be fixed now [22:38:05] (03PS1) 10Faidon Liambotis: Remove all references to labs_metal [puppet] - 10https://gerrit.wikimedia.org/r/469532 [22:38:13] !log bawolff@deploy1001 Synchronized wmf-config/CommonSettings.php: Deploy csp report-only to outreachwiki T207900 (duration: 00m 54s) [22:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:32] (03PS1) 10Brian Wolff: Enable CSP-report-only on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469538 (https://phabricator.wikimedia.org/T207900) [22:59:10] (03CR) 10Brian Wolff: [C: 032] Enable CSP-report-only on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469538 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Evening SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181024T2300). [23:00:04] MaxSem: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:50] I can swat [23:00:57] (03Merged) 10jenkins-bot: Enable CSP-report-only on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469538 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [23:01:01] (03PS2) 10MaxSem: Introduce new ArticleCreationWrokflow permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462040 (https://phabricator.wikimedia.org/T204016) [23:01:04] (03CR) 10MaxSem: [C: 032] Introduce new ArticleCreationWrokflow permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462040 (https://phabricator.wikimedia.org/T204016) (owner: 10MaxSem) [23:01:14] twentyafterfour: Is it ok if I continue doing a security merge before swat [23:01:22] bawolff: fine with me [23:01:25] I didn't realize it was almost 23:00 [23:02:36] (03Merged) 10jenkins-bot: Introduce new ArticleCreationWrokflow permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462040 (https://phabricator.wikimedia.org/T204016) (owner: 10MaxSem) [23:02:53] bawolff: I already merged a patch :O [23:03:06] That's ok [23:03:17] I git fetch'd before it merged [23:03:22] sorry, about this [23:03:39] We can sync it quickly if it's blocking you, it's noop at this point [23:04:19] Is it ok if i finish my sync first? I already pulled into deploy1001 [23:04:32] Yeah, I'm not in a hurry [23:05:04] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.554 second response time [23:06:25] (03CR) 10jenkins-bot: Enable CSP-report-only on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469538 (https://phabricator.wikimedia.org/T207900) (owner: 10Brian Wolff) [23:06:27] (03CR) 10jenkins-bot: Introduce new ArticleCreationWrokflow permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462040 (https://phabricator.wikimedia.org/T204016) (owner: 10MaxSem) [23:08:33] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Deploy csp report-only to small.dblist wikis T207900 (duration: 00m 56s) [23:08:34] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:37] T207900: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 [23:09:46] MaxSem: I'm done [23:11:36] MaxSem: do you want to deploy your own? go ahead if you'd like [23:11:54] Yeah, I can do it myself [23:15:19] !log maxsem@deploy1001 Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/462040/ (duration: 00m 55s) [23:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:15] (03PS3) 10MaxSem: Undeploy RelatedSites from a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469495 (https://phabricator.wikimedia.org/T202761) [23:16:22] (03CR) 10MaxSem: [C: 032] Undeploy RelatedSites from a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469495 (https://phabricator.wikimedia.org/T202761) (owner: 10MaxSem) [23:18:17] (03Merged) 10jenkins-bot: Undeploy RelatedSites from a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469495 (https://phabricator.wikimedia.org/T202761) (owner: 10MaxSem) [23:18:32] Who's messing with mwdebug1001? syntax error, unexpected ';', expecting ')' in /srv/mediawiki/php-1.32.0-wmf.26/includes/resourceloader/ResourceLoader.php on line 1536 [23:23:03] MaxSem: not I [23:23:40] maybe bawolff? [23:23:58] No, I was testing on mwdebug1002 [23:24:00] !log maxsem@deploy1001 Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/469495/ (duration: 00m 54s) [23:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:08] And I don't think i did anything syntax error-y [23:24:10] (03CR) 10jenkins-bot: Undeploy RelatedSites from a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469495 (https://phabricator.wikimedia.org/T202761) (owner: 10MaxSem) [23:32:54] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [23:33:34] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 31 probes of 323 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [23:37:15] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [23:41:24] (03CR) 10Thcipriani: "Bunch of nits and comments inline." (034 comments) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458233 (owner: 10Faidon Liambotis) [23:48:58] (03CR) 10Faidon Liambotis: Switch to Construct for the SSH agent protocol (033 comments) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458233 (owner: 10Faidon Liambotis) [23:58:42] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 3 others: increase restart interval of wdqs updater - https://phabricator.wikimedia.org/T207843 (10Smalyshev) [23:58:52] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Wikimedia-Logstash, and 2 others: Rate limit wdqs logs - https://phabricator.wikimedia.org/T204364 (10Smalyshev) [23:59:05] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 3 others: Cleanup Wikidata Query Service logging configuration - https://phabricator.wikimedia.org/T207834 (10Smalyshev)