[00:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:24] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [00:01:13] (03PS11) 10Dzahn: planet: missing Hiera, rename params, rm scope.lookup in erb [puppet] - 10https://gerrit.wikimedia.org/r/397729 [00:04:08] (03PS12) 10Dzahn: planet: missing Hiera, rename params, rm scope.lookup in erb [puppet] - 10https://gerrit.wikimedia.org/r/397729 [00:07:10] (03PS13) 10Dzahn: planet: missing Hiera, rename params, rm scope.lookup in erb [puppet] - 10https://gerrit.wikimedia.org/r/397729 [00:08:14] (03CR) 10Dzahn: [C: 032] "finally http://puppet-compiler.wmflabs.org/9678/planet1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/397729 (owner: 10Dzahn) [00:09:24] 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3888608 (10aaron) I fixed a stupid hostname var bug. Now I get numbers that make sense: ``` Same-DC (db2070.codfw.wmnet): stri... [00:09:59] (03CR) 10Dzahn: "also fixes 3 x Parameter 'languages' of class 'profile.. ' has no call to hiera" [puppet] - 10https://gerrit.wikimedia.org/r/397729 (owner: 10Dzahn) [00:17:28] a reboot of phabricator server is imminent [00:18:13] (03PS1) 10Alex Monk: letsencrypt: Update LE subscriber agreement URL [puppet] - 10https://gerrit.wikimedia.org/r/403326 [00:18:36] (03CR) 10jerkins-bot: [V: 04-1] letsencrypt: Update LE subscriber agreement URL [puppet] - 10https://gerrit.wikimedia.org/r/403326 (owner: 10Alex Monk) [00:18:54] !log rebooting phabricator server for kernel upgrade [00:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:42] (03PS2) 10Alex Monk: letsencrypt: Update LE subscriber agreement URL [puppet] - 10https://gerrit.wikimedia.org/r/403326 [00:20:46] is it me, or is git-review broken? [00:21:15] Krenair: i just used it [00:21:35] must be my setup then [00:22:04] PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error [00:23:08] Krenair: 1.25.0-2 [00:23:33] actually I'm having trouble pulling from origin too [00:23:49] i touched phab but not gerrit [00:23:51] yet [00:23:58] though pushing that commit was fine as I just pushed to refs/for/production directly without bothering with git-review [00:24:04] RECOVERY - PyBal backends health check on lvs1005 is OK: PYBAL OK - All pools are healthy [00:24:21] it's a good time because i also have to reboot gerrit , heh [00:24:29] heh [00:24:40] more likely the problem is on my end [00:24:41] i mean.. better now than thinking it was related [00:25:03] yeah [00:26:30] Krenair: there was something recently that had to be fixed about it [00:26:37] and then they did.. and it worked again [00:26:47] fixed about what exactly? [00:26:54] maybe related to the Gerrit URL with or without /r/p vs. just /p [00:26:56] ehm.. [00:27:18] attempting to pull from https://gerrit.wikimedia.org/r/p/operations/puppet rather than my configured (ssh) origin is also just sitting there looking at me [00:27:37] https://gerrit.wikimedia.org/p/operations/puppet is not found [00:29:48] Krenair: it exists with /r/p/ [00:30:08] yeah except my client just does nothing [00:30:13] i just have it configured with ssh [00:30:31] paladox knows this :) [00:30:54] afair [00:31:14] meh [00:31:18] I'll look at it some other day [00:31:55] https://gerrit.wikimedia.org/r/#/c/403326/ is gonna need some legal review or something [00:32:20] though the changes don't look very big, I don't know how to arrange it. I assume the reviewers do [00:32:25] i'll have the answer tomorrow, heh [00:32:54] you got the right reviewers, yes [00:34:50] mutante heh [00:34:56] Krenair update git-review :) [00:35:04] it includes a fix for this. [00:35:11] paladox: ^ thanks, i rememberd the issue was there [00:35:13] it's not just git-review having problems [00:35:17] it appears to be my git client [00:35:37] paladox: https in git config? [00:35:45] with the /r/ and r/p thing [00:36:18] i think my git-review is old enough to be before the issue [00:36:27] but the latest has it fixed again [00:36:43] i installed from distro, not pip [00:37:23] Maybe another bug? as the one that was fixed was /changes/ but the actual fix is https://review.openstack.org/#/c/478325/ [00:37:54] oh there we go [00:38:00] it took a while but git pull eventually worked [00:38:01] though he says just git client by itself too [00:38:04] ah [00:38:21] heh [00:38:28] Old git is sad git [00:38:31] jouncebot: next [00:38:31] In 13 hour(s) and 21 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T1400) [00:43:45] !log rebooting gerrit server for kernel upgrade [00:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:25] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 18.12, 22.51, 23.88 [00:46:48] gerrit back [00:51:24] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [00:54:33] ^ side-effect of gerrit reboot, just a sec [00:56:24] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [01:23:10] (03CR) 10Gergő Tisza: "The line looks good. Not sure where I should check (or even if I have access), https://wikitech.wikimedia.org/wiki/Cron_jobs is not very i" [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [01:32:49] (03PS2) 10Dzahn: mariadb::tendril: move firewall to role, use profile [puppet] - 10https://gerrit.wikimedia.org/r/397725 [01:34:11] (03Abandoned) 10Dzahn: mariadb::tendril: move firewall to role, use profile [puppet] - 10https://gerrit.wikimedia.org/r/397725 (owner: 10Dzahn) [01:36:00] (03PS2) 10Dzahn: druid: move firewall includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/397726 [01:36:34] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 35.42, 34.88, 32.10 [01:38:34] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 33.40, 33.97, 32.09 [01:39:54] !log mw1226 - high load - hhvm-dump-debug > /root/hhvm-dump-debug-20170109-1739PST.log ; restart-hhvm [01:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:00] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/9679/druid1002.eqiad.wmnet/change.druid1002.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/397726 (owner: 10Dzahn) [01:42:09] (03CR) 10Dzahn: [C: 04-1] "why is this even related? Error: Could not find resource 'Exec[apt-get update]' for relationship from 'Class[Profile::Cdh::Apt]'" [puppet] - 10https://gerrit.wikimedia.org/r/397726 (owner: 10Dzahn) [01:47:34] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 9.70, 15.67, 23.87 [02:10:34] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [02:11:34] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 7815375 keys, up 5 minutes 20 seconds - replication_delay is 0 [02:23:24] PROBLEM - High CPU load on API appserver on mw1201 is CRITICAL: CRITICAL - load average: 49.04, 26.51, 20.89 [02:24:27] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.15) (duration: 06m 02s) [02:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:25] RECOVERY - High CPU load on API appserver on mw1201 is OK: OK - load average: 18.06, 24.89, 21.90 [03:26:24] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 811.39 seconds [03:38:23] (03PS1) 10KartikMistry: apertium-cat: New upstream and updated dependency on cg3 [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/403339 (https://phabricator.wikimedia.org/T171406) [03:38:46] (03Abandoned) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/397223 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [03:39:04] (03CR) 10jerkins-bot: [V: 04-1] apertium-cat: New upstream and updated dependency on cg3 [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/403339 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [03:41:08] (03Abandoned) 10KartikMistry: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-cat-srd] - 10https://gerrit.wikimedia.org/r/397224 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [03:51:59] (03PS1) 10KartikMistry: apertium-cat-srd: New upstream and updated dependencies [debs/contenttranslation/apertium-cat-srd] - 10https://gerrit.wikimedia.org/r/403340 (https://phabricator.wikimedia.org/T171406) [03:53:38] (03CR) 10jerkins-bot: [V: 04-1] apertium-cat-srd: New upstream and updated dependencies [debs/contenttranslation/apertium-cat-srd] - 10https://gerrit.wikimedia.org/r/403340 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [04:04:25] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 233.72 seconds [05:01:40] (03Draft2) 10Jayprakash12345: Lift the cap on IP address to create accounts on mrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342 [05:02:10] (03PS3) 10Jayprakash12345: Lift the cap on IP address to create accounts on mrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342 (https://phabricator.wikimedia.org/T184579) [05:05:05] (03PS4) 10Jayprakash12345: Lift the cap on IP address to create accounts on mrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342 (https://phabricator.wikimedia.org/T184579) [05:12:23] (03CR) 10Jayprakash12345: "@SWAT, You can merge the task. Because we cant test it on mwdebug. So go ahead even if I am not around on Wikimedia-operations." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342 (https://phabricator.wikimedia.org/T184579) (owner: 10Jayprakash12345) [05:23:47] (03PS1) 10KartikMistry: apertium-srd-ita: Updated cg3 dependency [debs/contenttranslation/apertium-srd-ita] - 10https://gerrit.wikimedia.org/r/403344 (https://phabricator.wikimedia.org/T171406) [05:24:20] (03CR) 10jerkins-bot: [V: 04-1] apertium-srd-ita: Updated cg3 dependency [debs/contenttranslation/apertium-srd-ita] - 10https://gerrit.wikimedia.org/r/403344 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:28:55] (03PS1) 10KartikMistry: apertium-swe: Updated dependency on cg3 [debs/contenttranslation/apertium-swe] - 10https://gerrit.wikimedia.org/r/403345 (https://phabricator.wikimedia.org/T171406) [05:29:28] (03CR) 10jerkins-bot: [V: 04-1] apertium-swe: Updated dependency on cg3 [debs/contenttranslation/apertium-swe] - 10https://gerrit.wikimedia.org/r/403345 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:32:49] (03PS1) 10KartikMistry: apertium-swe-dan: updated dependency on cg3 [debs/contenttranslation/apertium-swe-dan] - 10https://gerrit.wikimedia.org/r/403346 (https://phabricator.wikimedia.org/T171406) [05:33:42] (03CR) 10jerkins-bot: [V: 04-1] apertium-swe-dan: updated dependency on cg3 [debs/contenttranslation/apertium-swe-dan] - 10https://gerrit.wikimedia.org/r/403346 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:35:23] (03PS1) 10KartikMistry: apertium-swe-nor: Updated dependency on cg3 [debs/contenttranslation/apertium-swe-nor] - 10https://gerrit.wikimedia.org/r/403347 (https://phabricator.wikimedia.org/T171406) [05:35:48] (03CR) 10jerkins-bot: [V: 04-1] apertium-swe-nor: Updated dependency on cg3 [debs/contenttranslation/apertium-swe-nor] - 10https://gerrit.wikimedia.org/r/403347 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [05:41:48] 10Operations, 10Discourse, 10Developer-Relations (Jan-Mar-2018), 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3888890 (10Tgr) [05:42:47] 10Operations, 10Developer-Relations, 10Discourse: Discourse migration from wmflabs to production - https://phabricator.wikimedia.org/T184461#3888892 (10Tgr) [05:43:05] 10Operations, 10Developer-Relations, 10Discourse: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3888893 (10Tgr) [05:58:06] (03PS1) 10Urbanecm: Update officewiki logo, add HD logo for officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403349 (https://phabricator.wikimedia.org/T184575) [05:58:16] (03PS3) 10Fomafix: Rename language codes sr-ec and sr-el to sr-cyrl and sr-latn [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) [06:01:45] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342 (https://phabricator.wikimedia.org/T184579) (owner: 10Jayprakash12345) [06:02:37] (03CR) 10Urbanecm: "> In dblists/all.dblist, inhwiki should come before internalwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402658 (https://phabricator.wikimedia.org/T184374) (owner: 10Urbanecm) [06:03:55] (03CR) 10Urbanecm: [C: 031] "LGTM, technically." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395775 (https://phabricator.wikimedia.org/T182201) (owner: 10MarcoAurelio) [06:04:27] 10Operations, 10Developer-Relations, 10Discourse: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3888925 (10Tgr) >>! In T180854#3882018, @Qgil wrote: > If replying via email is a wanted feature, then it should be discussed in a separate task blocking {T180853}. I will... [06:04:45] (03PS2) 10Fomafix: Rename language codes sr-ec and sr-el to sr-cyrl and sr-latn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375616 (https://phabricator.wikimedia.org/T117845) [06:06:23] (03CR) 10Urbanecm: [C: 031] "LGTM, technically. Not sure about the EDP and everything else needed for enabling." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390303 (https://phabricator.wikimedia.org/T166763) (owner: 10TerraCodes) [06:07:36] (03CR) 10Urbanecm: [C: 031] "Technically ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402780 (https://phabricator.wikimedia.org/T184314) (owner: 10MarcoAurelio) [06:13:20] (03PS1) 10KartikMistry: apertium-tat: Updated dependency on cg3 [debs/contenttranslation/apertium-tat] - 10https://gerrit.wikimedia.org/r/403350 (https://phabricator.wikimedia.org/T171406) [06:13:48] (03CR) 10jerkins-bot: [V: 04-1] apertium-tat: Updated dependency on cg3 [debs/contenttranslation/apertium-tat] - 10https://gerrit.wikimedia.org/r/403350 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [06:15:49] (03PS1) 10KartikMistry: apertium-tur: Updated dependency on cg3 [debs/contenttranslation/apertium-tur] - 10https://gerrit.wikimedia.org/r/403351 (https://phabricator.wikimedia.org/T171406) [06:16:11] (03CR) 10jerkins-bot: [V: 04-1] apertium-tur: Updated dependency on cg3 [debs/contenttranslation/apertium-tur] - 10https://gerrit.wikimedia.org/r/403351 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [06:17:11] !log Deploy schema change on s5 codfw master (db2052) with replication (this will generate lag on codfw) - T174569 [06:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:24] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:17:51] (03PS1) 10KartikMistry: apertium-urd: Updated dependency on cg3 [debs/contenttranslation/apertium-urd] - 10https://gerrit.wikimedia.org/r/403352 (https://phabricator.wikimedia.org/T171406) [06:18:34] (03CR) 10jerkins-bot: [V: 04-1] apertium-urd: Updated dependency on cg3 [debs/contenttranslation/apertium-urd] - 10https://gerrit.wikimedia.org/r/403352 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [06:19:49] (03PS1) 10KartikMistry: apertium-urd-hin: Updated dependency on cg3 [debs/contenttranslation/apertium-urd-hin] - 10https://gerrit.wikimedia.org/r/403353 (https://phabricator.wikimedia.org/T171406) [06:20:32] (03CR) 10jerkins-bot: [V: 04-1] apertium-urd-hin: Updated dependency on cg3 [debs/contenttranslation/apertium-urd-hin] - 10https://gerrit.wikimedia.org/r/403353 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [06:27:54] (03CR) 10Marostegui: [C: 031] wikireplicas: Add partial index for page_props.pp_value [puppet] - 10https://gerrit.wikimedia.org/r/388572 (https://phabricator.wikimedia.org/T140609) (owner: 10BryanDavis) [06:36:49] (03PS1) 10Andrew Bogott: vmbuilder: include linux-image-generic in trusty base image [puppet] - 10https://gerrit.wikimedia.org/r/403355 [06:38:34] (03CR) 10Andrew Bogott: [C: 032] vmbuilder: include linux-image-generic in trusty base image [puppet] - 10https://gerrit.wikimedia.org/r/403355 (owner: 10Andrew Bogott) [06:39:54] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3888967 (10Marostegui) I believe we are good to close this task after Bryan finished with the pending Cloud Team's tasks? [06:47:05] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3888968 (10Andrew) I've build new base images, and I'm concerned about what I'm seeing for Jessie. Trusty: ``` andrew@trusty-meltdown-image... [06:56:08] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3888980 (10Andrew) Here are all the distros and kernels currently running: P6565 [07:26:58] (03PS1) 10Marostegui: db-eqiad.php: Depool db1067,db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403357 (https://phabricator.wikimedia.org/T162807) [07:37:08] !log Drop external_user from wikidatawiki - T184247 [07:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:21] T184247: Drop `external_user` from all databases - https://phabricator.wikimedia.org/T184247 [07:44:05] !log rebooting mw1262-mw1275 for kernel security update (along with update to HHVM 3.18.6) [07:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:17] (03PS41) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [07:55:05] 10Operations, 10Ops-Access-Requests: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3889014 (10Bawolff) [08:13:44] !log Deploy schema change on s5 dbstore1002 - T174569 [08:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:57] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [08:28:13] !log contint1001: upgraded Zuul 2.5.0-8-gcbc7f62-wmf4jessie1 .. 2.5.0-8-gcbc7f62-wmf6 | T158243 [08:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:26] T158243: Update zuul to upstream master - https://phabricator.wikimedia.org/T158243 [08:29:31] there is a file on tin: modified:   /srv/mediawiki-staging/wikiversions.json [08:29:37] looking into it for marostegui [08:29:49] thanks hashar [08:30:47] ah that is twentyafterfour that did the deploy yesterday. T180749 [08:30:48] T180749: 1.31.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T180749 [08:33:49] !log rebooting mw1299-mw1306 (job runners) for kernel security update (along with update to HHVM 3.18.6) [08:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:46] (03PS1) 10Hashar: group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403360 [08:36:07] (03CR) 10Hashar: [C: 032] group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403360 (owner: 10Hashar) [08:36:54] marostegui: ^^that would fix it [08:37:14] group0 got updated but the wikiversions.json has been left uncommited for some reason [08:37:19] hashar: ah great :) [08:37:36] (03Merged) 10jenkins-bot: group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403360 (owner: 10Hashar) [08:37:48] (03CR) 10jenkins-bot: group0 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403360 (owner: 10Hashar) [08:37:59] marostegui: should be good now :] [08:38:01] hashar: it is now gone - thanks! :) [08:38:14] !log Deploy schema change on s5 dbstore1001 - T174569 [08:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:26] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [08:38:40] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067,db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403357 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:40:08] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1067,db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403357 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:40:34] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067,db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403357 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:41:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 and db1089 - T162807 (duration: 01m 05s) [08:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:45] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [08:42:41] !log Stop replication in sync on db1089 and db1067 - T162807 [08:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:25] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403130 (owner: 10Hashar) [08:53:33] (03CR) 10Hashar: "recheck" [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403130 (owner: 10Hashar) [08:57:52] (03CR) 10Hashar: "recheck" [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403130 (owner: 10Hashar) [08:59:16] (03PS2) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403130 [09:01:19] (03PS3) 10Hashar: Jenkins job validation (DO NOT SUBMIT) (jessie) [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403130 [09:01:29] 10Operations, 10ops-esams, 10Patch-For-Review, 10User-fgiunchedi: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518#3889187 (10fgiunchedi) Thanks a lot @Dzahn for taking care of this! [09:02:48] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) (jessie) [debs/node-tunnel-agent] - 10https://gerrit.wikimedia.org/r/403130 (owner: 10Hashar) [09:12:53] !log rebooting radium (tor relay) for kernel security update [09:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:19] !log Deploy schema change on db1051 - T174569 [09:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:29] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [09:18:07] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=1fullscreen [09:21:05] 10Operations, 10Dumps-Generation: Reboot snapshot*, dumpsdata*, dataset1001, ms1001, francium - https://phabricator.wikimedia.org/T184443#3889205 (10MoritzMuehlenhoff) Fixed kernels are available for trusty now, I've installed them on francium and snapshot100[1,5-7]. [09:21:07] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=1fullscreen [09:27:26] !log stop restbase on cassandra 2 nodes - T184100 [09:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:38] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [09:32:02] !log Upgrade kernel on db1067 [09:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:21] (03CR) 10Ema: "LGTM in general, I've added a couple comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/399176 (https://phabricator.wikimedia.org/T181315) (owner: 10Gilles) [09:39:12] !log eqiad LVSs: upgrade to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267 [09:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:26] T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656 [09:40:34] !log rebooting kubernetes workers (plus staging hosts) for kernel security update [09:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:49] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "While the code is overall correct, I'm not convinced by its organization. I'd try to make the role/profile move first and rebase this chan" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [09:50:07] !log shut cassandra 2 on restbase legacy nodes - T184100 [09:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:19] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [09:51:23] 10Operations, 10Discourse, 10Developer-Relations (Jan-Mar-2018): Setup reply via email in discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T184592#3889259 (10Qgil) p:05Triage>03Normal [09:53:01] 10Operations, 10Developer-Relations, 10Discourse: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3889275 (10Qgil) @Tgr indeed: {T184592} [10:00:20] (03PS1) 10Ladsgroup: statistics: Install php5-dom for wmde scripts [puppet] - 10https://gerrit.wikimedia.org/r/403366 (https://phabricator.wikimedia.org/T165463) [10:02:59] !log rebooting tegmen for kernel security update [10:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:45] !log rebooting analytics1035 (hadoop worker node and hdfs journal node) for kernel updates [10:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:22] (03PS1) 10Volans: Temporary failover Icinga to tegmen [puppet] - 10https://gerrit.wikimedia.org/r/403369 (https://phabricator.wikimedia.org/T170353) [10:09:33] (03PS1) 10Volans: Temporary failover Icinga to tegmen [dns] - 10https://gerrit.wikimedia.org/r/403370 (https://phabricator.wikimedia.org/T170353) [10:11:09] 10Operations, 10ops-eqiad, 10Analytics-Kanban: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3889328 (10jcrespo) Please send us a 15 minute meeting invite, there are some things that we need to discuss regarding dbstores for you to talk to analytics and other dbstore users. T... [10:12:42] (03CR) 10Hashar: "check experimental" [puppet/nginx] - 10https://gerrit.wikimedia.org/r/131222 (owner: 10Hashar) [10:13:05] (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/nginx] - 10https://gerrit.wikimedia.org/r/131222 (owner: 10Hashar) [10:13:49] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/403372 [10:14:00] (03CR) 10Hashar: "check experimental" [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/403372 (owner: 10Hashar) [10:14:21] (03CR) 10Hashar: "check experimental" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/131223 (owner: 10Hashar) [10:14:52] (03PS2) 10Filippo Giunchedi: decom legacy restbase/cassandra 2 cluster [puppet] - 10https://gerrit.wikimedia.org/r/403208 (https://phabricator.wikimedia.org/T184100) [10:14:57] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/403373 [10:15:08] (03CR) 10Hashar: "check experimental" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/403373 (owner: 10Hashar) [10:15:26] (03CR) 10Hashar: "check experimental" [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/133435 (owner: 10Hashar) [10:15:38] (03CR) 10Alexandros Kosiaris: [C: 031] Temporary failover Icinga to tegmen [puppet] - 10https://gerrit.wikimedia.org/r/403369 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [10:16:16] (03CR) 10Hashar: "check experimental" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/387571 (owner: 10Hashar) [10:16:35] (03CR) 10Hashar: "check experimental" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/142517 (owner: 10Hashar) [10:16:51] (03CR) 10Hashar: "check experimental" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/66906 (owner: 10Hashar) [10:16:55] !log rebooting bast4001 for kernel security update [10:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:11] (03CR) 10Hashar: "check experimental" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/387572 (owner: 10Hashar) [10:17:24] (03CR) 10Hashar: "check experimental" [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 (owner: 10Hashar) [10:17:40] (03CR) 10Filippo Giunchedi: [C: 032] decom legacy restbase/cassandra 2 cluster [puppet] - 10https://gerrit.wikimedia.org/r/403208 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [10:17:52] (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/131223 (owner: 10Hashar) [10:18:00] (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/133435 (owner: 10Hashar) [10:18:10] (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/387571 (owner: 10Hashar) [10:19:12] (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/387572 (owner: 10Hashar) [10:19:15] (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT)... [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 (owner: 10Hashar) [10:19:31] (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/142517 (owner: 10Hashar) [10:21:41] (03PS1) 10Filippo Giunchedi: site: spare::system vs system::spare [puppet] - 10https://gerrit.wikimedia.org/r/403376 [10:21:56] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT)... [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 (owner: 10Hashar) [10:22:02] (03PS5) 10Hashar: Jenkins job validation (DO NOT SUBMIT)... [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 [10:22:23] (03CR) 10Hashar: "check experimental" [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 (owner: 10Hashar) [10:22:35] (03CR) 10Hashar: "check experimental" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/66906 (owner: 10Hashar) [10:22:46] (03CR) 10Filippo Giunchedi: [C: 032] site: spare::system vs system::spare [puppet] - 10https://gerrit.wikimedia.org/r/403376 (owner: 10Filippo Giunchedi) [10:22:49] (03CR) 10Hashar: "check experimental" [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/133435 (owner: 10Hashar) [10:22:59] (03CR) 10Hashar: "check experimental" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/131223 (owner: 10Hashar) [10:23:03] (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/133435 (owner: 10Hashar) [10:23:06] (03CR) 10Hashar: "check experimental" [puppet/kafka] - 10https://gerrit.wikimedia.org/r/64434 (owner: 10Hashar) [10:23:12] (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/131223 (owner: 10Hashar) [10:23:14] (03CR) 10Hashar: "check experimental" [puppet/nginx] - 10https://gerrit.wikimedia.org/r/131222 (owner: 10Hashar) [10:23:19] (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT). [puppet/kafka] - 10https://gerrit.wikimedia.org/r/64434 (owner: 10Hashar) [10:23:22] (03CR) 10Hashar: "check experimental" [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/403372 (owner: 10Hashar) [10:23:29] (03CR) 10Hashar: "check experimental" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/403373 (owner: 10Hashar) [10:23:40] (03CR) 10Hashar: "check experimental" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/387571 (owner: 10Hashar) [10:23:44] (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/nginx] - 10https://gerrit.wikimedia.org/r/131222 (owner: 10Hashar) [10:23:55] (03CR) 10jenkins-bot: Jenkins job validation (DO NOT SUBMIT) [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/387571 (owner: 10Hashar) [10:29:28] (03CR) 10Hashar: "check experimental" [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 (owner: 10Hashar) [10:29:32] (03PS5) 10Faidon Liambotis: rubocop: move three ignores to .rubocop.yml [puppet] - 10https://gerrit.wikimedia.org/r/359485 [10:29:40] !log reimage restbase1011 to test HBA mode - T184100 [10:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:53] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [10:30:00] (03CR) 10Faidon Liambotis: [C: 032] rubocop: move three ignores to .rubocop.yml [puppet] - 10https://gerrit.wikimedia.org/r/359485 (owner: 10Faidon Liambotis) [10:31:49] (03PS6) 10Hashar: Jenkins job validation (DO NOT SUBMIT)... [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 [10:32:00] (03CR) 10Hashar: "check experimental" [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 (owner: 10Hashar) [10:32:04] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/66906 (owner: 10Hashar) [10:32:24] (03CR) 10Addshore: [C: 031] statistics: Install php5-dom for wmde scripts [puppet] - 10https://gerrit.wikimedia.org/r/403366 (https://phabricator.wikimedia.org/T165463) (owner: 10Ladsgroup) [10:33:02] (03PS2) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/66906 [10:33:06] (03CR) 10Addshore: [C: 04-1] "this should probably be within the statistics::wmde::graphite class? Thats is where the requirement for php actually comes in (via the scr" [puppet] - 10https://gerrit.wikimedia.org/r/403366 (https://phabricator.wikimedia.org/T165463) (owner: 10Ladsgroup) [10:33:16] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT)... [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/64436 (owner: 10Hashar) [10:33:31] (03CR) 10Hashar: "check experimental" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/66906 (owner: 10Hashar) [10:36:56] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT). [puppet/kafka] - 10https://gerrit.wikimedia.org/r/64434 (owner: 10Hashar) [10:36:59] (03PS4) 10Hashar: Jenkins job validation (DO NOT SUBMIT). [puppet/kafka] - 10https://gerrit.wikimedia.org/r/64434 [10:37:10] (03CR) 10Hashar: "check experimental" [puppet/kafka] - 10https://gerrit.wikimedia.org/r/64434 (owner: 10Hashar) [10:38:29] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT). [puppet/kafka] - 10https://gerrit.wikimedia.org/r/64434 (owner: 10Hashar) [10:38:50] (03PS1) 10Faidon Liambotis: wmflib: fix two RuboCop cops in require_package [puppet] - 10https://gerrit.wikimedia.org/r/403378 [10:40:24] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/403373 (owner: 10Hashar) [10:41:56] (03PS2) 10Ladsgroup: statistics: Install php5-dom for wmde scripts [puppet] - 10https://gerrit.wikimedia.org/r/403366 (https://phabricator.wikimedia.org/T165463) [10:42:00] (03CR) 10Ladsgroup: "Thanks. Fixed." [puppet] - 10https://gerrit.wikimedia.org/r/403366 (https://phabricator.wikimedia.org/T165463) (owner: 10Ladsgroup) [10:42:35] (03PS1) 10Ema: pybaltest: accept RAs even if forwarding is enabled [puppet] - 10https://gerrit.wikimedia.org/r/403380 [10:43:57] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/403378 (owner: 10Faidon Liambotis) [10:49:52] (03CR) 10Alexandros Kosiaris: [C: 031] Temporary failover Icinga to tegmen [dns] - 10https://gerrit.wikimedia.org/r/403370 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [10:49:59] (03PS2) 10Faidon Liambotis: wmflib: fix two RuboCop cops in require_package [puppet] - 10https://gerrit.wikimedia.org/r/403378 [10:52:36] (03CR) 10Alexandros Kosiaris: [C: 031] pybaltest: accept RAs even if forwarding is enabled [puppet] - 10https://gerrit.wikimedia.org/r/403380 (owner: 10Ema) [10:52:42] I'm about to failover the Icinga server to tegment (passive server) in about 5 minutes. If there is anything ongoing let me know and I can postpone it [10:53:04] *tegmen ofc [10:55:32] !log reboot analytics1040->43 for kernel updates [10:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:33] volans: do I need to do anything about the maintenance that I've just scheduled for --^? (ignorant question) [10:56:35] (03CR) 10Ema: [C: 032] pybaltest: accept RAs even if forwarding is enabled [puppet] - 10https://gerrit.wikimedia.org/r/403380 (owner: 10Ema) [10:57:15] elukey: no I will sync the files when failovering, but if you want I can wait your 3 reboots [10:57:53] nono because I need to drain those hosts first, all good [10:58:19] thanks :) [10:58:38] PROBLEM - Host wtp2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:58:44] anyway is a good test for the procedure, I'll check that the downtime is still there [10:59:08] PROBLEM - Host wtp2002 is DOWN: PING CRITICAL - Packet loss = 100% [10:59:18] RECOVERY - Host wtp2002 is UP: PING OK - Packet loss = 0%, RTA = 36.89 ms [10:59:19] RECOVERY - Host wtp2001 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [11:04:09] PROBLEM - Nginx local proxy to apache on mw1229 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [11:05:09] RECOVERY - Nginx local proxy to apache on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.032 second response time [11:06:17] (03CR) 10Volans: [C: 032] Temporary failover Icinga to tegmen [puppet] - 10https://gerrit.wikimedia.org/r/403369 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [11:06:22] (03PS2) 10Volans: Temporary failover Icinga to tegmen [puppet] - 10https://gerrit.wikimedia.org/r/403369 (https://phabricator.wikimedia.org/T170353) [11:07:57] !log start failovering of Icinga to tegmen - T170353 [11:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:09] T170353: Icinga: timeseries checks should have the link to a graph with the data - https://phabricator.wikimedia.org/T170353 [11:10:36] (03CR) 10Volans: [C: 032] Temporary failover Icinga to tegmen [dns] - 10https://gerrit.wikimedia.org/r/403370 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [11:11:52] 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3889432 (10jcrespo) So actually, that is not really that bad- query times are similar (only some small overhead), connection t... [11:12:29] !log migrating instances off ganeti2008 for subsequent reboot for kernel security update [11:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:34] 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3889438 (10jcrespo) One thing I just realized is that there could be some connection overhead on db1055- I will (or you can) t... [11:18:58] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [11:19:19] that's me ^ [11:19:31] !log Icinga failover to tegmen completed - T170353 [11:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:45] T170353: Icinga: timeseries checks should have the link to a graph with the data - https://phabricator.wikimedia.org/T170353 [11:19:48] RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 36.29 ms [11:19:50] the ACTIVE Icinga server is now tegmen [11:22:53] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403385 (https://phabricator.wikimedia.org/T174569) [11:23:28] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: TEST - IGNORE - T170353 - volans https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [11:23:48] !log migrating instances off ganeti2007 for subsequent reboot for kernel security update [11:23:50] akosiaris: there you go! ampersends are there :D ^^^ [11:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:08] PROBLEM - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100% [11:24:39] RECOVERY - Host cp1047 is UP: PING OK - Packet loss = 0%, RTA = 37.17 ms [11:24:47] still me, sorry about that (I've downtimed the hosts on the wrong icinga server hehe) ^ [11:25:14] TTL? :D [11:26:04] 10Operations, 10monitoring, 10Patch-For-Review: Icinga: timeseries checks should have the link to a graph with the data - https://phabricator.wikimedia.org/T170353#3889454 (10Volans) Confirmed that on `tegmen` it works fine after failovering the active Icinga server to it. The links are properly rendered and... [11:26:19] !log reboot analytics1044->47 for kernel updates [11:26:22] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3889455 (10jcrespo) 05Open>03Resolved a:03jcrespo yes, but let's open one for followup/clean up - delete, which we will want to wait to do (leave data there for a few... [11:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:42] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403385 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [11:32:14] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403385 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [11:32:28] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403385 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [11:33:03] (03PS2) 10Elukey: Standardize Analytics jmx agent's configurations [puppet] - 10https://gerrit.wikimedia.org/r/403123 (https://phabricator.wikimedia.org/T177458) [11:33:08] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [11:33:33] !log Deploy schema change on db1106 - T174569 [11:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:44] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [11:38:33] volans: I am speechless and have no idea [11:38:45] I suppose you already did a diff the configs ? [11:39:21] yes, IIRC I did diff the whole /etc/icinga, don't remeber if I did also /etc/nagios, I can redo both [11:39:33] or whole /etc :D [11:39:55] !log rebooting mw1201-mw1208 for kernel security update (along with update to HHVM 3.18.6) [11:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:18] PROBLEM - Check systemd state on restbase2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:41:18] PROBLEM - cassandra-b SSL 10.64.0.33:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:18] PROBLEM - cassandra-a SSL 10.64.48.138:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:18] PROBLEM - cassandra-a SSL 10.64.0.32:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:18] PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:19] PROBLEM - cassandra-b SSL 10.64.32.206:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:19] PROBLEM - cassandra-b SSL 10.64.48.99:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:20] PROBLEM - cassandra-a CQL 10.64.48.98:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.98 and port 9042: Connection refused [11:41:20] PROBLEM - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.206 and port 9042: Connection refused [11:41:21] PROBLEM - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:21] PROBLEM - cassandra-a SSL 10.64.32.130:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:22] PROBLEM - cassandra-c SSL 10.64.32.132:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:22] PROBLEM - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:23] PROBLEM - cassandra-b CQL 10.192.16.187:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.187 and port 9042: Connection refused [11:41:34] PROBLEM - cassandra-b CQL 10.192.16.177:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.177 and port 9042: Connection refused [11:41:34] PROBLEM - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.34 and port 9042: Connection refused [11:41:35] PROBLEM - Check systemd state on restbase1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:41:35] PROBLEM - cassandra-b service on restbase2007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [11:41:36] PROBLEM - cassandra-a SSL 10.192.32.143:7001 on restbase2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:36] PROBLEM - Check systemd state on restbase2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:41:37] PROBLEM - cassandra-b service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [11:41:37] PROBLEM - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:38] PROBLEM - cassandra-c service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [11:41:38] PROBLEM - Check systemd state on restbase2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:41:41] PROBLEM - cassandra-a CQL 10.64.32.205:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.205 and port 9042: Connection refused [11:41:41] PROBLEM - cassandra-c service on restbase2007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [11:41:41] PROBLEM - cassandra-a service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [11:41:41] PROBLEM - cassandra-b SSL 10.64.48.139:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:41] PROBLEM - cassandra-c CQL 10.192.32.154:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.154 and port 9042: Connection refused [11:41:41] PROBLEM - cassandra-a service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [11:41:42] PROBLEM - cassandra-a SSL 10.64.32.205:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:42] PROBLEM - cassandra-a CQL 10.64.32.130:9042 on restbase1017 is CRITICAL: connect to address 10.64.32.130 and port 9042: Connection refused [11:41:54] PROBLEM - cassandra-a CQL 10.192.32.143:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.143 and port 9042: Connection refused [11:41:54] PROBLEM - Restbase root url on restbase2007 is CRITICAL: connect to address 10.192.16.175 and port 7231: Connection refused [11:41:55] PROBLEM - cassandra-a service on restbase2007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [11:41:55] PROBLEM - cassandra-b SSL 10.192.32.153:7001 on restbase2011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:56] PROBLEM - cassandra-c SSL 10.64.48.140:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:56] PROBLEM - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [11:41:58] PROBLEM - cassandra-b service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [11:41:58] PROBLEM - cassandra-c CQL 10.192.48.70:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.70 and port 9042: Connection refused [11:41:58] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused [11:41:58] PROBLEM - cassandra-c service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [11:41:59] PROBLEM - cassandra-b service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [11:41:59] PROBLEM - cassandra-a SSL 10.192.16.186:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:42:00] PROBLEM - Check systemd state on restbase1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:42:00] PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:42:01] PROBLEM - cassandra-b service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [11:42:01] PROBLEM - cassandra-a service on restbase1015 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [11:42:02] PROBLEM - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.100 and port 9042: Connection refused [11:42:02] PROBLEM - cassandra-c service on restbase2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [11:42:18] PROBLEM - Restbase root url on restbase2011 is CRITICAL: connect to address 10.192.32.151 and port 7231: Connection refused [11:42:18] PROBLEM - cassandra-b CQL 10.192.32.153:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.153 and port 9042: Connection refused [11:42:18] PROBLEM - cassandra-c service on restbase2012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [11:42:19] PROBLEM - Restbase root url on restbase2012 is CRITICAL: connect to address 10.192.48.67 and port 7231: Connection refused [11:42:36] (03PS1) 10Jcrespo: mariadb: Repool db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403387 (https://phabricator.wikimedia.org/T176243) [11:42:59] (03CR) 10Aklapper: [C: 04-1] "Please see https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines and fix the commit message format (imperative form; length of l" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403120 (owner: 10محمد شعیب) [11:45:05] !log downtime decomissioned restbase cassandra 2 hosts [11:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:08] PROBLEM - Host wtp2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:48:19] RECOVERY - Host wtp2001 is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [11:51:21] !log migrating instances off ganeti2006 for subsequent reboot for kernel security update [11:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:31] (03CR) 10Elukey: [C: 032] "After a chat with Gehel I decided to proceed anyway since I don't have a ton of mbeans to inspect in my jvms. We started to collect info a" [puppet] - 10https://gerrit.wikimedia.org/r/403123 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [12:00:24] (03PS1) 10Giuseppe Lavagetto: puppetdb: refactor to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/403388 [12:01:36] 10Operations, 10Patch-For-Review: install_server: switch to stretch as default install image - https://phabricator.wikimedia.org/T182215#3816772 (10faidon) @Dzahn, yes, that sounds like a good idea. Please do :) [12:01:54] PROBLEM - puppet last run on acrux is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[openssh-server],Exec[set debconf flag seen for wireshark-common/install-setuid] [12:02:11] ^acrux is transient [12:02:36] (03PS1) 10Filippo Giunchedi: restbase: reimage restbase1011 as cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/403389 (https://phabricator.wikimedia.org/T184100) [12:03:37] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403387 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [12:04:14] PROBLEM - Host ganeti2006 is DOWN: PING CRITICAL - Packet loss = 100% [12:04:26] <_joe_> uh ganeti down [12:04:44] <_joe_> ah see log by moritz, ok [12:05:08] (03Merged) 10jenkins-bot: mariadb: Repool db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403387 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [12:05:14] (03CR) 10Mobrovac: [C: 031] restbase: reimage restbase1011 as cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/403389 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [12:05:14] RECOVERY - Host ganeti2006 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [12:06:12] yeah, for some reason my downtime had vanished [12:06:51] (03CR) 10jenkins-bot: mariadb: Repool db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403387 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [12:07:45] (03PS2) 10Giuseppe Lavagetto: puppetdb: refactor to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/403388 [12:08:32] (03CR) 10Filippo Giunchedi: [C: 032] restbase: reimage restbase1011 as cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/403389 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [12:10:07] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403390 [12:11:08] !log rebooting einsteinium for kernel security update [12:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:00] (03PS3) 10Giuseppe Lavagetto: puppetdb: refactor to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/403388 [12:12:55] PROBLEM - Host wtp2007 is DOWN: PING CRITICAL - Packet loss = 100% [12:13:05] PROBLEM - Host wtp2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:13:24] RECOVERY - Host wtp2007 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [12:13:34] RECOVERY - Host wtp2004 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [12:13:43] (03CR) 10Zoranzoki21: [C: 031] "Added in deployments list for European Mid-day SWAT today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [12:17:42] (03CR) 10TerraCodes: "You forgot the other too patches (I added them to the page)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [12:19:26] !log migrating instances off ganeti2005 for subsequent reboot for kernel security update [12:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:15] (03CR) 10Zoranzoki21: [C: 031] "> You forgot the other too patches (I added them to the page)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [12:20:45] (03PS4) 10Giuseppe Lavagetto: puppetdb: refactor to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/403388 [12:21:30] (03CR) 10TerraCodes: "> > You forgot the other too patches (I added them to the page)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [12:23:00] (03CR) 10Zoranzoki21: [C: 031] "> > > You forgot the other too patches (I added them to the page)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [12:31:52] RECOVERY - puppet last run on acrux is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:37:52] PROBLEM - Host wtp2012 is DOWN: PING CRITICAL - Packet loss = 100% [12:38:12] PROBLEM - Host wtp2013 is DOWN: PING CRITICAL - Packet loss = 100% [12:38:42] RECOVERY - Host wtp2012 is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [12:38:52] !log migrating instances off ganeti2004 for subsequent reboot for kernel security update [12:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:00] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403390 [12:42:28] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403390 (owner: 10Marostegui) [12:44:46] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403390 (owner: 10Marostegui) [12:46:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1106 - T174569 (duration: 01m 03s) [12:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:15] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [12:47:04] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403390 (owner: 10Marostegui) [12:47:31] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403394 (https://phabricator.wikimedia.org/T174569) [12:50:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403394 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [12:52:12] (03PS5) 10Giuseppe Lavagetto: puppetdb: refactor to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/403388 [12:53:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403394 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [12:53:34] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403394 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [12:54:00] !log Deploy schema change on db1097:3315 - https://phabricator.wikimedia.org/T174569 [12:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097:3315 - T174569 (duration: 01m 03s) [12:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:56] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [12:55:37] !log mobrovac@tin Started deploy [restbase/deploy@a2aabfb]: API: add top-by-country, change recommendation route, fix duplicates in onthisday - T181520 T170877 T175974 [12:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:49] T175974: [BUG] On this day occasionally duplicates events - https://phabricator.wikimedia.org/T175974 [12:55:49] T181520: Add "Pageviews by Country" AQS endpoint - https://phabricator.wikimedia.org/T181520 [12:55:49] T170877: Recommendation API public end points - https://phabricator.wikimedia.org/T170877 [12:55:56] (03CR) 10Giuseppe Lavagetto: "This works correctly in production, as seen here https://puppet-compiler.wmflabs.org/compiler02/9684/ but I still need to fix labs before " [puppet] - 10https://gerrit.wikimedia.org/r/403388 (owner: 10Giuseppe Lavagetto) [13:03:15] PROBLEM - Host wtp2010 is DOWN: PING CRITICAL - Packet loss = 100% [13:03:15] PROBLEM - Host wtp2017 is DOWN: PING CRITICAL - Packet loss = 100% [13:03:37] !log mobrovac@tin Finished deploy [restbase/deploy@a2aabfb]: API: add top-by-country, change recommendation route, fix duplicates in onthisday - T181520 T170877 T175974 (duration: 08m 00s) [13:03:44] RECOVERY - Host wtp2017 is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [13:03:44] RECOVERY - Host wtp2010 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [13:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:51] T175974: [BUG] On this day occasionally duplicates events - https://phabricator.wikimedia.org/T175974 [13:03:51] T181520: Add "Pageviews by Country" AQS endpoint - https://phabricator.wikimedia.org/T181520 [13:03:51] T170877: Recommendation API public end points - https://phabricator.wikimedia.org/T170877 [13:08:35] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error [13:15:03] <_joe_> wat? [13:15:08] <_joe_> ema: ^^ [13:21:21] <_joe_> cannot reproduce it ftr [13:22:31] <_joe_> oh now I can [13:25:23] jouncebot: next [13:25:24] In 0 hour(s) and 34 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T1400) [13:26:01] <_joe_> !log restarting pybal on lvs2003 [13:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:44] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [13:28:15] PROBLEM - Host wtp2019 is DOWN: PING CRITICAL - Packet loss = 100% [13:28:44] RECOVERY - Host wtp2019 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [13:31:54] that's me ^ [13:32:21] downtime expired [13:34:13] PROBLEM - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:32] that's me, reimaged machine [13:37:06] !log migrating instances off ganeti2003 for subsequent reboot for kernel security update [13:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:33] PROBLEM - Host elastic2008 is DOWN: PING CRITICAL - Packet loss = 100% [13:43:16] (03PS2) 10MarcoAurelio: translationadmin: remove configuration equal to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402780 (https://phabricator.wikimedia.org/T184314) [13:44:04] RECOVERY - Host elastic2008 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:44:16] I do not see the elastic ones on SAL [13:44:52] I assume it is part of yesterdays work? [13:45:17] *work started yesterday [13:45:22] jynus: yes rolling restarts usually take 2/3 days [13:45:28] thanks [13:45:32] not complaining [13:45:42] just wanted to make sure it wasn't a crash [13:45:48] sure, np [13:46:44] PROBLEM - Host elastic2007 is DOWN: PING CRITICAL - Packet loss = 100% [13:47:37] jynus: thanks for the check! I'm checking why those were not downtimed correctly by my script... [13:48:04] RECOVERY - Host elastic2007 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [13:48:46] hi brion - if https://gerrit.wikimedia.org/r/#/c/401965/ looks good to you now, can you remove your -2? [13:50:15] strange, I do have the set downtime in my logs... [13:50:28] gehel: how do you downtime them? [13:51:11] volans: icinga-downtime on einsteinium [13:51:21] eheheh [13:51:30] we failovered to tegmen today (temporarily) [13:51:45] Ah, I missed that one. That explains! [13:51:51] you should use icinga.w.o that is ofc updated [13:52:03] volans: thanks! [13:52:04] I'm sorry for the trouble, any way I can help? [13:52:58] but SSH to icinga.w.o isn't possible... or am I missing something? [13:53:20] from your local computer? [13:53:24] yep [13:53:52] yes it is if using my script that generates the right entries in the known hosts file ;) [13:54:13] PROBLEM - Host wtp2018 is DOWN: PING CRITICAL - Packet loss = 100% [13:54:17] I should of course migrate those ugly scripts to a proper cumin tool :) [13:54:24] PROBLEM - Host wtp2015 is DOWN: PING CRITICAL - Packet loss = 100% [13:54:25] to use the host strict check [13:54:44] RECOVERY - Host wtp2015 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [13:54:50] gehel: indeed, but I guess I'm also a blocker on that for the switchdc spinoff ;) [13:54:52] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T183896#3889763 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Thanks @Cmjohnson ! Disk rebuilding [13:54:53] RECOVERY - Host wtp2018 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [13:55:38] volans: yep you are :) (but I have plenty of other excuses for not moving forward on that, don't blame yourself) [13:56:23] thanks for sharing the blame :-P [13:57:07] but I'm a blocker for *any* of those, so it's fair I get a bigger share of the blame ;) [13:57:09] (03PS1) 10Legoktm: contint: Lower caching length on doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/403401 (https://phabricator.wikimedia.org/T184255) [14:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T1400). [14:00:04] Jayprakash12345, Zoranzoki21, and Hauskatze: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] o/ [14:00:16] I'm here [14:00:18] o/ [14:00:33] I can SWAT today [14:02:04] gehel: FYI we'll get back to einsteinium by EOD most likely (or tomorrow morning at most) [14:02:23] volans: thanks! I'll add a check... [14:02:45] zeljkof: if the others ain't around we maybe can start with mine? [14:03:15] Hauskatze: I'll deploy the 403342 first, since there is nothing to test there [14:03:29] cook, k [14:03:51] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342 (https://phabricator.wikimedia.org/T184579) (owner: 10Jayprakash12345) [14:05:36] Hi, I am here.. Is started swat? [14:05:56] !log migrating instances off ganeti2002 for subsequent reboot for kernel security update [14:06:07] Zoranzoki21: yes, you're next [14:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:19] OK, I am here [14:06:25] (03Merged) 10jenkins-bot: Lift the cap on IP address to create accounts on mrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342 (https://phabricator.wikimedia.org/T184579) (owner: 10Jayprakash12345) [14:06:36] zeljkof: Zoranzoki21 is here now :) [14:06:42] (03CR) 10jenkins-bot: Lift the cap on IP address to create accounts on mrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403342 (https://phabricator.wikimedia.org/T184579) (owner: 10Jayprakash12345) [14:07:13] Oh, I forgot name of extension for checkiing [14:07:57] OK, I found it and installed. I am now here [14:08:18] Zfilipin: Thank you for merge [14:08:31] (03PS2) 10Rush: tools: ferm pre hook to stop kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/403308 (https://phabricator.wikimedia.org/T182722) [14:08:57] Jayprakash12345: deploying it right now [14:09:20] zeljkof: I am next? [14:09:52] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:403342|Lift the cap on IP address to create accounts on mrwiki (T184579)]] (duration: 01m 04s) [14:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:06] T184579: Request to lift the cap on IP address to create accounts on wiki - https://phabricator.wikimedia.org/T184579 [14:10:10] Jayprakash12345: 403342 is deployed [14:10:31] Zoranzoki21: you are next, but I do not feel comfortable deploying your changes :( [14:10:42] Zfilipin: Thank you very much. [14:10:53] zeljkof: I can test. I have x-wikimedia debug [14:10:57] there is a good chance something will go wrong, and I am not familiar with the variables [14:11:10] (03PS3) 10Rush: tools: ferm pre hook to stop kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/403308 (https://phabricator.wikimedia.org/T182722) [14:11:32] zeljkof: If any reproduce problem, you can rollback patch [14:11:42] zeljkof: I think to all will be ok, without problems [14:12:34] Zoranzoki21: since Hauskatze has only one patch, and it's simpler, I will deploy it first, and then look at your patches [14:12:47] zeljkof: Ok [14:13:14] fine for me [14:13:24] let me know when you're ready and to test [14:13:27] ty [14:13:31] Zoranzoki21: the problem is that I did not see reviews from anybody that is familiar with the code on the patches [14:14:12] (03CR) 10Legoktm: [C: 031] load ActiveAbtract extension explicitly so class autoloading works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403114 (https://phabricator.wikimedia.org/T184177) (owner: 10ArielGlenn) [14:14:47] zeljkof: Ok [14:14:55] Zoranzoki21: in order for me to merge the patches, try getting reviews from for example hasharAway, no_justification, Dereckson, anomie... [14:15:22] They are already reviewers in patch [14:15:28] But they no respond [14:15:32] Zoranzoki21: as it stands at the moment, I do not feel comfortable deploying such changes [14:15:44] Zoranzoki21: they might be reviewers, but did not provide any feedback [14:15:45] right? [14:16:02] zeljkof: They are reviewers, but did not provide feedback [14:16:07] so... [14:16:33] zeljkof: But, I no know why. They have to, if any is not ok, to tell it [14:16:41] do you get my point? until somebody from the phab ticket says the patches look good (silence is not approval), I will not deploy them [14:17:20] Zoranzoki21: people are busy, you have to make sure you get at least one positive review, preferably more [14:17:27] zeljkof: Ok [14:17:44] Zoranzoki21: I do not want to earn "I broke wikipedia" t-shirt [14:17:48] not yet [14:18:09] Hauskatze: reviewing your commit [14:18:22] zeljkof: Ok. If patches get positive review(s) I will add for next swat which come in it time [14:18:30] zeljkof: Is it ok? [14:18:30] Zoranzoki21: please do [14:18:34] yes [14:18:43] zeljkof: OK thank you [14:18:44] sorry for being careful, but it's my job not to break stuff :) [14:18:57] zeljkof: OK, no problems. I know it [14:19:20] PROBLEM - Host wtp2003 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:20] PROBLEM - Host wtp2006 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:47] Zoranzoki21: I sugest you to contract senior Member Before Deploy. [14:19:59] RECOVERY - Host wtp2003 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [14:19:59] RECOVERY - Host wtp2006 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [14:20:18] Jayprakash12345: OK [14:20:28] Zoranzoki21: Changing in global Variable is very harmful. [14:20:40] Jayprakash12345: Ok, I know. I already told any [14:20:49] Hauskatze: your patch is also bigger that I like for swat :) [14:21:00] can you test it at mwdebug1002? [14:21:08] zeljkof: yes [14:21:09] what's the chance of things breaking? [14:21:38] zeljkof: minimal, as it can be tested on mwdebug and Special:ListGroupRights. If the rights don't appear, we can revert [14:21:39] how much time do you need to test it? it touches many wikis, right? [14:21:54] I'll do random checks on 3/4 wikis [14:21:58] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402780 (https://phabricator.wikimedia.org/T184314) (owner: 10MarcoAurelio) [14:22:18] Hauskatze: ok, merging, will ping you when at mwdebug [14:22:21] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3889821 (10MoritzMuehlenhoff) >>! In T184189#3888968, @Andrew wrote: > Linux jessie-meltdown-image 4.9.0-0.bpo.5-amd64 #1 SMP Debian 4.9.65-3+... [14:22:26] the config is already on CommonSettings after all since a week or so [14:22:38] okay let me know :) [14:23:27] (03Merged) 10jenkins-bot: translationadmin: remove configuration equal to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402780 (https://phabricator.wikimedia.org/T184314) (owner: 10MarcoAurelio) [14:23:41] (03CR) 10jenkins-bot: translationadmin: remove configuration equal to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402780 (https://phabricator.wikimedia.org/T184314) (owner: 10MarcoAurelio) [14:26:13] Hauskatze: 402780 is at mwdebug1002 [14:26:20] ack, checking [14:27:44] checks successful so far, I'll do some more zeljkof [14:28:46] ok [14:29:23] zeljkof: revert, I missed a line for commons [14:29:40] or I can amend it really quick [14:29:55] because the change is working after all [14:30:09] PROBLEM - NTP on sca2003 is CRITICAL: NTP CRITICAL: Offset unknown [14:30:11] Hauskatze: if you can create another commit that fixes the problem, I can deploy both at the same time [14:30:27] we have 30 more minutes in the window [14:31:15] (03CR) 10Zoranzoki21: [C: 031] "Hi Zach. This is not deployed, because noone is not reviewed this patch except me. When any another review this, in next swat time, this s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [14:31:41] zeljkof: it's https://phabricator.wikimedia.org/source/mediawiki-config/browse/master/wmf-config/CommonSettings.php;1e39c531186ca225cd7eb1efe5e059203ed366e2$2507 [14:31:49] change To From [14:32:00] so let's deploy and I can amend that commonsettings thing [14:32:29] Hauskatze: wait, I did not understand [14:32:40] I can deploy the 402780? [14:32:55] and you will create follow up commit that fixes some problem? [14:32:57] zeljkof: the patch is good and works as expected, yes; however there's a typo in CommonSettings [14:33:13] and I'm creating the follow-up right now [14:33:22] ok, so I should deploy 402780? or wait for the follow-up? [14:33:38] zeljkof: what's best to do, CS and later IS or vice-versa? [14:33:41] cc Urbanecm [14:34:08] CS? IS? [14:34:41] !log dropping wikidatawiki from dbstore2001:3315 T184599 [14:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:53] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [14:35:08] Hauskatze: IS? [14:35:27] CommonSettings and InitialiseSettings [14:35:36] (03CR) 10Rush: tools: ferm pre hook to stop kube-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/403308 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [14:35:43] I'm confused [14:35:53] why not make both changes in the same changeset? [14:36:01] (03Draft1) 10MarcoAurelio: translationadmin: typo fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403410 [14:36:03] (03PS2) 10MarcoAurelio: translationadmin: typo fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403410 [14:36:10] there it is^ [14:36:27] Platonides: fact is that you should sync one first [14:37:07] https://gerrit.wikimedia.org/r/403410 is good to go, it's a typo fix [14:37:37] (03CR) 10Platonides: [C: 031] "This is indeed the right variable name" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403410 (owner: 10MarcoAurelio) [14:38:18] 10Operations, 10Ops-Access-Requests: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3889838 (10Bawolff) [14:39:18] zeljkof: it's done [14:39:37] Platonides: wrt. same patchset, I can't given that one is already merged [14:39:39] Hauskatze: I'm confused, 402780 changes only IS, 403410 changes only CS? [14:39:57] zeljkof: CS typo prevented IS patch to fully work as expected [14:40:10] ok [14:40:19] only with regards to administrators not being able to remove the permission from themselves [14:40:24] in which order should I deploy the files? [14:40:34] IS, then CS? vice-versa? [14:40:37] CS then IS I'd say [14:40:44] ok [14:41:00] if the order is vice-versa, we can re-scap [14:41:07] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403410 (owner: 10MarcoAurelio) [14:42:36] (03Merged) 10jenkins-bot: translationadmin: typo fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403410 (owner: 10MarcoAurelio) [14:42:50] (03CR) 10jenkins-bot: translationadmin: typo fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403410 (owner: 10MarcoAurelio) [14:42:53] !log new meltdown images are live in cloud land [14:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:31] Hauskatze: 403410 is at mwdebug1002, please confirm that things now work fine before the deployment [14:43:41] ack [14:43:50] PROBLEM - Host wtp2009 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:50] PROBLEM - Host wtp2016 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:44] zeljkof: it does now [14:44:49] RECOVERY - Host wtp2009 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [14:44:54] Hauskatze: ok to deploy? [14:44:59] RECOVERY - Host wtp2016 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [14:45:01] yes from me [14:45:07] ok, deploying... [14:45:12] (03CR) 10Rush: [C: 032] tools: ferm pre hook to stop kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/403308 (https://phabricator.wikimedia.org/T182722) (owner: 10Rush) [14:45:42] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3889866 (10BBlack) That looks about right (disable all hashes older than SHA256, disable RSA+DSA), although it's hard to suss exactly what th... [14:46:28] !log zfilipin@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:403410|translationadmin: typo fix]] (duration: 01m 03s) [14:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:46] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:402780|translationadmin: remove configuration equal to CommonSettings.php (T184314)]] (duration: 01m 02s) [14:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:00] T184314: Redundant wmf-config for translationadmin - https://phabricator.wikimedia.org/T184314 [14:48:05] _joe_: mmh what was the deal with lvs2003? [14:48:10] Hauskatze: all deployed, please check and thanks for deploying with #releng ;) [14:48:42] <_joe_> ema: some internal error in the http part after it tried to remove an alert [14:48:44] I'm checking and no issues so far [14:48:51] thanks for deploying for me [14:49:10] Hauskatze: no problem, please add the second commit to the calendar [14:49:27] logs look fine so far... [14:49:37] zeljkof: sure, almost forgot [14:50:04] !log dropping dewiki from dbstore2001:3318 T184599 [14:50:07] !log EU SWAT finished [14:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:17] T184599: s5 wikidatawiki database cleanup - https://phabricator.wikimedia.org/T184599 [14:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:47] done that [14:51:48] !log start cassandra-a on restbase1011 - T184100 [14:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:58] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [14:53:08] (03PS1) 10Rush: tools: rm source from /usr/local/sbin/ferm_restart_handler [puppet] - 10https://gerrit.wikimedia.org/r/403411 [14:53:33] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3889872 (10Ottomata) > Does that mean SHA1 is disabled, except in the cases that it's the root cert of a chain stored in the jdkCA's default... [14:54:22] zeljkof: added to calendar and also marked as not done some https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T1400 [14:54:29] !log codfw LVSs: upgrade to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267 [14:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:43] T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656 [14:55:27] (03PS2) 10Rush: tools: rm source from /usr/local/sbin/ferm_restart_handler [puppet] - 10https://gerrit.wikimedia.org/r/403411 [14:56:45] (03PS6) 10Giuseppe Lavagetto: puppetdb: refactor to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/403388 [14:58:17] (03PS3) 10Rush: tools: rm source from /usr/local/sbin/ferm_restart_handler [puppet] - 10https://gerrit.wikimedia.org/r/403411 [14:58:49] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3889885 (10BBlack) Yeah, seems reasonable to just set it system-wide on these systems. [14:59:06] (03CR) 10Rush: [C: 032] tools: rm source from /usr/local/sbin/ferm_restart_handler [puppet] - 10https://gerrit.wikimedia.org/r/403411 (owner: 10Rush) [15:02:54] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3889896 (10Andrew) The load-testing command I've settled on is: ``` sudo cumin --force --timeout 120 -o json "project:testlabs name:labvirt1... [15:05:21] Hauskatze: thanks [15:06:20] PROBLEM - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.202 and port 9042: Connection refused [15:06:20] PROBLEM - cassandra-a service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:06:29] PROBLEM - cassandra-a SSL 10.64.32.202:7001 on restbase1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:07:56] (03CR) 10Ema: [C: 031] contint: Lower caching length on doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/403401 (https://phabricator.wikimedia.org/T184255) (owner: 10Legoktm) [15:08:05] that's not expected, I'll take a look [15:09:20] PROBLEM - Host wtp2008 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:30] PROBLEM - Host wtp2005 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:49] RECOVERY - Host wtp2008 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [15:10:10] RECOVERY - Host wtp2005 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [15:12:36] (03CR) 10Jayprakash12345: [C: 04-1] "no consensus link at task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403120 (owner: 10محمد شعیب) [15:13:14] (03PS1) 10Ottomata: Set jdk.certpath.disabledAlgorithms in java.security on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/403415 (https://phabricator.wikimedia.org/T182993) [15:13:29] RECOVERY - cassandra-a service on restbase1012 is OK: OK - cassandra-a is active [15:14:18] !log reboot netmon1002 / netmon2001 for kernel security update [15:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:45] volans: seems like https://gerrit.wikimedia.org/r/#/c/400250/ is the culprit [15:16:30] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/9686/kafka-jumbo1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/403415 (https://phabricator.wikimedia.org/T182993) (owner: 10Ottomata) [15:16:34] akosiaris: for the ensure => 'present',? [15:16:53] yes [15:17:01] and I missed it in the review [15:17:17] yeah in the yaml file we switch the role::tcpircbot::ensure [15:17:30] thanks for looking into ti [15:17:32] *it [15:21:30] RECOVERY - cassandra-a SSL 10.64.32.202:7001 on restbase1012 is OK: SSL OK - Certificate restbase1012-a valid until 2018-08-17 16:11:12 +0000 (expires in 219 days) [15:22:29] RECOVERY - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is OK: TCP OK - 0.037 second response time on 10.64.32.202 port 9042 [15:23:13] (03PS1) 10Alexandros Kosiaris: Fix role::tcpircbot lookups for tegmen/einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/403417 [15:24:47] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403418 [15:24:52] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403418 [15:30:06] RECOVERY - NTP on sca2003 is OK: NTP OK: Offset 9.244680405e-05 secs [15:32:40] !log rebooting yubico auth servers for kernel security update [15:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:14] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403418 (owner: 10Marostegui) [15:34:26] PROBLEM - Host wtp2011 is DOWN: PING CRITICAL - Packet loss = 100% [15:34:37] PROBLEM - Host wtp2014 is DOWN: PING CRITICAL - Packet loss = 100% [15:34:56] RECOVERY - Host wtp2011 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [15:34:57] PROBLEM - MD RAID on restbase1011 is CRITICAL: Return code of 255 is out of bounds [15:34:57] PROBLEM - dhclient process on restbase1011 is CRITICAL: Return code of 255 is out of bounds [15:34:57] PROBLEM - cassandra-c service on restbase1011 is CRITICAL: Return code of 255 is out of bounds [15:35:06] PROBLEM - Check size of conntrack table on restbase1011 is CRITICAL: Return code of 255 is out of bounds [15:35:06] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: Return code of 255 is out of bounds [15:35:06] RECOVERY - Host wtp2014 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [15:35:06] PROBLEM - cassandra-a service on restbase1011 is CRITICAL: Return code of 255 is out of bounds [15:35:16] PROBLEM - Disk space on restbase1011 is CRITICAL: Return code of 255 is out of bounds [15:35:16] PROBLEM - configured eth on restbase1011 is CRITICAL: Return code of 255 is out of bounds [15:35:16] PROBLEM - cassandra-b SSL 10.64.0.118:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:35:17] PROBLEM - Check systemd state on restbase1011 is CRITICAL: Return code of 255 is out of bounds [15:35:36] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused [15:35:37] PROBLEM - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.118 and port 9042: Connection refused [15:35:37] PROBLEM - DPKG on restbase1011 is CRITICAL: Return code of 255 is out of bounds [15:35:40] hey, can this patch be merged? https://gerrit.wikimedia.org/r/#/c/403366 it's tiny [15:35:46] PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:35:47] PROBLEM - Check whether ferm is active by checking the default input chain on restbase1011 is CRITICAL: Return code of 255 is out of bounds [15:35:47] PROBLEM - cassandra-b service on restbase1011 is CRITICAL: Return code of 255 is out of bounds [15:35:47] PROBLEM - Check the NTP synchronisation status of timesyncd on restbase1011 is CRITICAL: Return code of 255 is out of bounds [15:35:54] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403418 (owner: 10Marostegui) [15:35:57] if it's not possible, let me know to put it in the puppet SWAT [15:36:01] ugh, sorry about the spam [15:36:06] PROBLEM - puppet last run on restbase1011 is CRITICAL: Return code of 255 is out of bounds [15:36:46] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403418 (owner: 10Marostegui) [15:37:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1097:3315 - T174569 (duration: 01m 03s) [15:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:21] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [15:47:27] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3890015 (10Cmjohnson) [15:47:30] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3890014 (10Cmjohnson) 05Open>03Resolved [15:47:48] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2713121 (10Cmjohnson) [15:47:52] 10Operations, 10ops-eqiad, 10DBA, 10Phabricator, 10hardware-requests: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3890016 (10Cmjohnson) 05Open>03Resolved [15:48:04] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2714212 (10Cmjohnson) [15:48:07] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1015 - https://phabricator.wikimedia.org/T173570#3890019 (10Cmjohnson) 05Open>03Resolved [15:48:19] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1049 - https://phabricator.wikimedia.org/T175264#3890021 (10Cmjohnson) 05Open>03Resolved [15:48:22] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2714241 (10Cmjohnson) [15:48:36] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1045 - https://phabricator.wikimedia.org/T174806#3890023 (10Cmjohnson) 05Open>03Resolved [15:48:51] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3890024 (10Cmjohnson) 05Open>03Resolved [15:49:08] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2767459 (10Cmjohnson) [15:49:10] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1021 - https://phabricator.wikimedia.org/T181378#3890025 (10Cmjohnson) 05Open>03Resolved [15:49:19] 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests: Decommission db104[67] - https://phabricator.wikimedia.org/T181784#3890027 (10Cmjohnson) 05Open>03Resolved [15:50:29] 10Operations, 10ops-eqiad, 10DBA: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3890028 (10Cmjohnson) @marostegui II have a used spare battery we can swap this out with. LMK when you want to schedule this [15:50:33] (03PS13) 10Giuseppe Lavagetto: role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [15:52:01] 10Operations, 10ops-eqiad, 10DBA: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3890043 (10Marostegui) @Cmjohnson you want me to power off the server and we can do it now? [15:53:29] (03CR) 10Alexandros Kosiaris: [C: 032] Fix role::tcpircbot lookups for tegmen/einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/403417 (owner: 10Alexandros Kosiaris) [15:53:32] @marostegui: no, not right now. Can we do later this afternoon or tomorrow morning? [15:53:58] cmjohnson1: tomorrow morning works for me :) [15:54:23] cool! I will ping you tomorrow [15:54:27] cool thanks! [15:54:47] 10Operations, 10ops-eqiad, 10DBA: db1059 possibly BBU issues - https://phabricator.wikimedia.org/T184160#3890063 (10Marostegui) As per our chat, this will be done tomorrow [15:56:02] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: Decommission mw1180-1200 - https://phabricator.wikimedia.org/T183895#3890070 (10Cmjohnson) [15:56:16] (03PS3) 10Faidon Liambotis: wmflib: fix two RuboCop cops in require_package [puppet] - 10https://gerrit.wikimedia.org/r/403378 [15:56:20] (03CR) 10Faidon Liambotis: [C: 032] wmflib: fix two RuboCop cops in require_package [puppet] - 10https://gerrit.wikimedia.org/r/403378 (owner: 10Faidon Liambotis) [15:57:23] JENKINS! [15:57:26] wake up! [15:57:37] the fact we got an icinga bot that is called ircecho, but an effectively echoing bot called tcpircbot ... [15:57:59] akosiaris: yeah that's endlessly confusing [15:58:20] the whole multitude of irc bots slightly different but equal that is [15:58:38] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403419 (https://phabricator.wikimedia.org/T174569) [15:59:38] 10Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10hardware-requests: Give misc dump crons their own host - https://phabricator.wikimedia.org/T181936#3890096 (10ArielGlenn) I'm adding @Nikerabbit, @demon and @hoo because they will be the main beneficiaries of this new host. How do you see... [15:59:41] (03CR) 10Ottomata: [C: 032] Set jdk.certpath.disabledAlgorithms in java.security on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/403415 (https://phabricator.wikimedia.org/T182993) (owner: 10Ottomata) [15:59:45] (03PS2) 10Ottomata: Set jdk.certpath.disabledAlgorithms in java.security on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/403415 (https://phabricator.wikimedia.org/T182993) [15:59:47] (03CR) 10Ottomata: [V: 032 C: 032] Set jdk.certpath.disabledAlgorithms in java.security on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/403415 (https://phabricator.wikimedia.org/T182993) (owner: 10Ottomata) [15:59:57] !log start cassandra-a on restbase1011 [16:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:13] 10Operations, 10Domains, 10Research, 10Traffic, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3890099 (10bmansurov) Also blocked on a final review by @DarTar and project owners. [16:00:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403419 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [16:01:26] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3890108 (10Andrew) I've run three load tests with the above command. The last test started at Wed Jan 10 15:51:10 UTC 2018 {F12387667} {F12... [16:01:44] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403419 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [16:01:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403419 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [16:02:43] 10Operations, 10Datasets-General-or-Unknown: Replace snapshot1001 with a proper testbed host (new hardware) - https://phabricator.wikimedia.org/T184616#3890113 (10ArielGlenn) p:05Triage>03Normal [16:02:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1096:3315 - T174569 (duration: 01m 02s) [16:03:07] !log Deploy schema change on db1095.s5 - https://phabricator.wikimedia.org/T174569 [16:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:10] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [16:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:14] !log switched ganeti master node in codfw to ganeti2004 [16:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:18] !log migrating instances off ganeti2001 for subsequent reboot for kernel security update [16:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:36] !log roll-restart swift frontend in eqiad for kernel upgrade [16:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:25] marostegui: Should I wait on T181731 s5 or can I go ahead? I think the only real risk is if it breaks replication again. [16:14:25] T181731: Run maintenance/cleanupUsersWithNoId.php on all wikis - https://phabricator.wikimedia.org/T181731 [16:15:02] PROBLEM - DPKG on ms-fe1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:15:55] anomie: It should not break replication again, as we are not running row based. Right now there is one host running the alter tables (db1096) but it is depooled, so... [16:16:02] RECOVERY - DPKG on ms-fe1005 is OK: All packages OK [16:16:24] anomie: we also fixed consistency on dewiki and wikidata, so.. :) [16:16:29] Ok, thanks [16:16:43] We cannot say it is 100% fixed of course, but it is in a lot better state now [16:17:31] 10Operations, 10monitoring, 10User-fgiunchedi: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690#3890148 (10faidon) @ori recently sent his thoughts about this to the ops list, and I found it a very eloquent description of the issues I was thinking of too. His full ema... [16:17:52] PROBLEM - Host wtp1034 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:52] PROBLEM - Host wtp1040 is DOWN: PING CRITICAL - Packet loss = 100% [16:18:02] RECOVERY - Host wtp1034 is UP: PING OK - Packet loss = 0%, RTA = 36.61 ms [16:18:11] RECOVERY - Host wtp1040 is UP: PING OK - Packet loss = 0%, RTA = 36.30 ms [16:22:46] !log restarting kafka jumbo brokers to apply java.security certpath restrictions [16:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:41] !logging Running cleanupUsersWithNoId.php on dewiki and wikidatawiki [16:26:41] To log a message, use the following format: !log [16:26:45] !log Running cleanupUsersWithNoId.php on dewiki and wikidatawiki [16:26:52] PROBLEM - Host wtp1031 is DOWN: PING CRITICAL - Packet loss = 100% [16:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:11] PROBLEM - Host wtp1037 is DOWN: PING CRITICAL - Packet loss = 100% [16:27:15] I have definitely scheduled downtimes for the wtp10XX hosts.... what on earth [16:28:02] RECOVERY - Host wtp1037 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [16:28:18] RECOVERY - Host wtp1031 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [16:29:25] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 20 seconds [16:29:51] silly me... seconds [16:30:04] godog: thumbor known ? [16:30:08] (03PS1) 10Cmjohnson: adding dns entries both production and mgmt for mw1338-mw1348. [dns] - 10https://gerrit.wikimedia.org/r/403425 [16:30:08] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error [16:30:15] (03CR) 10jerkins-bot: [V: 04-1] adding dns entries both production and mgmt for mw1338-mw1348. [dns] - 10https://gerrit.wikimedia.org/r/403425 (owner: 10Cmjohnson) [16:30:16] akosiaris: no :( [16:30:19] I'll check [16:30:34] Bad response from pybal ? [16:30:38] looking [16:30:44] 10Operations, 10procurement: Give access to S4 (procurement tasks) to Erika Bjune - https://phabricator.wikimedia.org/T184617#3890177 (10Gehel) [16:30:47] (03PS2) 10Cmjohnson: adding dns entries both production and mgmt for mw1338-mw1348. [dns] - 10https://gerrit.wikimedia.org/r/403425 [16:30:52] that's the passive though [16:31:10] akosiaris: it is, yes. Earlier on today, lvs2003 had the same issue [16:31:36] yeah and we bounced pybal IIRC [16:31:59] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error [16:32:11] ah there we go.. that's more like it.. it explains the page [16:32:18] PROBLEM - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([thumbor1004.eqiad.wmnet, thumbor1002.eqiad.wmnet]) [16:32:59] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [16:33:03] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3890215 (10Ottomata) Oook, I've set this on all jumbo Kafka brokers. @bblack anything else? [16:33:08] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [16:33:52] http://localhost:9090/alerts and http://localhost:9090/pools were fine on lvs1006 when I checked a couple of minutes ago [16:34:18] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [16:34:29] still looking into thumbor btw [16:34:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:35:07] !log bounce thumbor-instances on thumbor1001 [16:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:35] this isn't the first pybal 500 we've had today [16:35:38] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [16:35:47] we must have some bug related to the depooling process here... [16:36:08] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error [16:36:43] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 172 bytes in 18.767 second response time [16:36:49] ok I've got the 500 response body from lvs1003 [16:36:49] PROBLEM - LVS HTTPS IPv4 on ms-fe.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.27 and port 443: Connection refused [16:37:02] > Servers thumbor1004.eqiad.wmnet, thumbor1002.eqiad.wmnet, thumbor1003.eqiad.wmnet are marked down but pooled [16:37:09] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error [16:37:15] PROBLEM - LVS HTTP IPv4 on ms-fe.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.27 and port 80: Connection refused [16:37:19] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [16:37:20] how can I help? [16:37:21] heh, also I'm pretty sure ms-fe is ok, I was rolling-restart its backends though [16:37:29] RECOVERY - PyBal IPVS diff check on lvs1010 is OK: OK: no difference between hosts in IPVS/PyBal [16:37:34] did we merged anything related recently? [16:37:47] so all false positives or just no impact because pool state? [16:37:50] the failing "LVS HTTP" check above is real, though [16:38:05] volans: nope, but I've rebooted all LVSs in eqiad and codfw today [16:38:09] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [16:38:10] the one that says: PROBLEM - LVS HTTPS IPv4 on ms-fe.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.27 and port 443: Connection refuseda [16:38:37] I'm repooling ms-fe1008 [16:38:39] ema: ack, and the etcd connection is ok [16:38:39] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [16:38:45] ? [16:38:55] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1008.eqiad.wmnet [16:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:15] volans: that's cache_upload reqs failing due to ms-fe.svc outage [16:39:21] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1006.eqiad.wmnet [16:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:44] bblack: yeah, my question mark was for my previous sentence ;) [16:39:54] RECOVERY - LVS HTTPS IPv4 on ms-fe.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.170 second response time [16:40:12] I got a connection refused too on 10.2.2.27:443 [16:40:15] RECOVERY - LVS HTTP IPv4 on ms-fe.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.141 second response time [16:40:20] but not anymore [16:40:28] there is a small spike before a large spike [16:41:04] so, if we assume the intended depool plan was sane (didn't depool more than threshold to do reboots or whatever), then there's something wrong on the pybal end here [16:41:31] 10Operations, 10procurement: Give access to S4 (procurement tasks) to Erika Bjune - https://phabricator.wikimedia.org/T184617#3890252 (10RobH) 05Open>03Resolved Added! @EBjune please be aware that any task with 'Operations Procurement' in the title (in the S4 space) are now visible to you. Please do NOT... [16:41:53] it would be nice to have not only deployments, but also conftool changes on logstash :-) [16:41:53] bblack: it was, though I thought ms-fe1008 was pooled and it wasn't [16:42:02] well either way there's something wrong on the pybal end if it's throwing a 500 I think [16:42:16] for sure [16:42:31] scary [16:42:41] maybe we should take a pause on the depools/reboots and figure that part out first [16:42:52] this is what the 500 from pybal looked like: https://phabricator.wikimedia.org/P6568 [16:42:54] but I bet it's related to depool_threshold [16:43:05] yup I'll hold the rolling restart [16:43:23] !log wtp* rolling restarts for meltdown finished [16:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:42] 10Operations, 10ORES, 10Graphite, 10Patch-For-Review, and 2 others: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3890264 (10Halfak) OK great. I'll go +1 :) [16:44:06] because I don't think we ever really resolved the depool threshold issue yet, even in the latest versions. and we may have changed something about it. [16:44:56] (the old general-case issue being that if a server going down crosses the threshold mark, some state is lost about that situation without separate concepts of "wants-to-be-depooled" vs "is-depooled") [16:45:09] (03CR) 10Halfak: [C: 031] ""keep_days" is a scary parameter name. I've confirmed with Filippo that this means "delete_files_not_modified_since_days". So it looks g" [puppet] - 10https://gerrit.wikimedia.org/r/401917 (https://phabricator.wikimedia.org/T169969) (owner: 10Filippo Giunchedi) [16:45:33] 10Operations, 10Ops-Access-Requests: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3890280 (10RobH) p:05Triage>03Normal [16:46:30] (03CR) 10Thcipriani: [C: 031] "Couple of inline comments. Seems fine overall." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/402803 (owner: 10ArielGlenn) [16:46:39] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: CRITICAL - kafka_broker_under_replicated_partitions is 14 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=kafka_jumbo&var-kafka_brokers=kafka-jumbo1003 [16:47:04] as far as the configurations and pool-sets go: swift-fe only has 4x servers per DC, and depool threshold is 0.5 [16:47:17] so depooling a 3/4 puts us in that state [16:47:50] thumbor is the same (4/DC, threshold = 0.5) [16:48:07] it looks like we had 0 servers pooled at a certain point? https://grafana.wikimedia.org/dashboard/db/pybal?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-server=lvs1003&var-service=swift_80 [16:49:01] (03CR) 10Elukey: "Don't have a lot of context about puppetdb to fully review this but code looks sane and pcc is fine! https://puppet-compiler.wmflabs.org/c" [puppet] - 10https://gerrit.wikimedia.org/r/403388 (owner: 10Giuseppe Lavagetto) [16:49:23] yeah, it's possible there was some operation sequence issue there and we actually did depool > threshold [16:49:50] the secondary issue is: I don't think pybal handles depools>threshold sanely (it never did, but now it does something differently-bad in newer code) [16:51:39] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [16:51:50] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [16:51:58] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T184620#3890300 (10Tonina_Zhelyazkova_WMDE) [16:52:40] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [16:52:43] (03PS1) 10RobH: Add bawolff to additional groups [puppet] - 10https://gerrit.wikimedia.org/r/403430 (https://phabricator.wikimedia.org/T184582) [16:52:45] (03CR) 10Elukey: "Looks sane from https://puppet-compiler.wmflabs.org/compiler02/9689/nitrogen.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [16:52:52] so in terms of sequence I started by assuming all 4x ms-fe machines were pooled, and started depooling 1005, reboot, repool [16:53:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:53:25] then moved onto 1006, depool, reboot [16:53:36] didn't get to repool before things went sideways [16:53:59] can I reboot some analytics hadoop worker nodes? (no pybal involved) [16:54:47] elukey: yes [16:54:50] <3 [16:55:02] !log reboot analytics1047->50 for kernel updates [16:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:51] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T184620#3890300 (10Platonides) I guess your manager at WMDE should confirm here that you are indeed a WMDE developer? [16:56:06] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3890319 (10RobH) [16:56:08] godog: right so you were expecting all 4 machines being pooled, but the graph above shows that only 3 hosts where pooled today [16:57:06] ema: yeah, and the three pooled is likely since yesterday when I did another roll-restart, before realizing the kernel wasn't upgraded [16:57:20] yesterday's roll restart was fine though, a machine at a time [16:57:25] right, so 1/4 was already gone and not noticed, then 2x more depools -> threshold [16:57:39] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: OK - kafka_broker_under_replicated_partitions is 4 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=kafka_jumbo&var-kafka_brokers=kafka-jumbo1003 [16:57:55] (03PS1) 10Zoranzoki21: Add throttle rule for Paris University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618) [16:58:02] and then pybal seems to at least not handle threshold-limited depools in its HTTP outputs [16:58:14] yup [16:58:17] yes though the 2x depools weren't (supposed to be) overlapping [16:58:19] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T184464#3890327 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete [16:58:21] and then I guess we don't know without more digging what caused the conn-refused on the LVS service [16:58:38] it could just be the remaining 1 (or 2?) servers actually couldn't handle the connection load [16:58:45] the fact that no hosts were pooled for the service I guess [16:58:49] (03PS2) 10Zoranzoki21: Add throttle rule for Paris University and sort other by date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618) [16:59:00] or it could be that pybal screwed up ipvs state and blocked connections even though it should've kept 2x pooled due to threshold [16:59:30] yeah I wouldn't be surprised if 1x ms-fe can't handle the load, 2x I'm not sure [16:59:30] (or are manual depools supposed to be able to exceed thresholds?) [16:59:57] I'm checking the hosts to see if some got obviously overloaded [17:00:02] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3890343 (10RobH) [17:00:30] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T184464#3890357 (10Marostegui) Thanks - will close once it has finished: ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 3% complete) physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, Rebuilding)... [17:02:11] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3889014 (10RobH) @EBjune: Please comment with your approval of this expansion of access rights (as @bawolff's manager.) Thanks! [17:04:41] I'm looking at ms-fe1* network graphs and indeed looks like at 16:34 pooled servers went to 0 and hosts stopped receiving traffic [17:07:57] and ms-fe1007 at ~16:25 went to 100% cpu, probably under the swings of traffic moving around [17:08:21] godog: like the repool didn't actually repool it? [17:08:37] <_joe_> that's not the case [17:08:44] <_joe_> if you go look at pybal's logs [17:08:52] volans: no, 1007 stayed pool the whole time afaik [17:09:25] also we've got an icinga check for that, which didn't trigger (check_pybal_ipvs_diff) [17:09:38] <_joe_> ema: a check for what? [17:10:08] _joe_: for what I think volans mentioned, a repool that didn't actually repool [17:10:18] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T184620#3890300 (10RobH) So the L2 actually won't get you any WMF LDAP flags. We actually need an NDA on file with WMF legal and a few other things: [] - have a signed WM... [17:10:57] PROBLEM - Keyholder SSH agent on netmon2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [17:10:58] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:10:59] <_joe_> ema: is that the case? I don't see that in the logs [17:11:38] PROBLEM - Keyholder SSH agent on netmon1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [17:11:57] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 5 minutes ago with 5 failures. Failed resources (up to 3 shown): Exec[create_user-replication@netmon2001],Exec[create_user-netbox@netmon2001],Exec[create_user-netbox@localhost],Exec[create_user-prometheus@localhost] [17:12:02] _joe_: right, that's what I'm saying. If that were the case, check_pybal_ipvs_diff would have alerted [17:12:24] <_joe_> oh ok I didn't understand :) [17:12:43] <_joe_> so from what I see, there was an issue fetching data from swift at 16:34:06 [17:13:14] does it say from what host? [17:13:22] <_joe_> ms-fe1005 [17:13:28] <_joe_> is what I'm looking at now [17:13:31] <_joe_> btw [17:13:41] <_joe_> it's still failing [17:14:25] <_joe_> and ms-fe1007 [17:14:25] with what error? [17:14:49] <_joe_> sorry, it's not failing anymore, it spopped at 16:49:50 [17:15:36] yeah that's general recovery I'd say, what was the error from 1005 ? [17:15:39] <_joe_> while ms-fe1007 went down just a bit before (16:33:28) and came back earlier (16:44:26) [17:15:55] <_joe_> from both the error is [17:16:07] <_joe_> WARN: ms-fe1005.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 5.007 s [17:16:16] <_joe_> ProxyFetch failing and taking more than 5 seconds [17:17:03] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): tools.iabot is using 1.3T of 8T available tools nfs storage - https://phabricator.wikimedia.org/T183953#3890483 (10bd808) [17:17:06] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#3890484 (10bd808) [17:17:09] 10Operations, 10Cloud-VPS, 10monitoring, 10cloud-services-team (Kanban): remove cloud VPS project 'ganglia' - https://phabricator.wikimedia.org/T183917#3890485 (10bd808) [17:17:10] <_joe_> so you disabled 1006 while ms-fe1005 and 1007 were failing [17:17:19] <_joe_> causing probably an overload of 1008 too [17:17:22] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3890490 (10bd808) [17:17:27] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3890492 (10bd808) [17:17:30] 10Operations, 10Cloud-Services, 10hardware-requests, 10cloud-services-team (Kanban): decom silver (was silver has trouble rebooting) - https://phabricator.wikimedia.org/T168559#3890493 (10bd808) [17:17:32] no, 1008 wasn't pooled so 1007 got overloaded [17:17:33] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3890494 (10bd808) [17:18:04] likely 1005 too, so traffic swung too fast among too few machines [17:18:08] Who wants to see my naked photos in the link download http://bit.ly/2CYpsCy [17:18:18] <_joe_> godog: looks like it [17:20:24] (03PS1) 10Giuseppe Lavagetto: base::resolving: remove useless "else" clause [puppet] - 10https://gerrit.wikimedia.org/r/403439 [17:20:26] (03PS1) 10Giuseppe Lavagetto: base::resolving: explicitly pass arguments [puppet] - 10https://gerrit.wikimedia.org/r/403440 [17:20:50] ok, so in the root cause there's for sure my mistake of shuffling (de)pools too fast I'd say, and there were three hosts instead of four pooled [17:21:17] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:33] I'll write an incident report about it, maybe there's followup we can do [17:21:39] (03CR) 10jerkins-bot: [V: 04-1] base::resolving: explicitly pass arguments [puppet] - 10https://gerrit.wikimedia.org/r/403440 (owner: 10Giuseppe Lavagetto) [17:22:08] godog: has there actually been at any point 0 hosts pooled? That's what the grafana board suggests, it would be good to find out if it's reliable or not :) [17:22:34] ema: when all frontends were overloaded I guess there were yeah, but not intentionally [17:22:57] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:01] <_joe_> ema: I think 1008 was still pooled [17:23:07] <_joe_> can someone look into pdfrender? [17:23:14] <_joe_> why are they tying in sequence? [17:23:39] 1008 wasn't pooled, if it was then we'd have been fine I think [17:23:40] <_joe_> *dying [17:24:14] (03CR) 10Urbanecm: [C: 04-1] "Commons should not be included manually. Every throttle rule is applied to Conmons, along with Wikidata and other defined projects." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618) (owner: 10Zoranzoki21) [17:26:47] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.077 second response time [17:27:17] <_joe_> did someone fix pdfrender or did it recover by itself? [17:28:07] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T184620#3890596 (10Tobi_WMDE_SW) [17:31:17] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.077 second response time [17:31:56] <_joe_> so on 1004 it recovered by itself [17:33:37] RECOVERY - Keyholder SSH agent on netmon1002 is OK: OK: Keyholder is armed with all configured keys. [17:34:34] hi twentyafterfour! [17:34:42] Hi! [17:34:58] RECOVERY - Keyholder SSH agent on netmon2001 is OK: OK: Keyholder is armed with all configured keys. [17:35:14] So, there are some backports for the things spotted yesterday, I have basically only just finished with them though. Daniel is looking through them now :) [17:35:30] ok [17:35:34] I just saw some patches [17:35:44] 3 for core and possibly 1 for FlaggedRevisions, although the 1 in FlaggedRevisions is also covered by one of the core patches :) [17:36:49] Well there is no rush on my part. It's a couple of hours away from train time but as soon as you're ready I'll deploy group0 so that we can get back on track for group1 this afternoon. [17:36:58] 10Operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#3890626 (10ArielGlenn) Ummm.. still wanted? Can we close as impossible or no longer needed? [17:37:05] twentyafterfour: yup! okay :) [17:37:07] PROBLEM - puppet last run on ms-be1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:37:26] will see if we can get them ready for the next swat [17:37:55] addshore: if not I can deploy them with the train [17:38:29] twentyafterfour: ack! :) [17:41:23] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3890637 (10Andrew) The terrible way to fix grub on Trusty VMs is: sudo cumin --force --timeout 120 -o json "a:All" "lsb_release -si | grep U... [17:41:53] andrewbogott: it's A:all, not a:All ;) [17:42:00] thx [17:44:33] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T184620#3890647 (10Tobi_WMDE_SW) >>! In T184620#3890317, @Platonides wrote: > I guess your manager at WMDE should confirm here that you are indeed a WMDE developer? >>! In... [17:44:39] !log upgrade and restart db2086 [17:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:20] !log installing linux-image-generic-lts-xenial on labtestvirt2003 [17:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:02] !log upgrade and restart db2087 [17:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:42] (03CR) 10ArielGlenn: make role::beta::mediawiki into a profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/402803 (owner: 10ArielGlenn) [18:02:07] RECOVERY - puppet last run on ms-be1032 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [18:05:03] (03PS1) 10Jcrespo: mariadb: Promote db2040 to be the codfw-s7 master instead of db2029 [puppet] - 10https://gerrit.wikimedia.org/r/403451 (https://phabricator.wikimedia.org/T176243) [18:09:52] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T184620#3890686 (10RStallman-legalteam) @Tonina_Zhelyazkova_WMDE I'll create the NDA for your electronic signature and route it to your WMDE email address. I'll send an up... [18:12:08] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [18:13:07] (03PS1) 10Jcrespo: mariadb: Promote db2040 as the new codfw-s7 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403453 (https://phabricator.wikimedia.org/T176243) [18:13:10] 10Operations, 10Mail: All IP addresses used for sending emails by Wikimedia's services - https://phabricator.wikimedia.org/T184555#3890704 (10Techyan) @Krenair @herron Thanks! I guess this information is enough for them. [18:13:14] 10Operations, 10Mail: All IP addresses used for sending emails by Wikimedia's services - https://phabricator.wikimedia.org/T184555#3890706 (10Techyan) 05Open>03Resolved [18:16:57] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:18:32] twentyafterfour: all of the patches are up in the .16 branch now, I wont bother adding them to swat [18:18:49] adding you as a reviewer now, I'll be around again when the train runs :) gimmie a ping :D [18:18:58] (03PS1) 10Jcrespo: mariadb: Promote db1055 to be the x1 eqiad master instead of db1031 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403454 (https://phabricator.wikimedia.org/T183469) [18:19:02] addshore: thanks [18:19:25] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3890726 (10jcrespo) a:03jcrespo [18:19:28] * addshore goes to make food [18:20:16] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3890733 (10jcrespo) [18:20:20] 10Operations, 10DBA, 10Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#3890734 (10jcrespo) [18:20:22] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3890730 (10jcrespo) 05Open>03stalled a:05jcrespo>03None [18:22:25] twentyafterfour: you also might be able to answer this question for me! One of the patches adds a new log channel called "RevisionStore", will that automatically show up in logstash, or do I need to do something with wmgMonologChannels ? [18:22:55] wmgMonologChannels says // Defaults: [ 'udp2log'=>'debug', 'logstash'=>'info', 'kafka'=>false, 'sample'=>false ], and the logging in RevisionStore, so at a guess it will land in logstash, but just wanted to confirm [18:23:17] !log upgrade and restart db2040 [18:23:18] addshore: I'm not sure [18:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:38] I would guess you're right [18:28:43] 10Operations, 10monitoring: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634#3890758 (10Volans) [18:32:57] RECOVERY - HP RAID on db2060 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [18:34:37] PROBLEM - Host labtestvirt2001 is DOWN: PING CRITICAL - Packet loss = 100% [18:35:37] RECOVERY - Host labtestvirt2001 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [18:36:24] (03PS1) 10Andrew Bogott: labvirts: whitelist the post-meltdown kernel version [puppet] - 10https://gerrit.wikimedia.org/r/403455 (https://phabricator.wikimedia.org/T184189) [18:37:26] (03PS2) 10Rush: labvirts: whitelist the post-meltdown kernel version [puppet] - 10https://gerrit.wikimedia.org/r/403455 (https://phabricator.wikimedia.org/T184189) (owner: 10Andrew Bogott) [18:37:30] (03CR) 10Rush: [C: 031] labvirts: whitelist the post-meltdown kernel version [puppet] - 10https://gerrit.wikimedia.org/r/403455 (https://phabricator.wikimedia.org/T184189) (owner: 10Andrew Bogott) [18:38:04] (03CR) 10Andrew Bogott: [C: 032] labvirts: whitelist the post-meltdown kernel version [puppet] - 10https://gerrit.wikimedia.org/r/403455 (https://phabricator.wikimedia.org/T184189) (owner: 10Andrew Bogott) [18:38:07] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T184464#3890799 (10jcrespo) 05Open>03Resolved a:05Marostegui>03Papaul [18:38:28] 10Operations, 10Cloud-VPS, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3890802 (10chasemp) [18:40:22] !log upgrading labvirt1018 kernel and rebooting [18:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:25] !log reboot labtestvirt2002.codfw.wmnet w/ new kernel [18:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:15] (03PS3) 10Zoranzoki21: Add throttle rule for Paris University and sort other by date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618) [18:46:38] PROBLEM - Host labtestvirt2002 is DOWN: PING CRITICAL - Packet loss = 100% [18:48:16] (03PS4) 10Zoranzoki21: Add throttle rule for Paris University and sort other by date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618) [18:49:26] zeljkof: I am here [18:50:00] Zoranzoki21: he most likely is not, what are you pinging him regarding? [18:50:30] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3890854 (10EBjune) @RobH I approve of @Bawolff's expansion of access rights for the analytics cluster, thank you! [18:50:52] greg-g: Because, I am finally here per rule to user need to be on irc channel when is swat time and have patch for it [18:51:25] greg-g: Zeljko never no deploy patch if owner of patch is not on irc in swat time when is patch for it scheduled [18:51:50] 10Operations, 10Cloud-VPS, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3890859 (10chasemp) [18:52:04] zeljko does not do this specific swat window, it's passed his work hours [18:52:09] past* [18:52:14] cc quiddity :P [18:52:36] Zoranzoki21: just stick around, who ever does do the swat will ping people with patches [18:52:40] <3 ;) [18:52:45] ok [18:59:35] (03PS1) 10Andrew Bogott: Revert "labvirts: whitelist the post-meltdown kernel version" [puppet] - 10https://gerrit.wikimedia.org/r/403456 (https://phabricator.wikimedia.org/T184639) [19:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Morning SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T1900). [19:00:04] Zoranzoki21: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:22] (03CR) 10Andrew Bogott: [C: 032] Revert "labvirts: whitelist the post-meltdown kernel version" [puppet] - 10https://gerrit.wikimedia.org/r/403456 (https://phabricator.wikimedia.org/T184639) (owner: 10Andrew Bogott) [19:00:37] !log upgrade and restart db1059 [19:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:45] the proxies is going to be me, see above [19:03:38] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [19:04:07] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [19:06:31] 10Operations, 10Cloud-VPS, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3890903 (10chasemp) [19:09:16] I can SWAT [19:09:57] RECOVERY - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is OK: TCP OK - 0.036 second response time on 10.64.0.117 port 9042 [19:10:25] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618) (owner: 10Zoranzoki21) [19:11:37] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3890936 (10RobH) [19:12:00] (03Merged) 10jenkins-bot: Add throttle rule for Paris University and sort other by date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618) (owner: 10Zoranzoki21) [19:12:16] (03CR) 10jenkins-bot: Add throttle rule for Paris University and sort other by date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403432 (https://phabricator.wikimedia.org/T184618) (owner: 10Zoranzoki21) [19:14:02] 10Operations, 10ops-eqiad, 10DBA: db1059 BBU issues - https://phabricator.wikimedia.org/T184160#3890941 (10jcrespo) [19:15:23] Zoranzoki21: thanks for the patch, I will go ahead and deploy it everywhere since it is a simple throttle change [19:15:39] ок [19:15:43] ok [19:16:39] proxies should come back now [19:16:47] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [19:17:07] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [19:18:57] RECOVERY - cassandra-b SSL 10.64.0.118:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-b valid until 2018-08-17 16:11:09 +0000 (expires in 218 days) [19:19:52] (03CR) 10Lucas Werkmeister (WMDE): "This change has now been assigned its own deployment window (2018-01-11T13:00:00Z/PT1H), so I’ll have one hour to test it on one of the mw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403195 (https://phabricator.wikimedia.org/T181060) (owner: 10Lucas Werkmeister (WMDE)) [19:22:18] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:403432|Add throttle rule for Paris University and sort other by date]] T184618 (duration: 01m 03s) [19:22:27] ^ Zoranzoki21 live everywhere now [19:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:31] T184618: Request to lift account creation throttling on 2018-01-11 - https://phabricator.wikimedia.org/T184618 [19:22:39] thcipriani: Thank you [19:22:51] you're welcome :) [19:32:46] !log bootstrapping restbase1011-b -- T184100 [19:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:59] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [19:34:18] (03PS1) 10Arlolra: Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464 [19:34:46] (03CR) 10jerkins-bot: [V: 04-1] Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [19:35:34] (03PS2) 10Arlolra: Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464 [19:35:48] (03CR) 10jerkins-bot: [V: 04-1] Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [19:36:27] PROBLEM - puppet last run on labtestvirt2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:45:58] !log upgrade and restart db2047 [19:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:12] (03CR) 10Krinkle: [C: 031] Remove firejail config for now-unused ffmpeg2theora [puppet] - 10https://gerrit.wikimedia.org/r/403212 (https://phabricator.wikimedia.org/T181591) (owner: 10Brion VIBBER) [20:00:04] no_justification: That opportune time is upon us again. Time for a MediaWiki train deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:39] !log upgrade and restart dbstore2001 [20:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:57] (03CR) 10Subramanya Sastry: Switch to YAML configuration for Parsoid on ruthenium (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [20:02:30] (03PS1) 10Brion VIBBER: Update comment avconv -> ffmpeg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403470 [20:05:38] !log upgrade and restart dbstore2002 [20:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:39] (03CR) 10Arlolra: Switch to YAML configuration for Parsoid on ruthenium (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [20:09:11] !log otto@tin Started deploy [eventstreams/deploy@ee854df]: Update eventstreams deploy test to scb2002: T171011 [20:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:35] !log otto@tin Finished deploy [eventstreams/deploy@ee854df]: Update eventstreams deploy test to scb2002: T171011 (duration: 00m 24s) [20:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:59] !log otto@tin Started deploy [eventstreams/deploy@ee854df]: Update eventstreams with newer service-template-node: T171011 [20:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:09] 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3453087 (10Imarlier) Is the goal here just to quantify the impact? Or is there a target connect time/query time that we're tr... [20:12:55] (03CR) 10Krinkle: [C: 031] Update comment avconv -> ffmpeg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403470 (owner: 10Brion VIBBER) [20:14:10] !log otto@tin Finished deploy [eventstreams/deploy@ee854df]: Update eventstreams with newer service-template-node: T171011 (duration: 04m 11s) [20:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:48] 10Operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#3891132 (10Dzahn) 05Open>03declined Yes, I think so, Ariel. and thanks Tim for the details above. As the task creator i'll call it 'declined' but fine with me. [20:16:32] twentyafterfour: there is also one on FlaggedRevs (just incase you didnt spot it) [20:18:13] addshore: yeah I think I +2'd that one too [20:18:21] https://gerrit.wikimedia.org/r/#/c/403443/ [20:18:51] (03CR) 10Subramanya Sastry: [C: 031] Switch to YAML configuration for Parsoid on ruthenium (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [20:20:43] yup [20:24:53] (03PS3) 10Dzahn: Replace yubikey nano key with yubikey 4 key for aaron [puppet] - 10https://gerrit.wikimedia.org/r/403095 (owner: 10Aaron Schulz) [20:25:25] (03CR) 10Dzahn: [C: 032] "verified via file in tin home dir" [puppet] - 10https://gerrit.wikimedia.org/r/403095 (owner: 10Aaron Schulz) [20:26:22] AaronSchulz: ^ now i understand what you meant :) i found the file on tin like last time, verified, merged [20:27:46] heh, thanks [20:29:16] AaronSchulz: yw! puppet ran on tin and bast1001. that combo should work already [20:40:16] (03CR) 1020after4: [C: 032] load ActiveAbtract extension explicitly so class autoloading works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403114 (https://phabricator.wikimedia.org/T184177) (owner: 10ArielGlenn) [20:49:37] !log twentyafterfour@tin Synchronized php-1.31.0-wmf.16: Sync wmf.16 to deploy multiple patches from addshore refs T180749 (duration: 10m 23s) [20:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:47] T180749: 1.31.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T180749 [20:53:26] (03CR) 10Krinkle: Switch to YAML configuration for Parsoid on ruthenium (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [20:58:36] (03PS1) 1020after4: group0 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403475 [20:58:38] (03CR) 1020after4: [C: 032] group0 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403475 (owner: 1020after4) [20:59:27] twentyafterfour: just realised testwikidatawiki appears in group1 on logstash, when it is actually in group0 i believe.... [20:59:49] looks like the sync of the patches above made the exceptions disappear :) [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: How many deployers does it take to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180110T2100). [21:00:05] No GERRIT patches in the queue for this window AFAICS. [21:00:19] addshore: cool [21:00:30] not sure why it's listed in group1.. hmm [21:01:30] twentyafterfour: addshore: this european morning there was an uncommited wikiversions.json on tin [21:01:40] and I have commited it to a change in gerrit [21:01:47] twentyafterfour: i guess that is a logstash dashboard issue [21:02:01] (03CR) 10Zoranzoki21: [C: 031] Update comment avconv -> ffmpeg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403470 (owner: 10Brion VIBBER) [21:02:12] https://gerrit.wikimedia.org/r/#/c/403360/1/wikiversions.json [21:02:18] hasharAway: that was because of the train getting held up before going to group 1 [21:02:30] (03Merged) 10jenkins-bot: group0 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403475 (owner: 1020after4) [21:02:31] for logstash, mabye the list of wikis are hardcoded manually [21:02:38] I mean group0 [21:02:39] hasharAway: indeed [21:02:46] (03CR) 10jenkins-bot: group0 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403475 (owner: 1020after4) [21:02:59] Nothin gfor ORES [21:04:36] nothing for parsoid [21:05:36] (03PS3) 10Arlolra: Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464 [21:05:47] (03PS2) 1020after4: load ActiveAbtract extension explicitly so class autoloading works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403114 (https://phabricator.wikimedia.org/T184177) (owner: 10ArielGlenn) [21:05:59] (03CR) 10jerkins-bot: [V: 04-1] Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [21:06:04] (03CR) 10Arlolra: Switch to YAML configuration for Parsoid on ruthenium (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [21:06:18] (03CR) 1020after4: [V: 032 C: 032] load ActiveAbtract extension explicitly so class autoloading works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403114 (https://phabricator.wikimedia.org/T184177) (owner: 10ArielGlenn) [21:09:26] !log twentyafterfour@tin Started scap: group0 to 1.31.0-wmf.16 refs T180749 [21:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:39] T180749: 1.31.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T180749 [21:11:26] (03CR) 10jenkins-bot: load ActiveAbtract extension explicitly so class autoloading works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403114 (https://phabricator.wikimedia.org/T184177) (owner: 10ArielGlenn) [21:13:49] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10NewPHP, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3891351 (10Krinkle) [21:30:52] (03CR) 10Dzahn: "@Ladsgroup i have heard you have done work on standardizing error page style before (for dumps?) as part of a general update" [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42) [21:30:53] 10Operations, 10Puppet: Upgrade puppetDB to version 3.2 or newer - https://phabricator.wikimedia.org/T177253#3891421 (10herron) So we’ll need to select a puppetdb version and package to proceed. Puppetdb 4.4 looks like the version we should target as according to puppetlabs docs it’s the newest release still... [21:31:45] (03PS4) 10Dzahn: rename planet_server to just planet [puppet] - 10https://gerrit.wikimedia.org/r/393710 [21:32:00] (03CR) 10jerkins-bot: [V: 04-1] rename planet_server to just planet [puppet] - 10https://gerrit.wikimedia.org/r/393710 (owner: 10Dzahn) [21:35:47] 10Operations, 10Goal: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) - https://phabricator.wikimedia.org/T65899#3891449 (10ArielGlenn) [21:35:50] 10Operations, 10Goal, 10HHVM: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#3891448 (10ArielGlenn) [21:35:54] 10Operations, 10Dumps-Generation, 10HHVM, 10Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#3891443 (10ArielGlenn) 05stalled>03declined Officially declining, move to php7 has been approved, see T176370 I've been working on a dump instance in... [21:37:58] PROBLEM - Host mw1271 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:27] PROBLEM - High CPU load on API appserver on mw1201 is CRITICAL: CRITICAL - load average: 50.28, 34.67, 22.00 [21:38:35] uhm [21:38:50] *looks* [21:39:05] It's dead, jim [21:39:08] current deploy is at 88% [21:39:15] with 1 node failure [21:39:25] * twentyafterfour wonders if I need to roll back real quick [21:40:11] there isn't anything of note in fatalmonitor that I can see [21:40:13] Reedy! What did you do! [21:40:15] I wouldn't say so yet [21:40:15] :P [21:40:24] Seddon: Fixed it till it was broken [21:40:31] Reedy: Of course :P [21:42:27] PROBLEM - High CPU load on API appserver on mw1201 is CRITICAL: CRITICAL - load average: 49.54, 37.77, 26.00 [21:42:43] hmm at least it's not getting much worse? [21:42:47] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 51.64, 29.20, 24.15 [21:42:54] uh oh [21:42:57] that's a different one [21:42:59] hmmm, is it these app servers? [21:43:01] thats 2 [21:43:07] 927 www-data 20 0 27.601g 2.901g 114480 S 1513 4.6 6621:55 hhvm [21:43:07] 908 nutcrac+ 20 0 69228 46768 2156 S 24.2 0.1 139:28.64 nutcracker [21:43:21] wth [21:43:34] the scap errors are from mw1271.eqiad.wmnet [21:43:36] A big request parsing stuff? [21:43:56] load average: 27.65, 34.25, 25.89 [21:43:58] It's coming down [21:44:06] host isn't unresponsive [21:44:27] load average: 21.55, 32.18, 25.47 [21:45:06] hmm I see a lot of stuff in logstash that just looks like a big long list of usernames, split up over multiple log entries [21:45:39] load average: 17.58, 28.58, 24.73 [21:45:47] There's reasons we shouldn't let the users have nice things [21:45:55] I saw a bunch of stuff "Pool error on {key}: {error}" [21:46:37] load average under 15 [21:46:40] * Reedy kicks icinga-wm [21:47:27] RECOVERY - High CPU load on API appserver on mw1201 is OK: OK - load average: 12.43, 23.73, 23.41 [21:47:56] !log twentyafterfour@tin Finished scap: group0 to 1.31.0-wmf.16 refs T180749 (duration: 38m 29s) [21:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:10] T180749: 1.31.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T180749 [21:48:31] note: this wasn't even group1 yet :-/ [21:48:47] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 18.79, 23.84, 23.63 [21:48:48] still I guess logstash looks ok, I don't know what caused the api servers to get hit [21:49:03] coincidence I suppose [21:49:35] (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403536 [21:49:37] (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403536 (owner: 1020after4) [21:51:45] (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403536 (owner: 1020after4) [21:51:57] (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403536 (owner: 1020after4) [21:53:27] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.16 [21:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:19] (03PS14) 10Paladox: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665 [21:54:30] !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.16 (duration: 01m 02s) [21:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:00] twentyafterfour: is that .16 on group1 then? :) [21:57:07] !log group1 looks stable. This concludes the MediaWiki train for today. [21:57:09] addshore: yep [21:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:33] twentyafterfour: awesome, yes, I see nothing that alarms me :) [21:57:52] ok cool [21:59:45] hmm, twentyafterfour I do see a couple of things now actually [22:00:01] https://logstash.wikimedia.org/app/kibana#/dashboard/default?_g=(refreshInterval%3A('%24%24hashKey'%3A'object%3A1287'%2Cdisplay%3A'10%20seconds'%2Cpause%3A!f%2Csection%3A1%2Cvalue%3A10000)%2Ctime%3A(from%3Anow-15m%2Cmode%3Aquick%2Cto%3Anow)) [22:00:13] bah, thats the wrong link [22:00:21] https://logstash.wikimedia.org/goto/74dfb80b01ae92a809b22eb9b430272a [22:01:12] however it doesn't immediately look critical [22:02:42] I'll look at logstash again later or tomorrow and see if anything needs to happen. Off for now [22:08:33] thanks addshore [22:52:07] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3891625 (10Andrew) OK -- in System Setup:Device Settings I see one nic with four ports: Integrated NIC 1 Port 1: Intel(R) Ethernet 10G 4P X520/I350 rNDC - 24:6E:96:8D... [22:58:54] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#3891632 (10Krinkle) [22:59:38] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10NewPHP, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3891635 (10Krinkle) [22:59:47] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RFC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3891637 (10Krinkle) [23:00:10] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RFC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3695340 (10Krinkle) [23:10:03] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244#3891691 (10Krenair) Created a new system, ran into the problem that https://gerrit.wikimedia.org/r/#/c/403326/ fixes [23:16:13] 10Operations, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947#3891713 (10kaldari) Pinging @MoritzMuehlenhoff. Please see my most recent comment above. Thanks! [23:26:17] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:27:22] (03PS5) 10Dzahn: rename planet_server to just planet [puppet] - 10https://gerrit.wikimedia.org/r/393710 [23:28:19] 10Operations, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947#3891728 (10JoKalliauer) Pinging @kaldari . [[ https://commons.wikimedia.org/wiki/File:O_Canada_Lilypond.svg | File:O_Canada_Lilypond.svg ]] h... [23:43:39] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/9691/" [puppet] - 10https://gerrit.wikimedia.org/r/393710 (owner: 10Dzahn) [23:46:58] (03PS1) 10Krinkle: [WIP] coal: Consume EventLogging from Kafka instead of ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/403560 (https://phabricator.wikimedia.org/T110903) [23:49:18] (03CR) 10Krinkle: "Need to decide where to split the thread." [puppet] - 10https://gerrit.wikimedia.org/r/403560 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [23:50:29] (03PS2) 10Krinkle: [WIP] coal: Consume EventLogging from Kafka instead of ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/403560 (https://phabricator.wikimedia.org/T110903) [23:54:14] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664#3891777 (10kaldari) [23:56:17] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:56:38] (03PS4) 10Krinkle: mediawiki/hhvm: Move fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114)