[00:44:01] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:10:25] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:43:04] 10Operations, 10SRE-Access-Requests: access for foks to labweb (in one way or another) (or make changePassword.php work on mwmaint hosts) - https://phabricator.wikimedia.org/T220860 (10Andrew) I agree that adding that password to mwmaint is pretty easy. Is that really the only step that's necessary to make ch... [02:59:05] 10Operations, 10SRE-Access-Requests: access for foks to labweb (in one way or another) (or make changePassword.php work on mwmaint hosts) - https://phabricator.wikimedia.org/T220860 (10Krenair) Well from hieradata/role/common/mediawiki/maintenance.yaml: [*] restricted - doesn't have as many different ways to b... [03:43:21] (03PS7) 10Andrew Bogott: shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [03:46:34] (03PS8) 10Andrew Bogott: shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [03:54:55] (03PS9) 10Andrew Bogott: shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [03:55:41] (03CR) 10jerkins-bot: [V: 04-1] shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [03:57:01] (03PS10) 10Andrew Bogott: shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [03:57:43] (03CR) 10jerkins-bot: [V: 04-1] shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [03:59:19] PROBLEM - puppet last run on db1104 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:06:13] PROBLEM - puppet last run on analytics1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:10:27] 10Operations, 10Wikimedia-Mailing-lists: Unable to send messages to Indian Wikimedia Community mailing list - https://phabricator.wikimedia.org/T220902 (10KCVelaga) 05Stalled→03Resolved a:03KCVelaga The issue has been resolved, the user is now able to send messages to the list. [04:11:27] (03PS11) 10Andrew Bogott: shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [04:14:14] (03CR) 10Andrew Bogott: [C: 03+2] shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [04:22:42] (03PS1) 10Andrew Bogott: shinken: move some wmcs-specific files to the wmcs profile [puppet] - 10https://gerrit.wikimedia.org/r/503919 [04:23:35] (03CR) 10Andrew Bogott: [C: 03+2] shinken: move some wmcs-specific files to the wmcs profile [puppet] - 10https://gerrit.wikimedia.org/r/503919 (owner: 10Andrew Bogott) [04:25:49] RECOVERY - puppet last run on db1104 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:32:00] (03PS1) 10Andrew Bogott: Rename role::labs::shinken to role::wmcs::shinken [puppet] - 10https://gerrit.wikimedia.org/r/503920 [04:32:41] RECOVERY - puppet last run on analytics1067 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:32:54] (03CR) 10Andrew Bogott: [C: 03+2] Rename role::labs::shinken to role::wmcs::shinken [puppet] - 10https://gerrit.wikimedia.org/r/503920 (owner: 10Andrew Bogott) [04:57:24] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503921 (https://phabricator.wikimedia.org/T210725) [04:58:43] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503921 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [04:59:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503921 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:07:06] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Marostegui) This server crashed again: ` ------------------------------------------------------------------------------- Record: 2 Date/Time: 04/13/2019 12:33:55 Source: system Severity: Crit... [05:07:56] !log powercycle mw1280 (crashed) [05:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:57] RECOVERY - EDAC syslog messages on wtp2013 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw+prometheus/ops [05:09:08] (03CR) 10jenkins-bot: db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503921 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:10:11] RECOVERY - Host mw1280 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [05:24:15] (03PS1) 10Marostegui: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503923 [05:25:20] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503923 (owner: 10Marostegui) [05:26:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503923 (owner: 10Marostegui) [05:27:27] RECOVERY - Memory correctable errors -EDAC- on wtp2013 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw+prometheus/ops [05:31:31] !log Upgrade db1100 [05:31:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503923 (owner: 10Marostegui) [05:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:43] 10Operations, 10Scap, 10cloud-services-team: deploy1001 cannot reach cloudweb2001-dev.wikimedia.org when running scap - https://phabricator.wikimedia.org/T220931 (10Marostegui) [05:38:58] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503924 [05:46:32] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503924 (owner: 10Marostegui) [05:47:31] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503924 (owner: 10Marostegui) [05:53:34] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503924 (owner: 10Marostegui) [05:55:49] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503925 [06:08:34] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503925 (owner: 10Marostegui) [06:09:31] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503925 (owner: 10Marostegui) [06:11:25] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:11:53] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [06:11:57] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:16:07] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503925 (owner: 10Marostegui) [06:18:49] <_joe_> except I reach wikitech-static with no issues [06:20:29] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Joe) I'm setting it to inactive while we know how the request to dell goes. @Cmjohnson let us know when you know more. [06:23:07] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Joe) On second thoughts, this is an API server, of which we have a just a few right now. I'll avoid depooling it if not strictly necessary. [06:26:11] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:29:21] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 27.40 ms [06:29:47] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 33523 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:30:17] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 33523 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:31:33] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/Lets_Encrypt_Authority_X3.crt] [06:33:17] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:33:29] morning [06:45:44] (03PS2) 10Muehlenhoff: Pull in buster udebs from unstable [puppet] - 10https://gerrit.wikimedia.org/r/503027 (https://phabricator.wikimedia.org/T213527) [06:48:34] (03CR) 10Muehlenhoff: [C: 03+2] Pull in buster udebs from unstable [puppet] - 10https://gerrit.wikimedia.org/r/503027 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [06:50:20] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/502511 (owner: 10Muehlenhoff) [06:51:01] (03CR) 10jerkins-bot: [V: 04-1] Initial Kerberos KDC/kadmin server profiles/roles [puppet] - 10https://gerrit.wikimedia.org/r/502511 (owner: 10Muehlenhoff) [06:52:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM overall." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [06:57:57] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:58:01] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.25; 2019-04-09), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) I have left ifstat running on all... [06:59:41] RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:13:26] (03PS11) 10Fsero: puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) [07:23:30] (03PS12) 10Fsero: puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) [07:25:20] (03CR) 10Fsero: [C: 03+2] "fixed nitpicks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [07:29:26] (03CR) 10Marostegui: [C: 03+1] raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [07:34:11] (03CR) 10Muehlenhoff: lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [07:38:38] (03PS1) 10Muehlenhoff: Remove access for pbj [puppet] - 10https://gerrit.wikimedia.org/r/503931 [07:40:10] (03CR) 10Marostegui: raid: add ssacli class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [07:40:17] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for pbj [puppet] - 10https://gerrit.wikimedia.org/r/503931 (owner: 10Muehlenhoff) [07:44:59] 10Operations, 10ops-codfw, 10Reading-Infrastructure-Team-Backlog, 10decommission, 10Patch-For-Review: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10Gehel) Removing maps from this ticket, since there isn't any work left on our side. @RobH: I'll let you close it when done o... [07:45:53] (03PS1) 10Gehel: maps: check OSM replication lag on all nodes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/503932 (https://phabricator.wikimedia.org/T198622) [07:46:02] onimisionipe: ^ [07:46:18] (03CR) 10Jcrespo: [C: 04-1] "See comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [07:46:22] fsero: can I puppet-merge your pupppet exec patch along? [07:50:50] !log ladsgroup@mwmaint1002:~$ mwscript maintenance/initSiteStats.php --wiki=hywwiki --active (T220936) [07:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:57] T220936: Article count does not work on hyw.wiki - https://phabricator.wikimedia.org/T220936 [07:51:15] gehel: looking [07:51:19] RECOVERY - Check systemd state on elastic1027 is OK: OK - running: The system is fully operational [07:52:10] (03CR) 10Mathew.onipe: [C: 03+1] maps: check OSM replication lag on all nodes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/503932 (https://phabricator.wikimedia.org/T198622) (owner: 10Gehel) [07:52:24] (03CR) 10Gehel: [C: 03+2] maps: check OSM replication lag on all nodes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/503932 (https://phabricator.wikimedia.org/T198622) (owner: 10Gehel) [07:53:28] moritzm, fsero: looks like we have unmerged patch piling up! [07:54:09] gehel: yeah, I pinged fsero above [07:54:17] moritzm: if you are merging, feel free to merge mine, it's a one linere [07:54:24] moritzm: yep, I saw that [07:54:56] ack, let's maybe wait a bit, the patch of Fabián is applied fleet-wide, so better if he's around [07:55:05] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [07:55:22] yep, no emergency on mine [08:00:44] (03CR) 10Muehlenhoff: raid: add ssacli class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [08:02:25] (03PS1) 10ArielGlenn: add services proxy setup for snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/503934 (https://phabricator.wikimedia.org/T220006) [08:03:38] (03CR) 10Marostegui: raid: add ssacli class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [08:19:18] !log updating mediawiki servers in codfw to version 1.8.1 of the PHP extension for wikidiff [08:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:02] (03PS9) 10Vgutierrez: lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) [08:28:26] gehel: moritzm ok to merge sorry! [08:28:38] 10Operations, 10Maps, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Create cookbook for postgres initialization on maps cluster - https://phabricator.wikimedia.org/T220946 (10Mathew.onipe) [08:28:47] fsero: no problem [08:28:48] 10Operations, 10Maps, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Create cookbook for postgres initialization on maps cluster - https://phabricator.wikimedia.org/T220946 (10Mathew.onipe) p:05Triage→03Normal [08:29:13] (03CR) 10Vgutierrez: lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [08:29:22] !log increase wal_keep_segments on codfw maps master [08:29:23] moritzm: your change seems innocent enough, I'm merging all those [08:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:09] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [08:30:24] gehel: I'll wait till your change is merged and puppet is ran before tweaking postgres [08:31:38] onimisionipe: you're good to go [08:32:28] Ok [08:36:13] (03PS1) 10Vgutierrez: authdns: Avoid caching dns-01 challenges [puppet] - 10https://gerrit.wikimedia.org/r/503935 (https://phabricator.wikimedia.org/T219414) [08:37:30] gehel: ack! [08:42:25] (03CR) 10ArielGlenn: [C: 03+2] add services proxy setup for snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/503934 (https://phabricator.wikimedia.org/T220006) (owner: 10ArielGlenn) [08:46:47] (03CR) 10Muehlenhoff: lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [08:48:31] (03PS2) 10Vgutierrez: authdns: Avoid caching dns-01 challenges [puppet] - 10https://gerrit.wikimedia.org/r/503935 (https://phabricator.wikimedia.org/T219414) [08:52:31] (03CR) 10Ema: [C: 03+1] "Fantastic!" [puppet] - 10https://gerrit.wikimedia.org/r/503935 (https://phabricator.wikimedia.org/T219414) (owner: 10Vgutierrez) [08:58:55] !log updating mediawiki servers in eqiad to version 1.8.1 of the PHP extension for wikidiff [08:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:50] RECOVERY - puppet last run on registry1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:05:54] (03PS1) 10Fsero: registryha: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/503938 (https://phabricator.wikimedia.org/T214289) [09:06:41] (03CR) 10Vgutierrez: [C: 04-1] authdns: Avoid caching dns-01 challenges (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503935 (https://phabricator.wikimedia.org/T219414) (owner: 10Vgutierrez) [09:07:03] (03CR) 10Fsero: [C: 03+2] registryha: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/503938 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [09:07:23] (03PS2) 10Fsero: registryha: minor fixes [puppet] - 10https://gerrit.wikimedia.org/r/503938 (https://phabricator.wikimedia.org/T214289) [09:07:30] (03CR) 10Ema: [C: 04-1] "Changed my mind! Valentin found out that the minimum valid value for acme_challenge_ttl is 60." [puppet] - 10https://gerrit.wikimedia.org/r/503935 (https://phabricator.wikimedia.org/T219414) (owner: 10Vgutierrez) [09:18:00] !log unbanning elastic1029 from cluster [09:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:45] (03CR) 10Jbond: raid: refactor structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [09:23:27] (03PS1) 10ArielGlenn: don't rsync over misc dump lock files to webserver/fallback dumps hosts [puppet] - 10https://gerrit.wikimedia.org/r/503940 (https://phabricator.wikimedia.org/T220809) [09:23:38] (03PS19) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [09:23:40] (03PS8) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [09:27:04] (03CR) 10ArielGlenn: [C: 03+2] don't rsync over misc dump lock files to webserver/fallback dumps hosts [puppet] - 10https://gerrit.wikimedia.org/r/503940 (https://phabricator.wikimedia.org/T220809) (owner: 10ArielGlenn) [09:28:44] (03CR) 10Jbond: raid: add ssacli class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [09:32:21] does ssacli only work with newer controllers? [09:33:12] paravoid: no, works with gen9 too [09:33:18] and gen8? [09:33:25] I haven't tested those [09:33:41] let me see [09:34:21] paravoid: looks like it does too [09:34:36] I have tested it with a P420i [09:35:38] (03PS6) 10Arturo Borrero Gonzalez: labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk) [09:35:45] so why not just s/hpssacli/ssacli/ everywhere? [09:36:11] paravoid: I proposed that on the patch (sort of) [09:36:28] jbond42 moritzm ^ [09:37:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk) [09:40:55] (03PS1) 10Volans: check_icinga: split configuration in two files [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/503943 [09:41:48] if gen8 also works with ssacli, that be the best option [09:42:05] 10Operations, 10Scap, 10cloud-services-team: deploy1001 cannot reach cloudweb2001-dev.wikimedia.org when running scap - https://phabricator.wikimedia.org/T220931 (10aborrero) By the time of this phab comment, cloudweb2001-dev.wikimedia.org is a spare system waiting to be put into service soon. We can safely... [09:42:19] and also easier to adapt the get_raid_... script [09:42:53] I didn't check, does the dsa one already support it? [09:43:03] no, only hpssacli [09:43:29] ok, then that one too easier to adapt if it's only s/hpssacli/ssacli/ [09:43:31] when i ran the dsa script it didn';t work sith a siumple sed but i didn't explorre much more then that [09:43:36] but if it's backwards compatible we can still send a patch, if only to be applied in the future [09:45:18] So I have tested a gen8 (db2070) and a gen9 (db1074) [09:45:25] and works fine [09:47:54] marostegui: you tested a ssacli simple show command right? not the DSA script [09:48:12] volans: correct, just a couple of commands indeed [09:49:19] (03PS1) 10Volans: icinga: generate config for Icinga meta-monitoring [puppet] - 10https://gerrit.wikimedia.org/r/503945 [09:50:17] (03PS20) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [09:50:19] (03PS9) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [09:51:31] (03CR) 10Jbond: [C: 03+1] "LGTM one comment" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [09:54:47] 10Operations, 10Scap, 10cloud-services-team: deploy1001 cannot reach cloudweb2001-dev.wikimedia.org when running scap - https://phabricator.wikimedia.org/T220931 (10MoritzMuehlenhoff) JFTR; this needs to be dropped from hieradata/common/scap/dsh.yaml [09:56:26] (03PS1) 10Giuseppe Lavagetto: confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 [09:56:28] (03PS1) 10Giuseppe Lavagetto: [WiP] Add the loadbalanced_cluster module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 [09:58:25] !log installing openssl1.0 security updates [09:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:38] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Add the loadbalanced_cluster module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [10:02:44] (03PS10) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [10:02:50] (03PS12) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [10:07:31] (03CR) 10Volans: [C: 04-1] "Small nitpicks and a question inline, looks good otherwise." (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [10:07:46] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10aborrero) >>! In T209707#5107316, @Andrew wrote: > Installing cloudvirt1024 with Buster isn't really an option -- we'd have to port all OpenStack packages for versions M and... [10:10:25] (03CR) 10Muehlenhoff: [C: 03+1] "I'm not familiar with zsh autocompletion syntax, but if Chris has reviewed and it works, please go ahead! When we push for a more external" [puppet] - 10https://gerrit.wikimedia.org/r/503058 (owner: 10Jbond) [10:11:22] (03PS3) 10Jbond: debdeploy: add zsh autocompletion script [puppet] - 10https://gerrit.wikimedia.org/r/503058 [10:13:23] (03CR) 10Jbond: [C: 03+2] debdeploy: add zsh autocompletion script [puppet] - 10https://gerrit.wikimedia.org/r/503058 (owner: 10Jbond) [10:17:35] (03PS21) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [10:17:37] (03PS11) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [10:17:57] (03CR) 10jerkins-bot: [V: 04-1] cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [10:18:50] akosiaris: hey, around for a deployment of ores now? [10:21:29] (03PS1) 10Fsero: registryha: swift authurl != sotrageURL [puppet] - 10https://gerrit.wikimedia.org/r/503949 (https://phabricator.wikimedia.org/T214289) [10:21:56] (03CR) 10jerkins-bot: [V: 04-1] registryha: swift authurl != sotrageURL [puppet] - 10https://gerrit.wikimedia.org/r/503949 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [10:22:26] (03PS2) 10Fsero: registryha: swift authurl != sotrageURL [puppet] - 10https://gerrit.wikimedia.org/r/503949 (https://phabricator.wikimedia.org/T214289) [10:22:41] <_joe_> fsero: storage, not sotrage [10:22:52] (03CR) 10jerkins-bot: [V: 04-1] registryha: swift authurl != sotrageURL [puppet] - 10https://gerrit.wikimedia.org/r/503949 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [10:23:04] sotrage is more important! [10:23:37] (03PS3) 10Fsero: registryha: swift authurl != storageURL [puppet] - 10https://gerrit.wikimedia.org/r/503949 (https://phabricator.wikimedia.org/T214289) [10:24:26] (03CR) 10Fsero: [C: 03+2] registryha: swift authurl != storageURL [puppet] - 10https://gerrit.wikimedia.org/r/503949 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [10:30:05] jan_drewniak: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190415T1030). [10:31:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Small question, but otherwise LGTM. I didn't run it through the compiler, this patch probably merits a full catalog run before merging." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [10:32:10] (03CR) 10Jbond: [C: 03+1] Add search capability (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [10:32:39] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503955 (https://phabricator.wikimedia.org/T128546) [10:32:51] (03CR) 10jerkins-bot: [V: 04-1] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503955 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:33:52] (03Abandoned) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503955 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:19] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503957 (https://phabricator.wikimedia.org/T128546) [10:35:11] (03CR) 10Giuseppe Lavagetto: "I would prefer we just remove any mention of the packages and I remove them in a more controlled manner via cumin rather than via puppet." [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy) [10:35:26] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Adding a -1 to reflect that." [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy) [10:37:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, let me (or Dzahn, who's in CC here) when it's ok to merge this change." [puppet] - 10https://gerrit.wikimedia.org/r/503675 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [10:40:16] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503957 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:41:12] (03PS18) 10Mathew.onipe: Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) [10:41:22] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503957 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:41:45] (03CR) 10Mathew.onipe: Add wdqs data transfer cookbook (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [10:42:40] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503957 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:48:19] hmmm... I'm doing the portals deployment and it looks like scap is hung up on `sync-apaches: 99% (ok: 262; fail: 0; left: 1)` [10:49:50] `ssh: connect to host cloudweb2001-dev.wikimedia.org port 22: Connection timed out` not sure what the issue is there... [10:50:04] 10Operations, 10PHP 7.0 support, 10Patch-For-Review, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) >>! In T211488#5107374, @Krinkle wrote: > @Joe Regarding opcache, I'm not sure why this change for manual reloadi... [10:51:15] (03PS1) 10Muehlenhoff: Remove obsolete rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/503961 [10:51:57] (03PS1) 10Fsero: registryha: replication configuration needs to omit 'cluster_' from cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/503962 (https://phabricator.wikimedia.org/T214289) [10:52:59] (03CR) 10Fsero: [C: 03+2] registryha: replication configuration needs to omit 'cluster_' from cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/503962 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [10:54:43] (03PS6) 10Giuseppe Lavagetto: Avoid redirects from HTTPS to HTTP and back to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/469262 (owner: 10Fomafix) [10:58:33] (03CR) 10Volans: [C: 04-1] "I don't think drain is correct here, see details inline." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [10:59:28] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 208 and 0 seconds [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190415T1100). [11:00:04] Ammarpad, Ammarpad, revi, Daimona, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] o/ [11:00:18] (03CR) 10Volans: Add search capability (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [11:00:33] o/ [11:00:46] Amir1: want to run SWAT today? ;) [11:00:48] hi, ping me when it's time for my patch [11:01:08] zeljkof: I can do only half of it. [11:01:17] I need to leave in half an hour [11:02:03] Amir1: go ahead with your patch then, and feel free to deploy what you can from the rest of the patches, ping me when you have to go, I'll deploy what's left [11:02:18] noted [11:02:26] and thanks, I'm in the middle of something else, this will give me time to finish it up [11:02:47] (03CR) 10Volans: [C: 03+2] Kernels refactor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443366 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [11:03:12] revi: I'm deploying your patch, is it testable on mwdebug1002? [11:03:43] Should be [11:03:45] (03PS6) 10Ladsgroup: Add enwiki to azwiki import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492942 (https://phabricator.wikimedia.org/T217104) (owner: 10Revi) [11:03:59] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492942 (https://phabricator.wikimedia.org/T217104) (owner: 10Revi) [11:04:42] (03Merged) 10jenkins-bot: Kernels refactor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443366 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [11:05:08] (03Merged) 10jenkins-bot: Add enwiki to azwiki import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492942 (https://phabricator.wikimedia.org/T217104) (owner: 10Revi) [11:05:21] gimme a sec while I turn on my damned chrome :P [11:05:21] (03CR) 10jenkins-bot: Add enwiki to azwiki import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492942 (https://phabricator.wikimedia.org/T217104) (owner: 10Revi) [11:05:32] and re-login as -revi instead of Revi (WMF).... [11:05:46] revi: it's live there, take your time [11:05:48] Amir1: I am now [11:06:04] revi: (btw. congrats on being hired by WMF \o/) [11:06:15] I've been here for a year and 8 moths [11:06:16] months* [11:06:17] lol [11:06:19] too late congrats [11:06:33] akosiaris: now I'm doing swat, we can coordinate for later [11:06:47] revi: that's long :D why I never noticed [11:06:48] debug1002, right? [11:06:51] yup [11:06:55] confirmed [11:07:40] Amir1: LGTM [11:07:50] okie dokie [11:09:12] (Please ping me when it's time for my patch, I'm having lunch. Thanks!) [11:09:45] and as an offtopic I think I didn't advertise my WMF stuff that loudly except for my SE :P [11:10:16] (03CR) 10Volans: [C: 03+2] DataTables: refactor column grouping [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [11:11:18] (03PS19) 10Mathew.onipe: Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) [11:11:57] (03PS2) 10Ladsgroup: Add botadmin group on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503753 (https://phabricator.wikimedia.org/T220915) (owner: 10Daimona Eaytoy) [11:12:04] Daimona: you're next [11:12:12] (03Merged) 10jenkins-bot: DataTables: refactor column grouping [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [11:12:15] (03PS1) 10Fsero: registryha: registry needs the specific swift authURL [puppet] - 10https://gerrit.wikimedia.org/r/503966 (https://phabricator.wikimedia.org/T214289) [11:12:35] hmm one of apaches is pretty unhappy [11:12:44] it hung [11:12:56] (03CR) 10Fsero: [C: 03+2] registryha: registry needs the specific swift authURL [puppet] - 10https://gerrit.wikimedia.org/r/503966 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [11:12:56] :OOOOOOOO [11:13:20] jan_drewniak had the same issue [11:13:30] (03CR) 10Volans: [C: 03+2] Add search capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [11:13:48] dear ops, it seems cloudweb2001-dev.wikimedia.org is unresposive [11:14:03] Here I am [11:14:23] Amir1: T220931 [11:14:24] T220931: deploy1001 cannot reach cloudweb2001-dev.wikimedia.org when running scap - https://phabricator.wikimedia.org/T220931 [11:14:52] oh thanks [11:15:06] Daimona: is it testable in mwdebug1002 [11:15:07] right? [11:15:12] I guess I am done, right? [11:15:17] revi: yup [11:15:19] then dinner time! seeya [11:15:22] (03PS1) 10Muehlenhoff: Add qemu processes/Nova instances to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/503967 (https://phabricator.wikimedia.org/T135991) [11:15:23] have a nice day! [11:15:24] Well, sort of [11:15:29] (03Merged) 10jenkins-bot: Add search capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [11:15:47] I can look at whether the new user group is listed in a few special pages [11:15:47] (03CR) 10Ladsgroup: [C: 03+2] Add botadmin group on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503753 (https://phabricator.wikimedia.org/T220915) (owner: 10Daimona Eaytoy) [11:16:09] Yeah, and if you're a 'crat, you can check if you can add it or not [11:16:10] RECOVERY - Check systemd state on registry1001 is OK: OK - running: The system is fully operational [11:16:17] I usually just look at Special:ListUserGroups [11:16:35] or whatever the official name is [11:16:37] I'm not crat :/ [11:16:39] Yep [11:16:47] but you can see the assignments [11:16:49] (03Merged) 10jenkins-bot: Add botadmin group on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503753 (https://phabricator.wikimedia.org/T220915) (owner: 10Daimona Eaytoy) [11:16:52] let's check the special page for user groups [11:16:59] And also special:userrights [11:17:02] https://it.wikipedia.org/wiki/Special:UserRights/-revi [11:17:03] (03CR) 10jenkins-bot: Add botadmin group on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503753 (https://phabricator.wikimedia.org/T220915) (owner: 10Daimona Eaytoy) [11:17:05] should have something like that [11:17:08] ^ [11:17:16] if you have... correct permissions [11:17:20] <_< [11:17:40] you need a permission to edit one of the user rights to see the "what I can grant" and "What I can't", it seems [11:17:52] Daimona: it's live in mwdebug [11:17:56] mwdebug1002 [11:17:59] Testing [11:18:08] (03PS13) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [11:18:35] (03CR) 10Jbond: puppet: Refactor of the base::puppet class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:19:15] Amir1 Yay! Checked all special pages that went to mind and everything seems fine [11:19:22] nice [11:21:02] going liv [11:22:38] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:24:23] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503270 (https://phabricator.wikimedia.org/T219871) (owner: 10Ladsgroup) [11:25:01] (03CR) 10Lucas Werkmeister (WMDE): "Hm, even if that wasn’t the problem on idwiki, this still looks like a good idea to me? Or is Wikibase already supposed to check if mapfra" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503643 (https://phabricator.wikimedia.org/T218051) (owner: 10Hoo man) [11:25:25] (03Merged) 10jenkins-bot: Add Western Armenian Wikipedia to wmf-config/InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503270 (https://phabricator.wikimedia.org/T219871) (owner: 10Ladsgroup) [11:26:14] (03CR) 10Gehel: Add wdqs data transfer cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [11:26:53] !log rolling restart of HHVM/Apache on labweb* to pick up OpenSSL update [11:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:58] (03CR) 10jenkins-bot: Add Western Armenian Wikipedia to wmf-config/InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503270 (https://phabricator.wikimedia.org/T219871) (owner: 10Ladsgroup) [11:27:38] looking good on mwdebug1002, moving forward [11:28:23] (03CR) 10Volans: Add wdqs data transfer cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [11:29:24] (03PS20) 10Mathew.onipe: Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) [11:29:50] (03PS1) 10Fsero: registryha: missing redis addr on healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/503968 (https://phabricator.wikimedia.org/T214289) [11:30:08] (03CR) 10Mathew.onipe: "for now, I'll just sleep for 3 min before stopping the services." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [11:30:44] (03CR) 10Fsero: [C: 03+2] registryha: missing redis addr on healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/503968 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [11:32:25] zeljkof: I'm done, everything is done except the patches for ammarpad, they seem not to be around [11:33:26] Amir1: in that case, swat is done! [11:33:29] thanks :) [11:35:13] !log EU swat is done [11:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:40] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: db: use mysql-server directly instead of mariadb [puppet] - 10https://gerrit.wikimedia.org/r/503969 (https://phabricator.wikimedia.org/T220096) [11:35:58] (03CR) 10Volans: "Looks good, few minor nits inline." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto) [11:36:06] (03CR) 10jerkins-bot: [V: 04-1] openstack: codfw1dev: db: use mysql-server directly instead of mariadb [puppet] - 10https://gerrit.wikimedia.org/r/503969 (https://phabricator.wikimedia.org/T220096) (owner: 10Arturo Borrero Gonzalez) [11:37:33] (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: db: use mysql-server directly instead of mariadb [puppet] - 10https://gerrit.wikimedia.org/r/503969 (https://phabricator.wikimedia.org/T220096) [11:38:54] (03PS3) 10Arturo Borrero Gonzalez: openstack: codfw1dev: db: use mysql-server directly instead of mariadb [puppet] - 10https://gerrit.wikimedia.org/r/503969 (https://phabricator.wikimedia.org/T220096) [11:40:35] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/503969 (https://phabricator.wikimedia.org/T220096) (owner: 10Arturo Borrero Gonzalez) [11:43:03] (03PS4) 10Arturo Borrero Gonzalez: openstack: codfw1dev: db: use mysql-server directly instead of mariadb [puppet] - 10https://gerrit.wikimedia.org/r/503969 (https://phabricator.wikimedia.org/T220096) [11:43:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: db: use mysql-server directly instead of mariadb [puppet] - 10https://gerrit.wikimedia.org/r/503969 (https://phabricator.wikimedia.org/T220096) (owner: 10Arturo Borrero Gonzalez) [11:46:19] (03CR) 10Jbond: "here is another compiler run (still running)" [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:46:48] PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:54:16] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:58:34] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17009752 and 0 seconds [11:59:03] hmmm [11:59:29] !log pointing boron docker builds to the new registry temporarily (docker builds on boron might fail) [11:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:29] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: keystone: delete invalid relationship [puppet] - 10https://gerrit.wikimedia.org/r/503975 (https://phabricator.wikimedia.org/T220096) [12:09:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: keystone: delete invalid relationship [puppet] - 10https://gerrit.wikimedia.org/r/503975 (https://phabricator.wikimedia.org/T220096) (owner: 10Arturo Borrero Gonzalez) [12:13:14] RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:13:49] (03CR) 10Jbond: Add prometheus interface to spicerack (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [12:14:12] PROBLEM - Check systemd state on maps2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:14:54] !log rolling restart of HHVM/Apache on deployment servers to pick up OpenSSL update [12:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:18] RECOVERY - puppet last run on cloudcontrol2001-dev is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:16:41] (03CR) 10Jbond: Initial Kerberos KDC/kadmin server profiles/roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502511 (owner: 10Muehlenhoff) [12:28:45] (03CR) 10Reedy: "That's fine by me. Just historically (as evident by your prior CR+1 ;)) this is how we tried to get rid of the packages :)" [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy) [12:29:33] (03PS6) 10Reedy: Stop installing pear packages on MW Application Servers [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) [12:31:15] (03PS1) 10Arturo Borrero Gonzalez: openstack: deisgnatemakedomain: factorize [puppet] - 10https://gerrit.wikimedia.org/r/503989 [12:32:36] (03PS2) 10Arturo Borrero Gonzalez: openstack: deisgnatemakedomain: factorize [puppet] - 10https://gerrit.wikimedia.org/r/503989 [12:32:48] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/503989 (owner: 10Arturo Borrero Gonzalez) [12:34:23] (03PS1) 10Muehlenhoff: Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/503992 [12:36:48] (03PS2) 10Muehlenhoff: Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/503992 [12:38:32] (03PS1) 10Fsero: registryha: typo on template for redis password [puppet] - 10https://gerrit.wikimedia.org/r/503994 (https://phabricator.wikimedia.org/T214289) [12:40:17] (03CR) 10Muehlenhoff: [C: 03+2] Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/503992 (owner: 10Muehlenhoff) [12:40:18] (03CR) 10Fsero: [C: 03+2] registryha: typo on template for redis password [puppet] - 10https://gerrit.wikimedia.org/r/503994 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [12:40:37] (03PS2) 10Fsero: registryha: typo on template for redis password [puppet] - 10https://gerrit.wikimedia.org/r/503994 (https://phabricator.wikimedia.org/T214289) [12:40:53] (03CR) 10Arturo Borrero Gonzalez: "PCC for some VMs as expected: https://puppet-compiler.wmflabs.org/compiler1002/15770/" [puppet] - 10https://gerrit.wikimedia.org/r/503989 (owner: 10Arturo Borrero Gonzalez) [12:42:51] (03PS3) 10Arturo Borrero Gonzalez: openstack: deisgnatemakedomain: factorize [puppet] - 10https://gerrit.wikimedia.org/r/503989 [12:43:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC for physical hosts as expected: https://puppet-compiler.wmflabs.org/compiler1002/15772/" [puppet] - 10https://gerrit.wikimedia.org/r/503989 (owner: 10Arturo Borrero Gonzalez) [12:44:21] (03PS10) 10Alaa Sarhan: Add wgWikibaseMusicalNotationLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) [12:46:48] (03CR) 10Alaa Sarhan: "> Patch Set 9:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) (owner: 10Alaa Sarhan) [12:47:05] (03PS6) 10Alaa Sarhan: Add wgWikibaseMusicalNotationLineWidthInches to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498661 (https://phabricator.wikimedia.org/T218191) [12:47:44] (03PS11) 10Alaa Sarhan: Add wgWikibaseMusicalNotationLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) [12:48:18] (03PS1) 10Hashar: cassandra: fix spec service provider [puppet] - 10https://gerrit.wikimedia.org/r/503996 [12:50:15] !log restarting Apache on matomo1001 to pick up OpenSSL update [12:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:53] (03PS2) 10Gehel: Repair elasticsearch per-node latency reporting [puppet] - 10https://gerrit.wikimedia.org/r/503718 (https://phabricator.wikimedia.org/T220901) (owner: 10EBernhardson) [12:57:45] (03CR) 10Gehel: [C: 03+2] Repair elasticsearch per-node latency reporting [puppet] - 10https://gerrit.wikimedia.org/r/503718 (https://phabricator.wikimedia.org/T220901) (owner: 10EBernhardson) [12:59:19] !log restarting archiva on archiva1001 for OpenJDK security update [12:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:22] (03PS1) 10Ema: cache: explicitly pass cache_route to varnish::wikimedia_vcl [puppet] - 10https://gerrit.wikimedia.org/r/504003 (https://phabricator.wikimedia.org/T219967) [13:02:29] (03CR) 10Hoo man: "> Hm, even if that wasn’t the problem on idwiki, this still looks like a good idea to me? Or is Wikibase already supposed to check if mapf" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503643 (https://phabricator.wikimedia.org/T218051) (owner: 10Hoo man) [13:04:31] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220909 (10CDanis) 05Open→03Invalid [13:09:26] !log installing wget security updates on trusty hosts [13:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add qemu processes/Nova instances to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/503967 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:13:52] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/503967 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:13:56] (03CR) 10Ema: "pcc output looks good (in practice, a noop) https://puppet-compiler.wmflabs.org/compiler1002/15774/" [puppet] - 10https://gerrit.wikimedia.org/r/504003 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:15:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/504003 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:17:03] (03CR) 10Muehlenhoff: Initial Kerberos KDC/kadmin server profiles/roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502511 (owner: 10Muehlenhoff) [13:17:15] (03PS5) 10Muehlenhoff: Initial Kerberos KDC/kadmin server profiles/roles [puppet] - 10https://gerrit.wikimedia.org/r/502511 [13:17:17] 10Operations, 10PHP 7.0 support, 10Patch-For-Review, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) regarding `enable_dl`, I did some investigating: - The documentation on php.net says it's removed since php 7.0... [13:18:15] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [13:19:30] (03CR) 10Ema: [C: 03+2] cache: explicitly pass cache_route to varnish::wikimedia_vcl [puppet] - 10https://gerrit.wikimedia.org/r/504003 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:19:47] gehel: ^ seems ready to merge [13:20:06] volans: thanks! [13:20:20] let's see if it works before calling victory ;) [13:26:26] (03PS1) 10Marostegui: dsh.yaml: Remove cloudweb2001-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/504005 (https://phabricator.wikimedia.org/T220931) [13:27:04] (03PS1) 10Jbond: facter3/puppet5: upgrade puppet and facter on canary hosts. [puppet] - 10https://gerrit.wikimedia.org/r/504006 (https://phabricator.wikimedia.org/T219803) [13:27:30] (03CR) 10jerkins-bot: [V: 04-1] facter3/puppet5: upgrade puppet and facter on canary hosts. [puppet] - 10https://gerrit.wikimedia.org/r/504006 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:27:32] (03CR) 10Muehlenhoff: [C: 03+1] dsh.yaml: Remove cloudweb2001-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/504005 (https://phabricator.wikimedia.org/T220931) (owner: 10Marostegui) [13:30:52] (03CR) 10Ottomata: [C: 03+1] oozie: override the oozie-setup script [puppet/cdh] - 10https://gerrit.wikimedia.org/r/503266 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [13:31:03] (03PS2) 10Marostegui: dsh.yaml: Remove cloudweb2001-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/504005 (https://phabricator.wikimedia.org/T220931) [13:31:53] (03CR) 10Marostegui: [C: 03+2] dsh.yaml: Remove cloudweb2001-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/504005 (https://phabricator.wikimedia.org/T220931) (owner: 10Marostegui) [13:32:05] (03CR) 10Elukey: [C: 03+2] oozie: override the oozie-setup script [puppet/cdh] - 10https://gerrit.wikimedia.org/r/503266 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [13:32:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/502511 (owner: 10Muehlenhoff) [13:34:34] (03PS2) 10Jbond: facter3/puppet5: upgrade puppet and facter on canary hosts. [puppet] - 10https://gerrit.wikimedia.org/r/504006 (https://phabricator.wikimedia.org/T219803) [13:35:14] (03PS1) 10Elukey: Update cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/504008 [13:35:38] (03CR) 10Elukey: [C: 03+1] "gogogogo! :)" [puppet] - 10https://gerrit.wikimedia.org/r/502511 (owner: 10Muehlenhoff) [13:35:51] (03CR) 10Elukey: [C: 03+2] Update cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/504008 (owner: 10Elukey) [13:35:56] (03PS22) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [13:35:58] (03PS12) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [13:36:02] 10Operations, 10Scap, 10cloud-services-team, 10Patch-For-Review: deploy1001 cannot reach cloudweb2001-dev.wikimedia.org when running scap - https://phabricator.wikimedia.org/T220931 (10Marostegui) 05Open→03Resolved a:03Marostegui After merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/50... [13:42:10] (03PS3) 10Jbond: facter3/puppet5: upgrade puppet and facter on canary hosts. [puppet] - 10https://gerrit.wikimedia.org/r/504006 (https://phabricator.wikimedia.org/T219803) [13:42:16] !log reboot ms-be1013 [13:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:00] (03PS4) 10Jbond: facter3/puppet5: upgrade puppet and facter on canary hosts. [puppet] - 10https://gerrit.wikimedia.org/r/504006 (https://phabricator.wikimedia.org/T219803) [13:44:56] jbond42: s/.$// :) [13:45:15] (03PS1) 10Anomie: Set ActorTableSchemaMigrationStage => write-both/read-new on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504011 (https://phabricator.wikimedia.org/T188327) [13:45:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/504006 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:45:37] (03CR) 10Anomie: [C: 03+2] "Deploying planned config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504011 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:46:42] (03Merged) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-both/read-new on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504011 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:47:35] (03PS1) 10Volans: Updated src to v0.1.9 and rebuilt wheels [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/504013 [13:51:16] (03CR) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-both/read-new on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504011 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:51:22] paravoid: "s/.$//" bit to cryptic for me, dont see any trailing white space of anything elses, car to hit me with the clue stick :) [13:51:50] sorry, s/\.$// I meant ;), in the first line of the commit message of https://gerrit.wikimedia.org/r/504006 [13:51:54] nitpicking, apologies :) [13:52:35] (03PS5) 10Jbond: facter3/puppet5: upgrade puppet and facter on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/504006 (https://phabricator.wikimedia.org/T219803) [13:52:47] aah got it now, cheers :) [13:53:00] (03CR) 10CDanis: icinga: generate config for Icinga meta-monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/503945 (owner: 10Volans) [13:55:34] (03CR) 10Volans: "replies inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/503945 (owner: 10Volans) [13:55:38] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: upgrade puppet and facter on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/504006 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:55:40] !log start ms-be1013 decom - T220590 [13:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:47] T220590: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 [13:56:13] !log restart tilerator / kartotherian on all maps servers for openssl update [13:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:16] !log upgrading puppet 4 -> 5 and facter 2 -> 3 on mediawiki::canary_appserver, mediawiki::appserver::canary_api and cache::cache roles [13:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:11] (03PS23) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [13:59:13] (03PS13) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [14:00:46] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220907 (10fgiunchedi) Tried to reboot the host in the hope the controller freaked out and a reboot would "fix" it or at least reset. However the host isn't coming back, and console says `No more sessions are available fo... [14:00:51] (03PS2) 10Volans: icinga: generate config for Icinga meta-monitoring [puppet] - 10https://gerrit.wikimedia.org/r/503945 [14:04:29] PROBLEM - tilerator on maps1001 is CRITICAL: connect to address 10.64.0.79 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [14:04:41] PROBLEM - Maps HTTPS on maps1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:04:50] !log rebooting ms-fe1005 for combined kernel/glibc/OpenSSL update [14:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:15] PROBLEM - tilerator on maps1004 is CRITICAL: connect to address 10.64.48.154 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [14:08:09] PROBLEM - Maps HTTPS on maps1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:09:09] gehel: ^ [14:09:28] looking [14:09:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, I also checked on mw1261 that no other deps in the installed package set declare a dependency." [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy) [14:09:53] (03PS14) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [14:10:58] PROBLEM - LVS HTTP IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.13 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:11:05] PROBLEM - tilerator on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [14:11:33] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:11:37] PROBLEM - Maps HTTPS on maps1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:11:37] PROBLEM - Maps HTTPS on maps1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:11:51] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:11:53] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:12:05] karthoterian reboots gone sideways? [14:12:14] ge hel is looknig [14:12:17] yep, no idea why, but eqiad is toasted [14:12:26] * gehel hates maps [14:12:37] PROBLEM - tilerator on maps1003 is CRITICAL: connect to address 10.64.32.117 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [14:12:37] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:12:51] * onimisionipe understands [14:12:58] <_joe_> [2019-04-15T14:07:23.934Z] ERROR: kartotherian/146 on maps1002: Bad geojson - unknown type object (err.levelPath=error) [14:13:01] kk, gehel should if help is needed [14:13:06] Apr 15 14:12:57 maps1002 tilerator[29410]: #033]0;firejail /usr/bin/nodejs src/server.js -c /etc/tilerator/config.yaml #007Error while reading config file: Error: EACCES: permission denied, open '/etc/tilerator/config.yaml' [14:13:09] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.13:6533, 10.2.2.13:443]) https://wikitech.wikimedia.org/wiki/PyBal [14:13:23] if anyone can switch maps to codfw while I dig in the logs, would be great! [14:13:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:13:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:13:37] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [14:13:43] <_joe_> gehel: ack [14:13:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:13:59] <_joe_> I'm moving the discovery record [14:14:05] gehel: looks like a recent deployment means tilerator config is readable only by deploy-service user [14:14:07] looking at maps1002 [14:14:09] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.13:6533, 10.2.2.13:443]) https://wikitech.wikimedia.org/wiki/PyBal [14:14:11] I'm prepping the varnish change [14:14:17] gehel: Only two nodes can receive traffic now at codfw. [14:14:23] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [14:14:26] the other two are initilialzing [14:14:42] cdanis: right [14:14:43] I could pool one of them...to reduce the load [14:15:38] (03PS1) 10Ema: cache_upload: switch maps to codfw only [puppet] - 10https://gerrit.wikimedia.org/r/504015 [14:15:54] onimisionipe: we should be good for the read load, but keep on eye on it, it's the write laod that is problematic [14:16:10] alright. will do [14:16:17] gehel: what do you think about chmod -R a+r /srv/deployment/kartotherian /srv/deployment/tilerator [14:16:34] cdanis: doing it right noe [14:16:54] gehel: ok to merge? https://gerrit.wikimedia.org/r/504015 [14:17:08] <_joe_> cdanis: please go on [14:17:10] gehel: I did it on both maps1002 and maps1003 [14:17:12] (03CR) 10Gehel: [C: 03+1] cache_upload: switch maps to codfw only [puppet] - 10https://gerrit.wikimedia.org/r/504015 (owner: 10Ema) [14:17:23] <_joe_> on all servers with cumin :) [14:17:26] <_joe_> and disable puppet [14:17:30] RECOVERY - LVS HTTP IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1323 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:17:33] (03CR) 10Ema: [C: 03+2] cache_upload: switch maps to codfw only [puppet] - 10https://gerrit.wikimedia.org/r/504015 (owner: 10Ema) [14:17:37] RECOVERY - tilerator on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 304 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [14:17:40] <_joe_> ok so, this is the problem [14:17:47] RECOVERY - tilerator on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 304 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [14:17:59] <_joe_> remember to disable puppet! [14:18:00] tilerator and kartotherian services look good on those hosts now [14:18:07] RECOVERY - Maps HTTPS on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1323 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:18:09] RECOVERY - Maps HTTPS on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1323 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:18:11] !log reseting permissions on maps server fir /srv/deployment/kartotherian and /srv/deplyoment/tilerator [14:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:14] did it on all maps* hosts with cumin [14:18:16] (03PS1) 10Ottomata: eventgate-analytics - Tweak kafka producer poll and batch settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/504016 (https://phabricator.wikimedia.org/T220661) [14:18:23] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:18:23] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'maps*' 'sudo chmod -R a+r /srv/deployment/tilerator /srv/deployment/kartotherian' [14:18:25] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:35] RECOVERY - Maps HTTPS on maps1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1323 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:18:44] _joe_: you think it is puppet and not scap resetting the permissions? [14:18:47] RECOVERY - tilerator on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 304 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [14:18:50] ema: wait a sec before switching traffic, we might be good [14:18:54] <_joe_> cdanis: it's both [14:18:59] RECOVERY - Maps HTTPS on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1323 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:18:59] RECOVERY - tilerator on maps1004 is OK: HTTP OK: HTTP/1.1 200 OK - 304 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [14:19:09] PROBLEM - kartotherian endpoints health on maps1004 is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [14:19:17] <_joe_> or better, it's puppet that has probably modified config in a way that fucks up permissions [14:19:21] gehel: too late, merged already! [14:19:27] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - Tweak kafka producer poll and batch settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/504016 (https://phabricator.wikimedia.org/T220661) (owner: 10Ottomata) [14:19:27] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:19:28] ema: np [14:19:33] <_joe_> ema: we can revert I guess :D [14:19:45] ok I will disable puppet on maps hosts for now [14:20:07] cdanis: thanks! [14:20:19] cdanis: maps eqiad please [14:20:23] RECOVERY - kartotherian endpoints health on maps1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [14:20:30] onimisionipe: rgr [14:21:04] 10Operations: maps hosts have bad permissions under /srv/deployment - https://phabricator.wikimedia.org/T220982 (10CDanis) [14:21:49] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'maps1*' "disable-puppet 'bad permissions - T220982 - cdanis'" [14:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:53] T220982: maps hosts have bad permissions under /srv/deployment - https://phabricator.wikimedia.org/T220982 [14:22:14] !log T220982 cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'maps1*' 'sudo chmod -R a+r /srv/deployment/tilerator /srv/deployment/kartotherian' [14:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:55] cdanis: cumin runs as root, no need for sudo ;) [14:23:09] volans: habit, and also sometimes I wish it didn't :) [14:23:19] eh [14:23:33] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Set `enable_dl` to 0 in php.ini - https://phabricator.wikimedia.org/T220681 (10Joe) as I commented in the parent ticket, enable_dl should be off in production, and given HHVM didn't support it this should not create any issue. [14:23:45] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:23:49] 10Operations: maps hosts have bad permissions under /srv/deployment - https://phabricator.wikimedia.org/T220982 (10Gehel) permissions reset via: cumin 'A:maps' 'chmod -R a+r /srv/deployment/kartotherian' cumin 'A:maps' 'chmod -R a+r /srv/deployment/tilerator' [14:23:53] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Cmjohnson) @joe Good news is I have already ordered the DIMM from the previous failure and it's on-site. I can do this tomorrow afternoon (my time) if you can depool it then. [14:24:01] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:24:39] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:25:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:25:41] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:27:22] cdanis (and all): thanks for the help! [14:28:27] np, just a lucky quick find in the logs :) [14:29:01] gehel: should we revert https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/504015/ ? :) [14:29:17] ema: yep, we're back to normal [14:29:19] k [14:29:24] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10CDanis) @Cmjohnson I'm on US East time and can handle the depool. Give me a ping when you're ready [14:29:27] (03PS1) 10Ema: Revert "cache_upload: switch maps to codfw only" [puppet] - 10https://gerrit.wikimedia.org/r/504017 [14:29:29] <_joe_> so so /etc/tilerator/config-vars.yaml is owned by deploy-service:deploy-service [14:29:47] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:30:14] (03CR) 10Ema: [C: 03+2] Revert "cache_upload: switch maps to codfw only" [puppet] - 10https://gerrit.wikimedia.org/r/504017 (owner: 10Ema) [14:30:28] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10jijiki) @cdanis tx! [14:30:34] <_joe_> and ditto for the file generated from it [14:30:35] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [14:30:43] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:31:00] and why isn't that a problem during a scap deploy? [14:31:01] <_joe_> which is now 440 instead of 444 as it used to be [14:31:15] <_joe_> because scap runs as deploy-service [14:31:19] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [14:31:45] so maybe puppet permission is not aligning? [14:31:48] Oh, puppet reset ownership, but not perms [14:32:22] Have there been any recent changes to anything that could affect meta? [14:32:23] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:32:41] 10Operations, 10Security: Ferm: send ferm/iptables/ulogd logs to Kafaka/logstash/elasticsearch - https://phabricator.wikimedia.org/T220987 (10jbond) p:05Triage→03Normal [14:32:46] <_joe_> Zppix: are you here reporting a bug? [14:32:50] (03PS2) 10Muehlenhoff: Add qemu processes/Nova instances to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/503967 (https://phabricator.wikimedia.org/T135991) [14:32:51] Zppix: Can you ask a more vague question? [14:33:01] 10Operations, 10Security: Ferm: send ferm/iptables/ulogd logs to Kafaka/logstash/elasticsearch - https://phabricator.wikimedia.org/T220987 (10jbond) [14:33:10] Reedy: well the logo for meta on the bottom of the pages in the footer has seemingly disappeared [14:33:27] https://prnt.sc/nc7vh1 https://prnt.sc/nc7xp3 [14:33:53] WFM. Check your browser console and refresh the page? [14:34:09] <_joe_> can this conversation move elsewhere? [14:34:27] <_joe_> we're in the middle of an incident [14:34:31] Sorry! [14:34:41] <_joe_> cdanis: are there servers you didn't fix? [14:35:02] _joe_: I fixed the perms on all maps servers, including codfw [14:35:03] <_joe_> I want to verify wtf is happening [14:35:05] I fixed 'maps*', and then re-fixed 'maps1*' after disabling puppet on those [14:36:46] (03PS1) 10Urbanecm: Enable subpages by default for a few more extra namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504018 (https://phabricator.wikimedia.org/T220950) [14:37:34] <_joe_> I am going to reenable puppet on maps1001 [14:37:44] 10Operations, 10User-herron: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) - https://phabricator.wikimedia.org/T220387 (10herron) >>! In T220387#5093826, @elukey wrote: > One thing that we didn't discuss for this goal is Zookeeper. For the purposes o... [14:37:50] <_joe_> I don't think it will change anything [14:37:54] (03PS1) 10Jcrespo: mariadb: Reduce db1078 load in preparation for depool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504019 (https://phabricator.wikimedia.org/T219115) [14:38:11] <_joe_> but I'd like to run a scap deploy-local there [14:38:15] <_joe_> to verify nothing changed [14:38:23] (03CR) 10Muehlenhoff: [C: 03+2] Add qemu processes/Nova instances to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/503967 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:38:33] <_joe_> or better, that hte problem won't show up next time you change something [14:38:58] <_joe_> yeah I can confirm, puppet changed nothing [14:38:59] so, we're regenerating config as root when config-vars changes (https://github.com/wikimedia/puppet/blob/production/modules/service/manifests/node/config/scap3.pp#L53) [14:39:33] <_joe_> gehel: no we're not [14:39:40] <_joe_> user => $deployment_user [14:39:58] damn, I'm blind [14:40:53] <_joe_> !log running apply-config-karthoterian on maps1001 [14:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:13] <_joe_> the file is now correctly 0644 [14:41:18] <_joe_> so uhm [14:41:41] <_joe_> that script DTRT [14:41:43] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2546 MB (5% inode=63%) [14:41:51] <_joe_> I have no idea what happened there [14:42:10] what about apply-config-tilerator? [14:42:54] <_joe_> it should do the same AFAICS [14:43:25] <_joe_> !log running apply-config-tilerator on maps1001 [14:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:50] RECOVERY - Disk space on contint1001 is OK: DISK OK [14:48:09] 10Operations, 10Security: Ferm: send ferm/iptables/ulogd logs to Kafaka/logstash/elasticsearch - https://phabricator.wikimedia.org/T220987 (10herron) > The intention is that ulogd logs from all servers will be sent to kafaka as such it would seem to make senses to move ::profile::rsyslog::kafka_shipper to the... [14:48:15] (03PS1) 10Jcrespo: mariadb: Depool db1078 for hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504022 (https://phabricator.wikimedia.org/T219115) [14:48:44] (03CR) 10Marostegui: [C: 03+1] mariadb: Reduce db1078 load in preparation for depool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504019 (https://phabricator.wikimedia.org/T219115) (owner: 10Jcrespo) [14:49:13] <_joe_> gehel: you had just restarted the services, right? [14:49:19] yep [14:49:21] <_joe_> so I would suggest to try to do a scap deploy [14:49:27] <_joe_> after having puppet run everywhere [14:49:40] sounds like a plan [14:49:41] <_joe_> ofc first do it on just the canary host and go check it's ok [14:49:42] (03CR) 10Muehlenhoff: [C: 03+1] cloudelastic: allow jobrunners and mwmaint nodes to access cloudelastic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502829 (https://phabricator.wikimedia.org/T220625) (owner: 10Gehel) [14:50:03] (03CR) 10Marostegui: [C: 03+1] mariadb: Depool db1078 for hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504022 (https://phabricator.wikimedia.org/T219115) (owner: 10Jcrespo) [14:50:34] any idea why we add a step to reset permissions in apply-config ? It looks to me that scap deploy is also running the same deploy-local [14:52:06] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php: tweak ini settings [puppet] - 10https://gerrit.wikimedia.org/r/502986 (https://phabricator.wikimedia.org/T211488) [14:54:10] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi: TEC6: Logging infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220103 (10fgiunchedi) [14:55:19] !log start of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T219871) [14:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:23] T219871: Connect Western Armenian Wikipedia to Wikidata - https://phabricator.wikimedia.org/T219871 [14:55:48] !log deploying tilerator to maps1001 to validate deployment is working - T220982 [14:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:52] T220982: maps hosts have bad permissions under /srv/deployment - https://phabricator.wikimedia.org/T220982 [14:55:55] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10fsero) it seems is almost full again. Did you consider to set up a periodic d... [14:56:47] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10Marostegui) Note that this task is about `/srv` and the current issue is with... [14:57:19] 10Operations: maps hosts have bad permissions under /srv/deployment - https://phabricator.wikimedia.org/T220982 (10Gehel) Deployment seems to be a noop: ` gehel@deploy1001:/srv/deployment/tilerator/deploy$ scap deploy --environment stretch --limit-hosts maps1001.eqiad.wmnet "check deployment is working - T22098... [14:57:30] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10Marostegui) @fsero maybe comment at {T207702}? [14:58:13] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10fsero) oh sorry @Marostegui i misread :) [14:58:49] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10herron) >>! In T217359#5032835, @elukey wrote: > In the SRE spreadsheet I can see that the s... [14:59:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "post merge +1 just for the record :-)" [puppet] - 10https://gerrit.wikimedia.org/r/504005 (https://phabricator.wikimedia.org/T220931) (owner: 10Marostegui) [15:00:00] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Papaul) 05Open→03Resolved This is done. [15:01:32] (03CR) 10Krinkle: "Did you see my question regarding removal and stopping of the service? I'm not sure what's best practice for that and how much we should i" [puppet] - 10https://gerrit.wikimedia.org/r/503675 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [15:02:57] Reedy: wanna do https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/480055/ today? [15:03:36] Could try :) [15:04:07] * Krinkle testing unmerged wmf-config/profiler.php changes on mwlog1001 [15:04:46] Hm... did not expect to find php-1.32.0.• checkouts on mwdebug1001 [15:04:59] Mostly empty? [15:05:11] maybe? [15:05:23] Borked permissions/cleanup scripts I guess [15:06:44] Also, what submodule was there? [15:07:59] something FirefoxOS I think? [15:08:07] should be on record somewhere [15:08:47] <_joe_> Krinkle: I didn't notice your comment, but I prefer in general to undeclare in puppet and deal with decom via cumin [15:08:52] Ah, that crap [15:08:59] <_joe_> re https://gerrit.wikimedia.org/r/c/operations/puppet/+/503675 [15:09:52] Krinkle: So we want to do an existence check for /srv/mediawiki(\-staging)?/docroot/wikimedia.org/WikipediaMobileFirefoxOS [15:11:59] _joe_: ok, so I need to clear opcache on mwdebug1001, I can't find the message you sent me last week. What was the command again? [15:12:17] <_joe_> php7adm /opcache-free [15:12:25] <_joe_> php7adm works like hhvmadm did :) [15:13:46] PROBLEM - puppet last run on oresrdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:13:57] Can non sre/roots run cumin? [15:14:12] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:17:13] _joe_: never used either of those [15:17:14] _joe_: https://wikitech.wikimedia.org/wiki/Debugging_in_production#Pushing_code_to_a_debug_server [15:17:55] <_joe_> Krinkle: thanks, although changing code in production shouldn't be advised :P [15:19:06] _joe_: I could merge in gerrit, pull down on deploy, and then pull to debug, but that's a lot of work and needlessly risks other people syncing it out (and hard to revert, further amend). But anyway, mwdebug's are cool for this. In theory beta, but not today. [15:19:37] <_joe_> Krinkle: no I mean it's ok if someone with the needed experience does it [15:19:47] <_joe_> I wouldn't give it as general advice though [15:20:36] _joe_: Fancy helping me work out if /srv/mediawiki(\-staging)?/docroot/wikimedia.org/WikipediaMobileFirefoxOS still exists on any of the MW hosts? ;P [15:21:10] <_joe_> Reedy: gimme 2 minutes please [15:21:16] ty. No rush :) [15:22:48] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:24:30] <_joe_> ls: cannot access '/srv/mediawiki/docroot/wikimedia.org/WikipediaMobileFirefoxOS': No such file or directory [15:24:43] !log end of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T219871) [15:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:48] T219871: Connect Western Armenian Wikipedia to Wikidata - https://phabricator.wikimedia.org/T219871 [15:26:46] <_joe_> Reedy: so it seems it's not anywhere [15:27:42] sweet [15:27:58] PROBLEM - Check systemd state on ms-be1027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:28:47] <_joe_> !log systemctl reset-failed on ms-be1027, debmonitor session [15:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:06] RECOVERY - Check systemd state on ms-be1027 is OK: OK - running: The system is fully operational [15:30:16] !log restarting prometheus-wmf-elasticsearch-exporter-9200 on all elastic nodes [15:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:17] !log restarting prometheus-wmf-elasticsearch-exporter-9* on all elastic nodes [15:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:44] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:33:23] * Krinkle done with mwdebug1001 [15:36:23] Krinkle: We can give it a go... [15:36:32] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:38:58] _joe_: thanks but there is a crontab that fixes it for you IIRC :) [15:39:22] RECOVERY - puppet last run on oresrdb1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:40:05] jouncebot: now [15:40:05] No deployments scheduled for the next 1 hour(s) and 19 minute(s) [15:40:07] jouncebot: next [15:40:08] In 1 hour(s) and 19 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190415T1700) [15:40:44] Reedy: +1 [15:42:09] * Reedy kicks wikibugs [15:43:22] Hi guys, sorry for the interruption. I think I'm in the right place to asks whether the info at https://wikitech.wikimedia.org/wiki/Remove_a_message_from_mailing_list_archive is still current? I'd need to point to it but I want to be sure there's no new process or alternative or something. [15:44:00] Elitre: I think so. TLDR mailman still sucks [15:44:22] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 208 and 0 seconds [15:47:13] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds [15:47:53] Krinkle: Sync and hope for the best? ;P [15:53:03] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:53:20] checking --^ [15:53:55] !log add cloud-in4 firewall filter to codfw - T211921 [15:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:03] mcrouter on apis reporting tkos, this is probably the ongoing issue [15:54:46] is scap not logging now? [15:55:10] !log changed /srv/mediawiki/docroot/wikimedia.org to a symlink to standard-docroot [15:55:11] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:17] (memcached issue related to mc1029, indeed it seems the "usual" tx bandwidth saturation) [15:56:16] Nikerabbit: :( [15:57:45] RECOVERY - Check systemd state on maps2001 is OK: OK - running: The system is fully operational [16:08:02] !log pooling maps2001 - postgres reinit is complete [16:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:04] Reedy: yep, jdi :) [16:09:34] right, wikibugs [16:09:36] it's done [16:09:38] yay [16:21:04] @Reedy do you know if it's possible to unsubscribe someone from all the lists they're into, perhaps? [16:22:00] Elitre: Yeah, https://wikitech.wikimedia.org/wiki/Mailman#Remove_an_individual_from_all_mailing_lists [16:26:27] Thanks for the pointer. [16:27:43] You can file a security task/similar and tag operations to get it done [16:30:33] PROBLEM - Check systemd state on ms-be1035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:46:27] will do if necessary <3 [16:52:39] Sorry for raising the alarm but it seems users are getting db timeout error for checking user contribs in wikidata. This will exhaust the database and might bring down everything [16:54:43] https://logstash.wikimedia.org/goto/276c047b26a0470ec0905e4851a372bc [16:56:15] marostegui: jynus ^ [16:56:21] <_joe_> Amir1: looking at the last 30 minutes I jsut see one spike of errors at 16:34 [16:56:27] <_joe_> *35 [16:56:32] lots of “contributions page unfiltered” in slow queries on Tendril [16:56:49] <_joe_> so 20 minutes ago [16:56:50] but not limited to wikidatawiki, it seems [16:57:07] could it be anomie's change? [16:57:15] the MCR one? [16:57:23] but that was group1 only, I think [16:57:25] This fatals https://www.wikidata.org/wiki/Special:Contributions/BotMultichill [16:57:32] wikidata is in group1 [16:57:37] Lucas_WMDE, jynus: T220991? [16:57:37] T220991: Slow query "IndexPager::buildQueryInfo (contributions page unfiltered)" after actor rollout - https://phabricator.wikimedia.org/T220991 [16:57:39] ah [16:57:43] then probably we have to revert? [16:57:44] the read new stuff [16:57:44] sounds so [16:57:57] I didn't know it was group1 [16:58:04] me neither [16:58:19] is commonswiki on 1 too? [16:58:35] yup [16:58:42] then +1 to revert [16:59:03] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/504021 should fix it. [16:59:12] note I say anomies change as I think he was the commiter, not taht he is responsible or anything [16:59:48] honestly,I would prefer if we revert, deploy that, then try again [16:59:53] but up to you [16:59:56] yeah, I agree with that [17:00:04] gehel and onimisionipe: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190415T1700). [17:00:09] unless you see something to lose other than time [17:00:43] jouncebot: yea.. I know! [17:00:55] Mostly time. But if someone is willing to review that patch now, I'm not seeing the point in reverting just to backport and re-deploy right away. [17:01:18] anomie: Lucas already did a review [17:01:33] anomie: you are in charge :-D, I am done for the day [17:02:08] no I didn’t, I left one remark :p [17:02:35] Fixed Lucas_WMDE's remark [17:05:18] anomie: +2'd [17:05:48] @seen wikibugs [17:05:48] mutante: wikibugs is in here, right now [17:06:09] Amir1: Thanks! [17:08:11] RECOVERY - Check systemd state on ms-be1035 is OK: OK - running: The system is fully operational [17:09:03] anomie: I'm around if you need anything for test/deploy [17:09:13] PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused [17:10:03] Amir1: Thanks. I'm backporting the fix now. Then we'll see if that really fixes T220991. If not I'll revert the change, but I think it will. [17:10:04] T220991: Slow query "IndexPager::buildQueryInfo (contributions page unfiltered)" after actor rollout - https://phabricator.wikimedia.org/T220991 [17:10:23] it looks correct to me [17:11:23] PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused [17:11:33] PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused [17:12:07] PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused [17:12:31] Hmm. Krinkle, you merged https://gerrit.wikimedia.org/r/c/mediawiki/core/+/503540 to 1.33.0-wmf.25 but didn't deploy it? [17:12:33] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1007.eqiad.wmnet, logstash1009.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1007.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:13:07] PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused [17:13:09] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1007.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1007.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:13:22] herron, godog --^ [17:13:22] PROBLEM - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:13:25] RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 [17:13:47] PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused [17:13:50] oof looking [17:14:27] RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 [17:14:30] on logstash1008 I see a puppet run after which logstash service starts fail-looping [17:14:52] Apr 15 17:11:21 logstash1008 logstash[983]: SLF4J: Class path contains multiple SLF4J bindings. [17:14:54] Apr 15 17:11:21 logstash1008 logstash[983]: SLF4J: Found binding in [jar:file:/usr/share/logstash/logstash-core/lib/org/apache/logging/log4j/log4j-slf4j-impl/2.6.2/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] [17:14:56] Apr 15 17:11:21 logstash1008 logstash[983]: SLF4J: Found binding in [jar:file:/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-5.1.11/vendor/jar-dependencies/runtime-jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] [17:14:58] Apr 15 17:11:21 logstash1008 logstash[983]: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. [17:15:15] PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused [17:15:19] PROBLEM - logstash syslog TCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused [17:15:21] !log restarted wikibugs because it stopped talking [17:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:27] RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.001 second response time on 127.0.0.1 port 11514 [17:15:38] shdubsh: ^ [17:15:56] <_joe_> please revert whatever change might have caused it [17:16:13] I'm here too, looking [17:16:33] RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 [17:16:37] only recent thing I see in puppet is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500099/ [17:16:38] <_joe_> lemme check if this is affecting mediawiki in any way [17:16:43] which looks innocuous ? [17:16:53] <_joe_> indeed [17:16:57] RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 [17:17:00] !log restarted logstash on logstash1007 [17:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:03] it's possible that reverting it will not actually help, that it was broken on disk but had not been restarted [17:17:04] !log wdqs deployment is complete! for some reasons I don't know scap did not logging here [17:17:05] _joe_: it shouldn't btw [17:17:06] reverting [17:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:15] RECOVERY - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on 10.2.2.36 port 10514 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:17:51] RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 [17:18:11] <_joe_> cdanis: on all servers? or it's just 1008? [17:18:14] onimisionipe: i think that normally logs but maybe the bot was broken [17:18:28] _joe_: same error on 1009 [17:18:39] mutante: Oh.. Ok [17:18:42] thanks! [17:18:47] <_joe_> so yeah, not a broken disk [17:18:56] no, I mean -- [17:19:00] something bad gets pushed in the past [17:19:04] but logstash isn't restarted [17:19:13] this puppet config file edit comes along, puppet restarts logstash, logstash now broken [17:19:14] <_joe_> why did we get a recovery? [17:19:21] herron? [17:19:33] puppet is running the revert now [17:19:55] <_joe_> so logstash seems to be running on 1009 [17:20:08] <_joe_> wlthough it spit out some of those logs [17:20:15] RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 [17:20:28] ohhh [17:20:30] that is not the actual error [17:20:33] RECOVERY - logstash syslog TCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 [17:20:38] <_joe_> nope [17:20:39] sorry [17:20:42] Apr 15 17:08:47 logstash1009 logstash[26466]: NameError: undefined local variable or method `mjolnir' for # [17:20:47] so it was that change, ty for revert shdubsh [17:20:50] <_joe_> yes [17:21:25] <_joe_> we would really benefit from using module versions and having a canary puppet environment [17:21:27] it is weird that it only logged that on the first restart [17:21:31] <_joe_> where we test the HEAD of all modules [17:21:39] the later restarts do not log this at all [17:21:39] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:21:52] <_joe_> cdanis: indeed, some race condition? [17:21:56] maybe? [17:22:10] does logstash really take 35 seconds to finish starting up? that is what it looks like [17:22:14] Apr 15 17:08:22 logstash1009 systemd[1]: Started logstash. [17:22:16] Apr 15 17:08:47 logstash1009 logstash[26466]: NameError: undefined local variable or method `mjolnir' for # [17:22:19] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:22:30] cdanis: give or take, it’s not instant [17:22:32] <_joe_> cdanis: welcome to java! [17:23:12] 10Operations, 10SRE-Access-Requests, 10monitoring: Allow Bryan Davis to downtime alerts in Icinga - https://phabricator.wikimedia.org/T220887 (10Dzahn) summons wikibugs [17:23:28] !log wikibugs - qdel'ed jobs and restarted another time, make it rejoin [17:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:19] (03PS3) 10Dzahn: admins: extend access for pbj until May 6th [puppet] - 10https://gerrit.wikimedia.org/r/504064 [17:25:33] (03CR) 10Dzahn: [C: 03+2] admins: extend access for pbj until May 6th [puppet] - 10https://gerrit.wikimedia.org/r/504064 (owner: 10Dzahn) [17:25:39] maybe puppet's restart of logstash could first run logstash --config.test_and_exit before restarting? [17:25:48] similar to what we do for icinga [17:26:18] +1, good idea. we had the issue on icinga itself before adding that [17:26:23] although "Note that grok patterns are not checked for correctness with this flag." [17:27:13] (03CR) 10Giuseppe Lavagetto: profile::docker::builder: add periodic job to prune old images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504029 (owner: 10Giuseppe Lavagetto) [17:28:18] canary deploy approach helps with logstash too, though more time consuming [17:29:09] <_joe_> cdanis: that should even go as a validate_cmd for when we edit files, if applicable [17:29:35] <_joe_> that is, if it's just one config file [17:29:42] <_joe_> we do it for e.g. mcrouter [17:29:54] <_joe_> it's harder to do with apache as we have includes [17:30:37] certainly room for improvement, let’s start a task [17:30:51] PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused [17:31:09] Krinkle, AaronSchulz: Ping regarding the status of , which seems to be merged to wmf.25 but not deployed. Do you want me to deploy it for you (like a mini-SWAT) or should I revert it per ? [17:31:13] <_joe_> uh? [17:31:24] Apr 15 17:28:47 logstash1009 systemd[1]: logstash.service: Main process exited, code=exited, status=143/n/a [17:31:28] <_joe_> again logstash1009? [17:31:36] <_joe_> cdanis: what happened before that? [17:31:59] anomie: uh, I could've sworn I rolled that out! [17:32:09] RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 [17:32:13] anomie: if you have a few minutes, I can roll it out now [17:32:40] Krinkle: Go ahead. My patch is already merged right after it, FYI. [17:33:06] _joe_: i'm not sure [17:33:28] !log LDAP - re-adding 'pbj' to 'nda' group, extended access until May 6th, transparency report contractor [17:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:33] something strange is going on [17:33:41] anomie: OK. I see there's no overlapping code paths. [17:33:43] staging on mwdebug1002 [17:33:51] we have 5 people logged in currently too, did someone simply restart it? [17:34:04] it's been happening more than once [17:34:12] okay dumb question [17:34:14] cdanis, _joe_: logstash1009 is me [17:34:14] Apr 15 17:30:44 logstash1009 systemd[1]: Stopping logstash... [17:34:19] does that mean someone *asked* systemd to restart it? [17:34:36] <_joe_> that would be 2 minutes later than the error you reported [17:34:45] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10hashar) Since that is recurring. Can we check whether we can add a couple disks to the machine? I guess 25... [17:34:50] the same thing has happened multiple times _joe_ [17:35:06] Apr 15 17:30:44 logstash1009 systemd[1]: Stopping logstash... [17:35:08] Apr 15 17:30:45 logstash1009 systemd[1]: logstash.service: Main process exited, code=exited, status=143/n/a [17:35:41] <_joe_> yes [17:35:42] cdanis: that is shdubsh's restart [17:35:55] <_joe_> it looks like someone is shutting it down repeatedly [17:36:04] so logstash exits with error when you ask it to shut down, got it [17:36:13] <_joe_> we can try again :) [17:36:16] <_joe_> just to be sure [17:36:19] lol [17:36:25] <_joe_> cdanis: "quality software" [17:36:53] anomie: https://phabricator.wikimedia.org/T220854#5108443 [17:37:00] anomie: what makes you believe it wasn't deployed? [17:37:16] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review, 10Wikimedia-Incident: Create cookbook to reset readonly indices on elasticsearch clusters - https://phabricator.wikimedia.org/T219799 (10debt) 05Open→03Resolved [17:37:47] Krinkle: When I did `git log HEAD..@{upstream} --stat` on deploy1001, it listed e7cefa6f7. [17:38:25] anomie: oh, right. that makes sense. [17:38:26] Did you scap without actually pulling the change? I've done that more times than I can remember. (: [17:38:39] anomie: I cherry-picked it directly. So your pull would have it new, but rebased away [17:38:50] all good [17:38:53] continue :) [17:38:57] Ah, ok. [17:39:03] I should've done a pull to rebase after my deploy [17:39:58] * anomie considers searching SAL for his log entries that say "for real this time" or the like, but decides it's not worth it. [17:41:31] !log force puppet agent run on maps* after moving config-vars.yaml file for kartotherian, tilerator, tileratorui T220982 [17:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:35] T220982: maps hosts have bad permissions under /srv/deployment - https://phabricator.wikimedia.org/T220982 [17:45:40] Hmm, no scap log message? [17:45:44] !log Backporting fix for T220991 [17:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:48] T220991: Slow query "IndexPager::buildQueryInfo (contributions page unfiltered)" after actor rollout - https://phabricator.wikimedia.org/T220991 [17:45:50] scap logs were missing over weekend too [17:46:19] the last one was 00:32 on 4-13 [17:47:05] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10elukey) 128GB sounds good, the more page cache Kafka has the better :) I'd also think about... [17:47:06] not sure when, but i did a deploy around 05:00 UTC on 4-14, but not logged by scap [17:47:49] Amir1, Lucas_WMDE: The patch is backported. Please help verify the fix if you can. [17:49:23] (03CR) 10Jbond: [C: 03+2] facter3: update check_eth script [puppet] - 10https://gerrit.wikimedia.org/r/504061 (owner: 10Jbond) [17:49:40] sure [17:49:48] anomie: is it on mwdebug1002? [17:50:07] (03PS2) 10Jbond: facter3: update check_eth script [puppet] - 10https://gerrit.wikimedia.org/r/504061 [17:50:16] Amir1: It's everywhere. So far I'm not seeing any 'IndexPager::buildQueryInfo (contributions page unfiltered)' log entries since 17:41:33. [17:50:50] anomie: looks good to me, thanks [17:50:57] anomie: this doesn't fatal anymoe https://www.wikidata.org/wiki/Special:Contributions/BotMultichill [17:52:15] \o/ [17:55:04] (03CR) 10Ladsgroup: [C: 03+2] "noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) (owner: 10Alaa Sarhan) [17:56:10] (03Merged) 10jenkins-bot: Add wgWikibaseMusicalNotationLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) (owner: 10Alaa Sarhan) [17:56:37] rebased on deploy1001 [18:00:04] Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190415T1800) [18:00:04] odder: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:44] * odder raises his hand with a 'oh, yay!' [18:00:56] (03CR) 10jenkins-bot: Add wgWikibaseMusicalNotationLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) (owner: 10Alaa Sarhan) [18:02:32] 10Operations, 10CirrusSearch, 10Elasticsearch, 10Discovery-Search (Current work), and 2 others: Create checks that alerts on cirrussearch update lags - https://phabricator.wikimedia.org/T219601 (10debt) 05Open→03Resolved a:03debt [18:02:34] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Epic: Epic: Deprecation warning on elasticsearch 6 - https://phabricator.wikimedia.org/T218994 (10debt) [18:03:42] 10Operations, 10Cloud-VPS, 10Discovery-Search (Current work), 10Patch-For-Review, 10cloud-services-team (Kanban): Setup elasticsearch on cloudelastic100[1-4] - https://phabricator.wikimedia.org/T214921 (10debt) 05Open→03Resolved [18:04:14] 10Operations, 10Cloud-VPS, 10Discovery-Search (Current work), 10cloud-services-team (Kanban): rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10debt) 05Open→03Resolved [18:09:04] !log LDAP - adding legoktm and qchris to gerritadmin group (T219086) [18:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:08] T219086: Add legoktm to gerritadmin LDAP group (restoring previously held access) - https://phabricator.wikimedia.org/T219086 [18:10:51] (03CR) 10CRusnov: [C: 03+2] Add basic Ganeti RAPI module and tests (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov) [18:10:58] (03PS9) 10CRusnov: Add basic Ganeti RAPI module and tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 [18:16:00] godog: Before https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/502785/ gets merged I'm trying to eliminate v3 puppetmasters. You have two: filippo-test-jessie3.monitoring.eqiad.wmflabs and filippo-test-stretch1.monitoring.eqiad.wmflabs [18:16:08] Can those VMs just be deleted? Or should I try to upgrade them or something? [18:17:15] PROBLEM - Check systemd state on ms-be1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:18:14] Anyone for SWAT? [18:18:17] I can do it [18:18:27] odder: I'm going to deploy your patches [18:19:47] (03CR) 10jenkins-bot: Add basic Ganeti RAPI module and tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov) [18:19:49] odder: around? [18:21:30] Amir1: Yup, still here :-) [18:21:56] (03CR) 10Alex Monk: puppet_major_version4: remove old puppet_major_version variable. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [18:22:27] (03PS1) 10Paladox: Gerrit: Enable gc logging [puppet] - 10https://gerrit.wikimedia.org/r/504073 (https://phabricator.wikimedia.org/T221026) [18:22:38] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503638 (https://phabricator.wikimedia.org/T202902) (owner: 10Odder) [18:22:41] (03PS12) 10CRusnov: Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) [18:22:54] (03PS2) 10Paladox: Gerrit: Enable gc logging [puppet] - 10https://gerrit.wikimedia.org/r/504073 (https://phabricator.wikimedia.org/T221026) [18:23:07] (03CR) 10Andrew Bogott: "There are currently two puppetmasters running 3.8.5-2~bpo8+2 packages:" [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [18:23:17] (03CR) 10CRusnov: "Minor change to fix ganeti module import (since it is in its Final Form in spicerack now)." [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [18:23:21] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Enable gc logging [puppet] - 10https://gerrit.wikimedia.org/r/504073 (https://phabricator.wikimedia.org/T221026) (owner: 10Paladox) [18:23:39] (03PS3) 10Paladox: Gerrit: Enable gc logging [puppet] - 10https://gerrit.wikimedia.org/r/504073 (https://phabricator.wikimedia.org/T221026) [18:23:42] (03Merged) 10jenkins-bot: Upload a new Apple Touch icon for German Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503638 (https://phabricator.wikimedia.org/T202902) (owner: 10Odder) [18:23:59] (03CR) 10jenkins-bot: Upload a new Apple Touch icon for German Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503638 (https://phabricator.wikimedia.org/T202902) (owner: 10Odder) [18:25:14] (03PS6) 10Jbond: puppet_major_version4: remove old puppet_major_version variable. [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) [18:26:31] odder: so the first patch is in mwdebug1002 [18:26:42] I guess it's easily testable [18:27:09] Amir1: Yup, one moment please. [18:27:21] https://en.wikipedia.org/static/apple-touch/wiktionary/de.png [18:28:48] Looks fine :-) [18:28:59] RECOVERY - Check systemd state on ms-be1039 is OK: OK - running: The system is fully operational [18:29:25] Let's go! [18:29:52] (03PS2) 10Ladsgroup: Add a new Apple Touch icon to configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503639 (https://phabricator.wikimedia.org/T202902) (owner: 10Odder) [18:30:01] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503639 (https://phabricator.wikimedia.org/T202902) (owner: 10Odder) [18:31:03] (03CR) 10Thcipriani: [C: 03+1] "these added logs will be helpful" [puppet] - 10https://gerrit.wikimedia.org/r/504073 (https://phabricator.wikimedia.org/T221026) (owner: 10Paladox) [18:31:05] (03Merged) 10jenkins-bot: Add a new Apple Touch icon to configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503639 (https://phabricator.wikimedia.org/T202902) (owner: 10Odder) [18:33:16] odder: your second patch is on mwdebug1002 [18:34:55] (03CR) 10jenkins-bot: Add a new Apple Touch icon to configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503639 (https://phabricator.wikimedia.org/T202902) (owner: 10Odder) [18:35:27] Amir1: Can confirm I can see the new logo being linked [18:35:37] !log logstash1009: disabling puppet and testing logstash config [18:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:42] okay, going forward [18:38:58] (03CR) 10Jbond: puppet_major_version4: remove old puppet_major_version variable. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [18:39:03] !log Morning SWAT is done [18:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:48] Yeah, not working for me, but whatever, Apple need to get their shit together [18:40:55] !log pooling map2002 - postgres init complete [18:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:01] (Or it's a cache issue.) [18:41:37] !log depooling maps2003 for psotgres init [18:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:46] (03PS1) 10Cwhite: profile: do not mutate level for mjolnir [puppet] - 10https://gerrit.wikimedia.org/r/504074 (https://phabricator.wikimedia.org/T213899) [18:43:55] Amir1: I'm getting a 404 when not using Wikimedia-X-Debug for that file [18:44:41] 10Operations, 10Wikimedia-Logstash, 10Security: Ferm: send ferm/iptables/ulogd logs to Kafaka/logstash/elasticsearch - https://phabricator.wikimedia.org/T220987 (10jbond) Its also worth pointing out that ulogd supports native json output[1] but only to a separate log file not syslog. Prometheus is was simpl... [18:46:09] hmm, it's probably varnish, let me fix that [18:46:25] odder: ^ [18:46:39] Aha, not a problem, I'll wait ;) [18:48:09] odder: can you try now? [18:49:26] Amir1: Yes, thank you, the file's there now [18:49:51] sorry, I always keep forgetting the purge part [18:50:47] (03CR) 10Herron: [C: 03+1] "+1 but let's set downtime in icinga and perform a canary roll out to err on the side of caution" [puppet] - 10https://gerrit.wikimedia.org/r/504074 (https://phabricator.wikimedia.org/T213899) (owner: 10Cwhite) [18:51:11] RECOVERY - tools project instance distribution on cloudcontrol1003 is OK: OK: All critical instances are spread out enough https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:51:53] Amir1: It's fine. The issue still appears to be there, but it's either a cache issue locally for me or the iPhone simply ignores the link and goes for the file in the docroot, in which case I don't know how to fix this. [18:52:41] yeah [18:52:49] it's fine I guess [18:53:26] (03CR) 10Cwhite: [C: 03+2] profile: do not mutate level for mjolnir [puppet] - 10https://gerrit.wikimedia.org/r/504074 (https://phabricator.wikimedia.org/T213899) (owner: 10Cwhite) [18:53:29] I'll ask someone to test it [18:58:22] !log update mr1-* security policies - T219384 [18:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:26] T219384: allow bast2002 to connect to mgmt network - https://phabricator.wikimedia.org/T219384 [18:59:25] Amir1: Yay, it works! :-) [18:59:33] Not on my phone but I had someone else test it [18:59:36] So thank you [19:00:35] Coooli [19:01:12] Gotta go, bye :) [19:07:25] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:08:03] (03CR) 10Herron: [C: 04-1] "let's split off the records with non sre maintained mx servers into their own patches, and cc the stakeholders of those domains for review" [dns] - 10https://gerrit.wikimedia.org/r/503258 (https://phabricator.wikimedia.org/T220786) (owner: 10Vgutierrez) [19:08:06] (03CR) 10Jbond: "Ready for review again, thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [19:08:41] (03CR) 10Herron: [C: 03+1] Add SPF record for wikibooks.org [dns] - 10https://gerrit.wikimedia.org/r/503177 (https://phabricator.wikimedia.org/T193408) (owner: 10Vgutierrez) [19:09:05] (03CR) 10Herron: [C: 03+1] Add SPF record for wikisource.org [dns] - 10https://gerrit.wikimedia.org/r/503165 (https://phabricator.wikimedia.org/T193408) (owner: 10Vgutierrez) [19:10:17] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install bast2002.wikimedia.org - https://phabricator.wikimedia.org/T196665 (10ayounsi) [19:10:24] 10Operations, 10netops: allow bast2002 to connect to mgmt network - https://phabricator.wikimedia.org/T219384 (10ayounsi) 05Open→03Resolved I think @robh's comment was about the fact that bast hosts are not allowed to reach mgmt's http/https, but only ssh. On the other hand, cumin hosts are allowed to. An... [19:11:03] akosiaris: ^ ? [19:15:11] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:24:16] (03PS2) 10Dzahn: Remove obsolete rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/503961 (owner: 10Muehlenhoff) [19:28:21] (03CR) 10Dzahn: [C: 03+2] Remove obsolete rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/503961 (owner: 10Muehlenhoff) [19:28:41] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:30:53] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:32:09] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:34:52] (03PS1) 10Andrew Bogott: puppet-merge: write the latest puppet repo sha-1 to config-master [puppet] - 10https://gerrit.wikimedia.org/r/504082 (https://phabricator.wikimedia.org/T219390) [19:45:42] !log update (and add) AS3491 BGP communities in eqsin [19:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:14] !log bromine/vega: rm /etc/rsyncd.conf ; systemctl stop rsync (clean up old rsync config gerrit:503961) [19:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:26] (03CR) 10Dzahn: "15:46 < mutante> !log bromine/vega: rm /etc/rsyncd.conf ; systemctl stop rsync (clean up old rsync config gerrit:503961)" [puppet] - 10https://gerrit.wikimedia.org/r/503961 (owner: 10Muehlenhoff) [19:46:27] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:48:53] PROBLEM - puppet last run on mw1330 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [19:49:03] !log export BGP communities (prepend x3 outside asia) to AS3491 in eqsin [19:49:05] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:23] 10Operations: scap no longer !log'ging to server admin log - https://phabricator.wikimedia.org/T221035 (10CDanis) [19:55:52] (03CR) 10Andrew Bogott: [C: 03+2] puppet-merge: write the latest puppet repo sha-1 to config-master [puppet] - 10https://gerrit.wikimedia.org/r/504082 (https://phabricator.wikimedia.org/T219390) (owner: 10Andrew Bogott) [19:56:44] 10Operations, 10Scap, 10Release-Engineering-Team (Backlog): scap no longer !log'ging to server admin log - https://phabricator.wikimedia.org/T221035 (10Marostegui) [19:57:46] (03PS1) 10Andrew Bogott: puppet-merge: remove some tabs [puppet] - 10https://gerrit.wikimedia.org/r/504084 [19:58:30] (03CR) 10Andrew Bogott: [C: 03+2] puppet-merge: remove some tabs [puppet] - 10https://gerrit.wikimedia.org/r/504084 (owner: 10Andrew Bogott) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear deployers, time to do the Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190415T2000). [20:00:13] I have an ORES deployment [20:03:01] commit id to rollback in case I'm not around 5d937b1d2ee613268ab608b1b52a893f4e6f0509 [20:03:05] https://wikitech.wikimedia.org/wiki/ORES/Deployment [20:03:27] (03PS4) 10Andrew Bogott: shinken: add basic spec [puppet] - 10https://gerrit.wikimedia.org/r/497253 (owner: 10Hashar) [20:03:59] (03CR) 10jerkins-bot: [V: 04-1] shinken: add basic spec [puppet] - 10https://gerrit.wikimedia.org/r/497253 (owner: 10Hashar) [20:09:55] 10Operations, 10SRE-Access-Requests: Membership in "researchers" group for Bryan Davis - https://phabricator.wikimedia.org/T220892 (10Bmueller) Approved. [20:12:11] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install bast2002.wikimedia.org - https://phabricator.wikimedia.org/T196665 (10Dzahn) 05Stalled→03Open [20:14:21] (03PS1) 10Andrew Bogott: nfs/maps: clean up grants for soon-to-be-deleted old region VMs [puppet] - 10https://gerrit.wikimedia.org/r/504091 [20:15:07] (03PS2) 10Andrew Bogott: nfs/maps: clean up exports for soon-to-be-deleted old region VMs [puppet] - 10https://gerrit.wikimedia.org/r/504091 [20:15:21] RECOVERY - puppet last run on mw1330 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:15:50] (03CR) 10Andrew Bogott: [C: 03+2] nfs/maps: clean up exports for soon-to-be-deleted old region VMs [puppet] - 10https://gerrit.wikimedia.org/r/504091 (owner: 10Andrew Bogott) [20:16:36] (03PS3) 10Dzahn: turn bast2001 into a spare, replaced by bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499224 (https://phabricator.wikimedia.org/T219492) [20:17:16] 10Operations, 10Scap, 10Release-Engineering-Team (Backlog): scap no longer !log'ging to server admin log - https://phabricator.wikimedia.org/T221035 (10Peachey88) [20:18:36] (03CR) 10Dzahn: [C: 03+1] "unblocked, can talk to mgmt interfaces now" [puppet] - 10https://gerrit.wikimedia.org/r/499449 (https://phabricator.wikimedia.org/T219492) (owner: 10Dzahn) [20:19:26] (03PS1) 10Andrew Bogott: maps/nfs: clean up more references to old VMs [puppet] - 10https://gerrit.wikimedia.org/r/504093 [20:20:03] (03CR) 10Andrew Bogott: [C: 03+2] maps/nfs: clean up more references to old VMs [puppet] - 10https://gerrit.wikimedia.org/r/504093 (owner: 10Andrew Bogott) [20:21:48] 10Puppet, 10Patch-For-Review, 10cloud-services-team (Kanban): Have puppet-merge on puppetmaster1001 publish the official sha1 after merging - https://phabricator.wikimedia.org/T219390 (10Andrew) ` andrew@tools-puppetmaster-01:~$ curl https://config-master.wikimedia.org/puppet-sha1.txt 440fab7e0ee0bd63fdd1249... [20:21:56] (03PS1) 10BryanDavis: admin: Add Bryan Davis (bd808) to 'researchers' group [puppet] - 10https://gerrit.wikimedia.org/r/504094 (https://phabricator.wikimedia.org/T220892) [20:23:28] !log the ores deployment is over [20:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:10] The home page is behind varnish :( [20:24:31] https://ores.wikimedia.org/ is old but https://ores.wikimedia.org/?any_random_thing is not [20:26:04] (03Abandoned) 10Dzahn: network::constants: remove bast2001 [puppet] - 10https://gerrit.wikimedia.org/r/499449 (https://phabricator.wikimedia.org/T219492) (owner: 10Dzahn) [20:27:14] Amir1: if you are in a hurry to expire that -- https://wikitech.wikimedia.org/wiki/Varnish#One-off_purges_(bans) [20:27:56] bd808: thanks. I think it's fine [20:28:29] I note this for weird cases that might happen later [20:28:50] (03PS1) 10Dzahn: remove bast2001 from bastion hosts in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/504095 (https://phabricator.wikimedia.org/T219492) [20:28:57] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10mobrovac) +1 for SSDs. We will be directly exposing time-based messages to consumers, so I i... [20:30:52] (03CR) 10Dzahn: [C: 03+2] remove bast2001 from bastion hosts in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/504095 (https://phabricator.wikimedia.org/T219492) (owner: 10Dzahn) [20:31:08] I think I'm done for the day [20:37:42] (03CR) 10CDanis: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15786/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/504073 (https://phabricator.wikimedia.org/T221026) (owner: 10Paladox) [20:37:43] !log Updated Parsoid to 83c17fc9 [20:37:55] (03PS4) 10CDanis: Gerrit: Enable gc logging [puppet] - 10https://gerrit.wikimedia.org/r/504073 (https://phabricator.wikimedia.org/T221026) (owner: 10Paladox) [20:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:19] thanks cdanis! [20:47:29] (03PS4) 10Dzahn: turn bast2001 into a spare, replaced by bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499224 (https://phabricator.wikimedia.org/T219492) [20:50:38] (03CR) 10Dzahn: [C: 03+2] turn bast2001 into a spare, replaced by bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499224 (https://phabricator.wikimedia.org/T219492) (owner: 10Dzahn) [20:51:41] PROBLEM - Free Blazegraph allocators wdqs-blazegraph on wdqs1010 is CRITICAL: cluster=wdqs-test instance=wdqs1010:9193 job=blazegraph site=eqiad https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=32&fullscreen [20:52:43] PROBLEM - High lag on wdqs1010 is CRITICAL: 3620 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:55:40] !log gerrit restart to pick up gc log changes incoming [20:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:43] ^ [20:57:10] !log shutting down blazegraph and updater on wdqs1010, waiting for data reimport [20:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:41] !log gerrit back [20:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] bawolff and Reedy: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190415T2100). [21:02:33] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_All-Avatars] [21:02:53] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:35] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [21:04:49] ^ due to gerrit restart.. will fix itself [21:05:33] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 6 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [21:06:01] (03PS1) 10Dzahn: bast2001: use correct class name for spare host [puppet] - 10https://gerrit.wikimedia.org/r/504206 [21:06:34] (03CR) 10Dzahn: [V: 03+2] bast2001: use correct class name for spare host [puppet] - 10https://gerrit.wikimedia.org/r/504206 (owner: 10Dzahn) [21:09:10] (03CR) 10Dzahn: [V: 03+2 C: 03+2] bast2001: use correct class name for spare host [puppet] - 10https://gerrit.wikimedia.org/r/504206 (owner: 10Dzahn) [21:13:27] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:19:58] (03PS2) 10Dzahn: bastionhost: remove rsync and motd warning for bast2001 [puppet] - 10https://gerrit.wikimedia.org/r/499740 (https://phabricator.wikimedia.org/T219492) [21:22:25] (03CR) 10Dzahn: [C: 03+2] bastionhost: remove rsync and motd warning for bast2001 [puppet] - 10https://gerrit.wikimedia.org/r/499740 (https://phabricator.wikimedia.org/T219492) (owner: 10Dzahn) [21:28:17] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:30:17] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:30:43] (03PS1) 10Dzahn: install_server: remove bast2001 [puppet] - 10https://gerrit.wikimedia.org/r/504209 (https://phabricator.wikimedia.org/T219492) [21:31:20] (03CR) 10Dzahn: [C: 03+2] install_server: remove bast2001 [puppet] - 10https://gerrit.wikimedia.org/r/504209 (https://phabricator.wikimedia.org/T219492) (owner: 10Dzahn) [21:31:21] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:31:30] (03CR) 10BryanDavis: wikitech: Use cn:caseExactMatch: as account search filter (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497423 (https://phabricator.wikimedia.org/T165795) (owner: 10BryanDavis) [21:34:14] (03PS1) 10Dzahn: remove bast2001 production IPs [dns] - 10https://gerrit.wikimedia.org/r/504210 (https://phabricator.wikimedia.org/T219492) [21:37:31] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install bast2002.wikimedia.org - https://phabricator.wikimedia.org/T196665 (10Dzahn) [21:47:58] jouncebot: refresh [21:47:59] I refreshed my knowledge about deployments. [21:51:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10Andrew) [21:52:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1006 with 10G interfaces - https://phabricator.wikimedia.org/T221048 (10Andrew) [21:52:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10Andrew) [21:54:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [22:00:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10Andrew) [22:00:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1006 with 10G interfaces - https://phabricator.wikimedia.org/T221048 (10Andrew) [22:00:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10Andrew) [22:05:13] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Membership in "researchers" group for Bryan Davis - https://phabricator.wikimedia.org/T220892 (10Nuria) Approved on my end. [22:19:40] (03PS6) 10Effie Mouzeli: haproxy: improve metrics (via mtail) and logging [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [22:20:12] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch nodes overloading in eqiad - https://phabricator.wikimedia.org/T220901 (10EBernhardson) Looked into a few angles but nothing conclusive: * Looked at overall query rates to our primary query splits (full_text, comp_suggest, mo... [22:21:05] 10Operations, 10Jade, 10Scoring-platform-team, 10TechCom, and 4 others: Deploy Jade extension to production - https://phabricator.wikimedia.org/T183381 (10Harej) [22:21:16] 10Operations, 10Jade, 10Scoring-platform-team, 10TechCom, and 4 others: Deploy Jade extension to production - https://phabricator.wikimedia.org/T183381 (10Harej) a:05awight→03None [22:24:49] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380 (10JbuattiWMF) Hi @RobH so sorry to reopen this old thread, but would it be possible to grant the same LDAP access referenced above to the following two Wi... [22:26:12] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380 (10JbuattiWMF) 05Resolved→03Open [22:28:09] (03CR) 10Effie Mouzeli: "Expected outcome: https://puppet-compiler.wmflabs.org/compiler1002/15791/dbproxy1001.eqiad.wmnet/ https://puppet-compiler.wmflabs.org/com" [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [22:54:58] (03PS1) 10Andrew Bogott: Rename some labvirts to cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/504215 (https://phabricator.wikimedia.org/T221047) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Evening SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190415T2300). [23:00:04] bd808: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:50] I can do the SWAT since there are no other patches in the queue. [23:01:02] (03CR) 10Krinkle: [C: 03+1] Invariant config cleanup: I - Initial DB and performance items (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003 (owner: 10Jforrester) [23:01:09] (03PS1) 10Andrew Bogott: Rename some labvirts to cloudvirts [dns] - 10https://gerrit.wikimedia.org/r/504217 (https://phabricator.wikimedia.org/T221049) [23:01:24] Luckily the stickers promised by jouncebot are already on my laptop [23:05:12] (03CR) 10Andrew Bogott: [C: 03+2] Rename some labvirts to cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/504215 (https://phabricator.wikimedia.org/T221047) (owner: 10Andrew Bogott) [23:05:24] (03CR) 10Andrew Bogott: [C: 03+2] Rename some labvirts to cloudvirts [dns] - 10https://gerrit.wikimedia.org/r/504217 (https://phabricator.wikimedia.org/T221049) (owner: 10Andrew Bogott) [23:06:11] (03CR) 10BryanDavis: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497423 (https://phabricator.wikimedia.org/T165795) (owner: 10BryanDavis) [23:10:29] (03PS3) 10BryanDavis: wikitech: Use cn:caseExactMatch: as account search filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497423 (https://phabricator.wikimedia.org/T165795) [23:10:51] (03CR) 10BryanDavis: [V: 03+2 C: 03+2] wikitech: Use cn:caseExactMatch: as account search filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497423 (https://phabricator.wikimedia.org/T165795) (owner: 10BryanDavis) [23:11:14] mutante: have a minute to approve a webperf1001 apache patch? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/499537/ [23:11:27] * bd808 keeps smashing buttons in gerrit in the hope that zuul will do the needful [23:11:50] (03Merged) 10jenkins-bot: wikitech: Use cn:caseExactMatch: as account search filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497423 (https://phabricator.wikimedia.org/T165795) (owner: 10BryanDavis) [23:11:59] victory! [23:15:46] (03CR) 10jenkins-bot: wikitech: Use cn:caseExactMatch: as account search filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497423 (https://phabricator.wikimedia.org/T165795) (owner: 10BryanDavis) [23:16:16] why didn't my scap run log here? [23:16:58] 10Operations, 10Wikimedia-Logstash: config file change canarying for logstash - https://phabricator.wikimedia.org/T221052 (10CDanis) [23:17:19] bd808: https://phabricator.wikimedia.org/T221035 [23:17:26] I have not had time to dig into it [23:17:40] !log scap: SWAT: [[gerrit:497423|wikitech: Use cn:caseExactMatch: as account search filter]] (T165795) [23:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:44] T165795: Ldap auth extension vs. ldap vs. username Case - https://phabricator.wikimedia.org/T165795 [23:18:07] logmsgbot isn't here? [23:19:16] it should be running from icinga.wikimedia.org I think based on a quick grep of ops/puppet [23:19:39] okay [23:19:44] it thinks it is running there [23:19:56] and it even has your recent deploy message in its logging output [23:20:02] I'm going to restart it and see what happens I guess [23:20:17] !log cdanis@icinga1001.wikimedia.org ~ % sudo systemctl restart tcpircbot-logmsgbot.service [23:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:04] cdanis: that join looks promising :) [23:23:00] .... [23:23:06] Apr 13 17:06:30 icinga1001 python[100554]: 2019-04-13 17:06:30,474 ChanServ!ChanServ@services. [u'[#wikimedia-overflow] PM ops in here if you want to be invited to the channel you were trying to join.'] [23:23:07] 10Operations, 10Scap, 10Release-Engineering-Team (Backlog): scap no longer !log'ging to server admin log - https://phabricator.wikimedia.org/T221035 (10bd808) ` [23:18] < bd808> logmsgbot isn't here? [23:19] < bd808> it should be running from icinga.wikimedia.org I think based on a quick grep of ops/... [23:23:16] * cdanis facepalm [23:23:57] ChanServ/NickServ hiccup? [23:24:17] apparently. it received the "you are now identified" from nickserv a full 12 seconds before that [23:25:06] * bd808 tries not to fall into the "rewrite logmsgbot" rabbithole today [23:25:21] haha [23:25:27] oh, it looks like it is whatever vuln scan that freenode does on connection that took forever? [23:25:33] https://phabricator.wikimedia.org/P8402 [23:27:14] are you not fuly identified until the scan happens? [23:27:30] apparently [23:27:33] I don't remember what spit and twigs that bot is built from today. My ib3 library would make it trivial to switch to SASL auth which helps with NickServ stuff. [23:28:34] https://github.com/bd808/python-ib3 [23:29:00] ooh neat [23:29:14] like 2 years ago i was looking for something like that. [23:29:19] https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/tcpircbot/files/tcpircbot.py bd808 [23:30:31] cdanis: like 2 years ago I wrote that because I couldn't find one :) [23:32:37] cdanis testing hopefully-restored logmsgbot [23:32:43] \o/ [23:33:58] 10Operations, 10Scap, 10Release-Engineering-Team (Backlog): scap no longer !log'ging to server admin log - https://phabricator.wikimedia.org/T221035 (10CDanis) 05Open→03Resolved a:03CDanis looks like `logmsgbot` was happily chattering away in #wikimedia-overload because of some race condition (within I... [23:34:52] bd808: that library is a good thing to know about, there's some interest in doing more 'chatops' kind of stuff [23:39:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [23:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [23:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:49] hooray actual usage [23:44:54] 10Operations, 10Scap, 10Stashbot, 10Release-Engineering-Team (Watching / External): scap no longer !log'ging to server admin log - https://phabricator.wikimedia.org/T221035 (10greg)