[00:27:54] (03PS1) 10Alex Monk: Tidy up more comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236701 (https://phabricator.wikimedia.org/T31902) [00:46:51] !log mwscript deleteEqualMessages.php --wiki eswiki [00:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:06:33] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:07:05] deletionist! [01:10:25] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 505 bytes in 0.010 second response time [01:38:35] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 22 connecting: (unnamed) not-conn: cp2015_v6, cp3018_v6 [01:40:34] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [01:55:23] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: puppet fail [01:57:45] (03CR) 10Krinkle: Tidy up more comments (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236701 (https://phabricator.wikimedia.org/T31902) (owner: 10Alex Monk) [01:59:55] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp3015_v6 [02:02:03] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [02:08:24] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2024_v6 [02:10:24] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [02:20:34] !log l10nupdate@tin Synchronized php-1.26wmf21/cache/l10n: l10nupdate for 1.26wmf21 (duration: 06m 30s) [02:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:21:13] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:23:51] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf21) at 2015-09-08 02:23:51+00:00 [02:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:15:54] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [03:16:14] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 1.76 ms [03:55:32] (03PS1) 10BryanDavis: Kibana: Fix apache::site title [puppet] - 10https://gerrit.wikimedia.org/r/236727 [03:56:19] (03CR) 10jenkins-bot: [V: 04-1] Kibana: Fix apache::site title [puppet] - 10https://gerrit.wikimedia.org/r/236727 (owner: 10BryanDavis) [03:57:28] (03PS2) 10BryanDavis: Kibana: Fix apache::site title [puppet] - 10https://gerrit.wikimedia.org/r/236727 [04:17:14] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [04:29:08] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [04:30:53] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (5821 100000s) [04:33:15] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [04:37:06] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Sep 8 04:37:06 UTC 2015 (duration 37m 5s) [04:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:37:14] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [04:43:14] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [04:45:13] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [04:46:24] (03CR) 10Krinkle: "fixme: domComplete stoped being collected after this was deployed." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/236024 (https://phabricator.wikimedia.org/T109756) (owner: 10Phedenskog) [04:47:56] (03PS1) 10Krinkle: coal: Restore domComplete property (missing comma) [puppet] - 10https://gerrit.wikimedia.org/r/236728 [04:47:57] ori: ^ [04:48:09] (03CR) 10Krinkle: "Fixed in https://gerrit.wikimedia.org/r/236728." [puppet] - 10https://gerrit.wikimedia.org/r/236024 (https://phabricator.wikimedia.org/T109756) (owner: 10Phedenskog) [04:48:21] (03CR) 10Ori.livneh: [C: 032 V: 032] coal: Restore domComplete property (missing comma) [puppet] - 10https://gerrit.wikimedia.org/r/236728 (owner: 10Krinkle) [04:48:23] damn it [05:19:13] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [05:29:14] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [05:32:04] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [05:33:15] (03CR) 10Krinkle: "Varnish and hhvm-static (which we use for serving /static files, since hhvm owns the entire docroot, contrary to zendphp cgi)... are both " [puppet] - 10https://gerrit.wikimedia.org/r/222673 (owner: 10Ori.livneh) [05:33:24] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [05:41:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [05:49:15] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [05:57:54] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:11:17] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [06:19:16] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [06:20:34] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 10.71% of data above the critical threshold [100000000.0] [06:25:16] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 5 below the confidence bounds [06:29:15] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [06:30:24] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail [06:31:05] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: puppet fail [06:31:34] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:45] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:04] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:24] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:44] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:53] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:32:55] (03PS1) 10Muehlenhoff: Add ferm rules for pybal SSH health checks [puppet] - 10https://gerrit.wikimedia.org/r/236734 [06:33:04] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:25] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:25] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:44] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:24] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 5 below the confidence bounds [06:44:22] (03PS1) 10Faidon Liambotis: Readd cr1-eqord to smokeping/rancid [puppet] - 10https://gerrit.wikimedia.org/r/236736 [06:44:45] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 10.65.0.1, interfaces up: 34, down: 0, dormant: 0, excluded: 1, unused: 0 [06:44:55] (03CR) 10Faidon Liambotis: [C: 032] Readd cr1-eqord to smokeping/rancid [puppet] - 10https://gerrit.wikimedia.org/r/236736 (owner: 10Faidon Liambotis) [06:46:53] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet last ran 4 days ago [06:47:25] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:48:13] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet last ran 4 days ago [06:48:14] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet last ran 4 days ago [06:49:24] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [06:50:13] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:14] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:50:44] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:55:34] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:55:45] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:56:23] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:56:33] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:45] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:04] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:14] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:25] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:57:53] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:13] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:53] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:33] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:00:05] 6operations, 10netops: Fix esams management network - https://phabricator.wikimedia.org/T80253#1615577 (10faidon) p:5Low>3Normal a:5mark>3None [07:00:35] 6operations, 10ops-esams: Setup management switch in OE12 - https://phabricator.wikimedia.org/T84700#1615579 (10faidon) [07:00:54] 6operations, 10netops: Fix esams management network - https://phabricator.wikimedia.org/T80253#1615584 (10faidon) 5Open>3Resolved a:3faidon Resolving in favor of T84700 — the rest here are done. [07:01:14] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 36, down: 0, dormant: 0, excluded: 1, unused: 0 [07:06:01] 6operations, 10Analytics-Cluster: Fix llama user id - https://phabricator.wikimedia.org/T100678#1615603 (10faidon) Any news here @ottomata? puppet has been failing on analytics1026 for more than 68 days now. [07:07:45] ACKNOWLEDGEMENT - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 29, down: 7, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - DISABLEDBRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 Telia (IC-314534) ??ms {#11375} [10Gbps DWDM]BRxe-1/1/0: down - Peering: ! Equinix Chicago (SR 17915277) {#11371} [10Gbps DF]BRxe-1/0/0: down - Core: cr2-eqiad:xe-4/2/0 Telia (IC-314533) ??ms {#11374} [10Gbps DWD [07:08:03] RECOVERY - NTP on lvs2004 is OK: NTP OK: Offset 0.001416802406 secs [07:11:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [07:13:44] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:33:43] 6operations, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, 5ContentTranslation-Release6, and 4 others: Review and create table for Content Translation - https://phabricator.wikimedia.org/T111317#1615653 (10Arrbee) [07:39:24] PROBLEM - Host ps1-b5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:24] PROBLEM - Host ps1-c7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:24] PROBLEM - Host ps1-d7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:24] PROBLEM - Host ps1-a7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:24] PROBLEM - Host ps1-d6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:25] PROBLEM - Host ps1-d1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:25] PROBLEM - Host ps1-a6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:26] PROBLEM - Host ps1-c6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:26] PROBLEM - Host ps1-c4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:27] PROBLEM - Host ps1-b4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:27] PROBLEM - Host ps1-a4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:28] PROBLEM - Host ps1-a8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:28] PROBLEM - Host ps1-d3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:29] PROBLEM - Host ps1-b1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:43] PROBLEM - Host ps1-a3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:44] PROBLEM - Host ps1-a5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:55] PROBLEM - Host ps1-b6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:39:55] PROBLEM - Host ps1-b3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:40:03] PROBLEM - Host ps1-d5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:40:14] PROBLEM - Host mr1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:40:34] PROBLEM - Host ps1-d2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:42:04] RECOVERY - Host ps1-c4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.33 ms [07:42:04] RECOVERY - Host ps1-a7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.60 ms [07:42:04] RECOVERY - Host ps1-b2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.42 ms [07:42:04] RECOVERY - Host ps1-a2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.55 ms [07:42:04] RECOVERY - Host ps1-b7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.67 ms [07:42:05] RECOVERY - Host ps1-b8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.80 ms [07:42:05] RECOVERY - Host ps1-b1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 4.09 ms [07:42:06] RECOVERY - Host ps1-c2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.60 ms [07:42:06] RECOVERY - Host ps1-d5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [07:42:07] RECOVERY - Host ps1-d6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.18 ms [07:42:07] RECOVERY - Host ps1-a3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.60 ms [07:42:08] RECOVERY - Host ps1-a5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 4.71 ms [07:42:08] RECOVERY - Host ps1-c7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.24 ms [07:42:09] RECOVERY - Host ps1-c1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.83 ms [07:42:23] RECOVERY - Host ps1-a1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.15 ms [07:42:34] RECOVERY - Host ps1-a6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.54 ms [07:42:43] RECOVERY - Host ps1-d7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.19 ms [07:42:43] RECOVERY - Host ps1-a8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.03 ms [07:42:44] RECOVERY - Host ps1-d3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.52 ms [07:45:33] RECOVERY - Host mr1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [07:45:54] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [07:47:03] ignore that ;) [07:47:15] (wow nagios is lagging a lot) [07:52:09] 10Ops-Access-Requests, 6operations: Requesting access to hadoop / hive (analytics-privatedata-users) for Addshore - https://phabricator.wikimedia.org/T111204#1615682 (10Addshore) And as WDQS is now up and announced I will also be using access to analyse that! [08:03:52] (03PS1) 10Yuvipanda: aptly: Make client default to not having source packages [puppet] - 10https://gerrit.wikimedia.org/r/236743 [08:04:02] evil! [08:04:14] heh [08:04:25] indeed, but all current users of labsdebrepo don't have source packages [08:04:35] and that causes apt-get to fail [08:04:38] causing puppet to not run [08:04:43] (03PS2) 10Yuvipanda: aptly: Make client default to not having source packages [puppet] - 10https://gerrit.wikimedia.org/r/236743 [08:05:18] (03CR) 10Yuvipanda: [C: 032 V: 032] aptly: Make client default to not having source packages [puppet] - 10https://gerrit.wikimedia.org/r/236743 (owner: 10Yuvipanda) [08:09:26] (03PS1) 10Faidon Liambotis: Add /32 loopback for mr1-eqiad [dns] - 10https://gerrit.wikimedia.org/r/236744 [08:10:06] apergos: any news at all if salt is any better on labs? me and valhallasw`cloud have to force something on all tools hosts now, and working salt would be useful [08:10:30] 10Ops-Access-Requests, 6operations: Requesting access to fluorine / mw-log-readers group for Addshore - https://phabricator.wikimedia.org/T111756#1615691 (10Addshore) 3NEW [08:10:44] yuvipanda: I'm doing a large cleanup. everything I do is via a giant ssh loop right now. [08:10:49] ok [08:12:44] we have everything from still some hosts with ec2ids to hosts with too current a salt version. a nice mess [08:13:00] I should write it down so people can track [08:13:33] yes please [08:14:45] paravoid: moritzm do you know why an amd64 machine will look for i368 debs? [08:14:48] apt-get is now failing with [08:14:53] multiarch [08:14:55] W: Failed to fetch http://tools-services-01/repo/dists/precise-tools/main/binary-i386/Packages 404 Not Found [08:14:56] oh [08:15:05] we disable it in d-i for production [08:15:28] the labs bootstrapping probably doesn't do that [08:15:44] yeah, and we probably should [08:16:12] paravoid: I think trusty and jessie do and precise doesn't [08:16:47] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1615702 (10ArielGlenn) The way we run the dumps now, we have a (scheduled) break in between runs, which should come up in a week and last for 5 days. This will be enough... [08:17:39] YuviPanda: we can fix it in the sources.list [08:17:48] deb [arch=amd64] http://uk.archive.ubuntu.com/ubuntu/ quantal main universe should work [08:17:55] that sounds like a hack [08:18:03] just disable multiarch? :) [08:18:32] why is that a hakc? it explicitly specifies the repo only has amd64 [08:18:50] valhallasw`cloud: so I think it's fine on trusty and jessie (bastion-01 doesn't try to get i386) [08:19:13] * YuviPanda verifies for jessie as well [08:19:30] yeah, /etc/dpkg/dpkg.cfg.d/multiarch has 'foreign-architecture i386' on precise [08:19:44] that's the old (ubuntu-specific) way of configuring multiarch [08:19:57] trusty switched to the "new" way, the one that got merged into Debian [08:20:13] confirmed that we're all good on jessie as well [08:20:21] I suppose it was turned off with the new way and wasn't in the 'old' way [08:20:59] new way = dpkg --add-architecture and dpkg --remove-architecture [08:22:07] so I guess we rm that file via puppet and make a bug to remember when building next image [08:22:42] some people might depend on i386 packages? [08:23:06] do we have any installed? [08:23:08] I don't think we do [08:23:09] https://wiki.debian.org/Multiarch/HOWTO suggests there's an issue with building android stuff at least [08:23:17] no, /we/ don't, but other people on labs might? [08:23:24] valhallasw`cloud: yes and we moved our android stuff to jessie [08:23:34] ok, then that should be OK, at least [08:23:42] yeah and afaik that's the biggest PITA [08:23:53] I moved it after the Great NFS outage [08:24:04] as part of 'get rid of NFS' [08:24:48] so let's just remove the file for tools, then, plus the note for when the image is rebuilt? [08:25:22] you mean manually or with something in puppet? [08:25:27] with puppet [08:25:33] let's just put it in the labs wide role [08:25:42] hmmm [08:25:50] actually yeah, let's not. [08:26:25] valhallasw`cloud: I think we'll want to disable precise images for projects outside of CI, beta and tools soon [08:27:41] valhallasw`cloud: actually, I think we should maybe just disable it in the aptly::client role for now. [08:28:01] valhallasw`cloud: so that'll make it work for other projects that use aptly as well [08:39:06] (03CR) 10Faidon Liambotis: [C: 032] Add /32 loopback for mr1-eqiad [dns] - 10https://gerrit.wikimedia.org/r/236744 (owner: 10Faidon Liambotis) [08:39:19] (03PS1) 10Yuvipanda: aptly: Force remove multiarch support in precise [puppet] - 10https://gerrit.wikimedia.org/r/236745 (https://phabricator.wikimedia.org/T111760) [08:39:22] valhallasw`cloud: ^ [08:39:32] (03PS2) 10Yuvipanda: aptly: Force remove multiarch support in precise [puppet] - 10https://gerrit.wikimedia.org/r/236745 (https://phabricator.wikimedia.org/T111760) [08:39:46] (03PS1) 10Faidon Liambotis: Split cr1/cr2/mr1-eqiad shared subnet into two [dns] - 10https://gerrit.wikimedia.org/r/236746 [08:39:58] (03CR) 10Faidon Liambotis: [C: 032] Split cr1/cr2/mr1-eqiad shared subnet into two [dns] - 10https://gerrit.wikimedia.org/r/236746 (owner: 10Faidon Liambotis) [08:40:16] 10Ops-Access-Requests, 6operations: Requesting access to fluorine / mw-log-readers group for Addshore - https://phabricator.wikimedia.org/T111756#1615757 (10jcrespo) p:5Triage>3Normal [08:40:54] (03CR) 10Yuvipanda: [C: 032] aptly: Force remove multiarch support in precise [puppet] - 10https://gerrit.wikimedia.org/r/236745 (https://phabricator.wikimedia.org/T111760) (owner: 10Yuvipanda) [08:43:40] (03PS1) 10Yuvipanda: aptly: Fix terrible typo [puppet] - 10https://gerrit.wikimedia.org/r/236747 [08:43:45] (03CR) 10jenkins-bot: [V: 04-1] aptly: Fix terrible typo [puppet] - 10https://gerrit.wikimedia.org/r/236747 (owner: 10Yuvipanda) [08:43:54] (03PS2) 10Yuvipanda: aptly: Fix terrible typo [puppet] - 10https://gerrit.wikimedia.org/r/236747 [08:44:16] (03CR) 10Yuvipanda: [C: 032 V: 032] aptly: Fix terrible typo [puppet] - 10https://gerrit.wikimedia.org/r/236747 (owner: 10Yuvipanda) [08:44:20] (the -1 was for rebasing) [08:49:29] valhallasw`cloud: done. now we just need to force a puppet run on all hosts [08:49:30] sigh [08:49:43] apergos: do document your salt work when you can :) [08:49:50] I'll use my ssh looper script for now [08:50:34] 6operations, 7Database: dbtree shows 0 lag for db1047 - https://phabricator.wikimedia.org/T109401#1615783 (10jcrespo) p:5Triage>3Low Low because there is another tool that shows the correct parameter, plus this only happens on multisource replication, which is not used for production hosts (db1047 is an an... [08:50:58] actually I'm going to head to the WMDE office now, will run looper from there [08:53:19] (03PS1) 10Faidon Liambotis: Add IPv6 to mr1-eqiad (and neighboring subnets) [dns] - 10https://gerrit.wikimedia.org/r/236748 [08:53:36] (03CR) 10Faidon Liambotis: [C: 032] Add IPv6 to mr1-eqiad (and neighboring subnets) [dns] - 10https://gerrit.wikimedia.org/r/236748 (owner: 10Faidon Liambotis) [08:54:28] (03CR) 10Filippo Giunchedi: 0.1.1-wmf3: statsd and systemd support (031 comment) [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/224390 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [08:54:53] 6operations, 6Discovery, 10MediaWiki-Search, 7Monitoring: Search service monitoring should fail if search results only return exact matches and suggestions don't work - https://phabricator.wikimedia.org/T101914#1615796 (10jcrespo) p:5Triage>3Normal [08:58:52] 6operations: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1615817 (10jcrespo) p:5Triage>3Low Set to low because it would be a "nice thing to change", but there is nothing broken right now because of this (correct me if I am wrong), no dependen... [08:59:25] (03CR) 10Filippo Giunchedi: [C: 031] Add ferm rules for pybal SSH health checks [puppet] - 10https://gerrit.wikimedia.org/r/236734 (owner: 10Muehlenhoff) [08:59:39] I hope it is ok to make mistakes on my triage, than making no triage at all [09:00:04] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150908T0900). [09:00:29] (03CR) 10Merlijn van Deen: "The puppet manifests are easier to read (as they are plain text) and also include version numbers. The HTML report is really meant to be v" [puppet] - 10https://gerrit.wikimedia.org/r/228635 (https://phabricator.wikimedia.org/T101646) (owner: 10Merlijn van Deen) [09:01:57] 6operations, 10hardware-requests: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1615822 (10jcrespo) p:5Triage>3Normal Normal as per a conversation with them "not an emergency". [09:03:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The Bug number in the commit message is incorrect. I get the "If blocked, MobileFrontend blocks you from editing all pages including your " [puppet] - 10https://gerrit.wikimedia.org/r/236680 (https://phabricator.wikimedia.org/T111739) (owner: 10Merlijn van Deen) [09:03:46] (03CR) 10Filippo Giunchedi: [C: 031] "will require old conf cleanup after merge, can be done manually" [puppet] - 10https://gerrit.wikimedia.org/r/236727 (owner: 10BryanDavis) [09:04:10] 6operations, 5Continuous-Integration-Scaling, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1615827 (10jcrespo) p:5Triage>3Normal Normal as it doesn't seem #Blocked-on-operations. [09:07:09] (03PS1) 10Faidon Liambotis: Sort router loopbacks in a logical order [dns] - 10https://gerrit.wikimedia.org/r/236750 [09:09:02] (03PS3) 10Merlijn van Deen: package_builder: require_packages and make ubuntu-friendly [puppet] - 10https://gerrit.wikimedia.org/r/236680 (https://phabricator.wikimedia.org/T111730) [09:09:48] 6operations, 6Performance-Team, 7Mobile: Remove docroot:/images/mobile in favour of docroot:/static/images/mobile - https://phabricator.wikimedia.org/T107395#1615837 (10jcrespo) p:5Triage>3Low Low because it is "a nice thing to do" (aka #TODO), but it is not breaking something right now (feel free to cor... [09:10:06] (03CR) 10Merlijn van Deen: "Whoops, thanks for spotting that. We'd like to use the role on tool labs, which is currently still Ubuntu-based, and most of our manifests" [puppet] - 10https://gerrit.wikimedia.org/r/236680 (https://phabricator.wikimedia.org/T111730) (owner: 10Merlijn van Deen) [09:10:40] YuviPanda, valhallasw`cloud: are you guys planning on killing manifests/misc/labsdebrepo.pp soon? [09:10:54] paravoid: for tool labs, at least, yes [09:10:59] I'm not sure if any others use it? [09:11:22] we want to kill it mostly because it bites with require_package() [09:11:32] also no NFS [09:11:59] (03CR) 10Alexandros Kosiaris: [C: 031] "cross-checked, LGTM" [dns] - 10https://gerrit.wikimedia.org/r/236750 (owner: 10Faidon Liambotis) [09:12:32] I'm asking because I've been hunting down the last manifests under manifests/ [09:13:09] (03CR) 10Faidon Liambotis: [C: 032] Sort router loopbacks in a logical order [dns] - 10https://gerrit.wikimedia.org/r/236750 (owner: 10Faidon Liambotis) [09:14:00] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: require_packages and make ubuntu-friendly [puppet] - 10https://gerrit.wikimedia.org/r/236680 (https://phabricator.wikimedia.org/T111730) (owner: 10Merlijn van Deen) [09:14:17] 6operations: Not all confd errors throw icinga alerts - https://phabricator.wikimedia.org/T110933#1615840 (10jcrespo) p:5Triage>3Normal [09:14:19] (03PS4) 10Alexandros Kosiaris: package_builder: require_packages and make ubuntu-friendly [puppet] - 10https://gerrit.wikimedia.org/r/236680 (https://phabricator.wikimedia.org/T111730) (owner: 10Merlijn van Deen) [09:14:37] (03CR) 10Alexandros Kosiaris: [V: 032] package_builder: require_packages and make ubuntu-friendly [puppet] - 10https://gerrit.wikimedia.org/r/236680 (https://phabricator.wikimedia.org/T111730) (owner: 10Merlijn van Deen) [09:15:45] paravoid: dynamicproxy uses it as well (using /data/project/repo) -- but I think toollabs is the only user of that? [09:15:50] YuviPanda should know [09:16:08] No the labs novaproxy is also that [09:16:10] 7Puppet, 6operations: Create a puppet define for systemd timers - https://phabricator.wikimedia.org/T111031#1615847 (10jcrespo) p:5Triage>3Low Setting it to Low because it is a "nice thing to do", but it is a new feature, not a bug fix. Feel free to correct me. [09:16:25] Some of mutante's projects also use it [09:16:29] (03PS1) 10Filippo Giunchedi: swift: refactor graphite/icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/236751 (https://phabricator.wikimedia.org/T88974) [09:16:37] godog: \o/ [09:16:38] I need to get rid of it from the proxy as well [09:17:22] wdq-mm also uses that.. [09:17:31] And quarry [09:17:44] And ircyall [09:18:02] Surprisingly / unsurprisingly all of these are projects I'm responsible for [09:19:04] 6operations, 6Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#1615861 (10adrianheine) I'm back, and I'm happy to walk through the process with someone on IRC if that's necessary :) [09:19:30] paravoid: I can spend some time killing it if it feels like a priority to you. [09:19:45] it's not realistically a priority [09:19:54] but it'd be nice to have this done at some point [09:20:15] I started the migration to modules more than 3 years ago at this point [09:20:38] 3 isnt a nice number. You know what is? 4! [09:20:40] :p [09:20:45] paravoid: is it the last one? [09:20:47] I usually just kill manifests when e.g. I'm on a plane :) [09:21:08] We can fixup and merge mutante's patch [09:21:27] Which moves it to a module [09:23:26] paravoid: \o/ should be easy enough if you could take a quick look [09:28:14] YuviPanda: https://phabricator.wikimedia.org/T80387 [09:28:32] YuviPanda: lacks a lot of information -- is this still relevant to you, and if so, could you adjust the description? [09:30:42] paravoid: ha-ha first time I'm seeing that ticker. I know releng (and chasemp) wanted lvs in labs but it wasn't possible due to some nova network limitations. So I think releng will be more interested than I am [09:31:26] I think I'll decline it and they can file a new task if that's needed [09:31:47] this "as discussed on IRC 4 years ago" task isn't very useful [09:31:52] YuviPanda: yeah we talked about it for the lvs / mw load balancing [09:32:22] paravoid: yeah [09:32:25] On beta cluster the LB is currently done by passing the backends as a list [09:32:31] 6operations, 6Performance-Team: New URL scheme for service-generated thumbnails - https://phabricator.wikimedia.org/T111048#1615885 (10fgiunchedi) iiif.io seems nice! re: the original url scheme I'm not sure about mime type in url, what was the rationale? [09:33:04] I think we can decline the LVS on labs task [09:33:10] already done [09:33:17] \O/ [09:34:47] 6operations, 10netops: JSNMP flood of errors across multiple switches - https://phabricator.wikimedia.org/T83898#1615893 (10faidon) p:5Normal>3Lowest [09:37:20] 6operations, 10ops-esams: Replace cr2-knams MX80 MIC slot with a 2x10G MIC - https://phabricator.wikimedia.org/T111765#1615909 (10faidon) 3NEW a:3mark [09:37:37] 6operations, 10ops-esams: Replace cr2-knams MX80 MIC slot with a 2x10G MIC - https://phabricator.wikimedia.org/T111765#1615917 (10faidon) [09:40:39] 6operations, 7Database: Add icinga check for all MySQL/MariaDB hosts to check they have the right read_only value - https://phabricator.wikimedia.org/T111766#1615926 (10jcrespo) 3NEW a:3jcrespo [09:41:32] 6operations, 7Database: Add icinga check for all MySQL/MariaDB hosts to check they have the right read_only value - https://phabricator.wikimedia.org/T111766#1615940 (10jcrespo) p:5Triage>3Normal [09:46:49] 6operations, 10Beta-Cluster, 7Database: Possible to run writes (e.g. UPDATE) on slave - https://phabricator.wikimedia.org/T110115#1615949 (10jcrespo) p:5Triage>3Normal Analytics stores (dbstoreX) are all with read_only=0 (aka s3-analytics-slave). I do not know yet, I suppose that because they have their... [09:48:09] 6operations: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433#1615953 (10jcrespo) p:5Triage>3Normal [09:50:12] 6operations: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433#1615955 (10jcrespo) a:3MoritzMuehlenhoff As the interest part, @MoritzMuehlenhoff I have made you temporarily the owner of this. But hopefully some service owner can help you seeing how to... [09:53:03] 6operations, 7discovery-system: Make puppet ca certificate world readable - https://phabricator.wikimedia.org/T110020#1615974 (10jcrespo) p:5Triage>3Normal [09:53:20] 6operations: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433#1615977 (10MoritzMuehlenhoff) > As the interest part, @MoritzMuehlenhoff I have made you temporarily the owner of this. But hopefully some service owner can help you seeing how to proceed.... [09:54:02] 6operations, 10Beta-Cluster, 7Database: Possible to run writes (e.g. UPDATE) on slave - https://phabricator.wikimedia.org/T110115#1615979 (10hashar) From `wmf-config/db-labs.php` the #beta-cluster has two SQL servers, a master and a slave: ``` 'hostsByName' => array( 'deployment-db1' =>... [09:54:05] 6operations, 7discovery-system: Make puppet ca certificate world readable - https://phabricator.wikimedia.org/T110020#1615980 (10jcrespo) a:3yuvipanda Assigning it to yourself until a service owner can help you, so you do not forget. [09:55:34] !log uploaded debdeploy 0.0.5 to carbon [09:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:00:43] (03CR) 10Faidon Liambotis: swift: refactor graphite/icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/236751 (https://phabricator.wikimedia.org/T88974) (owner: 10Filippo Giunchedi) [10:00:44] godog: ^ [10:02:33] 6operations, 7HTTPS: Add Forward Secrecy to all HTTPS sites - https://phabricator.wikimedia.org/T55259#1615990 (10faidon) [10:03:21] 7Puppet, 6operations: Write, publish and deploy puppet-lint plug-in for ensure attribute bareword check - https://phabricator.wikimedia.org/T95377#1615993 (10faidon) [10:04:58] aah, supposed to be deploying :) [10:05:15] * aude needs to run some scripts first [10:09:03] (03PS1) 10Faidon Liambotis: network: use mr1-eqiad/ulsfo's loopback IPs [puppet] - 10https://gerrit.wikimedia.org/r/236755 [10:09:05] (03PS1) 10Faidon Liambotis: rancid: kill long-gone csw1-esams/br1-knams [puppet] - 10https://gerrit.wikimedia.org/r/236756 [10:09:49] (03PS2) 10Filippo Giunchedi: swift: refactor graphite/icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/236751 (https://phabricator.wikimedia.org/T88974) [10:09:53] (03CR) 10Faidon Liambotis: [C: 032] network: use mr1-eqiad/ulsfo's loopback IPs [puppet] - 10https://gerrit.wikimedia.org/r/236755 (owner: 10Faidon Liambotis) [10:10:11] (03CR) 10Faidon Liambotis: [C: 032] rancid: kill long-gone csw1-esams/br1-knams [puppet] - 10https://gerrit.wikimedia.org/r/236756 (owner: 10Faidon Liambotis) [10:10:55] (03CR) 10Faidon Liambotis: [C: 031] swift: refactor graphite/icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/236751 (https://phabricator.wikimedia.org/T88974) (owner: 10Filippo Giunchedi) [10:10:57] (03CR) 10Filippo Giunchedi: swift: refactor graphite/icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/236751 (https://phabricator.wikimedia.org/T88974) (owner: 10Filippo Giunchedi) [10:11:34] (03PS13) 10Yuvipanda: toollabs: add script to generate python package listings [puppet] - 10https://gerrit.wikimedia.org/r/228635 (https://phabricator.wikimedia.org/T101646) (owner: 10Merlijn van Deen) [10:12:03] (03CR) 10Yuvipanda: [C: 032 V: 032] "So I'm slightly reserved about this, but this will also probably allow us to make building docker images and such easier in the future, an" [puppet] - 10https://gerrit.wikimedia.org/r/228635 (https://phabricator.wikimedia.org/T101646) (owner: 10Merlijn van Deen) [10:12:18] (03PS2) 10Yuvipanda: toollabs: add python-pyicu [puppet] - 10https://gerrit.wikimedia.org/r/236419 (https://phabricator.wikimedia.org/T102165) (owner: 10Merlijn van Deen) [10:12:27] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: add python-pyicu [puppet] - 10https://gerrit.wikimedia.org/r/236419 (https://phabricator.wikimedia.org/T102165) (owner: 10Merlijn van Deen) [10:12:40] (03PS2) 10Yuvipanda: toollabs: add python-enum34 [puppet] - 10https://gerrit.wikimedia.org/r/236420 (https://phabricator.wikimedia.org/T111602) (owner: 10Merlijn van Deen) [10:12:48] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: add python-enum34 [puppet] - 10https://gerrit.wikimedia.org/r/236420 (https://phabricator.wikimedia.org/T111602) (owner: 10Merlijn van Deen) [10:13:06] (03PS2) 10Yuvipanda: toollabs: add python-pil [puppet] - 10https://gerrit.wikimedia.org/r/236421 (https://bugzilla.wikimedia.org/108210) (owner: 10Merlijn van Deen) [10:13:26] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: add python-pil [puppet] - 10https://gerrit.wikimedia.org/r/236421 (https://bugzilla.wikimedia.org/108210) (owner: 10Merlijn van Deen) [10:17:42] (03PS4) 10Faidon Liambotis: Remove/ensure=> absent *_pass scripts [puppet] - 10https://gerrit.wikimedia.org/r/232903 [10:17:59] (03PS5) 10Faidon Liambotis: Remove/ensure=> absent *_pass scripts [puppet] - 10https://gerrit.wikimedia.org/r/232903 [10:18:18] (03PS6) 10Faidon Liambotis: Remove/ensure => absent *_pass scripts [puppet] - 10https://gerrit.wikimedia.org/r/232903 [10:18:30] (03CR) 10Faidon Liambotis: [C: 032] Remove/ensure => absent *_pass scripts [puppet] - 10https://gerrit.wikimedia.org/r/232903 (owner: 10Faidon Liambotis) [10:19:01] (03PS3) 10Filippo Giunchedi: swift: refactor graphite/icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/236751 (https://phabricator.wikimedia.org/T88974) [10:19:08] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: refactor graphite/icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/236751 (https://phabricator.wikimedia.org/T88974) (owner: 10Filippo Giunchedi) [10:19:24] paravoid: I merged yours too [10:19:30] (03PS4) 10Filippo Giunchedi: swift: refactor graphite/icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/236751 (https://phabricator.wikimedia.org/T88974) [10:19:34] Yuvipanda: toollabs: add script to generate python package listings (003541b) [10:19:37] WARNING: Revision range includes commits from multiple committers! [10:19:40] Merge these changes? (yes/no)? [10:19:40] wow, us three are racing each other aren't we [10:19:45] (03CR) 10Filippo Giunchedi: [V: 032] swift: refactor graphite/icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/236751 (https://phabricator.wikimedia.org/T88974) (owner: 10Filippo Giunchedi) [10:19:48] haha [10:19:51] 6operations, 10Traffic: Refactor varnish puppet config - https://phabricator.wikimedia.org/T96847#1616004 (10jcrespo) [10:20:01] yay more races [10:20:45] looking at https://github.com/wikimedia/operations-puppet/graphs/contributors I'm less than a 100 commits away from paravoid :P [10:20:53] * YuviPanda self diagnoses with commitcountitis [10:22:54] hehehe [10:23:09] 6operations, 10Traffic: Fix Varnish TTLs across the board - https://phabricator.wikimedia.org/T108612#1616015 (10jcrespo) [10:27:26] (03PS1) 10Aude: Enable usage tracking on wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236757 (https://phabricator.wikimedia.org/T111142) [10:29:00] 6operations, 10Beta-Cluster, 10Traffic, 5Patch-For-Review: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1616043 (10hashar) 5Resolved>3Open The Parsoid cache `deployment-parsoidcache02` is still on Trusty :( {T103660} [10:29:15] 6operations, 10Beta-Cluster, 10Traffic, 5Patch-For-Review: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1616048 (10hashar) [10:30:08] * aude deploying [10:31:06] (03CR) 10Aude: [C: 032] Enable usage tracking on wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236757 (https://phabricator.wikimedia.org/T111142) (owner: 10Aude) [10:31:14] (03Merged) 10jenkins-bot: Enable usage tracking on wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236757 (https://phabricator.wikimedia.org/T111142) (owner: 10Aude) [10:31:17] 6operations, 10Continuous-Integration-Infrastructure, 6Discovery, 7Elasticsearch, 5Patch-For-Review: elasticsearch 1.6.0 fails to start after reboot - https://phabricator.wikimedia.org/T109497#1616063 (10jcrespo) I do not know which is the consensus here. Should we apply the patch or should we abandon pr... [10:32:30] !log aude@tin Synchronized usagetracking.dblist: Enable usage tracking on Wikibooks (duration: 00m 11s) [10:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:36:02] 6operations, 10Wikimedia-Apache-configuration: Apache slash expansion should not redirect from HTTPS to HTTP - https://phabricator.wikimedia.org/T95164#1616074 (10jcrespo) [10:37:04] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: puppet fail [10:41:32] 6operations, 7Database: defragment db1015 and db1035 - https://phabricator.wikimedia.org/T110504#1616076 (10jcrespo) [10:41:59] with the wikidata deployment done, I'll restart the salt master in about 5 minutes unless anyone objects [10:42:33] (03PS2) 10Muehlenhoff: Add ferm rules for pybal SSH health checks [puppet] - 10https://gerrit.wikimedia.org/r/236734 [10:42:41] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for pybal SSH health checks [puppet] - 10https://gerrit.wikimedia.org/r/236734 (owner: 10Muehlenhoff) [10:45:57] labvirt1002 and 8 disk space, YuviPanda, are those actionables? [10:48:49] !log restarted salt master on palladium [10:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:52:30] (03PS1) 10Muehlenhoff: Enable ferm on two initial appservers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/236759 [10:52:43] jynus: are they having issues again with disk space? [10:53:20] YuviPanda, I just noticed the alert, I can ack it if it is not an issue [10:53:30] no it could be [10:53:35] i'm taking a look now [10:53:39] but I want you in the loop [10:53:42] :-) [10:54:12] jynus: :D checking to see how much space they hav [10:54:51] ok, so need to shuffle them around [10:54:57] (03PS1) 10Filippo Giunchedi: swift: graphite_alerts should be a define [puppet] - 10https://gerrit.wikimedia.org/r/236760 [10:55:11] jynus: I think they'll survive till andrewbogott comes online - he's been attempting to deal with these... [10:55:22] jynus: longer term we've more machines on the way [10:55:50] (03PS6) 10Jcrespo: Save binary log coordinates from the master and the slave on backup [puppet] - 10https://gerrit.wikimedia.org/r/234503 [10:58:03] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on two initial appservers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/236759 (owner: 10Muehlenhoff) [10:59:05] (03PS7) 10Jcrespo: Save binary log coordinates from the master and the slave on backup [puppet] - 10https://gerrit.wikimedia.org/r/234503 [11:01:19] (03PS3) 10Faidon Liambotis: Remove class misc::deployment::passwordscripts [puppet] - 10https://gerrit.wikimedia.org/r/232904 [11:01:42] (03CR) 10Faidon Liambotis: [C: 032] Remove class misc::deployment::passwordscripts [puppet] - 10https://gerrit.wikimedia.org/r/232904 (owner: 10Faidon Liambotis) [11:02:06] (03CR) 10Faidon Liambotis: [V: 032] Remove class misc::deployment::passwordscripts [puppet] - 10https://gerrit.wikimedia.org/r/232904 (owner: 10Faidon Liambotis) [11:03:23] (03CR) 10Jcrespo: [C: 032] Save binary log coordinates from the master and the slave on backup [puppet] - 10https://gerrit.wikimedia.org/r/234503 (owner: 10Jcrespo) [11:03:35] (03PS8) 10Jcrespo: Save binary log coordinates from the master and the slave on backup [puppet] - 10https://gerrit.wikimedia.org/r/234503 [11:04:32] (03PS1) 10Merlijn van Deen: toollabs: remove duplicate python packages [puppet] - 10https://gerrit.wikimedia.org/r/236762 [11:04:40] (03PS2) 10Faidon Liambotis: swift: graphite_alerts should be a define [puppet] - 10https://gerrit.wikimedia.org/r/236760 (owner: 10Filippo Giunchedi) [11:04:47] (03CR) 10Faidon Liambotis: [C: 032 V: 032] swift: graphite_alerts should be a define [puppet] - 10https://gerrit.wikimedia.org/r/236760 (owner: 10Filippo Giunchedi) [11:05:21] (03PS9) 10Jcrespo: Save binary log coordinates from the master and the slave on backup [puppet] - 10https://gerrit.wikimedia.org/r/234503 [11:06:08] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.194 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [11:07:07] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [11:10:09] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 38, down: 0, dormant: 0, excluded: 1, unused: 0 [11:10:58] PROBLEM - puppet last run on bast2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:10:58] PROBLEM - Check size of conntrack table on bast2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:11:58] PROBLEM - configured eth on bast2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:12:08] PROBLEM - RAID on bast2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:12:17] PROBLEM - SSH on bast2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:12:38] PROBLEM - dhclient process on bast2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:14:17] RECOVERY - SSH on bast2001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [11:14:28] RECOVERY - dhclient process on bast2001 is OK: PROCS OK: 0 processes with command name dhclient [11:14:59] RECOVERY - Check size of conntrack table on bast2001 is OK: OK: nf_conntrack is 0 % full [11:15:00] akosiaris: what are you doing? [11:15:07] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 37 minutes ago with 0 failures [11:15:10] you're OOMing bast2001 [11:15:18] akosiar+ 25275 33.7 45.5 10408688 7441548 pts/4 S+ 11:01 4:28 python stresstest.py [11:15:21] akosiar+ 25475 2.7 28.1 10417220 4609228 pts/4 Sl+ 11:05 0:15 python stresstest.py [11:15:24] akosiar+ 25478 2.9 29.0 10424652 4753036 pts/4 Sl+ 11:05 0:16 python stresstest.py [11:15:57] RECOVERY - configured eth on bast2001 is OK: OK - interfaces up [11:16:02] killed them [11:16:07] RECOVERY - RAID on bast2001 is OK: OK: no disks configured for RAID [11:26:45] paravoid: hmmm OOM wasn't supposed to show up. I must have a bug [11:50:53] (03PS1) 10Muehlenhoff: Add definitions for LVSes in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/236765 [12:00:10] (03PS2) 10Yuvipanda: toollabs: remove duplicate python packages [puppet] - 10https://gerrit.wikimedia.org/r/236762 (owner: 10Merlijn van Deen) [12:01:57] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: remove duplicate python packages [puppet] - 10https://gerrit.wikimedia.org/r/236762 (owner: 10Merlijn van Deen) [12:08:29] (03PS4) 10Jcrespo: Redirect be-x-old.wikipedia.org to be-tarask.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/235943 (https://phabricator.wikimedia.org/T11823) (owner: 10Alex Monk) [12:09:17] (03CR) 10Jcrespo: [C: 032] Redirect be-x-old.wikipedia.org to be-tarask.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/235943 (https://phabricator.wikimedia.org/T11823) (owner: 10Alex Monk) [12:13:52] (03CR) 10Jcrespo: "Not even +1'd by the original committer, skipping puppet SWAT because we do not know if it is finished." [puppet] - 10https://gerrit.wikimedia.org/r/236555 (owner: 10Alex Monk) [12:14:14] (03PS1) 10Muehlenhoff: Mark as notrack [puppet] - 10https://gerrit.wikimedia.org/r/236767 [12:14:23] jynus: the original commiter put it up for puppetswat [12:14:39] jynus: so I suppose it is considered complete? [12:15:08] ah, ok [12:15:26] I didn't check the wiki diff [12:15:40] (03CR) 10Yuvipanda: "Original commiter put it up for puppetswat, so I think it is considered complete." [puppet] - 10https://gerrit.wikimedia.org/r/236555 (owner: 10Alex Monk) [12:15:52] (03PS1) 10Muehlenhoff: Enable initial videoscaler in codfw [puppet] - 10https://gerrit.wikimedia.org/r/236768 [12:15:59] (03PS1) 10Hashar: nodepool: setup scripts are in integration/config [puppet] - 10https://gerrit.wikimedia.org/r/236769 (https://phabricator.wikimedia.org/T111377) [12:16:14] jynus: it's mentioned just before the list of patches too [12:17:25] (03PS2) 10Jcrespo: Remove obsolete comment about apache-config [puppet] - 10https://gerrit.wikimedia.org/r/236485 (owner: 10Alex Monk) [12:18:02] (03CR) 10Jcrespo: [C: 032] Remove obsolete comment about apache-config [puppet] - 10https://gerrit.wikimedia.org/r/236485 (owner: 10Alex Monk) [12:18:15] (03PS2) 10Hashar: nodepool: setup scripts are in integration/config [puppet] - 10https://gerrit.wikimedia.org/r/236769 (https://phabricator.wikimedia.org/T111377) [12:18:45] it should probably acked in any case by someone from labs (you) or releaseEng [12:20:22] (03PS2) 10Muehlenhoff: Enable initial videoscaler in codfw [puppet] - 10https://gerrit.wikimedia.org/r/236768 [12:20:33] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable initial videoscaler in codfw [puppet] - 10https://gerrit.wikimedia.org/r/236768 (owner: 10Muehlenhoff) [12:23:39] (03PS1) 10Muehlenhoff: Enable initial api appserver in codfw [puppet] - 10https://gerrit.wikimedia.org/r/236771 [12:25:46] PROBLEM - Host mw2007 is DOWN: PING CRITICAL - Packet loss = 100% [12:30:10] I am not going to +2 those, the changes are big, I am not familiar with beta, and nobody else has +1 them [12:31:28] jynus: I"m looking at them now [12:31:50] is that puppet swat ? :D [12:31:57] hashar, yes [12:32:15] it is very convenient to have it at the beginning of the european afternoonç [12:32:37] I am ok with owning them, learning about beta and reviewing them at other time [12:32:54] the apache config in beta is scary :( [12:33:26] it is not a big deal, after all, it is beta, it is there to break [12:33:33] :-) [12:33:47] but I want to set the standards for all changes [12:34:55] jynus: hashar I think for beta only changes, the standard could be that it should be cherry-picked on the beta puppetmaster [12:34:57] Krenair: ^ [12:35:57] RECOVERY - Host mw2007 is UP: PING OK - Packet loss = 0%, RTA = 34.36 ms [12:36:00] (03CR) 10BBlack: [C: 04-1] "We should cover the public subnets similarly (the LVSes attach to them as well and can route traffic to machines in them, although current" [puppet] - 10https://gerrit.wikimedia.org/r/236765 (owner: 10Muehlenhoff) [12:36:06] to be fair, I am very unfamiliar with the beta deployment, because on my normal role I do not use it much (as an admin, I mean) [12:36:42] (03CR) 10BBlack: [C: 04-1] "See comments in I75267adabb3408fecb5bdd3e7f56b5a4b4e80b6f" [puppet] - 10https://gerrit.wikimedia.org/r/236519 (owner: 10Muehlenhoff) [12:37:13] if you guys have some spare time left, I have a couple patches pending for nodepool :D [12:37:43] hashar: aren't you already working with andrewbogott on those? [12:37:52] trueish [12:38:17] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:38:18] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:38:22] will poke him [12:39:04] (03CR) 10BBlack: "Also, https://phabricator.wikimedia.org/T104458#1451642 is inter-related here (new LVS in eqiad will be deployed eventually, and apparentl" [puppet] - 10https://gerrit.wikimedia.org/r/236765 (owner: 10Muehlenhoff) [12:39:41] ^there is a pending firewall change on puppet-master [12:40:14] I suppose it is you, moritzm ? [12:40:53] jynus: yes, I've just merged it [12:40:55] (only asking because of the SWAT) [12:40:57] jynus: wait, puppetswat is in 2 hours, I suppose that's why Krenair isn't here atm [12:41:23] !log change whisper aggregation for 'sum.wsp' files T111170 [12:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:41:28] oh, we didn't notify him? [12:42:27] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [12:42:27] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:42:35] oh, I read the hours wrongly [12:42:50] my fault [12:43:09] jynus: yeah, think it's ok. Krenair should be around in a few hours to respond [12:44:06] less work for later, unless some of the patches crashes everything [12:44:13] heh [12:44:14] yeah [12:44:22] jynus: usually I don't SWAT them unless the proposer is around [12:44:37] it's ok, the others were trivial [12:45:00] it was just the beta apache config that was non-trivial [12:45:08] which I didn't touch [12:45:16] sorry for that [12:45:22] it's ok! [12:45:27] (the time missunderstanding) [12:45:38] jynus: np :D [12:45:48] jynus: we can do the remaining ones on the time I guess [12:45:54] yes [12:51:31] 6operations, 6Performance-Team, 7Graphite, 5Patch-For-Review: "sum" aggregation broken in Graphite - https://phabricator.wikimedia.org/T111170#1616194 (10fgiunchedi) 5Open>3Resolved existing files have been changed to the sum aggregation, (tentatively) resolving [12:53:29] (03PS1) 10Yuvipanda: k8s; Switch to using debian packages for binary deployment [puppet] - 10https://gerrit.wikimedia.org/r/236772 [12:53:42] godog: btw, I responded on the ticket about graphite gc for labs [12:54:58] (03CR) 10Yuvipanda: [C: 032] k8s; Switch to using debian packages for binary deployment [puppet] - 10https://gerrit.wikimedia.org/r/236772 (owner: 10Yuvipanda) [12:57:51] (03PS2) 10BBlack: Remove seemingly-obsolete apache redirect for ruwikinews [puppet] - 10https://gerrit.wikimedia.org/r/236591 (https://phabricator.wikimedia.org/T111715) (owner: 10Alex Monk) [12:58:53] YuviPanda: sweet! I was about to ask about it :D [12:59:24] (03CR) 10BBlack: [C: 032] Remove seemingly-obsolete apache redirect for ruwikinews [puppet] - 10https://gerrit.wikimedia.org/r/236591 (https://phabricator.wikimedia.org/T111715) (owner: 10Alex Monk) [12:59:57] godog: yeah you can probably just revert that revert and verif [12:59:57] y [13:00:32] (03CR) 10BBlack: [C: 031] Switch Mexico to codfw [dns] - 10https://gerrit.wikimedia.org/r/236235 (owner: 10Faidon Liambotis) [13:00:39] (03CR) 10BBlack: [C: 031] Switch US states AR,LA,NM,OK to codfw [dns] - 10https://gerrit.wikimedia.org/r/236236 (owner: 10Faidon Liambotis) [13:01:23] (03CR) 10Jcrespo: [C: 032] "Deployed a long time ago, no objections shown." [software/redactatron] - 10https://gerrit.wikimedia.org/r/232176 (owner: 10Jcrespo) [13:01:32] (03CR) 10Jcrespo: [V: 032] "Deployed a long time ago, no objections shown." [software/redactatron] - 10https://gerrit.wikimedia.org/r/232176 (owner: 10Jcrespo) [13:04:06] bblack: so, shall I merge then? :) [13:04:09] YuviPanda: yup, fairly low priority but I forgot about the archiver, thanks! [13:04:43] godog: yup. so not too much work to turn it back on [13:05:09] apergos: did you document your ongoing salt work somewhere? [13:07:46] paravoid: if the network's ok with it, go for it :) [13:08:43] :) [13:08:47] (03PS2) 10Faidon Liambotis: Switch Mexico to codfw [dns] - 10https://gerrit.wikimedia.org/r/236235 [13:09:01] (03CR) 10Faidon Liambotis: [C: 032] Switch Mexico to codfw [dns] - 10https://gerrit.wikimedia.org/r/236235 (owner: 10Faidon Liambotis) [13:09:16] just next week will be tricky during the telia split [13:09:53] we'll still have the other wave (hopefully) [13:10:32] at some point we need to map our approximately what our final target is [13:10:41] in terms of user split to ulsfo/codfw/eqiad/esams [13:10:48] yeah [13:10:52] (and fallback options for 1xDC outages) [13:11:10] so if everything goes according to schedule [13:11:28] we'll have both codfw + ulsfo next week [13:11:30] Sep 16th [13:12:08] we'll have what there next week? [13:12:26] yeah, I'll elaborate [13:12:31] so eqord is online since yesterday [13:12:41] just the ulsfo-eqord link so far [13:12:44] 10G wave [13:13:15] we've been given a date of Sept 16th for the migration to the split wave, i.e. split one of the eqiad-codfw ones to eqiad-eqord + eqord-codfw [13:13:42] which means that by then we'll have the following links active: eqiad-codfw, eqiad-eqord, eqord-codfw, eqord-ulsfo [13:13:57] that will just temporarily hurt our redundancy while the work is ongoing I guess [13:13:57] all 10G waves, no silly congested MPLS circuits [13:14:01] (03PS1) 10Yuvipanda: tools: Add k8s tools master role [puppet] - 10https://gerrit.wikimedia.org/r/236774 [13:14:02] yes [13:14:09] but at that point we'll be able to push ulsfo back at full capacity [13:14:14] ok [13:14:22] and codfw will be at full redundancy after that point as well [13:14:51] (03CR) 10jenkins-bot: [V: 04-1] tools: Add k8s tools master role [puppet] - 10https://gerrit.wikimedia.org/r/236774 (owner: 10Yuvipanda) [13:15:09] there's also already a 10G wave for ulsfo<->codfw as well, or not? [13:15:23] or what is the other link for ulsfo right now? [13:15:42] there's also a 10G wave for ulsfo-codfw that is still in procurement/with legal :( [13:15:49] for the past two months or so [13:15:58] (03PS2) 10Yuvipanda: tools: Add k8s tools master role [puppet] - 10https://gerrit.wikimedia.org/r/236774 [13:16:03] so until that's done, the other path for ulsfo is still the mpls? [13:16:05] but we still have the two crappy MPLS circuits to ulsfo [13:16:08] ok [13:16:09] yes [13:16:12] 6operations, 10hardware-requests: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1616229 (10mark) [13:16:40] 6operations, 6Performance-Team, 10hardware-requests: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1616234 (10Reedy) [13:17:06] after next week, we're gonna cancel one of those [13:17:16] so once we get past the ~Sept 16 stuff and that ulsfo-codfw thing with legal, we should basically be done with implementing the nice network setup within the US [13:17:16] and then the other mpls when the other 10G wave comes up [13:17:24] almost yeah [13:17:25] (the coffee-bean shape from before, etc) [13:17:26] bblack: yes [13:17:28] just not the esams part [13:17:37] and a 3rd wave codfw-eqiad [13:17:39] but that's it [13:17:43] ok [13:17:43] and a transit replacement :) [13:17:50] that's not in that diagram ;) [13:17:51] (incl. a third transit @ codfw) [13:17:51] but yeah [13:18:01] it'll be in good coffee bean shape then yes ;) [13:18:12] alternatively we can order our existing third transit @ codfw/eqdfw [13:18:14] (03PS3) 10Yuvipanda: tools: Add k8s tools master role [puppet] - 10https://gerrit.wikimedia.org/r/236774 [13:18:15] but ugh [13:18:28] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add k8s tools master role [puppet] - 10https://gerrit.wikimedia.org/r/236774 (owner: 10Yuvipanda) [13:18:30] they suck [13:18:40] so yeah above every statement uses "codfw" not "eqdfw" - is any of this actually going through eqdfw, or is eqdfw just off the side of codfw for peering, etc? [13:18:58] ulsfo - eqdfw soon [13:19:19] once it passes legal [13:19:41] 6operations, 6Performance-Team, 10hardware-requests: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1616238 (10jcrespo) No replication, these should be local-only, they are N-level cache (we can warm them one time, though). Set them for the first time on codfw, yes. I... [13:20:10] I guess a better question would be: in the near-term final US setup, which of our DCs has direct connections to eqdfw? Sounds like just ulsfo and codfw? [13:20:33] ? [13:20:40] mark: I thought the procurement was for ulsfo-codfw, not eqdfw [13:21:06] is that so? perhaps I remember it wrong [13:22:06] maybe it's time to make a new pretty map [13:22:14] for? [13:22:45] one that shows exactly what all of this looks like (or will very soon, anyways) [13:23:16] yes, it's to codfw, not eqdfw [13:23:18] (03PS2) 10Muehlenhoff: Add definitions for LVSes in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/236765 [13:23:18] just checked [13:23:28] they were on-net on both, we picked codfw [13:23:37] so eqdfw only connects to codfw? [13:23:40] yes [13:23:42] ok [13:23:45] and to peering/transits [13:23:48] right [13:24:10] so it's basically just a remote extension of codfw in some sense, like esams/knams, but a little different [13:24:29] does ferm conflict with iptables rules from other software? [13:24:36] probably [13:24:51] hmm, so I can't use base::firewall with kube-proxy, I guess [13:25:07] I don't know if ferm touches -t nat but I guess it'll still try to 'fully manage' them [13:26:15] bblack: i thought the old pretty map does just that, soon? :) [13:26:24] now I can't even find the link [13:26:30] 6operations, 10netops: Add missing rack "locations" in Librenms - https://phabricator.wikimedia.org/T84205#1616241 (10faidon) 5Open>3Resolved a:3faidon I did the following: - Set all of the eqiad/codfw PDUs SNMP sysLocation to "eqiad" and "codfw", respectively, using snmpset and their read/write communit... [13:26:43] https://drive.google.com/drive/u/0/folders/0By9f9UqxCyCQWkNVRF9SVzR4ZlU [13:28:28] https://librenms.wikimedia.org/locations/view=traffic/ :) [13:28:36] takes a while to load [13:28:38] but is nice! [13:28:45] https://librenms.wikimedia.org/locations/ is the easy one [13:28:49] I cleaned up librenms a bit [13:28:56] we were getting quite lost with those new sites [13:29:00] yeah I guess what I'd like to make, is something like the 2015 map there, but with codfw/eqdfw split up, and maybe with some kind of visual on where we do peering/transit hookups at too [13:31:00] maybe combine that all up with labeling out the DC roles too, so it's useful for explaining things to anyone [13:31:08] (T1 vs T2 vs network-only) [13:31:17] (03PS3) 10Muehlenhoff: Add definitions for LVSes in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/236765 [13:32:05] (03PS1) 10Hashar: nodepool: run ready.sh when finishing instances [puppet] - 10https://gerrit.wikimedia.org/r/236776 (https://phabricator.wikimedia.org/T111377) [13:34:17] (03CR) 10Gilles: [C: 031] Introduce apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/207338 (owner: 10Ori.livneh) [13:35:10] (03PS2) 10Faidon Liambotis: Switch US states AR,LA,NM,OK to codfw [dns] - 10https://gerrit.wikimedia.org/r/236236 [13:35:25] (03CR) 10Faidon Liambotis: [C: 032] Switch US states AR,LA,NM,OK to codfw [dns] - 10https://gerrit.wikimedia.org/r/236236 (owner: 10Faidon Liambotis) [13:35:38] paravoid: knams missing in the librenms sites [13:35:49] (03PS1) 10Yuvipanda: tools: Open up firewall holes for kube apiserver [puppet] - 10https://gerrit.wikimedia.org/r/236778 [13:35:55] (03CR) 10jenkins-bot: [V: 04-1] tools: Open up firewall holes for kube apiserver [puppet] - 10https://gerrit.wikimedia.org/r/236778 (owner: 10Yuvipanda) [13:35:55] right, I don't have it as a separate site in my tool... [13:36:09] eqdfw/codfw and knams/esams are similar in that respect [13:37:14] not much, knams is still a gateway for esams [13:37:26] it won't be soon, sure [13:38:00] but so far the router config between cr2-knams/cr1-esams is identical, which isn't the case for eqdfw/eqord [13:38:57] ...not sure how that's relevant :) [13:39:45] (03PS2) 10Yuvipanda: tools: Open up firewall holes for kube apiserver [puppet] - 10https://gerrit.wikimedia.org/r/236778 [13:41:30] (03CR) 10Yuvipanda: [C: 032] tools: Open up firewall holes for kube apiserver [puppet] - 10https://gerrit.wikimedia.org/r/236778 (owner: 10Yuvipanda) [13:42:34] mostly I just care about this from the POV of understanding where traffic's going on what links in cases of outages or capacity, etc [13:42:57] in that sense, it matters to know that codfw<->eqdfw have a connection between them, and which connections from elsehwere are going to which of those two, etc [13:43:00] mark: I switched the location for knams, ok [13:43:16] mark: the difference is that esams-knams are a single L2 domain, which makes some difference for some stuff [13:43:16] 6operations, 6Performance-Team, 10hardware-requests: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1616271 (10jcrespo) a:3jcrespo [13:43:36] the other difference is that I was making some assumption about core/border-satellite routers based on the site, which was wrong :) [13:43:40] so when someone says "due to local DC power issues, we have a 4H outage window on Dec 17th in eqdfw", it's easy to know what that really means we have to plan for. [13:44:48] (or to understand exactly what's going on if cr1-eqdfw dies from a horrible bug or hardware failure) [13:45:22] like from another of its PSUs breaking? :P [13:45:37] ( https://phabricator.wikimedia.org/T110435 ) [13:46:35] 6operations, 10netops: Add missing rack "locations" in Librenms - https://phabricator.wikimedia.org/T84205#1616279 (10faidon) Per @mark, I moved cr2-knams to the location "knams". It should be feeling more lonely now :) [13:50:12] (03PS1) 10Yuvipanda: tools: Add a kubernetes worker role [puppet] - 10https://gerrit.wikimedia.org/r/236779 [13:50:20] (03CR) 10jenkins-bot: [V: 04-1] tools: Add a kubernetes worker role [puppet] - 10https://gerrit.wikimedia.org/r/236779 (owner: 10Yuvipanda) [13:50:20] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [13:50:29] (03PS2) 10Yuvipanda: tools: Add a kubernetes worker role [puppet] - 10https://gerrit.wikimedia.org/r/236779 [13:50:57] 6operations: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442#1616281 (10BBlack) So, to kind of recap implicit things: in the general case, we definitely don't want to put **recursive** resolvers on all the machines. Most of the machines don't have public routing for tha... [13:54:20] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [13:56:36] (03PS3) 10Muehlenhoff: Add definitions for LVSes in codfw [puppet] - 10https://gerrit.wikimedia.org/r/236519 [14:00:04] YuviPanda jynus: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150908T1400). [14:00:04] Krenair: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:23] Krenair: around? [14:06:13] :-) [14:06:53] * YuviPanda pokes Krenair [14:06:59] 6operations, 10hardware-requests: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1616317 (10Milimetric) @akosiaris / @fgiunchedi: I believe you can run any tests you like on those machines, they're not doing anything at the moment. But I agree with Alex's optimism, 10GB / d... [14:07:00] we promise we are not going to hurt you... much [14:07:39] only your commits [14:08:14] heh [14:12:01] (03PS1) 10Jcrespo: Pool es1015 and es1019 for the first time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236782 [14:12:58] (03PS2) 10Dzahn: mediawiki jobrunner: mark as notrack [puppet] - 10https://gerrit.wikimedia.org/r/236767 (owner: 10Muehlenhoff) [14:13:19] (03CR) 10Dzahn: [C: 031] mediawiki jobrunner: mark as notrack [puppet] - 10https://gerrit.wikimedia.org/r/236767 (owner: 10Muehlenhoff) [14:13:26] (03PS3) 10Dzahn: mediawiki jobrunner: mark as notrack [puppet] - 10https://gerrit.wikimedia.org/r/236767 (owner: 10Muehlenhoff) [14:14:50] jynus: if Krenair doesn't show up in the next 10mins I say we call it done and decline the remaining 6 patches [14:15:30] (03PS3) 10Yuvipanda: tools: Add a kubernetes worker role [puppet] - 10https://gerrit.wikimedia.org/r/236779 [14:15:36] 5 [14:15:43] 5 patches [14:15:54] ah yes [14:16:16] (03CR) 10BBlack: "The IPv6 part is kind of odd too. That's not really a network, it's just they happen to all have the same ethernet vendor and thus mac pr" [puppet] - 10https://gerrit.wikimedia.org/r/236765 (owner: 10Muehlenhoff) [14:16:24] (it is more like 1, separated) [14:16:36] yeah [14:16:39] all in sequence [14:16:52] (03CR) 10Yuvipanda: [C: 032] tools: Add a kubernetes worker role [puppet] - 10https://gerrit.wikimedia.org/r/236779 (owner: 10Yuvipanda) [14:17:28] moritzm: sorry for all the LVS addressing problems! it's not really your problem, it's just none of this was laid out with this kind of thing in mind initially :/ [14:18:19] I should at least add the interface::tagged stuff though [14:18:31] we, as in all ops, should also agree on certain standards- it cannot depend on each reviewer [14:18:51] 6operations, 6Performance-Team, 7Mobile: Remove docroot:/images/mobile in favour of docroot:/static/images/mobile - https://phabricator.wikimedia.org/T107395#1616334 (10Krinkle) [14:19:21] jynus: for CR? [14:19:31] or acceptable puppetswat patches? [14:19:38] both, actually [14:19:44] heh [14:19:56] but I suppose we are already doing that [14:19:57] jynus: I've been documenting guidelines for the latter in https://wikitech.wikimedia.org/wiki/PuppetSWAT [14:20:14] let me add a bit about beta [14:20:26] that is really nice [14:20:52] (03PS1) 10Muehlenhoff: Enable ferm on half of remaining Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/236783 [14:20:58] 6operations, 6Performance-Team, 7Mobile: Remove docroot:/images/mobile in favour of docroot:/static/images/mobile - https://phabricator.wikimedia.org/T107395#1616335 (10Krinkle) The main impact from this is organisational overhead (they are outdated copies, possibly sending the wrong versions of production s... [14:21:18] (03CR) 10Dzahn: "This might be a case where it can't be avoided, but T87519 used to be called "kill network.pp" (now: "Migrate as much as possible from net" [puppet] - 10https://gerrit.wikimedia.org/r/236765 (owner: 10Muehlenhoff) [14:21:42] (03CR) 10Ottomata: [C: 031] Enable ferm on half of remaining Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/236783 (owner: 10Muehlenhoff) [14:21:43] jynus: just added another item [14:22:05] jynus: i want to work on the former as well, but one step at a time :) [14:23:00] 6operations, 10Continuous-Integration-Infrastructure, 6Discovery, 7Elasticsearch, 5Patch-For-Review: elasticsearch 1.6.0 fails to start after reboot - https://phabricator.wikimedia.org/T109497#1616339 (10hashar) Can we identify the jobs that actually need ElasticSearch? We could move the jobs to Trusty.... [14:24:32] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 6Discovery, 7Elasticsearch: Please backport ElasticSearch 1.7.x from wikimedia-trusty to wikimedia-precise for CI needs - https://phabricator.wikimedia.org/T111781#1616342 (10hashar) 3NEW [14:25:11] bblack: sure, fully understood. if you prefer, we could also proceed with https://gerrit.wikimedia.org/r/#/c/236734/ and move to stricter rules once the LVS servers have been extended and the addressing has been sorted out, that would also be fine with me [14:25:21] 6operations, 10Continuous-Integration-Infrastructure, 6Discovery, 7Elasticsearch, 5Patch-For-Review: elasticsearch 1.6.0 fails to start after reboot - https://phabricator.wikimedia.org/T109497#1550327 (10hashar) Filled T111781 to request the backport to Precise. Then we will just upgrade the package on... [14:25:33] (03PS1) 10Mjbmr: Add new user groups for fawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236785 (https://phabricator.wikimedia.org/T111024) [14:25:46] (03PS2) 10Muehlenhoff: Enable ferm on half of remaining Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/236783 [14:25:57] (03CR) 10Hashar: "This is still on the integration puppet master to workaround an issue with ElasticSearch 1.6" [puppet] - 10https://gerrit.wikimedia.org/r/233413 (https://phabricator.wikimedia.org/T109497) (owner: 10Hashar) [14:27:11] (03PS1) 10Yuvipanda: tools: Allow flannel to access etcd [puppet] - 10https://gerrit.wikimedia.org/r/236787 [14:27:13] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Introduce mendelevium to the cluster [dns] - 10https://gerrit.wikimedia.org/r/236033 (https://phabricator.wikimedia.org/T111532) (owner: 10Alexandros Kosiaris) [14:27:45] (03PS2) 10Yuvipanda: tools: Allow flannel to access etcd [puppet] - 10https://gerrit.wikimedia.org/r/236787 [14:27:52] (03PS7) 10Dzahn: [WIP] Update apache rules for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [14:28:07] jynus: Krenair ok, I am going to count today's SWAT as 'done' since Krenair didn't show up [14:28:13] (03PS8) 10Dzahn: [WIP] Update apache rules for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [14:29:05] I hate working so hard, YuviPanda [14:29:14] jynus: :D [14:29:21] jynus: I think thursday's going to be just you and akosiaris [14:29:25] np [14:29:33] I'll be recovering from a 24h long flight [14:29:34] hmm [14:29:35] not 24 [14:29:36] 16? [14:29:37] dunno [14:30:12] (03CR) 10Yuvipanda: [C: 032] tools: Allow flannel to access etcd [puppet] - 10https://gerrit.wikimedia.org/r/236787 (owner: 10Yuvipanda) [14:30:24] !log enabled ferm on hadoop workers up to analytics1039 [14:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:31:53] 6operations, 6Performance-Team, 10hardware-requests: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1616384 (10jcrespo) p:5Triage>3Normal [14:32:55] (03PS1) 10Dzahn: RT: Apache config compat for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/236788 [14:33:13] (03PS2) 10Dzahn: RT: Apache config compat for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/236788 [14:34:00] (03PS3) 10Dzahn: RT: Apache config compat for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/236788 [14:34:56] (03CR) 10Dzahn: [C: 032] RT: Apache config compat for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/236788 (owner: 10Dzahn) [14:35:38] (03PS9) 10Dzahn: [WIP] Update apache rules for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [14:39:58] (03CR) 10Jcrespo: [C: 032] Pool es1015 and es1019 for the first time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236782 (owner: 10Jcrespo) [14:41:10] 6operations, 10Continuous-Integration-Infrastructure, 6Discovery, 7Elasticsearch, 5Patch-For-Review: elasticsearch 1.6.0 fails to start after reboot - https://phabricator.wikimedia.org/T109497#1616426 (10JanZerebecki) Yes doing T111781 is the easier short term fix. In the long run we want to abandon pre... [14:41:39] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [14:41:43] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool es1015 and es1019 (duration: 00m 11s) [14:41:45] 10Ops-Access-Requests, 6operations: Requesting access to fluorine / mw-log-readers group for Addshore - https://phabricator.wikimedia.org/T111756#1616431 (10Dzahn) approval and NDA stuff is on T111204, it specifically says "analysis of api usage of wikidata." so that should cover this [14:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:43:42] does pybal RunCommand actually buy us anything in terms of pybal depooling a server more reliable in case of failure, etc? [14:44:39] we only use it for the mw hosts (apaches, rendering, api), and it basically ssh's the "uptime" command - but those use IdleConnection and fetches as well [14:46:33] just seems like a lot of complexity for a tertiary check that isn't directly checking the served service anyways - all the infrastructure for it with the authorized_keys setup and the user, and the config, and now firewall stuff because it's the exception for where ssh traffic could come from... [14:46:45] (03PS1) 10Yuvipanda: k8s: Don't make docker explicitly require flannel [puppet] - 10https://gerrit.wikimedia.org/r/236789 [14:47:04] (03PS2) 10Yuvipanda: k8s: Don't make docker explicitly require flannel [puppet] - 10https://gerrit.wikimedia.org/r/236789 [14:47:41] 6operations, 6Labs, 10Salt: salt does not run reliably for toollabs - https://phabricator.wikimedia.org/T99213#1616465 (10ArielGlenn) I made a full pass on these and labs looked ok. But here we are in September and labs is generally in bad shape. This includes toollabs. Let me list here the issues: 1) mor... [14:47:58] 6operations, 6Labs, 10Salt: salt does not run reliably for toollabs / labs generally - https://phabricator.wikimedia.org/T99213#1616466 (10ArielGlenn) [14:48:18] paravoid: ^ ? thoughts [14:48:38] (03CR) 10Yuvipanda: [C: 032] k8s: Don't make docker explicitly require flannel [puppet] - 10https://gerrit.wikimedia.org/r/236789 (owner: 10Yuvipanda) [14:48:51] I guess not :) [14:48:54] but ask mark [14:49:04] as there's probably some bit of history there [14:49:17] mark: ? [14:49:25] (03PS5) 10Andrew Bogott: Added openstack config files for version Kilo [puppet] - 10https://gerrit.wikimedia.org/r/235399 (https://phabricator.wikimedia.org/T110045) [14:49:30] i'm not sure what the question is [14:49:38] yes, it was nice to have a generic check for a script [14:49:41] as we used for mediawiki apache checking [14:49:44] and that helped a lot back then [14:49:49] but i don't think we use it anywhere else [14:50:04] I guess the question is more about actively configuring it for the mw* hosts today, not about the pybal feature [14:50:26] why wouldn't we? [14:50:47] it was added mostly for checking failure to deploy to new hosts, particularly due to disk failures [14:50:51] s/new// [14:51:03] if we can safely do that in other ways we can do without, but is it in the way at all? [14:51:05] (03CR) 10Merlijn van Deen: [C: 031] labs_lvm: Only run extend-instance-vol when needed [puppet] - 10https://gerrit.wikimedia.org/r/235642 (https://phabricator.wikimedia.org/T109933) (owner: 10Tim Landscheidt) [14:52:09] mark: mostly I only started looking and thinking because it's an annoying issue for SSH firewall rules. We want to enforce that ssh only comes from a handful of places for most servers. Adding rules for "all of the IPs that LVS could source a RunCommand ssh connection from" turns out to be annoying and complicated... [14:52:23] ah i see [14:52:43] well it did really help avoid pooling servers that were broken and not up to date [14:53:08] i wouldn't call it the end all solution to that either, but until we have a firmer alternative in place, I think we should keep it [14:53:16] it could be in scope for next quarter goal though [14:53:34] right now all it's really validating is that ssh and /bin/sh and /usr/bin/uptime function correctly [14:54:01] and I guess implicitly that puppetization was at least partly successful, enough to create the pybal-check user and its .ssh/authorized_keys [14:54:53] yes, but if you have a drive failure, those don't work correctly [14:54:58] which means that deployments with scap fail(ed) [14:55:05] and apache would happily keep serving traffic [14:55:26] yeah. it would be nice if we could check that the service is operating correctly itself, though. [14:55:31] yes [14:55:45] I mean in theory if there's no way to make the HTTP checks catch that issue, then technically the HTTP service doesn't need disks heh [14:55:47] so we've talked in the past about explicitly checking for the latest deployment on each host through pybal [14:55:54] and it would be great if that could be in scope for the next quarter goal :) [14:55:58] yeah [14:56:02] and then we could get rid of this I guess [14:57:00] maybe the subtask on that is to implement a self-check URL in mediawiki that looks at scap-level issues somehow [14:57:26] I'm not sure what we'd compare against to catch it in a very realtime-y way though [14:58:46] or push this from the other end: if an scap is mostly successful but fails on one or a few hosts, we could directly icinga-alert on that so that we know to depool it. Or take it a scary step further and auto-depool it in confd. [14:59:06] or more-probably, scap will be tied into a rapid depool->update->repool via confd anyways, and it just leaves it dead if scap didn't work [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150908T1500). Please do the needful. [15:00:04] Mjbmr: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:16] Hi [15:01:43] (03PS1) 10Muehlenhoff: Enable remaining Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/236792 [15:04:57] (03PS2) 10Alexandros Kosiaris: Introduce mendelevium to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/236035 (https://phabricator.wikimedia.org/T111532) [15:06:31] (03PS1) 10Dzahn: admin: add user for addshore [puppet] - 10https://gerrit.wikimedia.org/r/236793 [15:06:38] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:06:41] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "be-x-old" to "be-tarask" - https://phabricator.wikimedia.org/T11823#1616537 (10Krenair) 5Open>3Resolved [15:06:44] 6operations, 10Wikimedia-Site-Requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1616538 (10Krenair) [15:08:06] bblack: if we're planning to drop the pybal/SSH feature next quarter, I'd actually prefer to stick with https://gerrit.wikimedia.org/r/#/c/236734/ for now, it's still an improvement over the status quo and finalising the "networks" for the various LVS interfaces will require quite some additional work which will be void soon (also avoid potential regressions) [15:08:09] (03PS2) 10Dzahn: admin: add user for addshore [puppet] - 10https://gerrit.wikimedia.org/r/236793 (https://phabricator.wikimedia.org/T111756) [15:08:14] what do you think? [15:08:44] (03PS4) 10Alexandros Kosiaris: maps: Improve water_polygons population [puppet] - 10https://gerrit.wikimedia.org/r/235509 (https://phabricator.wikimedia.org/T109710) [15:08:53] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: Improve water_polygons population [puppet] - 10https://gerrit.wikimedia.org/r/235509 (https://phabricator.wikimedia.org/T109710) (owner: 10Alexandros Kosiaris) [15:09:04] who? [15:09:14] (03PS3) 10Alexandros Kosiaris: Introduce mendelevium to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/236035 (https://phabricator.wikimedia.org/T111532) [15:09:19] wrong window? [15:09:20] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Introduce mendelevium to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/236035 (https://phabricator.wikimedia.org/T111532) (owner: 10Alexandros Kosiaris) [15:10:11] (03CR) 10Jcrespo: [C: 04-1] "-1 until the 3 day-wait rule (expires on Thursday)." [puppet] - 10https://gerrit.wikimedia.org/r/236793 (https://phabricator.wikimedia.org/T111756) (owner: 10Dzahn) [15:10:40] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:13:45] (03CR) 10Dzahn: "@Jcrespo i did not add the user to any groups so that it's not giving access to anything to avoid that" [puppet] - 10https://gerrit.wikimedia.org/r/236793 (https://phabricator.wikimedia.org/T111756) (owner: 10Dzahn) [15:15:58] (03PS3) 10Alex Monk: Add new user groups for azbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234910 (https://phabricator.wikimedia.org/T109755) (owner: 10Mjbmr) [15:16:06] (03CR) 10Alex Monk: [C: 032] Add new user groups for azbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234910 (https://phabricator.wikimedia.org/T109755) (owner: 10Mjbmr) [15:16:14] (03Merged) 10jenkins-bot: Add new user groups for azbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234910 (https://phabricator.wikimedia.org/T109755) (owner: 10Mjbmr) [15:17:24] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/234910/ (duration: 00m 12s) [15:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:17:56] 6operations, 10RESTBase-Cassandra, 10hardware-requests: codfw 3x spares for cassandra encryption testing - https://phabricator.wikimedia.org/T111382#1616616 (10Papaul) Below are the names of the 3 servers restbase-test2001 restbase-test2002 restbase-test2003 [15:18:58] moritzm: ok, works for me :) [15:20:12] (03PS2) 10Alex Monk: Add new user groups for fawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236785 (https://phabricator.wikimedia.org/T111024) (owner: 10Mjbmr) [15:20:21] (03CR) 10Alex Monk: [C: 032] Add new user groups for fawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236785 (https://phabricator.wikimedia.org/T111024) (owner: 10Mjbmr) [15:20:27] (03Merged) 10jenkins-bot: Add new user groups for fawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236785 (https://phabricator.wikimedia.org/T111024) (owner: 10Mjbmr) [15:20:30] (03CR) 10Jcrespo: "True, Dzahn. Also -1 before commenting with him ssh key reuse policy (I have not done that yet)." [puppet] - 10https://gerrit.wikimedia.org/r/236793 (https://phabricator.wikimedia.org/T111756) (owner: 10Dzahn) [15:21:23] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/236785/ (duration: 00m 12s) [15:21:26] Mjbmr, ^ [15:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:50] Krenair: Thanks! all works! [15:23:44] (03PS1) 10Alex Monk: Reverse InitialiseSettings mode change from I98dea8ee [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236797 [15:23:46] (03CR) 10jenkins-bot: [V: 04-1] Reverse InitialiseSettings mode change from I98dea8ee [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236797 (owner: 10Alex Monk) [15:24:11] (03PS2) 10Alex Monk: Reverse InitialiseSettings mode change from I98dea8ee [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236797 [15:24:13] (03CR) 10Dzahn: "reuse policy? it looked to me as if he created a new key because the comment has "20150902" in it" [puppet] - 10https://gerrit.wikimedia.org/r/236793 (https://phabricator.wikimedia.org/T111756) (owner: 10Dzahn) [15:24:20] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to hadoop / hive (analytics-privatedata-users) for Addshore - https://phabricator.wikimedia.org/T111204#1616645 (10Deskana) >>! In T111204#1615682, @Addshore wrote: > And as WDQS is now up and announced I will also be using access to an... [15:24:49] (03Abandoned) 10Dzahn: (WIP): script for list spam report [puppet] - 10https://gerrit.wikimedia.org/r/235170 (owner: 10Dzahn) [15:24:54] (03CR) 10Alex Monk: [C: 032] Reverse InitialiseSettings mode change from I98dea8ee [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236797 (owner: 10Alex Monk) [15:25:02] (03Merged) 10jenkins-bot: Reverse InitialiseSettings mode change from I98dea8ee [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236797 (owner: 10Alex Monk) [15:28:13] (03PS1) 10ArielGlenn: crap salt cleanup scripts primarily for labs use [software] - 10https://gerrit.wikimedia.org/r/236798 [15:33:23] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 5Patch-For-Review, 7user-notice: Rename "be-x-old" to "be-tarask" - https://phabricator.wikimedia.org/T11823#1616698 (10Krenair) >>! In T11823#1613434, @Elitre wrote: > If this is fixed, I think there should be user notice of the change.... [15:35:22] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 5Patch-For-Review, 7user-notice: Rename "be-x-old" to "be-tarask" - https://phabricator.wikimedia.org/T11823#1616705 (10Amire80) The users there already noticed it and they are very happy! - https://be-tarask.wikipedia.org/wiki/%D0%92%D1... [15:39:24] (03PS1) 10Ori.livneh: wikimedia/cdb 1.0.1 → 1.2.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236803 [15:39:27] legoktm: ^ [15:40:10] (03PS1) 10Yuvipanda: k8s: Explicitly specify CA file for flannel [puppet] - 10https://gerrit.wikimedia.org/r/236804 [15:40:37] (03CR) 10Legoktm: [C: 031] wikimedia/cdb 1.0.1 → 1.2.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236803 (owner: 10Ori.livneh) [15:40:40] (03PS2) 10Yuvipanda: k8s: Explicitly specify CA file for flannel [puppet] - 10https://gerrit.wikimedia.org/r/236804 [15:40:50] (03CR) 10Ori.livneh: [C: 032] wikimedia/cdb 1.0.1 → 1.2.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236803 (owner: 10Ori.livneh) [15:41:02] (03Merged) 10jenkins-bot: wikimedia/cdb 1.0.1 → 1.2.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236803 (owner: 10Ori.livneh) [15:43:06] (03CR) 10Yuvipanda: [C: 032] k8s: Explicitly specify CA file for flannel [puppet] - 10https://gerrit.wikimedia.org/r/236804 (owner: 10Yuvipanda) [15:43:52] !log ori@tin Synchronized multiversion: wikimedia/cdb 1.0.1 → 1.2.0 (duration: 00m 12s) [15:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:21] (03PS1) 10Filippo Giunchedi: install_server: elastic codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/236810 (https://phabricator.wikimedia.org/T111080) [15:52:01] (03PS1) 10Papaul: ADd DNS entries for restbase-test200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/236812 [15:56:20] (03CR) 10Ottomata: [C: 031] Enable remaining Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/236792 (owner: 10Muehlenhoff) [16:00:20] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [16:05:42] andrewbogott: btw, alerts for labvirt 7/8 for disk space [16:05:57] YuviPanda: thanks, I’ll look after the meeting [16:06:06] andrewbogott: ok [16:06:13] YuviPanda: unless they’re at 99/100% ? [16:06:36] andrewbogott: no, can wait till after meeting [16:06:44] andrewbogott: probably not for more than a day tho [16:07:05] 1007 is at 84%, I wonder why the alert is firing... [16:07:06] anyway [16:16:03] !log ori@tin Synchronized php-1.26wmf21/vendor: I5af46eb3: wikimedia/cdb 1.0.1 → 1.2.0 (duration: 00m 14s) [16:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:22] OK, so this is a curiosity from -commons [16:23:24] https://commons.wikimedia.org/wiki/File:Thousand_Island,_St._Lawrence_River.svg [16:23:32] I'm getting 429s on the thumbnail requests. [16:23:47] I suspect it's just a timeout on the thumbnailer, wanted to make sure nothing else was weird about it [16:24:37] Error generating thumbnail [16:24:37] Error creating thumbnail: Error reading SVG:Error domain 1 code 1 on line 3959 column 1 of file:///tmp/svg_44319e873bee412fac460b96/localcopy_7d1fc252895a-1.svg: internal error: Huge input lookup [16:25:24] it's a 48.64M svg [16:26:18] Yeah [16:26:24] I figured it was just about the size [16:26:26] oh my macbook, running svgo: [16:26:28] $ svgo Thousand_Island,_St._Lawrence_River.svg [16:26:28] FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory [16:26:28] Abort trap: 6 [16:26:41] Hm? [16:26:44] i'm not sure if it's actually broken or just really, really complicated [16:26:51] I think a little of both perhaps [16:26:56] but either way prod is not the only thing struggling with it [16:26:58] but chrome renders it [16:27:01] The full SVG displays fine in Firefox [16:27:11] it'd be nice to figure it out, it has great encyclopedic value by the looks of it [16:27:48] marktraceur: haha: https://validator.w3.org/check?uri=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fe%2Fe7%2FThousand_Island%252C_St._Lawrence_River.svg&charset=%28detect+automatically%29&doctype=Inline&group=0 [16:35:20] Taking down w3.org one 50MB SVG at a time. [16:35:47] 6operations, 10Traffic, 7HTTPS: Outbound HTTPS for varnish backend instances - https://phabricator.wikimedia.org/T109325#1617144 (10BBlack) Updates on some exploration of option (2) above (actually looking at the varnish code and the APIs we'd be using in detail): * s2n - nice API, would be the simplest/clea... [16:36:10] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 5Patch-For-Review, 7user-notice: Rename "be-x-old" to "be-tarask" - https://phabricator.wikimedia.org/T11823#1617146 (10Amire80) So, the fallout begins ;) * T111818 * T111822 [16:37:27] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1617149 (10BBlack) [16:37:56] (03PS1) 10Yuvipanda: k8s: Setup basic authentication for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/236830 [16:38:13] (03PS2) 10Yuvipanda: k8s: Setup basic authentication for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/236830 [16:39:02] aude, around? [16:40:54] Krenair: I see her! [16:47:14] Krenair: ? [16:48:05] aude, hey, so we recently renamed a wikipedia domain and now sitelinks aren't working [16:48:24] I was wondering if that's something https://wikitech.wikimedia.org/wiki/Add_a_wiki#sites_table would take care of [16:49:48] (03PS3) 10Yuvipanda: k8s: Setup basic authentication for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/236830 [16:49:56] (03PS4) 10Yuvipanda: k8s: Setup basic authentication for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/236830 [16:51:18] (03CR) 10Yuvipanda: [C: 032] k8s: Setup basic authentication for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/236830 (owner: 10Yuvipanda) [16:52:57] Krenair: it might [16:53:20] site links to just that wikipedia? [16:53:25] or all site links? [16:53:39] only at that one [16:53:40] https://phabricator.wikimedia.org/T111822 [16:53:44] we might try on testwikidatawiki [16:54:10] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [16:55:02] (03PS1) 10Yuvipanda: k8s: Make sure that /var/run/kubernetes is present [puppet] - 10https://gerrit.wikimedia.org/r/236835 [16:55:07] (03CR) 10jenkins-bot: [V: 04-1] k8s: Make sure that /var/run/kubernetes is present [puppet] - 10https://gerrit.wikimedia.org/r/236835 (owner: 10Yuvipanda) [16:55:21] (03PS2) 10Yuvipanda: k8s: Make sure that /var/run/kubernetes is present [puppet] - 10https://gerrit.wikimedia.org/r/236835 [16:55:44] (03CR) 10Alex Monk: "It's very unusual to +1 your own commit." [puppet] - 10https://gerrit.wikimedia.org/r/236555 (owner: 10Alex Monk) [16:56:05] i think a second entry might get added for "be_tarask' [16:56:21] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Make sure that /var/run/kubernetes is present [puppet] - 10https://gerrit.wikimedia.org/r/236835 (owner: 10Yuvipanda) [16:56:30] !log krinkle@tin Synchronized php-1.26wmf21/extensions/CentralAuth: T108253 sul2 token store (duration: 00m 12s) [16:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:57:21] legoktm: ForeignApi and login seems to work fine still [16:57:29] think we need to keep the old entry (and think that would happen) for existing sit elinks [16:57:48] (03CR) 10Dzahn: "+1 by the committer says "this is ready to go" / can be merged anytime. i agree that technically no vote should mean the same but that onl" [puppet] - 10https://gerrit.wikimedia.org/r/236555 (owner: 10Alex Monk) [16:58:12] Krinkle: awesome, thanks :D [16:58:19] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [16:58:42] Krenair: if you want me to poke at this later, when i get home, i can do [16:58:48] but want to go home and eat [16:59:09] okay [16:59:41] we have a lot of existing entries in the wb_items_per_site table with the old 'site id' [16:59:59] and want to make sure those work or figure out how to handle them [17:02:39] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [17:09:48] !log krinkle@tin Synchronized php-1.26wmf21/extensions/NavigationTiming: T109756 (duration: 00m 11s) [17:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:10:14] (03PS2) 10Alex Monk: Tidy up more comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236701 (https://phabricator.wikimedia.org/T31902) [17:11:14] (03CR) 10Krinkle: [C: 031] Tidy up more comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236701 (https://phabricator.wikimedia.org/T31902) (owner: 10Alex Monk) [17:11:46] ori: stand by for impact - we'll find out whether it the new metrics work or not :) [17:11:54] (and if we forgot to update anything else) [17:14:15] (03CR) 10Krinkle: Collect missing Navigation Timing metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/236024 (https://phabricator.wikimedia.org/T109756) (owner: 10Phedenskog) [17:16:03] (03PS1) 10Yuvipanda: k8s: Add ABAC for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/236838 [17:16:09] (03CR) 10jenkins-bot: [V: 04-1] k8s: Add ABAC for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/236838 (owner: 10Yuvipanda) [17:16:22] (03PS2) 10Yuvipanda: k8s: Add ABAC for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/236838 [17:17:25] (03CR) 10Yuvipanda: [C: 032] k8s: Add ABAC for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/236838 (owner: 10Yuvipanda) [17:23:57] 6operations, 7Tracking: staged dumps implementation - https://phabricator.wikimedia.org/T107757#1617379 (10ArielGlenn) [17:24:00] 6operations: copy partial dumps from dataset host to labs - https://phabricator.wikimedia.org/T108077#1617377 (10ArielGlenn) 5Open>3Resolved this is working now; closing. [17:24:49] 6operations: salt '*' test.ping after upgrade fails on many hosts - https://phabricator.wikimedia.org/T83095#1617382 (10ArielGlenn) 5Open>3Resolved a:3ArielGlenn yes indeed, these particular bugs have fixes in the version we run now. closing. [17:40:17] (03PS2) 10Filippo Giunchedi: Add DNS entries for restbase-test200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/236812 (https://phabricator.wikimedia.org/T111382) (owner: 10Papaul) [17:40:35] (03PS2) 10Muehlenhoff: Enable remaining Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/236792 [17:41:26] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable remaining Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/236792 (owner: 10Muehlenhoff) [17:43:13] !log enabled ferm on remaining hadoop workers (analytics1040-analytics1057) [17:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:04] (03CR) 10Dzahn: [C: 032] Add DNS entries for restbase-test200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/236812 (https://phabricator.wikimedia.org/T111382) (owner: 10Papaul) [17:53:07] (03CR) 10Dzahn: [C: 031] install_server: elastic codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/236810 (https://phabricator.wikimedia.org/T111080) (owner: 10Filippo Giunchedi) [17:53:12] (03PS2) 10Dzahn: install_server: elastic codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/236810 (https://phabricator.wikimedia.org/T111080) (owner: 10Filippo Giunchedi) [17:54:51] (03CR) 10Dzahn: [C: 032] "checked that these MAC addresses are in the HP vendor range" [puppet] - 10https://gerrit.wikimedia.org/r/236810 (https://phabricator.wikimedia.org/T111080) (owner: 10Filippo Giunchedi) [18:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150908T1800). [18:03:47] (03PS1) 10Ori.livneh: Add perf-admins group and add to relevant roles [puppet] - 10https://gerrit.wikimedia.org/r/236847 (https://phabricator.wikimedia.org/T110926) [18:18:58] (03PS1) 1020after4: symlinks for 1.26wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236849 [18:19:13] (03CR) 1020after4: [C: 032] symlinks for 1.26wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236849 (owner: 1020after4) [18:19:20] (03Merged) 10jenkins-bot: symlinks for 1.26wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236849 (owner: 1020after4) [18:19:36] That was a weird hiccup. [18:20:37] !log twentyafterfour@tin Started scap: testwiki to 1.26wmf22 [18:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:25:05] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Puppet has 1 failures [18:26:14] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Puppet has 1 failures [18:26:15] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Puppet has 1 failures [18:27:45] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [18:28:14] "👑 Keep calm and do not break the wiki" [18:28:35] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: puppet fail [18:29:59] 500 are down now [18:30:48] :) [18:32:57] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1617651 (10jcrespo) This will be done by @mark, with proper @Papaul interaction, as agreed on the Operations meeting. [18:35:26] PROBLEM - Apache HTTP on mw2187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 689 bytes in 0.087 second response time [18:35:46] PROBLEM - HHVM rendering on mw2187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 689 bytes in 0.104 second response time [18:38:16] (03PS1) 10Ottomata: Make EventLogging MySQL consumer consume from kafka instead of 0mq [puppet] - 10https://gerrit.wikimedia.org/r/236853 (https://phabricator.wikimedia.org/T106260) [18:38:26] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:43:56] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: puppet fail [18:44:11] * aude grumbles [18:44:12] 10Ops-Access-Requests, 6operations: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1617702 (10jcrespo) 5Open>3declined a:3jcrespo The result of the ops meeting was that this request was unclear, and that a more clear and detailed request should be done in order to be... [18:45:07] * aude is going to want to run scap again :/ [18:45:16] but not urgent or anything [18:48:37] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [18:48:46] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [18:49:16] 6operations, 10ops-eqiad: Prepare shipping label for mx80 to eqord - https://phabricator.wikimedia.org/T109338#1617717 (10Cmjohnson) 5Open>3Resolved Received the mx80 in eqord. Updated racktables with it's location [18:49:46] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [18:50:07] !log twentyafterfour@tin Finished scap: testwiki to 1.26wmf22 (duration: 29m 29s) [18:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:50:26] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:50:34] 10Ops-Access-Requests, 6operations: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1617726 (10mark) 5declined>3Open @jcrespo: indeed we said we wanted a bit more information/justification as it was a rather vague request, but I think that's fine to do on this ticket, i... [18:50:47] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1617728 (10jcrespo) There was no time to go over these access request during the access request meeting. As this requires... [18:52:54] 10Ops-Access-Requests, 6operations: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1617732 (10jcrespo) @mark, of course, that is why I said "Please reopen the ticket". [18:54:16] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [18:56:55] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:58:12] sjoerddebruin, jynus: did something break earlier? [18:59:01] Started before my message here. [19:02:34] Krenair: is https://gerrit.wikimedia.org/r/#/c/235843/ all we mean when we say we renamed the wiki? [19:02:51] aude, domain name change, yes [19:02:56] 10Ops-Access-Requests, 6operations: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1617770 (10Tfinc) Happy to summarize group: "elasticsearch-roots" username: "tomasz" hosts: elastic{01..31}.eqiad.wmnet As part of the discovery department owning search I want to make sur... [19:02:59] ok, that's more simple [19:02:59] no change in database name [19:04:14] People let wiki rename requests get put on hold for years because database renaming had not been figured out [19:04:18] but users do not care about database names [19:05:19] 6operations, 10ops-eqiad: ms-be1010.eqiad.wmnet: slot=5 dev=sdf failed - https://phabricator.wikimedia.org/T111553#1617777 (10Cmjohnson) Disk ordered Congratulations: Work Order SR916882616 was successfully submitted. [19:06:01] Krenair: makes sense and is relatively simple [19:07:43] 10Ops-Access-Requests, 6operations: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1617801 (10jcrespo) [19:09:13] (03PS1) 1020after4: group0 wikis to 1.26wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236859 [19:09:45] (03CR) 1020after4: [C: 032] group0 wikis to 1.26wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236859 (owner: 1020after4) [19:09:51] (03Merged) 10jenkins-bot: group0 wikis to 1.26wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236859 (owner: 1020after4) [19:10:10] Krenair: I differ, they care about wiki code names [19:10:32] But I am not going to complain about something I agree with [19:10:46] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [19:10:54] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.26wmf22 [19:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:11:07] you don't agree with wiki hostnames being so heavily tied to database names, jynus? [19:11:32] I said I agreed [19:11:51] ok, let me rephrase [19:12:00] oh, you mean the current system [19:12:05] you agree with not requiring wiki hostnames to be so heavily tied to database names, jynus? [19:12:13] of course note, databases should be caller db12345 [19:12:28] well, not that exactly, but arbitrary names [19:12:57] but it's what we have now [19:13:03] hm, not sure I'd go that far. I find the eventlogging table names annoying enough :p [19:13:15] :-P [19:13:34] (03PS2) 10Ottomata: Puppetize EventLogging on Kafka server-side-forwarder on eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/236853 (https://phabricator.wikimedia.org/T106260) [19:14:00] !log twentyafterfour@tin Synchronized php-1.26wmf21/: sync php-1.26wmf21 as well (duration: 02m 31s) [19:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:15:07] (03PS3) 10Ottomata: Puppetize EventLogging on Kafka server-side-forwarder on eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/236853 (https://phabricator.wikimedia.org/T106260) [19:15:10] (03PS1) 10BBlack: remove maps-lb.codfw.wm.o temporary addr [dns] - 10https://gerrit.wikimedia.org/r/236862 [19:15:19] (03CR) 10jenkins-bot: [V: 04-1] remove maps-lb.codfw.wm.o temporary addr [dns] - 10https://gerrit.wikimedia.org/r/236862 (owner: 10BBlack) [19:16:31] (03CR) 10Ottomata: [C: 032] Puppetize EventLogging on Kafka server-side-forwarder on eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/236853 (https://phabricator.wikimedia.org/T106260) (owner: 10Ottomata) [19:16:54] (03PS2) 10BBlack: remove maps-lb.codfw.wm.o temporary addr [dns] - 10https://gerrit.wikimedia.org/r/236862 [19:19:00] (03PS3) 10BBlack: remove maps-lb.codfw.wm.o temporary addr [dns] - 10https://gerrit.wikimedia.org/r/236862 [19:28:05] (03CR) 10BryanDavis: "I removed the cherry-pick of this from deployment-puppetmaster as it was conflicting with the current upstream HEAD" [puppet] - 10https://gerrit.wikimedia.org/r/197655 (owner: 10Chad) [19:28:48] bd808: when we scap, does hhvm get restarted on the machines? [19:28:58] aude: nope [19:29:04] oh [19:29:16] we have some patches that could do that but they are not very stable [19:29:32] we cache our Sites stuff now with APC (CACHE_ACCEL) [19:29:48] :/ [19:29:49] Hallo. [19:29:50] to get it to invalidate locally, i had to restart hhvm [19:30:03] but wonder if there is another way [19:30:12] do you cache with no TTL? [19:30:14] (03PS1) 10Ottomata: Use proper variable in eventlogging forwarder [puppet] - 10https://gerrit.wikimedia.org/r/236863 [19:30:22] there is a cache key [19:30:25] Remind me please, for a deployment of a change in an extension, do I need to do any preparation, like backporting, or do I just have to list it at https://wikitech.wikimedia.org/wiki/Deployments ? [19:30:31] I think we force a TTL now in the APC cache [19:30:36] (I mean SWAT.) [19:30:55] aharoni, what extension? [19:31:00] ContentTranslation [19:31:02] only VE needs submodule updates [19:31:06] bd808: we might [19:31:06] https://phabricator.wikimedia.org/T111850 [19:31:25] everything else is just the simple backport to the right branch, and putting the new commit on the calendar [19:31:43] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Request to access apertium-apy service restart - https://phabricator.wikimedia.org/T111360#1617941 (10jcrespo) I am very sorry, but due to other pending access requests, this access, that includes sudo privileges, was scheduled but could not be processed... [19:31:49] Krenair: the "simple backport" is what I'm asking about. [19:31:53] (03CR) 10Ottomata: [C: 032] Use proper variable in eventlogging forwarder [puppet] - 10https://gerrit.wikimedia.org/r/236863 (owner: 10Ottomata) [19:31:58] 6operations, 6Labs, 5Patch-For-Review: labs salt master on jessie fails to install salt-master - https://phabricator.wikimedia.org/T110032#1617944 (10Andrew) I rebuilt new images last week, so this should be fixed. I have not directly verified this though. [19:32:03] yeah just use the cherry-pick button in gerrit to propose backporting it to the right branch [19:32:12] should also make the cache key a combination of config + something tied to Site (when the structure of them change) [19:32:29] Aha. An what's the right one for getting deployed in the evening SWAT today? [19:33:01] 1.26wmf21 and 1.26wmf22 at the moment [19:33:22] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Request to access apertium-apy service restart - https://phabricator.wikimedia.org/T111360#1617945 (10jcrespo) p:5Normal>3High [19:34:14] 10Ops-Access-Requests, 6operations: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1617947 (10jcrespo) a:5jcrespo>3None [19:34:24] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 15.38% of data above the critical threshold [500.0] [19:34:35] PROBLEM - spamassassin on mendelevium is CRITICAL: Timeout while attempting connection [19:34:42] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1617950 (10jcrespo) p:5Normal>3High [19:34:45] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: puppet fail [19:35:24] PROBLEM - Check size of conntrack table on mendelevium is CRITICAL: Timeout while attempting connection [19:35:44] PROBLEM - DPKG on mendelevium is CRITICAL: Timeout while attempting connection [19:36:04] PROBLEM - Disk space on mendelevium is CRITICAL: Timeout while attempting connection [19:36:15] PROBLEM - HTTPS on mendelevium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection timed out [19:36:34] PROBLEM - OTRS SMTP on mendelevium is CRITICAL: Connection timed out [19:36:36] Krenair: about 1.26wmf22 Gerrit says "Could not create a merge commit during the cherry pick" [19:36:46] PROBLEM - RAID on mendelevium is CRITICAL: Timeout while attempting connection [19:36:56] then you'll have to do it manually and resolve the conflict [19:37:14] PROBLEM - configured eth on mendelevium is CRITICAL: Timeout while attempting connection [19:37:24] aharoni, oh, wait [19:37:24] PROBLEM - dhclient process on mendelevium is CRITICAL: Timeout while attempting connection [19:37:39] aharoni, according to the "Included in" section on https://gerrit.wikimedia.org/r/#/c/236795/ it's already on 1.26wmf22 [19:37:44] PROBLEM - puppet last run on mendelevium is CRITICAL: Timeout while attempting connection [19:37:45] * aude wants to scap sometime, if there is a chance [19:37:52] good. [19:37:55] PROBLEM - salt-minion processes on mendelevium is CRITICAL: Timeout while attempting connection [19:37:57] gerrit failed me [19:37:58] aude: I'm done with the train [19:38:05] ok [19:38:45] (03CR) 10Gilles: "http://www.reactiongifs.com/wp-content/uploads/2013/06/paul-good-power.gif" [puppet] - 10https://gerrit.wikimedia.org/r/236847 (https://phabricator.wikimedia.org/T110926) (owner: 10Ori.livneh) [19:39:22] (03PS1) 10Ottomata: Don't use async producer for eventlogging server side forwarder [puppet] - 10https://gerrit.wikimedia.org/r/236869 [19:40:17] (03CR) 10Ottomata: [C: 032] Don't use async producer for eventlogging server side forwarder [puppet] - 10https://gerrit.wikimedia.org/r/236869 (owner: 10Ottomata) [19:42:44] !log aude@tin Started scap: Update group0 to new Wikidata branch [19:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:43:02] hopefully quick, since the only i18n differences is our stuff [19:43:15] on wmf22 [19:44:35] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:48:04] RECOVERY - salt-minion processes on mendelevium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:48:05] RECOVERY - Disk space on mendelevium is OK: DISK OK [19:48:35] RECOVERY - OTRS SMTP on mendelevium is OK: SMTP OK - 0.007 sec. response time [19:48:45] RECOVERY - spamassassin on mendelevium is OK: PROCS OK: 3 processes with args spamd [19:48:55] RECOVERY - RAID on mendelevium is OK: OK: no RAID installed [19:49:15] RECOVERY - configured eth on mendelevium is OK: OK - interfaces up [19:49:26] RECOVERY - dhclient process on mendelevium is OK: PROCS OK: 0 processes with command name dhclient [19:49:26] RECOVERY - Check size of conntrack table on mendelevium is OK: OK: nf_conntrack is 0 % full [19:49:46] RECOVERY - DPKG on mendelevium is OK: All packages OK [19:53:01] aude, is https://phabricator.wikimedia.org/T111822 a dupe of https://phabricator.wikimedia.org/T43723 ? and is https://phabricator.wikimedia.org/T111852 a blocker of them? [19:56:55] when we resolve T111822 then i think T43723 is fixed [19:57:14] * aude wonders how to find what the ttl is? [19:57:55] (03PS7) 10Eevans: certificate/keystore generation script [puppet] - 10https://gerrit.wikimedia.org/r/236389 (https://phabricator.wikimedia.org/T108953) [19:59:04] (03CR) 10Eevans: certificate/keystore generation script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/236389 (https://phabricator.wikimedia.org/T108953) (owner: 10Eevans) [20:01:19] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 5Patch-For-Review, 7user-notice: Rename "be-x-old" to "be-tarask" - https://phabricator.wikimedia.org/T11823#1618044 (10Amire80) Does anybody object to reopening this task and setting the post-rename cleanup tasks from the previous comme... [20:01:59] * aude will try populate sites [20:03:25] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:04:23] (03PS1) 10Ottomata: Make eventlogging server side processor consume from and produce to kafka [puppet] - 10https://gerrit.wikimedia.org/r/236921 (https://phabricator.wikimedia.org/T106260) [20:05:14] (03CR) 10jenkins-bot: [V: 04-1] Make eventlogging server side processor consume from and produce to kafka [puppet] - 10https://gerrit.wikimedia.org/r/236921 (https://phabricator.wikimedia.org/T106260) (owner: 10Ottomata) [20:06:04] Krenair: site_language doesn't currently mean what you think it means :( [20:06:10] (03PS2) 10Ottomata: Make eventlogging server side processor consume from and produce to kafka [puppet] - 10https://gerrit.wikimedia.org/r/236921 (https://phabricator.wikimedia.org/T106260) [20:06:24] ah [20:06:27] * aude makes another bug [20:06:39] it's basically the site id minus the 'wiki' :/ [20:07:04] and the site_id = database name, per sitematrix [20:07:09] which might not be what we want [20:07:12] !log aude@tin Finished scap: Update group0 to new Wikidata branch (duration: 24m 27s) [20:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:08:17] e.g. 'simple' is a 'language' [20:08:50] ugh [20:09:51] (03CR) 10Ottomata: [C: 032] Make eventlogging server side processor consume from and produce to kafka [puppet] - 10https://gerrit.wikimedia.org/r/236921 (https://phabricator.wikimedia.org/T106260) (owner: 10Ottomata) [20:10:52] (03PS1) 10Papaul: Add MAC address for restbase-test200[1-3] Bug:T111697 [puppet] - 10https://gerrit.wikimedia.org/r/236925 [20:11:05] (03PS1) 10Ori.livneh: wikimedia/cdb 1.2.0 → 1.3.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236926 [20:11:17] yeah :( [20:11:29] (03CR) 10MaxSem: [C: 031] remove maps-lb.codfw.wm.o temporary addr [dns] - 10https://gerrit.wikimedia.org/r/236862 (owner: 10BBlack) [20:11:38] (03CR) 10Ori.livneh: [C: 032] wikimedia/cdb 1.2.0 → 1.3.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236926 (owner: 10Ori.livneh) [20:11:44] (03Merged) 10jenkins-bot: wikimedia/cdb 1.2.0 → 1.3.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236926 (owner: 10Ori.livneh) [20:23:18] ok, cache timeout is an hour [20:31:05] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [20:37:15] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [20:37:47] !log ori@tin Synchronized php-1.26wmf21/vendor: wikimedia/cdb 1.2.0 → 1.3.0 (duration: 00m 14s) [20:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:38:03] !log ori@tin Synchronized php-1.26wmf22/vendor: wikimedia/cdb 1.2.0 → 1.3.0 (duration: 00m 15s) [20:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:38:16] !log ori@tin Synchronized multiversion: wikimedia/cdb 1.2.0 → 1.3.0 (duration: 00m 12s) [20:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:43:16] (03CR) 10BBlack: [C: 032] remove maps-lb.codfw.wm.o temporary addr [dns] - 10https://gerrit.wikimedia.org/r/236862 (owner: 10BBlack) [20:45:51] !log upgrading nginx to 1.9.4 on cp* [20:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:49:35] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: puppet fail [20:50:02] bblack, hey. does test.wikipedia.org still bypass varnish? [20:50:43] I think not because restbase etc. works there [20:50:43] it doesn't get cached [20:50:46] ah [20:50:47] but it doesn't bypass varnish [20:50:59] i guess it depends on what you mean by "bypass" [20:51:03] it's just rigged to always be a cache miss. [20:51:12] the context I saw it in was "test.wikipedia.org is a special wiki that bypasses Varnish caching" [20:51:27] so I guess it's correct [20:51:37] just not the same 'bypass' as e.g. wikitech, which does not sit behind varnish [20:51:42] right [20:58:35] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [20:59:20] (03PS3) 10Andrew Bogott: nodepool: setup scripts are in integration/config [puppet] - 10https://gerrit.wikimedia.org/r/236769 (https://phabricator.wikimedia.org/T111377) (owner: 10Hashar) [20:59:50] (03PS2) 10Andrew Bogott: nodepool: run ready.sh when finishing instances [puppet] - 10https://gerrit.wikimedia.org/r/236776 (https://phabricator.wikimedia.org/T111377) (owner: 10Hashar) [21:01:15] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: puppet fail [21:01:53] (03CR) 10Andrew Bogott: [C: 032] nodepool: setup scripts are in integration/config [puppet] - 10https://gerrit.wikimedia.org/r/236769 (https://phabricator.wikimedia.org/T111377) (owner: 10Hashar) [21:01:54] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: puppet fail [21:02:00] (03CR) 10Andrew Bogott: [C: 032] nodepool: run ready.sh when finishing instances [puppet] - 10https://gerrit.wikimedia.org/r/236776 (https://phabricator.wikimedia.org/T111377) (owner: 10Hashar) [21:02:31] !log deployed patches for T108616 T91850 T91205 to wmf21 & 22 [21:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:05:45] (03PS1) 10Dzahn: phab: use mysql slave not master for scripts [puppet] - 10https://gerrit.wikimedia.org/r/236944 (https://phabricator.wikimedia.org/T111547) [21:06:48] (03PS1) 10Ottomata: Modify eventlogging graphite alerts so that they are based on kafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/236947 (https://phabricator.wikimedia.org/T106254) [21:07:21] (03CR) 10Gergő Tisza: "Note that this patch was done by acking for ^\s*(Allow|Deny) . I did not look for ^\s*(allow|deny) (lowercase) so there are some more rule" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [21:07:37] (03CR) 10jenkins-bot: [V: 04-1] Modify eventlogging graphite alerts so that they are based on kafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/236947 (https://phabricator.wikimedia.org/T106254) (owner: 10Ottomata) [21:08:41] (03PS2) 10Ottomata: Modify eventlogging graphite alerts so that they are based on kafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/236947 (https://phabricator.wikimedia.org/T106254) [21:08:45] PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: Puppet has 1 failures [21:09:13] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 5Patch-For-Review, 7user-notice: Rename "be-x-old" to "be-tarask" - https://phabricator.wikimedia.org/T11823#1618341 (10Krenair) Think it'd be better to create a wikitech page about it. [21:13:19] (03PS6) 10Andrew Bogott: Added openstack config files for version Kilo [puppet] - 10https://gerrit.wikimedia.org/r/235399 (https://phabricator.wikimedia.org/T110045) [21:13:21] (03PS1) 10Andrew Bogott: Switch to Openstack Kilo [puppet] - 10https://gerrit.wikimedia.org/r/236950 [21:13:23] (03PS1) 10Andrew Bogott: Switched labnet hosts to Openstack Kilo [puppet] - 10https://gerrit.wikimedia.org/r/236951 [21:13:25] (03PS1) 10Andrew Bogott: Switch labvirt1005 to openstack kilo [puppet] - 10https://gerrit.wikimedia.org/r/236952 [21:17:54] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:18:01] (03PS4) 10Andrew Bogott: Add some crappy but handy scripts for managing the grid during reboots. [puppet] - 10https://gerrit.wikimedia.org/r/232285 [21:19:22] (03CR) 10Andrew Bogott: [C: 032] Add some crappy but handy scripts for managing the grid during reboots. [puppet] - 10https://gerrit.wikimedia.org/r/232285 (owner: 10Andrew Bogott) [21:25:54] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [21:26:55] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:29:24] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:35:22] * Krinkle is staging a revert on tin [21:35:24] (03PS1) 10Ottomata: Separate zmq and kafka server side forwarder into different processes [puppet] - 10https://gerrit.wikimedia.org/r/236964 [21:35:28] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 5Patch-For-Review, 7user-notice: Rename "be-x-old" to "be-tarask" - https://phabricator.wikimedia.org/T11823#1618388 (10Krenair) >>! In T11823#1618341, @Krenair wrote: > Think it'd be better to create a wikitech page about it. https://w... [21:36:26] !log krinkle@tin Synchronized php-1.26wmf21/extensions/NavigationTiming: temporarily revert T109756 (duration: 00m 11s) [21:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:36:39] (03CR) 10Ottomata: [C: 032] Separate zmq and kafka server side forwarder into different processes [puppet] - 10https://gerrit.wikimedia.org/r/236964 (owner: 10Ottomata) [21:36:44] Krenair: adding be_x_oldwiki / be-tarask links works now [21:36:54] thanks [21:36:58] I just realised I missed something [21:37:05] oh? [21:37:08] wgSiteMatrixFile - the langlist file [21:37:13] maybe an hour cache is okish [21:37:21] or we could make it shorter [21:38:19] i'm not sure how langlist works in this context [21:38:22] (03PS1) 10Ottomata: Run server side zmq forwarder off of kafka [puppet] - 10https://gerrit.wikimedia.org/r/236965 [21:39:17] (03CR) 10jenkins-bot: [V: 04-1] Run server side zmq forwarder off of kafka [puppet] - 10https://gerrit.wikimedia.org/r/236965 (owner: 10Ottomata) [21:40:15] (03PS2) 10Ottomata: Run server side zmq forwarder off of kafka [puppet] - 10https://gerrit.wikimedia.org/r/236965 [21:40:24] (03PS1) 10Alex Monk: Also update langlist for be-x-old -> be-tarask rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236966 [21:40:35] Krenair: 'simple' is a language here [21:40:43] yeah [21:41:50] (03CR) 10Ottomata: [C: 032] Run server side zmq forwarder off of kafka [puppet] - 10https://gerrit.wikimedia.org/r/236965 (owner: 10Ottomata) [21:43:41] (03CR) 10Aude: "wonder if this makes be-tarask appear as a special wiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236966 (owner: 10Alex Monk) [21:46:30] (03CR) 10Aude: "yep, this is what happens :/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236966 (owner: 10Alex Monk) [21:51:35] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: forwarder/server-side-raw-zmq [21:55:55] PROBLEM - Restbase endpoints health on xenon is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [21:55:55] aude, I applied it on mw1017 [21:56:04] PROBLEM - Restbase root url on xenon is CRITICAL: Connection refused [21:56:24] it doesn't quite work [21:56:41] as expected :/ [21:56:55] language != language as we think [21:57:16] and that's where populate sites gets language :( [21:58:54] foreach ( $matrix->getLangList() as $lang ) { [21:58:54] if ( in_array( $lang, array( 'cz', 'dk', 'epo', 'jp', 'minnan', 'nan', 'nb', 'zh-cfr' ) ) ) { [21:58:54] continue; [21:58:55] urgh [21:59:00] I bet that isn't documented where it needs to be [21:59:39] ugh [22:04:36] (03PS1) 10GWicke: Remove old-style security stanza [puppet] - 10https://gerrit.wikimedia.org/r/236973 [22:06:19] https://upload.wikimedia.org/wikipedia/commons/b/b1/%D0%90.%D0%A1.%D0%A9%D0%B5%D1%80%D0%B1%D0%B0_2_%D1%81%D0%B5%D0%BD%D1%82%D1%8F%D0%B1%D1%80%D1%8F_2011_%D0%B3%D0%BE%D0%B4%D0%B0.jpg [22:06:24] From https://commons.wikimedia.org/wiki/File:%D0%90.%D0%A1.%D0%A9%D0%B5%D1%80%D0%B1%D0%B0_2_%D1%81%D0%B5%D0%BD%D1%82%D1%8F%D0%B1%D1%80%D1%8F_2011_%D0%B3%D0%BE%D0%B4%D0%B0.jpg [22:06:32] Very strange business [22:10:15] if any opsens are around, https://gerrit.wikimedia.org/r/236973 would let me continue RESTBase testing in staging [22:13:13] !log krinkle@tin Synchronized php-1.26wmf21/extensions/NavigationTiming: re-apply patch 1/2 (jscs) (duration: 00m 12s) [22:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:16:09] mutante: you around? [22:16:12] gwicke: are you saying that the code you removed was previously ineffective, because RESTBase did not actually consult the values in that stanza? [22:16:24] ori: correct [22:17:00] if no one from ops is around and if you can get someone from your team to +1 it then i'll be happy to merge it [22:17:05] but let's wait a minute for mutante to respond [22:17:43] bd808 ran into the same issue in vagrant [22:19:45] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/236973 (owner: 10GWicke) [22:19:54] aude, bah, I thought I had an idea [22:20:00] but then the 'wiki' suffix ruined everything [22:20:39] :( [22:21:20] > Error: 2013 Lost connection to MySQL server during query (10.64.16.8) [22:21:31] An error on Commons, may or may not be related... [22:21:48] that's really common [22:22:23] (03CR) 10Ori.livneh: [C: 032] "Merging because: No one from ops is around; the change does not actually relax security, since the stanza was previously ignored altogethe" [puppet] - 10https://gerrit.wikimedia.org/r/236973 (owner: 10GWicke) [22:24:02] Krenair: Well, something's fishy on Commons for sure [22:24:33] gwicke: please ack that the change applied correctly [22:25:02] (03PS2) 10Papaul: Add MAC address for restbase-test200[1-3] Add empty line before the enty Bug:T111697 [puppet] - 10https://gerrit.wikimedia.org/r/236925 (https://phabricator.wikimedia.org/T111697) [22:25:05] There's also some memcached errors...and lots of "fetching thumbnail failed" which seems to me like a symptom, not a cause [22:26:57] ori: looks fine on xenon [22:27:14] ori: thanks for the review! [22:27:18] np [22:29:52] (03PS3) 10Dzahn: Add MAC address for restbase-test200[1-3] Add empty line before the enty Bug:T111697 [puppet] - 10https://gerrit.wikimedia.org/r/236925 (https://phabricator.wikimedia.org/T111697) (owner: 10Papaul) [22:30:16] (03PS4) 10Dzahn: Add MAC address for restbase-test200[1-3] Add empty line before the entry, link to bug. [puppet] - 10https://gerrit.wikimedia.org/r/236925 (https://phabricator.wikimedia.org/T111697) (owner: 10Papaul) [22:30:56] (03CR) 10Dzahn: [C: 032] Add MAC address for restbase-test200[1-3] Add empty line before the entry, link to bug. [puppet] - 10https://gerrit.wikimedia.org/r/236925 (https://phabricator.wikimedia.org/T111697) (owner: 10Papaul) [22:31:03] (03PS1) 10Robmoen: Enable QuickSurveys by default on labs with example survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236979 (https://phabricator.wikimedia.org/T110199) [22:31:12] (03CR) 10jenkins-bot: [V: 04-1] Enable QuickSurveys by default on labs with example survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236979 (https://phabricator.wikimedia.org/T110199) (owner: 10Robmoen) [22:33:18] (03PS2) 10Robmoen: Enable QuickSurveys by default on labs with example survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236979 (https://phabricator.wikimedia.org/T110199) [22:34:34] (03PS1) 10Ottomata: Don't prepend seq on zmq forwarder, as the kafka one already has done so. [puppet] - 10https://gerrit.wikimedia.org/r/236980 [22:35:24] (03CR) 10Ottomata: [C: 032 V: 032] Don't prepend seq on zmq forwarder, as the kafka one already has done so. [puppet] - 10https://gerrit.wikimedia.org/r/236980 (owner: 10Ottomata) [22:36:05] PROBLEM - Restbase root url on praseodymium is CRITICAL: Connection refused [22:36:15] (03CR) 10Jdlrobson: [C: 031] Enable QuickSurveys by default on labs with example survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236979 (https://phabricator.wikimedia.org/T110199) (owner: 10Robmoen) [22:36:24] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [22:36:52] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1618554 (10Dzahn) To be able to install operating systems, such as the restbase-test hosts papaul just provisioned, he would also need access to the mgmt interfaces to see the cons... [22:42:57] aude, okay, I have another idea [22:46:57] uh, nope. [22:47:30] damn. [22:55:47] 6operations, 5Patch-For-Review: Scale up and out our puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1618595 (10Papaul) Base on what akosiaris said "That is only expected to get worse as more of codfw is being brought online" is it possible for us to think about putting another puppetma... [22:57:28] (03PS1) 10Ottomata: Use hiera to lookup eventlogging_host in varnishncsa eventlistener [puppet] - 10https://gerrit.wikimedia.org/r/236983 [22:57:31] aude, so I think the problem is that we've overridden $lang for this wiki just to make it have an abnormal database name [22:59:00] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1618599 (10JMinor) 3NEW [23:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150908T2300). [23:00:04] ebernhardson mooeypoo: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:08] \o [23:00:13] I'll take it today [23:00:16] wait, it's already 4pm? [23:00:26] greg-g: its always 4pm somewhere :P [23:01:01] (03CR) 10Ottomata: [C: 032] Use hiera to lookup eventlogging_host in varnishncsa eventlistener [puppet] - 10https://gerrit.wikimedia.org/r/236983 (owner: 10Ottomata) [23:01:45] PROBLEM - Host mw2031 is DOWN: PING CRITICAL - Packet loss = 100% [23:02:31] Krenair: Thoughts about https://phabricator.wikimedia.org/T100313 and https://gerrit.wikimedia.org/r/#/c/214893/ ? I don't know Amir1's IRC nick so I'm not quite sure whether I'm comfortable deploying that one [23:03:04] RECOVERY - Host mw2031 is UP: PING WARNING - Packet loss = 64%, RTA = 34.90 ms [23:03:05] PROBLEM - HHVM rendering on mw2031 is CRITICAL: Connection refused [23:03:14] RoanKattouw, do not want. [23:03:20] OK, will decline [23:03:28] Having said that I'm the only person to object so far. [23:04:02] he's aharoni on iRC [23:04:10] No, different Amir [23:04:10] doesn't seem to have shown up anyway [23:04:24] Amir Aharoni != Amir Ladsgroup [23:04:30] Oh I see [23:05:14] RECOVERY - HHVM rendering on mw2031 is OK: HTTP OK: HTTP/1.1 200 OK - 65739 bytes in 4.498 second response time [23:05:17] "My nickname in IRC is Amir1." according to https://meta.wikimedia.org/wiki/User:Ladsgroup [23:05:27] online, but not here [23:05:33] Oh and he is on IRC [23:05:36] But not in this channel [23:05:38] OK well [23:05:47] You say you don't want it, so I'll decline it [23:06:00] lol [23:06:03] Krenair: But could you -1 the Gerrit patch with the reason for why you don't want it? [23:06:08] I already did [23:06:16] OK thanks [23:06:24] But it was on an earlier patch set [23:06:39] They rebased it a few times and poof, no more -1 [23:06:45] Oh, I see [23:07:27] (03PS1) 10Ottomata: New Trusty eventlogging host in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236985 [23:10:18] RoanKattouw: you deploying stuff? Krenair is helping me understand how to get that ^ change deployed in beta [23:10:29] I'm doing the SWAT deploy yes [23:10:48] (i am out of touch with all things mediawiki) [23:10:48] Oh it's a labs-only change [23:10:49] yes [23:11:01] just a new host, to make it more like production [23:11:03] That can just be +2ed and then no-op-synced to prod [23:11:10] (on merge it's automatically deployed in labs) [23:11:17] oh, ja? [23:11:21] on merge? [23:11:23] I'm already doing a bunch of deploys anyway so I'll just do this one right now [23:11:26] ok, thanks! [23:11:31] (03CR) 10Catrope: [C: 032] New Trusty eventlogging host in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236985 (owner: 10Ottomata) [23:11:36] I think we should write down what to do with labs-only changes at some point [23:11:37] (03Merged) 10jenkins-bot: New Trusty eventlogging host in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236985 (owner: 10Ottomata) [23:11:45] You shouldn't really need to sync them in prod, should you? [23:11:46] The only real reason it has to be synced to prod is so the bots don't yell at us for not having synced a merged change [23:11:58] the bots can detect that? [23:11:58] ah, aye [23:12:13] I've seen them yelling about unmerged changes [23:12:23] Ahm, not sure [23:12:27] I think that's why it was [23:12:34] But yeah we should come up with a better way of doing this [23:13:06] Ahm, who was renaming be-x-old to be-tarask today? [23:13:13] me [23:13:28] I left something on tin I think [23:13:32] You left that as an uncommitted local change on tin :( [23:13:34] RoanKattouw, gone [23:13:38] Thanks [23:14:08] !log catrope@tin Synchronized wmf-config/CommonSettings-labs.php: (no message) (duration: 00m 11s) [23:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:15:56] gwicke: "Node.js v4.0.0 contains V8 v4.5, the same version of V8 shipping with the Chrome web browser today." -- nice [23:20:23] !log catrope@tin Synchronized php-1.26wmf22/extensions/Echo/: SWAT (duration: 00m 14s) [23:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:49] !log catrope@tin Synchronized php-1.26wmf22/extensions/WikimediaEvents/: SWAT (duration: 00m 11s) [23:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:05] RoanKattouw: got room for one more ? [23:24:37] ori: we just have to wait until the sqlite module is updated [23:24:39] rmoen: Sure [23:24:43] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/236979/ [23:24:56] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [23:25:58] rmoen: Can you add it to the wiki page? [23:26:05] RoanKattouw: aye ty [23:26:48] (03CR) 10Catrope: [C: 04-1] Enable QuickSurveys by default on labs with example survey (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236979 (https://phabricator.wikimedia.org/T110199) (owner: 10Robmoen) [23:26:53] rmoen: Could you reformat it a bit per inline comments? [23:27:03] yes [23:27:40] Hrmph one of the Echo changes snuck in a message, I'll need to scap later [23:30:44] (03CR) 10Catrope: [C: 032] Enable experiment with experimental completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235391 (https://phabricator.wikimedia.org/T111078) (owner: 10EBernhardson) [23:30:47] (03CR) 10Catrope: [C: 032] Enable CirrusSearch per-user rate limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 (https://phabricator.wikimedia.org/T76497) (owner: 10EBernhardson) [23:30:52] (03PS3) 10Robmoen: Enable QuickSurveys by default on labs with example survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236979 (https://phabricator.wikimedia.org/T110199) [23:31:14] (03Merged) 10jenkins-bot: Enable experiment with experimental completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235391 (https://phabricator.wikimedia.org/T111078) (owner: 10EBernhardson) [23:31:35] (03Merged) 10jenkins-bot: Enable CirrusSearch per-user rate limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235274 (https://phabricator.wikimedia.org/T76497) (owner: 10EBernhardson) [23:31:58] RoanKattouw: updated [23:33:48] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: SWAT (duration: 00m 12s) [23:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:07] !log catrope@tin Synchronized wmf-config/CommonSettings.php: SWAT (duration: 00m 12s) [23:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:26] ebernhardson: Yours is done, please verify [23:34:46] (03CR) 10Catrope: [C: 032] Enable QuickSurveys by default on labs with example survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236979 (https://phabricator.wikimedia.org/T110199) (owner: 10Robmoen) [23:34:47] RoanKattouw: thanks, checking [23:35:00] RoanKattouw: (all 3?) [23:35:04] Yup [23:35:12] (03Merged) 10jenkins-bot: Enable QuickSurveys by default on labs with example survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236979 (https://phabricator.wikimedia.org/T110199) (owner: 10Robmoen) [23:36:02] !log catrope@tin Synchronized wmf-config/InitialiseSettings-labs.php: SWAT (duration: 00m 13s) [23:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:22] RoanKattouw: everything looking good, thanks [23:36:23] !log catrope@tin Synchronized wmf-config/CommonSettings-labs.php: SWAT (duration: 00m 10s) [23:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:56] !log catrope@tin Started scap: Need to update i18n for a new Echo message [23:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:37:04] RECOVERY - Restbase root url on praseodymium is OK: HTTP OK: HTTP/1.1 200 - 15150 bytes in 0.020 second response time [23:37:24] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [23:37:36] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [23:37:44] RECOVERY - Restbase root url on xenon is OK: HTTP OK: HTTP/1.1 200 - 15150 bytes in 0.009 second response time [23:39:51] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 5Patch-For-Review, 7user-notice: Rename "be-x-old" to "be-tarask" - https://phabricator.wikimedia.org/T11823#1618734 (10Krenair) Another thing: T111876 [23:40:10] (03CR) 10Alex Monk: "Made T111876" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236966 (owner: 10Alex Monk) [23:43:37] (03PS1) 10BryanDavis: logging: Configure monolog to output stack traces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236994 (https://phabricator.wikimedia.org/T89169) [23:46:20] 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, 3Reading-Web: On mobile, the Flow notification's link takes you to the desktop version of the Flow page, even though the main (background) link takes you to the mobile one (main) - https://phabricator.wikimedia.org/T107108#1618772 (1...