[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170817T0000). [00:01:25] (03PS2) 10Dzahn: admins: add new ssh keys for dzahn [puppet] - 10https://gerrit.wikimedia.org/r/372210 [00:04:37] (03PS3) 10Dzahn: admins: Make daniel a deployer [puppet] - 10https://gerrit.wikimedia.org/r/371661 (https://phabricator.wikimedia.org/T173230) (owner: 10Reedy) [00:04:47] (03CR) 10Dzahn: [C: 031] admins: Make daniel a deployer [puppet] - 10https://gerrit.wikimedia.org/r/371661 (https://phabricator.wikimedia.org/T173230) (owner: 10Reedy) [00:05:14] (03CR) 10Dzahn: [C: 031] admins: add new ssh keys for dzahn [puppet] - 10https://gerrit.wikimedia.org/r/372210 (owner: 10Dzahn) [00:06:04] 10Operations, 10Ops-Access-Requests, 10Release-Engineering-Team (Watching / External), 10User-Addshore: Make @daniel a MediaWiki deployer - https://phabricator.wikimedia.org/T173230#3529644 (10Dzahn) p:05Triage>03Normal [00:07:50] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3529646 (10Dzahn) 05stalled>03Open [00:08:35] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3512337 (10Dzahn) @DarTar Robh was out today, i think we can get this done tomorrow though! thanks for your patience. [00:08:52] on Commons, in many files uploaded with UW, the language tag in the description is broken [00:09:19] https://commons.wikimedia.org/w/index.php?title=File:Defining_Conflict_and_Harassment_on_Wikipedia.pdf&oldid=255563866 [00:09:41] also report on the French VP on Commons [00:16:35] (03PS1) 10Dzahn: admins: add additional admin addshore to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/372211 (https://phabricator.wikimedia.org/T173233) [00:25:14] ACKNOWLEDGEMENT - Check systemd state on cp1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T171028 [00:25:14] ACKNOWLEDGEMENT - Confd template for /etc/varnish/directors.frontend.vcl on cp1008 is CRITICAL: File not found: /etc/varnish/directors.frontend.vcl daniel_zahn https://phabricator.wikimedia.org/T171028 [00:25:14] ACKNOWLEDGEMENT - salt-minion processes on cp1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion daniel_zahn https://phabricator.wikimedia.org/T171028 [00:25:14] ACKNOWLEDGEMENT - traffic-pool service on cp1008 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive daniel_zahn https://phabricator.wikimedia.org/T171028 [00:28:24] 10Operations, 10Performance-Team, 10Patch-For-Review: webpagetest-alerts: Difference in size authenticated - https://phabricator.wikimedia.org/T164209#3529660 (10Dzahn) 05Resolved>03Open It's back but with "`Difference in size anonymous`" vs. "Difference in size authenticated" this time. `webpagetest-al... [00:30:04] 10Operations, 10Electron-PDFs, 10OfflineContentGenerator, 10Services (designing): Improve stability and maintainability of our browser-based PDF render service - https://phabricator.wikimedia.org/T172815#3529665 (10mobrovac) Coming directly from the Chrome devs, this is definitely promissing, but the cavea... [00:30:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3466006 (10Dzahn) The host is down again (since 1d 8h 34m 8s) https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=labvirt1015 [00:31:15] ACKNOWLEDGEMENT - Host labvirt1015 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T171473 [00:35:01] (03PS1) 10Urbanecm: Reopen bawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372212 (https://phabricator.wikimedia.org/T173471) [00:35:56] (03PS1) 10Legoktm: [WIP] Add libraryupgrader puppet module [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) [00:39:00] (03PS2) 10Legoktm: [WIP] Add libraryupgrader puppet module [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) [00:39:17] (03PS3) 10Legoktm: Add libraryupgrader puppet module [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) [00:45:24] !log T169939: Decommissioning Cassandra/restbase2001-b.codfw.wmnet [00:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:38] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [00:47:54] PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [01:04:49] ACKNOWLEDGEMENT - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed eevans Decommissioned (T169939) [02:27:51] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.11) (duration: 08m 52s) [02:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:56] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.13) (duration: 08m 33s) [02:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:54] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 54 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [03:03:56] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.14) (duration: 03m 58s) [03:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:13] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Aug 17 03:11:13 UTC 2017 (duration 7m 18s) [03:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:54] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [03:30:17] (03PS17) 10Mobrovac: Increase max kafka message size [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [03:35:41] (03CR) 10Mobrovac: [C: 031] "This should be what we want. Currently it's hack-ish because we copy/paste the value between the scb and kafka main roles, but until this " [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [03:40:02] (03CR) 10Mobrovac: "Nice :) Let's sync up and get this in ASAP" [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [03:41:14] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 901.83 seconds [03:52:27] (03PS4) 10Mobrovac: Cassandra: Do not include the main DNS in the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/370554 (https://phabricator.wikimedia.org/T172610) [03:53:55] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 56 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [03:58:55] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [04:04:39] (03PS5) 10Mobrovac: Cassandra: Do not include the main DNS in the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/370554 (https://phabricator.wikimedia.org/T172610) [04:07:08] (03CR) 10Mobrovac: Cassandra: Do not include the main DNS in the list of seeds (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/370554 (https://phabricator.wikimedia.org/T172610) (owner: 10Mobrovac) [04:10:43] (03PS6) 10Mobrovac: Cassandra: Do not include the main DNS in the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/370554 (https://phabricator.wikimedia.org/T172610) [04:23:46] (03CR) 10Mobrovac: "PCC - http://puppet-compiler.wmflabs.org/7489/" [puppet] - 10https://gerrit.wikimedia.org/r/370554 (https://phabricator.wikimedia.org/T172610) (owner: 10Mobrovac) [04:30:35] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 199.44 seconds [04:34:03] (03CR) 10Dzahn: "the general idea is good but can you do it without a new parameter? we already have "active_server" and usually we use that to do things l" [puppet] - 10https://gerrit.wikimedia.org/r/371927 (https://phabricator.wikimedia.org/T173297) (owner: 10Paladox) [04:37:08] (03CR) 10Dzahn: [C: 04-1] "please amend commit message. what is this about?" [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 (owner: 10Paladox) [04:39:02] (03CR) 10Dzahn: [C: 031] "let's get this done together. needs scheduling" [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [04:39:45] (03CR) 10Dzahn: [C: 031] "guess so..." [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox) [04:41:57] (03CR) 10Dzahn: "well.. i can confirm that we have similar issues with the LDAP login on Icinga where upper and lower case is allowed but the app behind ld" [puppet] - 10https://gerrit.wikimedia.org/r/368196 (owner: 10Paladox) [04:51:24] (03CR) 10Dzahn: "@Platonides, do you like it?" [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) (owner: 10Paladox) [04:52:01] (03CR) 10Dzahn: "let's test it together in a fresh labs VPS some time" [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [04:54:54] (03CR) 10Krinkle: "@Gilles Should task T171468 be re-opened? Or is this revert for a different reason?" [puppet] - 10https://gerrit.wikimedia.org/r/372199 (owner: 10Gilles) [04:56:29] (03CR) 10Krinkle: [C: 031] Enable Thumbor webp original support [puppet] - 10https://gerrit.wikimedia.org/r/372158 (https://phabricator.wikimedia.org/T172939) (owner: 10Gilles) [04:58:01] (03CR) 10Dzahn: [C: 04-2] "please update commit message. what is this for? still current?" [labs/private] - 10https://gerrit.wikimedia.org/r/363847 (owner: 10Paladox) [04:58:19] (03CR) 10Krinkle: [C: 031] "Can this be tested in Beta by simply applying the patch to the puppet master there and then trying it out on a file?" [puppet] - 10https://gerrit.wikimedia.org/r/372158 (https://phabricator.wikimedia.org/T172939) (owner: 10Gilles) [05:06:13] (03PS1) 10KartikMistry: apertium-bel: Initial Debian packaging [debs/contenttranslation/apertium-bel] - 10https://gerrit.wikimedia.org/r/372227 (https://phabricator.wikimedia.org/T172381) [05:37:00] (03CR) 10Gilles: "No, we'll have to see if the timeouts reoccur. I think the change only accidentally fixed the symptom by essentially tossing out a bunch o" [puppet] - 10https://gerrit.wikimedia.org/r/372199 (owner: 10Gilles) [05:41:06] (03CR) 10Gilles: "The effect of the connection cap is so bad it can be experienced first-hand on beta's new files page, where only a handful of thumbnails g" [puppet] - 10https://gerrit.wikimedia.org/r/372199 (owner: 10Gilles) [05:43:33] (03CR) 10Gilles: "The same config is currently manually applied to Beta. Thumbnail renders correctly: https://commons.wikimedia.beta.wmflabs.org/wiki/File:R" [puppet] - 10https://gerrit.wikimedia.org/r/372158 (https://phabricator.wikimedia.org/T172939) (owner: 10Gilles) [05:48:03] !log Stop replication in sync on db1078 and db1015 - T164488 [05:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:17] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [05:55:17] !log Stop replication on db2077 to fix duplicate entries - T151029 [05:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:29] T151029: duplicate key problems - https://phabricator.wikimedia.org/T151029 [05:59:20] (03PS1) 10KartikMistry: apertium-rus: Initial Debian packaging [debs/contenttranslation/apertium-rus] - 10https://gerrit.wikimedia.org/r/372230 (https://phabricator.wikimedia.org/T172381) [06:14:05] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89809.78 seconds [06:24:43] !log Stop slave on db2047 to fix duplicate keys - T151029 [06:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:56] T151029: duplicate key problems - https://phabricator.wikimedia.org/T151029 [06:38:07] (03PS1) 10Gilles: Expose Thumbor Nginx metrics in Prometheus format [puppet] - 10https://gerrit.wikimedia.org/r/372254 (https://phabricator.wikimedia.org/T151554) [06:39:57] (03PS2) 10Gilles: Expose Thumbor Nginx metrics in Prometheus format [puppet] - 10https://gerrit.wikimedia.org/r/372254 (https://phabricator.wikimedia.org/T151554) [07:19:50] (03PS1) 10Elukey: geowiki::job::monitoring: point cron output to analytics [puppet] - 10https://gerrit.wikimedia.org/r/372324 [07:21:43] (03CR) 10Elukey: [C: 032] geowiki::job::monitoring: point cron output to analytics [puppet] - 10https://gerrit.wikimedia.org/r/372324 (owner: 10Elukey) [07:51:44] (03PS1) 10Filippo Giunchedi: hieradata: use ms-fe.svc as auth_url for pagecompilation [puppet] - 10https://gerrit.wikimedia.org/r/372335 [07:53:32] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: use ms-fe.svc as auth_url for pagecompilation [puppet] - 10https://gerrit.wikimedia.org/r/372335 (owner: 10Filippo Giunchedi) [07:56:52] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#958305 (10Nemo_bis) Re https://lists.wikimedia.org/pipermail/wikimedia-l/2017-August/088350.html , I think adding Mailman 3 to translatewiki.net would... [08:03:30] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#3530058 (10MoritzMuehlenhoff) @Gilles That's my understanding, yes. I have no idea about the origin of that reference SVG, though (or... [08:05:45] !log reboot ms-be2021, unreachable on network [08:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:24] RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 36.36 ms [08:17:07] 10Operations, 10Patch-For-Review: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#3530074 (10elukey) rdb1005 is a JobQueue Redis master (1006 is its local dc slave) so it would be painful to put it out of production to reimage it (mw-config change to move its traffic away, Redis back... [08:28:39] (03PS2) 10Filippo Giunchedi: Enable Thumbor webp original support [puppet] - 10https://gerrit.wikimedia.org/r/372158 (https://phabricator.wikimedia.org/T172939) (owner: 10Gilles) [08:29:20] (03CR) 10Filippo Giunchedi: [C: 032] Enable Thumbor webp original support [puppet] - 10https://gerrit.wikimedia.org/r/372158 (https://phabricator.wikimedia.org/T172939) (owner: 10Gilles) [08:30:58] !log restart varnish on cp1062 to fix mailbox lag [08:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:04] RECOVERY - Check Varnish expiry mailbox lag on cp1062 is OK: OK: expiry mailbox lag is 0 [08:36:49] !log roll-restart thumbor to apply webp support change [08:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:08] (03PS1) 10KartikMistry: apertium-bel-rus: Initial Debian packaging [debs/contenttranslation/apertium-bel-rus] - 10https://gerrit.wikimedia.org/r/372341 (https://phabricator.wikimedia.org/T172381) [08:46:26] (03PS1) 10Muehlenhoff: Add wrapper for firejailed lilypond [puppet] - 10https://gerrit.wikimedia.org/r/372342 [08:52:53] 10Operations, 10monitoring, 10netops: pmacct should be upgraded to 1.6.2 on Stretch - https://phabricator.wikimedia.org/T173489#3530143 (10elukey) [08:55:38] 10Operations, 10monitoring, 10netops: pmacct should be upgraded to 1.6.2 on Stretch - https://phabricator.wikimedia.org/T173489#3530159 (10elukey) [09:02:12] (03PS3) 10Elukey: role::cache::kafka::webrequest: tune graphite alarms [puppet] - 10https://gerrit.wikimedia.org/r/372155 (https://phabricator.wikimedia.org/T172681) [09:04:35] 10Operations, 10Puppet, 10LDAP: Should puppet auto-restart slapd? - https://phabricator.wikimedia.org/T171191#3530168 (10akosiaris) I am not against puppet restarting slapd. In fact I am for it, albeit we should do it carefully in order to not cause automated outages. That is T161145 as mentioned above by @f... [09:11:03] 10Operations, 10Puppet, 10Patch-For-Review: PuppetDB misbehaving on 2017-07-15 - https://phabricator.wikimedia.org/T170740#3530180 (10akosiaris) p:05High>03Low I am lowering priority on this one. The one interesting thing left to do is to create a parser for the /metrics endpoint @elukey pointed out abov... [09:12:04] 10Operations, 10Wikimedia-Site-requests, 10User-MarcoAurelio, 10Wikimedia-log-errors: Unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T173419#3530183 (10MarcoAurelio) `translationNotificationJob: 0 queued; 949 claimed (0 active, 949 abandoned); 0 delayed` is probably due to T1... [09:13:42] (03CR) 10Elukey: [C: 032] role::cache::kafka::webrequest: tune graphite alarms [puppet] - 10https://gerrit.wikimedia.org/r/372155 (https://phabricator.wikimedia.org/T172681) (owner: 10Elukey) [09:17:44] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:14] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:14] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:24] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:44] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:25] (03CR) 10Filippo Giunchedi: "The original problem that motivated the change is that when a given thumbor instance is processing a request it would still accept new con" [puppet] - 10https://gerrit.wikimedia.org/r/372199 (owner: 10Gilles) [09:20:44] PROBLEM - puppet last run on cp2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:54] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:59] argh this is probably me [09:21:00] sigh [09:21:41] "single quotes will be stripped from graphite metric" [09:21:43] yep [09:21:45] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:22:13] stopped ircecho [09:23:59] (03CR) 10Gehel: [C: 031] Gerrit: Enable logstash by default for prod gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [09:24:36] (03PS1) 10Elukey: role::cache::kafka:webrequest: fix graphite metric to alarm on [puppet] - 10https://gerrit.wikimedia.org/r/372348 (https://phabricator.wikimedia.org/T172681) [09:25:30] (03CR) 10Gehel: [C: 04-1] "I meant to -1, not +1... the logstash endpoint should use LVS." [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [09:26:54] (03CR) 10Elukey: [C: 032] role::cache::kafka:webrequest: fix graphite metric to alarm on [puppet] - 10https://gerrit.wikimedia.org/r/372348 (https://phabricator.wikimedia.org/T172681) (owner: 10Elukey) [09:29:50] running puppet --failed-only across cp* [09:33:38] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, to be merged when the time comes" [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [09:36:04] (03CR) 10Muehlenhoff: [C: 032] Add wrapper for firejailed lilypond [puppet] - 10https://gerrit.wikimedia.org/r/372342 (owner: 10Muehlenhoff) [09:36:11] (03PS2) 10Muehlenhoff: Add wrapper for firejailed lilypond [puppet] - 10https://gerrit.wikimedia.org/r/372342 [09:45:10] (03PS4) 10Filippo Giunchedi: Allow mwdeploy user to restart jobchron [puppet] - 10https://gerrit.wikimedia.org/r/367815 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani) [09:45:25] (03CR) 10Filippo Giunchedi: [C: 031] "PCC https://puppet-compiler.wmflabs.org/compiler02/7492/mw1161.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/367815 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani) [09:47:04] (03CR) 10Filippo Giunchedi: [C: 032] Allow mwdeploy user to restart jobchron [puppet] - 10https://gerrit.wikimedia.org/r/367815 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani) [09:48:52] (03CR) 10Gilles: "I understand, but the effect of this change is far worse than what it set out to fix. Nginx is issuing a ton of 502s that were handled fin" [puppet] - 10https://gerrit.wikimedia.org/r/372199 (owner: 10Gilles) [09:55:06] what is the deal with db2049? ferm check flapped for a second [09:57:19] Only working with db2047 from my side [09:58:18] (03CR) 10Faidon Liambotis: "non-free is activated by default in prod -- probably not in Labs, that's a divergence (bug?) of Labs." [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [09:59:42] (03PS1) 10Elukey: role::cache::kafka::webrequest: fix alert varnishkafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/372354 (https://phabricator.wikimedia.org/T172681) [10:00:19] (03CR) 10Elukey: [C: 032] role::cache::kafka::webrequest: fix alert varnishkafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/372354 (https://phabricator.wikimedia.org/T172681) (owner: 10Elukey) [10:01:36] sorry for the mess in icinga, should be fixed soon [10:01:47] my attempt to change the varnishkafka alert was a failure [10:01:58] (03CR) 10Muehlenhoff: Run Lilypond from Firejail (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/370361 (https://phabricator.wikimedia.org/T172582) (owner: 10Ebe123) [10:02:56] jynus: probably a generic NRPE failure of some sort, no current changes to ferm or the underlying config [10:03:04] 10Operations, 10Ops-Access-Requests, 10Release-Engineering-Team (Watching / External), 10User-Addshore: Make @daniel a MediaWiki deployer - https://phabricator.wikimedia.org/T173230#3521683 (10ArielGlenn) Who is the manager-equivalent for this so that we can get the typical sign-off? [10:05:51] 10Operations, 10Analytics: Tune Varnishkafka delivery errors to be more sensitive - https://phabricator.wikimedia.org/T173492#3530259 (10elukey) [10:06:18] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681#3505470 (10elukey) Created https://phabricator.wikimedia.org/T173492 for the varnishkafka alarms [10:09:04] (03CR) 10Filippo Giunchedi: [C: 031] "Worth giving this a try, we'll need also to check status codes that varnish gets from ms-fe and how timeouts are tracked/reported by varni" [puppet] - 10https://gerrit.wikimedia.org/r/372199 (owner: 10Gilles) [10:09:19] (03PS2) 10Filippo Giunchedi: Revert "thumbor: fix connections-per-backend in nginx" [puppet] - 10https://gerrit.wikimedia.org/r/372199 (owner: 10Gilles) [10:09:40] 10Operations, 10Analytics: Tune Kafka logs to register clients connected - https://phabricator.wikimedia.org/T173493#3530279 (10elukey) [10:09:55] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681#3505470 (10elukey) And finally https://phabricator.wikimedia.org/T173493 to tune alarms. [10:10:20] (03CR) 10Filippo Giunchedi: [C: 032] Revert "thumbor: fix connections-per-backend in nginx" [puppet] - 10https://gerrit.wikimedia.org/r/372199 (owner: 10Gilles) [10:13:09] !log roll-restart nginx on thumbor to apply https://gerrit.wikimedia.org/r/372199 [10:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:15] !log Load testing Thumbor [10:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:37] 10Operations, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: Long running thumbnail requests locking up Thumbor instances - https://phabricator.wikimedia.org/T172930#3530331 (10Gilles) Running the stress test again, requesting about 2000 uncached thumbnails of the same image with a concurrency of 200 re... [10:29:31] !log Thumbor stress test finished [10:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:07] (03CR) 10Elukey: [C: 04-1] "Two hiera calls in modules afaics, it would be better to have them only in roles/profiles as https://wikitech.wikimedia.org/wiki/Puppet_co" [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [10:44:57] (03CR) 10Elukey: "I am going to upload a new version in a bit and then we'll discuss what is best" [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [10:46:34] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89964.33 seconds [10:51:19] 10Operations, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: Long running thumbnail requests locking up Thumbor instances - https://phabricator.wikimedia.org/T172930#3530350 (10fgiunchedi) >>! In T172930#3530331, @Gilles wrote: > > Which is a lot better than before where 502s were the most common respo... [11:01:14] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 24 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [11:06:15] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 12 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [11:12:03] Krinkle: hi, you around? [11:13:15] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 43 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [11:13:31] (03PS1) 10Filippo Giunchedi: WIP: new prometheus instance 'services' [puppet] - 10https://gerrit.wikimedia.org/r/372357 (https://phabricator.wikimedia.org/T173490) [11:13:54] (03CR) 10jerkins-bot: [V: 04-1] WIP: new prometheus instance 'services' [puppet] - 10https://gerrit.wikimedia.org/r/372357 (https://phabricator.wikimedia.org/T173490) (owner: 10Filippo Giunchedi) [11:18:15] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 6 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [11:19:11] (03PS2) 10Filippo Giunchedi: WIP: new prometheus instance 'services' [puppet] - 10https://gerrit.wikimedia.org/r/372357 (https://phabricator.wikimedia.org/T173490) [11:36:16] (03CR) 10Filippo Giunchedi: "The change itself works, still missing is extracting a list of addresses from each instance listen_address, likely via another class prome" [puppet] - 10https://gerrit.wikimedia.org/r/372357 (https://phabricator.wikimedia.org/T173490) (owner: 10Filippo Giunchedi) [11:38:03] akosiaris: https://gerrit.wikimedia.org/r/#/c/372229/ :) [12:02:14] (03PS18) 10Elukey: Increase max kafka message size for changeprop and kafka main [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [12:04:12] (03PS19) 10Elukey: Increase max kafka message size for changeprop and kafka main [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [12:07:54] !log T169939: Decommissioning Cassandra/restbase2001-b.codfw.wmnet [12:07:57] grrr [12:07:59] !log T169939: Decommissioning Cassandra/restbase2001-c.codfw.wmnet [12:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:08] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [12:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:45] (03PS20) 10Elukey: Increase max kafka message size for changeprop and kafka main [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [12:34:31] (03CR) 10Elukey: "I am currently reading http://kafka.apache.org/090/documentation.html and some things popped up:" [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [12:38:10] !log installing libgd/libsoup security updates [12:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:35] (03PS13) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [12:49:17] (03PS14) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [12:49:52] (03PS1) 10Marostegui: dbstore3.my.cnf: Reduce pool size [puppet] - 10https://gerrit.wikimedia.org/r/372366 (https://phabricator.wikimedia.org/T168409) [12:54:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Aside from the technical notes inline, there's something that 'd like to point. This patch makes it possible for a host to not be in our i" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/368124 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170817T1300). [13:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:32] I'm here [13:00:37] o/ [13:00:45] I can SWAT today! [13:00:52] Great [13:01:04] (03CR) 10Elukey: "Probably we can't set per topic settings that are bigger than the broker max config, so what I wrote is wrong." [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [13:01:34] Urbanecm: the first patch is not testable, right? [13:01:39] and the second one? [13:01:50] it's testable only after the script runs? [13:01:57] (so no need to deploy to mwdebug?) [13:02:14] zeljkof, yes, you're right [13:02:39] Urbanecm: ok, so I ping you after I have deployed everything and the script runs? [13:02:46] Ack [13:05:09] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372189 (https://phabricator.wikimedia.org/T173444) (owner: 10Urbanecm) [13:06:33] (03Merged) 10jenkins-bot: Add one throttling exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372189 (https://phabricator.wikimedia.org/T173444) (owner: 10Urbanecm) [13:06:44] (03CR) 10jenkins-bot: Add one throttling exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372189 (https://phabricator.wikimedia.org/T173444) (owner: 10Urbanecm) [13:07:42] (03PS15) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [13:08:46] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:372189|Add one throttling exception (T173444)]] (duration: 00m 51s) [13:08:50] Urbanecm: 372189 deployed [13:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:58] T173444: Lift IP rate limit - Editathon (WMCL) - 2017-09-29 - https://phabricator.wikimedia.org/T173444 [13:09:11] (03PS4) 10Ebe123: Run Lilypond from Firejail [puppet] - 10https://gerrit.wikimedia.org/r/370361 (https://phabricator.wikimedia.org/T172582) [13:09:19] Ack [13:11:25] (03CR) 10Ebe123: Run Lilypond from Firejail (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/370361 (https://phabricator.wikimedia.org/T172582) (owner: 10Ebe123) [13:12:02] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372195 (https://phabricator.wikimedia.org/T172974) (owner: 10Urbanecm) [13:13:28] (03Merged) 10jenkins-bot: Change $wgArticleCountMethod to any for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372195 (https://phabricator.wikimedia.org/T172974) (owner: 10Urbanecm) [13:13:40] (03CR) 10jenkins-bot: Change $wgArticleCountMethod to any for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372195 (https://phabricator.wikimedia.org/T172974) (owner: 10Urbanecm) [13:15:18] (03PS5) 10Ebe123: Run Lilypond from Firejail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370358 (https://phabricator.wikimedia.org/T172582) [13:15:27] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:372195|Change $wgArticleCountMethod to any for srwikiquote (T172974)]] (duration: 00m 51s) [13:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:38] T172974: Set $wgArticleCountMethod to 'any' for srwikiquote and srwikisource - https://phabricator.wikimedia.org/T172974 [13:15:44] Urbanecm: 372195 deployed, running script... [13:15:51] zeljkof, just noticed one other patch is needed, is it ok? [13:16:05] Urbanecm: sure, there is plenty of time [13:16:42] Added. [13:17:23] Urbanecm: script finished [13:17:28] Great [13:17:39] Seems it is working [13:17:55] (03CR) 10Zfilipin: "zfilipin@terbium:~$ mwscript updateArticleCount.php --wiki=srwikiquote --update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372195 (https://phabricator.wikimedia.org/T172974) (owner: 10Urbanecm) [13:20:34] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371029 (https://phabricator.wikimedia.org/T160491) (owner: 10Urbanecm) [13:21:58] (03Merged) 10jenkins-bot: Update wikiversity favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371029 (https://phabricator.wikimedia.org/T160491) (owner: 10Urbanecm) [13:22:09] (03CR) 10jenkins-bot: Update wikiversity favicon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371029 (https://phabricator.wikimedia.org/T160491) (owner: 10Urbanecm) [13:24:58] !log zfilipin@tin Synchronized static/favicon/wikiversity.ico: SWAT: [[gerrit:371029|Update wikiversity favicon (T160491)]] (duration: 00m 50s) [13:25:02] Urbanecm: 371029 is deployed, please check [13:25:08] ok [13:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:10] T160491: Update Wikiversity logos - https://phabricator.wikimedia.org/T160491 [13:25:24] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 30 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:25:28] Urbanecm: do I need to purge cache? [13:26:15] Try to please [13:26:31] sure [13:27:47] Urbanecm: done [13:27:58] the image has changed slightly, so I guess it worked [13:28:11] (03CR) 10Zfilipin: "zfilipin@terbium:~$ echo "https://en.wikipedia.org/static/favicon/wikiversity.ico" | mwscript purgeList.php --wiki=enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371029 (https://phabricator.wikimedia.org/T160491) (owner: 10Urbanecm) [13:28:13] yep [13:28:25] (03PS16) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [13:28:36] Urbanecm: all done? [13:29:05] If you aren't a steward/meta admin... [13:29:30] Urbanecm: I don't think so [13:29:36] that probably means no :) [13:29:47] I mean, I don't know, and I guess that means I am not [13:29:48] need meta admin for anything? [13:30:08] Danny_B, yep, logo update for T160491 . [13:30:22] Urbanecm: can I close the swat window? or is there something else to do? [13:30:24] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:30:31] zeljkof, yes, you can. Thank you! [13:30:36] !log EU SWAT finished [13:30:47] Urbanecm: thanks for deploying with #releng! ;D [13:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:47] Urbanecm: what exactly you need me to do? [13:32:13] the task you linked is about wikiversity actually [13:32:24] Danny_B, yes. But logos are on meta [13:32:54] I need to override https://meta.wikimedia.org/wiki/File:Wikiversity-logo_1.5x.png and https://meta.wikimedia.org/wiki/File:Wikiversity-logo_2x.png from https://beta.wikiversity.org/wiki/File:Wikiversity_logo_2017.svg (same sizes) [13:41:02] zeljkof: if you could run a script for me I'd appreciate it [13:55:11] TabbyCat: sorry, afk for a few minutes, which script? [13:55:38] zeljkof: restart a stuck rename [13:55:50] see if it unblocks it [13:55:58] (03PS17) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [13:56:29] (03CR) 10jerkins-bot: [V: 04-1] [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [13:56:32] TabbyCat: uh, not sure I have ever done it, is it documented? [13:56:59] zeljkof: left instructions at T173419 but if you don't feel confortable do not worry [13:57:00] T173419: Unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T173419 [13:58:38] TabbyCat: I would rather not, is it urgent? moar people that know what to do should be online in a few hours [13:59:22] zeljkof: not urgent, will wait then [14:01:21] ok, thanks, I don't want to break stuff [14:08:00] (03CR) 10Muehlenhoff: [C: 032] Run Lilypond from Firejail [puppet] - 10https://gerrit.wikimedia.org/r/370361 (https://phabricator.wikimedia.org/T172582) (owner: 10Ebe123) [14:13:25] (03PS18) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [14:13:38] (03PS3) 10Filippo Giunchedi: WIP: new prometheus instance 'services' [puppet] - 10https://gerrit.wikimedia.org/r/372357 (https://phabricator.wikimedia.org/T173490) [14:14:10] (03CR) 10jerkins-bot: [V: 04-1] WIP: new prometheus instance 'services' [puppet] - 10https://gerrit.wikimedia.org/r/372357 (https://phabricator.wikimedia.org/T173490) (owner: 10Filippo Giunchedi) [14:16:27] (03PS4) 10Filippo Giunchedi: WIP: new prometheus instance 'services' [puppet] - 10https://gerrit.wikimedia.org/r/372357 (https://phabricator.wikimedia.org/T173490) [14:16:34] (03PS19) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [14:20:51] (03PS20) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [14:23:18] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/370361 (https://phabricator.wikimedia.org/T172582) (owner: 10Ebe123) [14:23:57] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db2077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372386 (https://phabricator.wikimedia.org/T170662) [14:24:47] 10Operations, 10Ops-Access-Requests, 10Release-Engineering-Team (Watching / External), 10User-Addshore: Make @daniel a MediaWiki deployer - https://phabricator.wikimedia.org/T173230#3530629 (10Addshore) I guess as with other WMDE things this would either be @Tobi_WMDE_SW (on vacation for a few more days) o... [14:24:51] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Add db2077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372386 (https://phabricator.wikimedia.org/T170662) [14:27:51] (03CR) 10Marostegui: [C: 04-2] "Wait till it has caught up with the master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372386 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [14:30:48] 10Operations, 10ops-esams, 10Traffic: cp3036 crashed - https://phabricator.wikimedia.org/T173506#3530657 (10BBlack) [14:32:23] (03PS1) 10MarcoAurelio: Added Cookbook and Cookbook talk NS on hi.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372387 (https://phabricator.wikimedia.org/T173398) [14:32:52] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp3036.* [14:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:56] 10Operations, 10ops-esams, 10Traffic: cp3036 crashed - https://phabricator.wikimedia.org/T173506#3530692 (10BBlack) 05Open>03Resolved a:03BBlack ``` 14:32 bblack@neodymium: conftool action : set/pooled=yes; selector: name=cp3036.* ``` [14:37:08] (03CR) 10Muehlenhoff: "The underlying patches providing the wrappers have now been merged in the production cluster. Do we have the Score extension setup in depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370358 (https://phabricator.wikimedia.org/T172582) (owner: 10Ebe123) [14:45:29] (03PS1) 10Cmjohnson: Adding dns entries for labvirt1019-20 T172538 [dns] - 10https://gerrit.wikimedia.org/r/372390 [14:53:14] (03PS5) 10Filippo Giunchedi: WIP: new prometheus instance 'services' [puppet] - 10https://gerrit.wikimedia.org/r/372357 (https://phabricator.wikimedia.org/T173490) [15:03:44] (03PS21) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [15:04:14] (03CR) 10jerkins-bot: [V: 04-1] [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [15:08:08] (03CR) 10Jcrespo: "Note memory usage is 1.15 times the buffer pool. I think we are wasting memory on MyISAM cache." [puppet] - 10https://gerrit.wikimedia.org/r/372366 (https://phabricator.wikimedia.org/T168409) (owner: 10Marostegui) [15:10:16] (03PS22) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [15:21:37] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372387 (https://phabricator.wikimedia.org/T173398) (owner: 10MarcoAurelio) [15:22:44] (03PS8) 10Jcrespo: mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) [15:22:46] (03PS2) 10Jcrespo: [WIP]mariadb: First attempt at a mydumper-based dump script [puppet] - 10https://gerrit.wikimedia.org/r/371944 (https://phabricator.wikimedia.org/T169516) [15:22:49] (03PS1) 10Jcrespo: mariadb: Allow individual instances to configure its innodb BPS [puppet] - 10https://gerrit.wikimedia.org/r/372400 (https://phabricator.wikimedia.org/T169514) [15:23:06] (03CR) 10Marostegui: "> Note memory usage is 1.15 times the buffer pool. I think we are" [puppet] - 10https://gerrit.wikimedia.org/r/372366 (https://phabricator.wikimedia.org/T168409) (owner: 10Marostegui) [15:23:35] (03CR) 10Marostegui: "> > Note memory usage is 1.15 times the buffer pool. I think we are" [puppet] - 10https://gerrit.wikimedia.org/r/372366 (https://phabricator.wikimedia.org/T168409) (owner: 10Marostegui) [15:23:38] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow individual instances to configure its innodb BPS [puppet] - 10https://gerrit.wikimedia.org/r/372400 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [15:23:50] (03CR) 10Jcrespo: "> where the sX.cnf are being generated from" [puppet] - 10https://gerrit.wikimedia.org/r/372366 (https://phabricator.wikimedia.org/T168409) (owner: 10Marostegui) [15:24:11] (03Abandoned) 10Marostegui: dbstore3.my.cnf: Reduce pool size [puppet] - 10https://gerrit.wikimedia.org/r/372366 (https://phabricator.wikimedia.org/T168409) (owner: 10Marostegui) [15:26:35] (03PS23) 10Gehel: [wip] cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [15:26:52] (03PS2) 10Jcrespo: mariadb: Allow individual instances to configure its innodb BPS [puppet] - 10https://gerrit.wikimedia.org/r/372400 (https://phabricator.wikimedia.org/T169514) [15:29:33] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Switch all hosts to the future parser - https://phabricator.wikimedia.org/T171704#3530937 (10Gehel) [15:31:51] (03CR) 10Marostegui: "Not really sure I am understanding the output of: https://puppet-compiler.wmflabs.org/compiler02/7512/" [puppet] - 10https://gerrit.wikimedia.org/r/372400 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [15:35:05] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1949 bytes in 0.575 second response time [15:39:16] 10Operations, 10DBA, 10Wikidata: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3531015 (10Marostegui) And this happened again just now: https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&panelId=1&fullscreen&orgId=1&from=1502980683784&to=1502984283784 And spi... [15:39:35] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 423.32 seconds [15:40:00] PROBLEM - MariaDB Slave Lag: s5 on db1026 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 396.19 seconds [15:40:01] That is most likely because of what I commented on ^ [15:40:06] PROBLEM - MariaDB Slave Lag: s5 on db1045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 382.99 seconds [15:40:10] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3531024 (10Eevans) [15:40:25] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 452.63 seconds [15:40:52] marostegui: ack, thanks for the quick reaction [15:42:30] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2001.codfw.wmnet [15:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:23] marostegui: ok ack. thanks for the info [15:43:26] it is a wikiadmin user [15:43:27] Actually it is not that, it was our friend: https://phabricator.wikimedia.org/T164173 [15:43:31] so it is someone running a script [15:43:57] from terbium [15:44:31] jynus: you sure? I am seeing lots of updates like the usual ones [15:44:50] who is creating those updates? [15:45:18] I am checking binlogs [15:46:06] php5 /srv/mediawiki-staging/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/pruneChanges.php --wiki wikidatawiki --number-of-days=3" has some long running queries [15:46:21] They started at 15:31 UTC, which matches the spikes https://grafana.wikimedia.org/dashboard/db/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1063&var-port=9104&from=now-1h&to=now [15:46:35] !log reimage restbase2001 - T169939 [15:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:47] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [15:47:44] (03PS16) 10Filippo Giunchedi: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [15:50:08] 10Operations, 10DBA, 10Wikidata: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3531065 (10jcrespo) p:05Normal>03Unbreak! The execution of: ``` www-data 20284 0.0 0.1 328216 60744 ? S 15:30 0:00 php5 /srv/mediawiki-staging/multiversion/MWScript.php extensions/W... [15:50:11] RECOVERY - MariaDB Slave Lag: s5 on db1026 is OK: OK slave_sql_lag Replication lag: 17.02 seconds [15:50:20] RECOVERY - MariaDB Slave Lag: s5 on db1045 is OK: OK slave_sql_lag Replication lag: 0.22 seconds [15:53:22] isn't it possible to part dewiki and wikidatawiki from one shard to avoid the replication lag on both this wikis the same time? [15:53:33] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3531075 (10jcrespo) p:05High>03Unbreak! [15:53:41] doctaxon: not only it is possible, it is planned [15:53:54] cool [15:54:00] when? [15:54:12] it cannot start before september [15:54:52] okay, is there a phabricator task for that? [15:55:02] probably october-december, but cannot commit to any date right now [15:55:06] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 11.09 seconds [15:55:15] however, current issues are not normal [15:55:26] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1949 bytes in 0.107 second response time [15:55:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3466006 (10chasemp) @Cmjohnson seems like a definite hardware failure to me man, we haven't even put this back in service. Next steps? [15:55:45] RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 11.66 seconds [15:56:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3531087 (10chasemp) [15:58:11] jynus: is there a phabricator task for this plans? [15:58:35] (03PS17) 10Eevans: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) [15:59:28] doctaxon: this is the closest we have: https://phabricator.wikimedia.org/T172679 [15:59:39] thx [16:00:05] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170817T1600). [16:01:37] no patches afaics [16:01:59] \o/ [16:03:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3531116 (10Cmjohnson) @chasemp can you share some logs, I need to take this back to Dell [16:04:39] 10Operations, 10DBA, 10Wikidata: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3531118 (10jcrespo) I think this is just `UPDATE /* Title::invalidateCache */` based on the binlogs, not the above script. [16:04:50] (03PS3) 10RobH: New shell user diego [puppet] - 10https://gerrit.wikimedia.org/r/370841 (https://phabricator.wikimedia.org/T172891) [16:05:29] (03CR) 10RobH: [C: 032] New shell user diego [puppet] - 10https://gerrit.wikimedia.org/r/370841 (https://phabricator.wikimedia.org/T172891) (owner: 10RobH) [16:06:19] jynus: glad that we are seeing the same thing in the end ;-) [16:06:31] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3531120 (10jcrespo) Setting as unbreak now because this is preventing collaborators from editing ar... [16:06:36] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3531122 (10RobH) 05Open>03Resolved Sorry about this, been out sick this week. No objections, so this is now merged. [16:07:32] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10User-Addshore: Requesting access to contint-admins for addshore - https://phabricator.wikimedia.org/T173233#3521778 (10RobH) Please note this requires review in the Ops meeting for approval. [16:08:29] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3531127 (10jcrespo) [16:08:41] 10Operations, 10Ops-Access-Requests, 10Release-Engineering-Team (Watching / External), 10User-Addshore: Make @daniel a MediaWiki deployer - https://phabricator.wikimedia.org/T173230#3521683 (10RobH) Additionally, deployment requires ops meeting review/approval. [] - manager approval from @tobi_wmde_sw or... [16:09:45] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:10:20] jynus: regarding https://phabricator.wikimedia.org/T164173 I'm around to work on it (I'm Amir1), but it's outside of Wikidata team because the fix is merged and not deployed because of https://phabricator.wikimedia.org/T173462 [16:10:28] (03PS7) 10Chad: Gerrit: Enable logstash by default for prod gerrit [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [16:10:45] so I don't know how I can be useful but if there is anything, tell me [16:10:46] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Enable logstash by default for prod gerrit [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [16:10:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3531150 (10chasemp) p:05Triage>03High [16:10:51] I am not sure who ar you, Goatification [16:11:00] but I just pinged collaboration and editing [16:11:04] precisely because of that [16:11:07] Amir1, Ladsgroup [16:11:12] ah, hi [16:11:15] (03CR) 10Chad: Gerrit: Enable logstash by default for prod gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [16:11:32] or should I ping releng? who is blocking that? [16:11:43] who as in, "what"? [16:11:53] !log T169939: Lower eqiad compaction throughput from 20MB/s to 15MB/s [16:11:54] yeah, deploying the train causes another sort of problem [16:11:54] robh: o/ - stat1004's puppet is broken for has_key(): expects the first argument to be a hash, got "" which is of type String at /etc/puppet/modules/admin/manifests/hashuser.pp:14 [16:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:06] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [16:12:11] ? [16:12:21] hmmm? [16:12:23] damn it [16:12:27] robh: not sure, could it be https://gerrit.wikimedia.org/r/370841 ? [16:12:28] did my patch break puppet everywhere? [16:12:37] at the end, it comes to AaronSchulz because he is working on the fix [16:12:42] 10Operations, 10Ops-Access-Requests, 10Release-Engineering-Team (Watching / External), 10User-Addshore: Make @daniel a MediaWiki deployer - https://phabricator.wikimedia.org/T173230#3531156 (10Abraham) @RobH approved. Thanks. [16:12:47] so performance? [16:12:54] elukey: im going to revert and see! [16:12:55] robh: for the moment it seems only stat1004, but I just noticed it [16:13:03] oh... lemme try a puppet run elsewhere [16:13:08] jynus: yeah, aaron has a WIP for the train blocker [16:13:09] 10Operations, 10DBA, 10Wikidata: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3531160 (10jcrespo) which means most probably a direct cause of T164173, getting worse? [16:13:14] yeah [16:13:20] and aaron -> performance team [16:13:29] ah robh, you added 'dsaez' not diego to researchers [16:13:35] ohhhh [16:13:42] yeah we changed it halfway through the patch, shit [16:13:44] greg-g: I think my unbreak now is fair [16:13:46] https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&panelId=1&fullscreen&orgId=1&from=1502980683784&to=1502984283784 [16:13:51] elukey: pulling and fixing in place [16:13:55] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:56] jynus: I totally agree with the unbreak now [16:13:57] losing 90% of edits I think it worth it [16:14:02] jynus: worth a backport&deploy around the train? [16:14:11] puppet issues is known its me [16:14:14] and they do not come back after read-only recovers [16:14:18] this task brings UBN to a whole new level [16:14:28] it's definitely worth it [16:14:28] jynus: kk, backport it is (cc thcipriani ) [16:14:45] greg-g: sorry, but I do not know the issues around it [16:14:45] greg-g: This was happening not too often some weeks ago, but now it pretty much happens every single day :_( [16:14:51] I cannot recommend any point of action [16:14:56] marostegui: what's "it"? [16:14:58] I just want to shake people [16:15:01] greg-g: the problem these two bugs are related [16:15:06] to see if some mitigation can be done [16:15:19] Goatification: the one jynus is referring to and https://phabricator.wikimedia.org/T173462 ? [16:15:40] the second one (the blocker of the train) is being caused by fixing the first (the UBN) task [16:15:49] oh, I see [16:15:50] ... [16:15:58] greg-g: The massive amount of non-batches updates creating lag and setting affected shards in read only: https://phabricator.wikimedia.org/T164173#3531120 [16:16:04] marostegui: ahh [16:16:16] !log T169939: RESTBase: converting (33) new keyspaces to time-windowed compaction [16:16:19] so, my point is I do not want to recommend something because I assume it is complex [16:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:31] soooo, Goatification, what is the correct way forward? [16:16:43] but I want to be sure people are aware of the impact of the problem [16:16:49] jynus: noted, thanks [16:16:53] get the https://phabricator.wikimedia.org/T173462 fixed or at least bypass it [16:16:59] and get the train going [16:17:00] 90% edits loss, and high user complains an frustration [16:17:07] (03PS1) 10RobH: fixing diego's login [puppet] - 10https://gerrit.wikimedia.org/r/372409 (https://phabricator.wikimedia.org/T172891) [16:17:14] elukey: wanna +1 ^? [16:17:18] since you spotted my mixtake =] [16:17:21] mistake even [16:17:28] i should have gotten another set of eyes on it, was silly of me' [16:17:32] looks like aaron made that patch at 2am, not sure when he'll be awake [16:17:32] and manger may be able to provide more resources, if that helps? [16:17:37] *managers [16:17:39] even a hotfix (or a horrible hack around) might work until we can get it passed [16:18:08] robh: checking [16:18:13] thx, sorry for the break [16:18:14] Krinkle: can you help with reviewing https://gerrit.wikimedia.org/r/#/c/372345/ ? [16:18:17] come back from being sick, break shit. [16:18:19] \o/ [16:18:40] jynus: I might be wrong but the whole thing is so complex that most people I know can't fix it without spending lots of time to read the code and gets familiarized [16:18:49] I understand [16:19:06] Goatification: I talked to daniel and hoo, and the patches that is supposed to fix it, is merged, it just needs to get deployed [16:19:07] I gave it a try and failed miserably [16:19:12] Krinkle: tl;dr: that needs to be reviewed/merged quickly to ge the train going as it's blocking a fix to a wikidata issue (it's related) which is causing the loss of 90% of wikidata edits [16:19:15] sigh. I'm not sure what the user impact is for https://phabricator.wikimedia.org/T173462 but it sounds like the user impact from halting the train may outsize it? [16:19:26] thcipriani: /me nods [16:19:29] marostegui: yeah but deploying it causes another UBN bug [16:19:56] Goatification: yeah yeah, I know :( [16:20:02] im 99.9999% sure i fixed my patch of course [16:20:12] but the fact i introduced it in the first place with out getting a +1 is what causd the issue. [16:20:38] so yeah, anyone can +1 that it looks sane so i can merge =] [16:21:09] robh: I am checking the user on ldap but it seems already taken by another person [16:21:20] so thats the confusion [16:21:25] diego in ldap is his personal account [16:21:36] but we are using his diego@wikimeida.org lddap account not his personal one to made the uid [16:21:43] the uid ties to dsaez, he wants to login with diego [16:21:55] hence the change of login name, but th UID ties to his WMF wikitech account/ldap [16:22:01] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db2077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372386 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [16:22:07] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:22:18] shell username and wikitech name dont have to match [16:22:22] they often do, but not always [16:22:25] robh: ahhh okok [16:22:30] thanks :) [16:22:30] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3531169 (10BBlack) Update: Today is the start date for going to 5%. Before we p... [16:22:37] no worries, its why i had the mistake in the first place! [16:22:41] (03CR) 10Elukey: [C: 031] fixing diego's login [puppet] - 10https://gerrit.wikimedia.org/r/372409 (https://phabricator.wikimedia.org/T172891) (owner: 10RobH) [16:22:48] (03CR) 10RobH: [C: 032] fixing diego's login [puppet] - 10https://gerrit.wikimedia.org/r/372409 (https://phabricator.wikimedia.org/T172891) (owner: 10RobH) [16:23:07] greg-g: basically my point is to understand what is the current situation, and what can be done about it, and if possible facilitate actions [16:23:22] (03CR) 10Filippo Giunchedi: [C: 032] Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [16:23:26] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372386 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [16:23:36] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372386 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [16:23:40] ok, tesitng puppet on stat1004 [16:23:41] (03PS18) 10Filippo Giunchedi: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [16:23:53] (fix is live) [16:24:13] elukey: thank you for spotting that so quickly! [16:24:24] thanks for fixing :) [16:24:33] jynus: essentially I think it's: this issue effecting 90% of edits on wikidata is more severe than https://phabricator.wikimedia.org/T173462 but I have no idea as @aaron isn't here [16:24:42] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Pool db2077 - T170662 (duration: 00m 51s) [16:24:52] greg-g: small correction [16:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:55] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [16:24:56] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:24:59] it is affecting 90% of all edits [16:25:09] well $%*#$)* [16:25:14] while it is true that a large number of them are from wikidata [16:25:21] (03CR) 10Chad: [C: 04-1] "No, the PS7 -> PS8 was just a manual rebase...actually...it looks like the new author just rewrote it as a brand new commit. That stripped" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [16:25:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db2077 - T170662 (duration: 00m 49s) [16:25:42] greg-g: this is a graph of all WMF edits: https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&panelId=1&fullscreen&orgId=1&from=1502980683784&to=1502984283784 [16:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:52] if the update is in wikidata, and the wikidata extension on wmf.13 is still at wmf.12, is it possible to backport wikidata extension wmf.14 to core wmf.14? [16:26:12] what happened at 8:28? [16:26:19] er, backport the wikidata extension v wmf.14 core to wmf.13 [16:26:40] s5 started to lag [16:27:11] anyone else agree we need aaron's input? [16:27:18] yes [16:27:20] T173269 [16:27:21] T173269: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269 [16:27:21] I'm calling him now [16:28:06] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3531178 (10Papaul) @elukey the system is not booting up with the live CD provide by Dell get error message below. I am working with DELL on this issue at the moment. {F9099509} [16:28:26] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:29:00] he's coming [16:29:02] edits went back, but got reduced by 20-30%, which I would guess to attribute to editors getting frustrated [16:29:18] so that is the reason I consider this an unbreak now [16:29:21] papaul: let's burn mw2256! :P [16:29:47] greg-g: thanks! [16:30:04] editors, and not edits, are the most valued thing for me [16:30:25] oiy backscroll [16:30:26] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:30:30] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3531192 (10RobH) So I had to do a followup patch, my initial one put a mix of diego and dsaez in use, rather than all diego. Fixed! [16:30:36] elukey: i agree lol [16:30:42] can someone tl;dr for aaron? :) [16:30:49] AaronSchulz: summary, just T164173 [16:30:49] T164173: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173 [16:31:01] how can that get unblocked? [16:31:15] is there a patch pending? something merged but undeployed? [16:31:23] https://gerrit.wikimedia.org/r/#/c/372345/ [16:31:25] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3531195 (10DarTar) Thanks, @RobH. [16:32:03] last example is T173269, which operations thinks is to that bug [16:32:10] *due [16:32:51] so, the patch I have is to just avoid an exception that would if any cause rollback and less writes to happen [16:33:07] I was just trying to fix those snapshot/lock errors [16:33:30] (03CR) 10EBernhardson: Set elasticsearch servers to use 128kB readahead (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [16:33:44] (03PS1) 10Dzahn: phabricator: silence public_task_dump cron mails [puppet] - 10https://gerrit.wikimedia.org/r/372413 (https://phabricator.wikimedia.org/T127524) [16:33:50] AaronSchulz: I understand things are complex- would be any way to deploy the fix? [16:33:53] (03CR) 10Chad: [C: 031] "Let's do this. Should be pretty low impact, and if it *doesn't* work, it should be easy to just revert to status quo." [puppet] - 10https://gerrit.wikimedia.org/r/366910 (owner: 10Paladox) [16:33:58] so my understanding is that https://phabricator.wikimedia.org/T164173 has a fix that is in wmf.14 (just from reading that ticket) the rollout of which is blocked on https://phabricator.wikimedia.org/T173462 which AaronSchulz has a patch for [16:34:00] even if it causes exceptions sometimes? [16:34:07] AaronSchulz: the thing is, the fix for https://phabricator.wikimedia.org/T164173 is probably the cause for the snashot errors [16:34:17] anything better than losing 90% of our edit rate [16:34:43] (ok, talk between you, which you may know more, I step aside) [16:34:46] because it moves from one title invalidation to batch invalidation [16:34:57] *cache invalidation [16:36:21] (03CR) 10Jcrespo: "As you can see, if the variable is false, it is not used on the config file: See /etc/mysql/mysqld.conf.d/x1.cnf vs. the others. I can doc" [puppet] - 10https://gerrit.wikimedia.org/r/372400 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [16:36:31] 10Operations, 10hardware-requests: refresh hardware for logstash100[123] - https://phabricator.wikimedia.org/T173298#3531206 (10RobH) This does indeed look like it could go into a gaeneti VM, since they are very low requirements. "@bd808 suggested that we could move those machines to VMs on Ganeti, but I'm no... [16:37:28] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3531208 (10Eevans) [16:37:42] thcipriani: I did some quick local testing and un-WIP'ed it [16:38:06] thcipriani: ah, I see, blocker-chain :) [16:38:22] AaronSchulz: sorry to bother you, bTW, but I think the issue merited alerting some people around [16:38:47] because in most cases, things delay because coordionation rather than actual issues [16:38:49] I was like 5min from being back on IRC anyway, so np, heh [16:39:00] (03CR) 10Marostegui: [C: 031] "> As you can see, if the variable is false, it is not used on the" [puppet] - 10https://gerrit.wikimedia.org/r/372400 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [16:39:20] I mostly made that WIP because if was 2AM and I wanted to see it again in the morning and play around a bit [16:39:20] I am specially sensitive to issues affecting contributors [16:39:25] *it was [16:39:39] thank you! [16:39:40] AaronSchulz: ok, I can cherry-pick and deploy that, and then I'll roll forward with wmf.14, thanks for the quick action [16:39:58] also thanks thcipriani and greg-g for helping [16:40:58] thanks for pushing jynus. I wouldn't have seen that until later. [16:41:12] ^ [16:41:13] (03PS2) 10EBernhardson: Set elasticsearch servers to use 128kB readahead [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) [16:41:31] greg-g: I saw it because it got directed to my phone in the form of an sms :-) [16:41:45] (03CR) 10Brian Wolff: "otoh, maybe itd be better to just change the script to be 90 days after the vote was recorded." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372180 (https://phabricator.wikimedia.org/T173393) (owner: 10Brian Wolff) [16:41:57] jynus: :) [16:42:33] (03CR) 10Chad: [C: 04-1] "11 is a silly number. 10 is just fine as a limit without being intrusive to the *vast* majority of people's work. Over the many many years" [puppet] - 10https://gerrit.wikimedia.org/r/371739 (owner: 10Greg Grossmeier) [16:42:35] ok, I need to go afk for about 20 minutes. Tyler's doing the cherrypick and deploy. Thanks all. I'll be back shortly. [16:43:12] (03CR) 10Paladox: "@Chad or @Dzahn hi, could you rebased this please? :)" [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [16:43:16] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:43:29] thank you guys for getting on top of this [16:43:43] yeah, before https://gerrit.wikimedia.org/r/#/c/364198/9/client/includes/Changes/WikiPageUpdater.php nothing hit that for-loop case [16:44:13] yes, I was right :D [16:44:22] (03PS3) 10EBernhardson: Set elasticsearch servers to use 128kB readahead [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) [16:44:51] (03CR) 10jerkins-bot: [V: 04-1] Set elasticsearch servers to use 128kB readahead [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [16:45:00] (03PS4) 10Paladox: Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910 [16:45:00] thcipriani: I'm going back to hotel and monitor from there, if anything is affecting the train I might be able to help [16:45:15] specially related to wikidata [16:45:31] Goatification: ok, thank you, I'll let you know if there are other issues there [16:46:30] AaronSchulz: I cherry-picked/merged the change to wmf.14. I left master unmerged. I try not to code review in master since there are folks better qualified to do the review. [16:46:34] (03PS5) 10Paladox: Gerrit: Set auth.userNameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/368196 [16:46:36] * thcipriani just fiddles tasks/pushes buttons [16:47:05] (03PS4) 10EBernhardson: Set elasticsearch servers to use 128kB readahead [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) [16:47:29] (03Abandoned) 10Greg Grossmeier: Gerrit: Max batch limit = 11 [puppet] - 10https://gerrit.wikimedia.org/r/371739 (owner: 10Greg Grossmeier) [16:47:59] AaronSchulz: +2'd [16:48:09] on master [16:48:17] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3531227 (10fgiunchedi) Both issues have been fixed upstream! Pending deployment of latest version of librenms to production. [16:48:32] (03CR) 10Paladox: "@Dzahn but then that’s won’t fix it since I use active server in labs so my instance will keep sending the emails as it would keep be adde" [puppet] - 10https://gerrit.wikimedia.org/r/371927 (https://phabricator.wikimedia.org/T173297) (owner: 10Paladox) [16:48:36] 10Operations, 10media-storage: Deleting file on Commons "Error deleting file: An unknown error occurred in storage backend "local-multiwrite"." - https://phabricator.wikimedia.org/T173374#3531228 (10Nick) >>! In T173374#3527640, @fgiunchedi wrote: > Is there an exception id or anything like that attached to th... [16:50:21] RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [16:51:01] (03CR) 10Dzahn: [C: 031] "yep, i just can't personally merge it yet, blocked by https://gerrit.wikimedia.org/r/#/c/372210/" [puppet] - 10https://gerrit.wikimedia.org/r/366910 (owner: 10Paladox) [16:51:22] (03CR) 10Gehel: [C: 04-1] "We also need values for `storage_device` for relforge and beta clusters." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/371963 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [16:51:30] !log thcipriani@tin Synchronized php-1.30.0-wmf.14/includes/jobqueue/jobs/RefreshLinksJob.php: [[gerrit:372414|Avoid lock acquisition errors for multi-title refreshlinks jobs]] T173462 (duration: 00m 51s) [16:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:42] T173462: LinksUpdate::acquirePageLock: Cannot flush pre-lock snapshot because writes are pending - https://phabricator.wikimedia.org/T173462 [16:53:00] PROBLEM - MD RAID on restbase2001 is CRITICAL: Return code of 255 is out of bounds [16:53:00] PROBLEM - cassandra-c service on restbase2001 is CRITICAL: Return code of 255 is out of bounds [16:53:06] (03PS1) 10Thcipriani: Revert "Revert "Group1 wikis to wmf.14"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372415 [16:53:50] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [16:53:51] PROBLEM - configured eth on restbase2001 is CRITICAL: Return code of 255 is out of bounds [16:54:28] (03CR) 10Chad: "Should be ok with the minor nit inline fixed" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/368196 (owner: 10Paladox) [16:54:40] PROBLEM - dhclient process on restbase2001 is CRITICAL: Return code of 255 is out of bounds [16:54:53] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis back to wmf.14 now for T164173 [16:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:08] T164173: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173 [16:55:31] PROBLEM - Check size of conntrack table on restbase2001 is CRITICAL: Return code of 255 is out of bounds [16:55:31] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused [16:55:31] PROBLEM - puppet last run on restbase2001 is CRITICAL: Return code of 255 is out of bounds [16:56:00] PROBLEM - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:56:21] PROBLEM - Check systemd state on restbase2001 is CRITICAL: Return code of 255 is out of bounds [16:56:21] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: Return code of 255 is out of bounds [16:56:21] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:57:12] 10Operations, 10vm-requests, 10Discovery-Search (Current work): refresh hardware for logstash100[123] - https://phabricator.wikimedia.org/T173298#3531240 (10Gehel) a:05Gehel>03akosiaris @akosiaris: could you have a look into this request and let me know if it make sense to move the logstash ingestion nod... [16:57:30] PROBLEM - Check the NTP synchronisation status of timesyncd on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:30] PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:30] PROBLEM - salt-minion processes on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:32] AaronSchulz: Goatification hrm now after rolling forward I'm seeing a lot of error: Stack overflow in /srv/mediawiki/php-1.30.0-wmf.14/includes/libs/objectcache/WANObjectCache.php on line 251 and error: Stack overflow in /srv/mediawiki/php-1.30.0-wmf.14/includes/libs/objectcache/MemcachedBagOStuff.php on line 182 [16:58:21] PROBLEM - Check whether ferm is active by checking the default input chain on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:58:21] PROBLEM - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:58:50] (03PS2) 10Dzahn: phabricator: silence stdout of public_task_dump cron [puppet] - 10https://gerrit.wikimedia.org/r/372413 (https://phabricator.wikimedia.org/T127524) [16:59:20] PROBLEM - DPKG on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:59:25] as well as /srv/mediawiki/php-1.30.0-wmf.14/vendor/monolog/monolog/src/Monolog/Logger.php on line 292 [16:59:37] odd [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170817T1700). Please do the needful. [17:00:10] PROBLEM - cassandra-b service on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:00:11] PROBLEM - Disk space on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:00:19] Nothing for ORES today. [17:00:24] nothing for parsoid today [17:00:55] jan_drewniak, debt: Can you help me how can I update the CSS sprite for T160491? Are there docs somewhere? [17:00:55] T160491: Update Wikiversity logos - https://phabricator.wikimedia.org/T160491 [17:01:01] PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:01:16] that's me ^ silenced [17:01:41] godog: I was just about to ping u.random :) [17:02:05] (I'm talking about sprite-project-logos.png BTW) [17:02:19] greg-g: hehe thanks! yeah 2001 is being reimaged [17:06:31] PROBLEM - puppet last run on graphite1003 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 2 minutes ago with 5 failures. Failed resources (up to 3 shown): Package[tzdata],Exec[wikidev_ensure_members],Exec[ops_ensure_members],Exec[perf-roots_ensure_members] [17:09:53] thcipriani: okay, I'm back. Let me check [17:10:07] !log gerrit: cpu spikes, lots of large gc logs. Looking into it. (nb: things might be a little slow, but it is /up/) [17:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:28] blerg. I would totally rollback due to stack overflows had I not rolled forward because of a massive loss of edits. This is not a good state for the train to be in. [17:12:34] (03PS3) 10Jcrespo: mariadb: Allow individual instances to configure its innodb BPS [puppet] - 10https://gerrit.wikimedia.org/r/372400 (https://phabricator.wikimedia.org/T169514) [17:12:45] thcipriani: is all wikis on wmf.14? [17:12:52] or just group1 [17:12:57] group0+1 [17:13:00] ^ [17:13:06] okay [17:13:07] yeah, it's all coming from wmf.14 code [17:13:08] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow individual instances to configure its innodb BPS [puppet] - 10https://gerrit.wikimedia.org/r/372400 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [17:13:23] I'm checking logstash fatalmonitor, amazingly there is no traceback [17:16:17] Hello [17:16:20] There is a problem at kowikisource. I think I need some help. [17:16:43] what sort of problem ? [17:16:59] what kind? [17:17:16] I cannot access the site. [17:17:54] ah yes, it's not loading. [17:17:57] It gives error 503 or blank page [17:18:41] I'm rolling back wmf.14. I think this is to due to the stack overflows I'm seeing. [17:19:15] (03PS4) 10Jcrespo: mariadb: Allow individual instances to configure its innodb BPS [puppet] - 10https://gerrit.wikimedia.org/r/372400 (https://phabricator.wikimedia.org/T169514) [17:19:43] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow individual instances to configure its innodb BPS [puppet] - 10https://gerrit.wikimedia.org/r/372400 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [17:20:06] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis back to wmf.13 now T173520 [17:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:21] T173520: Fatal error: Stack overflow in [files] for wmf.14 - https://phabricator.wikimedia.org/T173520 [17:20:21] ^ Namoroka can you try it now? [17:20:44] works for me, btw [17:21:13] NotASpy: after rollback it works? [17:21:24] yes [17:21:31] It works well [17:21:39] * AaronSchulz will take a look after mtg finishes [17:21:41] muther [17:21:43] Namoroka: thank you for the report [17:21:53] NotASpy: thanks for the testing :) [17:22:06] RECOVERY - Check size of conntrack table on restbase2001 is OK: OK: nf_conntrack is 0 % full [17:22:06] RECOVERY - dhclient process on restbase2001 is OK: PROCS OK: 0 processes with command name dhclient [17:22:17] RECOVERY - Check whether ferm is active by checking the default input chain on restbase2001 is OK: OK ferm input default policy is set [17:22:26] np [17:22:26] RECOVERY - configured eth on restbase2001 is OK: OK - interfaces up [17:22:27] RECOVERY - Disk space on restbase2001 is OK: DISK OK [17:22:27] RECOVERY - MD RAID on restbase2001 is OK: OK: Active: 10, Working: 10, Failed: 0, Spare: 0 [17:22:30] thank you for helping me [17:22:37] RECOVERY - DPKG on restbase2001 is OK: All packages OK [17:22:47] RECOVERY - salt-minion processes on restbase2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:22:57] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [17:25:55] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3531336 (10Papaul) Was about this time to boot off the live CD and after the OS load the system got in frozen state so i couldn't run the stress command. The last option now is to run the extended HW test and see... [17:26:13] (03PS3) 10Dzahn: admins: add new ssh keys for dzahn [puppet] - 10https://gerrit.wikimedia.org/r/372210 [17:27:08] (03PS9) 10Jcrespo: mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) [17:27:10] (03PS3) 10Jcrespo: [WIP]mariadb: First attempt at a mydumper-based dump script [puppet] - 10https://gerrit.wikimedia.org/r/371944 (https://phabricator.wikimedia.org/T169516) [17:27:12] (03PS5) 10Jcrespo: mariadb: Allow individual instances to configure its innodb BPS [puppet] - 10https://gerrit.wikimedia.org/r/372400 (https://phabricator.wikimedia.org/T169514) [17:27:17] RECOVERY - Check the NTP synchronisation status of timesyncd on restbase2001 is OK: OK: synced at Thu 2017-08-17 17:27:14 UTC. [17:33:16] (03PS6) 10Jcrespo: mariadb: Allow individual instances to configure its innodb BPS [puppet] - 10https://gerrit.wikimedia.org/r/372400 (https://phabricator.wikimedia.org/T169514) [17:34:57] RECOVERY - puppet last run on graphite1003 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:36:07] (03PS4) 10ArielGlenn: admins: add new ssh keys for dzahn [puppet] - 10https://gerrit.wikimedia.org/r/372210 (owner: 10Dzahn) [17:37:03] (03CR) 10ArielGlenn: [C: 032] admins: add new ssh keys for dzahn [puppet] - 10https://gerrit.wikimedia.org/r/372210 (owner: 10Dzahn) [17:37:49] thcipriani: any luck on the overflow thing? [17:37:59] thank you apergos [17:38:02] AaronSchulz: no :( [17:38:09] I'm trying to dig up stack trace on mwlog1001 [17:38:50] yw [17:38:52] reminds me of https://phabricator.wikimedia.org/T123829 [17:39:06] maybe something structurally similar [17:40:25] (03PS10) 10Jcrespo: mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) [17:40:27] (03PS4) 10Jcrespo: [WIP]mariadb: First attempt at a mydumper-based dump script [puppet] - 10https://gerrit.wikimedia.org/r/371944 (https://phabricator.wikimedia.org/T169516) [17:40:29] (03PS7) 10Jcrespo: mariadb: Allow individual instances to configure its innodb BPS [puppet] - 10https://gerrit.wikimedia.org/r/372400 (https://phabricator.wikimedia.org/T169514) [17:43:59] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7522/dbstore2002.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/372400 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [17:44:10] (03PS8) 10Jcrespo: mariadb: Allow individual instances to configure its innodb BPS [puppet] - 10https://gerrit.wikimedia.org/r/372400 (https://phabricator.wikimedia.org/T169514) [17:44:22] AaronSchulz: here's a partial trace for /srv/mediawiki/php-1.30.0-wmf.14/includes/libs/objectcache/WANObjectCache.php https://phabricator.wikimedia.org/P5891 [17:45:16] since it's hhvm this might be related https://github.com/facebook/hhvm/issues/7432 [17:45:32] nvm [17:46:17] (03CR) 10Rush: "seems good" [puppet] - 10https://gerrit.wikimedia.org/r/372413 (https://phabricator.wikimedia.org/T127524) (owner: 10Dzahn) [17:46:27] (03CR) 10Rush: [C: 031] phabricator: silence stdout of public_task_dump cron [puppet] - 10https://gerrit.wikimedia.org/r/372413 (https://phabricator.wikimedia.org/T127524) (owner: 10Dzahn) [17:46:34] (03PS3) 10Rush: phabricator: silence stdout of public_task_dump cron [puppet] - 10https://gerrit.wikimedia.org/r/372413 (https://phabricator.wikimedia.org/T127524) (owner: 10Dzahn) [17:48:42] thcipriani: what's the current status of wmf.14? i will need this backport merged: https://gerrit.wikimedia.org/r/#/c/372419/ (the task is marked as release blocker already). do i need to have it SWAT-ted, or can you merge it for it to be included when we try deploying wmf.14 it again? [17:50:21] MatmaRex: is this testable on group0 (testwiki/mediawiki.org)? If so, I can just deploy it now. [17:50:59] thcipriani: it should be, we have UploadWizard on one of them i think [17:51:20] yeah, testwiki has it [17:52:02] the easiest way to test is to inspect `mw.UploadWizard.config.uwLanguages` in browser console - incorrect is an array (numeric keys), correct is an object (string keys with language codes) [17:52:32] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3531472 (10Whatamidoing-WMF) I've left another note about this at enwiki's VPT:... [17:53:17] thcipriani: I'm betting on b48f361d7d606eff5ab48cc2a64c1cae4e794c84 [17:53:21] MatmaRex: okie doke, +2'd [17:53:28] thank you [17:53:28] !log restarting mariadb (dbstore2001:x1) to test new buffer pool configuration [17:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:07] https://gerrit.wikimedia.org/r/#/c/370810/ [17:55:33] AaronSchulz: I can try a revert. I'm not sure I can put that patch's email to a nick and James_ doesn't seem to be around [17:57:32] easiest to just revert to db7507246665e69384c1d92af2aedc62263a5116 [17:58:27] I see nothing about changing data schemas or something obvious that would blow up on revert [17:58:33] MatmaRex: your change is on mwdebug1002 [17:59:15] lots of cleanup and unit test changes [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170817T1800). [18:00:04] RoanKattouw, MaxSem, Niharika, and Addshore: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:11] I'm here [18:00:18] I can SWAT. [18:00:38] thcipriani: thanks, looks good [18:00:42] Niharika: wait just a second please, [18:00:45] Niharika: *waves* [18:00:57] Niharika: just finishing an impromtu swat, sorry :( [18:01:29] thcipriani: No worries. :) [18:01:41] Hey addgoat! How's it goating? [18:02:21] its goating good! Actually, my flight was pretty shitty, got back today, kind of feel like I might be ill tommorrow.... [18:03:06] (03PS6) 10Rush: openstack: clean up openstack::repo [puppet] - 10https://gerrit.wikimedia.org/r/370092 (https://phabricator.wikimedia.org/T171494) [18:03:12] That's no good. We've a lot of goatification to do! [18:03:14] !log thcipriani@tin Synchronized php-1.30.0-wmf.14/extensions/UploadWizard/UploadWizard.config.php: SWAT: [[gerrit:372419|Preserve array keys (language keys) when sorting the language dropdown]] T173522 (duration: 00m 51s) [18:03:21] MatmaRex: ^ live now [18:03:27] Niharika: go for it [18:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:28] T173522: UploadWizard generates wikitext like '{{83|...}}' (with different numbers) instead of '{{en|...}}' in file descriptions - https://phabricator.wikimedia.org/T173522 [18:03:30] also I dont get pinged for swats with that nice D: [18:03:32] *nick [18:04:50] RoanKattouw: You wanna take off your -2 from https://gerrit.wikimedia.org/r/#/c/368330/? [18:04:51] AaronSchulz: so between db7507246665e69384c1d92af2aedc62263a5116..wmf/1.30.0-wmf.14 it's 8 changes, we want all of them gone? [18:06:39] yeah, for ProofreadPage [18:07:04] the other stuff probably depends on the first refactoring and such [18:08:20] AaronSchulz: gotcha, ok, I'll make some reverts and have you make sure I did it right, squash 'em, and give it a shot [18:09:05] (03PS7) 10Rush: openstack: clean up openstack::repo [puppet] - 10https://gerrit.wikimedia.org/r/370092 (https://phabricator.wikimedia.org/T171494) [18:09:07] (03PS4) 10Rush: openstack: keystone as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/370288 (https://phabricator.wikimedia.org/T171494) [18:09:09] (03CR) 10Dzahn: [C: 032] phabricator: silence stdout of public_task_dump cron [puppet] - 10https://gerrit.wikimedia.org/r/372413 (https://phabricator.wikimedia.org/T127524) (owner: 10Dzahn) [18:09:12] (03PS1) 10Jcrespo: dbstore_multiinstance: reduce key_buffer, set read only [puppet] - 10https://gerrit.wikimedia.org/r/372423 (https://phabricator.wikimedia.org/T169514) [18:09:51] (03PS2) 10Jcrespo: dbstore_multiinstance: reduce key_buffer, set read only [puppet] - 10https://gerrit.wikimedia.org/r/372423 (https://phabricator.wikimedia.org/T169514) [18:10:14] (03PS3) 10Jcrespo: dbstore_multiinstance: reduce key_buffer, set read only [puppet] - 10https://gerrit.wikimedia.org/r/372423 (https://phabricator.wikimedia.org/T169514) [18:11:03] (03PS8) 10Rush: openstack: clean up openstack::repo [puppet] - 10https://gerrit.wikimedia.org/r/370092 (https://phabricator.wikimedia.org/T171494) [18:12:08] PROBLEM - puppet last run on cp1063 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:12:23] !log niharika29@tin Synchronized php-1.30.0-wmf.14/extensions/LoginNotify/: Log usage statistics https://gerrit.wikimedia.org/r/#/c/372214/ (duration: 00m 51s) [18:12:33] (03CR) 10Niharika29: [C: 032] "SWAT. wmf.12 is everywhere now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368330 (owner: 10Catrope) [18:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:40] (03CR) 10jerkins-bot: [V: 04-1] Remove temporary wgStructuredChangeFiltersEnableExperimentalViews setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368330 (owner: 10Catrope) [18:12:58] PROBLEM - puppet last run on cp1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:13:22] Uh oh. RoanKattouw ^^ [18:13:29] (03CR) 10Jcrespo: [C: 032] dbstore_multiinstance: reduce key_buffer, set read only [puppet] - 10https://gerrit.wikimedia.org/r/372423 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [18:13:34] Niharika: Looking [18:13:57] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures [18:14:37] PROBLEM - confd service on cp1063 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:14:58] (03PS6) 10Paladox: Gerrit: Set auth.userNameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/368196 [18:15:12] Niharika: WTF I don't understand the conflict, will get back to it in ~15 mins [18:15:22] (03PS7) 10Paladox: Gerrit: Set auth.userNameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/368196 [18:15:27] RECOVERY - confd service on cp1063 is OK: OK - confd is active [18:15:41] RoanKattouw: Okay, no worries. [18:16:08] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures [18:16:22] (03CR) 10Paladox: Gerrit: Set auth.userNameToLowerCase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/368196 (owner: 10Paladox) [18:16:33] !log upgrading and restarting all mariadb instances on dbstore2001 [18:16:38] PROBLEM - traffic-pool service on cp1063 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:07] PROBLEM - puppet last run on cp1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:17:24] those socket timeouts are a little concerning [18:18:21] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3531576 (10demon) 05Resolved>03Open [[ http://gceasy.io/diamondgc-report.jsp?p=c2hhcmVkLzIwMTcvMDgvMTcvLS1sb2cuZ2MudGFyLmd6LS0xNy0x... [18:18:47] addshore: Your change is on mwdebug1002. [18:18:48] PROBLEM - confd service on cp1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:18:58] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 35 minutes ago with 0 failures [18:19:08] PROBLEM - traffic-pool service on cp1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:19:30] We should have more fun names for our test servers. [18:19:38] RECOVERY - traffic-pool service on cp1063 is OK: OK - traffic-pool is active [18:19:40] (03PS1) 10Chad: Gerrit: Also set minimum heap size [puppet] - 10https://gerrit.wikimedia.org/r/372426 (https://phabricator.wikimedia.org/T148478) [18:19:44] (03PS4) 10Paladox: Phabricator: Only send logmail on prod not labs [puppet] - 10https://gerrit.wikimedia.org/r/371927 (https://phabricator.wikimedia.org/T173297) [18:19:57] RECOVERY - confd service on cp1048 is OK: OK - confd is active [18:20:08] RECOVERY - traffic-pool service on cp1048 is OK: OK - traffic-pool is active [18:22:27] Niharika: ack [18:22:30] AaronSchulz: I think this is everything, but it's definitely a ton: https://gerrit.wikimedia.org/r/#/c/372427/ [18:22:40] Niharika: all looks good [18:22:43] (sorry for the delay) [18:24:24] (03CR) 10Paladox: [C: 031] Gerrit: Also set minimum heap size [puppet] - 10https://gerrit.wikimedia.org/r/372426 (https://phabricator.wikimedia.org/T148478) (owner: 10Chad) [18:24:27] addshore: No worries. Syncing... [18:24:55] !log niharika29@tin Synchronized php-1.30.0-wmf.14/extensions/RevisionSlider/: Revert Reintroduce hover and bar clicking - https://gerrit.wikimedia.org/r/#/c/372384/ (duration: 00m 48s) [18:25:01] And done. [18:25:05] ack [18:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:18] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3531606 (10BBlack) Ok thanks! I've done (1) above here: https://wikitech.wikime... [18:26:19] (03CR) 10Rush: Add libraryupgrader puppet module (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [18:26:25] (03PS4) 10Rush: Add libraryupgrader puppet module [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [18:26:42] thcipriani: lgtm [18:28:15] !log restart logstash on logstash1003 to see why its not reading EventError messages from kafka [18:28:21] AaronSchulz: ok, post-swat I will merge, re-scap since there are l10n changes here, and attempt another roll-forward of wmf.14 [18:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:37] Niharika: could you ping me when SWAT is complete, please? [18:28:59] thcipriani: Yeah, will do. Waiting for one last patch. [18:29:08] we need continuous deploy ;) [18:30:29] ^ [18:30:33] :) [18:33:13] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3191313 (10Izno) I have a dumb question! 😃 Currently, https://en.wikipedia.org... [18:34:49] if systemctl complains soon, it is me getting confused trying to start things that I shouldn't [18:35:37] heh [18:38:57] happily I could do a systemctl reset-failed before icinga cought it [18:41:24] (03PS2) 10Catrope: Remove temporary wgStructuredChangeFiltersEnableExperimentalViews setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368330 [18:41:45] Niharika: ---^^ Sorry for the delay, nothing weird happened, the way git marked the conflict was just very confusing [18:42:14] And led me to believe that someone else had removed it already, so I went on a wild goose chase to find the commit that did, only to find that nobody had removed it and it was still there [18:43:14] RoanKattouw: That's weird. If you diff that file between the 1st and 2nd patches you get https://gerrit.wikimedia.org/r/#/c/368330/1..2/wmf-config/InitialiseSettings.php [18:44:16] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3531719 (10BBlack) It's a very valid question :) Around the time of the final d... [18:45:06] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368330 (owner: 10Catrope) [18:45:22] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3531721 (10Whatamidoing-WMF) The information could also be duplicated on the blo... [18:46:32] (03Merged) 10jenkins-bot: Remove temporary wgStructuredChangeFiltersEnableExperimentalViews setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368330 (owner: 10Catrope) [18:46:43] (03CR) 10jenkins-bot: Remove temporary wgStructuredChangeFiltersEnableExperimentalViews setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368330 (owner: 10Catrope) [18:47:03] RoanKattouw: It's om medebug1002. [18:47:05] on* [18:47:12] mwdebug1002* [18:47:32] Yeah, the conflict was because other things (e..g Timeless) had been added near it [18:47:41] And git seemed to think that I wanted to remove those too [18:47:50] (03PS24) 10Gehel: cassandra - puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [18:48:19] Urrrghhhhh, Wikidata is still on wmf11 :( [18:48:34] (03CR) 10Smalyshev: "It feels a bit weird that one log config lives in source repo and another in puppet. Can we colsolidate somehow? Also, situation between u" [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [18:48:37] thcipriani: Any ETA for getting Wikidata out of wmf11 hell? [18:48:46] Niharika: We might need to revert this because of ---^^ [18:49:10] RoanKattouw: after SWAT that's what I'll be working on [18:49:24] RoanKattouw: Sorry. I should have checked that better. [18:49:27] so if all goes well: evening swat [18:49:30] Except I didn't know how. [18:49:32] Niharika: So should I [18:50:05] thcipriani: Do you mean, if all goes well, I can redeploy this patch in the evening SWAT and wikidata will be on wmf12+ by then? [18:50:14] (03CR) 10Gehel: "The multiple logs locations existed prior to this patch, but yes, we should move all that back to the same location (I would tend to move " [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [18:50:30] (03PS1) 10Niharika29: Revert "Remove temporary wgStructuredChangeFiltersEnableExperimentalViews setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372430 [18:50:32] RoanKattouw: right, if all goes well wikidatawiki will be on wmf.14 with all the other wikis [18:50:37] Awesome [18:50:43] Niharika: In that case let's revert now and try again at 4pm [18:50:50] Aha you're ahead of me [18:51:01] (03CR) 10Catrope: [C: 032] Revert "Remove temporary wgStructuredChangeFiltersEnableExperimentalViews setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372430 (owner: 10Niharika29) [18:52:28] (03Merged) 10jenkins-bot: Revert "Remove temporary wgStructuredChangeFiltersEnableExperimentalViews setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372430 (owner: 10Niharika29) [18:52:40] (03CR) 10jenkins-bot: Revert "Remove temporary wgStructuredChangeFiltersEnableExperimentalViews setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372430 (owner: 10Niharika29) [18:53:19] Since I just did a pull of the previous patch and now a pull of the revert patch, I don't need to sync the file right? Or do I for the sake of consistency or something? [18:53:45] You're right, you don't [18:53:46] The revert is on mwdebug1002 as well. [18:53:59] Cause you never synced the change to begin with, right? Only to 1002 [18:54:06] Yeah. [18:54:29] Alrighty then. thcipriani - the floor is all yours. [18:54:34] Niharika: thank you! [18:58:26] (03CR) 10Bearloga: "> I'm renaming the "r" module to "r_lang" which is probably a bit less confusing (https://gerrit.wikimedia.org/r/#/c/371075/). This will c" [puppet] - 10https://gerrit.wikimedia.org/r/363337 (https://phabricator.wikimedia.org/T153856) (owner: 10Hashar) [18:58:47] AaronSchulz: is there anything you want to check before I start scap of the proofreadpage revert? I can stage it on mwdebug1002 first... [18:59:24] not that I can think of [18:59:49] ok [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170817T1900). [19:00:15] * thcipriani does [19:01:27] !log thcipriani@tin Started scap: [[gerrit:372427|ProofReadPage Revert to db7507246665e69384c1d92af2aedc62263a5116]] T173520 [19:01:28] * greg-g sings "one more time" [19:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:40] T173520: Fatal error: Stack overflow in [files] for wmf.14 - https://phabricator.wikimedia.org/T173520 [19:07:41] !log thcipriani@tin Finished scap: [[gerrit:372427|ProofReadPage Revert to db7507246665e69384c1d92af2aedc62263a5116]] T173520 (duration: 06m 13s) [19:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:51] T173520: Fatal error: Stack overflow in [files] for wmf.14 - https://phabricator.wikimedia.org/T173520 [19:08:00] alright, step 1, now to get wikiversions updated to move to wmf.14 [19:09:38] (03CR) 10Dzahn: [C: 032] librenms: no https/cert monitoring on inactive server [puppet] - 10https://gerrit.wikimedia.org/r/372205 (https://phabricator.wikimedia.org/T172712) (owner: 10Dzahn) [19:09:46] (03PS2) 10Dzahn: librenms: no https/cert monitoring on inactive server [puppet] - 10https://gerrit.wikimedia.org/r/372205 (https://phabricator.wikimedia.org/T172712) [19:11:13] (03PS1) 10Thcipriani: Revert "Revert "Group1 wikis to wmf.14"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372433 [19:11:18] (03PS21) 10Ppchelko: Increase max kafka message size for changeprop and kafka main [puppet] - 10https://gerrit.wikimedia.org/r/372179 [19:12:13] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to wmf.14 [19:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:50] logs seem ok (pre-lock error and overflow) atm [19:14:44] !log thcipriani@tin Synchronized php: group1 wikis to wmf.14 (duration: 00m 46s) [19:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:11] (03CR) 10Thcipriani: [C: 032] Revert "Revert "Group1 wikis to wmf.14"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372433 (owner: 10Thcipriani) [19:15:43] yeah, logs look clean so far [19:16:29] https://logstash.wikimedia.org/goto/12ae93340e8a2852908acbe4a2725c8c [19:16:42] (03Merged) 10jenkins-bot: Revert "Revert "Group1 wikis to wmf.14"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372433 (owner: 10Thcipriani) [19:16:54] (03CR) 10jenkins-bot: Revert "Revert "Group1 wikis to wmf.14"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372433 (owner: 10Thcipriani) [19:17:19] * AaronSchulz also found the reason DBReplication is more spammy that it should be, but that's another matter [19:17:56] AaronSchulz: I commentet some things to krinke on the performance-ops meeting [19:18:00] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3531892 (10Eevans) restbase2001.codfw.wmnet has been re-imaged, but there are a couple of issues yet to resolve: First, the Puppet manifest expect... [19:18:16] have a look other day at the etherpad [19:19:13] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3531895 (10Johan) Maybe it would be possible to have a "Translate" link after th... [19:19:32] AaronSchulz: if all holds I'll go ahead and get wmf.14 to all wikis around 20:00 UTC thank you for all your help! [19:21:27] (03PS22) 10Ppchelko: Increase max kafka message size for changeprop and kafka main [puppet] - 10https://gerrit.wikimedia.org/r/372179 [19:24:09] jynus: yeah the lag calculation was pessimistic to err on the side of read-only, though realistically it could be either way [19:24:29] not only that [19:24:51] on high load, it can also be missleading [19:25:29] in theory the lag could increase during the trip time (by the same amount), but I suppose far more often it's just overestimating [19:25:49] it's not like any of these stuff is super accurate to that level anyway [19:25:56] but that is anotehr concern [19:26:04] I would be more than ok to measure latency [19:26:27] and depool servers if connection latency > X [19:26:34] or query latency [19:26:47] but that doesn't have anything to do with server lag [19:27:48] I probably will increase the granularity of heartbeat to 0.5 seconds, but that is another thing for long term [19:27:49] 10Operations, 10monitoring, 10Patch-For-Review: fix librenms LE check for netmon2001 - https://phabricator.wikimedia.org/T172712#3531922 (10Dzahn) This removed the HTTPS and LE cert check for netmon2001 (on einsteinium) based on netmon1002 being set as the netmon_server in ./hieradata/common.yaml. ``` def... [19:28:38] I mean if an estimate is from time X and it takes Y ms for the client to get the response, the lag could have also increased by that much in the worst case. But as I said, the common case makes this liability. [19:28:45] !log disable puppet for cloud things for some careful refactor merging [19:28:56] 10Operations, 10monitoring, 10Patch-For-Review: fix librenms LE check for netmon2001 - https://phabricator.wikimedia.org/T172712#3531926 (10Dzahn) 05Open>03Resolved [19:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:09] any, that part is any easy change [19:29:26] (03CR) 10Rush: [C: 032] openstack: clean up openstack::repo [puppet] - 10https://gerrit.wikimedia.org/r/370092 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [19:29:37] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3531927 (10diego) my ssh config Host * UseRoaming no ### Short names #Host ## Use bastion-... [19:29:55] (03PS9) 10Rush: openstack: clean up openstack::repo [puppet] - 10https://gerrit.wikimedia.org/r/370092 (https://phabricator.wikimedia.org/T171494) [19:30:54] 10Operations, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3531929 (10Dzahn) 05stalled>03Open @demon Last call before we are actually killing iridium and wiping the disk? [19:33:44] jynus: account for error is basically done via configuring a higher value. I'm not sure how better it is too have something programmatic or how that would look though. [19:34:23] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3531938 (10RobH) In reviewing access, it seems that @diego's ssh key for labs and production are the same, and that isn't ok. I'm merging changes t... [19:34:29] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3531939 (10RobH) 05Resolved>03Open [19:34:44] if mediawiki was running on a proper application server, because heartbeat is 0-aligned, all servers could be checked once per second at 0.1 seconds [19:34:55] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3531945 (10Eevans) Meanwhile, in codfw rack B: CPU iowait is quite high for 2002 and 2007 (Samsung drives)... {F9100622} As a result, GC (and re... [19:35:00] :00:1 seconds [19:36:46] (03PS1) 10RobH: diego's ssh key change [puppet] - 10https://gerrit.wikimedia.org/r/372439 (https://phabricator.wikimedia.org/T172891) [19:37:43] (03CR) 10RobH: [C: 032] diego's ssh key change [puppet] - 10https://gerrit.wikimedia.org/r/372439 (https://phabricator.wikimedia.org/T172891) (owner: 10RobH) [19:38:30] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3191313 (10Platonides) Maybe worth adding a link to https://www.ssllabs.com/sslt... [19:42:09] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3531976 (10Johan) I imagine people who use IE8 on XP mainly fall into two catego... [19:44:16] (03PS23) 10Ppchelko: Increase max kafka message size for changeprop and kafka main [puppet] - 10https://gerrit.wikimedia.org/r/372179 [19:44:33] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[researchers_ensure_members] [19:50:02] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[researchers_ensure_members] [19:52:01] ^ robh fallout from the ssh key issue? [19:52:11] ? [19:52:13] whyyyy [19:52:13] (03CR) 10Dzahn: [C: 031] "ok, i guess you are right and it should be a separate parameter in that case" [puppet] - 10https://gerrit.wikimedia.org/r/371927 (https://phabricator.wikimedia.org/T173297) (owner: 10Paladox) [19:52:16] i just absented the key... [19:52:21] but yeah has to be that... [19:52:35] so the user used the same key in labs and production [19:53:02] checking. [19:53:03] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[researchers_ensure_members],Exec[analytics-privatedata-users_ensure_members] [19:53:09] well, shit [19:53:14] ohh, i know what i did [19:53:24] imma blame sudafed, i know fixing now [19:53:40] absented but didnt remove from the user groups [19:53:42] stupid of me. [19:54:06] the linter should have caught that [19:54:32] (03PS1) 10RobH: removing diego from other groups [puppet] - 10https://gerrit.wikimedia.org/r/372442 [19:54:47] this poor user is cursed to have a dozen patches for their access [19:55:09] chasemp: should i file at task for the linter to check that? [19:55:20] (03CR) 10RobH: [C: 032] removing diego from other groups [puppet] - 10https://gerrit.wikimedia.org/r/372442 (owner: 10RobH) [19:56:08] Ok, its merged and running manually on stat1003 to test [19:56:45] hrmm [19:56:47] still has issue [19:56:48] Notice: /Stage[main]/Admin/Exec[enforce-users-groups-cleanup]: Dependency Exec[researchers_ensure_members] has failures: true [19:56:58] but i rmeoved him from the user, but it showed him in the puppet run [19:56:59] robh: sure, makes sense to me [19:57:13] i mean, the output of the error shows user 'diego' in it [19:57:26] its applying different config versions on second run [19:57:35] robh: multiple masters delay? [19:57:37] Info: Applying configuration version '1502999759' [19:57:40] Info: Applying configuration version '1502999826' [19:57:43] I can look if you want [19:57:44] eyah i think so [19:57:45] done [19:57:47] its fixed [19:57:49] kk [19:57:53] the different versions was the clue [19:57:57] just master delays [19:58:09] ok, it should clear up on all the affected hosts for those two groups [19:58:12] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:58:13] (analytics and stats) [19:59:07] (03Abandoned) 10Aaron Schulz: Enable $wgEnableWANCacheReaper for testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339245 (owner: 10Aaron Schulz) [20:00:45] * thcipriani rolls wmf.14 to wikipedia wikis [20:01:19] 10Operations, 10DBA, 10Wikidata, 10Wikidata.org: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3532017 (10Bugreporter) [20:04:43] (03PS1) 10Thcipriani: all wikis to 1.30.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372444 [20:04:45] (03CR) 10Thcipriani: [C: 032] all wikis to 1.30.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372444 (owner: 10Thcipriani) [20:06:11] (03Merged) 10jenkins-bot: all wikis to 1.30.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372444 (owner: 10Thcipriani) [20:06:21] (03CR) 10jenkins-bot: all wikis to 1.30.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372444 (owner: 10Thcipriani) [20:06:48] (03PS5) 10Dzahn: Phabricator: Only send logmail on prod not labs [puppet] - 10https://gerrit.wikimedia.org/r/371927 (https://phabricator.wikimedia.org/T173297) (owner: 10Paladox) [20:07:31] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.30.0-wmf.14 [20:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:34] * AaronSchulz sees no issues [20:13:02] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:13:25] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3532052 (10BBlack) Hopefully in the former case, they'll complain to their IT de... [20:14:02] AaronSchulz: yeah, so far so good :) [20:21:33] RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:23:02] PROBLEM - Check Varnish expiry mailbox lag on cp1063 is CRITICAL: CRITICAL: expiry mailbox lag is 2138106 [20:24:26] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3532086 (10BBlack) Testing updated HTML with some translations and a translate l... [20:24:33] * AaronSchulz wanders [20:25:13] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [20:35:54] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3532113 (10BBlack) Update: noticed I had en-US firefox links in all of the trans... [20:36:29] (03PS5) 10Rush: openstack: keystone as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/370288 (https://phabricator.wikimedia.org/T171494) [20:37:33] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3532116 (10MoritzMuehlenhoff) Looks good to me, go for it :-) [20:39:29] (03PS1) 10Catrope: Reapply "Remove temporary wgStructuredChangeFiltersEnableExperimentalViews setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372446 [20:39:37] (03PS2) 10Catrope: Reapply "Remove temporary wgStructuredChangeFiltersEnableExperimentalViews setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372446 [20:40:06] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3191313 (10Krinkle) >>! In T163251#3532086, @BBlack wrote: > Testing updated HTM... [20:40:41] (03PS4) 10Ayounsi: Icinga: add check_bfd check (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/370103 [20:41:15] https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1 [20:41:22] We have a HTTP 500 spike [20:41:53] https://grafana.wikimedia.org/dashboard/db/production-logging?refresh=5m&orgId=1 seems to confirm increase in CRITICAL log messages from MediaWIki [20:42:11] odd, total request rate is down a bit [20:43:13] (03CR) 10Ayounsi: "> The changes in PS2 shouldn't be needed. IIRC, there are symlinks" [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [20:44:37] * AaronSchulz sees nothing interesting in logstash [20:44:54] I've started to see this https://phabricator.wikimedia.org/T173541 [20:44:56] 4/includes/specials/SpecialNewpages.php: Call to a member function serialize() on a non-object (null) [20:45:01] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3532129 (10Platonides) >>! In T163251#3532113, @BBlack wrote: > Update: noticed... [20:45:04] Indeed [20:45:46] alright I'm going to rollback for T173541 [20:45:46] T173541: Call to a member function serialize() on a non-object (null) - https://phabricator.wikimedia.org/T173541 [20:45:57] Could just disable feeds temporarily [20:46:04] Rolling back and forth multiple times a day is scarier to me [20:46:40] But you're choo-choo man today, your call :) [20:47:00] RainbowSprinkles: disable feeds sounds fine to me, I don't know how to do that though :) [20:47:18] $wgFeedSomethingOrAnother [20:47:19] https://phabricator.wikimedia.org/source/mediawiki/history/master/includes/specials/SpecialNewpages.php [20:47:21] $wgFeeds? [20:47:38] Or, you know, revert the last change to that file [20:48:39] any chance we have a sampled log of frontend javascript errors anywhere? Trying to track down what happened to some eventlogging that collected <1% of what it should have [20:49:14] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3532141 (10Krinkle) >>! In T163251#3532129, @Platonides wrote: >> * Prefix each... [20:49:53] I guess $revision->getContent() should be checked for null (which can happen) [20:49:57] Wait, why are we disabling features in production? [20:50:03] We aren't :p [20:50:10] We could've disabled Wikidata last week, but we didn't. [20:50:20] I just suggested it, we didn't :p [20:50:25] Okay :) [20:51:00] I'm just worried we're trying to side-step our new and proud zero-regression policy by still moving forward partially (e.g. some wikis, some features), which seems like a slippery slope. [20:51:21] I prefer less features all around but hey I'm weird anyway :p [20:51:37] at least the two week difference (wmf.11 and wmf.13) is probably something we should never do again, given our code is not written with that in mind. E.g. cross-wiki interactions are coded for 1 branch of compat. [20:51:58] Oh trust me I know. [20:52:09] also APC, memcached, efficiency are not provisioned for three versions etc. [20:52:17] (and soon HHVM TC) [20:52:18] > at least the two week difference (wmf.11 and wmf.13) is probably something we should never do again [20:52:21] +100 [20:52:21] Anyway: we should either revert that most recent change to NewPages or throw up a fix [20:52:24] (on subject) [20:52:26] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3532160 (10BBlack) Thanks! Updated for all the above as best I can (I'm not 100... [20:52:35] Krinkle: To be fair, we didn't run three [20:52:43] Ah, fair point. [20:52:49] wmf.11 and wmf.13 ran, wmf.12 was disabled as soon as migration to .13 finished [20:52:50] yeah, we didn't run wmf.12 at that point. [20:52:51] So...briefly [20:53:15] * AaronSchulz makes a patch [20:53:22] AaronSchulz: Tyvm [20:53:24] cc thcipriani ^ [20:54:33] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, 10Research-collaborations: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3532166 (10Halfak) [20:54:34] Tbh, I'm a little more upset we allowed Wikidata to go broken (pretty heavily I might add) for several weeks when patches mostly already existed in master and just needed to be backported :( [20:54:34] Backporting shouldn't be that hard :\ [20:55:01] 10Operations, 10Wikimedia-Site-requests, 10User-MarcoAurelio, 10Wikimedia-log-errors: Unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T173419#3532167 (10MarcoAurelio) Please also note that one of the accounts whose rename is in progress has wrote to me on my talk page at eswik... [20:58:11] RainbowSprinkles: Wikidata build.. [20:58:17] glad I brought my laptop to the cafĂ© [20:58:30] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, 10Research-collaborations: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3532176 (10RobH) a:03Shilad Assigned to @shilad for them to sign L3, and provide preferred shell username, wikitech u... [20:59:44] (03PS1) 10BBlack: 3DES Deprecation: internationalize and update warning [puppet] - 10https://gerrit.wikimedia.org/r/372448 (https://phabricator.wikimedia.org/T163251) [21:00:04] MaxSem: Dear anthropoid, the time has come. Please deploy Community Tech (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170817T2100). [21:00:10] (03CR) 10jerkins-bot: [V: 04-1] 3DES Deprecation: internationalize and update warning [puppet] - 10https://gerrit.wikimedia.org/r/372448 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [21:00:35] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3532186 (10BBlack) patch above is the same changes as a re... [21:00:50] (03CR) 10Mobrovac: Increase max kafka message size for changeprop and kafka main (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [21:01:51] RainbowSprinkles: btw, is the new policy that blockers to wmf branches are removed as subtask when they're resolved? [21:02:06] Or was that just done in a few cases where it was resolved enoguh to unblock, but not closed? [21:02:14] Krinkle: I thought we had a tested javascript code ready to rip off for it :) [21:02:31] Krinkle: mostly the later, but this is a weird undocumented place with some corner cases [21:02:45] Okay [21:02:59] Just making sure I won't follow suite with removing sub tasks whenever I close one. [21:02:59] so, speak up if something seems weird/wrong, basically :) [21:03:05] (03PS1) 10MaxSem: Enable LoginNotify everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372450 [21:03:16] nah, if closed doesn't need to be removed [21:04:01] FWIW I tried to touch all the subtasks with my assumptions this week in case I was misunderstanding the status of any of these open tasks. They all *seem* resolved as far as the train is concerned, but most have followup tasks. [21:04:55] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3532195 (10Eevans) To summarize a brief discussion with @mobrovac and @Pchelolo: * The two slower nodes in rack b (2002 & 2007) are somewhat trou... [21:05:25] thanks thcipriani, that should be part of our SOP [21:07:07] (03CR) 10Platonides: 3DES Deprecation: internationalize and update warning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/372448 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [21:09:44] Niharika, https://grafana-admin.wikimedia.org/dashboard/db/loginnotify [21:10:33] (03CR) 10MarcoAurelio: [C: 04-1] "I need to manually rebase and add the wiki to the s3.dblist which I forgot." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) (owner: 10MarcoAurelio) [21:11:11] (03PS2) 10BBlack: 3DES Deprecation: internationalize and update warning [puppet] - 10https://gerrit.wikimedia.org/r/372448 (https://phabricator.wikimedia.org/T163251) [21:11:25] (03CR) 10BBlack: 3DES Deprecation: internationalize and update warning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/372448 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [21:11:40] (03CR) 10jerkins-bot: [V: 04-1] 3DES Deprecation: internationalize and update warning [puppet] - 10https://gerrit.wikimedia.org/r/372448 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [21:14:52] (03CR) 10Niharika29: [C: 032] "LGTM. YAY!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372450 (owner: 10MaxSem) [21:16:19] (03Merged) 10jenkins-bot: Enable LoginNotify everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372450 (owner: 10MaxSem) [21:16:28] (03CR) 10jenkins-bot: Enable LoginNotify everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372450 (owner: 10MaxSem) [21:16:32] PROBLEM - Check Varnish expiry mailbox lag on cp1048 is CRITICAL: CRITICAL: expiry mailbox lag is 2004511 [21:16:35] (03PS3) 10BBlack: Deprecation of 3DES: internationalize and update warning [puppet] - 10https://gerrit.wikimedia.org/r/372448 (https://phabricator.wikimedia.org/T163251) [21:17:45] !log cp1063: varnish backend restart (mailbox lag) [21:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:58] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Enable LoginNotify finally! Yippeeeeee https://gerrit.wikimedia.org/r/#/c/372450/ (duration: 00m 45s) [21:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:36] (03CR) 10Dzahn: [C: 032] Phabricator: Only send logmail on prod not labs [puppet] - 10https://gerrit.wikimedia.org/r/371927 (https://phabricator.wikimedia.org/T173297) (owner: 10Paladox) [21:21:42] Krinkle: so should we revert https://gerrit.wikimedia.org/r/#/c/150210/ ? Or do I rollback and give folks time for patch fixes? [21:22:24] thcipriani: That patch does two things, fix a bug, and (wrongly) try to clean up unrelated code to use the same new pattern. [21:22:44] AaronSchulz and I have now determined that cleanup was wrong and also generally not needed, as it needed to be different for a reason. [21:22:52] So we can revert that portion of it, using Aaron's commit as a starting point. [21:23:02] RECOVERY - Check Varnish expiry mailbox lag on cp1063 is OK: OK: expiry mailbox lag is 0 [21:23:49] (03PS1) 10MaxSem: LoginNotify requires Echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372456 [21:24:52] (03CR) 10Niharika29: [C: 032] LoginNotify requires Echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372456 (owner: 10MaxSem) [21:25:03] (03CR) 10Dzahn: "no wait, i will follow-up on this.." [puppet] - 10https://gerrit.wikimedia.org/r/371927 (https://phabricator.wikimedia.org/T173297) (owner: 10Paladox) [21:26:07] ftr: https://wikitech.wikimedia.org/w/index.php?title=Deployments%2FHolding_the_train&action=historysubmit&type=revision&diff=1768185&oldid=1767105 and https://www.mediawiki.org/w/index.php?title=Wikimedia_Release_Engineering_Team%2FRoles&type=revision&diff=2539163&oldid=2480097 [21:26:21] (03Merged) 10jenkins-bot: LoginNotify requires Echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372456 (owner: 10MaxSem) [21:26:31] (03CR) 10jenkins-bot: LoginNotify requires Echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372456 (owner: 10MaxSem) [21:28:49] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Enable LoginNotify on sites without Echo too. :|https://gerrit.wikimedia.org/r/#/c/372456/ (duration: 00m 44s) [21:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:07] (03CR) 10Dzahn: "there is already this:" [puppet] - 10https://gerrit.wikimedia.org/r/371927 (https://phabricator.wikimedia.org/T173297) (owner: 10Paladox) [21:30:59] (03PS4) 10BBlack: Deprecation of 3DES: internationalize and update warning [puppet] - 10https://gerrit.wikimedia.org/r/372448 (https://phabricator.wikimedia.org/T163251) [21:31:32] (03CR) 10BBlack: "Fixed ZH labeling (and confirmed the others) based on https://www.wikipedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/372448 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [21:32:49] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3532242 (10BBlack) After a couple of other minor nits, goi... [21:34:00] (03PS5) 10BBlack: Deprecation of 3DES: internationalize and update warning [puppet] - 10https://gerrit.wikimedia.org/r/372448 (https://phabricator.wikimedia.org/T163251) [21:35:29] (03CR) 10BBlack: [C: 032] Deprecation of 3DES: internationalize and update warning [puppet] - 10https://gerrit.wikimedia.org/r/372448 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [21:36:16] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, 10Research-collaborations: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3532244 (10Shilad) @RobH, I've signed the L3. My wikitech username is "Shilad Sen" and my preferred shell username is s... [21:37:05] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, 10Research-collaborations: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3532247 (10Shilad) a:05Shilad>03RobH [21:37:12] PROBLEM - puppet last run on graphite1003 is CRITICAL: CRITICAL: Puppet has 67 failures. Last run 2 minutes ago with 67 failures. Failed resources (up to 3 shown): Cron[carbon-cache@a-cleanup],Cron[carbon-cache@h-cleanup],Cron[carbon-cache@f-cleanup],Cron[graphite-eventstreams] [21:39:27] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3532268 (10Eevans) >>! In T169939#3531892, @Eevans wrote: > restbase2001.codfw.wmnet has been re-imaged, but there are a couple of issues yet to re... [21:48:09] (03PS1) 10Dzahn: phabricator: Only send logmail on prod not labs, pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/372463 (https://phabricator.wikimedia.org/T173297) [21:48:35] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Only send logmail on prod not labs, pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/372463 (https://phabricator.wikimedia.org/T173297) (owner: 10Dzahn) [21:50:31] (03PS2) 10Dzahn: phabricator: Only send logmail on prod not labs, pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/372463 (https://phabricator.wikimedia.org/T173297) [21:58:48] (03PS1) 10BBlack: Deprecation of 3DES: Bump pageview replacement to 5% [puppet] - 10https://gerrit.wikimedia.org/r/372467 [21:59:17] (03PS2) 10BBlack: Deprecation of 3DES: Bump pageview replacement to 5% [puppet] - 10https://gerrit.wikimedia.org/r/372467 (https://phabricator.wikimedia.org/T163251) [22:03:08] (03CR) 10Dzahn: [C: 032] phabricator: Only send logmail on prod not labs, pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/372463 (https://phabricator.wikimedia.org/T173297) (owner: 10Dzahn) [22:04:32] RECOVERY - puppet last run on graphite1003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [22:07:10] !log restarting pdfrender on scb100* nodes [22:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:37] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3532358 (10Krinkle) >>! In T163251#3532186, @BBlack wrote:... [22:09:04] !log ppchelko@tin Started deploy [changeprop/deploy@2c553a6]: Lower the concurrecy for transcludes to decrease cassandra load during cluster reshaping [22:09:05] (03CR) 10BBlack: [C: 032] Deprecation of 3DES: Bump pageview replacement to 5% [puppet] - 10https://gerrit.wikimedia.org/r/372467 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [22:09:13] (03PS3) 10BBlack: Deprecation of 3DES: Bump pageview replacement to 5% [puppet] - 10https://gerrit.wikimedia.org/r/372467 (https://phabricator.wikimedia.org/T163251) [22:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:17] !log thcipriani@tin Synchronized php-1.30.0-wmf.14/includes/specials/SpecialNewpages.php: [[gerrit:372465|Restore the newFromId() approach in SpecialNewpages::feedItemDesc]] T173541 (duration: 00m 46s) [22:09:25] (03CR) 10BBlack: [V: 032 C: 032] Deprecation of 3DES: Bump pageview replacement to 5% [puppet] - 10https://gerrit.wikimedia.org/r/372467 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [22:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:30] T173541: Call to a member function serialize() on a non-object (null) - https://phabricator.wikimedia.org/T173541 [22:09:52] * Platonides notes that varsnish frontend is running on a teapot [22:10:19] !log ppchelko@tin Finished deploy [changeprop/deploy@2c553a6]: Lower the concurrecy for transcludes to decrease cassandra load during cluster reshaping (duration: 01m 14s) [22:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:25] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3532373 (10GWicke) At least one instance on scb100* broke again about 20 hours ago:... [22:12:53] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [22:13:13] only for certain users :) [22:13:39] (03PS1) 10Eevans: Use absolute paths for `data_file_directories` [puppet] - 10https://gerrit.wikimedia.org/r/372469 (https://phabricator.wikimedia.org/T169939) [22:15:06] was it chosen for the rfc, or is a 41x the most suited code? [22:16:00] well 4xx in general is suitable, and none of the other 4xx codes seem completely correct for this case (or would confuse stats with our other usages of that code) [22:16:14] so it was either make up a new "unused" one, or use Teapot I guess :) [22:18:13] (03CR) 10Paladox: "> please amend commit message. what is this about?" [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 (owner: 10Paladox) [22:22:59] (03PS2) 10Eevans: Use absolute paths for `data_file_directories` [puppet] - 10https://gerrit.wikimedia.org/r/372469 (https://phabricator.wikimedia.org/T169939) [22:27:48] (03CR) 10Eevans: "[PC](http://puppet-compiler.wmflabs.org/7527)" [puppet] - 10https://gerrit.wikimedia.org/r/372469 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [22:32:17] (03CR) 10Ppchelko: Increase max kafka message size for changeprop and kafka main (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [22:32:23] (03Abandoned) 10Krinkle: webperf: Fix broken example value for eventlogging_path [puppet] - 10https://gerrit.wikimedia.org/r/370138 (owner: 10Krinkle) [22:33:05] (03PS24) 10Ppchelko: Increase max kafka message size for changeprop and kafka main [puppet] - 10https://gerrit.wikimedia.org/r/372179 [22:35:49] (03CR) 10Ppchelko: "Puppet compiler: https://puppet-compiler.wmflabs.org/compiler02/7528/" [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170817T2300). Please do the needful. [23:00:04] RoanKattouw and mooeypoo: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:01:13] I can SWAT [23:06:55] o7 here and ready [23:07:00] :) [23:07:28] i'll sneak one in here in a bit as well...jenkins is chuggin along [23:07:31] How bad is it that I get annoyed o7 is a salute with the wrong hand, unless you imagine it from the back... [23:08:08] I'm here too [23:08:22] ┌o ... hrm that doesn't really work [23:08:41] (03PS3) 10Thcipriani: Reapply "Remove temporary wgStructuredChangeFiltersEnableExperimentalViews setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372446 (owner: 10Catrope) [23:08:52] mooeypoo: does it also bother you the salute is with the hand over the head? :P [23:08:55] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372446 (owner: 10Catrope) [23:09:00] Yeah, there's not a good way to do this properly unless you imagine you're looking at the person's back [23:09:21] over the head gets you in less trouble than using the wrong hand... [23:10:25] (03Merged) 10jenkins-bot: Reapply "Remove temporary wgStructuredChangeFiltersEnableExperimentalViews setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372446 (owner: 10Catrope) [23:10:34] (03CR) 10jenkins-bot: Reapply "Remove temporary wgStructuredChangeFiltersEnableExperimentalViews setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372446 (owner: 10Catrope) [23:11:26] RoanKattouw: your change is live on mwdebug1002, check please (and we are now fully on wmf.14 :)) [23:12:52] thcipriani: Looks good [23:13:00] ok, going live [23:14:59] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:372446|Reapply "Remove temporary wgStructuredChangeFiltersEnableExperimentalViews setting"]] (duration: 00m 45s) [23:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:18] ^ RoanKattouw live everywhere [23:15:43] Live dangerously experimentally, RoanKattouw [23:15:50] Thanks thcipriani [23:16:02] yw :) [23:17:56] mooeypoo: you change is live on mwdebug1002, check please [23:18:02] * mooeypoo checks [23:19:08] thcipriani, looks good! [23:19:18] going live [23:19:31] thcipriani: could you 'touch' the file before syncing out as well for mine? Not entirely sure what went wrong last time but random guess at the momemnt is that the local storage caching of javascript files decided to not actually download the new version [23:19:36] no clue if touch effects that anyways ... [23:20:09] ebernhardson: yeah, I can give that a shot [23:20:12] we ended up collecting 100 sessions per day instead of 15k ... just a little off :) [23:21:57] !log thcipriani@tin Synchronized php-1.30.0-wmf.14/resources/src/mediawiki.rcfilters/dm/mw.rcfilters.dm.FilterGroup.js: SWAT: [[gerrit:372449|RCFilters: Fix validation for single_option groups]] T173303 (duration: 00m 44s) [23:22:05] ebernhardson, it's a normalized spherical estimation, in a vacuum. We do it all the time in physics. [23:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:11] T173303: Can't change number of results selector after you click trashcan - https://phabricator.wikimedia.org/T173303 [23:22:15] ^ mooeypoo live now [23:22:21] \o/ thanks thcipriani [23:22:26] yw :) [23:24:08] ebernhardson: touched and pulled over to mwdebug1002, if there're things to check there [23:26:00] thcipriani: it reports a different hash, so its a start i suppose :) [23:28:31] ebernhardson: cool, good to go live? [23:29:16] thcipriani: yup [23:29:23] * thcipriani does [23:31:16] !log thcipriani@tin Synchronized php-1.30.0-wmf.14/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: SWAT: [[gerrit:372476|Revert "Disable cirrus MLR ab test"]] (duration: 00m 44s) [23:31:23] ^ ebernhardson live now [23:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:40] thanks! it'll take a few to make it through the varnish cache, but i'll keep an eye out [23:31:52] okie doke [23:31:55] [23:32:49] * greg-g hands thcipriani a $drinkOfChoice [23:33:33] * thcipriani quietly weeps in $drinkOfChoice [23:51:35] (03PS1) 10Krinkle: webperf: Convert navtiming.py to use KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) [23:52:01] (03CR) 10jerkins-bot: [V: 04-1] webperf: Convert navtiming.py to use KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [23:55:54] (03PS2) 10Krinkle: webperf: Convert navtiming.py to use KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903)