[00:12:04] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3543488 (10Krinkle) [00:17:31] (03PS2) 10BBlack: wikimedia.org CAA: split issue-vs-issuewild, document clearer [dns] - 10https://gerrit.wikimedia.org/r/373163 [00:17:33] (03PS4) 10BBlack: Setting namecheap/comodo CAA records [dns] - 10https://gerrit.wikimedia.org/r/372900 (https://phabricator.wikimedia.org/T173787) (owner: 10RobH) [00:18:36] (03CR) 10BBlack: [C: 032] wikimedia.org CAA: split issue-vs-issuewild, document clearer [dns] - 10https://gerrit.wikimedia.org/r/373163 (owner: 10BBlack) [00:18:39] (03CR) 10BBlack: [C: 032] Setting namecheap/comodo CAA records [dns] - 10https://gerrit.wikimedia.org/r/372900 (https://phabricator.wikimedia.org/T173787) (owner: 10RobH) [00:21:12] (03PS1) 10Smalyshev: Add list for wikis that would have categories dumped into RDF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373167 (https://phabricator.wikimedia.org/T173892) [00:23:01] (03CR) 10jerkins-bot: [V: 04-1] Add list for wikis that would have categories dumped into RDF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373167 (https://phabricator.wikimedia.org/T173892) (owner: 10Smalyshev) [00:26:38] (03PS2) 10Smalyshev: Add list for wikis that would have categories dumped into RDF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373167 (https://phabricator.wikimedia.org/T173892) [00:28:09] (03CR) 10jerkins-bot: [V: 04-1] Add list for wikis that would have categories dumped into RDF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373167 (https://phabricator.wikimedia.org/T173892) (owner: 10Smalyshev) [00:30:50] (03PS3) 10Smalyshev: Add list for wikis that would have categories dumped into RDF [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373167 (https://phabricator.wikimedia.org/T173892) [01:04:20] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1503450256 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 5143431 keys, up 4 minutes 14 seconds - replication_delay is 1503450256 [01:04:40] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481 [01:05:21] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 5141591 keys, up 5 minutes 15 seconds - replication_delay is 0 [01:05:41] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 5141722 keys, up 5 minutes 34 seconds - replication_delay is 0 [01:58:26] (03PS1) 10Niedzielski: WIP (DO NOT MERGE): pagePreviews: remove invalidated popup sampling rate variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373171 (https://phabricator.wikimedia.org/T171853) [01:59:02] (03PS1) 10MaxSem: Add a log channel for LoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373172 (https://phabricator.wikimedia.org/T173888) [01:59:51] (03CR) 10Niedzielski: [C: 04-1] WIP (DO NOT MERGE): pagePreviews: remove invalidated popup sampling rate variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373171 (https://phabricator.wikimedia.org/T171853) (owner: 10Niedzielski) [02:03:53] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2231207 (10Shizhao) >>! In T133410#3436066, @Tgr wrote: > I believe templates in Flow comments/summaries would be broken in both edit and... [02:12:47] (03PS6) 10Ebe123: Run Lilypond from Firejail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370358 (https://phabricator.wikimedia.org/T172582) [02:12:59] (03CR) 10Ebe123: Run Lilypond from Firejail (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370358 (https://phabricator.wikimedia.org/T172582) (owner: 10Ebe123) [02:13:10] (03PS7) 10Ebe123: Run Lilypond from Firejail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370358 (https://phabricator.wikimedia.org/T172582) [02:32:33] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.14) (duration: 08m 03s) [02:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:49] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3543641 (10Tgr) @Shizhao I have been referring to T164791 which has been resolved since then. [02:45:25] Shilad: I think I see you in the log but can you try once more? [02:45:58] Sure! I'm starting to wonder if my ssh-keygen -t dsa really generated a dss key, though, for some reason... [02:47:01] Just tried again. [02:47:10] this log is noisy, hard to tell which of these messages matter [02:49:19] the public key that was merged in gerrit for you is [02:49:20] ssh-dss [02:49:21] AAAAB3NzaC1kc3MAAACBAKBspbywXptKB4djp8jYjfk0fAQUAhsEM03zvRhuCpIwB5BYQl2mIeIwADHqM5DA0plGtFZLLwZvFR/LpHIiK3zcDuvz5N6LBkTulKQ5TrjnMkAeTk1SA900u6jCoKitF7j6ZO3Q4diLgFSY5F4EJI80GiWkOx+JAnzhS3kHbkibAAAAFQDzFcnzFRA7bawBb0ZVhCYDU2v+2wAAAIBzWSGg2rEvV0UT+cDzGZMl6LGWT+3oC1pJviW8vilOhIKvdbXYeQeGpqpJjxZToN/5Ok+P0kAMNTacdPWyYiDDepb+zgB9tbW+DPB3HgH2y6u7SMNWnOXK+C9VAT62LEX4zQsD41NC3kMijDjLuAzAkyKPAVmgtFWCXpYDDU/+zgAAAIAW66EVt/6tp7o6Glf [02:49:21] U3TS3JnYLA3cFzWqmbuHuV2dFhW3h7OAbmCRivhOVuhJuu56C/AJeKdGzIA10p/eo39YXUX3iOjUTO8/YFFAAnh9m4Fb1YDTMG3JzwBi8jT6r8iOm9414ITX48y9zzD3smXku3o3At/w5Up6rl/lDeywI4g== a558989@600308a4c4c6 [02:49:33] is that actually the public key to go with the private key you're using? [02:49:49] If they are different types then that would be an issue :) [02:50:10] Shilad: [02:50:11] ^ [02:50:12] That's it! But in the private key file it says it's DSA. [02:51:19] AHA! I changed the ssh-dss to ssh-dsa in the public key and all is good. [02:51:31] huh [02:51:37] I definitely did NOT manually create or edit that file, though. So weird! [02:51:43] wait, what do you mean it's all good? [02:52:17] because, the public key lives on the bastion — I don't see how you changing the local file could do anything [02:52:23] The login was fine. It must have been my client thinking the key type wasn't valid for the server when it actually was. [02:52:44] So I think my client skipped trying the key at all? [02:53:51] huh... [02:54:00] I don't know why changing your local copy of the public key would do anything [02:54:04] but if it works, it works :/ [02:55:11] Thanks for your help and sorry for the confusion. [02:56:31] no problem, I hope it keeps working :) [03:08:30] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.15) (duration: 15m 11s) [03:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:25] andrewbogott: [03:14:13] While I could successfully login to bastion, my attempts to get to stat1005.eqiad.wmnet have so far failed. [03:14:58] I can see the key being offered to stat1005 in the ssh log but it doesn't look like it's accepted. [03:15:36] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Aug 23 03:15:36 UTC 2017 (duration 7m 6s) [03:15:38] I am pretty suspicious about the keytype for the ssh key, so I may trash that key, create a new rsa key and request it be added instead. [03:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:43] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3543672 (10Shilad) @herron I am having some trouble logging in. I can get to bastion but not beyond. I'm suspicious that the key I gav... [03:26:50] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 711.97 seconds [04:06:11] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 285.53 seconds [04:38:36] (03PS1) 10Jeremyb: shiladsen shell: try RSA key instead, add expiry [puppet] - 10https://gerrit.wikimedia.org/r/373177 (https://phabricator.wikimedia.org/T171988) [04:39:15] (03CR) 10Dzahn: [C: 04-1] "@Paladox it seems meanwhile it has been decided that we'll stop using the deb altogether for Gerrit, so this probably can be abandoned" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353766 (owner: 10Paladox) [04:39:27] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3482327 (10jeremyb) that's maybe exactly the problem. your debug log says > debug1: match: OpenSSH_7.4p1 Debian-10+deb9u1 pat OpenSSH... [04:46:14] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3482327 (10Dzahn) I can confirm this. The reason is the key type DSS. From auth.log on stat1005: 85335 Aug 23 03:10:05 stat1005 sshd[... [04:48:15] (03CR) 10Dzahn: "yes, the user can currently not login on stat1005 because of the DSS key. "stat1005 sshd[28673]: userauth_pubkey: key type ssh-dss not in " [puppet] - 10https://gerrit.wikimedia.org/r/373177 (https://phabricator.wikimedia.org/T171988) (owner: 10Jeremyb) [04:57:56] (03CR) 10Jeremyb: "followup: https://gerrit.wikimedia.org/r/373177" [puppet] - 10https://gerrit.wikimedia.org/r/373115 (https://phabricator.wikimedia.org/T171988) (owner: 10Herron) [05:24:05] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3543712 (10Marostegui) Can you give more details about this rename? Number of edits? Wikis with the biggest number of edits? Thanks [05:25:42] 10Operations, 10DBA, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Global rename: Opdire657 → Sakiv; supervision needed - https://phabricator.wikimedia.org/T173834#3541365 (10Marostegui) Most of the edits are on arwiki, which belongs to s7. When would you like to do this rename? [06:22:30] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:45:20] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3482327 (10MoritzMuehlenhoff) Please don't add new DSA keys, we're down to two keys of that kind and I'm planning to remove server-sid... [06:55:13] 10Operations, 10DBA, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Global rename: Opdire657 → Sakiv; supervision needed - https://phabricator.wikimedia.org/T173834#3543759 (10MarcoAurelio) @Marostegui I'll see you in IRC. I'm avalaible today. [06:55:25] !log upgrading remaining app servers in to luasandbox 2.0.13 [06:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:45] marostegui: renaming cat here :) [07:03:46] TabbyCat: o/ [07:04:26] TabbyCat: You want to do https://phabricator.wikimedia.org/T173834 now? [07:04:41] marostegui: if it is fine with you, then yes [07:04:46] sure, give me a sec [07:04:55] I have to usurp an account first so give me a sec as well [07:06:10] I am all set from my side [07:09:41] sorry was busy doing other renames [07:09:49] No worries :) [07:09:53] I'll usurp the account and when I'm done we'll start [07:09:59] sure, just ping me [07:11:50] marostegui: I'm ready to rename [07:12:10] !log Global rename: Opdire657 → Sakiv - T173834 [07:12:12] Go for it! [07:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:22] T173834: Global rename: Opdire657 → Sakiv; supervision needed - https://phabricator.wikimedia.org/T173834 [07:13:11] Jobs to rename Opdire657 to Sakiv have been queued on . [07:13:24] You have the meta progress url for me? :) [07:13:39] That space between on and the dot... /me needs to fix that... [07:14:04] marostegui: https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Sakiv [07:14:09] Thanks! [07:14:39] it's doing arwiki now afaics [07:17:42] arwiki is done, seeing some lag now but nothing too worrying [07:17:53] I will re-enable safe transactions back in the slower slaves in a few seconds [07:26:48] I wish there was a console output view for when we rename people marostegui [07:27:33] I need to reload GlobalRenameProgress to know how the rename is going [07:31:06] yeah [07:31:08] It is not ideal [07:32:02] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:03] (03PS1) 10Muehlenhoff: Add expiry date of MOU and point of contact for shiladsen [puppet] - 10https://gerrit.wikimedia.org/r/373235 [07:41:20] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [07:41:39] (03CR) 10Muehlenhoff: "JFTR, this was not complete; since this is a time-limited researcher account it should have an expiry date and point of contact, I've push" [puppet] - 10https://gerrit.wikimedia.org/r/373115 (https://phabricator.wikimedia.org/T171988) (owner: 10Herron) [07:41:47] (03CR) 10Muehlenhoff: [C: 032] Add expiry date of MOU and point of contact for shiladsen [puppet] - 10https://gerrit.wikimedia.org/r/373235 (owner: 10Muehlenhoff) [07:42:11] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 28 probes of 267 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:42:27] (03CR) 10Volans: "@Krinkle: given your comment I took the liberty to give it a pass and add some comments inline. Disclaimer: I've almost zero context on th" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [07:43:48] we're nearly done according to global rename progress; not sure if there will be more background jobs still running after we finish [07:44:21] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [07:44:30] So far everything looks fine from my side [07:47:10] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 5 probes of 267 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:48:24] marostegui: GRP says it's finished, what the oracle (a.k.a. db logs) says? [07:48:31] XDDD [07:48:44] From my side it is fine, I think we can close it [07:49:04] perfe :) [07:49:46] 10Operations, 10DBA, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Global rename: Opdire657 → Sakiv; supervision needed - https://phabricator.wikimedia.org/T173834#3543838 (10MarcoAurelio) 05Open>03Resolved Done. [07:50:40] TabbyCat: https://phabricator.wikimedia.org/T173859 ? [07:50:50] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1954 bytes in 0.149 second response time [07:51:18] Steinsplitter: meow-ping [07:51:39] TabbyCat: is that something you can do too? so we can get rid of it? [07:51:52] marostegui: let me check [07:54:11] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:54:40] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:55:50] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1950 bytes in 0.149 second response time [07:55:59] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3542365 (10MarcoAurelio) * `Papa1234` has a total global edit count of 123,428, of which 120,674 are on dewiki (s5) * dewiki has flaggedrevs, according t... [07:56:41] marostegui: dewiki people is a bit "posesivos" of their own stuff; I'd let him do that to avoid drama later. [07:56:59] left some comments there [07:57:07] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3543846 (10Marostegui) >>! In T173859#3543842, @MarcoAurelio wrote: > * `Papa1234` has a total global edit count of 123,428, of which 120,674 are on dewi... [07:58:45] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3543864 (10MarcoAurelio) 05Open>03stalled p:05Triage>03Low Okay, therefore I am marking this as stalled/blocked pending resolution of the indexin... [07:59:42] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3543870 (10MarcoAurelio) [08:06:38] (03PS2) 10Filippo Giunchedi: role: collect jmx_exporter metrics from restbase test_cluster [puppet] - 10https://gerrit.wikimedia.org/r/372845 (https://phabricator.wikimedia.org/T173490) [08:06:55] (03CR) 10jerkins-bot: [V: 04-1] role: collect jmx_exporter metrics from restbase test_cluster [puppet] - 10https://gerrit.wikimedia.org/r/372845 (https://phabricator.wikimedia.org/T173490) (owner: 10Filippo Giunchedi) [08:08:33] (03PS3) 10Filippo Giunchedi: role: collect jmx_exporter metrics from restbase test_cluster [puppet] - 10https://gerrit.wikimedia.org/r/372845 (https://phabricator.wikimedia.org/T173490) [08:08:49] (03PS1) 10Volans: Add CHANGELOG file including previous releases [software/cumin] - 10https://gerrit.wikimedia.org/r/373250 [08:08:51] (03PS1) 10Volans: Updated documentation [software/cumin] - 10https://gerrit.wikimedia.org/r/373251 [08:09:25] (03CR) 10Filippo Giunchedi: [C: 032] role: collect jmx_exporter metrics from restbase test_cluster [puppet] - 10https://gerrit.wikimedia.org/r/372845 (https://phabricator.wikimedia.org/T173490) (owner: 10Filippo Giunchedi) [08:14:08] (03PS1) 10Marostegui: db-codfw.php: Depool db2073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373252 (https://phabricator.wikimedia.org/T168661) [08:15:38] (03CR) 10jerkins-bot: [V: 04-1] db-codfw.php: Depool db2073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373252 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:16:18] (03PS2) 10Marostegui: db-codfw.php: Depool db2073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373252 (https://phabricator.wikimedia.org/T168661) [08:18:31] !log upgrading remaining job runners to luasandbox 2.0.13 [08:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:13] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373252 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:19:27] (03PS1) 10Filippo Giunchedi: statsite: don't track statsd client traffic [puppet] - 10https://gerrit.wikimedia.org/r/373253 (https://phabricator.wikimedia.org/T173731) [08:20:10] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:20:39] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373252 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:20:49] (03CR) 10jenkins-bot: db-codfw.php: Depool db2073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373252 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:21:10] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.001 second response time [08:22:14] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2073 - T168661 (duration: 00m 59s) [08:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:28] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [08:23:06] (03PS3) 10Filippo Giunchedi: prometheus: add blackbox configuration for prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/373062 (https://phabricator.wikimedia.org/T169860) [08:23:51] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add blackbox configuration for prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/373062 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:27:59] (03PS3) 10Zfilipin: Enable Flow as a Beta feature on wawiki and wawikionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373069 (https://phabricator.wikimedia.org/T172947) (owner: 10KartikMistry) [08:31:21] PROBLEM - puppet last run on prometheus2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-blackbox-exporter] [08:35:44] !log Stop Upgrade MySQL on db2073 to 10.0.32 - T168661 [08:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:55] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [08:36:22] prometheus2003 is me, taking a look [08:43:51] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-blackbox-exporter] [08:48:55] should be recovering soon ^ [08:55:03] (03CR) 10Alexandros Kosiaris: "Technically this looks correct. Personally I am not in love much with outbound filtering, mostly because most apps use some ephemeral sour" [puppet] - 10https://gerrit.wikimedia.org/r/373038 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [08:58:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] swift: don't track client connections in frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/373039 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [08:58:50] RECOVERY - puppet last run on prometheus2003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [08:59:47] Hi ops chan - since a few minutes I can't get patches from gerrit anymore - anything changed lately? [09:08:03] !log upload prometheus-blackbox-exporter 0.7.0+ds1-1~wmf1 to jessie-wikimedia, backported - T169860 [09:08:12] joal: not afaik, works for me [09:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:16] T169860: Investigate/setup prometheus blackbox_exporter - https://phabricator.wikimedia.org/T169860 [09:10:10] !log upgrading remaining API servers in to luasandbox 2.0.13 [09:10:18] akosiaris: thanks for the review! I'm not sure I understand what you mean, the accept on output is on the destination port [09:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:21] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:16:37] joal: I am also having an issue with gerrit [09:16:45] looks like it might only be over ssh though, http works [09:17:38] (03CR) 10Filippo Giunchedi: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/373039 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [09:19:14] http works for me as well [09:19:50] ah yeah git review is busted for me now as well [09:20:00] I'll take a look [09:20:23] (03Abandoned) 10Paladox: Fix debian-rules-missing-recommended-target [debs/gerrit] - 10https://gerrit.wikimedia.org/r/353766 (owner: 10Paladox) [09:20:32] I tried to clone / fetch - no luck either [09:20:45] Thx godog [09:21:04] godog: bastion issue? [09:21:25] ok, so it's not just me (git review problems) [09:22:42] volans: ssh to gerrit shouldn't go through the bastion iirc [09:23:17] godog: right, scratch that [09:25:49] wikidata will start to send icinga here, One of ops please ack. it for six hours please :) [09:26:32] so I can ssh ok and run gerrit commands, but git review hangs there [09:26:48] e.g. gerrit show-queue [09:27:27] [2017-08-23 09:27:12,133 +0000] 214840b3 filippo a/1629 git-upload-pack./operations/puppet.git 0ms 132439ms killed [09:27:34] once I ctrl-c git review [09:29:14] there's also a bunch of git-upload-pack stuck there for different repos [09:29:28] kill them all [09:29:37] it happens sometimes [09:30:16] jynus: we've got a question (Manuel and myself) wrt. a big global rename that is going to hit dewiki due to flaggedrevs_fr.user table [09:30:19] https://wikitech.wikimedia.org/wiki/Gerrit#Tasks_management [09:30:55] godog: the JVM logs has a lot of: [09:30:57] [GC (Allocation Failure) [09:30:57] Desired survivor size 200802304 bytes, new threshold 15 (max 15) [09:30:57] [PSYoungGen: 1235744K->83802K(1245696K)] 12224690K->11104235K(15226880K), 0.0728984 secs] [Times: user=0.74 sys=0.01, real=0.08 secs] [09:31:18] Amir1: can you / we not ACK things in icinga? [09:31:55] volans: interesting, IIRC there's been some jvm tuning recently for gerrit [09:32:00] addshore: I thought about different ways of not doing it but none of them seems feasible [09:32:02] :( [09:32:03] I can see it has -Xmx20g [09:32:24] jynus: looks like that might unstuck it indeed [09:32:33] objections to try it or volans you want to keep looking? [09:32:46] godog: go ahead! [09:32:51] Amir1: no, I mean, can you / I not ack the alarms in icinga ourselves? [09:33:05] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1950 bytes in 0.143 second response time [09:33:25] I tried but gives me "not authorized" error [09:33:36] if ops can give me that right, it would be fantastic [09:34:04] Amir1: do you have a task I can link in the ACK message? [09:34:21] volans: T171460 thanks [09:34:22] T171460: Populate term_full_entity_id on www.wikidata.org - https://phabricator.wikimedia.org/T171460 [09:34:38] Amir1: ahh yes, so it shows in the UI but when you do it it says "Not Authorized" :) [09:34:40] ACKNOWLEDGEMENT - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1950 bytes in 0.143 second response time Volans Populate term_full_entity_id on www.wikidata.org T171460 [09:34:48] addshore: yeah [09:34:59] done ;) [09:35:08] volans: Thanks [09:35:22] I wonder if there is some way that allows people of certains ldap groups to have access over certain alarms / groups of alarms [09:35:53] (03PS2) 10Filippo Giunchedi: ferm: introduce ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/373038 (https://phabricator.wikimedia.org/T173731) [09:35:55] (03PS2) 10Filippo Giunchedi: swift: don't track connections to swift backend services on frontend machines [puppet] - 10https://gerrit.wikimedia.org/r/373039 (https://phabricator.wikimedia.org/T173731) [09:35:57] (03PS2) 10Filippo Giunchedi: statsite: don't track statsd client traffic [puppet] - 10https://gerrit.wikimedia.org/r/373253 (https://phabricator.wikimedia.org/T173731) [09:35:57] looks like the gerrit pipe has been unstuck [09:36:09] thanks godog! [09:36:12] leszek_wmde: ^^ [09:36:15] !log kill older and running tasks from gerrit queue, it was stuck [09:36:25] thanks godog! [09:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:47] no problem! [09:45:14] 10Operations, 10MediaWiki-JobQueue, 10Performance-Team, 10Wikidata, 10Wikidata-Sprint: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3544007 (10Lydia_Pintscher) [09:47:57] (03PS1) 10Jcrespo: mariadb: reduce shards replicated to dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/373256 (https://phabricator.wikimedia.org/T168409) [09:50:22] (03CR) 10Marostegui: [C: 031] mariadb: reduce shards replicated to dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/373256 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [09:53:51] (03CR) 10Gehel: [C: 031] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/373251 (owner: 10Volans) [09:56:54] Yay, gerrit unstuck for me as well - Thanks godog :) [09:57:42] (03CR) 10Alexandros Kosiaris: [C: 031] icinga: add plugin to check for long running screens [puppet] - 10https://gerrit.wikimedia.org/r/373135 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [09:59:17] (03PS3) 10ArielGlenn: start of setup of dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/373117 (https://phabricator.wikimedia.org/T169849) [10:02:45] !log joal@tin Started deploy [analytics/refinery@84d6ee4]: Regular deploy (1 month since last) [10:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:24] !log upgrading remaining image scalers to luasandbox 2.0.13 [10:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:15] !log joal@tin Finished deploy [analytics/refinery@84d6ee4]: Regular deploy (1 month since last) (duration: 03m 30s) [10:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:26] (03CR) 10Jcrespo: [C: 032] mariadb: reduce shards replicated to dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/373256 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [10:11:01] (03CR) 10Alexandros Kosiaris: [C: 031] "Just saw the dport argument again. Yeah my comment makes no sense then given that." [puppet] - 10https://gerrit.wikimedia.org/r/373038 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [10:19:46] 10Operations, 10MediaWiki-extensions-Scribunto: Build and push a new hhvm-luasandbox package - https://phabricator.wikimedia.org/T171166#3544055 (10MoritzMuehlenhoff) 05Open>03Resolved 2.0.13 has been built for jessie and deployed in our environment. I don't think we need a trusty build at this point; the... [10:22:23] chasemp: is 0001-openstack-keystone-as-module-profile-role-for-deploy.patch in puppet's root a mistake or meant to be there? [10:29:26] (03PS1) 10Filippo Giunchedi: role: add ssh blackbox probes for bastions [puppet] - 10https://gerrit.wikimedia.org/r/373261 (https://phabricator.wikimedia.org/T169860) [10:29:51] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/373038 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [10:30:20] (03PS3) 10Filippo Giunchedi: ferm: introduce ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/373038 (https://phabricator.wikimedia.org/T173731) [10:31:16] (03CR) 10Filippo Giunchedi: [C: 032] "Thanks Alex and Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/373038 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [10:31:49] jynus: ok to merge your change? [10:32:18] yes, sorry [10:32:41] no worries, happens all the time to me heh [10:33:52] (03PS2) 10Filippo Giunchedi: role: add ssh blackbox probes for bastions [puppet] - 10https://gerrit.wikimedia.org/r/373261 (https://phabricator.wikimedia.org/T169860) [10:34:32] (03CR) 10Filippo Giunchedi: [C: 032] role: add ssh blackbox probes for bastions [puppet] - 10https://gerrit.wikimedia.org/r/373261 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [10:39:44] !log stopping and deleting s1 and s4 from dbstore2001 [10:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:06] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3544198 (10MoritzMuehlenhoff) @Papaul: Luca's out this week. I've tried to connect to the host, but can't connect via SSH. It works fine over the mgmt, can you check the cabling please? [10:46:12] (03PS1) 10Phuedx: pagePreviews: Enable A/B test (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373264 (https://phabricator.wikimedia.org/T171853) [10:50:58] (03PS2) 10Phuedx: pagePreviews: Enable A/B test (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373264 (https://phabricator.wikimedia.org/T171853) [10:53:13] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1933 bytes in 0.145 second response time [10:58:17] (03PS1) 10Filippo Giunchedi: role: use port 22 for ssh probing [puppet] - 10https://gerrit.wikimedia.org/r/373266 (https://phabricator.wikimedia.org/T169860) [10:59:22] (03PS3) 10Phuedx: pagePreviews: Enable A/B test (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373264 (https://phabricator.wikimedia.org/T171853) [11:02:17] (03PS1) 10Jcrespo: mariadb: Renable buffer pool dump on dbstores and prometheus fix [puppet] - 10https://gerrit.wikimedia.org/r/373267 (https://phabricator.wikimedia.org/T168409) [11:08:11] (03CR) 10Filippo Giunchedi: [C: 032] role: use port 22 for ssh probing [puppet] - 10https://gerrit.wikimedia.org/r/373266 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [11:08:18] (03PS2) 10Filippo Giunchedi: role: use port 22 for ssh probing [puppet] - 10https://gerrit.wikimedia.org/r/373266 (https://phabricator.wikimedia.org/T169860) [11:09:45] (03CR) 10Jhernandez: [C: 031] "Looking good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373264 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [11:11:44] (03PS1) 10Jcrespo: wmf-mariadb: update packages to 10.1.26 and 10.0.32 [software] - 10https://gerrit.wikimedia.org/r/373269 [11:11:46] (03PS1) 10Jcrespo: dblists: Remove s1 and s4 from dbstore2001 [software] - 10https://gerrit.wikimedia.org/r/373270 [11:12:16] (03CR) 10Jcrespo: [V: 032 C: 032] wmf-mariadb: update packages to 10.1.26 and 10.0.32 [software] - 10https://gerrit.wikimedia.org/r/373269 (owner: 10Jcrespo) [11:12:40] (03PS1) 10ArielGlenn: minimal manifest for dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/373271 [11:13:51] (03CR) 10Jcrespo: [C: 032] dblists: Remove s1 and s4 from dbstore2001 [software] - 10https://gerrit.wikimedia.org/r/373270 (owner: 10Jcrespo) [11:13:53] (03PS2) 10Volans: Updated documentation [software/cumin] - 10https://gerrit.wikimedia.org/r/373251 [11:13:59] (03PS2) 10Volans: Add CHANGELOG file including previous releases [software/cumin] - 10https://gerrit.wikimedia.org/r/373250 [11:13:59] (03PS1) 10Volans: Transports: fix target management improvement [software/cumin] - 10https://gerrit.wikimedia.org/r/373272 (https://phabricator.wikimedia.org/T171684) [11:14:15] (03CR) 10Jcrespo: [C: 032] mariadb: Renable buffer pool dump on dbstores and prometheus fix [puppet] - 10https://gerrit.wikimedia.org/r/373267 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [11:14:23] (03PS2) 10Jcrespo: mariadb: Renable buffer pool dump on dbstores and prometheus fix [puppet] - 10https://gerrit.wikimedia.org/r/373267 (https://phabricator.wikimedia.org/T168409) [11:16:11] (03PS3) 10Volans: Add CHANGELOG file including previous releases [software/cumin] - 10https://gerrit.wikimedia.org/r/373250 [11:25:37] would anyone object to me merging a beta cluster only change? [11:26:09] and fetching it on deployment? [11:29:02] zeljkof et al ^ [11:29:44] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1950 bytes in 0.090 second response time [11:29:51] phuedx: I don't mind, but I am often wrong :D [11:29:59] if you know what you are doing... [11:32:50] volans: Is it normal ^? [11:36:34] Amir1: I ack'ed it, if it went to normal and then back to critical yes, it's normal [11:36:50] if it will flap, better to downtime instead of ack'ing [11:37:03] Yeah it will flap [11:37:05] sorry [11:38:07] no prob, let me downtime it [11:38:59] done [11:43:13] Thanks [11:48:54] zeljkof: i can never remember what the procedure is, but the beta cluster config is deployed regularly [11:49:36] so what i've done before is merge the config change and then put it on the deployment box so that deployers don't see random unrelated changes when they're deploying [11:49:55] no one has complained when i've asked before, but that doesn't mean that i'm doing something wrong ;) [11:54:45] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1952 bytes in 0.129 second response time [11:56:55] phuedx: I don't think I have deployed beta cluster config ever [11:57:11] so I'm the wrong person to ask [11:57:20] it has to be documented somewhere [11:57:22] zeljkof: it's done by jenkins :) [11:57:40] oh, so you just merge the commit and jenkins deploys it? [11:57:44] cool [11:58:01] https://integration.wikimedia.org/ci/view/Beta/job/beta-mediawiki-config-update-eqiad/ [12:03:04] 10Operations, 10ops-eqiad, 10DBA: Decommission db1041 - https://phabricator.wikimedia.org/T173915#3544266 (10Marostegui) [12:04:55] (03CR) 10Matthias Mullie: Upgrade to 1.2 (031 comment) [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/370907 (https://phabricator.wikimedia.org/T161719) (owner: 10Gilles) [12:05:23] 10Operations, 10ops-eqiad, 10DBA: Decommission db1041 - https://phabricator.wikimedia.org/T173915#3544279 (10Marostegui) [12:07:18] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373276 (https://phabricator.wikimedia.org/T173915) [12:12:19] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db1041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373276 (https://phabricator.wikimedia.org/T173915) (owner: 10Marostegui) [12:13:13] chasemp: bd808: replication lags again on dewiki and maybe wikidatawiki for 20 minutes now (maybe due to T172679 in future). No API r/w possible. Please help [12:13:13] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [12:13:14] (03CR) 10Jcrespo: [C: 031] db-eqiad,db-codfw.php: Remove db1041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373276 (https://phabricator.wikimedia.org/T173915) (owner: 10Marostegui) [12:13:28] Amir1: ^ [12:14:20] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373276 (https://phabricator.wikimedia.org/T173915) (owner: 10Marostegui) [12:15:41] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db1041 to decommission it - T173915 (duration: 00m 48s) [12:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:56] T173915: Decommission db1041 - https://phabricator.wikimedia.org/T173915 [12:16:14] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373276 (https://phabricator.wikimedia.org/T173915) (owner: 10Marostegui) [12:16:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1041 to decommission it - T173915 (duration: 00m 48s) [12:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:35] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Decommission db1041 - https://phabricator.wikimedia.org/T173915#3544324 (10Marostegui) [12:18:30] chasemp: bd808: replication lags again on dewiki and maybe wikidatawiki for about 30 minutes now (maybe due to T172679 in future). No API r/w possible. Please help [12:18:30] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [12:19:56] andrewbogott: ^ [12:20:19] ok -- i'm out; i won't deploy the thing to the beta cluster until i get back [12:21:10] anybody here to fix longtime replication lags? [12:21:59] jynus: ^ [12:22:36] (03PS1) 10Filippo Giunchedi: role: collect blackbox_exporter metrics in Prometheus global [puppet] - 10https://gerrit.wikimedia.org/r/373280 (https://phabricator.wikimedia.org/T169860) [12:22:54] (03PS2) 10Filippo Giunchedi: role: collect blackbox_exporter metrics in Prometheus global [puppet] - 10https://gerrit.wikimedia.org/r/373280 (https://phabricator.wikimedia.org/T169860) [12:22:56] marostegui: It's related to other things too [12:22:57] (03CR) 10jerkins-bot: [V: 04-1] role: collect blackbox_exporter metrics in Prometheus global [puppet] - 10https://gerrit.wikimedia.org/r/373280 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [12:23:12] Amir1: how do you know? [12:23:26] Amir1: anybody here to fix longtime replication lags? [12:23:36] https://www.wikidata.org/wiki/Wikidata:Administrators%27_noticeboard#Maxlag_parameter_not_respected [12:23:54] jynus: high edit rate in wikidata [12:24:49] https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&orgId=1&from=now-6h&to=now [12:25:29] (03CR) 10Filippo Giunchedi: [C: 032] role: collect blackbox_exporter metrics in Prometheus global [puppet] - 10https://gerrit.wikimedia.org/r/373280 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [12:29:46] marostegui: jynus doctaxon the replication lag doesn't seem too bad: https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=5&fullscreen&orgId=1 [12:30:01] (03PS1) 10Faidon Liambotis: Delete stray patch file [puppet] - 10https://gerrit.wikimedia.org/r/373281 [12:30:02] (I think we ignore dbstore lag) [12:30:02] not this time any more [12:30:05] (03PS1) 10Marostegui: mariadb: Remove db1041 [puppet] - 10https://gerrit.wikimedia.org/r/373282 (https://phabricator.wikimedia.org/T173915) [12:30:13] Amir1: we do [12:30:26] it's better now [12:30:38] (03CR) 10Faidon Liambotis: [C: 032] Delete stray patch file [puppet] - 10https://gerrit.wikimedia.org/r/373281 (owner: 10Faidon Liambotis) [12:31:08] the highest lag I was able to find was 8 seconds not 20 minutes [12:31:15] the problem is 26 and 45 [12:31:26] which are recentchanges/watchist hosts [12:31:42] if both lag, things will get most likely into read only [12:31:56] I think the limit is 5 seconds [12:32:05] i think too [12:32:08] 5 sec [12:32:26] oops, looking at the wrong table. mysql replication lag graph is defaulted to codfw [12:32:28] https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=5&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops [12:32:32] this is bad [12:33:14] all tools and bots should use maxlag parameters of at least 5 seconds !!! [12:34:28] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler02/7569/" [puppet] - 10https://gerrit.wikimedia.org/r/373282 (https://phabricator.wikimedia.org/T173915) (owner: 10Marostegui) [12:34:37] sincerelyl, I do not know why admins are not stricter with temporary bans [12:34:40] (03PS1) 10Muehlenhoff: Fine-tune display of check_restart and deploy commands [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373283 [12:34:43] Amir1: I assume your script is now stopped, no? [12:34:54] they are not users, they are bots [12:35:00] *human [12:35:11] yes but I'm planning to re-run it [12:35:26] but first I want to make sure the user is not flooding so bad [12:38:24] blocked the user, writing the warning [12:39:15] both maintanance and bot work can be done always slower and take more time [12:39:17] And lag looks gone for now [12:39:30] failed human edits, on the other side, drive users aways [12:39:47] plus dewiki is still on s5 [12:41:18] but still 80 edits per minute by a human puts too much pressure on lots of services besides flooding the rc (that's why it's prohibited to do so in all wikis) [12:41:42] "80 edits per minute by a human" who does that? [12:42:04] https://www.wikidata.org/wiki/User_talk:Muhammad_Abul-Futooh#Block [12:42:08] here I am talking bots, having the bot flag or not [12:42:19] yeah "human flag" [12:42:43] yes, that would be one of the issues I include as "should be blocked quickly" [12:42:54] no matter if they are later unblocked [12:43:07] I am not saying this is your job [12:43:24] I want to start the script, is it okay? [12:43:28] I am saying all admins should be stricter with those things [12:43:30] This is a silly question, pardon my ignorance….why there's not an automatic blocking for bots doing X amount of edits per minute? Being X something that can be changed over time from a config flag? [12:43:55] marostegui: there is 2 things [12:44:14] bots should be run with bot flags, which this one shouldn't [12:44:31] and bots should follow the bot netiquette [12:44:40] *didnt [12:44:57] got it [12:45:06] but the netiquette is not enforced now? [12:45:10] is it a best-effort? [12:45:46] marostegui: the wiki phylosophy is let doing unless problems happen [12:47:11] Right, but when they happen, it is slow to get them fixed? I mean, there is not a quick way of blocking an user/bot if we didn't have Amir1 online now [12:47:14] no? [12:47:19] "There is no hard and fast limit on read requests, but we ask that you be considerate and try not to take a site down. Most system administrators reserve the right to unceremoniously block you if you do endanger the stability of their site." [12:47:36] https://www.mediawiki.org/wiki/API:Etiquette [12:47:42] you can do it with your WMF account I guess [12:51:21] (03PS1) 10Filippo Giunchedi: role: fix prometheus global collection regexp [puppet] - 10https://gerrit.wikimedia.org/r/373284 [12:52:37] (03CR) 10Filippo Giunchedi: [C: 032] role: fix prometheus global collection regexp [puppet] - 10https://gerrit.wikimedia.org/r/373284 (owner: 10Filippo Giunchedi) [12:52:44] (03PS2) 10Filippo Giunchedi: role: fix prometheus global collection regexp [puppet] - 10https://gerrit.wikimedia.org/r/373284 [12:53:24] marostegui I would search for a wikidata admin, on several other channels [12:53:43] Yeah, that is what I mean, that there is no other way [12:53:53] Making the process relatively slow [12:54:40] there are other ways, the ways of the root, but I would only use that if there was a clear outage- for transparecy I think it would be better to do things "on wiki" [12:56:42] right so, this would be done better onwiki [12:57:15] doctaxon: I think you will be happier once wikidata is on its own shard :-) [12:57:24] so bots can fight each other alone [12:57:49] I think nobody predicted the popularity of wikidata [12:57:49] yes, I watch the ticket [12:58:11] everybody I talked to said "edits willl slow down in a year" [12:58:22] but I already run bots on wikidatawiki too [12:58:34] -already [12:58:44] yeah, but not all other people respect the netiquette [12:58:56] :( [12:59:25] that is something I would definitely suggest bring up on the comunity wiki page [12:59:55] being stricter with bad netizens (specially bots) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170823T1300). [13:00:04] kart_: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:13] I can SWAT today! [13:00:51] * kart_ here [13:00:56] zeljkof: cool. [13:01:09] kart_: merging the commit, will ping you when it's at mwdebug1002 [13:01:20] or, do you want to deploy yourself? [13:01:34] zeljkof: go ahead with merge and deploy. [13:01:44] ok [13:01:58] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373069 (https://phabricator.wikimedia.org/T172947) (owner: 10KartikMistry) [13:03:27] (03Merged) 10jenkins-bot: Enable Flow as a Beta feature on wawiki and wawikionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373069 (https://phabricator.wikimedia.org/T172947) (owner: 10KartikMistry) [13:03:41] (03CR) 10jenkins-bot: Enable Flow as a Beta feature on wawiki and wawikionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373069 (https://phabricator.wikimedia.org/T172947) (owner: 10KartikMistry) [13:06:33] (03CR) 10Gehel: [C: 031] Transports: fix target management improvement (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/373272 (https://phabricator.wikimedia.org/T171684) (owner: 10Volans) [13:07:37] zeljkof: I had network issue for a while; back now. [13:09:06] kart_: the commit is at mwdebug1002, please test and let me know if I can continue with deployment [13:09:59] zeljkof: tested. looks good. [13:10:07] kart_: deploying [13:10:58] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:373069|Enable Flow as a Beta feature on wawiki and wawikionary (T172947)]] (duration: 00m 48s) [13:11:07] kart_: deployed, please check [13:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:10] T172947: Install Flow as a Beta feature on wa.wikipedia.org and wa.wikionary.org. - https://phabricator.wikimedia.org/T172947 [13:13:15] zeljkof: Checked. all cool. Thanks! [13:13:38] kart_: great, thanks for deploying with #releng ;) [13:13:44] !log EU SWAT finished [13:13:49] :) [13:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:06] (03PS10) 10Gehel: wdqs - moving to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/369682 (https://phabricator.wikimedia.org/T171704) [13:23:09] (03CR) 10Ottomata: Adding mailto to camus job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362237 (https://phabricator.wikimedia.org/T169248) (owner: 10Nuria) [13:25:00] (03CR) 10Gehel: "puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler02/7570/" [puppet] - 10https://gerrit.wikimedia.org/r/369682 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [13:25:46] (03PS2) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342230 [13:26:28] (03PS8) 10Gehel: wdqs - send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) [13:26:53] (03CR) 10jerkins-bot: [V: 04-1] wdqs - send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [13:28:46] (03PS2) 10Marostegui: mariadb: Remove db1041 [puppet] - 10https://gerrit.wikimedia.org/r/373282 (https://phabricator.wikimedia.org/T173915) [13:29:45] (03CR) 10Marostegui: [C: 032] mariadb: Remove db1041 [puppet] - 10https://gerrit.wikimedia.org/r/373282 (https://phabricator.wikimedia.org/T173915) (owner: 10Marostegui) [13:29:49] (03PS9) 10Gehel: wdqs - send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) [13:31:17] !log Stop MySQL on db1041 to get it ready for decommission - T173915 [13:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:36] T173915: Decommission db1041 - https://phabricator.wikimedia.org/T173915 [13:32:05] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Decommission db1041 - https://phabricator.wikimedia.org/T173915#3544468 (10Marostegui) [13:32:21] (03PS3) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342230 [13:37:43] (03PS1) 10Muehlenhoff: New debdeploy module to query installed reverse dependencies [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373290 [13:37:48] (03CR) 10Gehel: "puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler02/7571/" [puppet] - 10https://gerrit.wikimedia.org/r/342230 (owner: 10Gehel) [13:40:16] 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3544522 (10Mholloway) @cooltey Sometime this week, please complete the checklist above. This is for shell access to upload APKs to one of our production ser... [13:46:13] (03PS1) 10Alexandros Kosiaris: WIP: Allow silencing notifications for hosts [puppet] - 10https://gerrit.wikimedia.org/r/373291 (https://phabricator.wikimedia.org/T151632) [13:53:20] (03PS2) 10Alexandros Kosiaris: WIP: Allow silencing notifications for hosts [puppet] - 10https://gerrit.wikimedia.org/r/373291 (https://phabricator.wikimedia.org/T151632) [14:00:39] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3544613 (10fgiunchedi) Looks like librenms polls every 5 minutes, so the gaps are there because no data has actually been sent. @Volans yeah that... [14:00:45] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3544614 (10Marostegui) @MarcoAurelio See: T172207#3544611 Looks like we can proceed [14:05:59] (03PS3) 10Alexandros Kosiaris: WIP: Allow silencing notifications for hosts [puppet] - 10https://gerrit.wikimedia.org/r/373291 (https://phabricator.wikimedia.org/T151632) [14:18:55] (03PS2) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342231 [14:23:10] (03CR) 10jerkins-bot: [V: 04-1] logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342231 (owner: 10Gehel) [14:23:44] (03PS4) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342231 [14:26:57] (03PS1) 10Muehlenhoff: Remove access for ellery [puppet] - 10https://gerrit.wikimedia.org/r/373298 [14:27:17] any regular deployers about? [14:27:24] (i consider myself an irregular deployer) [14:28:23] ? [14:28:28] o/ [14:28:35] wassup? [14:28:41] i asked earlier and had a little poke around wikitech [14:28:53] what's the sop around beta cluster only config changes? [14:29:31] can we merge 'em outside of a window and then update deployment ourselves? [14:30:45] I'm not 100% sure, but I believe so yes [14:30:59] Ops have confirmed sync-file -labs.php is a noop for production [14:31:25] hrrm [14:31:39] should i be bold and update wikitech i wonder! [14:31:50] I'd generally advise not to sync in the middle of the night (unless you are more comfortable deploying) [14:32:51] feel free :) [14:33:00] (03CR) 10Muehlenhoff: [C: 032] Remove access for ellery [puppet] - 10https://gerrit.wikimedia.org/r/373298 (owner: 10Muehlenhoff) [14:33:10] Obviously, if you need to touch CommonSettings.php.... Then it's not just a beta change [14:33:27] But if it's explicitly only -labs.php files, it should be good [14:34:10] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp4027.ulsfo.wmnet [14:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:31] (03CR) 10Reedy: [C: 031] Run Lilypond from Firejail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370358 (https://phabricator.wikimedia.org/T172582) (owner: 10Ebe123) [14:36:42] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp4027.ulsfo.wmnet [14:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:25] (03PS4) 10Alexandros Kosiaris: WIP: Allow silencing notifications for hosts [puppet] - 10https://gerrit.wikimedia.org/r/373291 (https://phabricator.wikimedia.org/T151632) [14:44:25] (03PS1) 10Jdlrobson: Remove CiteThisPage from blacklist for page previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373300 (https://phabricator.wikimedia.org/T173865) [14:49:11] !log installing texlive security updates on trusty (Debian already fixed) [14:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:05] (03PS2) 10Ottomata: Update hadoop fair scheduler queues [puppet] - 10https://gerrit.wikimedia.org/r/362151 (https://phabricator.wikimedia.org/T156841) (owner: 10Joal) [14:59:22] (03CR) 10Ottomata: [V: 032 C: 032] Update hadoop fair scheduler queues [puppet] - 10https://gerrit.wikimedia.org/r/362151 (https://phabricator.wikimedia.org/T156841) (owner: 10Joal) [14:59:32] 10Operations, 10ops-eqiad, 10DBA: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3545014 (10Cmjohnson) @Marostegui The ssd has been replaced. Please resolve after rebuild [15:02:48] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on logstash1006 - https://phabricator.wikimedia.org/T173679#3545034 (10Cmjohnson) @Gehel The disk arrived, the disks are internal and the server will need to be taken down to replace. Let me know when you're ready for me to swap disk. [15:06:21] (03CR) 10Pmiazga: [C: 031] Remove CiteThisPage from blacklist for page previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373300 (https://phabricator.wikimedia.org/T173865) (owner: 10Jdlrobson) [15:08:09] cmjohnson1: about logstash1006 failed disk, I'm available right now if you are... [15:08:17] (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/compiler02/7578/. Looks rather OK" [puppet] - 10https://gerrit.wikimedia.org/r/373291 (https://phabricator.wikimedia.org/T151632) (owner: 10Alexandros Kosiaris) [15:08:23] 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3545075 (10Mholloway) @Sharvaniharan Same as I asked @cooltey above, please complete the above checklist sometime this week in order to obtain shell access.... [15:09:00] 10Operations, 10ops-eqiad, 10DBA: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3545079 (10jcrespo) Thank you, Chris! @RobH @Cmjohnson if you are ok with that, //with less priority//, we would like some disk degradation testing at some point in the future. [15:09:36] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on logstash1006 - https://phabricator.wikimedia.org/T173679#3545081 (10Gehel) I'll take down the server right now, we should be able to live with 2 elasticsearch backends only without any issue. Let me know when the server is up agai... [15:09:42] Reedy: middle of the night whos time? [15:09:47] ;) [15:09:54] phuedx: "when there's no opsen around" [15:10:00] :P [15:10:03] !log shutdown logstash1006 for disk replacement - T173679 [15:10:06] agreed [15:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:17] T173679: Degraded RAID on logstash1006 - https://phabricator.wikimedia.org/T173679 [15:10:23] hrrm [15:10:49] there's a 6 pm utc swat deploy, which is the morning swat [15:11:04] maybe i could get this change up at the end of that window [15:12:10] or after those changes have been deployed [15:14:41] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: kafka-jumbo1004 h/w problem most likely raid card - https://phabricator.wikimedia.org/T173837#3545106 (10Cmjohnson) 05Open>03Resolved Received and replaced the raid controller! A million times better and it's working fine no... [15:14:44] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3545108 (10Cmjohnson) [15:15:28] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3352337 (10Cmjohnson) a:05Cmjohnson>03RobH the issue with 1004 has been resolved assigning to @robh to do installs. [15:16:22] (03PS5) 10Jcrespo: mariadb: Adding rack allocations, some formatting fixes, read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371444 (https://phabricator.wikimedia.org/T172459) [15:16:24] (03PS1) 10Jcrespo: mariadb: Pool db1078 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373306 (https://phabricator.wikimedia.org/T173365) [15:16:57] Reedy: you still remember my patch/request? Any ETA? Once merged i can annonce it in the mailinglist and we can start adding sites. (Hope i follow due process, didn't uploaded any patches for a while) [15:17:05] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3229950 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp4021.ulsfo.wmnet', 'cp4022.ulsfo.wmnet', 'cp4023.ulsfo.... [15:17:07] ? [15:17:44] 10Operations, 10MediaWiki-JobQueue, 10Performance-Team, 10Wikidata, 10Wikidata-Sprint: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3545124 (10Ladsgroup) This is list of the most inserted jobs in the past three hours (the number is the job added per second): ``` wikibase_a... [15:18:04] https://gerrit.wikimedia.org/r/#/c/368770/ (wmf product maneger +1'ed it and a steward) [15:20:16] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 4 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3545133 (10jcrespo) [15:22:54] (03CR) 10Gehel: [C: 032] logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342231 (owner: 10Gehel) [15:23:05] (03PS5) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342231 [15:26:25] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on logstash1006 - https://phabricator.wikimedia.org/T173679#3545172 (10Cmjohnson) @gehel the disk has been swapped, I will re-install later this afternoon. [15:28:13] (03PS1) 10Jcrespo: install_server: Remove db1069 & dbstore2001 from the list of reimaging [puppet] - 10https://gerrit.wikimedia.org/r/373309 [15:29:03] 10Operations, 10Ops-Access-Requests, 10Gerrit, 10LDAP-Access-Requests: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3545212 (10herron) [15:29:15] (03CR) 10Jcrespo: [C: 04-1] "Not before disk rebuilding finishes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373306 (https://phabricator.wikimedia.org/T173365) (owner: 10Jcrespo) [15:31:09] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3545222 (10Papaul) @MoritzMuehlenhoff the cable is connected. Just keep in mind new main board = new MAC address. [15:31:41] (03CR) 10Filippo Giunchedi: Upgrade to 1.2 (031 comment) [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/370907 (https://phabricator.wikimedia.org/T161719) (owner: 10Gilles) [15:31:47] (03PS2) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342232 [15:33:19] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3545236 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp4021.ulsfo.wmnet', 'cp4022.ulsfo.wmnet', 'cp4023.ulsfo.... [15:35:43] 10Operations, 10Page-Previews, 10Traffic, 10Readers-Web-Backlog (Tracking): Investigate the increase in the number of requests to Swift after the Page Previews deploy - https://phabricator.wikimedia.org/T173422#3545240 (10fgiunchedi) I think we can resolve this task, for swift I got T173721 going. The incr... [15:35:47] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10User-Addshore: Requesting access to contint-admins for addshore - https://phabricator.wikimedia.org/T173233#3545241 (10Legoktm) Yay! I also added him to the "integration" group on gerrit so he can merge ch... [15:35:55] !log added addshore to "integration" gerrit group (T173233) [15:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:07] T173233: Requesting access to contint-admins for addshore - https://phabricator.wikimedia.org/T173233 [15:36:49] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4021_v4, cp4021_v6 [15:38:12] (03PS1) 10Jcrespo: mariadb: reimage db1069 as a core host, remove sanitarium old role [puppet] - 10https://gerrit.wikimedia.org/r/373311 (https://phabricator.wikimedia.org/T169514) [15:38:13] 10Operations, 10Page-Previews, 10Traffic, 10Readers-Web-Backlog (Tracking): Investigate the increase in the number of requests to Swift after the Page Previews deploy - https://phabricator.wikimedia.org/T173422#3545255 (10phuedx) 05Open>03Resolved a:03phuedx @fgiunchedi: +1 🎉🎉🎉 [15:39:38] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4021_v4, cp4021_v6 [15:39:41] (03CR) 10Gehel: "puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler02/7580/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/342232 (owner: 10Gehel) [15:39:48] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4021_v4, cp4021_v6 [15:39:58] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4021_v4, cp4021_v6 [15:39:59] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4021_v4, cp4021_v6 [15:40:08] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4021_v4, cp4021_v6 [15:40:08] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4021_v4, cp4021_v6 [15:40:08] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4021_v4, cp4021_v6 [15:40:08] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4021_v4, cp4021_v6 [15:40:09] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4021_v4, cp4021_v6 [15:40:18] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4021_v4, cp4021_v6 [15:40:19] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4021_v4, cp4021_v6 [15:40:19] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4021_v4, cp4021_v6 [15:40:19] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4021_v4, cp4021_v6 [15:40:19] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4021_v4, cp4021_v6 [15:40:38] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4021_v4, cp4021_v6 [15:40:38] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4021_v4, cp4021_v6 [15:40:38] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4021_v4, cp4021_v6 [15:40:38] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4021_v4, cp4021_v6 [15:40:39] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4021_v4, cp4021_v6 [15:40:39] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 80 not-conn: cp4021_v4, cp4021_v6 [15:41:16] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: kafka-jumbo1004 h/w problem most likely raid card - https://phabricator.wikimedia.org/T173837#3545260 (10Ottomata) Amazing! Thank you. [15:42:10] 10Operations, 10Ops-Access-Requests, 10Gerrit, 10LDAP-Access-Requests: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3545264 (10RobH) [15:45:04] (03CR) 10Jcrespo: [C: 04-1] "Not until db1069 is actually reimaged." [puppet] - 10https://gerrit.wikimedia.org/r/373309 (owner: 10Jcrespo) [15:49:29] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 4 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3537269 (10Esc3300) wikibase_addUsagesForPage are these new requests from client wikis? e.g. a template in some wikipedia reading labels for statements?... [15:50:21] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/342230 (owner: 10Gehel) [15:53:36] PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [15:53:45] PROBLEM - Apache HTTP on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [15:54:05] PROBLEM - Apache HTTP on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [15:54:05] PROBLEM - HHVM rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [15:54:06] PROBLEM - HHVM rendering on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [15:54:36] RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.072 second response time [15:54:46] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.722 second response time [15:54:54] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 4 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3545319 (10Ladsgroup) This is for updating entity_usage table in clients. [15:55:05] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.115 second response time [15:55:06] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 73945 bytes in 0.729 second response time [15:55:06] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 73945 bytes in 0.356 second response time [16:00:05] bd808: Dear anthropoid, the time has come. Please deploy Striker deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170823T1600). [16:00:30] o/ [16:03:04] (03PS4) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342230 [16:03:43] (03CR) 10Gehel: [C: 032] logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342230 (owner: 10Gehel) [16:06:31] (03PS1) 10Brian Wolff: Allow crats to add people to accountcreator group on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373317 [16:08:42] (03CR) 10Platonides: [C: 031] Allow crats to add people to accountcreator group on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373317 (owner: 10Brian Wolff) [16:10:29] (03CR) 10D3r1ck01: [C: 031] Allow crats to add people to accountcreator group on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373317 (owner: 10Brian Wolff) [16:11:55] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 4 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3545392 (10Esc3300) Are these originating also in clients are initially coming from Wikidata? What triggers them? [16:14:14] PROBLEM - DPKG on cp4021 is CRITICAL: Return code of 255 is out of bounds [16:14:15] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp4021 is CRITICAL: connect to address 10.128.0.121 and port 3128: Connection refused [16:14:15] PROBLEM - Varnish traffic logger - varnishreqstats on cp4021 is CRITICAL: Return code of 255 is out of bounds [16:14:15] PROBLEM - traffic-pool service on cp4021 is CRITICAL: Return code of 255 is out of bounds [16:15:14] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp4021 is CRITICAL: connect to address 10.128.0.121 and port 3120: Connection refused [16:15:14] PROBLEM - Varnish traffic logger - varnishstatsd on cp4021 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:15:14] PROBLEM - Disk space on cp4021 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:15:45] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 68 ESP OK [16:15:45] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 82 ESP OK [16:15:54] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 68 ESP OK [16:15:55] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 82 ESP OK [16:15:55] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 82 ESP OK [16:16:04] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 82 ESP OK [16:16:04] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 82 ESP OK [16:16:04] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 82 ESP OK [16:16:04] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 68 ESP OK [16:16:04] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 68 ESP OK [16:16:14] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 68 ESP OK [16:16:14] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp4021 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.157 second response time [16:16:14] RECOVERY - Disk space on cp4021 is OK: DISK OK [16:16:14] RECOVERY - Varnish traffic logger - varnishstatsd on cp4021 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishstatsd, UID = 0 (root) [16:16:15] (03CR) 10Volans: [C: 04-1] "Looks good in general, a couple of leftover / missing things and few styling comments inline." (037 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373290 (owner: 10Muehlenhoff) [16:16:15] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp4021 is OK: HTTP OK: HTTP/1.1 200 OK - 177 bytes in 0.157 second response time [16:16:15] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 82 ESP OK [16:16:15] RECOVERY - DPKG on cp4021 is OK: All packages OK [16:16:15] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 82 ESP OK [16:16:24] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 68 ESP OK [16:16:34] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 68 ESP OK [16:16:34] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 68 ESP OK [16:16:35] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 68 ESP OK [16:16:54] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 68 ESP OK [16:18:15] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 82 ESP OK [16:18:34] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 68 ESP OK [16:19:34] PROBLEM - Check systemd state on cp4021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:20:15] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 82 ESP OK [16:22:24] PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 6 minutes ago with 4 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX],Service[varnish],Service[varnish-frontend] [16:22:25] RECOVERY - Varnish traffic logger - varnishreqstats on cp4021 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishreqstats, UID = 0 (root) [16:22:27] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#3545403 (10fgiunchedi) a:03fgiunchedi [16:22:41] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.9 (duration: 03m 44s) [16:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:07] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#971192 (10fgiunchedi) I'll take this on as part of {T86556} since this task is essentially a superset [16:23:25] RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:23:46] !log bd808@tin Started deploy [striker/deploy@2f2dd7c]: Deploying 2f2dd7c "Tool account creation and more" (T128400, T149458, T159044, T164847, T167931, T168480, T173845) [16:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:04] T168480: Perform initial Cloud Services rebranding - https://phabricator.wikimedia.org/T168480 [16:24:04] T149458: Manage shared tool accounts via Striker - https://phabricator.wikimedia.org/T149458 [16:24:04] T173845: Make potential for others to see IP Address for ssh sessions explicit in Toolforge membership request process - https://phabricator.wikimedia.org/T173845 [16:24:04] T164847: Striker gives fatal error when a SUL account already in use tries to attach to a second LDAP account - https://phabricator.wikimedia.org/T164847 [16:24:05] T128400: Unable to add service group to service groups - https://phabricator.wikimedia.org/T128400 [16:24:05] T167931: Fatal error when adding a duplicate SSH key - https://phabricator.wikimedia.org/T167931 [16:24:05] T159044: Replace deprecated phabricator conduit api calls in phabricator.py file - https://phabricator.wikimedia.org/T159044 [16:24:26] !log bd808@tin Finished deploy [striker/deploy@2f2dd7c]: Deploying 2f2dd7c "Tool account creation and more" (T128400, T149458, T159044, T164847, T167931, T168480, T173845) (duration: 00m 39s) [16:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:31] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.11 [keeping static files] (duration: 01m 32s) [16:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:13] (03CR) 10Volans: [C: 031] "Looks good, see inline for related comment to the other CR." (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373283 (owner: 10Muehlenhoff) [16:29:45] (03PS2) 10Andrew Bogott: Fix firstboot salt minion id on labs [puppet] - 10https://gerrit.wikimedia.org/r/369873 (owner: 10Hashar) [16:33:34] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 4 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3537269 (10EBernhardson) Unfortunately we've had this problem several times before, and it can be quite hard to distinguish between how some jobs behave... [16:43:23] (03PS2) 10Dzahn: icinga: add plugin to check for long running screens [puppet] - 10https://gerrit.wikimedia.org/r/373135 (https://phabricator.wikimedia.org/T165348) [16:43:34] (03PS3) 10Dzahn: icinga: add plugin to check for long running screens [puppet] - 10https://gerrit.wikimedia.org/r/373135 (https://phabricator.wikimedia.org/T165348) [16:44:44] (03CR) 10Dzahn: [C: 032] icinga: add plugin to check for long running screens [puppet] - 10https://gerrit.wikimedia.org/r/373135 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [16:46:50] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 4 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3545592 (10Esc3300) The quickest user seems to do 12000 per hour (not per minute): https://quarry.wmflabs.org/query/20823 [16:48:00] (03CR) 10Volans: "Base logic sounds reasonable and not too much hacky. See inline for additional comments." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/373291 (https://phabricator.wikimedia.org/T151632) (owner: 10Alexandros Kosiaris) [16:48:34] akosiaris: there is also a link to a Phab paste for you in the comments ;) ^^^ [16:51:55] 10Operations, 10monitoring, 10Patch-For-Review: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#3545625 (10Dzahn) What is the advantage having these servers in monitoring if we also go through great lengths to make sure we don't see them (no notifications, ACKed). Is anyo... [16:56:28] (03CR) 10Volans: "Do we plan to add 2 different checks for screen and tmux sessions? They looks pretty much the same to me and I think they can be part of a" [puppet] - 10https://gerrit.wikimedia.org/r/373135 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [17:00:05] (03CR) 10Dzahn: "Yes, i can add tmux to this check, wasn't going to upload a second one." [puppet] - 10https://gerrit.wikimedia.org/r/373135 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [17:00:13] 10Operations, 10JobRunner-Service, 10Performance-Team: Job queue growing constantly since around 7th August - https://phabricator.wikimedia.org/T173957#3545666 (10Reedy) [17:00:25] Oh, it's a dupe [17:01:22] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 4 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3545669 (10Reedy) [17:01:26] 10Operations, 10JobRunner-Service, 10Performance-Team: Job queue growing constantly since around 7th August - https://phabricator.wikimedia.org/T173957#3545651 (10Reedy) [17:02:41] 10Operations, 10Ops-Access-Requests, 10Gerrit, 10LDAP-Access-Requests: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3545691 (10cooltey) @Mholloway I have read and signed the L3 document. And here is my public ssh key for WMF production, and I pref... [17:04:52] (03CR) 10Andrew Bogott: [C: 032] Fix firstboot salt minion id on labs [puppet] - 10https://gerrit.wikimedia.org/r/369873 (owner: 10Hashar) [17:05:00] (03PS3) 10Andrew Bogott: Fix firstboot salt minion id on labs [puppet] - 10https://gerrit.wikimedia.org/r/369873 (owner: 10Hashar) [17:06:55] 10Operations, 10Ops-Access-Requests, 10Gerrit: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3545713 (10RobH) [17:07:19] (03Abandoned) 10RobH: two new ldap users sharvaniharan and Cooltey [puppet] - 10https://gerrit.wikimedia.org/r/373148 (https://phabricator.wikimedia.org/T173874) (owner: 10RobH) [17:09:39] 10Operations, 10Ops-Access-Requests, 10Gerrit: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3545725 (10Mholloway) [17:10:11] 10Operations, 10Ops-Access-Requests, 10Gerrit: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3543236 (10Mholloway) Hey @cooltey, I think you're all set. I pinged @Fjalapeno about approval. Thanks! [17:11:47] 10Operations, 10Ops-Access-Requests, 10Gerrit: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3545735 (10RobH) @cooltey: I was planning to make the patchset once I have both keys and info for both of you. You should be all set, thanks! [17:14:28] Anyone interested in a puppet compiler/hiera question? The compiler is failing for lack of a hiera value that I can see is clearly defined in /labs/private. https://puppet-compiler.wmflabs.org/compiler02/7581/labcontrol1001.wikimedia.org/prod.labcontrol1001.wikimedia.org.err [17:14:36] chasemp if you are around? [17:15:10] 10Operations, 10Ops-Access-Requests, 10Gerrit: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3545760 (10cooltey) @Mholloway @RobH Got it, thank you! [17:19:49] (03CR) 10Andrew Bogott: "Looks good! Thanks hashar!" [puppet] - 10https://gerrit.wikimedia.org/r/369873 (owner: 10Hashar) [17:24:01] 10Operations, 10Ops-Access-Requests, 10Gerrit: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3545807 (10Sharvaniharan) @Mholloway completed signing my L3 document. Here is my public key {F9159590} [17:24:29] (03PS2) 10Andrew Bogott: openstack: phase out deployment-stream [puppet] - 10https://gerrit.wikimedia.org/r/369860 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [17:24:37] 10Operations, 10Ops-Access-Requests, 10Gerrit: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3545811 (10Sharvaniharan) I will keep the shell name sharvani [17:25:44] (03CR) 10Herron: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/373177 (https://phabricator.wikimedia.org/T171988) (owner: 10Jeremyb) [17:26:02] (03CR) 10jerkins-bot: [V: 04-1] shiladsen shell: try RSA key instead, add expiry [puppet] - 10https://gerrit.wikimedia.org/r/373177 (https://phabricator.wikimedia.org/T171988) (owner: 10Jeremyb) [17:26:08] 10Operations, 10Patch-For-Review, 10User-Urbanecm, 10Wiki-Setup (Create): Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782#3545819 (10Krinkle) [17:26:13] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3545820 (10Krinkle) [17:26:15] 10Operations, 10Wikimedia-Language-setup, 10Patch-For-Review, 10User-Urbanecm, 10Wiki-Setup (Create): Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3545821 (10Krinkle) [17:26:17] 10Operations, 10Wikimedia-Language-setup, 10MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)), 10Patch-For-Review, and 2 others: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3545823 (10Krinkle) [17:26:19] 10Operations, 10MW-1.30-release-notes, 10Wikimedia-Language-setup, 10Patch-For-Review, 10Wiki-Setup (Create): Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3545822 (10Krinkle) [17:26:48] 10Operations, 10Patch-For-Review, 10Wiki-Setup (Create): Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#3545840 (10Krinkle) [17:26:52] 10Operations, 10Wikimedia-Language-setup, 10MW-1.28-release (WMF-deploy-2016-07-26_(1.28.0-wmf.12)), 10MW-1.28-release-notes, and 2 others: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#3545835 (10Krinkle) [17:26:57] 10Blocked-on-Operations, 10Operations, 10Wikidata, 10Wikimedia-Language-setup, and 4 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#3545837 (10Krinkle) [17:26:59] 10Operations, 10Shell, 10Wiki-Setup (Create): Create wikimania2016 wiki - https://phabricator.wikimedia.org/T85374#3545846 (10Krinkle) [17:27:05] 10Operations, 10DNS, 10Traffic, 10Wiki-Setup (Create): Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#3545844 (10Krinkle) [17:27:08] 10Operations, 10Wikimedia-Language-setup, 10Shell, 10Wiki-Setup (Create): Create Wikipedia Maithili - https://phabricator.wikimedia.org/T74346#3545847 (10Krinkle) [17:27:10] 10Operations, 10Shell, 10Wiki-Setup (Create): Create Wikivoyage Persian - https://phabricator.wikimedia.org/T73382#3545850 (10Krinkle) [17:27:34] 10Operations, 10Wikimedia-Language-setup, 10Wiki-Setup (Create): Create Wikipedia Minangkabau - https://phabricator.wikimedia.org/T46462#3545865 (10Krinkle) [17:27:59] 10Operations, 10Wiki-Setup (Create): Create Slovenian Wikiversity - https://phabricator.wikimedia.org/T37290#3545874 (10Krinkle) [17:28:06] 10Operations, 10Shell, 10Wiki-Setup (Create): Create Gujarati Wikisource - https://phabricator.wikimedia.org/T37138#3545875 (10Krinkle) [17:28:07] 10Operations, 10Wikimedia-Language-setup, 10Wiki-Setup (Create): Create Wikipedia Lezgi - https://phabricator.wikimedia.org/T36223#3545877 (10Krinkle) [17:28:11] 10Operations, 10Wiki-Setup (Create): Create Wikiversity Arabic - https://phabricator.wikimedia.org/T31796#3545884 (10Krinkle) [17:28:13] 10Operations, 10Wiki-Setup (Create): Create Western Panjabi Wiktionary - https://phabricator.wikimedia.org/T34511#3545879 (10Krinkle) [17:28:15] 10Operations, 10Wiki-Setup (Create): Please create a wiki for Wikimedia Argentina - https://phabricator.wikimedia.org/T31715#3545886 (10Krinkle) [17:28:17] 10Operations, 10Wiki-Setup (Create): Create a wiki for the future chapter Wikimedia México - https://phabricator.wikimedia.org/T31758#3545885 (10Krinkle) [17:28:19] 10Operations, 10Wikimedia-Language-setup, 10Wiki-Setup (Create): Create Veps Wikipedia - https://phabricator.wikimedia.org/T34510#3545880 (10Krinkle) [17:28:21] 10Operations, 10Wiki-Setup (Create): Create Wikimedia Belgium wiki (be.wikimedia.org) - https://phabricator.wikimedia.org/T32793#3545883 (10Krinkle) [17:28:23] 10Operations, 10Wikimedia-Language-setup, 10Wiki-Setup (Create): Create Wikipedia Mingrelian - https://phabricator.wikimedia.org/T31456#3545887 (10Krinkle) [17:28:25] 10Operations, 10Bengali-Sites, 10Wiki-Setup (Create): Create a new wiki for Wikimedia Bangladesh - https://phabricator.wikimedia.org/T33096#3545881 (10Krinkle) [17:28:27] 10Operations, 10Wikimedia-Language-setup, 10I18n, 10Shell, 10Wiki-Setup (Create): Create Wikipedia in Northern Sotho - https://phabricator.wikimedia.org/T32882#3545882 (10Krinkle) [17:28:40] (03CR) 10Andrew Bogott: [C: 032] openstack: phase out deployment-stream [puppet] - 10https://gerrit.wikimedia.org/r/369860 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [17:28:43] 10Operations, 10Wiki-Setup (Create): create a test wiki for RTL development - https://phabricator.wikimedia.org/T31339#3545888 (10Krinkle) [17:28:47] 10Operations, 10Wikimedia-Language-setup, 10Wiki-Setup (Create): Create Sakha Wikisource - https://phabricator.wikimedia.org/T29557#3545895 (10Krinkle) [17:29:27] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3545916 (10Steinsplitter) >>! In T173859#3544614, @Marostegui wrote: > @MarcoAurelio See: T172207#3544611 > Looks like we can proceed Yepp, Like last ti... [17:32:18] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3546048 (10Cmjohnson) a:05Cmjohnson>03RobH @robh can you do the installs....let's get them accessible and then I will deal with the disk shelf is... [17:38:18] 10Operations, 10Shell, 10Wiki-Setup (Close): Closure of ten.wikipedia.org - https://phabricator.wikimedia.org/T35185#3546082 (10Krinkle) [17:38:55] (03CR) 10Addshore: [C: 031] Allow crats to add people to accountcreator group on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373317 (owner: 10Brian Wolff) [17:40:00] !log reboot cp4021 [17:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:34] 10Operations, 10DBA, 10Wikimedia-Site-requests: script & docs to rename wiki databases - https://phabricator.wikimedia.org/T83609#3546125 (10Krinkle) [17:41:44] PROBLEM - Host cp4021 is DOWN: PING CRITICAL - Packet loss = 100% [17:41:52] 10Operations, 10DBA, 10Wikimedia-Site-requests: script & docs to rename wiki databases - https://phabricator.wikimedia.org/T83609#916322 (10Krinkle) [17:42:17] RECOVERY - Host cp4021 is UP: PING OK - Packet loss = 0%, RTA = 78.57 ms [17:42:25] RECOVERY - traffic-pool service on cp4021 is OK: OK - traffic-pool is active [17:42:25] 10Operations, 10Wiki-Setup (Rename): Changing address of Võro Vikipeediä - https://phabricator.wikimedia.org/T84537#3546136 (10Krinkle) [17:42:32] 10Operations, 10Traffic, 10HTTPS, 10Wiki-Setup (Rename): Our *.wikimedia.org cert doesn't properly cover https://pa.us.wikimedia.org/ - https://phabricator.wikimedia.org/T40763#3546141 (10Krinkle) [17:42:34] 10Operations, 10Patch-For-Review, 10Wiki-Setup (Rename): Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#3546140 (10Krinkle) [17:42:35] RECOVERY - Check systemd state on cp4021 is OK: OK - running: The system is fully operational [17:42:45] 10Operations, 10Shell, 10Wiki-Setup (Close): Closure of ten.wikipedia.org - https://phabricator.wikimedia.org/T35185#3546156 (10Krinkle) [17:43:32] 10Operations, 10Traffic, 10HTTPS, 10Wiki-Setup (Rename): Rename wikis with multiple subdomains - https://phabricator.wikimedia.org/T33335#3546177 (10Krinkle) [17:43:35] 10Operations, 10Patch-For-Review, 10Wiki-Setup (Rename): Move the wiki of WMEE - https://phabricator.wikimedia.org/T31919#3546179 (10Krinkle) [17:43:50] 10Operations, 10Wiki-Setup (Rename): Migration of pt.wikimedia.org - https://phabricator.wikimedia.org/T25537#3546190 (10Krinkle) [17:43:52] 10Operations, 10Wiki-Setup (Rename): Move the Moldovan Wikipedia - https://phabricator.wikimedia.org/T25217#3546191 (10Krinkle) [17:44:08] 10Operations, 10Wikimedia-Language-setup, 10Patch-For-Review, 10User-notice, 10Wiki-Setup (Rename): Rename "be-x-old" to "be-tarask" - https://phabricator.wikimedia.org/T11823#3546200 (10Krinkle) [17:44:25] 10Operations, 10Wiki-Setup (Rename): Migration of pt.wikimedia.org - https://phabricator.wikimedia.org/T25537#3546248 (10Krinkle) [17:44:27] 10Operations, 10Wiki-Setup (Rename): Move the Moldovan Wikipedia - https://phabricator.wikimedia.org/T25217#3546249 (10Krinkle) [17:44:41] 10Operations, 10Wikimedia-Language-setup, 10Patch-For-Review, 10User-notice, 10Wiki-Setup (Rename): Rename "be-x-old" to "be-tarask" - https://phabricator.wikimedia.org/T11823#3546256 (10Krinkle) [17:47:23] (03PS1) 10Herron: Change shiladsen ssh key [puppet] - 10https://gerrit.wikimedia.org/r/373326 (https://phabricator.wikimedia.org/T171988) [17:47:41] thcipriani: https://gerrit.wikimedia.org/r/#/c/373325/ [17:48:11] (03CR) 10Herron: [C: 032] Change shiladsen ssh key [puppet] - 10https://gerrit.wikimedia.org/r/373326 (https://phabricator.wikimedia.org/T171988) (owner: 10Herron) [17:48:48] (03PS2) 10Herron: Change shiladsen ssh key [puppet] - 10https://gerrit.wikimedia.org/r/373326 (https://phabricator.wikimedia.org/T171988) [17:49:13] (03PS1) 10RobH: kafka-jumbo install params [puppet] - 10https://gerrit.wikimedia.org/r/373328 (https://phabricator.wikimedia.org/T167992) [17:49:45] (03CR) 10jerkins-bot: [V: 04-1] kafka-jumbo install params [puppet] - 10https://gerrit.wikimedia.org/r/373328 (https://phabricator.wikimedia.org/T167992) (owner: 10RobH) [17:51:55] PROBLEM - Apache HTTP on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [17:51:55] PROBLEM - HHVM rendering on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [17:52:00] (03CR) 10Herron: [C: 04-2] "Updated the key in https://gerrit.wikimedia.org/r/#/c/373326/ since this was failing. Possibly because the expiry attributed were merged " [puppet] - 10https://gerrit.wikimedia.org/r/373177 (https://phabricator.wikimedia.org/T171988) (owner: 10Jeremyb) [17:52:54] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.051 second response time [17:53:05] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 73952 bytes in 0.350 second response time [17:54:00] (03PS2) 10RobH: kafka-jumbo install params [puppet] - 10https://gerrit.wikimedia.org/r/373328 (https://phabricator.wikimedia.org/T167992) [17:54:29] (03CR) 10jerkins-bot: [V: 04-1] kafka-jumbo install params [puppet] - 10https://gerrit.wikimedia.org/r/373328 (https://phabricator.wikimedia.org/T167992) (owner: 10RobH) [17:54:44] well, it wasnt my new recipe then... [17:55:14] PROBLEM - MariaDB Slave Lag: s4 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.05 seconds [17:56:31] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3546393 (10MarcoAurelio) [17:56:32] AaronSchulz: awesome. I can backport to wmf.14/15 if you can get it reviewed/merged for master. [17:56:41] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Redirect nan.wikipedia.org to zh-min-nan.wikipedia.org - https://phabricator.wikimedia.org/T173966#3546395 (10Verdy_p) p:05Triage>03Normal [17:57:40] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Redirect nan.wikipedia.org to zh-min-nan.wikipedia.org - https://phabricator.wikimedia.org/T173966#3546343 (10Verdy_p) [18:00:01] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3546437 (10madhuvishy) The interface flapping issue was because of a mis-connected cable, which @Cmjohnson's fixed now. Both management interfaces ar... [18:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170823T1800). Please do the needful. [18:00:05] brion, Jdlrobson, phuedx, bawolff, MaxSem, and Niharika: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:10] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3546452 (10herron) Hi @Shilad, your ssh key has been updated. Are you able to log in? Aug 23 17:54:50 stat1005 puppet-agent[4323]:... [18:00:11] (03CR) 10Niharika29: [C: 031] "LGTM. Let's SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373172 (https://phabricator.wikimedia.org/T173888) (owner: 10MaxSem) [18:00:22] Woo! [18:00:30] (03PS1) 10RobH: kafka-jumbo addition to dhcp [puppet] - 10https://gerrit.wikimedia.org/r/373329 [18:00:45] I can SWAT. [18:02:00] brion: You around? [18:02:51] bawolff: When you have a moment, there's a question for you on https://gerrit.wikimedia.org/r/#/c/327762/ Pointing it out here because gerrit mentions are so easy to miss. :| [18:03:11] (03PS2) 10RobH: kafka-jumbo install parameters [puppet] - 10https://gerrit.wikimedia.org/r/373329 [18:04:23] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373317 (owner: 10Brian Wolff) [18:05:03] (here) [18:05:07] looking [18:05:19] (03PS3) 10RobH: kafka-jumbo install parameters [puppet] - 10https://gerrit.wikimedia.org/r/373329 [18:05:41] Niharika: here :) [18:05:55] (03Merged) 10jenkins-bot: Allow crats to add people to accountcreator group on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373317 (owner: 10Brian Wolff) [18:05:55] Niharika: I do actually, although I think its rediculous that the other patch is so stalled [18:06:12] (03CR) 10jenkins-bot: Allow crats to add people to accountcreator group on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373317 (owner: 10Brian Wolff) [18:07:32] (here) [18:07:32] bawolff: That patch is stalled on changing the length of the field. Seems like an infinite stall. [18:07:38] bawolff: Your patch is on mwdebug1002. [18:07:46] Cool [18:08:18] mine's beta cluster only, which i could've deployed earlier but was in a bunch o' meetin's [18:08:57] Oh d'oh I only put it in the add to list not the allowed to remove list [18:09:03] (03Abandoned) 10RobH: kafka-jumbo install parameters [puppet] - 10https://gerrit.wikimedia.org/r/373329 (owner: 10RobH) [18:09:05] (03PS3) 10RobH: kafka-jumbo install params [puppet] - 10https://gerrit.wikimedia.org/r/373328 (https://phabricator.wikimedia.org/T167992) [18:09:22] (03PS4) 10RobH: kafka-jumbo install params [puppet] - 10https://gerrit.wikimedia.org/r/373328 (https://phabricator.wikimedia.org/T167992) [18:09:27] (03PS5) 10RobH: kafka-jumbo install params [puppet] - 10https://gerrit.wikimedia.org/r/373328 (https://phabricator.wikimedia.org/T167992) [18:09:34] bleh. [18:09:45] (03PS2) 10Niharika29: Remove CiteThisPage from blacklist for page previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373300 (https://phabricator.wikimedia.org/T173865) (owner: 10Jdlrobson) [18:09:54] Umm, there is a minor problem with that patch, its probably fine for me to follow up later with a fix (the main part works), or if you want you can revert it and i can do a better version [18:10:10] (03CR) 10jerkins-bot: [V: 04-1] kafka-jumbo install params [puppet] - 10https://gerrit.wikimedia.org/r/373328 (https://phabricator.wikimedia.org/T167992) (owner: 10RobH) [18:10:19] goddamn it [18:10:26] mutante: ^ more than just the typo i think [18:10:55] bawolff: Whichever you prefer? [18:11:14] I'd prefer to have it deployed, and I'll submit a follow up after [18:11:29] bawolff: Okay, I'm syncing it then. [18:11:33] thanks [18:11:59] wtf [18:12:02] im in odd rebase hell [18:12:16] also the ci output seems suddenly slow [18:12:29] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Redirect nan.wikipedia.org to zh-min-nan.wikipedia.org - https://phabricator.wikimedia.org/T173966#3546530 (10Verdy_p) Note that "nan" is already defined in https://gerrit.wikimedia.org/r/#/c/285085/1/templates/helpers/langs.tmpl However it... [18:12:30] there it goes... [18:13:01] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Allow crats to add people to accountcreator group on mw.or https://gerrit.wikimedia.org/r/#/c/373317/ (duration: 00m 49s) [18:13:07] And it's out. [18:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:33] Krinkle, legoktm: https://gerrit.wikimedia.org/r/#/c/373325/ [18:13:41] (03CR) 10Niharika29: [C: 032] Remove CiteThisPage from blacklist for page previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373300 (https://phabricator.wikimedia.org/T173865) (owner: 10Jdlrobson) [18:14:06] (03PS4) 10Niharika29: pagePreviews: Enable A/B test (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373264 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [18:14:28] Niharika, https://gerrit.wikimedia.org/r/#/c/373330/ [18:15:04] robh: yea, integration does seem slow..the webserver [18:15:53] (03PS2) 10Niharika29: Add a log channel for LoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373172 (https://phabricator.wikimedia.org/T173888) (owner: 10MaxSem) [18:16:16] (03CR) 10Niharika29: [C: 032] pagePreviews: Enable A/B test (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373264 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [18:16:38] (03PS3) 10Niharika29: Reinforce LoginNotify settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372555 (owner: 10MaxSem) [18:17:35] (03PS6) 10RobH: kafka-jumbo install params [puppet] - 10https://gerrit.wikimedia.org/r/373328 (https://phabricator.wikimedia.org/T167992) [18:17:50] im in hell trying to find a typo that the output says is there [18:17:51] but wont say what line.... [18:17:56] (03CR) 10jerkins-bot: [V: 04-1] kafka-jumbo install params [puppet] - 10https://gerrit.wikimedia.org/r/373328 (https://phabricator.wikimedia.org/T167992) (owner: 10RobH) [18:18:02] Zuul is really sad today. [18:18:06] fucking hell [18:18:09] LoginNotify didn't work? that's sad :( [18:18:11] i have no idea what is wrong with my patchset. [18:18:40] bawolff: It works but for some accounts and not for others. Max's done some logging for it. [18:19:04] (03Abandoned) 10RobH: kafka-jumbo install params [puppet] - 10https://gerrit.wikimedia.org/r/373328 (https://phabricator.wikimedia.org/T167992) (owner: 10RobH) [18:19:19] We've ruled out 2fa and user permissions so far. [18:19:24] bawolff: your patch about accountcreators is not fully okay; you should have specified who can remove people from that group too [18:19:25] huh. I wonder if maybe it depends on what data center you're connecting to, or something like that [18:19:35] (03CR) 10Krinkle: webperf: Add unit tests for schema handlers and stat dispatching (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [18:19:37] TabbyCat: I know, i'm fixing it now [18:19:41] maybe you can do a quick hotfix [18:19:42] ah [18:19:44] :) [18:19:54] (03CR) 10Niharika29: [C: 032] Reinforce LoginNotify settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372555 (owner: 10MaxSem) [18:20:24] RECOVERY - MariaDB Slave Lag: s4 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:20:45] bawolff: Hmm, it's been consistent across accounts so far. Either you get it or you don't. [18:21:42] (03PS1) 10Brian Wolff: Follow-up 6d62e9ea8a. Also allow crats to remove accountcreator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373333 [18:22:18] (03PS4) 10Krinkle: webperf: Add unit tests for schema handlers and stat dispatching [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) [18:22:50] TabbyCat: In my defense 3 people +1'd the other patch :P [18:23:03] (03PS1) 10RobH: kafka-jumbo100[1-6].eqiad install params [puppet] - 10https://gerrit.wikimedia.org/r/373334 [18:23:09] bawolff: in your defense, I usually forget about that either :P [18:23:12] Niharika: Is there time in swat for a follow-up? (I know that's not really how its supposed to work) [18:23:43] by which i mean https://gerrit.wikimedia.org/r/373333 [18:24:36] bawolff: Given the state of https://integration.wikimedia.org/zuul/, maybe not. I'll try and swat it if things speed up. [18:25:03] wow, zuul is sad [18:25:14] 373333 :D [18:25:15] i just spent an hour trying to track down an odd CI error on my patch [18:25:21] it it turns out to be the CI server... id be relieved. [18:25:54] I think its CI server [18:26:10] Unless literally everyone is making syntax errors, since most of the lint jobs are failing [18:26:11] jenkins appears to be having some issues, trying to dump threads now and then restart to see if it helps [18:26:17] !log jenkins appears to be having some issues, trying to dump threads now and then restart to see if it helps [18:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:34] bawolff: yesssss, ok [18:26:43] im relieved cuz my sanity was suffering [18:26:47] but also no clue how to help fix =[ [18:27:10] so... [18:27:20] on contint1001.. jenkins is a bit busy.. but it's not dead [18:27:22] thcipriani: let us know if you want ops to do anything =] [18:27:37] there are some warnings but i suspect they are normal background noise [18:28:11] oh, wait: this kind of thing: [18:28:24] Niharika: For the category stuff. I think for the expanding the field thing, we should either just chose one of the options, or roll a dice, or write an rfc and make t-com chose one of the options [18:28:33] SSH Launch of ci-trusty-wikimedia-791969 on 10.68.16.215 failed in 7,943 ms [18:28:41] it's trying to launch nodepool instances but fails [18:28:42] ? [18:28:53] -> cloud ? [18:29:59] (03CR) 10jerkins-bot: [V: 04-1] kafka-jumbo100[1-6].eqiad install params [puppet] - 10https://gerrit.wikimedia.org/r/373334 (owner: 10RobH) [18:30:04] bawolff: I'd go with the former. Faster. Who's driving that? [18:30:22] yeah, my patch is perfection now [18:30:24] so ci is borked. [18:30:26] I think James_F was interested in getting the collation stuff out. [18:30:41] I care in a purely volunteer capacity [18:30:59] addshore: hi, perfect time to test your new powers as contint-admin , hehe :) [18:31:07] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Redirect lzh.wikipedia to zh-classical.wikipedia - https://phabricator.wikimedia.org/T167513#3546567 (10Verdy_p) [18:31:12] (03CR) 10jerkins-bot: [V: 04-1] Remove CiteThisPage from blacklist for page previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373300 (https://phabricator.wikimedia.org/T173865) (owner: 10Jdlrobson) [18:31:14] (03CR) 10jerkins-bot: [V: 04-1] pagePreviews: Enable A/B test (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373264 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [18:31:24] (03CR) 10jerkins-bot: [V: 04-1] Add a log channel for LoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373172 (https://phabricator.wikimedia.org/T173888) (owner: 10MaxSem) [18:32:09] !log jenkins restarted, zuul should pick-up queue [18:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:22] Historically the collation stuff has been mostly me and MatmaRex, so there hasn't ever really been a team assoicated with it (Other than community tech) [18:32:44] although the jobs that already wrongfully in the queue will probably continue to mess up some jenkins patch voting for a few [18:33:08] (03CR) 10jerkins-bot: [V: 04-1] pagePreviews: Enable A/B test (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373264 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [18:33:29] resubmitting job. [18:33:32] (03PS2) 10RobH: kafka-jumbo100[1-6].eqiad install params [puppet] - 10https://gerrit.wikimedia.org/r/373334 [18:34:01] (03CR) 10jerkins-bot: [V: 04-1] kafka-jumbo100[1-6].eqiad install params [puppet] - 10https://gerrit.wikimedia.org/r/373334 (owner: 10RobH) [18:34:07] well shit [18:34:28] thcipriani: ^ so i resubmitted that just now and it tested fast this time [18:34:41] but same odd error output [18:34:52] (i can stop pushing shit if you rather push your own test patch!) [18:35:25] so its fast but still failing... unless my rather simple change does have an actual typo. [18:35:54] (03Merged) 10jenkins-bot: Reinforce LoginNotify settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372555 (owner: 10MaxSem) [18:36:14] (03CR) 10jenkins-bot: Reinforce LoginNotify settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372555 (owner: 10MaxSem) [18:36:36] ugh [18:36:41] MaxSem: It's on mwdebug1002. [18:36:47] robh: hrm, the test does look like it's running correctly [18:37:51] i dont see any possible typos in my change, but i can just do a simpler one [18:38:31] =/ [18:38:47] thcipriani: so we think that CI is fine and its my patch? [18:39:10] (03PS4) 10Thcipriani: Scap: scap_source correct gid [puppet] - 10https://gerrit.wikimedia.org/r/361796 [18:39:38] !log niharika29@tin Synchronized wmf-config/CommonSettings.php: Reinforce LoginNotify settings https://gerrit.wikimedia.org/r/#/c/372555/ (duration: 00m 47s) [18:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:28] !log resetting permissions on stat1005:/srv/published-datasets/discovery - T173333 [18:40:28] robh: i wonder if it can be somehow the "-" in the hostname [18:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:38] T173333: Reportupdater outputs files with restricted permissions - https://phabricator.wikimedia.org/T173333 [18:40:38] we have hosts with that in the name [18:40:39] ms-be [18:40:42] ms-fe [18:40:45] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Reinforce LoginNotify settings https://gerrit.wikimedia.org/r/#/c/372555/ (duration: 00m 47s) [18:40:49] Done. [18:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:09] also the ms-be happen to be in a regez stanza in site.pp [18:41:12] i know cuz i stole i tfor this ;] [18:41:16] (03PS3) 10Volans: Updated documentation [software/cumin] - 10https://gerrit.wikimedia.org/r/373251 [18:41:18] (03PS4) 10Volans: Add CHANGELOG file including previous releases [software/cumin] - 10https://gerrit.wikimedia.org/r/373250 [18:41:21] oh wait [18:41:21] (03PS1) 10Volans: ClusterShell transport: fix progress bar on timeout [software/cumin] - 10https://gerrit.wikimedia.org/r/373337 [18:41:24] robh: something your patch touches maybe. I just rebased one of my old operations/puppet patches and it worked ok: https://integration.wikimedia.org/ci/job/operations-puppet-tests-docker/3708/ [18:41:37] no ^ [18:41:40] in my site.pp regex [18:41:42] goddamn it [18:41:44] :) [18:41:57] cuz i 'coped' it by retyping... not actual copy paste. [18:42:00] (03CR) 10Volans: [C: 032] Transports: fix target management improvement [software/cumin] - 10https://gerrit.wikimedia.org/r/373272 (https://phabricator.wikimedia.org/T171684) (owner: 10Volans) [18:42:00] copied even [18:42:06] well, it's not in the other kafka stanzas [18:42:11] true [18:42:12] but yea, somewhere along these things [18:42:14] but it is for ms-be.... [18:42:16] Wellll, Jenkins is out of memory - https://gerrit.wikimedia.org/r/#/c/373264 [18:42:22] ill put in the ^ and see if ti fixes... [18:42:26] phuedx: ^^ [18:42:32] Can't merge it for now. [18:43:00] !log manually running report updater for discovery golden data on stat1005 - T173333 [18:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:22] (03PS3) 10RobH: kafka-jumbo100[1-6].eqiad install params [puppet] - 10https://gerrit.wikimedia.org/r/373334 [18:43:29] Basically every patch https://gerrit.wikimedia.org/r/#/c/373300/ [18:43:45] (03PS3) 10Niharika29: Remove CiteThisPage from blacklist for page previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373300 (https://phabricator.wikimedia.org/T173865) (owner: 10Jdlrobson) [18:44:09] (03Merged) 10jenkins-bot: Transports: fix target management improvement [software/cumin] - 10https://gerrit.wikimedia.org/r/373272 (https://phabricator.wikimedia.org/T171684) (owner: 10Volans) [18:44:19] (03CR) 10jerkins-bot: [V: 04-1] kafka-jumbo100[1-6].eqiad install params [puppet] - 10https://gerrit.wikimedia.org/r/373334 (owner: 10RobH) [18:44:31] meh... its not me its everyone so meh [18:44:43] mutante: its not the ^ thing its ongoing issues i think... [18:44:55] cuz yeah, other kakfa lines are like the one i added. [18:45:22] jerkins is equal-opportunity jerk [18:45:38] i dont wanna blame gerrit... [18:45:52] its merely the messenger for this particular issue, hehe [18:46:02] 18:44:17 Typo found! [18:46:03] 18:44:17 /tmp/cache/puppet/Rakefile:179:in `block in setup_typos' [18:46:12] not sure if related though [18:46:22] volans: thats what im getting in ALL my patches [18:46:25] the typo found [18:46:54] 18:44:17 rake aborted! [18:46:54] 18:44:17 Typo found! [18:46:54] 18:44:17 /tmp/cache/puppet/Rakefile:179:in `block in setup_typos' [18:46:55] 18:44:17 Tasks: TOP => test => typos [18:47:08] (03PS3) 10MaxSem: Add a log channel for LoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373172 (https://phabricator.wikimedia.org/T173888) [18:47:11] My understanding is thcipriani is currently working the issue. [18:47:12] (03CR) 10MaxSem: [C: 032] Add a log channel for LoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373172 (https://phabricator.wikimedia.org/T173888) (owner: 10MaxSem) [18:47:22] I think if there are issues, it'd probably be issues with zuul merger not gerrit. I'm trying to figure that out. [18:48:55] (03Merged) 10jenkins-bot: Add a log channel for LoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373172 (https://phabricator.wikimedia.org/T173888) (owner: 10MaxSem) [18:49:05] (03CR) 10jenkins-bot: Add a log channel for LoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373172 (https://phabricator.wikimedia.org/T173888) (owner: 10MaxSem) [18:49:42] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3546612 (10aaron) [18:49:44] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [18:50:01] jdlrobson: https://gerrit.wikimedia.org/r/#/c/373292/ is on mwdebug1002. [18:50:09] (03CR) 10jerkins-bot: [V: 04-1] Updated documentation [software/cumin] - 10https://gerrit.wikimedia.org/r/373251 (owner: 10Volans) [18:50:10] (03CR) 10jerkins-bot: [V: 04-1] ClusterShell transport: fix progress bar on timeout [software/cumin] - 10https://gerrit.wikimedia.org/r/373337 (owner: 10Volans) [18:50:13] (03CR) 10jerkins-bot: [V: 04-1] Add CHANGELOG file including previous releases [software/cumin] - 10https://gerrit.wikimedia.org/r/373250 (owner: 10Volans) [18:50:15] brion: https://gerrit.wikimedia.org/r/#/c/373126/ is on mwdebug1002. [18:50:25] woot! [18:50:31] lemme test [18:50:34] robh: lol, i found it! [18:50:40] ? [18:50:42] it's a typo in the comment line [18:50:45] "kakfa" [18:50:54] and "kakfa" is in the "typos" file [18:50:56] i dont see how a typo in a comment line would matter? [18:50:58] oh [18:51:01] because it was a common typo [18:51:07] but its a comment? [18:51:15] comments shouldnt fail anything, ever [18:51:21] not for typos. [18:51:25] it doesn't care, it just sees the string "kakfa" and notices that is listed in the typos file [18:51:26] (03PS1) 10Jforrester: MetaContactPages: Temporarily de-require the trademark request's ProposedUse field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373339 (https://phabricator.wikimedia.org/T173839) [18:51:39] # kakfa-jumbo nodes set to role spare until analytics pushes into service T167992 [18:51:40] T167992: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992 [18:51:44] yeah... [18:52:06] well, fixed, trying [18:52:10] !log niharika29@tin Synchronized php-1.30.0-wmf.15/extensions/LoginNotify/: Log everything T173888 (duration: 00m 48s) [18:52:17] (03PS4) 10RobH: kafka-jumbo100[1-6].eqiad install params [puppet] - 10https://gerrit.wikimedia.org/r/373334 [18:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:23] T173888: LoginNotify not working for everyone apparently - https://phabricator.wikimedia.org/T173888 [18:52:49] mutante: if this works id say i owe you a beer but we dont really keep track of that between us heh [18:52:53] Niharika: Can I squeeze a last-minute one in? [18:53:02] since there is simply beer around when we hang. [18:53:02] so yea, when it says "typo detected" it means "found a string that is in the file called 'typos' in the repo root dir" [18:53:09] mutante: goddamn it that was it [18:53:13] Niharika: Quick fix for a broken on-wiki form for Legal. :-( [18:53:17] you fucking genius [18:53:19] =] [18:53:20] robh: :) [18:53:22] ^ [18:53:29] will second "you fucking genius" :P [18:53:32] James_F: I've got about 4 patches still to go thanks to Zuul being too slow. If I get time, I'll ping you. [18:53:38] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10HHVM: Convert Wikimedia production HHVM instances to have hhvm.php7.all set true - https://phabricator.wikimedia.org/T173786#3546633 (10Krinkle) [18:53:41] Niharika: OK, thanks! [18:53:45] i mean, holy shit that just totally stopped my day [18:53:48] =P [18:53:52] * thcipriani was elbow deep in docker [18:53:55] hmmmm, not seeing updated on commons using mwdebug1002... but that could be RL caching [18:54:02] !log ppchelko@tin Started deploy [changeprop/deploy@1998c10]: [Config] Disable mobile rerenders for non-wikipedia domains [18:54:06] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10HHVM: Convert Wikimedia production HHVM instances to have hhvm.php7.all set true - https://phabricator.wikimedia.org/T173786#3546634 (10aaron) p:05Triage>03Low [18:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:19] MaxSem: https://gerrit.wikimedia.org/r/#/c/373172/ is on mwdebug1002. [18:54:23] Can you test? [18:54:54] jdlrobson: There? [18:55:35] !log ppchelko@tin Finished deploy [changeprop/deploy@1998c10]: [Config] Disable mobile rerenders for non-wikipedia domains (duration: 01m 33s) [18:55:38] yup [18:55:42] Niharika, no rashie but it takes time for logs to actually appear so go ahed [18:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:56] combine a ci issue with a typo and it just causes confusion. [18:56:03] No idea what "rashie" is. :P [18:56:18] *crashie [18:56:34] rashies are the result of crashies on the roadway. [18:56:44] ;] [18:56:51] Niharika: am i up on mwdebug? [18:56:59] jdlrobson: You are. [18:57:00] * robh just gave himself bad flashback to having road rash a long time ago from crashing a bike [18:57:04] Niharika: on it [18:57:11] Niharika: both? [18:57:13] > Niharika> jdlrobson: https://gerrit.wikimedia.org/r/#/c/373292/ is on mwdebug1002. [18:57:35] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Log channel for LoginNotify T173888 (duration: 00m 47s) [18:57:40] (03CR) 10Niharika29: "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373300 (https://phabricator.wikimedia.org/T173865) (owner: 10Jdlrobson) [18:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:43] T173888: LoginNotify not working for everyone apparently - https://phabricator.wikimedia.org/T173888 [18:57:43] (03CR) 10Niharika29: [C: 032] Remove CiteThisPage from blacklist for page previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373300 (https://phabricator.wikimedia.org/T173865) (owner: 10Jdlrobson) [18:58:07] (03CR) 10Niharika29: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373264 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [18:59:03] Niharika: you can sync that change [18:59:15] jdlrobson: Ack! [18:59:58] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3546648 (10Bugreporter) Added some more users who should be aware of the issue. Feel free to remove. [19:00:05] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170823T1900). [19:00:45] (03PS5) 10RobH: kafka-jumbo100[1-6].eqiad install params [puppet] - 10https://gerrit.wikimedia.org/r/373334 [19:00:50] !log niharika29@tin Synchronized php-1.30.0-wmf.14/extensions/MobileFrontend/: Verify the existence of key when parsing lang objects https://gerrit.wikimedia.org/r/#/c/373292/ (duration: 00m 49s) [19:00:51] thcipriani: Gimme a few more minutes. [19:00:54] Sorry. :( [19:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:13] Niharika: no problem, please ping me when SWAT is complete [19:01:29] brion: Any luck? [19:01:57] Niharika: not so far; may be stuck in RL cache but it should have expired out by now... [19:02:19] the main thing i want to try is in safari/edge/ie so i have to kind of fake around testing it in chrome with the debug ext :D [19:02:41] at least it's not breaking, so that's good :D [19:02:43] (03PS4) 10Niharika29: Remove CiteThisPage from blacklist for page previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373300 (https://phabricator.wikimedia.org/T173865) (owner: 10Jdlrobson) [19:02:45] Aha. [19:02:47] :) [19:03:36] (03CR) 10Niharika29: Remove CiteThisPage from blacklist for page previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373300 (https://phabricator.wikimedia.org/T173865) (owner: 10Jdlrobson) [19:03:41] (03CR) 10Niharika29: [C: 032] Remove CiteThisPage from blacklist for page previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373300 (https://phabricator.wikimedia.org/T173865) (owner: 10Jdlrobson) [19:03:54] (03CR) 10RobH: [C: 032] kafka-jumbo100[1-6].eqiad install params [puppet] - 10https://gerrit.wikimedia.org/r/373334 (owner: 10RobH) [19:03:56] (03CR) 10Niharika29: [C: 032] pagePreviews: Enable A/B test (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373264 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [19:03:58] Well that's really funny [19:04:12] LoginNotify is sending failure notices to the job runners appearently [19:04:31] https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.08.23/mediawiki?id=AV4Qd-t5jtdAhJ4FLvyt&_g=() [19:05:12] (03Merged) 10jenkins-bot: Remove CiteThisPage from blacklist for page previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373300 (https://phabricator.wikimedia.org/T173865) (owner: 10Jdlrobson) [19:05:16] bawolff: Also see https://grafana-admin.wikimedia.org/dashboard/db/loginnotify?refresh=10s&orgId=1&from=now-2d&to=now It is sending notices but apparently not for everyone. [19:05:18] (03Merged) 10jenkins-bot: pagePreviews: Enable A/B test (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373264 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [19:05:26] Niharika: is https://gerrit.wikimedia.org/r/#/c/373292/ synced? [19:06:13] (03CR) 10jenkins-bot: Remove CiteThisPage from blacklist for page previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373300 (https://phabricator.wikimedia.org/T173865) (owner: 10Jdlrobson) [19:07:33] jdlrobson: Yup. [19:07:40] 10Operations, 10Ops-Access-Requests, 10Gerrit: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3546719 (10Fjalapeno) Access for both is approved, thanks! [19:07:45] jdlrobson: phuedx: Your config changes are on mwdebug1002. [19:08:01] Niharika: mhmm something has gone wrong then [19:08:06] (03PS2) 10Volans: ClusterShell transport: fix progress bar on timeout [software/cumin] - 10https://gerrit.wikimedia.org/r/373337 [19:08:08] (03PS4) 10Volans: Updated documentation [software/cumin] - 10https://gerrit.wikimedia.org/r/373251 [19:08:10] (03PS5) 10Volans: Add CHANGELOG file including previous releases [software/cumin] - 10https://gerrit.wikimedia.org/r/373250 [19:08:15] jdlrobson: Something broke? It was only for wmf14 right? [19:08:21] nothing broke [19:08:29] it was just supposed to fix something but doesn't seem to have fixed it which is odd [19:08:32] config change is working fine though [19:08:44] ugh i can't even find the damn scripts in the script debugger. wtf? [19:09:32] jdlrobson: I'll sync that one. [19:09:46] Or are they connected? [19:09:47] Niharika: ok good news! it's working now, may have expired at last [19:10:07] brion: Yay! I'll put the wmf15 one on mwdebug1002 too. Gimme a sec. [19:10:39] brion: It's there. [19:11:45] Niharika: you can sync them both [19:11:57] please sync them both [19:11:58] phuedx: You there? [19:12:10] Niharika: still unchanged on mediawiki.org but i'll chalk that up to the RL caching expiry. code's the same otherwise between .14 and .15 so should be good to go [19:12:15] !log niharika29@tin Synchronized php-1.30.0-wmf.14/extensions/TimedMediaHandler/: Enable WebM playback via ogv.js https://gerrit.wikimedia.org/r/#/c/373126/ (duration: 00m 50s) [19:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:36] Niharika: can you confirm if https://gerrit.wikimedia.org/r/#/c/373292/ is synced i'm now very confused [19:12:53] !log niharika29@tin Synchronized php-1.30.0-wmf.14/extensions/MobileFrontend/: Verify the existence of key when parsing lang objects https://gerrit.wikimedia.org/r/#/c/373292/ (duration: 00m 49s) [19:12:59] ok cool :) [19:13:09] i was worried you were reverting it or something [19:13:15] :) [19:13:22] !log niharika29@tin Synchronized php-1.30.0-wmf.15/extensions/TimedMediaHandler/: Enable WebM playback via ogv.js https://gerrit.wikimedia.org/r/#/c/373126/ (duration: 00m 49s) [19:13:29] 10Operations, 10MediaWiki-Platform-Team, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar), and 4 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3546744 (10aaron) a:05aaron>03None [19:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:32] brion: Both your changes are now live. [19:13:52] thx! [19:14:31] phuedx: !! [19:15:08] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Remove CitThisPage from blacklist for page previews T173865 (duration: 00m 47s) [19:15:17] jdlrobson: ^ [19:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:19] T173865: Remove Special:CiteThisPage from previews blacklist - https://phabricator.wikimedia.org/T173865 [19:16:24] Okay, phuedx, I'm reverting your patch. [19:16:27] (03CR) 10Volans: [C: 032] "self-merging" [software/cumin] - 10https://gerrit.wikimedia.org/r/373337 (owner: 10Volans) [19:16:51] (03PS1) 10Niharika29: Revert "pagePreviews: Enable A/B test (BC-only)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373346 [19:16:56] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3546755 (10madhuvishy) Current status: We are not really sure why the disk shelves don't show up. As the next step, @Cmjohnson will try and call HP s... [19:17:58] (03CR) 10Niharika29: [C: 032] Revert "pagePreviews: Enable A/B test (BC-only)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373346 (owner: 10Niharika29) [19:18:20] (03Merged) 10jenkins-bot: ClusterShell transport: fix progress bar on timeout [software/cumin] - 10https://gerrit.wikimedia.org/r/373337 (owner: 10Volans) [19:18:45] (03CR) 10Volans: [C: 032] Updated documentation [software/cumin] - 10https://gerrit.wikimedia.org/r/373251 (owner: 10Volans) [19:19:21] thanks Niharika [19:19:33] You're welcome! [19:19:38] Niharika: what's the issue with phuedx patch? [19:19:42] !log mwscript sql.php --wiki=testwiki /srv/mediawiki/php/maintenance/archives/patch-ip_changes.sql [19:19:49] jdlrobson: That he's not here. :) [19:19:50] i can test that. It's super late for him [19:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:52] I reverted it. [19:19:57] Too late though. [19:20:04] thcipriani: I'm done. [19:20:06] it's 20 minutes past the end [19:20:15] Niharika: thank you! [19:20:16] MaxSem: Niharika it's a bc only change though right? [19:20:20] (03Merged) 10jenkins-bot: Updated documentation [software/cumin] - 10https://gerrit.wikimedia.org/r/373251 (owner: 10Volans) [19:20:29] it doesn't even need to be in the window [19:21:05] i assume we are talking about https://gerrit.wikimedia.org/r/#/c/373264/ [19:21:07] (03PS1) 10Andrew Bogott: horizon: stop assuming that the nova controller is the puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/373348 [19:21:08] jdlrobson, patch authors must be present during deployment, if not they should communicate clearly who stands for them [19:21:09] (03CR) 10Herron: [C: 031] "Ah indeed. The docs are a bit confusing but in testing it appears to work as expected. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/372848 (https://phabricator.wikimedia.org/T173733) (owner: 10Alexandros Kosiaris) [19:21:42] 10Operations, 10MW-1.30-release-notes, 10Performance-Team, 10monitoring: Ensure getLagTimes.php is working properly - https://phabricator.wikimedia.org/T172559#3546771 (10Krinkle) This is rolling out in the wmf.15 branch this week. Keeping this task open until we've verified that the warnings are gone (whi... [19:22:04] !log foreachwiki sql.php /srv/mediawiki/php/maintenance/archives/patch-ip_changes.sql [19:22:11] musikanimal: ^ [19:22:14] 10Operations, 10MW-1.30-release-notes, 10Performance-Team, 10monitoring: Ensure getLagTimes.php is working properly - https://phabricator.wikimedia.org/T172559#3546774 (10Krinkle) [19:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:27] yay! [19:22:31] thank you :) [19:25:49] (03PS1) 10Jdlrobson: Revert "Revert "pagePreviews: Enable A/B test (BC-only)"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373353 [19:26:02] hey, i'm really sorry for disappearing during the window [19:26:25] my 5 year old son was reporting that he was very nearly almost blind in one eye after going out for a bike ride [19:26:46] (03PS2) 10Jdlrobson: Revert "Revert "pagePreviews: Enable A/B test (BC-only)"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373353 [19:27:04] phuedx: he okay? [19:27:33] Niharika: confirmed working in safari in prod, thanks! [19:27:36] phuedx: don't worry i'll try sort this out before end of day. Maybe thcipriani wouldn't mind as part of train roll out? [19:28:29] we washed it out and he could see how many fingers i was holding up from a couple of yards away [19:28:34] he's now fast asleep [19:28:58] * thcipriani reads scrollback [19:29:44] RECOVERY - Host labstore2001 is UP: PING OK - Packet loss = 0%, RTA = 36.07 ms [19:29:45] ^ Niharika: totally mibad and i understand why you reverted [19:30:04] i closed my laptop and sprinted up the stairs tbh [19:30:46] (03PS1) 10Smalyshev: [WIP] Add RDF dumps for categories [puppet] - 10https://gerrit.wikimedia.org/r/373354 [19:31:12] brion: Niharika hrm, I'm looking at tin and there may be a change to TimedMediaHandler that may not be checked out for wmf.15 [19:31:14] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add RDF dumps for categories [puppet] - 10https://gerrit.wikimedia.org/r/373354 (owner: 10Smalyshev) [19:31:40] thcipriani: did it go in too late? [19:31:46] > Enable WebM playback via ogv.js [19:31:57] brion: I don't know what you mean? [19:32:05] PROBLEM - puppet last run on labstore2001 is CRITICAL: Return code of 255 is out of bounds [19:32:11] well it's live on production .14 [19:32:13] looks like it was fetched, but the submodule wasn't updated [19:32:14] and got merged to .15 [19:32:15] PROBLEM - DPKG on labstore2001 is CRITICAL: Return code of 255 is out of bounds [19:32:16] ah [19:32:20] ah, that's what I'm seeing :) [19:32:27] PROBLEM - salt-minion processes on labstore2001 is CRITICAL: Return code of 255 is out of bounds [19:32:27] PROBLEM - dhclient process on labstore2001 is CRITICAL: Return code of 255 is out of bounds [19:32:34] PROBLEM - Check systemd state on labstore2001 is CRITICAL: Return code of 255 is out of bounds [19:32:38] brion: want to test it anywhere? Or is it fine to deploy? [19:32:44] PROBLEM - configured eth on labstore2001 is CRITICAL: Return code of 255 is out of bounds [19:32:45] PROBLEM - Disk space on labstore2001 is CRITICAL: Return code of 255 is out of bounds [19:32:54] thcipriani: already tested on .14 in debug & prod, it's good to go [19:33:08] (03PS2) 10Smalyshev: [WIP] Add RDF dumps for categories [puppet] - 10https://gerrit.wikimedia.org/r/373354 (https://phabricator.wikimedia.org/T157676) [19:33:11] okie doke, /me deploys [19:33:17] no other changes to tmh this week :) [19:33:18] thx [19:33:27] alright -- out! [19:33:47] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add RDF dumps for categories [puppet] - 10https://gerrit.wikimedia.org/r/373354 (https://phabricator.wikimedia.org/T157676) (owner: 10Smalyshev) [19:35:08] (03PS3) 10Smalyshev: [WIP] Add RDF dumps for categories [puppet] - 10https://gerrit.wikimedia.org/r/373354 (https://phabricator.wikimedia.org/T157676) [19:35:51] * jdlrobson waits till things quieten down :) [19:36:20] !log thcipriani@tin Synchronized php-1.30.0-wmf.15/extensions/TimedMediaHandler: [[gerrit:373127|Enable WebM playback via ogv.js]] T172444 (duration: 00m 50s) [19:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:33] T172444: Enable WebM playback for ogv.js video player shim - https://phabricator.wikimedia.org/T172444 [19:37:26] jdlrobson: what did you want me to deploy? :) [19:37:42] thcipriani: https://gerrit.wikimedia.org/r/#/c/373353/ to beta cluster [19:37:55] i need to unblock an A/b test for fundraising [19:38:03] ah, yeah, np [19:38:04] by verifying something works on the beta cluster [19:38:09] thanks thcipriani appreciate it [19:38:25] (03CR) 10Thcipriani: [C: 032] Revert "Revert "pagePreviews: Enable A/B test (BC-only)"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373353 (owner: 10Jdlrobson) [19:39:17] (03PS1) 10Andrew Bogott: just trying to get the puppet compiler to find profile::openstack::main::observer_password [labs/private] - 10https://gerrit.wikimedia.org/r/373355 [19:39:53] jdlrobson: I'll just sync it when it's merged to make sure mw-config on tin is clean, beta-scap-eqiad'll get it out on beta cluster when it runs #ThingsYouProbablyAlreadyKnew :) [19:39:55] (03Merged) 10jenkins-bot: Revert "Revert "pagePreviews: Enable A/B test (BC-only)"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373353 (owner: 10Jdlrobson) [19:40:31] (03CR) 10Andrew Bogott: [V: 032 C: 032] just trying to get the puppet compiler to find profile::openstack::main::observer_password [labs/private] - 10https://gerrit.wikimedia.org/r/373355 (owner: 10Andrew Bogott) [19:40:42] jdlrobson: hey...did you have mobile frontend change on wmf.14? [19:40:50] thcipriani: i did but it should be synced [19:40:55] (as part of last swat) [19:41:08] it looks like it was fetched but the submodule wasn't updated for wmf.14 [19:41:21] ahhh that explains all the confusion ive been having with another dev :) [19:41:33] do you have a few to check it on mwdebug? [19:41:38] lemme set it up [19:41:41] thcipriani: of course [19:42:05] jdlrobson: should be on mwdebug1002 now [19:42:06] thcipriani: ready when you are [19:42:17] HURRAH [19:42:20] (03CR) 10Andrew Bogott: [C: 032] horizon: stop assuming that the nova controller is the puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/373348 (owner: 10Andrew Bogott) [19:42:20] !log joal@tin Started deploy [analytics/refinery@f467ce1]: Bug fixing deploy [19:42:20] that time it worked :) [19:42:26] thcipriani: you can sync that :) [19:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:32] * thcipriani does [19:42:54] ^ cc Niharika [19:43:20] (it looks like it was fetched but the submodule wasn't updated for wmf.14) [19:43:25] PROBLEM - MegaRAID on labstore2001 is CRITICAL: Return code of 255 is out of bounds [19:43:32] Oh. That explains it. I'm sorry! [19:44:17] madhuvishy: RAID on labstore2001 fyi^ [19:44:20] Niharika: no worries. i can't swat so i have none of your superpowers :) [19:44:46] !log thcipriani@tin Synchronized php-1.30.0-wmf.14/extensions/MobileFrontend: [[gerrit:373292|Verify the existence of `url` key when parsing lang objects]] T172316 (duration: 00m 56s) [19:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:58] T172316: Notice: Undefined index: in SpecialMobileLanguages.php on line 68 - https://phabricator.wikimedia.org/T172316 [19:44:59] ^ jdlrobson live now [19:45:32] Niharika: one thing that's helpful is setting status.submoduleSummary = true in your ~/.gitconfig it'll show you all the submodule updates you merged that you still need to run git submodule update for [19:45:35] !log joal@tin Finished deploy [analytics/refinery@f467ce1]: Bug fixing deploy (duration: 03m 16s) [19:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:58] thcipriani: w00t and fixed for real this time [19:46:13] thcipriani: Noted. Thank you. [19:46:59] thcipriani: and the beta cluster change is also merged? [19:47:31] chasemp: working on it [19:47:34] jdlrobson: ah, right, yes, need to sync, but it is merged and next run of beta-scap-eqiad'll get it out on eta [19:47:36] *beta [19:48:26] (03CR) 10Smalyshev: [C: 031] wdqs - send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [19:48:58] (03PS1) 10RobH: setting kafka-jumbo100[1-6].eqiad.wmnet dns [dns] - 10https://gerrit.wikimedia.org/r/373357 (https://phabricator.wikimedia.org/T167992) [19:49:31] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings-labs.php: Beta-only change: [[gerrit:373264|pagePreviews: Enable A/B test (BC-only]] (duration: 00m 47s) [19:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:58] okie doke: train time [19:50:43] (03CR) 10RobH: [C: 032] setting kafka-jumbo100[1-6].eqiad.wmnet dns [dns] - 10https://gerrit.wikimedia.org/r/373357 (https://phabricator.wikimedia.org/T167992) (owner: 10RobH) [19:51:42] thanks thcipriani :) [19:52:09] madhuvishy: cool :) [19:54:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3546880 (10Cmjohnson) h/w log shows Record: 16 Date/Time: 08/15/2017 15:55:29 Source: system Severity: Critical Description: CPU 1 machine check error detecte... [19:55:34] PROBLEM - Check the NTP synchronisation status of timesyncd on labstore2001 is CRITICAL: Return code of 255 is out of bounds [19:57:23] AaronSchulz: did you need that jobrunner patch to go out before train? [19:58:25] PROBLEM - Host labvirt1015 is DOWN: PING CRITICAL - Packet loss = 100% [19:59:44] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [19:59:54] PROBLEM - IPMI Temperature on labstore2001 is CRITICAL: Return code of 255 is out of bounds [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170823T2000). [20:00:14] no parsoid deploy today [20:03:25] PROBLEM - Host labvirt1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:05:24] RECOVERY - Host labvirt1015 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [20:06:00] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1078 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373306 (https://phabricator.wikimedia.org/T173365) (owner: 10Jcrespo) [20:06:06] (03PS2) 10Jcrespo: mariadb: Pool db1078 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373306 (https://phabricator.wikimedia.org/T173365) [20:06:44] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [20:08:22] !log thcipriani@tin Started scap: [[gerrit:373351|Revert ProofreadPage to db7507246665e69384c1d92af2aedc62263a5116 for wmf.15]] T173520 [20:08:31] hmm, seems we still have an expensive query somewhere to be tracked down and limited [20:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:33] T173520: Fatal error: Stack overflow in [files] for wmf.14 - https://phabricator.wikimedia.org/T173520 [20:08:34] RECOVERY - Host labvirt1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [20:10:47] thcipriani: ping me when finished [20:10:54] jynus: will do [20:12:33] !log thcipriani@tin Finished scap: [[gerrit:373351|Revert ProofreadPage to db7507246665e69384c1d92af2aedc62263a5116 for wmf.15]] T173520 (duration: 04m 11s) [20:12:41] jynus: all yours [20:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:47] thanks [20:13:53] !log jynus@tin Synchronized wmf-config/db-eqiad.php: repool db1078 with low weight (duration: 00m 47s) [20:14:02] (03PS6) 10Jcrespo: mariadb: Adding rack allocations, some formatting fixes, read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371444 (https://phabricator.wikimedia.org/T172459) [20:14:04] (03PS1) 10Jcrespo: mariadb: Repool db1078 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373363 (https://phabricator.wikimedia.org/T173365) [20:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:36] there seems to be a spike of errors on mwdebug-mediawiki [20:15:56] seems now gone [20:16:23] jynus: where are you seeing this? [20:17:22] DBQuery log [20:17:39] (the one you took out of fatalmonitor by accident) [20:17:41] :-) [20:18:30] oh good [20:21:20] (03PS1) 10Thcipriani: group1 wikis to 1.30.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373366 [20:21:22] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.30.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373366 (owner: 10Thcipriani) [20:24:38] (03Merged) 10jenkins-bot: group1 wikis to 1.30.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373366 (owner: 10Thcipriani) [20:25:19] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.15 [20:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:06] !log thcipriani@tin Synchronized php: group1 wikis to 1.30.0-wmf.15 (duration: 00m 46s) [20:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:47] (03CR) 10Jdlrobson: pagePreviews: Enable A/B test (BC-only) (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373264 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [20:30:23] (03CR) 10Volans: "Replies inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [20:32:44] (03CR) 10Volans: [C: 032] Add CHANGELOG file including previous releases [software/cumin] - 10https://gerrit.wikimedia.org/r/373250 (owner: 10Volans) [20:34:04] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3547014 (10Shilad) Yes! I updated [[ https://wikitech.wikimedia.org/wiki/Help:SSH | Help:SSH ]] to indicate that DSA is being phased o... [20:35:21] (03Merged) 10jenkins-bot: Add CHANGELOG file including previous releases [software/cumin] - 10https://gerrit.wikimedia.org/r/373250 (owner: 10Volans) [20:36:37] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3547016 (10herron) 05Open>03Resolved Great! Glad to hear it! [20:37:44] (03PS5) 10Krinkle: webperf: Add unit tests for schema handlers and stat dispatching [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) [20:42:26] (03PS1) 10Jforrester: MetaContactPages: Re-require the trademark request's ProposedUse field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373368 [20:42:28] (03PS1) 10Jforrester: MetaContactPages: Require the trademark request's Username and Group fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373369 [20:44:58] (03PS2) 10Ottomata: Initial commit of certpy [software/certpy] - 10https://gerrit.wikimedia.org/r/359960 (https://phabricator.wikimedia.org/T166167) [20:45:31] (03CR) 10Ottomata: "New patch up, but not yet ready for review. I haven't considered Volans comments yet, and need to update README and ca.py" [software/certpy] - 10https://gerrit.wikimedia.org/r/359960 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [20:55:48] thcipriani, looks like Jenkins still feels bad [20:57:24] MaxSem: the post-merge stuff? [20:57:30] or something else? [20:57:51] * thcipriani fixes postmerge stuff [20:59:15] !log T169939: Truncating MCS tables [20:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:29] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [20:59:50] did I just jinx it? :) [21:00:19] (03CR) 10jenkins-bot: pagePreviews: Enable A/B test (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373264 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [21:00:21] greg-g: I think it's just the deadlock on deployment-tin thing, gotta do the offline/disconnect dance for a while [21:01:11] (03CR) 10jenkins-bot: Revert "pagePreviews: Enable A/B test (BC-only)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373346 (owner: 10Niharika29) [21:01:34] blugh [21:01:36] thcipriani, I see tests/gate failing [21:02:52] (03CR) 10jenkins-bot: Revert "Revert "pagePreviews: Enable A/B test (BC-only)"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373353 (owner: 10Jdlrobson) [21:03:28] MaxSem: link? [21:04:03] for example, https://integration.wikimedia.org/ci/job/mwext-php70-phan-jessie/3844/console just doesn't make sense [21:04:03] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [21:04:55] (03CR) 10jenkins-bot: mariadb: Pool db1078 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373306 (https://phabricator.wikimedia.org/T173365) (owner: 10Jcrespo) [21:06:23] RECOVERY - Host labstore2001 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [21:07:10] ^ i got it [21:07:31] (03CR) 10jenkins-bot: group1 wikis to 1.30.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373366 (owner: 10Thcipriani) [21:10:16] (03PS1) 10Volans: Upstream release 1.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/373376 [21:11:05] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3547119 (10RobH) Ok, kafka-jumbo1001 has odd issues. It is confirmed to have the correct MAC address in dhcp, as well as dns is righ... [21:11:39] > Call to deprecated function \ObjectCache::getMainStashInstance() [21:11:52] which is the function that's called from https://gerrit.wikimedia.org/r/#/c/373342/1/includes/LoginNotify.php [21:15:11] duh, I'm stupid, didn't notice patchset dependencies [21:15:56] (because they're barely visible on myscreen) [21:16:01] broken code? blame CI! ;) [21:16:10] IT'S STILL GERRIT'S FAULT! [21:16:14] :) :) [21:16:14] :) [21:16:29] hmm: https://incubator.wikimedia.org/wiki/Special:RandomRootpage [21:16:58] a query that takes forever to complete? [21:17:15] and then a borked message [21:17:19] * Krinkle thinks Krenair is trying to DDOS the site by posting the link here [21:17:34] * Krinkle sits back while everyone clicks the link before reading MaxSem's message [21:18:26] the borked message was what I was looking at [21:19:00] ooops [21:19:01] on my internet connection I can't always distinguish between a slow-loading page and normal pages [21:19:14] o.O [21:20:34] (03PS1) 10Andrew Bogott: puppetmaster: support extra_auth_rules to frontend and backend profiles [puppet] - 10https://gerrit.wikimedia.org/r/373378 [21:20:36] (03PS1) 10Andrew Bogott: labs puppetmaster: add auth.conf rule allowing horizon access to API [puppet] - 10https://gerrit.wikimedia.org/r/373379 (https://phabricator.wikimedia.org/T173982) [21:21:51] (03PS2) 10Volans: Upstream release 1.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/373376 [21:21:54] (03PS1) 10Chad: Releases: Proxy jenkins to main apache instance [puppet] - 10https://gerrit.wikimedia.org/r/373380 [21:22:58] (03PS1) 10Volans: Upstream release 1.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/373381 [21:23:24] Krenair: nice find, I'm checking logstash now for an entry from that page [21:23:45] (03Abandoned) 10Volans: Upstream release 1.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/373376 (owner: 10Volans) [21:24:02] Seems to be coming from this one: Expectation (readQueryTime <= 5) by MediaWiki::main not met (actual: 5.514): [21:24:06] query: SELECT page_title,page_namespace FROM `page` LEFT JOIN `page_props` ON ((page_id = pp_page) AND pp_propname = 'X' ) AND pp_page IS NULL ORDER BY page_random LIMIT N [21:24:29] Seeing it regularly with actual being annywhere between 5 and 17s [21:25:01] that's a rather large range isn't it? [21:25:12] Seems to have started Aug 19, with a rapid increase today after 18:00 UTC [21:25:25] Presumably wmf.15 related? [21:25:28] are you looking just at incubatorwiki? [21:25:49] I'm looking at all mediawiki channel messages, but results only come from Incubatorwiki [21:25:55] this is from the Incubator extension, right? [21:26:51] hrm, wmf.15 rolled forward at 20UTC, FWIW [21:26:54] Hm.. Special:RandomRootpage exists in core, but I guess it's mostly used on incubator [21:27:08] thcipriani: Yeah, it's in a 4 hour grouping. It increased between 18:00-21:00 UTC [21:27:13] * Krinkle zooms in [21:27:20] ah [21:27:30] (03CR) 10Andrew Bogott: [C: 032] puppetmaster: support extra_auth_rules to frontend and backend profiles [puppet] - 10https://gerrit.wikimedia.org/r/373378 (owner: 10Andrew Bogott) [21:27:37] (03CR) 10Andrew Bogott: [C: 032] labs puppetmaster: add auth.conf rule allowing horizon access to API [puppet] - 10https://gerrit.wikimedia.org/r/373379 (https://phabricator.wikimedia.org/T173982) (owner: 10Andrew Bogott) [21:27:38] Um, gerrit wya? [21:28:05] thcipriani: Hm.. no, 5 entries in 18:00-19:00, then nothing, and more entries after 21:00-22:00 only [21:28:37] It seems Kibana doesn't interpret null properly in its graphing, so it looks like a spike from 5 to 21 hits per hour, but it's really 5 hits in 1 hour, then nothing for 3 hours, and then an hour with 21 hits. [21:28:45] but the line graph doesn't go down in between, it just connects the dots directly [21:32:01] (03CR) 10Chad: "Jenkins isn't completely set up yet, but it's in the "initial install / locked" mode so don't have to be on hand to set things up the seco" [puppet] - 10https://gerrit.wikimedia.org/r/373380 (owner: 10Chad) [21:32:58] Krinkle, RandomRootpage? [21:33:04] it's core [21:33:27] the incubator extension thing is RandomByTest [21:36:08] Krenair: maybe incubator links to it more often, or maybe the problem is that the pages on incubator... yes, it's just that that wiki has lots of more /-contianing page titles, so the query is bound to be slower there compared to other wikis. [21:36:20] Whehter or not the speical page is more used htere doesn't matter as much I guess. [21:36:25] AaronSchulz: Do you think it would be okay if I backport your patch regarding HTMLCacheUpdate? [21:38:00] yes [21:38:17] Thanks [21:46:04] (03PS1) 10Andrew Bogott: labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 [21:46:26] (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (owner: 10Andrew Bogott) [21:50:43] (03PS2) 10Andrew Bogott: labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) [21:51:03] (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) (owner: 10Andrew Bogott) [21:57:12] (03PS3) 10Andrew Bogott: labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) [21:57:37] (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) (owner: 10Andrew Bogott) [21:59:15] (03PS4) 10Andrew Bogott: labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) [22:00:30] (03PS1) 10RobH: adding user mnoor to ldap users in admin module [puppet] - 10https://gerrit.wikimedia.org/r/373389 (https://phabricator.wikimedia.org/T164285) [22:01:00] (03PS5) 10Andrew Bogott: labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) [22:01:09] (03CR) 10RobH: [C: 032] adding user mnoor to ldap users in admin module [puppet] - 10https://gerrit.wikimedia.org/r/373389 (https://phabricator.wikimedia.org/T164285) (owner: 10RobH) [22:01:21] (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) (owner: 10Andrew Bogott) [22:02:36] (03PS6) 10Andrew Bogott: labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) [22:02:58] (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) (owner: 10Andrew Bogott) [22:04:14] (03PS7) 10Andrew Bogott: labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) [22:04:37] (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) (owner: 10Andrew Bogott) [22:06:06] (03PS8) 10Andrew Bogott: labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) [22:06:30] (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) (owner: 10Andrew Bogott) [22:11:03] (03PS9) 10Andrew Bogott: labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) [22:11:30] (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) (owner: 10Andrew Bogott) [22:11:43] (03PS7) 10Ladsgroup: mediawiki: Add puppetized cronjob for rebuildTermSqlIndex [puppet] - 10https://gerrit.wikimedia.org/r/370626 (https://phabricator.wikimedia.org/T171460) [22:12:47] (03CR) 10Ladsgroup: "This is ready for merge" [puppet] - 10https://gerrit.wikimedia.org/r/370626 (https://phabricator.wikimedia.org/T171460) (owner: 10Ladsgroup) [22:13:09] (03PS10) 10Andrew Bogott: labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) [22:13:46] (03PS11) 10Andrew Bogott: labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) [22:15:16] (03CR) 10Andrew Bogott: [C: 032] labs puppetmaster: allow puppetmaster api access to each worker [puppet] - 10https://gerrit.wikimedia.org/r/373386 (https://phabricator.wikimedia.org/T173982) (owner: 10Andrew Bogott) [22:19:06] (03PS2) 10Jforrester: MetaContactPages: Require the trademark request's Username and Group fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373369 [22:19:52] (03Abandoned) 10Jforrester: MetaContactPages: Require the trademark request's Username and Group fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373369 (owner: 10Jforrester) [22:19:57] (03Abandoned) 10Jforrester: MetaContactPages: Temporarily de-require the trademark request's ProposedUse field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373339 (https://phabricator.wikimedia.org/T173839) (owner: 10Jforrester) [22:20:10] (03Abandoned) 10Jforrester: MetaContactPages: Re-require the trademark request's ProposedUse field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373368 (owner: 10Jforrester) [22:20:20] (03Restored) 10Jforrester: MetaContactPages: Require the trademark request's Username and Group fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373369 (owner: 10Jforrester) [22:37:13] (03PS1) 10RobH: further tweaking of kafka-jumbo recipe [puppet] - 10https://gerrit.wikimedia.org/r/373392 [22:39:19] (03CR) 10RobH: [C: 032] further tweaking of kafka-jumbo recipe [puppet] - 10https://gerrit.wikimedia.org/r/373392 (owner: 10RobH) [22:53:08] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3471092 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by madhuvishy on neodymium.eqiad.wmnet for hosts: ```... [22:59:00] 10Operations, 10Ops-Access-Requests, 10Gerrit: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3547403 (10RobH) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170823T2300). Please do the needful. [23:00:04] James_F and Niharika: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:01:05] Present! [23:01:44] Hey. [23:04:27] MaxSem: You wanna SWAT? :P [23:05:26] thcipriani, you merged a patch - are you deploying? [23:07:25] I can SWAT [23:07:28] Niharika: for https://gerrit.wikimedia.org/r/#/c/373324/1 is it for both .14 and .15? [23:08:18] MaxSem: I started to, but noticed you merged a few, go for it :) [23:08:27] hehe [23:08:29] alright [23:09:14] thcipriani: Ah, yeah, both. Sorry, I forgot to cherry pick that one. I can do. [23:10:28] https://gerrit.wikimedia.org/r/#/c/373397/ and https://gerrit.wikimedia.org/r/#/c/373398/ (adding to calendar now) [23:10:46] * James_F waits impatiently. :-) [23:10:56] (03CR) 10MaxSem: [C: 032] MetaContactPages: Require the trademark request's Username and Group fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373369 (owner: 10Jforrester) [23:12:32] (03Merged) 10jenkins-bot: MetaContactPages: Require the trademark request's Username and Group fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373369 (owner: 10Jforrester) [23:12:45] (03CR) 10jenkins-bot: MetaContactPages: Require the trademark request's Username and Group fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373369 (owner: 10Jforrester) [23:13:48] James_F, pulled the contact page patch on mwdebug1002 [23:14:02] MaxSem: The config one only? [23:14:45] MaxSem: Yup, works as expected. [23:15:40] MaxSem: Hmm. On second thoughts, will do a follow-up patch to drop one of those. [23:16:00] ok, I'm waiting [23:17:05] (03PS1) 10Jforrester: MetaContactPages: Don't require the trademark request's Group field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373400 [23:17:09] MaxSem: ^^ [23:17:26] (03CR) 10MaxSem: [C: 032] MetaContactPages: Don't require the trademark request's Group field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373400 (owner: 10Jforrester) [23:17:46] Despite it having a "*" in the i18n, the context makes it clearly not mandatory. [23:18:25] Uhh, why do we even mark required in messages??! [23:18:44] MaxSem: That's fixed in my patch to WikimediaMessages. [23:18:55] (03Merged) 10jenkins-bot: MetaContactPages: Don't require the trademark request's Group field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373400 (owner: 10Jforrester) [23:18:57] MaxSem: But not urgent so just going out in the train. [23:19:04] (03CR) 10jenkins-bot: MetaContactPages: Don't require the trademark request's Group field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373400 (owner: 10Jforrester) [23:20:23] MaxSem: Yeah, LGTM. [23:20:36] MaxSem: The MW patch is the one that's urgent. :-) [23:20:42] James_F, pulled both [23:27:23] !log maxsem@tin Synchronized wmf-config/MetaContactPages.php: https://gerrit.wikimedia.org/r/#/c/373369/ (duration: 00m 48s) [23:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:00] Thanks MaxSem! [23:28:51] !log maxsem@tin Synchronized php-1.30.0-wmf.15/resources/: https://gerrit.wikimedia.org/r/#/c/373391/ (duration: 00m 48s) [23:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:57] Niharika, pulled CodeMirror wmf.15 on mwdebug1002 [23:33:26] Hmm, it doesn't work for me. [23:34:12] Wait. [23:34:52] Na, definitely doesn't work. Caching or something? [23:35:45] Appeared now. [23:37:33] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 48.49 seconds [23:38:09] 10Operations, 10Multimedia, 10TimedMediaHandler, 10HHVM, 10Patch-For-Review: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#3547472 (10Jdforrester-WMF) [23:38:55] 10Operations, 10Multimedia, 10TimedMediaHandler, 10HHVM, 10Patch-For-Review: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#2639607 (10Jdforrester-WMF) [23:39:03] MaxSem: Works. Sync both. [23:41:18] MaxSem: I still see a lot of "Assuming ... is from known IP since no info available" [23:41:20] !log maxsem@tin Synchronized php-1.30.0-wmf.15/extensions/CodeMirror/: https://gerrit.wikimedia.org/r/#/c/373324/ (duration: 00m 49s) [23:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:02] from your testing on mwdebug, Niharika ? [23:42:13] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [23:42:33] !log maxsem@tin Synchronized php-1.30.0-wmf.14/extensions/CodeMirror/: https://gerrit.wikimedia.org/r/#/c/373324/ (duration: 00m 47s) [23:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:43] RECOVERY - Host labstore2001 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [23:42:47] MaxSem: Hmm, wait, let me try hacking into your account. [23:43:28] prolly still won't work as job runners are still running the old code [23:45:31] Doesn't seem to break anything, so sync? [23:47:57] !log maxsem@tin Synchronized php-1.30.0-wmf.15/extensions/LoginNotify/: SWAT fixes (duration: 00m 47s) [23:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:29] !log maxsem@tin Synchronized php-1.30.0-wmf.14/extensions/LoginNotify/: SWAT fixes (duration: 00m 47s) [23:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:33] Hmm, there's still plenty of no info available in the logs.