[00:02:10] !log maxsem@tin Synchronized wmf-config: Labs only (duration: 00m 30s) [00:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:02:50] Seems labs files needed some love. [00:03:22] (03PS1) 10Gergő Tisza: Reenable $wgMWOAuthSecureTokenTransfer=true; on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302630 (https://phabricator.wikimedia.org/T67421) [00:04:55] RECOVERY - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is OK: TCP OK - 0.002 second response time on port 9042 [00:06:21] MaxSem: are you done? I have something I want to deploy too [00:06:28] done [00:08:20] (03PS2) 10Legoktm: De-deploy the CustomData extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301892 (https://phabricator.wikimedia.org/T140847) (owner: 10Jforrester) [00:08:56] (03CR) 10Legoktm: [C: 032] De-deploy the CustomData extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301892 (https://phabricator.wikimedia.org/T140847) (owner: 10Jforrester) [00:09:18] (03Merged) 10jenkins-bot: De-deploy the CustomData extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301892 (https://phabricator.wikimedia.org/T140847) (owner: 10Jforrester) [00:10:57] * MaxSem headbangs [00:11:05] legoktm: let me know when you're done :) [00:11:19] heh, I'm testing on mw1017 right now [00:11:21] * James_F coughs about the SWAT list. [00:11:28] legoktm: I thought we tested on mw1099? [00:11:40] (Also, yay.) [00:12:00] does it matter which one I use? [00:12:07] Dereckson: Right, sorry, I had a meeting. If it's too late now I'll postpone it again (it's just a no-op change anyway) [00:13:22] James_F: legoktm: Doesn't matter which one. But since other people may be using mw1017 for testing outside SWAT, I asked SWAT to use mw1099. While we tend to schedule deployments, just testing something or hacking something on mw1017 with no intent to deploy right away happens from time to time and it'd be annoying to be overridden by SWAT at those times. [00:13:45] !log legoktm@tin Synchronized wmf-config: De-deploy CustomData extension - T140847 (duration: 00m 28s) [00:13:46] T140847: De-deploy CustomData extension from WMF production - https://phabricator.wikimedia.org/T140847 [00:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:14:19] I suppose we could conventionalise "testing for deployment" entirely (beyond just SWAT), since that would reduce conflicts, since we schedule deployments. [00:14:29] Meh. :) [00:14:38] yeah I'm not really SWAT-ing, just deploying :P [00:14:46] 06Operations, 10Deployment-Systems: dologmsg doesn't work on terbium - https://phabricator.wikimedia.org/T141619#2505074 (10Dzahn) @Anomie see my changes above, which i merged now. how about now? [00:14:53] Krinkle: done! [00:14:55] legoktm: could you ping RoanKattouw and me when you're done? [00:15:03] legoktm: Do we remove the Use variables also? [00:15:03] Dereckson, RoanKattouw: done [00:15:07] Thanks [00:15:27] * James_F nods. [00:15:55] (03PS2) 10Dereckson: Revert "Revert "Add $wmgEchoMentionStatusNotifications and enable it in beta labs"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302377 (https://phabricator.wikimedia.org/T135717) [00:16:02] Krinkle: those ones? no they're used by separate extensions [00:16:07] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302377 (https://phabricator.wikimedia.org/T135717) (owner: 10Dereckson) [00:16:22] legoktm: But if they're always false, I suppose we can remove those three, no? [00:16:29] anomie: you should try dologmsgbot from terbium again some time, i think it should work now [00:16:31] (03Merged) 10jenkins-bot: Revert "Revert "Add $wmgEchoMentionStatusNotifications and enable it in beta labs"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302377 (https://phabricator.wikimedia.org/T135717) (owner: 10Dereckson) [00:16:34] If they're not always false, then your commit message is confusing. [00:16:35] Krinkle: uh, they're true on wikivoyages? [00:16:44] those extensions no longer depend upon CustomData [00:16:48] Oh, I see. [00:16:58] The dependency changed, not the presence of the ext. [00:17:00] got it [00:18:05] Yup. [00:19:22] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add $wmgEchoMentionStatusNotifications and enable it in beta labs (no-op in prod, T135717, T139623) (duration: 00m 25s) [00:19:25] (03PS9) 10Aaron Schulz: Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) [00:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:19:27] (03PS1) 10Aaron Schulz: Enable MASTER_GTID_WAIT() on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302635 (https://phabricator.wikimedia.org/T135027) [00:19:37] (03CR) 10jenkins-bot: [V: 04-1] Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [00:19:40] Dereckson: OK, I'm here this time [00:19:49] (03CR) 10jenkins-bot: [V: 04-1] Enable MASTER_GTID_WAIT() on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302635 (https://phabricator.wikimedia.org/T135027) (owner: 10Aaron Schulz) [00:19:51] Sorry for being absent the last couple times :/ [00:19:59] RoanKattouw: IS done [00:20:01] now CS [00:20:07] T139623: Create notification for successful mentions - https://phabricator.wikimedia.org/T139623 [00:20:07] T135717: Add mention failure notifications - https://phabricator.wikimedia.org/T135717 [00:20:14] Dereckson: Cluster-wide or 1099? [00:20:23] cluster, mw1099 has the full set [00:20:29] cluster for IS, mw1099 has the full set [00:20:30] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Add $wmgEchoMentionStatusNotifications and enable it in beta labs (no-op in prod, T135717, T139623) (duration: 00m 26s) [00:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:20:38] T139623: Create notification for successful mentions - https://phabricator.wikimedia.org/T139623 [00:20:38] T135717: Add mention failure notifications - https://phabricator.wikimedia.org/T135717 [00:21:37] OK, checking that it's a no-op [00:21:40] var_dump($wmgEchoMentionStatusNotifications) [00:21:40] bool(false) [00:21:51] variable exists without any issue today [00:22:29] legoktm, https://gerrit.wikimedia.org/r/#/c/302637/ [00:22:49] Dereckson: Looks good, thanks [00:23:03] MaxSem: but I already uploaded that same change before you!! [00:23:25] arr [00:25:21] !log dereckson@tin Synchronized wmf-config: Add $wmgEchoMentionStatusNotifications and enable it in beta labs (no-op, sync labs files) (duration: 00m 28s) [00:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:25:56] Okay we're done. [00:26:11] 2 Notice: Undefined index: width in /srv/mediawiki/php-1.28.0-wmf.12/includes/Linker.php on line 760 [00:26:14] (and 762) [00:28:19] Dereckson: that's known [00:29:04] yes https://phabricator.wikimedia.org/T138987 [00:35:14] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 314 bytes in 0.015 second response time [00:40:14] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 314 bytes in 0.016 second response time [00:45:14] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 314 bytes in 0.016 second response time [00:46:13] (03PS10) 10Aaron Schulz: Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) [00:46:41] (03CR) 10jenkins-bot: [V: 04-1] Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [00:46:54] 06Operations, 10Deployment-Systems: dologmsg doesn't work on terbium - https://phabricator.wikimedia.org/T141619#2517673 (10Dzahn) i see there is still an issue .. ferm rules not on neon yet.. hmm.. [00:47:59] (03PS8) 10Dzahn: labs: restart slapd if it uses > 50% of memory [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) [00:50:14] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 314 bytes in 0.017 second response time [00:50:40] (03PS12) 10Alex Monk: beta: Use Let's Encrypt cert [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) [00:51:42] (03CR) 10jenkins-bot: [V: 04-1] beta: Use Let's Encrypt cert [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) (owner: 10Alex Monk) [00:51:43] oh..ferm rules on icinga host are broken [00:51:47] from a previous change [00:51:47] (03CR) 10Alex Monk: "PS12 is completely untested, I'll be surprised if Jenkins passes it." (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) (owner: 10Alex Monk) [00:51:54] and that's why i cant make new changes [00:52:01] ugh [00:52:45] and puppet doesnt tell you with a fail [00:53:47] (03PS13) 10Alex Monk: beta: Use Let's Encrypt cert [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) [00:54:52] (03CR) 10jenkins-bot: [V: 04-1] beta: Use Let's Encrypt cert [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) (owner: 10Alex Monk) [00:55:14] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 314 bytes in 0.016 second response time [00:56:18] (03PS11) 10Aaron Schulz: Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) [00:56:21] (03PS14) 10Alex Monk: beta: Use Let's Encrypt cert [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) [00:56:47] (03CR) 10jenkins-bot: [V: 04-1] Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [01:00:14] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 314 bytes in 0.013 second response time [01:02:16] (03PS12) 10Aaron Schulz: Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) [01:05:14] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 314 bytes in 0.016 second response time [01:10:14] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 314 bytes in 0.017 second response time [01:10:20] 06Operations, 10Icinga: ferm rules on icinga are broken, - https://phabricator.wikimedia.org/T141957#2517758 (10Dzahn) [01:11:10] 06Operations, 10Icinga: ferm rules on icinga are broken, - https://phabricator.wikimedia.org/T141957#2517770 (10Dzahn) why "**no such variable: $EQIAD_PRIVATE_LABS_HOSTS1_A_EQIAD**" ? [01:12:05] (03PS15) 10Alex Monk: beta: Use Let's Encrypt cert [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) [01:15:14] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 314 bytes in 0.017 second response time [01:17:26] 06Operations, 10Icinga: ferm rules on icinga are broken, - https://phabricator.wikimedia.org/T141957#2517774 (10Dzahn) related to https://gerrit.wikimedia.org/r/#/c/302463/ ? [01:18:56] PROBLEM - MariaDB Slave Lag: m3 on db1043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1130.99 seconds [01:20:14] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 314 bytes in 0.015 second response time [01:20:18] (03CR) 10Dzahn: "this broke ferm rules on neon (icinga) https://phabricator.wikimedia.org/T141957" [puppet] - 10https://gerrit.wikimedia.org/r/302450 (https://phabricator.wikimedia.org/T141085) (owner: 10Gehel) [01:25:14] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2517776 (10dpatrick) >>! In T135410#2493042, @CCogdill_WMF wrote: > Confirmed with IBM that the updated key works, and we'v... [01:25:15] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 314 bytes in 0.016 second response time [01:25:21] (03PS1) 10Dzahn: icinga: fix nsca/firewall.pp, missing labs-hosts variable [puppet] - 10https://gerrit.wikimedia.org/r/302642 [01:25:41] 06Operations, 06Discovery, 10netops, 03Discovery-Search-Sprint: deploy elasticsearch/plugins to relforge1001-1002 servers - https://phabricator.wikimedia.org/T141085#2486473 (10Dzahn) renaming the network broke ferm on neon (icinga) -> T141957 [01:27:31] (03CR) 10Dzahn: [C: 032] icinga: fix nsca/firewall.pp, missing labs-hosts variable [puppet] - 10https://gerrit.wikimedia.org/r/302642 (owner: 10Dzahn) [01:30:14] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 314 bytes in 0.011 second response time [01:31:39] 06Operations, 10Deployment-Systems: dologmsg doesn't work on terbium - https://phabricator.wikimedia.org/T141619#2517783 (10Dzahn) the reason why it doesn't work yet even after merge is currently T141957 an unrelated issue that prevents the new ferm rules from being applied [01:35:14] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 314 bytes in 0.017 second response time [01:40:14] PROBLEM - check_payments_wiki on payments1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string OK not found on https://payments1004.frack.eqiad.wmnet:443https://payments.wikimedia.org/index.php/Special:SystemStatus - 314 bytes in 0.016 second response time [01:44:54] RECOVERY - MariaDB Slave Lag: m3 on db1043 is OK: OK slave_sql_lag Replication lag: 0.27 seconds [01:45:14] RECOVERY - check_payments_wiki on payments1004 is OK: HTTP OK: HTTP/1.1 200 OK - 269 bytes in 0.031 second response time [01:59:25] (03PS2) 10Aaron Schulz: Enable MASTER_GTID_WAIT() on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302635 (https://phabricator.wikimedia.org/T135027) [02:02:15] PROBLEM - Varnishkafka log producer on cp3007 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [02:15:20] testing from tin [02:16:20] (03CR) 10Legoktm: [C: 032] Beta: move from ores.wikimedia.org to ores-beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302406 (https://phabricator.wikimedia.org/T141825) (owner: 10Ladsgroup) [02:17:02] mutante: is it okay if I sync a labs only change? will it get in the way of your testing? [02:17:18] (03PS2) 10Legoktm: Beta: move from ores.wikimedia.org to ores-beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302406 (https://phabricator.wikimedia.org/T141825) (owner: 10Ladsgroup) [02:17:51] (03CR) 10Legoktm: [C: 032] Beta: move from ores.wikimedia.org to ores-beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302406 (https://phabricator.wikimedia.org/T141825) (owner: 10Ladsgroup) [02:17:52] legoktm: it will not get in the way. go ahead [02:18:17] (03Merged) 10jenkins-bot: Beta: move from ores.wikimedia.org to ores-beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302406 (https://phabricator.wikimedia.org/T141825) (owner: 10Ladsgroup) [02:19:08] eh [02:19:09] error: insufficient permission for adding an object to repository database .git/objects [02:19:09] fatal: failed to write object [02:19:09] fatal: unpack-objects failed [02:19:32] !log restarted log bot [02:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:20:11] hm, worked second time [02:20:57] ignore me [02:21:21] !log legoktm@tin Synchronized wmf-config/CommonSettings-labs.php: labs-only, move from ores.wikimedia.org to ores-beta.wmflabs.org (duration: 00m 33s) [02:22:58] 06Operations, 10Deployment-Systems: dologmsg doesn't work on terbium - https://phabricator.wikimedia.org/T141619#2517803 (10Dzahn) That issue has been fixed.. now we have these iptables rules ACCEPT tcp -- terbium.eqiad.wmnet anywhere tcp dpt:9200 ACCEPT tcp -- wasat.codfw.wmnet an... [02:23:12] (03PS16) 10Alex Monk: beta: Use Let's Encrypt cert [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) [02:23:50] 06Operations, 10Icinga: ferm rules on icinga are broken, - https://phabricator.wikimedia.org/T141957#2517808 (10Dzahn) 05Open>03Resolved a:03Dzahn fixed with https://gerrit.wikimedia.org/r/#/c/302642/ [02:24:09] !log neon - restarted ferm service [02:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:05] RECOVERY - Varnishkafka log producer on cp3007 is OK: PROCS OK: 1 process with command name varnishkafka [02:30:19] (03PS1) 10Dzahn: tcpircbot: allow IPv6 addresses for terbium, wasat [puppet] - 10https://gerrit.wikimedia.org/r/302647 (https://phabricator.wikimedia.org/T141619) [02:30:45] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.12) (duration: 08m 34s) [02:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:32] (03CR) 10Dzahn: [C: 032] "i can see with tcpdump how the packets come from the v6 address and the bot doesnt accept it" [puppet] - 10https://gerrit.wikimedia.org/r/302647 (https://phabricator.wikimedia.org/T141619) (owner: 10Dzahn) [02:34:31] (03PS1) 10Dzahn: add mapped v6 IPs for terbium and wasat [puppet] - 10https://gerrit.wikimedia.org/r/302649 [02:36:42] (03PS1) 10Gergő Tisza: Increase retries for rename jobs [puppet] - 10https://gerrit.wikimedia.org/r/302650 (https://phabricator.wikimedia.org/T141731) [02:41:35] RECOVERY - salt-minion processes on ganeti1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:47:25] PROBLEM - salt-minion processes on ganeti1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:32] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.13) (duration: 14m 56s) [03:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:05:46] (03PS1) 10Dzahn: tcpircbot: add v6 addresses of terbium/wasat to ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/302652 (https://phabricator.wikimedia.org/T141619) [03:09:30] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Aug 3 03:09:30 UTC 2016 (duration 6m 58s) [03:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:09:46] (03CR) 10Dzahn: [C: 032] tcpircbot: add v6 addresses of terbium/wasat to ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/302652 (https://phabricator.wikimedia.org/T141619) (owner: 10Dzahn) [03:11:15] (03PS2) 10Dzahn: tcpircbot: add v6 addresses of terbium/wasat to ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/302652 (https://phabricator.wikimedia.org/T141619) [03:23:50] testing from terbium T141619 [03:23:50] T141619: dologmsg doesn't work on terbium - https://phabricator.wikimedia.org/T141619 [03:24:16] now it does [03:24:50] logging from wasat [03:25:47] 06Operations, 10Deployment-Systems: dologmsg doesn't work on terbium - https://phabricator.wikimedia.org/T141619#2517862 (10Dzahn) 05Open>03Resolved a:03Dzahn @anomie finally works now :) 20:29 < logmsgbot> testing from terbium T141619 20:29 < stashbot> T141619: dologmsg doesn't work on terbium - https:... [03:29:07] 06Operations, 06Labs: Moving network::external to hiera broke much of labs - https://phabricator.wikimedia.org/T141959#2517865 (10chasemp) [03:29:15] 06Operations, 06Labs: Moving network::external to hiera broke much of labs - https://phabricator.wikimedia.org/T141959#2517877 (10chasemp) p:05Triage>03Normal [04:18:56] PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: puppet fail [04:42:56] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Puppet has 1 failures [04:46:26] RECOVERY - puppet last run on db2054 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [05:08:26] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [05:17:49] 06Operations, 10DBA, 07Availability: Setup automatic failover for misc database servers - https://phabricator.wikimedia.org/T141547#2517919 (10Krinkle) [05:35:23] !log krinkle@tin Synchronized php-1.28.0-wmf.13/includes/resourceloader/: I195f67d061d (duration: 00m 38s) [05:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:36:05] !log krinkle@tin Synchronized php-1.28.0-wmf.13/autoload.php: I195f67d061d (duration: 00m 42s) [05:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:36:36] !log krinkle@tin Synchronized php-1.28.0-wmf.13/resources/Resources.php: I195f67d061d (duration: 00m 30s) [05:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:37:07] !log krinkle@tin Synchronized php-1.28.0-wmf.13/includes/OutputPage.php: I195f67d061d (duration: 00m 30s) [05:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:38:02] !log krinkle@tin Synchronized php-1.28.0-wmf.13/extensions/MobileFrontend/: I195f67d061d (duration: 00m 29s) [05:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:56:27] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/294252 (https://phabricator.wikimedia.org/T137768) (owner: 10KartikMistry) [06:03:40] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/294252 (https://phabricator.wikimedia.org/T137768) (owner: 10KartikMistry) [06:30:14] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:36] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 4 failures [06:31:56] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:24] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:54] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:04] 06Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic: Strip query parameters from w.wiki domain - https://phabricator.wikimedia.org/T141170#2518039 (10Legoktm) 05Resolved>03Open @BBlack this doesn't seem to be working yet? https://w.wiki/?search=X still shows search results...and PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [06:41:06] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [06:44:53] Dereckson It seems to be directly caused by general lag, but I cannot detect a specific query causing it (there were no long-running writes at that time) [06:55:36] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:45] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:45] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:56:54] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:55] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:57:15] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:16] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:58:34] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:45] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:37] (03PS2) 10Jcrespo: Add datacenter to lag checks [puppet] - 10https://gerrit.wikimedia.org/r/302469 (https://phabricator.wikimedia.org/T114752) [07:05:12] (03CR) 10Jcrespo: [C: 032] Add a field to pt-heartbeat to monitor different datacenters [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/302426 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [07:06:23] (03PS3) 10Jcrespo: Add datacenter to lag checks [puppet] - 10https://gerrit.wikimedia.org/r/302469 (https://phabricator.wikimedia.org/T114752) [07:06:40] (03PS4) 10Jcrespo: Add datacenter to lag checks [puppet] - 10https://gerrit.wikimedia.org/r/302469 (https://phabricator.wikimedia.org/T114752) [07:08:28] (03CR) 10Jcrespo: [C: 032] Add datacenter to lag checks [puppet] - 10https://gerrit.wikimedia.org/r/302469 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [07:12:18] So replication now is falling back to Seconds_Behind_Master [07:13:36] I will now manually restart heartbeat on the masters, one by one [07:29:04] !log restarting pt-heartbeat-wikimedia on all database masters [07:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:50:09] (03Abandoned) 10DCausse: Enable continuous sanity check on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297459 (owner: 10DCausse) [07:58:01] I am now going to lag a s1 on codfw to check the alerts are working (it won't show here, although it could show on the logs) [08:07:08] !log stopping replication to s2-master-codfw (db2017) to test replication alerts [08:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:11:37] it works as intended, codfw slaves have Seconds_Behind_Master = 0, but now are detected as lagged because the primary datacenter is eqiad, not codfw [08:12:35] !log restarting slave on db2017 [08:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:13:14] I will now test an eqiad master and an eqiad slave [08:14:48] !log stopping replication to db1024 (depooled) to test replication alerts [08:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:19:48] Also works [08:21:27] !log restarting replication on db1024 [08:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:22:48] the last test it stopping s2-master replication from codfw- with the current setup, a master on the primary datacenter will not complain of "lag" [08:23:27] we maybe can change it in the future to check the lag from the secondary datacenter? [08:26:48] (03PS1) 10Gehel: Relforge servers are running with a single master Bug: T137256 [puppet] - 10https://gerrit.wikimedia.org/r/302661 (https://phabricator.wikimedia.org/T137256) [08:27:18] !log stopping replication to db1018 (s2-master-eqiad) [08:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:30:57] WARNING slave_sql_state Slave_SQL_Running: No / OK slave_sql_lag Replication lag: 0.95 seconds [08:31:02] !log upgrading httpd on mw126[34] to 2.4.10-10+deb8u4+wmf3 (T73487) [08:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:31:08] everything looks good [08:31:54] T73487: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487 [08:32:00] !log restarting replication on db1018 [08:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:32:28] I think that finishes the deployment- I will check slowly misc/dbstore [08:38:33] (03CR) 10Gehel: "Thanks for the fix! And sorry for breaking this in the first place..." [puppet] - 10https://gerrit.wikimedia.org/r/302642 (owner: 10Dzahn) [08:39:13] (03CR) 10Gehel: [C: 032] Relforge servers are running with a single master Bug: T137256 [puppet] - 10https://gerrit.wikimedia.org/r/302661 (https://phabricator.wikimedia.org/T137256) (owner: 10Gehel) [08:40:29] (03CR) 10Paladox: "This defiantly improves things especially on the iPhone which I found before slow to load gerrit 2.12.2 and with this patch it now loads a" [puppet] - 10https://gerrit.wikimedia.org/r/301898 (https://phabricator.wikimedia.org/T141065) (owner: 10Chad) [08:41:16] (03PS7) 10Paladox: Gerrit: Support having phab commits as links [puppet] - 10https://gerrit.wikimedia.org/r/302229 (https://phabricator.wikimedia.org/T76459) [08:41:47] (03PS8) 10Paladox: Gerrit: Support having phab commits as links [puppet] - 10https://gerrit.wikimedia.org/r/302229 (https://phabricator.wikimedia.org/T76459) [08:42:45] RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 0, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 0, initializing_shard [08:43:13] !log upload scap 3.2.2-1 to carbon T127762 [08:43:14] T127762: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762 [08:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:43:27] RECOVERY - ElasticSearch health check for shards on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 0, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 0, initializing_shard [08:43:42] gehel: ^ it's you? [08:43:57] dcausse: yep [08:44:02] \o/ :) [08:44:20] dcausse: I'm still checking, but cluster is green and looks OK... [08:44:28] nice thanks! [08:45:34] dcausse: my pleasure! It still need an LVS, but it *should* already be useable. Let me know if you find things that don't work... [08:45:46] sure! [08:46:35] (03CR) 10Paladox: [C: 031] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/301829 (owner: 10Chad) [08:47:46] (03CR) 10Paladox: [C: 031] Phab: Set origin's URL to phab not gerrit [puppet] - 10https://gerrit.wikimedia.org/r/301863 (owner: 10Chad) [08:50:57] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate: Clones from git.wikimedia.org are not redirected - https://phabricator.wikimedia.org/T139206#2518214 (10Paladox) This will require us to generate another list to go on due to the difference in the names. [08:53:12] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations: Analytics cluster access request for ISI Foundation team - https://phabricator.wikimedia.org/T141634#2518222 (10ema) Hi! The document [[ https://phabricator.wikimedia.org/L3 | Acknowledgement of Wikimedia Server Access Re... [08:53:51] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2518224 (10fgiunchedi) [08:54:59] gehel elukey would you have time today/tomorrow to talk about moving jmxtrans from statsd to graphite vis-a-vis the jmxupgrade upgrade? T136405 [08:55:00] T136405: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405 [08:55:22] godog: sure, whenever you want... [08:55:43] +1, I have only to run errand a couple of hours for lunch, the rest is o [08:55:46] *ok [08:56:34] elukey: you're running right now? [08:56:37] ok! so ATM jmxtrans is using statsd, though it doesn't seem needed as all stats are sent as gauges to statsd and the metric names are distinct per-host, correct? [08:56:52] godog: yep [08:57:02] gehel: nope in a couple of hours [08:57:44] ok, so let's talk :) IRC or hangout? [08:58:21] easier here I think! [08:58:40] ok! [08:58:56] so yeah basically I wanted to propose to move to graphite line-oriented protocol instead of statsd for jmxtrans [08:59:29] so afaiu jmxtrans in this case just forwards data to statsd without leveraging any aggregation or other functionality that we might need [08:59:47] so it would be worth to just send data to graphite directly right? [09:00:08] (03CR) 10Ema: [C: 031] varnish: Remove outdated comment in setup_filesystem about bits [puppet] - 10https://gerrit.wikimedia.org/r/302611 (https://phabricator.wikimedia.org/T107430) (owner: 10Krinkle) [09:00:11] that's my understanding. I don't think statsd adds any value here [09:00:33] yeah that was my impression too, all gauges and all distinct metric names [09:01:34] so the main-codfw kafka cluster (eventbus codfw) could be a good place to test the change [09:01:53] it is not serving traffic but it is complex enough to have a good testbed [09:02:01] (at least, imho) [09:02:42] this is also assuming that through hiera/puppet we can configure jmxtrans to push data to graphite directly [09:02:58] just removing statsd from the picture should be easy enough. We might need to modify naming to reuse the same graphite timeseries and not create new ones, but that's trivial [09:03:31] * gehel is opening puppet-jmxtrans... [09:04:22] we have jmxtrans deployed on each servers, not a centralized jmxtrans, right? [09:04:29] sounds good to me too to start from main-codfw kafka [09:04:33] yeah I think so gehel [09:04:52] any idea why? [09:05:26] 06Operations, 10DBA: Display that lag on tendril and dbtree from pt-heartbeat instead of Seconds_Behind_Master - https://phabricator.wikimedia.org/T141968#2518246 (10jcrespo) [09:06:12] I'm guessing not to have to think about jmx over the network but only over loopback [09:06:35] I don't know the full context though [09:06:48] I'm going to reuse https://phabricator.wikimedia.org/T73322 for this [09:07:35] 06Operations, 10DBA, 13Patch-For-Review: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on the several monitoring backends - https://phabricator.wikimedia.org/T114752#2518262 (10jcrespo) 05Open>03Resolved Resolving this, T141968 will handle tendril/dbtree and T126757, other moni... [09:08:03] 06Operations, 10DBA: Display that lag on tendril and dbtree from pt-heartbeat instead of Seconds_Behind_Master - https://phabricator.wikimedia.org/T141968#2518246 (10jcrespo) p:05Normal>03Low [09:08:17] it might make sense to upgrade to latest jmxtrans version at the same time (or at least do it at some point) [09:08:43] but the current .deb published by jmxtrans are crap (my fault), so are the startup scripts... [09:08:58] definitely, how are we currently deploying jmxtrans? [09:09:12] I think we have an in-house .deb [09:10:05] https://github.com/wikimedia/operations-debs-jmxtrans [09:11:42] gehel: yes we have jmxtrans deployed on all the hosts [09:11:50] (sorry I lagged) [09:12:08] elukey: np! [09:12:27] godog: how can I help move that forward? [09:13:07] so on kafka1001 we have jmxtrans (242-1) unstable; urgency=low [09:13:21] and from the changelog/version it doesn't seem in house [09:15:30] elukey: changelog indicate that the package was created by Nik [09:15:32] 242-1 (/var/lib/apt/lists/apt.wikimedia.org_wikimedia_dists_jessie-wikimedia_thirdparty_binary-amd64_Packages) (/var/lib/dpkg/status) [09:15:40] Nik Everett -> that looks in house... [09:16:04] ah snap sorry I didn't se the @wikimedia [09:16:32] should we first get rid of statsd and then upgrade? [09:17:09] is the jmxtrans version that we are running going to support the new working scenario correctly? [09:17:10] I'm spending some of my free time in trying to package jmxtrans into something that might be accepted by Debian, but it is not done yet :( [09:17:28] I guess so but if you tell me that the new version is INCREDIBLY better we could upgrade first [09:17:40] elukey: that old version should be mostly ok. Or at least not worst than it is with statsd. [09:18:39] if the new version is better then we could do the work of packaging, upgrading and then see the differences.. if all is stable, we could switch statsd to graphite [09:18:48] godog: suggestions? [09:19:40] we can do either way, I am just thinking if switching with a better codebase would avoid any issues [09:19:50] s/any/some/ [09:20:12] there are tons of fixes in the latest jmxtrans version, but it is also quite a bit different in term of structure, and some of it is not as well tested than the old ugly codebase [09:20:54] I really think we should upgrade, but we need to be prepared to discover some bugs in the process. [09:22:32] sorry home adsl outage :( [09:23:24] gehel: sure, big upgrades always bring some noise :) [09:23:27] 06Operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor throughput towards some destinations - https://phabricator.wikimedia.org/T120425#2518282 (10Nemo_bis) Tried another: {P3628} [09:23:42] yeah I don't mind either way about the order, the new version also brings statsd fixes [09:24:12] since both ways could be followed I was trying to suggest to upgrade to measure the difference with the current settings, and then upgrade slowly with the new code if nothing comes u [09:24:16] *up [09:24:50] elukey: upgrade, then upgrade slowly? I'm missing something... [09:25:05] 06Operations, 10ops-codfw, 10DBA: BIOS upgrade on certain codfw machines - https://phabricator.wikimedia.org/T139714#2518289 (10jcrespo) 05Open>03Resolved a:03jcrespo Everthings seems ok at the random servers I checked, it sayst bios upgrade, then no strange logs. [09:25:10] sounds good to me! [09:25:11] 06Operations, 10Analytics: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2518294 (10fgiunchedi) see also {T73322} about switching statsd -> graphite, once the upgrade is done [09:26:41] gehel: ETOOMANYUPGRADE [09:26:52] * elukey restarts his brain [09:28:10] what I wanted to say was: upgrade the deb package to all the hosts that need it, and then see how it behaves with the current settings. This would be good for example if later on we'll need to rollback to a good known state.. plus it will give us a fresh codebase [09:28:32] then, second step would be to remove statsd from the equation (the second upgrade sorrt) [09:29:24] (03PS3) 10Jcrespo: Remove labsdb::manager [puppet] - 10https://gerrit.wikimedia.org/r/302427 [09:29:45] gehel: but we can also do the other way around, let me know your thoughts [09:29:55] !log applying prometheus required grants to all databases T128185 [09:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:30:00] elukey: ok, make sense. [09:30:07] T128185: Prepare mysql account and options for prometheus - https://phabricator.wikimedia.org/T128185 [09:30:57] elukey: The hard part (for me at least) is the packaging. It is crap and I'm not good at packaging. [09:31:53] I think that the removal of statsd is easier, but if we do run into trouble, there is no way I'm going to do a fix on that 3 years old version. So I think your proposition makes a lot of sense. [09:32:06] yeah :D [09:32:40] gehel: if you want we can work together on the packaging, I am not good either but it might be a good place to start learning. Plus godog will supervise the work :P [09:32:54] sounds good! [09:33:05] heheh I'll be out next week but feel free to add me to tasks/code reviews [09:34:02] moritzm_ suggested to try to get jmxtrans into debian, but that is quite a bit of work (but would be real nice). The easy way would be to check the current .deb published by jmxtrans and see how much needs fixing... [09:34:05] PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: puppet fail [09:35:31] elukey: could you have a look at the debs in http://central.maven.org/maven2/org/jmxtrans/jmxtrans/259/ and tell me how broken they are? [09:35:59] ^ puppet fail on maps-test2004 seems to be back to working... [09:36:02] will try to do :) [09:36:05] RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:36:16] (03PS3) 10Filippo Giunchedi: puppetmaster: generate prometheus targets from ganglia [puppet] - 10https://gerrit.wikimedia.org/r/299539 (https://phabricator.wikimedia.org/T126785) [09:36:18] I'll also update this afternoon the task [09:36:24] with what we have discussed [09:36:53] ok, time to get a coffee... let's sync tomorrow see if we managed to move forward [09:37:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "Jaime, I'm merging this so prometheus has the configuration already but polling the machines for host-level metrics will fail until we gra" [puppet] - 10https://gerrit.wikimedia.org/r/302611 (https://phabricator.wikimedia.org/T107430) (owner: 10Krinkle) [09:37:43] (03PS2) 10Filippo Giunchedi: varnish: Remove outdated comment in setup_filesystem about bits [puppet] - 10https://gerrit.wikimedia.org/r/302611 (https://phabricator.wikimedia.org/T107430) (owner: 10Krinkle) [09:37:46] (03CR) 10Filippo Giunchedi: [V: 032] varnish: Remove outdated comment in setup_filesystem about bits [puppet] - 10https://gerrit.wikimedia.org/r/302611 (https://phabricator.wikimedia.org/T107430) (owner: 10Krinkle) [09:38:36] sigh wrong review, anyways I'm merging it since it is harmless [09:39:28] (03CR) 10Filippo Giunchedi: "merge comment was meant for https://gerrit.wikimedia.org/r/#/c/299539/ ! anyways change is trivial" [puppet] - 10https://gerrit.wikimedia.org/r/302611 (https://phabricator.wikimedia.org/T107430) (owner: 10Krinkle) [09:39:53] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "Jaime, I'm merging this so prometheus has the configuration already but polling the machines for host-level metrics will fail until we gra" [puppet] - 10https://gerrit.wikimedia.org/r/299539 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [09:40:04] (03PS4) 10Filippo Giunchedi: puppetmaster: generate prometheus targets from ganglia [puppet] - 10https://gerrit.wikimedia.org/r/299539 (https://phabricator.wikimedia.org/T126785) [09:40:08] (03CR) 10Filippo Giunchedi: [V: 032] puppetmaster: generate prometheus targets from ganglia [puppet] - 10https://gerrit.wikimedia.org/r/299539 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [09:48:52] (03PS3) 10Filippo Giunchedi: prometheus: monitor hosts in the current site [puppet] - 10https://gerrit.wikimedia.org/r/299540 (https://phabricator.wikimedia.org/T126785) [09:50:45] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: monitor hosts in the current site [puppet] - 10https://gerrit.wikimedia.org/r/299540 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [09:55:56] jynus: ack'ed for general lag issue [10:01:42] (03PS1) 10Filippo Giunchedi: prometheus: also scan for *.yaml files [puppet] - 10https://gerrit.wikimedia.org/r/302669 [10:02:41] godog, should I install the exported on all hosts or only on jessie ones? [10:02:48] *exporter [10:03:40] jynus: jessie only, it should run on trusty too but the jessie init script doesn't work out of the box on trusty iirc [10:03:46] ok [10:04:13] will add that to the role with a jessie check [10:05:08] jynus: ok, I'm following up to add machine-level exporter to db2069 too, review coming up [10:05:31] is the machine exporter jessie-only, too? [10:05:57] it is yeah, same disclaimer as before, it should work on trusty but the init script doesn't work out of the box [10:06:03] it is ok [10:06:25] (03PS1) 10Mark Bergsma: Convert back to tabs, rename .prefs file [puppet] - 10https://gerrit.wikimedia.org/r/302670 [10:06:45] depending on the state, we will either patch tha package (or add puppet on top of it) or just install it on newer hosts [10:08:21] yup, I didn't want to change the stock debian packages if possible but we should be able to work around it if we really want to have it running on trusty too [10:09:05] (03PS1) 10Filippo Giunchedi: add prometheus::node_exporter to db2069 [puppet] - 10https://gerrit.wikimedia.org/r/302671 (https://phabricator.wikimedia.org/T140646) [10:09:11] ...where is jenkins? [10:09:39] 06Operations, 10Traffic, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2518408 (10ema) On the swift side of things: - Range requests are correctly handled, swift responds with 206 Partial Content - Swift always sends CL, which removes the need for hacks... [10:09:58] are you asking for the machine, or is CI actions not being executed? [10:10:06] (03CR) 10Mark Bergsma: [C: 032] Convert back to tabs, rename .prefs file [puppet] - 10https://gerrit.wikimedia.org/r/302670 (owner: 10Mark Bergsma) [10:10:09] mark: coffee break perhaps? it replied to https://gerrit.wikimedia.org/r/#/c/302669/ [10:10:18] ah there you go [10:10:23] yes, it did for me some hours ago too [10:10:40] just slow then [10:11:49] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2518412 (10elukey) After a long chat with upstream we decided not to go ahead w... [10:11:51] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: also scan for *.yaml files [puppet] - 10https://gerrit.wikimedia.org/r/302669 (owner: 10Filippo Giunchedi) [10:11:56] (03PS1) 10Alexandros Kosiaris: hiera role_backend: Don't qualify the _roles variable [puppet] - 10https://gerrit.wikimedia.org/r/302674 [10:11:57] godog, when you have something visible, I would like to check the metrics on prometheus side, to check everything is ok [10:11:58] (03PS1) 10Alexandros Kosiaris: realm: Do not qualify realm lookups in realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/302675 [10:12:00] (03PS1) 10Alexandros Kosiaris: realm: Don't qualify the lookups to ::site in realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/302676 [10:12:02] (03PS1) 10Alexandros Kosiaris: realm: Qualify fact lookups used in assignments [puppet] - 10https://gerrit.wikimedia.org/r/302677 [10:12:04] (03PS2) 10Filippo Giunchedi: prometheus: also scan for *.yaml files [puppet] - 10https://gerrit.wikimedia.org/r/302669 [10:12:06] (03CR) 10Filippo Giunchedi: [V: 032] prometheus: also scan for *.yaml files [puppet] - 10https://gerrit.wikimedia.org/r/302669 (owner: 10Filippo Giunchedi) [10:14:15] jynus: yup I'll be working today on lvs so we can hook it up to grafana, ATM you can ssh-tunnel port 80 to a local port and then localhost:port/ops [10:14:36] I am not that worried about grafana [10:14:49] although it would be cool [10:15:02] but a simple grep on command line would be useful [10:15:19] it seems s/grep/curl/ [10:17:50] jynus: ack, I'll work on generating the config for mysql next, ATM only host-level metrics are polled for [10:18:31] IOW once https://gerrit.wikimedia.org/r/#/c/302671/ is merged you'll start seeing host metrics for db2069 [10:18:34] let me wait [10:18:39] let's wait [10:18:45] for a general deployment [10:18:53] I am preparing the patch now [10:19:40] what do you mean with general deployment? [10:20:00] wait for the patch [10:20:20] (03PS9) 10Chad: Logstash: Enable log4j provider [puppet] - 10https://gerrit.wikimedia.org/r/302601 [10:21:08] (03PS1) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's core databases [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [10:21:19] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/302601 (owner: 10Chad) [10:21:57] (03CR) 10Mark Bergsma: [C: 032] Update README.md for BACKPORTS=yes option [puppet] - 10https://gerrit.wikimedia.org/r/302679 (owner: 10Mark Bergsma) [10:23:17] so probably you think https://gerrit.wikimedia.org/r/302680 is too ambitious? [10:23:26] (03PS10) 10Chad: Logstash: Enable log4j provider [puppet] - 10https://gerrit.wikimedia.org/r/302601 [10:23:31] ^godog [10:24:05] ah, no I don't think it is, though I'd limit it by site too first [10:25:18] in the off chance that there's something wrong with prometheus-mysqld-exporter [10:25:32] with site, you mean ::site? [10:25:35] (03PS11) 10Chad: Logstash: Enable log4j provider [puppet] - 10https://gerrit.wikimedia.org/r/302601 [10:25:42] yeah [10:25:45] ok [10:28:47] (03PS12) 10Chad: Logstash: Enable log4j provider [puppet] - 10https://gerrit.wikimedia.org/r/302601 [10:32:37] godog, as our edits will conflict, allow me to overrride your patch with mine [10:32:43] (03PS13) 10Chad: Logstash: Enable log4j provider [puppet] - 10https://gerrit.wikimedia.org/r/302601 (via Chad) [10:35:28] (03PS1) 10Merlijn van Deen: puppet_compiler: add packages for labs realm [puppet] - 10https://gerrit.wikimedia.org/r/302683 (https://phabricator.wikimedia.org/T97081) (via Merlijn van Deen) [10:35:30] sure jynus [10:37:09] 06Operations, 10Traffic, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2518467 (10ema) Interestingly the second Range request mentioned in [[https://phabricator.wikimedia.org/T131502#2515835 | my previous comment ]] does *not* stall on varnish 3.0.6plus-... [10:38:10] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-urd: New upstream release and rebuild for Jessie [debs/contenttranslation/apertium-urd] - 10https://gerrit.wikimedia.org/r/296229 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [10:40:03] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-arg: Initial Debian packaging [debs/contenttranslation/apertium-arg] - 10https://gerrit.wikimedia.org/r/294657 (https://phabricator.wikimedia.org/T124369) (owner: 10KartikMistry) [10:40:14] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:41:26] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:42:06] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:42:07] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-ca-it: Rebuild for Jessie [debs/contenttranslation/apertium-ca-it] - 10https://gerrit.wikimedia.org/r/294080 (owner: 10KartikMistry) [10:44:13] (03CR) 10jenkins-bot: [V: 04-1] Add prometheus's mysql-exporter to all jessie's core databases [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [10:45:01] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-eu-en: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-eu-en] - 10https://gerrit.wikimedia.org/r/295696 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [10:46:54] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-es-pt: Rebuild for Jessie and other fixes [debs/contenttranslation/apertium-es-pt] - 10https://gerrit.wikimedia.org/r/294431 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [10:47:52] (03PS3) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's core databases [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [10:48:01] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-es-gl: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-es-gl] - 10https://gerrit.wikimedia.org/r/295625 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [10:48:48] (03CR) 10Jcrespo: "node_exporter should not be there, but i prefer it there for now rather than touching site.pp every time." [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [10:49:05] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-es-ca: Rebuild for Jessie and other fixes [debs/contenttranslation/apertium-es-ca] - 10https://gerrit.wikimedia.org/r/294671 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [10:50:08] (03PS15) 10Chad: Logstash: Enable log4j provider [puppet] - 10https://gerrit.wikimedia.org/r/302601 [10:50:25] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-es-ast: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-es-ast] - 10https://gerrit.wikimedia.org/r/295624 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [10:51:05] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-eo-fr: New upstream release and Jessie rebuild [debs/contenttranslation/apertium-eo-fr] - 10https://gerrit.wikimedia.org/r/294917 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [10:52:37] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-eo-en: New upstream version and Jessie rebuild [debs/contenttranslation/apertium-eo-en] - 10https://gerrit.wikimedia.org/r/294472 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [10:53:06] jynus: I'd be fine with adding node_exporter too alongside mysqld_exporter btw [10:53:14] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-eo-ca: Rebuild for Jessie and fixed dependencies [debs/contenttranslation/apertium-eo-ca] - 10https://gerrit.wikimedia.org/r/294432 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [10:53:53] 06Operations: determine future of dickson - wmf hosted irc server - https://phabricator.wikimedia.org/T120752#2518550 (10mark) Given the complete inactivity around this, let's decom. [10:53:54] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-en-gl: Rebuilt for Jessie and other fixes [debs/contenttranslation/apertium-en-gl] - 10https://gerrit.wikimedia.org/r/294322 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [10:54:03] godog, the idea is use the mariadb role for testing, then put it on the right place [10:54:04] 06Operations: determine future of dickson - wmf hosted irc server - https://phabricator.wikimedia.org/T120752#2518552 (10mark) a:05mark>03None [10:54:29] the mysql_exporter should be there, the node of course not [10:54:44] why not though? [10:55:05] should it be on all nodes eventually? [10:55:14] like in standard? [10:55:17] 06Operations: Torrus is broken - https://phabricator.wikimedia.org/T87815#2518560 (10mark) 05Open>03Resolved It's all puppetized. That's unrelated to torrus's db occasionally getting corrupted, for which I'm not aware of a fix. I think the plan is to migrate everything to different graphing systems, after w... [10:55:20] eventually yeah [10:55:26] that is what I mean [10:55:32] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/302601 (owner: 10Chad) [10:55:39] eventually not there, but doing it like this for now [10:55:49] 06Operations, 07Documentation: Incident response protocol needs a refresh - https://phabricator.wikimedia.org/T89800#2518562 (10mark) 05Open>03Resolved [10:56:02] yeah like that is fine I think, just removing the db2069 exception [10:56:18] oh, then I didn't understood [10:56:47] so it is ok to deploy it to all jessie databases on codfw? [10:58:10] I thought you wanted to test it on one node first in case there was some race condition [10:58:45] ah I see what you mean now, heh I'm not aware of any with node_exporter, it is already running in a few places [10:59:09] 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium): Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2518566 (10mark) a:05mark>03None The allocation of contint1001 is fine. What's the current status on the discu... [10:59:11] how do you plan to apply the prometheus class? append it to mariadb::core jynus ? [10:59:13] ah, I didn't know that, godog [10:59:32] not in production though, in labs [10:59:43] it is on all mysql classes now [11:00:17] all mysql production classes [11:00:30] I am not sure to add it directly to the module [11:00:32] isn't in a separate class in https://gerrit.wikimedia.org/r/#/c/302680 ? [11:00:50] labs yes [11:00:54] it is separate [11:01:15] I was hoping to deploy this first, then other classes [11:01:34] what I mean is that at it stands that code review isn't going to do anything when merged, correct [11:01:37] ? [11:01:49] the main issue is that the labs module is WIP [11:01:59] it is, isn't it? [11:02:25] it should install the package and start the service [11:03:26] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-cat: Initial Debian packaging [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/294250 (https://phabricator.wikimedia.org/T137768) (owner: 10KartikMistry) [11:03:28] (03PS5) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's core databases [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [11:03:34] once the class is applied somewhere yeah, as it stands I don't think it is applied anywhere? [11:04:12] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-es-it: Rebuild for Jessie and other fixes [debs/contenttranslation/apertium-es-it] - 10https://gerrit.wikimedia.org/r/295206 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [11:04:26] it is, on all production mysqls [11:04:53] see the extra "include role::mariadb::prometheus" [11:05:14] it should not do anything on things that do not match the condition [11:05:31] but we can test it on the compiler [11:05:32] ah yeah nevermind [11:06:14] that is the "part that is not 100% ok, but I think it is ok for now" [11:06:46] it is the right place for the mariadb specific part, but not for the node, but it is ok for testing [11:06:46] (03CR) 10Filippo Giunchedi: [C: 031] Add prometheus's mysql-exporter to all jessie's core databases [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [11:06:52] yup [11:07:00] let me do a compiler pass [11:07:02] just in case [11:07:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [11:07:41] the title is wrong actually [11:08:02] it will be added to all mariadbs (except labs, that has a different general role management) [11:08:06] and beta [11:08:39] (03PS6) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production databases [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [11:10:55] 06Operations, 10Phabricator, 07Documentation: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#2518585 (10mark) 05Open>03Resolved [11:11:05] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-mk-bg: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-mk-bg] - 10https://gerrit.wikimedia.org/r/296212 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [11:11:35] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [11:11:55] (03PS2) 10Gehel: Maps - remove expire files [puppet] - 10https://gerrit.wikimedia.org/r/302392 [11:13:17] (03CR) 10Gehel: [C: 032] Maps - remove expire files [puppet] - 10https://gerrit.wikimedia.org/r/302392 (owner: 10Gehel) [11:13:57] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-tat: New upstream release and rebuild for Jessie [debs/contenttranslation/apertium-tat] - 10https://gerrit.wikimedia.org/r/296367 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [11:14:45] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-swe: Initial Debian packaging [debs/contenttranslation/apertium-swe] - 10https://gerrit.wikimedia.org/r/294244 (https://phabricator.wikimedia.org/T137767) (owner: 10KartikMistry) [11:15:01] mark: my puppet-merge is pulling one of your change (palladium). Did I screw my rebase? Or do you have something in progress? [11:15:18] mark: it seems trivial enough (only a change to a readme) [11:15:22] oh the README.md one? forgot to merge it, yeah go ahead [11:15:31] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-spa: Initial Debian packaging [debs/contenttranslation/apertium-spa] - 10https://gerrit.wikimedia.org/r/294658 (https://phabricator.wikimedia.org/T124370) (owner: 10KartikMistry) [11:15:31] mark: thanks! [11:15:54] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [11:15:56] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [11:16:10] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-pt-gl: Rebuild for Jessie, cleanup [debs/contenttranslation/apertium-pt-gl] - 10https://gerrit.wikimedia.org/r/296162 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [11:16:38] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-oc-es: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-oc-es] - 10https://gerrit.wikimedia.org/r/296209 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [11:17:24] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [11:18:20] jynus: fyi, I have a long running job on the catalog compiler blocking yours... should be done at some point [11:18:37] I can wait, np, I saw it [11:19:06] I don't know why we can't have 2 running at the same time tbh [11:19:23] if I run out of things to do because I am waiting on one ticket, I would have 7000 open tickets :-) [11:19:35] :-) [11:19:42] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-oc-ca: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-oc-ca] - 10https://gerrit.wikimedia.org/r/296207 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [11:20:25] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-nno: New upstream release [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/269915 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry) [11:21:14] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-mlt-ara: Rebuild for Jessie and new upstream [debs/contenttranslation/apertium-mlt-ara] - 10https://gerrit.wikimedia.org/r/296214 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [11:23:06] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-mk-en: Initial Debian packaging [debs/contenttranslation/apertium-mk-en] - 10https://gerrit.wikimedia.org/r/298250 (https://phabricator.wikimedia.org/T139918) (owner: 10KartikMistry) [11:23:20] jynus: I'm going to lunch and back in an hour or so, we can merge after that [11:23:31] +1 [11:23:39] nice, ttyl [11:23:46] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-id-ms: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-id-ms] - 10https://gerrit.wikimedia.org/r/296159 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [11:23:49] (unless alex is still building :-) [11:24:17] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-is-sv: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-is-sv] - 10https://gerrit.wikimedia.org/r/296213 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [11:25:09] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-hbs-mkd: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-hbs-mkd] - 10https://gerrit.wikimedia.org/r/296051 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [11:25:47] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-eu-en: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-eu-es] - 10https://gerrit.wikimedia.org/r/295697 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [11:26:39] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-eo-es: Rebuild for Jessie, cleanup [debs/contenttranslation/apertium-eo-es] - 10https://gerrit.wikimedia.org/r/295611 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [11:28:42] (03PS7) 10Addshore: Add simple-json-datasource plugin to labs grafana [puppet] - 10https://gerrit.wikimedia.org/r/302119 (https://phabricator.wikimedia.org/T141636) [11:42:14] (03PS1) 10Addshore: DNM: Enable RevisionSlider on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302690 (https://phabricator.wikimedia.org/T141974) [12:17:11] (03CR) 10Alexandros Kosiaris: [C: 032] puppet_compiler: add packages for labs realm [puppet] - 10https://gerrit.wikimedia.org/r/302683 (https://phabricator.wikimedia.org/T97081) (owner: 10Merlijn van Deen) [12:17:15] (03PS2) 10Alexandros Kosiaris: puppet_compiler: add packages for labs realm [puppet] - 10https://gerrit.wikimedia.org/r/302683 (https://phabricator.wikimedia.org/T97081) (owner: 10Merlijn van Deen) [12:17:18] (03CR) 10Alexandros Kosiaris: [V: 032] puppet_compiler: add packages for labs realm [puppet] - 10https://gerrit.wikimedia.org/r/302683 (https://phabricator.wikimedia.org/T97081) (owner: 10Merlijn van Deen) [12:29:00] (03PS20) 10: Logstash: Enable log4j provider [puppet] - 10https://gerrit.wikimedia.org/r/302601 (owner: 10Chad) [12:30:30] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [12:30:32] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [12:30:34] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1005.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [12:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:30:36] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1006.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [12:30:38] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1007.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [12:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:31:00] !log T135176 depool wtp100[34567] [12:31:01] T135176: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176 [12:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:37:29] 06Operations, 10DBA, 07Upstream: TokuDB crashes frequently -consider upgrade it or search for alternative engines with similar features - https://phabricator.wikimedia.org/T109069#2518719 (10jcrespo) 05Open>03Resolved a:03jcrespo The alternative is clear: we are going to convert labs and dbstore (when... [12:38:53] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#239075 (10HenryLi) This ticket opened in 2009 and now it is 2016. Why does it take 7 years without anything done? While it is not urgent and not trivial, there is a... [12:41:40] akosiaris: \o/ [12:41:54] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2518725 (10revi) This is a tracking task, which was inherited from previous bug tracking software - bugzilla. Tracking tasks are a list of bugs about specific stuff... [12:42:26] jynus: back, ready when you are [12:42:57] akosiaris: wtp1001 looking good - https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&c=Parsoid+eqiad&h=wtp1001.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=NOGROUPS [12:43:53] godog, jenkins is still compiling, if he is compiling all hosts it will take 8 hours [12:44:01] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2518729 (10Amire80) >>! In T21986#2518722, @HenryLi wrote: > This ticket opened in 2009 and now it is 2016. Why does it take 7 years without anything done? While it... [12:44:29] ah, no parallel jobs heh [12:45:43] maybe we can ask to enable parallism [12:45:45] 06Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic: Strip query parameters from w.wiki domain - https://phabricator.wikimedia.org/T141170#2518733 (10BBlack) Yeah I checked as well, and it doesn't work. Most likely there simple patch is subtly broken... [12:46:41] jynus: while we wait, I was thinking on which puppet resource to export that has the info we need (mysql cluster name and shard) [12:46:44] godog, there is an "execute concurrent builds if necessary" [12:46:55] should I mark it or will it explode? [12:47:13] no idea how/if that would work [12:47:57] does anyone here have more idea about the CI queue than us? [12:49:12] yes, the guy that's on vacations :/ [12:49:27] I supposed so :-) [12:49:43] however, the timout for build is set to 180 minutes [12:49:48] so there's that [12:49:49] waat [12:50:00] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2518737 (10Amire80) >>! In T21986#2518725, @revi wrote: > This is a tracking task, which was inherited from previous bug tracking software - bugzilla. Tracking task... [12:50:57] so, godog, right now we create salt grains, but we do not export anything yet (as it was not used before) [12:51:07] but I can add that right now [12:51:27] (I had to review the salt grains anyway, as they were incomplete) [12:53:55] jynus: nice, that'd be mysql_role and mysql_shard correct? where would core/labs/etc come from? [12:55:01] that is a good questio, should I make it "well" at export time and let the script do whatever it wants, or should I think on premetheus groups already? [12:55:50] I think at export time would work [12:56:47] there is 2 options, mysql_role in {'core', 'labs', 'pc', ...} shard in {'s1', 'es1', ...} [12:57:45] or just one mysql_role {'core-s1', 'core-s2', 'labs', 'pc-pc1'} [12:58:06] I suppose the first is more flexible? [12:58:21] if we later want to reorganize the groups? [12:58:38] let me talk in patch terms and you can comment there if necessary [12:58:56] yeah I'd keep things separate in puppet, easy to join them afterwards [12:59:40] maybe change salt:grain for an abstract class that does export && salt at the same time [12:59:47] 06Operations, 06Developer-Relations (Jul-Sep-2016): Operations Team Offsite - https://phabricator.wikimedia.org/T141940#2517241 (10Qgil) [13:00:05] (sorry, I am talking while I think, I shouldn't do that [13:00:08] ) [13:00:47] (03PS1) 10: Maps - Variable used to give password to osm2pgsql has changed [puppet] - 10https://gerrit.wikimedia.org/r/302701 (owner: 10Gehel) [13:01:39] jynus: hehe no worries, since the variables are already all there also passing those onto role::mariadb::prometheus and exporting that would work [13:01:49] yes [13:02:33] ok I've convinced myself that seems the most straightforward [13:02:49] I just do not want to do that one time per class, have a class with paramters if tomorrow we go to imagine, ganglia, we have the class for the groups already [13:03:05] so I was going to create role::mariadb::groups, and put the common code there [13:03:19] oh [13:03:33] that would work, too [13:03:49] very similar solution [13:04:28] 06Operations, 10Traffic, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2518759 (10mark) >>! In T131502#2518467, @ema wrote: > Interestingly the second Range request mentioned in [[https://phabricator.wikimedia.org/T131502#2515835 | my previous comment ]]... [13:05:53] jynus: yup, happy to bikeshed on a code review too :D [13:06:00] yes [13:06:53] specially the naming, the naming if the most important part [13:08:05] let me copy icinga's syntax [13:09:55] (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/3568/ is happy enough. The 5 hosts that fail have nothing to do with this change. Merging this " [puppet] - 10https://gerrit.wikimedia.org/r/302677 (owner: 10Alexandros Kosiaris) [13:10:10] (03PS2) 10: hiera role_backend: Don't qualify the _roles variable [puppet] - 10https://gerrit.wikimedia.org/r/302674 (owner: 10Alexandros Kosiaris) [13:10:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] hiera role_backend: Don't qualify the _roles variable [puppet] - 10https://gerrit.wikimedia.org/r/302674 (owner: 10Alexandros Kosiaris) [13:10:35] (03PS2) 10: realm: Do not qualify realm lookups in realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/302675 (owner: 10Alexandros Kosiaris) [13:10:50] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] realm: Do not qualify realm lookups in realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/302675 (owner: 10Alexandros Kosiaris) [13:11:10] (03PS2) 10: realm: Don't qualify the lookups to ::site in realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/302676 (owner: 10Alexandros Kosiaris) [13:11:16] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] realm: Don't qualify the lookups to ::site in realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/302676 (owner: 10Alexandros Kosiaris) [13:11:29] (03PS2) 10: realm: Qualify fact lookups used in assignments [puppet] - 10https://gerrit.wikimedia.org/r/302677 (owner: 10Alexandros Kosiaris) [13:11:35] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] realm: Qualify fact lookups used in assignments [puppet] - 10https://gerrit.wikimedia.org/r/302677 (owner: 10Alexandros Kosiaris) [13:15:54] now this is interesting, because there are hosts that really have 2 mysql roles [13:16:19] !log reboot ms-be1022 - T140597 [13:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:16:38] lol, did the bot get really confused above? [13:17:16] ah? which bot? [13:17:21] T140597: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597 [13:17:23] grrrit-wm: [13:17:31] (PS2) : realm [13:17:36] Oh sorry [13:17:42] Ive been testing a change to fix gerrit [13:23:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "well, we still have older boxes using osm2pgsql, namely labsdb1006. Let's pass both variables for a while (where while will probably be qu" [puppet] - 10https://gerrit.wikimedia.org/r/302701 (owner: 10Gehel) [13:24:01] (03PS21) 10Chad: Logstash: Enable log4j provider [puppet] - 10https://gerrit.wikimedia.org/r/302601 [13:24:28] (03PS1) 10Alexandros Kosiaris: graphite: Prepend @ to hostname in ERB template [puppet] - 10https://gerrit.wikimedia.org/r/302704 [13:30:06] (03PS2) 10Gehel: Maps - Variable used to give password to osm2pgsql has changed [puppet] - 10https://gerrit.wikimedia.org/r/302701 [13:31:17] (03CR) 10Alexandros Kosiaris: [C: 031] Maps - Variable used to give password to osm2pgsql has changed [puppet] - 10https://gerrit.wikimedia.org/r/302701 (owner: 10Gehel) [13:31:31] (03CR) 10Gehel: "set both PGPASS and PGPASSWORD. We might want to ensure connection to postgres is done through unix sockets and switch to peer authenticat" [puppet] - 10https://gerrit.wikimedia.org/r/302701 (owner: 10Gehel) [13:31:36] (03PS2) 10Alexandros Kosiaris: add mapped IPv6 on rhodium and strontium [puppet] - 10https://gerrit.wikimedia.org/r/302626 (owner: 10Dzahn) [13:31:42] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] add mapped IPv6 on rhodium and strontium [puppet] - 10https://gerrit.wikimedia.org/r/302626 (owner: 10Dzahn) [13:32:01] (03CR) 10Filippo Giunchedi: [C: 031] graphite: Prepend @ to hostname in ERB template [puppet] - 10https://gerrit.wikimedia.org/r/302704 (owner: 10Alexandros Kosiaris) [13:37:17] (03PS1) 10Halfak: Lowers ores-redis maxmemory setting to 2.5GB [puppet] - 10https://gerrit.wikimedia.org/r/302705 [13:37:22] (03PS1) 10ArielGlenn: If a prereq job is missing, run it instead of giving up [dumps] - 10https://gerrit.wikimedia.org/r/302706 (https://phabricator.wikimedia.org/T141981) [13:38:56] ACKNOWLEDGEMENT - LVS HTTP IPv4 on thumbor.svc.codfw.wmnet is CRITICAL: Connection refused Filippo Giunchedi thumbor being setup T139606 [13:39:02] ACKNOWLEDGEMENT - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: Connection refused Filippo Giunchedi thumbor being setup T139606 [13:39:08] (03CR) 10Chad: "Why did you make like 20 edits to this patch? If you were testing, create a separate patch that says something like "DO NOT MERGE" for tes" [puppet] - 10https://gerrit.wikimedia.org/r/302601 (owner: 10Chad) [13:39:20] why the hell did that page [13:39:25] sorry about that [13:39:57] I got paged for the acks it seems... not sure what happened there [13:40:05] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 15 failures [13:40:08] yeah didn't expect it at all [13:40:14] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [13:40:37] (03CR) 10Paladox: "Sorry" [puppet] - 10https://gerrit.wikimedia.org/r/302601 (owner: 10Chad) [13:40:56] if only the page had 'ACKNOWLEDGEMENT' right in the tle :-D but sadly it did not have it anywhere. [13:40:58] *title [13:45:15] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 15 failures [13:45:15] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [13:45:34] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [13:45:44] RECOVERY - MD RAID on ms-be1022 is OK: OK: Active: 3, Working: 3, Failed: 0, Spare: 0 [13:48:54] (03PS22) 10Chad: Logstash: Enable log4j provider [puppet] - 10https://gerrit.wikimedia.org/r/302601 [13:48:58] (03PS1) 10DCausse: Upgrade elastic plugins to 2.3.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/302707 [13:49:10] (03CR) 10Chad: "@BryanDavis: Fixed in PS21..." [puppet] - 10https://gerrit.wikimedia.org/r/302601 (owner: 10Chad) [13:50:06] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 15 failures [13:50:14] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [13:50:50] (03CR) 10DCausse: [C: 04-1] Upgrade elastic plugins to 2.3.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/302707 (owner: 10DCausse) [13:54:00] I know about bismuth & silicon ^^^^ -- I stopped puppetmaster for a minute right at the wrong time [13:55:05] RECOVERY - check_puppetrun on bismuth is OK: OK: Puppet is currently enabled, last run 171 seconds ago with 0 failures [13:55:05] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [13:55:14] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [13:55:14] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [13:55:43] sigh. same for indium/silicon/thulium ^^^ [13:57:49] Database error [13:57:50] From Meta, a Wikimedia project coordination wiki [13:57:52] To avoid creating high replication lag, this transaction was aborted because the write duration (10.131317615509) exceeded the 5 second limit. [13:57:53] If you are changing many items at once, try doing multiple smaller operations instead. [13:58:06] @ Special:NotifyTranslators [13:59:25] PROBLEM - puppet last run on mw2069 is CRITICAL: CRITICAL: Puppet has 1 failures [14:00:14] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [14:00:14] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [14:00:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [14:00:14] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [14:00:15] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [14:00:15] RECOVERY - check_puppetrun on silicon is OK: OK: Puppet is currently enabled, last run 299 seconds ago with 0 failures [14:02:39] (03PS7) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [14:03:43] (03CR) 10jenkins-bot: [V: 04-1] Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [14:04:22] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2476991 (10fgiunchedi) re: requirements I think it was mentioned at the previous me... [14:04:42] mafk, was it one time? [14:05:15] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [14:05:15] RECOVERY - check_puppetrun on samarium is OK: OK: Puppet is currently enabled, last run 172 seconds ago with 0 failures [14:05:15] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [14:05:16] RECOVERY - check_puppetrun on backup4001 is OK: OK: Puppet is currently enabled, last run 195 seconds ago with 0 failures [14:05:16] s7 seem to have lag issues since one of the latest deployments [14:05:16] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [14:05:51] (03PS5) 10Gehel: Maps - initial data import [puppet] - 10https://gerrit.wikimedia.org/r/300572 (https://phabricator.wikimedia.org/T138501) [14:06:00] (03CR) 10Eevans: [C: 031] site: add prometheus::node_exporter to more machines [puppet] - 10https://gerrit.wikimedia.org/r/299970 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [14:06:04] or maybe are some of the ongoing user renames? [14:06:43] !log restbase deploy start of ff1ee1e7 [14:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:54] for example, right now, traffic to s7 has multipled by 3 according to some metrics [14:06:57] PROBLEM - puppet last run on db2030 is CRITICAL: CRITICAL: puppet fail [14:07:35] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [14:10:15] RECOVERY - check_puppetrun on indium is OK: OK: Puppet is currently enabled, last run 281 seconds ago with 0 failures [14:10:15] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [14:10:15] RECOVERY - check_puppetrun on thulium is OK: OK: Puppet is currently enabled, last run 85 seconds ago with 0 failures [14:11:32] logs only show ResourceLoaderModule::saveFileDependencies errors, is it just a coincidence or could it be related? [14:14:54] godog, do you need the datacenter to be expoerted for prometheus? [14:15:13] jynus: sorry, was on the phone [14:15:15] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [14:16:10] 06Operations, 10Analytics, 10Monitoring: Switch jmxtrans from statsd to graphite line protocol - https://phabricator.wikimedia.org/T73322#2518931 (10elukey) p:05Triage>03Normal [14:16:33] godog, gehel --^ [14:16:36] so jynus, I received: [15:56] rc-pmtpa [#meta.wikimedia] [[Special:Log/notifytranslators]] sent * MarcoAurelio * MarcoAurelio sent a notification about translating page [[Right to vanish]]; languages: ar, ca, de, en-gb, es, fr, ja, pt, ru; deadline: none; priority: (unset); sent to 987 recipients, failed for 0 recipients, skipped for 0 recipients [14:16:46] but there's no onwiki log xD [14:17:12] https://meta.wikimedia.org/w/index.php?title=Special%3ALog&type=notifytranslators&user=MarcoAurelio [14:17:17] that one don't appear [14:17:24] jynus: yeah makes sense to have ::site too [14:17:37] elukey: looks good to me! [14:17:37] !log restbase deploy end of ff1ee1e7 [14:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:17:50] 06Operations, 10Analytics: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2518950 (10elukey) 05Open>03Resolved a:03elukey All the next steps outlined in https://phabricator.wikimedia.org/T73322, we can close this task. [14:17:52] godog, for example, we have 2 masters, one on each datacenter [14:18:09] but only the one on the active db is read-write [14:19:26] elukey: +1 ! [14:19:46] jynus: yeah, in this case each will be monitored by the respective prometheus in the datacenter [14:19:55] (03PS8) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [14:20:15] RECOVERY - check_puppetrun on pay-lvs1001 is OK: OK: Puppet is currently enabled, last run 237 seconds ago with 0 failures [14:20:16] ^start giving it a look to see if that is close to what would be useful [14:21:00] (03CR) 10jenkins-bot: [V: 04-1] Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [14:22:25] I am sorry, mafk, I am unsure about what the line you copied me means (I have no context) [14:24:03] (03PS9) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [14:26:37] RECOVERY - puppet last run on mw2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:28:26] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [14:28:44] jynus: creé un ticket [14:29:38] mafs says "I created a ticket", and I thank you for it [14:30:35] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [14:31:35] RECOVERY - puppet last run on ms-be1022 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [14:32:23] mafk, is this content-translation related? Sorry, I am not very familiar with the inner workings of that extension [14:32:53] jynus: It's a feature of the translate extension, used to massmessage subscribed translators [14:33:04] I've CCd Nikerabbit as well [14:33:15] I will add both tags too [14:33:36] RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:33:56] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2519021 (10Liuxinyu970226) So amire80, what's the reason we can't do T30441 as the second kiseki (or "miracle" if you really don't know romaji) here? The only probl... [14:35:39] (03PS4) 10Andrew Bogott: Add domain labtestspice.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/301177 (https://phabricator.wikimedia.org/T130806) (owner: 10Yuvipanda) [14:36:11] (03CR) 10Filippo Giunchedi: "LGTM generally, some comments" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [14:36:32] (03CR) 10Andrew Bogott: [C: 032] Add domain labtestspice.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/301177 (https://phabricator.wikimedia.org/T130806) (owner: 10Yuvipanda) [14:36:48] !log reboot ms-be1022 following firmware upgrade T141756 [14:36:50] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [14:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:25] oh, so Translation-notifications, I got confused with cx-translation [14:39:34] yep [14:39:46] no probs, too many translation related features :) [14:39:49] I asked first! [14:40:04] I said I was not very familiar with that feature [14:41:01] mafk: wha? [14:41:09] (03CR) 10BBlack: [C: 032] openssl (1.0.2h-1~wmf3) jessie-wikimedia; urgency=medium [debs/openssl] - 10https://gerrit.wikimedia.org/r/301920 (https://phabricator.wikimedia.org/T131908) (owner: 10BBlack) [14:41:25] Nikerabbit: salve, the notifytranslators log is broken since June [14:41:40] Nikerabbit: https://phabricator.wikimedia.org/T141988 [14:41:47] kiitos [14:42:49] (03PS2) 10Addshore: Enable RevisionSlider on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302690 (https://phabricator.wikimedia.org/T141974) [14:42:59] mafk: bene [14:44:16] mafk: is there backtrace? [14:44:33] I'm checking logstash [14:44:44] Nikerabbit: I just c/p the error message I got [14:44:52] no error ID however [14:45:13] mafk: when did that error happen UTC? [14:45:48] Nikerabbit: 15:57 UTC+2 [14:45:55] I'm in UTC+2 I think [14:46:03] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2519084 (10fgiunchedi) ok I've reenabled the ld with `controller slot=3 ld 1 modify reenable`, also had to juggle with boot order since this was sda. anyways I'm still seeing the iucrc... [14:46:22] UTC-2 [14:47:36] 06Operations, 10hardware-requests: determine future of dickson - wmf hosted irc server - https://phabricator.wikimedia.org/T120752#2519085 (10RobH) a:03RobH [14:51:08] mafk: hmm I am also seeing emails failing at that time... not sure if related [14:51:52] Nikerabbit: I'm just an user, so I have no idea :) [14:52:39] (03PS10) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [14:53:52] godog, I think I answered/fixed all comments except the new define [14:53:55] (03PS4) 10Andrew Bogott: Set up spice-based remote consoles for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/301294 (https://phabricator.wikimedia.org/T141399) [14:53:57] (03PS1) 10Andrew Bogott: Set up labtestspice -> labtestcontrol2001 on misc-web [puppet] - 10https://gerrit.wikimedia.org/r/302714 [14:54:32] (03PS3) 10Filippo Giunchedi: site: add prometheus::node_exporter to more machines [puppet] - 10https://gerrit.wikimedia.org/r/299970 (https://phabricator.wikimedia.org/T140646) [14:55:38] (03PS6) 10Gehel: Maps - initial data import [puppet] - 10https://gerrit.wikimedia.org/r/300572 (https://phabricator.wikimedia.org/T138501) [14:56:00] do you have one already (or a "format", or do you want me to create one?) [14:56:41] jynus: for the define? yeah look at modules/ganglia/manifests/monitor.pp [14:58:06] (03PS3) 10Gehel: Maps - Variable used to give password to osm2pgsql has changed [puppet] - 10https://gerrit.wikimedia.org/r/302701 [14:58:09] (03PS11) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [14:59:06] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [14:59:23] (03CR) 10Gehel: [C: 032] Maps - Variable used to give password to osm2pgsql has changed [puppet] - 10https://gerrit.wikimedia.org/r/302701 (owner: 10Gehel) [15:00:04] anomie, ostriches, thcipriani, hashar, and twentyafterfour: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160803T1500). [15:00:05] Addshore: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:32] Looks like the only patch is mine so I can go ahead and do it! [15:00:46] :) [15:01:16] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [15:01:21] addshore: if you need help or have question :) [15:01:25] let me know [15:02:14] mhh the exceptions/fatals show up in logstash too, though from a single host mw1304 https://logstash.wikimedia.org/goto/b2adef5c2f60f310be53238c0c570a22 [15:05:43] mw1304 and mw1167 seem not well. [15:05:59] Like, a factor of 20-30x worse than others in terms of log problems [15:06:19] 1163 and 1162 round out the top 4, by a huge factor [15:06:56] https://usercontent.irccloud-cdn.com/file/mXJtUBtN/mw-errors%20on%20bad%20hosts%2C%202016-08-03 [15:07:58] is it a DB issue or am I misreading "DB connection was already closed or the connection dropped" [15:08:21] "DB connection was already closed or the connection dropped" and "Duplicate get(): "{key}" fetched {count} times" are the top 2 errors. [15:08:40] yeah but I am a bit ignorant and don't know if they are garbage or true ones [15:08:44] so I am asking first :) [15:09:20] Well if they're garbage we're spamming for a non-issue. [15:09:22] 07Blocked-on-Operations, 06Operations, 10Cassandra, 06Services: Remove obsolete metrics - https://phabricator.wikimedia.org/T139792#2519173 (10Eevans) [15:09:24] So either way it's wrong :) [15:09:43] ah yes I know! :) [15:09:52] looking in logstash it seems that they are genuine [15:09:54] (03CR) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production dbs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [15:10:02] Nikerabbit: I see TNBot is now sending some messages [15:10:11] frwiki seems to be the biggest offender. [15:10:41] mafk: yeah apparently the jobs got committed [15:10:50] 16:57 Utc-2 [15:11:03] so they're in the jobQueue still? [15:11:05] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: puppet fail [15:11:13] indeed, also nowiki / dewiki / commonswiki show up when looking a few days back, all seem from jobrunners tho [15:11:24] (03CR) 10Eevans: [C: 031] site: add prometheus::node_exporter to more machines [puppet] - 10https://gerrit.wikimedia.org/r/299970 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [15:13:07] mafk: some failed, some already done and some in queue most likely [15:13:18] !log addshore@tin Synchronized php-1.28.0-wmf.13/includes/Linker.php: SWAT: [[gerrit:302681|Debug Logging for Undefined index: width in Linker.php]] (duration: 00m 30s) [15:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:35] SWAT all done! << aude thcipriani [15:13:43] godog, elukey: They're all changeNotification jobs. [15:14:24] Nikerabbit: alright, thanks [15:14:44] Yep, if you filter out 'type=ChangeNotification' from the url field, most of the errors disappear. [15:14:56] So yeah, we've got like 4 boxes that are bailing, and hard, on those jobs. [15:16:16] \o/ [15:16:47] (03CR) 10Alexandros Kosiaris: [C: 031] site: add prometheus::node_exporter to more machines [puppet] - 10https://gerrit.wikimedia.org/r/299970 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [15:16:48] ostriches: ah! thanks for taking a look [15:17:18] (03PS5) 10Andrew Bogott: Set up spice-based remote consoles for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/301294 (https://phabricator.wikimedia.org/T141399) [15:17:19] is there a revision history for fatalmonitor? I have a vague memory of ChangeNotification jobs being filtered in fatalmonitor. [15:17:29] maybe had this discussion previously. [15:17:36] addshore: nice :) [15:17:48] Of course the log entry doesn't include which db is barfing, but I doubt that actually matters since it's one set of jobs failing. [15:17:51] thcipriani: Not that I remember [15:18:18] (03PS4) 10Filippo Giunchedi: site: add prometheus::node_exporter to more machines [puppet] - 10https://gerrit.wikimedia.org/r/299970 (https://phabricator.wikimedia.org/T140646) [15:18:26] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] site: add prometheus::node_exporter to more machines [puppet] - 10https://gerrit.wikimedia.org/r/299970 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [15:18:39] (03PS1) 10Alexandros Kosiaris: ores: switch to ores-redis-02 [puppet] - 10https://gerrit.wikimedia.org/r/302718 [15:19:00] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ores: switch to ores-redis-02 [puppet] - 10https://gerrit.wikimedia.org/r/302718 (owner: 10Alexandros Kosiaris) [15:19:04] (03PS2) 10Alexandros Kosiaris: ores: switch to ores-redis-02 [puppet] - 10https://gerrit.wikimedia.org/r/302718 [15:19:08] (03CR) 10Alexandros Kosiaris: [V: 032] ores: switch to ores-redis-02 [puppet] - 10https://gerrit.wikimedia.org/r/302718 (owner: 10Alexandros Kosiaris) [15:20:04] I guess next step is figure out if it's MW's fault or if a DB really is sick. [15:20:19] Of course the MW error doesn't include which DB we were talking to [15:21:15] jouncebot next [15:21:15] In 3 hour(s) and 38 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160803T1900) [15:22:11] ostriches: indeed, I'm jumping into a meeting now but looking for similar error message in phab got me to https://phabricator.wikimedia.org/T67263 [15:22:15] (03CR) 10Gilles: Update Thumbor configuration for python-thumbor-wikimedia 1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/301073 (https://phabricator.wikimedia.org/T141337) (owner: 10Gilles) [15:22:40] (03PS3) 10Gilles: Update Thumbor configuration for python-thumbor-wikimedia 1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/301073 (https://phabricator.wikimedia.org/T141337) [15:23:01] !log openssl-1.0.2h-1~wmf3 uploaded to carbon jessie-wikimedia ( https://gerrit.wikimedia.org/r/#/c/301920/ ) [15:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:24] Hmmmm [15:24:29] (03CR) 10Elukey: [C: 031] "I am in favor, but I'd have two questions:" [puppet] - 10https://gerrit.wikimedia.org/r/301878 (https://phabricator.wikimedia.org/T140869) (owner: 10Eevans) [15:25:38] (03Abandoned) 10Filippo Giunchedi: add prometheus::node_exporter to db2069 [puppet] - 10https://gerrit.wikimedia.org/r/302671 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [15:26:16] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2519262 (10GWicke) [15:26:47] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2476991 (10GWicke) @fgiunchedi: Added the latency & instrumentation requirement in... [15:28:59] jynus: change LGTM, though I still see some unrelated changes in the diff? [15:29:22] oh, I just changed completely the structure [15:29:32] I will tell you about that later after the meetings [15:31:27] PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: puppet fail [15:32:06] !log upgrading openssl on cache_maps + cache_misc [15:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:28] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-af-nl_0.2.0~r58256-1+wmf1 [15:33:28] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-arg_0.1.2~r65494-1+wmf1 [15:33:28] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-ca-it_0.1.1~r57554-1+wmf1 [15:33:28] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-cat_1.0.0~r65787-1+wmf1 [15:33:29] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-cy-en_0.1.1~r57554-3+wmf1 [15:33:29] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:29] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:30] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-en-gl_0.5.2~r57551-1+wmf1 [15:33:30] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:31] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-eo-ca_0.9.1~r60655-1+wmf1 [15:33:31] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:32] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-eo-en_1.0.0~r63833-1+wmf1 [15:33:32] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:33] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:33] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-eo-es_0.9.1~r60655-1+wmf1 [15:33:34] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:34] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-eo-fr_0.9.0~r57551-1 [15:33:34] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:35] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-es-ast_1.1.0~r60158-1+wmf1 [15:33:35] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:36] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-es-ca_1.2.1+svn~57448-1+wmf1 [15:33:36] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:37] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:37] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-es-gl_1.0.8~r57542-1+wmf1 [15:33:38] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:38] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-es-it_0.1.0~r51165-1+wmf1 [15:33:38] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:39] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-es-pt_1.1.5+svn~57507-1+wmf1 [15:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:40] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:40] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-eu-en_0.3.1~r60155-1+wmf1 [15:33:40] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:41] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:41] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-eu-es_0.3.3~r56159-1+wmf1 [15:33:42] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:42] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-hbs-mkd_0.1.0~r57554-1+wmf1 [15:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:43] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:44] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-id-ms_0.1.1+svn~57870-1+wmf1 [15:33:44] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:44] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-is-sv_0.1.0~r56030-1+wmf1 [15:33:45] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:45] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-mk-bg_0.2.0~r49489-1+wmf1 [15:33:46] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:46] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-mk-en_0.1.1~r57554-1+wmf1 [15:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:47] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:48] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-mlt-ara_0.2.0~r62623-1+wmf1 [15:33:48] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:49] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-nno_0.9.0~r69513-2+wmf1 [15:33:49] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:50] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-oc-ca_1.0.6~r60158-1 [15:33:51] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:51] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-oc-es_1.0.6~r60161-1+wmf1 [15:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:52] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:52] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-pt-gl_0.9.2~r60358-1 [15:33:53] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:53] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-spa_0.1.0~r65494-1+wmf1 [15:33:54] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:54] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-swe_0.7.0~r69513-1+wmf1 [15:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:55] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:55] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-tat_0.1.0~r60887-1+wmf1 [15:33:56] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:56] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-urd_0.1.0~r61311-1+wmf1 [15:33:57] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:20] stress-testing the bots? :) [15:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:54] /ignore morebots [15:38:47] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/294252 (https://phabricator.wikimedia.org/T137768) (owner: 10KartikMistry) [15:38:50] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-spa-arg] - 10https://gerrit.wikimedia.org/r/295122 (https://phabricator.wikimedia.org/T124370) (owner: 10KartikMistry) [15:38:52] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/269912 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry) [15:38:54] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/269916 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry) [15:38:57] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-en-es] - 10https://gerrit.wikimedia.org/r/294314 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [15:38:59] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-arg-cat] - 10https://gerrit.wikimedia.org/r/295121 (https://phabricator.wikimedia.org/T124369) (owner: 10KartikMistry) [15:39:01] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/giella-sme] - 10https://gerrit.wikimedia.org/r/294430 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [15:39:04] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-en-ca] - 10https://gerrit.wikimedia.org/r/294264 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [15:39:06] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-urd-hin] - 10https://gerrit.wikimedia.org/r/296368 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [15:39:09] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-swe-nor] - 10https://gerrit.wikimedia.org/r/294245 (https://phabricator.wikimedia.org/T137767) (owner: 10KartikMistry) [15:39:11] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-swe-dan] - 10https://gerrit.wikimedia.org/r/294248 (https://phabricator.wikimedia.org/T137767) (owner: 10KartikMistry) [15:39:13] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-sme-nob] - 10https://gerrit.wikimedia.org/r/295185 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [15:39:16] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-nob] - 10https://gerrit.wikimedia.org/r/269914 (https://phabricator.wikimedia.org/T124317) (owner: 10KartikMistry) [15:39:18] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-isl] - 10https://gerrit.wikimedia.org/r/296050 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [15:39:21] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-isl-eng] - 10https://gerrit.wikimedia.org/r/296157 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [15:39:23] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-kaz] - 10https://gerrit.wikimedia.org/r/296366 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [15:39:25] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-kaz-tat] - 10https://gerrit.wikimedia.org/r/296369 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [15:39:28] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-hin] - 10https://gerrit.wikimedia.org/r/296228 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [15:39:30] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-hbs-slv] - 10https://gerrit.wikimedia.org/r/296203 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [15:39:33] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/294675 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [15:39:35] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:39:35] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-hbs-eng] - 10https://gerrit.wikimedia.org/r/296049 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [15:39:38] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-eus] - 10https://gerrit.wikimedia.org/r/294673 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [15:39:40] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/294425 (https://phabricator.wikimedia.org/T137768) (owner: 10KartikMistry) [15:39:42] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/295220 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [15:39:45] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-pt-ca] - 10https://gerrit.wikimedia.org/r/296164 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [15:41:51] \0/ Now lets see how happy Mr. Jenkins is. [15:42:10] poor jenkins <3 [15:43:07] (03PS1) 10BBlack: ssl_ciphersuite: allow client choice of chapoly [puppet] - 10https://gerrit.wikimedia.org/r/302724 (https://phabricator.wikimedia.org/T131908) [15:44:34] akosiaris: source/debian/control - why apertium-fra seeks this? It is only added when autopackagetest is present. [15:44:40] * kart_ wondering. [15:44:51] kart_: ? [15:45:01] https://integration.wikimedia.org/ci/job/debian-glue/488/console [15:45:04] akosiaris: ^ [15:45:39] the debian/control file ? [15:46:04] shouldn't there be debian/control file anyway ? [15:46:37] dpkg-source: error: cannot read source/debian/control: No such file or directory [15:46:45] debian/control is there. [15:47:00] Without it, package won't even build at all :) [15:47:50] gimme 5 mins and I 'll have a look. trying to fix something [15:48:28] OK! [15:48:57] Also let me know if I have missed to push any tags. [15:49:28] (03CR) 10BBlack: [C: 032] ssl_ciphersuite: allow client choice of chapoly [puppet] - 10https://gerrit.wikimedia.org/r/302724 (https://phabricator.wikimedia.org/T131908) (owner: 10BBlack) [15:51:38] (03PS1) 10Alexandros Kosiaris: ores: Allow specifying deployment method [puppet] - 10https://gerrit.wikimedia.org/r/302727 [15:53:59] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ores: Allow specifying deployment method [puppet] - 10https://gerrit.wikimedia.org/r/302727 (owner: 10Alexandros Kosiaris) [15:54:04] (03PS2) 10Alexandros Kosiaris: ores: Allow specifying deployment method [puppet] - 10https://gerrit.wikimedia.org/r/302727 [15:54:06] (03CR) 10Alexandros Kosiaris: [V: 032] ores: Allow specifying deployment method [puppet] - 10https://gerrit.wikimedia.org/r/302727 (owner: 10Alexandros Kosiaris) [15:57:12] Hm.. looks like mwgrep broke. [15:57:16] It's getting HTTP 500 from Elastic [15:57:27] e.g. run `mwgrep tipsy` from terbium. [15:57:37] RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:41] $ mwgrep '.tipsy(' [15:58:45] https://phabricator.wikimedia.org/T141996 [15:59:39] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [15:59:40] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [15:59:42] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1005.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [15:59:44] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1007.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [15:59:50] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [15:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:59:52] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [15:59:53] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp1005.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [15:59:54] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp1007.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [15:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:14] (03PS1) 10ArielGlenn: make dump run locks stale and therefore removable after 15 minutes [puppet] - 10https://gerrit.wikimedia.org/r/302729 [16:00:46] Krinkle: you have to escape regex chars: mwgrep '.tipsy\(' [16:01:02] !log T135176 pool wtp100[3457] with weight=15. wtp1006 does not look so good [16:01:03] T135176: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176 [16:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:01:14] dcausse: Hm.. seems odd to return HTTP 500? I'd expect 4xx if it's a user error [16:01:23] Right, it supports regex now [16:02:04] Krinkle: I agree... exceptions are a bit messy, we should probably display the response message it should have some error info [16:06:59] (03CR) 10Eevans: [C: 031] "> Are we going to restart all the cassandra instances to pick up the change or should we wait for the next rolling restart?" [puppet] - 10https://gerrit.wikimedia.org/r/301878 (https://phabricator.wikimedia.org/T140869) (owner: 10Eevans) [16:08:36] PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: Puppet has 1 failures [16:09:36] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 1 failures [16:10:45] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 2 failures [16:10:57] PROBLEM - puppet last run on mw1272 is CRITICAL: CRITICAL: Puppet has 1 failures [16:11:19] akosiaris: there's a pb with wtp1006? [16:12:21] mobrovac: still investigating... can't login for one [16:12:25] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Puppet has 1 failures [16:12:56] PROBLEM - puppet last run on wtp2018 is CRITICAL: CRITICAL: Puppet has 1 failures [16:13:16] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 2 failures [16:14:45] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:14:56] RECOVERY - puppet last run on wtp2018 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:15:07] transients... [16:15:47] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:15:56] groupadd: failure while writing changes to /etc/group [16:15:59] wat ? [16:16:03] that's wtp1006 btw [16:17:31] mounted in ro? [16:17:52] !log upgrading openssl on cache_text, cache_upload [16:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:19] nope [16:19:22] that's the funny thing [16:20:41] mv: cannot move '/etc/group+' to '/etc/group': Device or resource busy [16:20:43] hmmm [16:20:48] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614946 [16:20:56] ok that's user error [16:21:05] but there was mostly no user up to now involved ... [16:22:53] !log reboot wtp1006 [16:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:40] 06Operations, 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service: Clean up puppet & configs for ORES - https://phabricator.wikimedia.org/T142002#2519474 (10Halfak) [16:28:47] 06Operations, 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service: Clean up puppet & configs for ORES - https://phabricator.wikimedia.org/T142002#2519487 (10Halfak) p:05Triage>03Normal [16:29:05] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [16:29:23] (03PS13) 10Andrew Bogott: Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [16:30:47] (03PS12) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [16:31:15] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [16:33:03] godog, if by chance you are still there, I have modified my patch to do role::mariadb::groups, and make it "your problem" on role::prometheus::mysql_exporter [16:33:39] that way I also combine salt integration (which is unrelated for prometheus, but needed for other tasks) [16:35:23] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:35:35] (03CR) 10Jcrespo: "@Riccado, @Ariel I have added here the salt automation that you either initially implemented or asked for it; ignore the rest of the patch" [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [16:36:03] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:36:09] (03CR) 10BryanDavis: [C: 031] "Once we get some logs flowing in we can come back and add filters to classify the gerrit logs so they are easier to find." [puppet] - 10https://gerrit.wikimedia.org/r/302601 (owner: 10Chad) [16:36:22] RECOVERY - puppet last run on mw1272 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:36:23] RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:37:02] (03PS13) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [16:38:30] jynus: yup I'll be taking a look [16:39:07] if it is late (it starts being late for me), we can postpone if for tomorrow- but only if you can [16:39:32] It now requires implementing the exports on role::prometheus::mysql_exporter [16:40:03] but I leave the rest prepared for you to work on your own in case you want it [16:40:16] ack, yeah I have a meeting in 20', let's do it tomorrow morning [16:40:32] and sorry I started to mix other changes at the same time [16:40:43] but if that class is bad, is bad [16:40:44] no worries! [16:40:51] :-) [16:41:16] i am unable to log in on https://commons.m.wikimedia.beta.wmflabs.org/wiki/Special:UserLogin (mobile beta commons). known issue? [16:41:40] i am redirected to Special:CentralLogin/complete?token=blahblah with the error message "No active login attempt is in progress for your session." [16:41:59] (03PS14) 10Alex Monk: Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) [16:42:12] MatmaRex, you get "No active login attempt is in progress for your session." ? [16:42:12] right [16:42:17] same [16:42:50] anomie, ^ [16:43:06] (03CR) 10Jcrespo: [C: 04-1] "This now is missing the export on role::prometheus, so it would fail to apply in the current state, but the idea is more or less there." [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [16:43:26] i can log in on the desktop version, but it looks like the cookies are not shared with the mobile one. [16:45:30] Krenair: anomie is allegedly on vacation [16:45:40] (i mean, he says so, yet he replies to emails and stuff) [16:47:31] 06Operations, 10Traffic: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2519626 (10BBlack) Quoting myself from IRC: ``` 16:32 < bblack> on initialy deployment of the chapoly stuff to text+upload (all clusters now), the initial point-in-time stats changes look like: 16:33... [16:47:43] PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail [16:47:59] (03PS15) 10Alex Monk: Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) [16:50:20] (03CR) 10Andrew Bogott: [C: 032] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [16:50:42] jynus: yup LGTM, though there's still an unrelated change at the bottom for parsercache [16:50:52] yes [16:51:01] I comment those on the summary [16:51:14] they are garbage from old servers [16:51:18] that I cleaned up [16:51:31] so small that doesn't even merit its own commit [16:51:43] it is an old /a exception for old servers [16:52:00] but no, we cannot LGTM now because I broke it on purpose [16:52:02] I think it makes sense to split things no matter how small, commits are cheap [16:52:15] godog, ok, I can do that beforehand [16:52:18] no problem [16:52:36] thanks! [16:52:44] the question is, instead of exporting from role::groups, I just call the role with parameters [16:52:54] and I will let you name the resource and all [16:53:05] if you do not care, just say it [16:54:11] ah, I don't feel strongly either way where the exported resource lives [17:00:43] (03PS9) 10Dzahn: labs: restart slapd if it uses > 50% of memory [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) [17:01:39] (03CR) 10Dzahn: [C: 032] labs: restart slapd if it uses > 50% of memory [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn) [17:01:42] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [17:01:43] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [17:01:45] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1005.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [17:01:47] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1007.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [17:03:46] (03CR) 10Chad: "I think we can do that from the gerrit side too. Something like:" [puppet] - 10https://gerrit.wikimedia.org/r/302601 (owner: 10Chad) [17:04:29] no morebots here? [17:05:50] !log seaborgium - restart slapd [17:07:09] (03PS1) 10Alexandros Kosiaris: realm: Move the $::site setting code first [puppet] - 10https://gerrit.wikimedia.org/r/302741 [17:07:47] mutante: restart morebots too? [17:09:14] sorry, that's completely unrelated on tool labs and i am in the middle of something [17:11:04] (03PS1) 10Alexandros Kosiaris: wtp100[34567] to jessie [puppet] - 10https://gerrit.wikimedia.org/r/302742 [17:11:52] (03PS2) 10Alexandros Kosiaris: wtp100[34567] to jessie [puppet] - 10https://gerrit.wikimedia.org/r/302742 [17:12:07] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] wtp100[34567] to jessie [puppet] - 10https://gerrit.wikimedia.org/r/302742 (owner: 10Alexandros Kosiaris) [17:14:18] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 391 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [17:14:46] (03CR) 10Chad: "Inline question for regexes, otherwise lgtm" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/302741 (owner: 10Alexandros Kosiaris) [17:15:39] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:18:05] (03PS1) 10Thcipriani: Bump Scap to v.3.2.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/302744 [17:20:18] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 391 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [17:22:19] (03PS4) 10Filippo Giunchedi: Update Thumbor configuration for python-thumbor-wikimedia 1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/301073 (https://phabricator.wikimedia.org/T141337) (owner: 10Gilles) [17:24:01] (03CR) 10Filippo Giunchedi: [C: 032] Update Thumbor configuration for python-thumbor-wikimedia 1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/301073 (https://phabricator.wikimedia.org/T141337) (owner: 10Gilles) [17:24:15] (03CR) 10Andrew Bogott: [C: 032] Delegate 208.80.155.128/25 (labs instances) PTR records to labs-ns* so they can be managed automatically [dns] - 10https://gerrit.wikimedia.org/r/299513 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [17:24:19] (03PS6) 10Andrew Bogott: Delegate 208.80.155.128/25 (labs instances) PTR records to labs-ns* so they can be managed automatically [dns] - 10https://gerrit.wikimedia.org/r/299513 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [17:25:25] 06Operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#2519790 (10aaron) [17:29:49] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [17:35:00] !log citoid deploying 0b9f59fe0 [17:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:39:45] 06Operations, 10hardware-requests: eqiad: (4) spare pool servers for kubernetes - https://phabricator.wikimedia.org/T141624#2519829 (10RobH) [17:40:06] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2519830 (10Jgreen) DNS records have not been removed yet, I haven't heard whether they're done with the A/B tests. [17:40:40] 06Operations, 10hardware-requests: eqiad: (4) spare pool servers for kubernetes - https://phabricator.wikimedia.org/T141624#2505235 (10RobH) a:05RobH>03mark Escalating to @mark for his review/approval for allocation. The systems meet all requirements. Please note when these are allocated, we will only ha... [17:44:22] (03PS1) 10Alexandros Kosiaris: hiera: Remove 2 items from the hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/302750 [17:49:07] (03Draft3) 10Paladox: Strip out branch HEAD in git.wikimedia.org tree link [puppet] - 10https://gerrit.wikimedia.org/r/302747 (https://phabricator.wikimedia.org/T141965) [17:49:12] (03Draft2) 10Paladox: Strip out branch HEAD in git.wikimedia.org tree link [puppet] - 10https://gerrit.wikimedia.org/r/302747 (https://phabricator.wikimedia.org/T141965) [17:49:16] (03Draft1) 10Paladox: Strip out branch HEAD in git.wikimedia.org tree link [puppet] - 10https://gerrit.wikimedia.org/r/302747 (https://phabricator.wikimedia.org/T141965) [17:52:20] (03PS1) 10BBlack: Text VCL: validate w.wiki URLs (for real this time) [puppet] - 10https://gerrit.wikimedia.org/r/302752 (https://phabricator.wikimedia.org/T141170) [17:52:52] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: validate w.wiki URLs (for real this time) [puppet] - 10https://gerrit.wikimedia.org/r/302752 (https://phabricator.wikimedia.org/T141170) (owner: 10BBlack) [17:54:32] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2519895 (10CCogdill_WMF) We can remove the DNS record, the test is done and the tail of email opens should be about done. T... [17:55:51] (03PS3) 10Madhuvishy: [WIP] labstore: Configure drbd for a HA labstore setup [puppet] - 10https://gerrit.wikimedia.org/r/302488 [17:56:58] (03CR) 10jenkins-bot: [V: 04-1] [WIP] labstore: Configure drbd for a HA labstore setup [puppet] - 10https://gerrit.wikimedia.org/r/302488 (owner: 10Madhuvishy) [17:58:24] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:59:25] (03PS2) 10Alexandros Kosiaris: graphite: Prepend @ to hostname in ERB template [puppet] - 10https://gerrit.wikimedia.org/r/302704 [17:59:30] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] graphite: Prepend @ to hostname in ERB template [puppet] - 10https://gerrit.wikimedia.org/r/302704 (owner: 10Alexandros Kosiaris) [18:00:13] bblack: merged yours as well [18:00:50] (03CR) 10Alexandros Kosiaris: "After the move to ores-redis-02 is this still needed ?" [puppet] - 10https://gerrit.wikimedia.org/r/302705 (owner: 10Halfak) [18:05:57] Krenair: i'll file a task about that beta login issue, unless you did? [18:05:59] PROBLEM - parsoid on wtp1006 is CRITICAL: Connection refused [18:06:08] MatmaRex, I didn't - go for it [18:06:16] (03PS4) 10Madhuvishy: [WIP] labstore: Configure drbd for a HA labstore setup [puppet] - 10https://gerrit.wikimedia.org/r/302488 [18:06:38] PROBLEM - salt-minion processes on wtp1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:06:39] PROBLEM - parsoid on wtp1005 is CRITICAL: Connection refused [18:07:18] PROBLEM - salt-minion processes on wtp1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:07:29] PROBLEM - parsoid on wtp1004 is CRITICAL: Connection refused [18:07:59] PROBLEM - parsoid on wtp1003 is CRITICAL: Connection refused [18:07:59] PROBLEM - salt-minion processes on wtp1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:08:12] (03CR) 10jenkins-bot: [V: 04-1] [WIP] labstore: Configure drbd for a HA labstore setup [puppet] - 10https://gerrit.wikimedia.org/r/302488 (owner: 10Madhuvishy) [18:08:12] !log restarted morebots-production [18:08:17] Luke081515: ^ done now [18:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:08:29] PROBLEM - salt-minion processes on wtp1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:08:34] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 05MW-1.28-release-notes, and 3 others: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480071 (10greg) >>! In T140898#2511461, @Dereckson wrote: > Okay, so let's wait wmf13 is deployed and we create the wiki? > > The i... [18:09:07] Krenair: https://phabricator.wikimedia.org/T142015 [18:10:30] mutante: thx [18:13:52] 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2519965 (10Dzahn) crons have been created on serpens and seaborgium. they will check once an hour (at a random minute so they are never restarted at the same time) if more than 50% of memory... [18:14:59] 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2519966 (10Dzahn) ``` [seaborgium:~] $ /bin/ps -C slapd -o pmem= 3.5 [serpens:~] $ /bin/ps -C slapd -o pmem= 27.1 ``` [18:21:40] (03PS5) 10Madhuvishy: [WIP] labstore: Configure drbd for a HA labstore setup [puppet] - 10https://gerrit.wikimedia.org/r/302488 [18:24:17] RECOVERY - salt-minion processes on wtp1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:27:58] PROBLEM - puppet last run on wtp1006 is CRITICAL: CRITICAL: Puppet has 1 failures [18:28:35] (03CR) 10jenkins-bot: [V: 04-1] [WIP] labstore: Configure drbd for a HA labstore setup [puppet] - 10https://gerrit.wikimedia.org/r/302488 (owner: 10Madhuvishy) [18:28:48] RECOVERY - salt-minion processes on wtp1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:29:04] (03PS1) 10Dzahn: strontium: add IPv6 AAAA and reverse record [dns] - 10https://gerrit.wikimedia.org/r/302757 [18:29:07] 06Operations, 10hardware-requests: decommission dickson - https://phabricator.wikimedia.org/T120752#2519985 (10RobH) [18:32:18] RECOVERY - salt-minion processes on wtp1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:32:47] PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: Puppet has 1 failures [18:33:03] 06Operations, 10hardware-requests: decommission dickson - https://phabricator.wikimedia.org/T120752#2519994 (10RobH) [18:35:49] (03PS1) 10RobH: decom dickson, remove prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/302760 [18:36:56] (03CR) 10RobH: [C: 032] decom dickson, remove prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/302760 (owner: 10RobH) [18:37:58] PROBLEM - puppet last run on wtp1003 is CRITICAL: CRITICAL: Puppet has 1 failures [18:39:19] (03PS1) 10RobH: decom dickson - remove from install_server [puppet] - 10https://gerrit.wikimedia.org/r/302761 [18:39:52] 06Operations, 10hardware-requests: decommission dickson - https://phabricator.wikimedia.org/T120752#2520014 (10RobH) [18:40:20] 06Operations, 10hardware-requests: decommission dickson - https://phabricator.wikimedia.org/T120752#1860514 (10RobH) a:05RobH>03Cmjohnson Assigned to @Cmjohnson for hdd wipe and/or ssd removal, plus other decom steps following. [18:41:38] RECOVERY - salt-minion processes on wtp1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:45:15] thcipriani: i would like to deploy https://gerrit.wikimedia.org/r/#/c/302739/ for wikidata before the train [18:45:17] PROBLEM - puppet last run on wtp1004 is CRITICAL: CRITICAL: Puppet has 1 failures [18:45:31] maybe if no one is deploying stuff now, i could quickly do it [18:45:43] (or quick as jenkins allows) [18:46:28] aude: yup, shouldn't be anyone deploying now AFAIK. [18:46:53] ok [18:46:57] * aude proceeds [18:50:48] RECOVERY - parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.012 second response time [18:51:08] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [18:51:17] RECOVERY - parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.034 second response time [18:51:18] RECOVERY - parsoid on wtp1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.022 second response time [18:51:37] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:51:38] RECOVERY - puppet last run on wtp1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:51:57] RECOVERY - parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.026 second response time [18:52:18] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:53:01] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [18:53:02] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [18:53:03] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp1005.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [18:53:06] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp1006.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [18:53:07] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp1007.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [18:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:53:33] !log T135176 pool wtp100[34567] with weight=15 [18:53:34] T135176: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176 [18:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:01] alex running mega compile job too, heh [18:55:30] (03CR) 10Dzahn: [C: 032] Gerrit: Do a reindex on a fresh install, less surprises [puppet] - 10https://gerrit.wikimedia.org/r/302497 (owner: 10Chad) [18:55:47] (03PS2) 10Dzahn: Gerrit: Do a reindex on a fresh install, less surprises [puppet] - 10https://gerrit.wikimedia.org/r/302497 (owner: 10Chad) [18:59:26] !log aude@tin Synchronized php-1.28.0-wmf.13/extensions/Wikidata: Update PropertySuggester (duration: 02m 02s) [18:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:00:05] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160803T1900). [19:00:14] * aude checks [19:00:35] aude: ping me when complete, please [19:01:09] looks good [19:01:33] thcipriani: ^ [19:01:41] awesome, thanks! [19:02:13] (03CR) 10Southparkfan: tcpircbot: add rhodium to allowed hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/302621 (owner: 10Dzahn) [19:07:10] (03PS6) 10Madhuvishy: [WIP] labstore: Configure drbd for a HA labstore setup [puppet] - 10https://gerrit.wikimedia.org/r/302488 [19:08:37] (03CR) 10jenkins-bot: [V: 04-1] [WIP] labstore: Configure drbd for a HA labstore setup [puppet] - 10https://gerrit.wikimedia.org/r/302488 (owner: 10Madhuvishy) [19:09:42] blerg. [19:10:17] can I get a root to chown -R /srv/mediawiki-staging/.git/objects for me? [19:10:26] got some things owned by root:root [19:10:37] ...which I thought we had an alert for? [19:12:01] sec, this is on tin I guess? [19:12:12] thcipriani: [19:12:22] 06Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic: Strip query parameters from w.wiki domain - https://phabricator.wikimedia.org/T141170#2520208 (10Legoktm) 05Open>03Resolved Thanks! It works now :) [19:12:25] apergos: yup, on tin, thanks! [19:13:30] all owned by mwdeploy now [19:13:34] did you need a group? [19:13:48] thcipriani: [19:13:58] apergos: yeah, group needs to be wikidev so I can write [19:14:08] sorry wasn't specific :( [19:14:18] done [19:14:26] thank you! [19:14:28] yw [19:14:34] weird we have that again.... [19:15:20] yeah, it's strange :\ [19:15:42] yes, we have a check. it said things are ok :p [19:15:56] and we were _just_ even adding that [19:16:40] i mean, it had another issue before so it wouldnt work over NRPE.. then we fixed. now it said things are fine [19:17:11] hmm, weird. [19:17:47] (03PS1) 10Thcipriani: group1 wikis to 1.28.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302770 [19:18:46] oh, wait [19:18:54] it did detect it on tin.. 3 minutes ago [19:18:55] now :p [19:19:03] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.28.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302770 (owner: 10Thcipriani) [19:19:28] (03Merged) 10jenkins-bot: group1 wikis to 1.28.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302770 (owner: 10Thcipriani) [19:19:49] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.13 [19:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:20:10] removed "permanent ack" it had on icinga, which kept the bot from talking about it probably [19:20:35] ahhh, makes sense. [19:22:22] (03PS1) 10BBlack: nginx (1.11.3-1+wmf1) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.3) - 10https://gerrit.wikimedia.org/r/302771 [19:26:18] (03PS1) 10Dzahn: rhodium: add IPv6 AAAA and reverse [dns] - 10https://gerrit.wikimedia.org/r/302772 [19:27:45] thanks, mutante. and whyyyyy didn't it say something? [19:27:51] ah [19:28:13] good, that means I can not hang out on tin for the rest of the evening [19:29:29] apergos: search string "improperly" on icinga web ui :) [19:30:00] I would but this lazy bum is eating now :-P [19:30:03] it thinks mira is ok and tin is not, currently [19:30:20] but it just checks every 10 min or so [19:30:26] that would epxlain it [19:30:34] 06Operations, 06Labs, 10Labs-project-Librarybase: librarybase project cannot create a proxy for librarybase.wmflabs.org - https://phabricator.wikimedia.org/T131448#2520317 (10Harej) [19:33:57] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on tin is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging [19:35:59] thcipriani: apergos: there it is .. we'll see what happens [19:38:30] !log ganeti1004, start salt-minion [19:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:38:37] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [19:40:25] (03PS7) 10Madhuvishy: [WIP] labstore: Configure drbd for a HA labstore setup [puppet] - 10https://gerrit.wikimedia.org/r/302488 [19:41:28] (03CR) 10jenkins-bot: [V: 04-1] [WIP] labstore: Configure drbd for a HA labstore setup [puppet] - 10https://gerrit.wikimedia.org/r/302488 (owner: 10Madhuvishy) [19:45:23] (03PS3) 10Dzahn: tcpircbot: add rhodium to allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/302621 [19:52:18] (03PS8) 10Madhuvishy: [WIP] labstore: Configure drbd for a HA labstore setup [puppet] - 10https://gerrit.wikimedia.org/r/302488 [19:52:36] (03PS9) 10Madhuvishy: [WIP] labstore: Configure drbd for a HA labstore setup [puppet] - 10https://gerrit.wikimedia.org/r/302488 [19:54:54] He guys I have a question for you? [19:55:34] 0-0 [19:55:48] Does sitenotice have the same site performance impacts as central notice? [19:55:51] Seddon: ask :) [19:55:59] SPF|Cloud: too late :-P [19:56:07] IRCCloud lagged :p [19:56:10] hahaha [19:56:13] that'll teach ya [19:56:44] I hate using real clients (especially those in the linux terminal) ;) [19:58:16] (03PS1) 10Dzahn: add deployment, maintenance servers to hieradata common [puppet] - 10https://gerrit.wikimedia.org/r/302774 [19:59:30] (03CR) 10Southparkfan: [WIP] labstore: Configure drbd for a HA labstore setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/302488 (owner: 10Madhuvishy) [19:59:46] (03PS1) 10Alex Monk: cronspam: Send floating-ip-ptr-record-updater stdout to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/302776 [19:59:50] thcipriani: can we temporarily put wikidata back on wmf.12? [19:59:59] RoanKattouw: heh, please add +2 again: https://gerrit.wikimedia.org/r/#/c/302759/2 ;) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160803T2000). Please do the needful. [20:00:10] seems like jenkins is a bit lazy actually [20:00:11] we found an issue and have a fix, but it might take some time to prepare the backport [20:00:26] (03CR) 10Paladox: "@Muehlenhoff hi, please could you review and merge and upload it to apt please?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/302498 (owner: 10Chad) [20:01:34] aude: eh, I suppose that would mean a full scap in both directions (to wmf.12 and back to wmf.13)? [20:01:48] (03CR) 10Andrew Bogott: [C: 032] cronspam: Send floating-ip-ptr-record-updater stdout to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/302776 (owner: 10Alex Monk) [20:01:53] (03PS2) 10Andrew Bogott: cronspam: Send floating-ip-ptr-record-updater stdout to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/302776 (owner: 10Alex Monk) [20:02:00] (03PS1) 10Aude: Put wikidata back on wmf.12 (until T142032 is fixed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302777 [20:02:07] thcipriani: no [20:02:15] just ^ [20:02:16] Any answer for the question? [20:02:46] aude: oh, sorry, misunderstood, yeah, doing now. [20:02:49] thanks [20:02:57] (03CR) 10Thcipriani: [C: 032] Put wikidata back on wmf.12 (until T142032 is fixed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302777 (owner: 10Aude) [20:02:58] maybe we can get the fix in for swat [20:03:05] ack [20:03:24] (03Merged) 10jenkins-bot: Put wikidata back on wmf.12 (until T142032 is fixed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302777 (owner: 10Aude) [20:03:31] it's trivial but still takes time for someone to review and to do the backport [20:04:33] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2520433 (10HenryLi) I am glad to hear some good progress. I read through all the related issues and come across with this document. [[ https://wikitech.wikimedia.o... [20:05:33] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: wikidata to 1.28.0-wmf.12 [20:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:01] ^ aude sync'd! [20:06:32] thcipriani: thanks [20:08:45] (03PS2) 10Andrew Bogott: Set up labtestspice -> labtestcontrol2001 on misc-web [puppet] - 10https://gerrit.wikimedia.org/r/302714 [20:09:11] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis, 13Patch-For-Review: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2520471 (10Dzahn) a:05Dzahn>03HJiang-WMF [20:09:50] did i just break de.wiki? [20:10:11] matanya: how? [20:10:18] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2520473 (10Dzahn) cc: @ema the pwstore part should also be unblocked now. [20:10:18] bigdelete [20:10:23] o_O [20:10:38] need a dba to verify [20:10:43] (03CR) 10Andrew Bogott: [C: 032] Set up labtestspice -> labtestcontrol2001 on misc-web [puppet] - 10https://gerrit.wikimedia.org/r/302714 (owner: 10Andrew Bogott) [20:10:54] any jynus around ? [20:10:59] (03PS7) 10Gehel: Maps - initial data import [puppet] - 10https://gerrit.wikimedia.org/r/300572 (https://phabricator.wikimedia.org/T138501) [20:11:48] no, didn't break, maybe, made it incosistent [20:11:50] (03CR) 10Andrew Bogott: [C: 032] Set up spice-based remote consoles for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/301294 (https://phabricator.wikimedia.org/T141399) (owner: 10Andrew Bogott) [20:11:57] (03PS6) 10Andrew Bogott: Set up spice-based remote consoles for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/301294 (https://phabricator.wikimedia.org/T141399) [20:14:00] (03CR) 10Gehel: [C: 032] Maps - initial data import [puppet] - 10https://gerrit.wikimedia.org/r/300572 (https://phabricator.wikimedia.org/T138501) (owner: 10Gehel) [20:14:36] (03CR) 10Dzahn: [C: 032] scap::l10nupdate: Fix ~l10nupdate provisioning in Labs [puppet] - 10https://gerrit.wikimedia.org/r/301405 (owner: 10BryanDavis) [20:14:59] (03PS8) 10Gehel: Maps - initial data import [puppet] - 10https://gerrit.wikimedia.org/r/300572 (https://phabricator.wikimedia.org/T138501) [20:15:16] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 191 bytes in 0.041 second response time [20:15:53] Jeff_Green: ^ [20:16:24] what did you do, matanya, and what did you delete from my db? [20:16:46] jynus: https://de.wikipedia.org/w/index.php?title=Benutzer:JTCEPB/Ted_Bundy&action=delete [20:17:04] I do not have permissions to read that [20:17:19] (03PS5) 10Dzahn: scap::l10nupdate: Fix ~l10nupdate provisioning in Labs [puppet] - 10https://gerrit.wikimedia.org/r/301405 (owner: 10BryanDavis) [20:17:32] mutante: looking... [20:17:47] jynus: https://de.wikipedia.org/wiki/Benutzer:JTCEPB/Ted_Bundy [20:17:47] matanya: I think actually everything is allright? [20:17:59] can't tell, got a DB error [20:18:01] what happend, as you executed "delete"? [20:18:04] (03PS1) 10RobH: new policy.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/302779 [20:18:12] Luke081515: No, that lack of merge after +2 is correct, it Depends-On an unmerged patch in MW core [20:18:20] ah [20:18:22] my fault^^ [20:18:24] matanya, a db error is not a big issue [20:18:32] jynus: good news [20:18:47] can i delete it again then jynus ? [20:19:10] RoanKattouw: where can I find the depencies at the new gerrit? [20:19:18] (03PS6) 10Dzahn: scap::l10nupdate: Fix ~l10nupdate provisioning in Labs [puppet] - 10https://gerrit.wikimedia.org/r/301405 (owner: 10BryanDavis) [20:19:42] Luke081515: This is a cross-repo dependency so it's not listed in the Gerrit interface explicitly. See the Depends-On: I123456 thing in the commit message, and follow that link [20:20:09] (03CR) 10RobH: [C: 032] new policy.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/302779 (owner: 10RobH) [20:20:16] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 191 bytes in 0.040 second response time [20:20:17] (03PS2) 10RobH: new policy.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/302779 [20:20:17] 06Operations, 10Traffic: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2520524 (10Krinkle) [20:20:23] (03CR) 10RobH: [V: 032] new policy.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/302779 (owner: 10RobH) [20:20:32] RoanKattouw: ah, ok. Thx :) [20:20:41] matanya, you tried delete that page and it failed? [20:20:48] yes [20:21:00] RoanKattouw: I wonder why I can set V+1 or V-1 for echo-patches? [20:21:14] Can't you do that in core too? [20:21:26] Or did they change their ACLs again [20:21:38] RoanKattouw: only at echo [20:21:46] this is why I noticed it: It's new :O [20:21:58] matanya, let me check if it is not urgent errors/logs [20:22:09] I mean V+1 is not a problem, but I think every user can block one of the echo changes by V-1 now [20:22:18] jynus: no rush [20:22:26] (03PS3) 10RobH: new policy.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/302779 [20:22:32] Oh it's because in MW core only people in the "mediawiki" group can V+1 [20:23:11] matanya, when you tried deleting it, you mean with mediawiki ui or a script? [20:23:21] (03CR) 10RobH: [V: 032] new policy.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/302779 (owner: 10RobH) [20:23:27] jynus: i see why it failed, it reached timeout. i used the UI [20:23:45] then no issue [20:23:55] I cannot see any problem [20:24:00] and how to continue in that case? [20:24:03] write error, operation took more than 5 sec (5.7485411167145) [20:24:06] just try again? [20:24:13] there can be times on extreme cases, where things are uncached [20:24:18] and take a lot of time [20:24:34] i'll give it aa shot now [20:24:56] worked [20:24:56] and performance and ops are trying to lower the max transaction time for several reasons [20:25:09] we still have to tune edge cases [20:25:16] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 191 bytes in 0.040 second response time [20:25:24] (03CR) 10Dzahn: "no-op in prod.http://puppet-compiler.wmflabs.org/3575/" [puppet] - 10https://gerrit.wikimedia.org/r/301405 (owner: 10BryanDavis) [20:25:29] but in the end it will result on faster servers [20:25:38] sorry for the inconveniences [20:25:45] i hit that when i do bigdeletes or hit a page with many translutions [20:25:49] jynus: so, in general if people get a timeout, they should just try again? [20:25:54] yes [20:25:59] ok [20:26:01] (03CR) 10Dzahn: [V: 04-1] "syntax error http://puppet-compiler.wmflabs.org/3576/lead.wikimedia.org/change.lead.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/302504 (owner: 10Chad) [20:26:02] if it is frequent, then report a bug [20:26:02] sounds easy :) [20:26:11] thanks for your help, and sorry for the noise jynus [20:26:18] well, to made my day [20:26:31] I thought you were running scripts towards the db [20:26:39] and had deleted things from the database [20:26:51] :O [20:26:51] unless there is a huge bug [20:26:57] would not be that stupid, i hope :) [20:27:10] you wouln't be the first :-) [20:27:24] but even in that case we have backups [20:28:02] but with the ui is almost impossible to break things [20:28:02] I rather not be there, as a sysadmin, i hate to be on the breaking side. prefer being on the fixing one [20:28:03] != things working 100% of the time [20:28:16] (03CR) 10Dzahn: "that doesnt mean i understand why we get that error here.. wth" [puppet] - 10https://gerrit.wikimedia.org/r/302504 (owner: 10Chad) [20:28:19] (03PS1) 10Aude: Revert "Put wikidata back on wmf.12 (until T142032 is fixed)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302782 [20:28:46] things edge cases such as users with 10K pages on watshlist [20:28:47] or chehcking the history of a user with millions of contributions [20:28:48] the first time you call mediawiki, it will timeout [20:28:49] the second time it will be in cache and it will work [20:28:58] AaronSchulz: hiii! ok I am back at it on the hook changes for events [20:29:02] got two review for you [20:29:07] ok, back to steward tasks queue [20:29:13] Dereckson: want rights ? [20:29:14] we try to avoid those cases, but there are millions of possible queries and we have to fix them once at a time [20:29:15] https://gerrit.wikimedia.org/r/#/c/298548/2/ [20:29:15] https://gerrit.wikimedia.org/r/#/c/299008/ [20:29:38] so 1 in 1 million requests fail [20:29:45] not a big deal [20:29:47] not bad rate [20:30:16] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 191 bytes in 0.040 second response time [20:30:49] you scared me, you mention "break and delete" and you scared the hell out of me :-) [20:31:19] I will know for next time :) [20:31:30] no, actually you did good [20:31:40] better to communicate fast [20:32:13] Hi. matanya > for the import stuff? [20:32:15] as the longer it takes to detect issues, the more difficult it is to fix them [20:32:20] yes Dereckson [20:32:51] have a nice day, deleting things, bye! [20:32:58] see you! [20:33:00] matanya: what we need is to import these pages from en. to fr. : https://phabricator.wikimedia.org/P3631 [20:33:13] right is perhaps overkill for one import operations [20:33:21] i'd rather just grant you the right and remove after you are done [20:33:32] okay [20:33:43] doing [20:35:10] (03PS2) 10Dzahn: Gerrit: Make heapLimit configurable per host as well [puppet] - 10https://gerrit.wikimedia.org/r/302504 (owner: 10Chad) [20:35:16] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 191 bytes in 0.040 second response time [20:35:33] Dereckson: please check [20:36:13] matanya: https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Importer works, thanks, doing the import [20:36:36] cool, let me know when done [20:39:24] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2520557 (10madhuvishy) Still needed to be add as labs root. This is not done yet [20:40:16] RECOVERY - check_listener_gc on thulium is OK: HTTP OK: HTTP/1.1 200 OK - 291 bytes in 0.011 second response time [20:40:31] 06Operations, 10Traffic: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#2520562 (10RobH) [20:52:43] (03PS1) 10Kaldari: Test numeric collation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302819 (https://phabricator.wikimedia.org/T141433) [20:57:56] Dereckson: are you done ? [20:58:18] not yet, I'm trying to see what's missing, we've some red links [21:00:03] (03CR) 10Dzahn: [C: 032] "eh, compiles fine after rebase http://puppet-compiler.wmflabs.org/3578/lead.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/302504 (owner: 10Chad) [21:00:34] !log starting mobileapps deploy [21:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:02:39] !log deployed mobileapps e48b6a8 [21:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:06:11] 06Operations, 10Traffic, 06Wikipedia-iOS-App-Backlog: Wikipedia app hits loads.php on bits.wikimedia.org - https://phabricator.wikimedia.org/T132969#2215676 (10BBlack) >>! In T132969#2429719, @Fjalapeno wrote: > @Krinkle the version of the iOS app that made those requests is a legacy version - the iOS app no... [21:06:53] matanya: I'm done, thanks [21:18:24] 06Operations, 10Traffic, 06Wikipedia-iOS-App-Backlog: Wikipedia app hits loads.php on bits.wikimedia.org - https://phabricator.wikimedia.org/T132969#2520673 (10Mholloway) @BBlack, there's no policy on supporting un-upgraded versions of which I'm aware (but I'll add @Dbrant as product owner here for comment).... [21:18:48] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [21:19:47] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [21:19:51] 06Operations, 10Traffic, 07Mobile, 13Patch-For-Review: Replace bits URL in Firefox app, if possible - https://phabricator.wikimedia.org/T98373#2520676 (10Krinkle) [21:19:53] 06Operations, 10Traffic: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2520675 (10Krinkle) [21:19:58] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [21:21:28] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: puppet fail [21:23:02] (03CR) 10BBlack: [C: 031] rhodium: add IPv6 AAAA and reverse [dns] - 10https://gerrit.wikimedia.org/r/302772 (owner: 10Dzahn) [21:27:40] (03PS1) 10ArielGlenn: capture dumps cron job output in log and add log rotation [puppet] - 10https://gerrit.wikimedia.org/r/302827 [21:34:08] (03CR) 10Danny B.: [C: 04-1] Strip out branch HEAD in git.wikimedia.org tree link [puppet] - 10https://gerrit.wikimedia.org/r/302747 (https://phabricator.wikimedia.org/T141965) (owner: 10Paladox) [21:34:16] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2520721 (10madhuvishy) The labs root thing is all good now. Thanks @yuvipanda [21:36:11] (03PS1) 10ArielGlenn: Make scheduler hupable. [dumps] - 10https://gerrit.wikimedia.org/r/302831 [21:36:57] (03CR) 10jenkins-bot: [V: 04-1] Make scheduler hupable. [dumps] - 10https://gerrit.wikimedia.org/r/302831 (owner: 10ArielGlenn) [21:39:14] yep we knew that [21:39:25] but rather have it there than only on my laptop [21:39:35] midnight-30 again [21:39:50] (03CR) 10BBlack: [C: 032 V: 032] nginx (1.11.3-1+wmf1) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.3) - 10https://gerrit.wikimedia.org/r/302771 (owner: 10BBlack) [21:41:19] !log nginx-1.11.3-1+wmf1 uploaded to carbon jessie-wikimedia [21:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:42:35] (03CR) 10Paladox: "@Chad what we want to do is to do init first and if that fails then reindex please. Since reindex expects bin/gerrit.war to be there but i" [puppet] - 10https://gerrit.wikimedia.org/r/302504 (owner: 10Chad) [21:42:57] (03PS1) 10Andrew Bogott: nova.conf: s/true/True and s/false/False [puppet] - 10https://gerrit.wikimedia.org/r/302833 [21:42:59] (03PS1) 10Andrew Bogott: WIP: Aggregate instance root passwords on Labs puppet master [puppet] - 10https://gerrit.wikimedia.org/r/302834 [21:46:03] !log upgrading nginx on cache_misc + cache_maps [21:46:08] (03CR) 10Chad: "Why would reindex expect bin/gerrit.war to exist? The command specifies the other gerrit.war from the package." [puppet] - 10https://gerrit.wikimedia.org/r/302504 (owner: 10Chad) [21:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:48:04] (03CR) 10jenkins-bot: [V: 04-1] WIP: Aggregate instance root passwords on Labs puppet master [puppet] - 10https://gerrit.wikimedia.org/r/302834 (owner: 10Andrew Bogott) [21:48:30] (03CR) 10Andrew Bogott: [C: 032] nova.conf: s/true/True and s/false/False [puppet] - 10https://gerrit.wikimedia.org/r/302833 (owner: 10Andrew Bogott) [21:49:24] (03CR) 10Paladox: "Oh but it fails when installing" [puppet] - 10https://gerrit.wikimedia.org/r/302504 (owner: 10Chad) [21:49:28] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:49:48] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:49:59] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [21:50:48] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [21:50:50] (03CR) 10Paladox: "Oh wait why did I write that in here. Sorry wrong commit." [puppet] - 10https://gerrit.wikimedia.org/r/302504 (owner: 10Chad) [21:51:12] (03CR) 10Paladox: "@chad I get" [puppet] - 10https://gerrit.wikimedia.org/r/302497 (owner: 10Chad) [21:54:03] !log upgrading nginx on cache_text + cache_upload [21:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:55:11] (03PS1) 10Alex Monk: labnet: Merge site_address and network_public_ip in novaconfig [puppet] - 10https://gerrit.wikimedia.org/r/302835 [22:03:55] 06Operations, 10Traffic, 05WMF-deploy-2016-08-09_(1.28.0-wmf.14): Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#1494832 (10Krinkle) The Commons app for Android (previously by Wikimedia, now community-maintained) also uses `bits.wikimedia.org/event.gif` still. Fix pending at Krenair: bits.beta decom? [22:05:29] Just deleted about 10 references in different repos [22:05:46] (for prod that is) [22:05:59] 10? what were they all? [22:06:04] I thought there was like 4 when we checked [22:07:07] (03Draft2) 10Paladox: Fix 'reindex_gerrit_jetty': Since we need run init before reindex [puppet] - 10https://gerrit.wikimedia.org/r/302840 [22:08:32] (03PS3) 10Paladox: Fix 'reindex_gerrit_jetty': Since we need run init before reindex [puppet] - 10https://gerrit.wikimedia.org/r/302840 [22:08:56] 06Operations, 10media-storage: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704#2520829 (10Revent) Error undeleting file: The file "mwstore://local-multiwrite/local-public/8/84/張善政參訪2016臺灣國際蘭展_02.jpg" is in an inconsistent state within the internal sto... [22:18:24] (03CR) 10Chad: [C: 04-1] "This is not correct. You need to init before you can reindex. Also: moving it does nothing." [puppet] - 10https://gerrit.wikimedia.org/r/302840 (owner: 10Paladox) [22:18:46] (03CR) 10Paladox: "oh, but it dosent init first, it does reindex." [puppet] - 10https://gerrit.wikimedia.org/r/302840 (owner: 10Paladox) [22:19:33] (03CR) 10Paladox: "@Chad So init needs fixing so that it will run first, since I doint have any of the projects in review_site/git/ where I should have them." [puppet] - 10https://gerrit.wikimedia.org/r/302840 (owner: 10Paladox) [22:20:51] Krenair: 45 mentions in Wikimedia Git of 'bits.wikimedia' [22:21:19] 44 [22:21:20] https://github.com/search?q=org%3Awikimedia+%22bits.wikimedia%22&type=Code [22:21:30] oh [22:21:34] (03CR) 10Chad: "It should run first anyway, what do you think this part is for:" [puppet] - 10https://gerrit.wikimedia.org/r/302840 (owner: 10Paladox) [22:21:38] so shall we get rid of bits.beta? [22:22:14] (03CR) 10Paladox: "I thought reindex ran first then it would be init if it passed. but seems I am wrong." [puppet] - 10https://gerrit.wikimedia.org/r/302840 (owner: 10Paladox) [22:22:17] (03Abandoned) 10Paladox: Fix 'reindex_gerrit_jetty': Since we need run init before reindex [puppet] - 10https://gerrit.wikimedia.org/r/302840 (owner: 10Paladox) [22:22:33] (03CR) 10Paladox: "@Chad ^^" [puppet] - 10https://gerrit.wikimedia.org/r/302840 (owner: 10Paladox) [22:25:03] Krenair: kill it. [22:28:01] Yeah, let's do it. [22:28:11] Krenair: I can try and rid the references in puppet, but not sure what I'm doing. [22:28:43] I think we can just kill this line: hieradata/labs.yaml:role::cache::text::bits_domain: 'bits.beta.wmflabs.org' [22:29:10] and the whole of modules/mediawiki/files/apache/beta/sites/wmflabs.conf (given bits is the only thing in it) as well as the reference to it in modules/mediawiki/manifests/web/beta_sites.pp [22:30:28] Yeah [22:30:36] (03PS9) 10Yuvipanda: uwsgi: Allow specifying plugins as a uwsgi command line option [puppet] - 10https://gerrit.wikimedia.org/r/292030 (owner: 10Madhuvishy) [22:30:43] I guess the virtual host will then disappear next puppet run? [22:30:46] there's no DNS entry to get rid of either, it's just a match for *.beta.wmflabs.org [22:30:50] yep [22:30:52] Assuming it automatically does a graceful [22:35:51] (03CR) 10Dzahn: "This does not tell me why we are doing it, nor does it tell me why it was voted -1. No context." [puppet] - 10https://gerrit.wikimedia.org/r/302747 (https://phabricator.wikimedia.org/T141965) (owner: 10Paladox) [22:44:50] !log starting mobileapps deployment [22:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:45:38] Hi. [22:46:00] jdlrobson: [wmf.13] 302654 Adjust notification badges for Monobook [22:46:05] [wmf.13] 302780 Update PropertySuggester (fix js error when adding statements) [22:46:20] jdlrobson: that's the suggested format for the deployment log [22:46:24] Dereckson: FYI I just added https://gerrit.wikimedia.org/r/#/c/302845/ [22:46:26] Dereckson: are you doing swat today? [22:46:38] (Pinging you because you already +2ed my other wmf13 patch) [22:46:52] !log mobileapps deployed e7488f6 [22:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:47:00] jdlrobson: you prepend by the branch you wish the deploy, so we directly see for what version it's (and so in what directory apply it), or [config] when it's for mediawiki-config [22:47:11] RoanKattouw: ack'ed [22:47:25] aude: yes I can [22:47:28] ok [22:47:34] can my patches go first? [22:47:40] * aude needs to eat soon [22:48:26] aude: if you're in an hurry, what about beging the SWAT now (there isn't any deployment window still running) and you deploy them yourself (as you know better how to deal with Wikidata stuff)? [22:48:35] ok :) [22:48:36] then I'll handle all the other patches [22:48:43] sounds good [22:48:47] (03PS10) 10Madhuvishy: [WIP] labstore: Configure drbd for a HA labstore setup [puppet] - 10https://gerrit.wikimedia.org/r/302488 [22:49:04] (03PS11) 10Madhuvishy: [WIP] labstore: Configure drbd for a HA labstore setup [puppet] - 10https://gerrit.wikimedia.org/r/302488 [22:49:13] jenkins usually takes some time [22:51:46] hehe [22:52:17] Dereckson: I'm here (for my patch) ping me when you get to it please! :) [22:52:47] k [22:57:07] aude: zuul done it seems [22:59:46] ok [22:59:49] * aude proceeds [23:00:04] RoanKattouw, ostriches, MaxSem, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160803T2300). Please do the needful. [23:00:04] RoanKattouw, Addshore, aude, kaldari, and Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:16] \o [23:00:20] o/ [23:01:03] So aude is deploying Wikidata patches, then I go on the SWAT. [23:01:08] i'll be pulling in echo and mobile frontend patches but wont sync or update those submodules [23:01:17] here [23:01:40] \m/ [23:02:03] jdlrobson: could you try something like [wmf.13] Do not output the 'switch language' action on Main Page in beta (the Gerrit link) (a link to T142016) in the deployment log? [23:02:03] T142016: Beta: Cannot access languages on main page - https://phabricator.wikimedia.org/T142016 [23:02:17] {{phabT|142016}} for the task [23:02:45] Huh? [23:03:32] kaldari: Can I come watch over your shoulder if it's anything interesting/useful? [23:04:14] !log aude@tin Synchronized php-1.28.0-wmf.13/extensions/Wikidata: Update PropertySuggester (duration: 02m 04s) [23:04:20] * aude checks on test.wikidata [23:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:05:06] looks good [23:05:18] (03CR) 10Aude: [C: 032] Revert "Put wikidata back on wmf.12 (until T142032 is fixed)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302782 (owner: 10Aude) [23:05:32] Niharika: Dereckson's doing the SWAT deployment so won't be much to see on my end :P [23:05:47] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:05:55] (03Merged) 10jenkins-bot: Revert "Put wikidata back on wmf.12 (until T142032 is fixed)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302782 (owner: 10Aude) [23:06:05] (03PS2) 10Dzahn: Adding my new SSH key to production [puppet] - 10https://gerrit.wikimedia.org/r/302277 (owner: 10Chad) [23:06:35] :( [23:06:39] :D [23:06:52] !log aude@tin rebuilt wikiversions.php and synchronized wikiversions files: Wikidata back to wmf.13 [23:06:58] * aude checks [23:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:07:02] jdlrobson: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=816249&oldid=816247 [23:07:30] seems ok [23:07:45] * aude notes Special:Nearby appears broken on wikidata again (no labels) [23:07:47] but not urgent [23:08:14] (03PS3) 10Dereckson: Enable RevisionSlider on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302690 (https://phabricator.wikimedia.org/T141974) (owner: 10Addshore) [23:08:41] kaldari and Niharika > you'll still be able to test on mw1099 and you can watch https://logstash.wikimedia.org/app/kibana#/dashboard/mw1099?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-1h,mode:quick,to:now))&_a=(filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:host,negate:!f,value:mw1099),query:(match:(host:(query:mw1099,type:phrase [23:08:48] ))))),options:(darkTheme:!f),panels:!((col:7,id:Event-Level,panelIndex:3,row:1,size_x:3,size_y:3,type:visualization),(col:1,id:Trending-Messages,panelIndex:6,row:4,size_x:12,size_y:4,type:visualization),(col:10,id:MediaWiki-Versions,panelIndex:8,row:1,size_x:3,size_y:3,type:visualization),(col:1,id:Top-Channels-table,panelIndex:11,row:1,size_x:6,size_y:3,type:visualization),(col:1,id:Events [23:08:54] -Over-Time,panelIndex:14,row:1,size_x:12,size_y:2,type:visualization),(col:1,columns:!(level,channel,host,wiki,message),id:MediaWiki-Events-List,panelIndex:15,row:10,size_x:12,size_y:11,sort:!('@timestamp',desc),type:search)),query:(query_string:(analyze_wildcard:!t,query:'*')),title:mw1099,uiState:(P-3:(spy:(mode:(fill:!f,name:!n)),vis:(legendOpen:!f)),P-6:(spy:(mode:(fill:!f,name:!n))))) [23:09:00] to check for errors [23:09:01] that is a great URL ;) [23:09:02] (outch the extra long URL) [23:09:06] That's a url?! [23:09:09] yes :D [23:09:12] That was split over 3 messages on my end [23:09:18] Dereckson: oh I see. Sorry. It's a shame the gerrit template can't automatically do that [23:09:20] Ditto. [23:09:21] with a JSON message at the end [23:09:22] Logstash really needs to improve its URLs [23:09:33] RoanKattouw: yup [23:09:40] Quick, someone register logst.sh for Logstash URL shortening [23:10:00] there's shortener built in if you hit the right thing on the UI. I found it once... [23:10:03] jdlrobson: I can offer you a tool to write them auto, if you've a Phabricator column to put tasks to deploy [23:10:17] The URL is at https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#General_Advice (the mw1099 one) [23:10:36] hey not even JSON the kibana format [23:10:38] a query [23:11:45] jdlrobson: for wikimedia-site-requests, we put them in a To deploy column, so I wrote https://github.com/dereckson/wikimedia-config-todeploy [23:12:08] if you click the [^] share icon and then the >< "Generate short url" icon it will give you something like https://logstash.wikimedia.org/goto/adeeb3ee06e936a57dcf749f7e1564bd [23:12:24] aude: you're done? [23:13:24] (03CR) 10Dzahn: [C: 032] Adding my new SSH key to production [puppet] - 10https://gerrit.wikimedia.org/r/302277 (owner: 10Chad) [23:16:01] (I guess yes, as logged out from tin) [23:16:20] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302690 (https://phabricator.wikimedia.org/T141974) (owner: 10Addshore) [23:16:47] (03Merged) 10jenkins-bot: Enable RevisionSlider on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302690 (https://phabricator.wikimedia.org/T141974) (owner: 10Addshore) [23:17:09] addshore: live on mw1099 [23:17:12] checking [23:17:45] all looks good! [23:18:13] Message blob cache-miss for ext.RevisionSlider.HelpDialog [23:19:36] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable RevisionSlider on plwiki (T141974) (duration: 00m 26s) [23:19:37] T141974: Enable Revision-Slider on pl.wikipedia as a BetaFeature - https://phabricator.wikimedia.org/T141974 [23:19:39] addshore: in prod ^ [23:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:58] ack, still looks all good! :) [23:20:05] Thanks for testing. [23:22:46] jdlrobson: live on mw1099 [23:23:50] Dereckson: works nicely. thanks [23:24:02] you're welcome [23:24:05] sending in prod [23:24:23] !log dereckson@tin Synchronized php-1.28.0-wmf.13/extensions/MobileFrontend/includes/skins/SkinMinervaBeta.php: Do not output the 'switch language' action on Main Page in beta (T142016) (duration: 00m 28s) [23:24:24] T142016: Beta: Cannot access languages on main page - https://phabricator.wikimedia.org/T142016 [23:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:49] Here you are. [23:25:44] Niharika: so, as our config repo is fast forward only, first step I do is rebase it [23:25:52] (03PS2) 10Dereckson: Test numeric collation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302819 (https://phabricator.wikimedia.org/T141433) (owner: 10Kaldari) [23:26:19] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302819 (https://phabricator.wikimedia.org/T141433) (owner: 10Kaldari) [23:26:38] Okay. [23:26:44] Then I CR+2 to have zuul pick it to the gate and submit queue. [23:26:47] (03Merged) 10jenkins-bot: Test numeric collation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302819 (https://phabricator.wikimedia.org/T141433) (owner: 10Kaldari) [23:26:50] (03PS2) 10Dzahn: Gerrit: Set default owners for mediawiki/* and operations/* projects [puppet] - 10https://gerrit.wikimedia.org/r/301822 (owner: 10Chad) [23:26:50] it's a lot quicker than for core and extensions [23:27:00] (03CR) 10Dzahn: [C: 032] Gerrit: Set default owners for mediawiki/* and operations/* projects [puppet] - 10https://gerrit.wikimedia.org/r/301822 (owner: 10Chad) [23:27:01] Okay. [23:27:20] jouncebot: next [23:27:20] In 0 hour(s) and 32 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160804T0000) [23:27:37] jouncebot: current [23:27:43] On Tin, our deployment server, I update the repo through Git [23:27:55] mutante: any issue? [23:28:08] are you in the middle of deploying? [23:28:09] yes [23:28:21] Niharika: then I'm pulling code on mw1099 [23:28:25] kaldari: live on mw1099 [23:28:42] Dereckson: no issues, just not restarting gerrit then [23:28:44] Niharika: this is a canary server, in the production cluster, but not visible to regular users [23:29:15] we can so test stuff on it, with some special headers described at https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [23:31:26] Dereckson: Okay. [23:31:53] 06Operations, 10Traffic, 05WMF-deploy-2016-08-09_(1.28.0-wmf.14): Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2521183 (10Tbayer) >>! In T107430#2184485, @BBlack wrote: > We need to start making progress on this again and kill cruft at some point... > > bits.wikimedia.org st... [23:32:07] Dereckson: I forgot to mention that this is another UCA collation test, so there's no way to test it via the canary server. We'll need to actually sync it and run a maintenance script to regenerate the sort keys for test.wikipedia. [23:32:54] You can at least test collation exists visiting a category page. [23:34:10] https://test.wikipedia.org/wiki/Category:!Arquivos_da_Esplanada/2008/09 doesn't throw an error so it seems good [23:34:27] true [23:35:26] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Test numeric collation on testwiki (T141433) (duration: 00m 26s) [23:35:27] T141433: Enable numeric sorting on test wikipedia - https://phabricator.wikimedia.org/T141433 [23:35:28] Niharika: now I ran `scap sync-file wmf-config/InitialiseSettings.php 'Test numeric collation on testwiki (T141433)'` [23:35:28] T141433: Enable numeric sorting on test wikipedia - https://phabricator.wikimedia.org/T141433 [23:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:35:36] so file is sync to the prod cluster [23:35:59] Niharika: last step, you can do it with kaldari: run a maintenance script on terbium to actually apply the collation [23:37:30] Dereckson: Maintainence script is done [23:37:36] Dereckson: Thank you. [23:37:44] looks good: https://test.wikipedia.org/wiki/Category:Sort_test [23:37:44] You're welcome. [23:38:14] !log ran "mwscript maintenance/updateCollation.php --wiki=testwiki --force" [23:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:21] RoanKattouw: live on mw1099 [23:38:42] (both) [23:39:45] Dereckson: Looking good [23:41:02] !log dereckson@tin Synchronized php-1.28.0-wmf.13/extensions/Echo/modules/nojs/: Adjust notification badges for monobook (T141923). Prevent IE from rendering the badge SVGs ridiculously big (T142042). (duration: 00m 29s) [23:41:03] T142042: Notification badges are REALLY big in IE10 - https://phabricator.wikimedia.org/T142042 [23:41:03] T141923: Alerts and Notices icons are too large on 1.28-wmf.13 - https://phabricator.wikimedia.org/T141923 [23:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:22] 3 Notice: Undefined index: 5 in /srv/mediawiki/php-1.28.0-wmf.12/languages/Language.php on line 3386 [23:42:29] (03PS1) 10Legoktm: UrlShortener: Whitelist *.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302851 (https://phabricator.wikimedia.org/T142055) [23:42:53] Dereckson: My change was CSS-only so I shouldn't have been able to cause that [23:44:39] Not yet in https://phabricator.wikimedia.org/project/view/1055/ [23:46:09] https://github.com/wikimedia/mediawiki/blob/wmf/1.28.0-wmf.13/languages/Language.php#L3360 [23:48:53] (03CR) 10Smalyshev: [C: 031] UrlShortener: Whitelist *.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302851 (https://phabricator.wikimedia.org/T142055) (owner: 10Legoktm) [23:49:38] filled as https://phabricator.wikimedia.org/T142061 [23:50:30] SWAT is done. [23:50:45] mutante: I don't need Gerrit anymore [23:53:26] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on mira is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging [23:54:06] the icinga probe works :) [23:56:34] Dereckson: ok, thanks [23:57:03] isn't that interesting how it's broken on mira now [23:57:08] and earlier it was on tin and not on mira [23:57:19] i dont believe anymore it's from a root manually doing stuff [23:57:35] it seems it's happenning on each deploy now [23:57:44] It's moment like that I like the FreeBSD accounting system. [23:58:45] In 70s ans 80s, accounting system allowed to compute the commands run on servers, to invoice use time, but nowadays that survived into a big log of everything executed on the system you can enable when heavy debugging is needed. [23:59:25] hmm, i fixed this yesterday, but my fix is not applied on the server [23:59:37] by "this" i mean: