[00:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161027T0000). [00:00:33] 318027 works (dblist), but not 318027 (other files) [00:01:00] 309743 works but not 318027 [00:04:45] #wikimedia-labs [00:04:45] Woops [00:04:45] I'll open an issue about that, seems an issue of docroot/noc/conf/index.php [00:04:45] tgr: you've the deployment floor [00:04:45] thx [00:06:30] where does one find the current ssh server fingerprints? [00:10:28] we publish that? [00:10:46] tin has been reimaged [00:10:58] so, the key change is expected [00:11:12] but I'm not sure we publish the information [00:11:43] we publish them for bastions [00:11:47] https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints [00:11:51] yeah, I figured I'll be conscientous and verify the new one but it does not seem like it's availavle anywhere [00:12:19] found that one, but tin is not included [00:13:49] On the tin I've used this evening to deploy through bast3001, I've personnaly that: https://phabricator.wikimedia.org/P4313 [00:14:06] er if I don't ctrl + D... [00:14:59] https://phabricator.wikimedia.org/P4313 updated [00:30:01] !log palladium - re-shutdown [00:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:32:16] !log palladium - removed from DNS [00:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:34:19] Dereckson: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast3001.wikimedia.org [00:35:16] Dereckson: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/tin.eqiad.wmnet [00:35:31] tgr: ^ [00:36:08] thanks! [00:40:10] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1472 [00:41:23] yw, i added mira just now as well [00:43:55] 06Operations, 10netops, 05Goal, 13Patch-For-Review: Decomission palladium - https://phabricator.wikimedia.org/T147320#2747305 (10Dzahn) Everything is gone, including DNS, just that i can't get it out of Icinga, even though i ran puppet node deactivate more than once, and even on both masters. [00:44:44] 06Operations, 10netops, 05Goal: Decomission palladium - https://phabricator.wikimedia.org/T147320#2747307 (10Dzahn) [00:45:10] RECOVERY - check_mysql on lutetium is OK: Uptime: 2635031 Threads: 3 Questions: 291903307 Slow queries: 20021 Opens: 71714651 Flush tables: 2 Open tables: 64 Queries per second avg: 110.777 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [00:45:51] (03PS4) 10Dzahn: contint: add phpdbg for code coverage [puppet] - 10https://gerrit.wikimedia.org/r/314563 (https://phabricator.wikimedia.org/T147778) (owner: 10Hashar) [00:47:00] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:47:09] (03CR) 10Dzahn: [C: 032] contint: add phpdbg for code coverage [puppet] - 10https://gerrit.wikimedia.org/r/314563 (https://phabricator.wikimedia.org/T147778) (owner: 10Hashar) [00:48:36] (03CR) 10Dzahn: [C: 032] contint: drop useless require_package [puppet] - 10https://gerrit.wikimedia.org/r/317986 (owner: 10Hashar) [00:49:10] (03CR) 10Dzahn: "hard to merge because it's a long dependency chain" [puppet] - 10https://gerrit.wikimedia.org/r/317986 (owner: 10Hashar) [00:51:01] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:52:21] (03PS2) 10Dzahn: contint: drop useless require_package [puppet] - 10https://gerrit.wikimedia.org/r/317986 (owner: 10Hashar) [00:53:13] (03PS3) 10Dzahn: contint: drop useless require_package [puppet] - 10https://gerrit.wikimedia.org/r/317986 (owner: 10Hashar) [00:53:50] (03CR) 10Dzahn: [C: 032] contint: drop useless require_package [puppet] - 10https://gerrit.wikimedia.org/r/317986 (owner: 10Hashar) [00:56:21] (03PS1) 10Faidon Liambotis: Remove psw1-eqiad, decom'ed [dns] - 10https://gerrit.wikimedia.org/r/318242 (https://phabricator.wikimedia.org/T149224) [00:56:25] (03PS1) 10Faidon Liambotis: Remove references to psw1-eqiad, decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/318243 (https://phabricator.wikimedia.org/T149224) [00:56:50] (03PS2) 10Faidon Liambotis: Remove references to psw1-eqiad, decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/318243 (https://phabricator.wikimedia.org/T149224) [00:57:00] (03CR) 10Faidon Liambotis: [C: 032] Remove psw1-eqiad, decom'ed [dns] - 10https://gerrit.wikimedia.org/r/318242 (https://phabricator.wikimedia.org/T149224) (owner: 10Faidon Liambotis) [00:57:35] does that stop the mails to noc? thanks! [00:57:45] is there a way to send web requests to mw1099 directly? [00:57:55] (03CR) 10Dzahn: "no-op on gallium and contint1001" [puppet] - 10https://gerrit.wikimedia.org/r/317986 (owner: 10Hashar) [00:58:02] (03CR) 10Faidon Liambotis: [C: 032] Remove references to psw1-eqiad, decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/318243 (https://phabricator.wikimedia.org/T149224) (owner: 10Faidon Liambotis) [00:58:15] varnish has some annoying bug where fatals are incorrectly gzipped, and that applies even if I use X-Wikimedia-Debug [00:58:19] mutante: yup [00:58:24] :) [01:00:29] tgr, I think we have an open task for that varnish bug [01:00:39] https://phabricator.wikimedia.org/T125938 [01:00:55] right, I made it. ok [01:02:43] (03PS1) 10Dzahn: contint: remove gallium from firewall::labs [puppet] - 10https://gerrit.wikimedia.org/r/318245 (https://phabricator.wikimedia.org/T95757) [01:03:31] tgr, you can do it from a server inside the network [01:03:49] chrome will generate a curl command for you if you want [01:04:32] otherwise... I guess you could use SSH port forwarding [01:05:36] Krenair: that'll work, thanks [01:05:54] I would avoid that sort of thing in production where possible [01:07:40] curl or port forwarding? [01:08:02] what I'd really need is something that can sign OAuth requests [01:08:07] (03PS1) 10Dzahn: cache::misc: switch gallium to contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/318246 (https://phabricator.wikimedia.org/T95757) [01:08:32] tgr, port forwarding [01:08:32] I normally use python with requests_oauthlib but that's aparrently not installed on terbium [01:08:53] so not sure if I have other option than a port forward [01:09:39] although I can probably just export the signature on my own machine and then send it from terbium [01:10:48] (03PS1) 10Dzahn: contint: rm gallium from ferm rules in zuul::merger [puppet] - 10https://gerrit.wikimedia.org/r/318247 (https://phabricator.wikimedia.org/T95757) [01:12:37] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07WorkType-Maintenance: Jenkins master / client ssh connection fails due to missing ssh algorithm - https://phabricator.wikimedia.org/T100509#1314411 (10Dzahn) @hashar do you think this will change on contint1001 with openjdk-7-jdk... [01:14:40] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [01:16:47] (03PS1) 10Dzahn: deployment-prep/integration: stop downgrading sshd MAC and KEX [puppet] - 10https://gerrit.wikimedia.org/r/318248 (https://phabricator.wikimedia.org/T95757) [01:17:47] tgr: if this isn't a one-time thing but needed regularly we can maybe just get that package installed via puppet [01:19:00] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [01:19:01] mutante: it's a one-time thing and it seems generating a curl command locally with python is not terribly complicated [01:19:09] alright [01:19:55] (03PS2) 10Dzahn: deployment-prep/integration: stop downgrading sshd MAC and KEX [puppet] - 10https://gerrit.wikimedia.org/r/318248 (https://phabricator.wikimedia.org/T95757) [01:20:30] PROBLEM - Apache HTTP on mw1208 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [01:21:27] !log mw1208 - service hhvm restart [01:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:23:30] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.035 second response time [01:27:06] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: upgrade to prometheus >= 1.1 - https://phabricator.wikimedia.org/T147207#2747392 (10fgiunchedi) [01:27:15] (03PS1) 10Dzahn: switch zuul CNAME from gallium to contint1001 [dns] - 10https://gerrit.wikimedia.org/r/318249 (https://phabricator.wikimedia.org/T95757) [01:28:57] (03PS1) 10Dzahn: remove gallium.wikimedia.org, keep gallium.mgmt [dns] - 10https://gerrit.wikimedia.org/r/318250 (https://phabricator.wikimedia.org/T95757) [01:30:08] (03PS1) 10Filippo Giunchedi: prometheus::tools: fix k8s discovery after upgrade [puppet] - 10https://gerrit.wikimedia.org/r/318251 (https://phabricator.wikimedia.org/T147207) [01:31:08] (03CR) 10jenkins-bot: [V: 04-1] prometheus::tools: fix k8s discovery after upgrade [puppet] - 10https://gerrit.wikimedia.org/r/318251 (https://phabricator.wikimedia.org/T147207) (owner: 10Filippo Giunchedi) [01:32:00] (03CR) 10Dzahn: "what's the status here now, since the referenced change to wait for is now abandoned." [dns] - 10https://gerrit.wikimedia.org/r/293288 (https://phabricator.wikimedia.org/T137265) (owner: 10Hashar) [01:32:44] (03CR) 10Dzahn: "see https://gerrit.wikimedia.org/r/#/c/318249/" [dns] - 10https://gerrit.wikimedia.org/r/293288 (https://phabricator.wikimedia.org/T137265) (owner: 10Hashar) [01:32:58] (03CR) 10Dzahn: "this or https://gerrit.wikimedia.org/r/#/c/293288/ ?" [dns] - 10https://gerrit.wikimedia.org/r/318249 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [01:35:16] Krenair: so I run >> curl -v -X GET -H 'Host: en.wikipedia.org' 'https://mw1099.eqiad.wmnet/w/index.php' << from terbium and it hangs with "Trying 10.64.16.79" [01:35:17] (03PS1) 10Dzahn: zuul::merger: switch gearman server to contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/318252 (https://phabricator.wikimedia.org/T95757) [01:35:22] any ide what I am doing wrong? [01:35:51] tgr: http not https [01:36:02] oh, right [01:36:24] that did the trick, thanks [01:36:29] np! [01:36:35] (03CR) 10Dzahn: "the other comments are right, but about precise we dont have to worry, there is a "if os_version('debian >= jessie')" around it anyways" [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [01:41:21] (03PS2) 10Filippo Giunchedi: prometheus::tools: fix k8s discovery after upgrade [puppet] - 10https://gerrit.wikimedia.org/r/318251 (https://phabricator.wikimedia.org/T147207) [01:45:52] (03CR) 10Filippo Giunchedi: "Notable change is that kubernetes_role tag is dropped and moves to job tag" [puppet] - 10https://gerrit.wikimedia.org/r/318251 (https://phabricator.wikimedia.org/T147207) (owner: 10Filippo Giunchedi) [01:49:20] PROBLEM - Apache HTTP on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.012 second response time [01:49:31] PROBLEM - HHVM rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.009 second response time [01:50:20] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.041 second response time [01:50:40] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 72568 bytes in 0.100 second response time [01:54:40] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:10:52] (03CR) 10Dereckson: [C: 04-2] "There are still fresh comments on the task, and a lot of commenters, with different plans of action to divide the work. We should so deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317824 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [02:23:20] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [02:28:49] !log l10nupdate@tin scap sync-l10n completed (1.28.0-wmf.22) (duration: 10m 28s) [02:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:17] tgr: ^ [02:31:17] !log tgr@tin Synchronized php-1.28.0-wmf.22/extensions/OAuth/: deploy fix for T149194 (duration: 00m 51s) [02:31:18] T149194: Owner-only consumer fails when Identifying over OAuth - https://phabricator.wikimedia.org/T149194 [02:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:32] !log tgr@tin Synchronized php-1.28.0-wmf.23/extensions/OAuth/: deploy fix for T149194 (duration: 00m 47s) [02:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:41:41] (03CR) 10Andy M. Wang: "> It now only creates the group. Reviewer is already taken by pending" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317824 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [02:55:15] !log l10nupdate@tin scap sync-l10n completed (1.28.0-wmf.23) (duration: 10m 36s) [02:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:00:35] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Oct 27 03:00:35 UTC 2016 (duration 5m 20s) [03:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:14:16] (03PS6) 10Yuvipanda: [WIP] jupyterhub: Add module to set up Jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [03:15:20] PROBLEM - Apache HTTP on mw1281 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [03:16:20] RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.020 second response time [03:21:31] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:44:10] PROBLEM - Apache HTTP on mw1204 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.019 second response time [03:45:20] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.032 second response time [03:50:50] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [03:54:40] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 2 minutes ago with 10 failures. Failed resources (up to 3 shown): Service[ferm],Service[diamond],Service[prometheus-node-exporter],Package[ecryptfs-utils] [04:04:40] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 57 failures. Last run 2 minutes ago with 57 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [04:13:50] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.074 second response time [04:14:22] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 72587 bytes in 0.228 second response time [04:22:31] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [04:27:07] 06Operations, 10Ops-Access-Requests: Requesting access researchers, analytics-users, statistics-privatedata-users, and analytics-privatedata-users for Zareen - https://phabricator.wikimedia.org/T149211#2747507 (10Ottomata) researchers, statistics-privatedata-users, and analytics-privatedata-users is good enough. [04:54:40] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 57 failures. Last run 2 minutes ago with 57 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [05:34:58] (03CR) 10Giuseppe Lavagetto: docker::registry::web: listen on ipv6 as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/318061 (owner: 10Giuseppe Lavagetto) [06:01:13] (03Abandoned) 10Brian Wolff: Set hhvm.virtual_host[default][always_decode_post_data] = false [puppet] - 10https://gerrit.wikimedia.org/r/306548 (owner: 10Brian Wolff) [06:08:20] PROBLEM - HHVM rendering on mw1287 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.014 second response time [06:09:20] RECOVERY - HHVM rendering on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 72587 bytes in 0.153 second response time [06:10:27] (03Abandoned) 10Hashar: deployment-prep/integration: stop downgrading sshd MAC and KEX [puppet] - 10https://gerrit.wikimedia.org/r/318248 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [06:11:43] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07WorkType-Maintenance: Jenkins master / client ssh connection fails due to missing ssh algorithm - https://phabricator.wikimedia.org/T100509#2747576 (10hashar) @Dzahn the issue is in Jenkins itself which uses an old / apparently un... [06:21:40] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [06:21:50] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [06:23:50] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:24:00] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [06:37:20] PROBLEM - Apache HTTP on mw1222 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.013 second response time [06:38:22] RECOVERY - Apache HTTP on mw1222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.035 second response time [06:39:32] 06Operations: Rename stat100x machines to have misc element names - https://phabricator.wikimedia.org/T149228#2747585 (10elukey) p:05Triage>03Low The Analytics team is going to refactor roles and functionalities of these machines, you are right that their usage is a bit confusing (https://wikitech.wikimedia.... [06:39:53] 06Operations, 10Analytics: Rename stat100x machines to have misc element names - https://phabricator.wikimedia.org/T149228#2747588 (10elukey) [06:49:03] (03PS1) 10Giuseppe Lavagetto: tendril: require auth in apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/318256 [06:49:15] <_joe_> marostegui: ^^ [06:50:10] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] tendril: require auth in apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/318256 (owner: 10Giuseppe Lavagetto) [06:57:19] !log Deploy schema change s5 dewiki.revision only codfw - T148967 [06:57:19] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [06:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:05:58] 06Operations, 10Prod-Kubernetes, 10vm-requests, 05Kubernetes-production-experiment, 15User-Joe: Ganeti VM for docker registry - https://phabricator.wikimedia.org/T148961#2737861 (10Joe) a:03Joe [07:09:34] (03PS1) 10Giuseppe Lavagetto: Add darmstadtium to DNS (ganeti VM for docker registry) [dns] - 10https://gerrit.wikimedia.org/r/318257 (https://phabricator.wikimedia.org/T148961) [07:15:05] (03CR) 10Giuseppe Lavagetto: [C: 032] Add darmstadtium to DNS (ganeti VM for docker registry) [dns] - 10https://gerrit.wikimedia.org/r/318257 (https://phabricator.wikimedia.org/T148961) (owner: 10Giuseppe Lavagetto) [07:15:16] !log Removed /srv/s5.sql.gz (54G - may 2015) from db1045 to free up some space [07:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:24:48] !log Deploying schema change db2034- enwiki.change_tag/tag_summary - T147166 [07:24:49] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [07:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:26:57] <_joe_> !log creating darmstadtium on ganeti, T148961 [07:26:58] T148961: Ganeti VM for docker registry - https://phabricator.wikimedia.org/T148961 [07:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:35:26] PROBLEM - Apache HTTP on mw1229 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.012 second response time [07:36:26] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.025 second response time [07:40:31] (03PS1) 10Alexandros Kosiaris: tendril: Remove the apache 2.4 specific directives [puppet] - 10https://gerrit.wikimedia.org/r/318258 [07:45:23] (03CR) 10Alexandros Kosiaris: [C: 032] tendril: Remove the apache 2.4 specific directives [puppet] - 10https://gerrit.wikimedia.org/r/318258 (owner: 10Alexandros Kosiaris) [07:47:12] !log Deploying ALTER table s4 commonswiki.templatelinks - https://phabricator.wikimedia.org/T149079 (db2065 only) [07:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:47:49] marostegui: all the alter tables in the morning, I see a pattern :P [07:47:52] *pattern [07:48:13] elukey: Yeah, some of them are long ones, so I rather get them done in the morning so they can go thru during the day [07:48:21] Although this last one takes around 13 hours to finish :) [07:48:41] :D [07:49:05] I was more thinking about you waking up in the morning and thinking about alter tables :D [07:49:33] Well, I dreamed about some hosts some weeks ago, so that is also worrying :) [07:50:08] ahhahaha [07:50:26] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:02:06] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [08:05:11] (03PS1) 10Giuseppe Lavagetto: install_server: add darmstadtium [puppet] - 10https://gerrit.wikimedia.org/r/318259 (https://phabricator.wikimedia.org/T148961) [08:08:41] (03CR) 10Alexandros Kosiaris: [C: 031] install_server: add darmstadtium [puppet] - 10https://gerrit.wikimedia.org/r/318259 (https://phabricator.wikimedia.org/T148961) (owner: 10Giuseppe Lavagetto) [08:11:27] (03CR) 10Giuseppe Lavagetto: [C: 032] install_server: add darmstadtium [puppet] - 10https://gerrit.wikimedia.org/r/318259 (https://phabricator.wikimedia.org/T148961) (owner: 10Giuseppe Lavagetto) [08:16:43] 06Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287#2747695 (10MoritzMuehlenhoff) [08:18:26] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:20:16] (03CR) 10Volans: "@Dzahn: sure, I don't know why I put that comment :)" [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [08:24:48] !log rolling reboot of logstash cluster for kernel update [08:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:27:26] 06Operations, 10Monitoring: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#2747714 (10Peachey88) [08:27:28] 06Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287#2747713 (10Peachey88) [08:29:29] (03CR) 10Thiemo Mättig (WMDE): "I really don't like it when disputed changes are made to a patch, and then self-merged. The setting is not "superfluous". It's required to" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318071 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [08:32:16] PROBLEM - HHVM rendering on mw1233 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.006 second response time [08:33:16] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 72607 bytes in 0.260 second response time [08:37:19] 06Operations, 10Ops-Access-Requests, 05Security: Security Issue Access Request for (Elukey) - https://phabricator.wikimedia.org/T149289#2747761 (10Peachey88) [08:38:07] 06Operations, 10Ops-Access-Requests, 05Security: Security Issue Access Request for (Elukey,Ema,Gehel) - https://phabricator.wikimedia.org/T149289#2747767 (10elukey) [08:38:38] (03PS4) 10Gilles: Log when HTTP status codes from Mediawiki and Thumbor are different [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) [08:38:49] (03CR) 10Gilles: Log when HTTP status codes from Mediawiki and Thumbor are different (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/315648 (https://phabricator.wikimedia.org/T147918) (owner: 10Gilles) [08:40:38] (03CR) 10Alexandros Kosiaris: [C: 04-1] Introduce mtail module (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/316543 (https://phabricator.wikimedia.org/T147923) (owner: 10Filippo Giunchedi) [08:42:23] 06Operations, 10Ops-Access-Requests, 05Security: Security Issue Access Request for (Elukey,Ema) - https://phabricator.wikimedia.org/T149289#2747770 (10elukey) [08:49:29] (03PS4) 10Thiemo Mättig (WMDE): Enable Wikibase #statements parser function on all test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) [08:51:50] (03CR) 10Hoo man: "> I really don't like it when disputed changes are made to a patch, and then self-merged. The setting is not "superfluous". It's required " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318071 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [08:52:27] (03CR) 10Hoo man: "(Re-post because gerrit mangled my formatting)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318071 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [09:06:05] (03Abandoned) 10Hashar: openstack: skip LDAP update for contintcloud [puppet] - 10https://gerrit.wikimedia.org/r/314188 (owner: 10Hashar) [09:09:32] (03CR) 10Elukey: [C: 031] hue_server: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/316363 (owner: 10Muehlenhoff) [09:09:33] 06Operations, 10ops-eqiad, 10Traffic: cp1066.mgmt.eqiad.wmnet is unreachable - https://phabricator.wikimedia.org/T149217#2747798 (10ema) [09:11:31] (03CR) 10Volans: "Some general comment here and few inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/318117 (owner: 10Gehel) [09:12:22] (03CR) 10Elukey: [C: 031] Tighten access to oozie server [puppet] - 10https://gerrit.wikimedia.org/r/316359 (owner: 10Muehlenhoff) [09:13:11] (03CR) 10Elukey: [C: 031] hhvm: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/316550 (owner: 10Muehlenhoff) [09:15:09] !log applying schema change (imagelinks) to s2 wikis T139090 [09:15:10] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [09:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:15:40] (03PS2) 10Alexandros Kosiaris: service::node: only rotate log files [puppet] - 10https://gerrit.wikimedia.org/r/285615 [09:15:47] (03CR) 10DCausse: elasticsearch - enable garbage collection logs on relforge servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/318055 (https://phabricator.wikimedia.org/T134853) (owner: 10Gehel) [09:17:35] (03PS3) 10Alexandros Kosiaris: service::node: only rotate log files [puppet] - 10https://gerrit.wikimedia.org/r/285615 (https://phabricator.wikimedia.org/T148436) [09:17:44] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] service::node: only rotate log files [puppet] - 10https://gerrit.wikimedia.org/r/285615 (https://phabricator.wikimedia.org/T148436) (owner: 10Alexandros Kosiaris) [09:19:50] (03CR) 10Gehel: elasticsearch - enable garbage collection logs on relforge servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/318055 (https://phabricator.wikimedia.org/T134853) (owner: 10Gehel) [09:20:24] (03PS2) 10Gehel: elasticsearch - enable garbage collection logs on relforge servers [puppet] - 10https://gerrit.wikimedia.org/r/318055 (https://phabricator.wikimedia.org/T134853) [09:22:58] (03CR) 10Elukey: [C: 04-1] "PEBCAK prevention -1 (for me)" [puppet] - 10https://gerrit.wikimedia.org/r/316359 (owner: 10Muehlenhoff) [09:28:23] (03CR) 10Gehel: "This optimization might well be overkill (as pointed out by Volans). I'm mainly trying to make sure that documentation and automation agre" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/318117 (owner: 10Gehel) [09:30:39] PROBLEM - DPKG on neodymium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:31:38] RECOVERY - DPKG on neodymium is OK: All packages OK [09:33:10] (03CR) 10DCausse: [C: 031] elasticsearch - enable garbage collection logs on relforge servers [puppet] - 10https://gerrit.wikimedia.org/r/318055 (https://phabricator.wikimedia.org/T134853) (owner: 10Gehel) [09:40:19] (03PS1) 10Gehel: cirrus - disable the rebuild of completion indices [puppet] - 10https://gerrit.wikimedia.org/r/318267 [09:41:21] (03CR) 10jenkins-bot: [V: 04-1] cirrus - disable the rebuild of completion indices [puppet] - 10https://gerrit.wikimedia.org/r/318267 (owner: 10Gehel) [09:41:51] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: wtp2019.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:41:54] (03PS2) 10Gehel: cirrus - disable the rebuild of completion indices [puppet] - 10https://gerrit.wikimedia.org/r/318267 [09:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:42:15] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: wtp2019.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:42:23] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: wtp2019.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:42:27] (03CR) 10Gehel: [C: 04-1] "Don't merge this before Sunday Oct 30" [puppet] - 10https://gerrit.wikimedia.org/r/318267 (owner: 10Gehel) [09:42:33] ignore all that, just me testing something [09:45:03] !log reboot maps codfw cluster for kernel update [09:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:49:39] (03PS2) 10Volans: Icinga: raid_handler improve failure detection [puppet] - 10https://gerrit.wikimedia.org/r/318128 (https://phabricator.wikimedia.org/T142085) [09:49:48] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:50:29] ^ maps2003 is me [09:50:38] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [09:52:34] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Install a docker registry for production - https://phabricator.wikimedia.org/T148960#2747839 (10Joe) [09:52:36] 06Operations, 10Prod-Kubernetes, 10vm-requests, 05Kubernetes-production-experiment, and 2 others: Ganeti VM for docker registry - https://phabricator.wikimedia.org/T148961#2747838 (10Joe) 05Open>03Resolved [09:53:29] (03CR) 10Elukey: [C: 031] "LGTM! I discussed with Riccardo the explicit use of !=/== 0, but I don't really have a strong opinion against it." [puppet] - 10https://gerrit.wikimedia.org/r/318128 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [09:54:13] !log rolling reboot of zookeeper cluster in codfw for kernel update [09:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:55:15] (03CR) 10Giuseppe Lavagetto: "Actually, that's not needed, I will be very careful when merging this change" [puppet] - 10https://gerrit.wikimedia.org/r/318060 (owner: 10Giuseppe Lavagetto) [09:59:38] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - kartotherian_6533 - Could not depool server maps2004.codfw.wmnet because of too many down! [10:00:46] (03PS3) 10Volans: Icinga: raid_handler improve failure detection [puppet] - 10https://gerrit.wikimedia.org/r/318128 (https://phabricator.wikimedia.org/T142085) [10:00:48] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [10:02:03] (03CR) 10Volans: [C: 032] Icinga: raid_handler improve failure detection [puppet] - 10https://gerrit.wikimedia.org/r/318128 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [10:02:19] !log reboot maps eqiad cluster for kernel update [10:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:05:02] (03PS1) 10Volans: Icinga: fix raid_handler ACK persistence [puppet] - 10https://gerrit.wikimedia.org/r/318269 (https://phabricator.wikimedia.org/T149229) [10:11:59] ACKNOWLEDGEMENT - MD RAID on conf2003 is CRITICAL: Return code of 255 is out of bounds nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T149293 [10:12:02] 06Operations, 10ops-codfw: Degraded RAID on conf2003 - https://phabricator.wikimedia.org/T149293#2747876 (10ops-monitoring-bot) [10:12:52] elukey: really? :( [10:13:32] puppet dind't run yet on einsteinium... forcing it [10:16:06] 06Operations, 10ops-codfw: Degraded RAID on conf2003 - https://phabricator.wikimedia.org/T149293#2747881 (10Volans) 05Open>03Invalid Bug already fixed in https://gerrit.wikimedia.org/r/#/c/318128/ but Puppet didn't had yet time to run on Icinga host. Forced now, it should not happen again. [10:16:12] nooooo [10:16:26] I'm lucky :D [10:16:49] interesting that those started happening after the change on the icinga-downtime script... [10:18:11] (03CR) 10Volans: [C: 032] Icinga: fix raid_handler ACK persistence [puppet] - 10https://gerrit.wikimedia.org/r/318269 (https://phabricator.wikimedia.org/T149229) (owner: 10Volans) [10:20:31] !log rebooting conf1001 for kernel update [10:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:21:04] moritzm: thanks, so I can check if the fix works :) [10:21:26] (03PS1) 10Hashar: contint: kvm groupadd is only for android testing [puppet] - 10https://gerrit.wikimedia.org/r/318272 (https://phabricator.wikimedia.org/T149294) [10:21:46] moritzm: did you use the icinga-downtime script to put it in downtime? [10:22:37] no, I'm using "Schedule downtime for checked host(s)" in the Icinga UI [10:23:26] ok [10:24:59] I'm not sure how it behaves differently from the script [10:26:08] !log rebooting conf1002 for kernel update [10:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:27:28] PROBLEM - HHVM rendering on mw1226 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.021 second response time [10:27:42] yes it set the downtime on the host only, not on their services [10:28:15] but only since the change on icinga-downtime made a couple of days ago ;) [10:28:28] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 72704 bytes in 0.167 second response time [10:30:49] !log rebooting conf1003 for kernel update [10:30:53] (03CR) 10Volans: [C: 04-1] "Conftool becomes interactive to disable all services on a single host. I need to get the list of them and disable them one by one." [puppet] - 10https://gerrit.wikimedia.org/r/318131 (https://phabricator.wikimedia.org/T149216) (owner: 10Volans) [10:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:34:11] (03CR) 10Hashar: [V: 031] "Cherry picked on the CI puppet master. I have cleaned up the instances via salt:" [puppet] - 10https://gerrit.wikimedia.org/r/318272 (https://phabricator.wikimedia.org/T149294) (owner: 10Hashar) [10:47:38] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2747941 (10Joe) These systems are not being used given we received the new machines already, who will own moving them back to the... [10:48:05] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2652695 (10MoritzMuehlenhoff) wmf4747 to mw4750 are up and running and in puppet/salt/icinga (but not in site.pp), but they're no... [10:49:28] 06Operations, 10ops-eqiad: Return wmf4747/wmf4748/wmf4749/wmf4750 to spares - https://phabricator.wikimedia.org/T146171#2747944 (10MoritzMuehlenhoff) [11:00:36] (03PS1) 10Ema: cache_text varnishtest: frontend response headers [puppet] - 10https://gerrit.wikimedia.org/r/318276 (https://phabricator.wikimedia.org/T131503) [11:05:49] 06Operations, 10ops-esams, 10Traffic: cp3021 failed disk sdb - https://phabricator.wikimedia.org/T148983#2747977 (10ema) p:05Triage>03Low [11:10:38] PROBLEM - Disk space on cp3013 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [11:10:38] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 13 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tmux] [11:10:58] PROBLEM - Disk space on cp3014 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [11:10:58] PROBLEM - Disk space on cp3016 is CRITICAL: DISK CRITICAL - free space: / 169 MB (1% inode=86%) [11:10:59] PROBLEM - Disk space on cp3012 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=86%) [11:11:48] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tmux] [11:12:43] (03PS2) 10Giuseppe Lavagetto: docker::registry: separate nginx config from the main one [puppet] - 10https://gerrit.wikimedia.org/r/318060 [11:12:45] <_joe_> wat? [11:12:48] <_joe_> ema: ^^ [11:13:26] _joe_: looking [11:14:03] elukey: /var/cache/varnishkafka/ apparently [11:14:19] <_joe_> yes [11:15:09] RECOVERY - Disk space on cp3012 is OK: DISK OK [11:15:13] <_joe_> messages saying kafka1012 is up I think [11:15:34] <_joe_> ema: are you stopping vk, removing files, start vk? [11:15:36] 06Operations: Reimage/rename codfw pool counters - https://phabricator.wikimedia.org/T149298#2747982 (10MoritzMuehlenhoff) [11:15:46] _joe_: I've just ran apt-get clean on cp3012 so far [11:16:20] <_joe_> ema: lemme try on cp3014 [11:16:24] sorry just saw the pings [11:16:29] _joe_: please go ahead [11:17:49] _joe_, elukey: note that those are all spares [11:18:04] spares [11:18:06] ? [11:18:08] <_joe_> ema: oh [11:18:13] cp3013 is a Unused spare system (role::spare::system) [11:18:19] ahhhh [11:18:35] <_joe_> should we just turn all those services off [11:18:38] RECOVERY - Disk space on cp3014 is OK: DISK OK [11:18:57] <_joe_> ema: I'm doing this ^^ [11:19:20] ok. I've got to go for lunch, this doesn't seem a big issue anyways :) [11:19:23] bbl [11:19:32] <_joe_> k [11:19:52] I'll drop the unused 3.16/3.19 kernels on those hosts [11:20:23] <_joe_> ok I know what happened [11:20:39] <_joe_> moritzm: did you reboot those servers circa 9:34 UTC? [11:20:46] <_joe_> cp3014 specifically [11:21:44] yeah, I rebooted these earlier the day [11:22:28] we should just shut them down, they're only waiting for decomission anyway [11:23:01] maybe systemctl mask varnishkafka ? [11:23:05] just to be sure [11:23:10] <_joe_> elukey: nope [11:23:15] <_joe_> stop it and disable it [11:23:23] <_joe_> I am doing it [11:23:28] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [11:24:54] ahh didn't know about disable [11:24:57] yes better [11:25:18] PROBLEM - Apache HTTP on mw1290 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.022 second response time [11:25:18] RECOVERY - Disk space on cp3013 is OK: DISK OK [11:25:38] PROBLEM - HHVM rendering on mw1290 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.012 second response time [11:26:18] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.025 second response time [11:26:38] RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 72710 bytes in 0.434 second response time [11:28:55] I can see only SlowTimer(s) in hhvm log for 00^ [11:29:01] sorry --^ [11:29:28] RECOVERY - Disk space on cp3016 is OK: DISK OK [11:33:40] <_joe_> !log stopping all cache-related services on esams spares cp3012-22 [11:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:34:58] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [11:41:15] (03CR) 10BBlack: [C: 031] cache::misc: switch gallium to contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/318246 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [11:43:07] yeah they should really be reimaged at this point (in the spare role) [11:43:20] just to get cruft like this out of the way for good [11:43:23] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2748026 (10Marostegui) a:05jcrespo>03Marostegui [11:44:25] https://phabricator.wikimedia.org/T130883 - 7 months and counting [11:44:38] it's not going to be uncommon, probably even more so with future DCs [11:45:04] so it should probably just be standard operating procedure to reimage to some spare role when waiting for actual hw decom/reclaim [11:48:03] <_joe_> bblack: agreed, I don't have time to reimage those though [11:48:20] <_joe_> it was faster to determine which services to shut down [11:49:49] (03PS3) 10Giuseppe Lavagetto: docker::registry: separate nginx config from the main one [puppet] - 10https://gerrit.wikimedia.org/r/318060 [11:51:01] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "works fine in toollabs" [puppet] - 10https://gerrit.wikimedia.org/r/318060 (owner: 10Giuseppe Lavagetto) [11:52:03] (03PS2) 10Giuseppe Lavagetto: docker::registry::web: listen on ipv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/318061 [11:53:30] (03PS3) 10Giuseppe Lavagetto: docker::registry::web: listen on ipv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/318061 [11:55:48] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [11:56:48] (03PS3) 10Gehel: elasticsearch - enable garbage collection logs on relforge servers [puppet] - 10https://gerrit.wikimedia.org/r/318055 (https://phabricator.wikimedia.org/T134853) [11:57:41] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::registry::web: listen on ipv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/318061 (owner: 10Giuseppe Lavagetto) [11:58:05] (03CR) 10Gehel: [C: 032] elasticsearch - enable garbage collection logs on relforge servers [puppet] - 10https://gerrit.wikimedia.org/r/318055 (https://phabricator.wikimedia.org/T134853) (owner: 10Gehel) [11:58:10] (03PS4) 10Gehel: elasticsearch - enable garbage collection logs on relforge servers [puppet] - 10https://gerrit.wikimedia.org/r/318055 (https://phabricator.wikimedia.org/T134853) [11:58:58] (03PS2) 10Giuseppe Lavagetto: docker::web: allow defining multiple build servers [puppet] - 10https://gerrit.wikimedia.org/r/318062 [12:03:05] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: let icinga own /var/log/icinga [puppet] - 10https://gerrit.wikimedia.org/r/318030 (owner: 10Dzahn) [12:03:10] (03PS3) 10Alexandros Kosiaris: icinga: let icinga own /var/log/icinga [puppet] - 10https://gerrit.wikimedia.org/r/318030 (owner: 10Dzahn) [12:03:12] (03CR) 10Alexandros Kosiaris: [V: 032] icinga: let icinga own /var/log/icinga [puppet] - 10https://gerrit.wikimedia.org/r/318030 (owner: 10Dzahn) [12:04:25] !log restart elasticsearch on relforge to activate GC logs - T134853 [12:04:26] T134853: Enable GC (garbage collection) logs on Elasticsearch JVM - https://phabricator.wikimedia.org/T134853 [12:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:08:34] 06Operations, 06Discovery-Search (Current work), 13Patch-For-Review, 07Wikimedia-Incident: Enable GC (garbage collection) logs on Elasticsearch JVM - https://phabricator.wikimedia.org/T134853#2748058 (10Gehel) Puppet change is deployed. GC logs are available on relforge. I will wait a few days to check eve... [12:09:54] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::web: allow defining multiple build servers [puppet] - 10https://gerrit.wikimedia.org/r/318062 (owner: 10Giuseppe Lavagetto) [12:10:00] (03PS3) 10Giuseppe Lavagetto: docker::web: allow defining multiple build servers [puppet] - 10https://gerrit.wikimedia.org/r/318062 [12:10:16] (03CR) 10Giuseppe Lavagetto: [V: 032] docker::web: allow defining multiple build servers [puppet] - 10https://gerrit.wikimedia.org/r/318062 (owner: 10Giuseppe Lavagetto) [12:11:11] (03CR) 10Giuseppe Lavagetto: "> Note that 'ip' can also be a CIDR. That isn't useful in labs, but" [puppet] - 10https://gerrit.wikimedia.org/r/318062 (owner: 10Giuseppe Lavagetto) [12:11:59] 06Operations, 10ops-esams, 10hardware-requests: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883#2149483 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp3012.esams.wmnet'] ``` The log can be found in `/var/log/wmf-auto... [12:13:56] 06Operations, 10Ops-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users for Zareen - https://phabricator.wikimedia.org/T149211#2748063 (10Zareenf) [12:14:16] (03PS2) 10Hashar: contint: drop contint::packages [puppet] - 10https://gerrit.wikimedia.org/r/317988 [12:14:17] (03PS2) 10Hashar: contint: move php5 install on jessie to nearest user [puppet] - 10https://gerrit.wikimedia.org/r/317987 [12:14:19] (03PS2) 10Hashar: contint: move doxygen/graphviz to labs instances [puppet] - 10https://gerrit.wikimedia.org/r/317985 [12:16:59] (03CR) 10Gehel: [C: 031] "LGTM. It looks like role::logstash::elasticsearch was missed in https://gerrit.wikimedia.org/r/#/c/315888/. This change restores the previ" [puppet] - 10https://gerrit.wikimedia.org/r/318205 (owner: 10Filippo Giunchedi) [12:18:28] PROBLEM - NTP on conf1001 is CRITICAL: NTP CRITICAL: Offset unknown [12:20:24] !log restarted ntp on conf1001 (stuck in XFAC state) [12:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:35:10] (03PS13) 10Gehel: Maps - cleanup postgres user creation [puppet] - 10https://gerrit.wikimedia.org/r/315271 (https://phabricator.wikimedia.org/T147194) [12:35:57] !log disabling puppet on maps servers for deployment of https://gerrit.wikimedia.org/r/#/c/315271/ [12:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:36:08] !log migrating nodes from ganeti2006 for kernel reboot [12:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:36:53] (03PS14) 10Gehel: Maps - cleanup postgres user creation [puppet] - 10https://gerrit.wikimedia.org/r/315271 (https://phabricator.wikimedia.org/T147194) [12:39:28] RECOVERY - NTP on conf1001 is OK: NTP OK: Offset -0.001717597246 secs [12:40:47] (03CR) 10Gehel: [C: 032] Maps - cleanup postgres user creation [puppet] - 10https://gerrit.wikimedia.org/r/315271 (https://phabricator.wikimedia.org/T147194) (owner: 10Gehel) [12:50:17] (03PS1) 10Gehel: maps / postgresql: new configuration format for slaves [puppet] - 10https://gerrit.wikimedia.org/r/318282 (https://phabricator.wikimedia.org/T147194) [12:51:26] (03CR) 10Gehel: [C: 032] maps / postgresql: new configuration format for slaves [puppet] - 10https://gerrit.wikimedia.org/r/318282 (https://phabricator.wikimedia.org/T147194) (owner: 10Gehel) [12:53:33] !log uploaded openjdk-8 8u111 for jessie-wikimedia to apt.wikimedia.org [12:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:56:00] (03PS1) 10Gehel: maps / postgresql: corrected hiera key for replication password [puppet] - 10https://gerrit.wikimedia.org/r/318283 (https://phabricator.wikimedia.org/T147194) [12:56:33] (03PS1) 10Muehlenhoff: Fix changelog (incorrect date spec makes the build fail) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/318285 [12:57:04] (03CR) 10Gehel: [C: 032] maps / postgresql: corrected hiera key for replication password [puppet] - 10https://gerrit.wikimedia.org/r/318283 (https://phabricator.wikimedia.org/T147194) (owner: 10Gehel) [12:59:52] (03PS1) 10Gehel: maps / postgresql: corrected hiera key for replication password [puppet] - 10https://gerrit.wikimedia.org/r/318286 (https://phabricator.wikimedia.org/T147194) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161027T1300). Please do the needful. [13:00:04] Urbanecm and dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:05] (03CR) 10Gehel: [C: 032 V: 032] maps / postgresql: corrected hiera key for replication password [puppet] - 10https://gerrit.wikimedia.org/r/318286 (https://phabricator.wikimedia.org/T147194) (owner: 10Gehel) [13:01:03] o/ [13:01:14] (03CR) 10Muehlenhoff: [C: 032] Fix changelog (incorrect date spec makes the build fail) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/318285 (owner: 10Muehlenhoff) [13:01:39] present [13:04:57] o/ [13:05:20] (03PS2) 10Ema: cache_text varnishtest: frontend response headers [puppet] - 10https://gerrit.wikimedia.org/r/318276 (https://phabricator.wikimedia.org/T131503) [13:05:22] (03PS1) 10Ema: cache_text varnishtest: insecure POST forbidden [puppet] - 10https://gerrit.wikimedia.org/r/318287 (https://phabricator.wikimedia.org/T131503) [13:05:35] going to land the throttle patches [13:05:57] hashar: Okay [13:06:27] (03CR) 10Volans: "Minor comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [13:06:41] (03PS3) 10Hashar: Fix a typo (ramge -> range) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318142 (https://phabricator.wikimedia.org/T146600) (owner: 10Urbanecm) [13:06:43] (03PS4) 10Hashar: [throttle] Rule for Wikipedia Editathon at Ohio State University on 2016-11-02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318141 (https://phabricator.wikimedia.org/T149200) (owner: 10Urbanecm) [13:06:46] (03PS3) 10Hashar: [throttle] Remove old rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318157 (owner: 10Urbanecm) [13:06:58] (03CR) 10Hashar: [C: 032] [throttle] Remove old rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318157 (owner: 10Urbanecm) [13:07:01] (03CR) 10Hashar: [C: 032] [throttle] Rule for Wikipedia Editathon at Ohio State University on 2016-11-02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318141 (https://phabricator.wikimedia.org/T149200) (owner: 10Urbanecm) [13:07:05] (03CR) 10Hashar: [C: 032] Fix a typo (ramge -> range) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318142 (https://phabricator.wikimedia.org/T146600) (owner: 10Urbanecm) [13:07:30] (03Merged) 10jenkins-bot: [throttle] Rule for Wikipedia Editathon at Ohio State University on 2016-11-02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318141 (https://phabricator.wikimedia.org/T149200) (owner: 10Urbanecm) [13:07:36] (03Merged) 10jenkins-bot: Fix a typo (ramge -> range) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318142 (https://phabricator.wikimedia.org/T146600) (owner: 10Urbanecm) [13:07:45] (03Merged) 10jenkins-bot: [throttle] Remove old rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318157 (owner: 10Urbanecm) [13:10:07] Urbanecm: deployed :] [13:10:07] !log hashar@tin Synchronized wmf-config/throttle.php: T146600 T149200 (duration: 00m 53s) [13:10:09] T146600: Lift the Wikipedia Account Creation Limit on 2016-10-04 and other dates for Winona State University - https://phabricator.wikimedia.org/T146600 [13:10:09] T149200: Account creation throttle exception for WIkipedia Editathon at Ohio State University on 2016-11-02 - https://phabricator.wikimedia.org/T149200 [13:10:10] dcausse: around ? ] [13:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:10:15] hashar: yes [13:10:20] hashar: Thanks a lot! [13:10:30] dcausse: I have CR+2 the CirrusSearch patch [13:10:37] it is in the CI pipeline [13:10:40] ok [13:10:58] and I guess the BM25 activation is independent? [13:11:03] hashar: yes [13:11:20] (03PS4) 10Hashar: [cirrus] Activate BM25 on top 10 wikis: Step 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315299 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [13:11:46] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315299 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [13:12:02] hashar: it's possible to see a spike of poolcounter errors, it happened last time we switched back to eqiad [13:12:15] (03Merged) 10jenkins-bot: [cirrus] Activate BM25 on top 10 wikis: Step 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315299 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [13:12:17] eeek [13:12:23] should we try to deploy it in batches? [13:12:57] hashar: we could but if it's a pain don't worry [13:13:15] well I dont think scap supports batching yet :D [13:13:38] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL: Master reports slave not active [13:13:46] hashar: pulling it to mw1099 for me to have a quick look might be sufficient [13:14:04] sure [13:14:14] done [13:14:21] ok testing [13:14:22] Hello. [13:14:30] hello :) [13:14:46] today is lightweight swat! [13:14:53] hashar: about throttle patches, could you check https://gerrit.wikimedia.org/r/#/c/318149/ ? It implements a test failing with "ramge" and passing with "range" [13:15:49] https://integration.wikimedia.org/ci/job/operations-mw-config-phpunit/9938/console without the Urbanecm fix, https://integration.wikimedia.org/ci/job/operations-mw-config-phpunit/9939/console with the Urbanecm fix [13:16:43] Dereckson: how does https://gerrit.wikimedia.org/r/#/c/318149/2/tests/ThrottleTest.php not define 'range' as a valid param ? :D [13:17:34] oh mannn [13:17:38] I got confused [13:17:49] !log migrating nodes from ganeti2005 for kernel reboot [13:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:59] hashar: I can see a definition which defines range as a valid param. [13:18:06] I don't understand you [13:23:42] hashar: sorry about that but I think I have an issue... eqiad is missing ~270K docs for enwiki [13:24:02] I think something went during the eqiad reindex and I just realize that now :( [13:24:27] no problem [13:24:36] dcausse: should we revert ? [13:24:46] hashar: yes I think it's safer [13:24:49] I dont mind deploying again once you have recovered the few docs [13:25:05] Dereckson: reviewing your change :D [13:25:30] hashar: thanks, it might take some time so most probably not before sf morning swat [13:26:14] (03PS1) 10Hashar: Revert "[cirrus] Activate BM25 on top 10 wikis: Step 3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318289 [13:26:22] easy :] [13:26:29] (03CR) 10Hashar: [C: 032] Revert "[cirrus] Activate BM25 on top 10 wikis: Step 3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318289 (owner: 10Hashar) [13:26:30] :) [13:26:56] then there is [13:26:56] [1.28.0-wmf.23] 318278 Fix comp suggest pref page [13:27:01] yes [13:27:06] (03Merged) 10jenkins-bot: Revert "[cirrus] Activate BM25 on top 10 wikis: Step 3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318289 (owner: 10Hashar) [13:27:08] PROBLEM - Host prometheus2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:20] I can test this one on mw1099 as well [13:28:32] PROBLEM - LVS HTTP IPv4 on prometheus.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.25 and port 80: Connection refused [13:28:40] dcausse: it is on mw1099 now [13:28:45] testing [13:28:59] PROBLEM - NTP on nihal is CRITICAL: NTP CRITICAL: Offset -4.775590569 secs [13:29:30] ^the prometheus alert is caused by ganeti reboot (it uses plain storage as I forgot to silence prometheus2002 in icinga) [13:30:11] and the ntp one by the migration [13:30:18] hashar: sounds good [13:30:38] RECOVERY - Host prometheus2002 is UP: PING OK - Packet loss = 0%, RTA = 38.03 ms [13:32:19] !log migrating nodes from ganeti2004 for kernel reboot [13:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:33:39] !log postgres replication checks in error after deployment of https://gerrit.wikimedia.org/r/#/c/315271/ (T147194) - replication is working, only check is failing - icinga is silenced [13:33:40] T147194: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194 [13:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:33:59] (03CR) 10Hashar: [C: 04-1] "Excellent start. I would go a step ahead and add some more tests to ensure entries in wmgThrottlingExceptions are valid. See inline :]" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318149 (owner: 10Dereckson) [13:34:05] !log maps / postgres replication checks in error after deployment of https://gerrit.wikimedia.org/r/#/c/315271/ (T147194) - replication is working, only check is failing - icinga is silenced [13:34:07] dcausse: pushing [13:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:34:42] RECOVERY - LVS HTTP IPv4 on prometheus.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 156 bytes in 0.074 second response time [13:35:30] !log hashar@tin Synchronized php-1.28.0-wmf.23/extensions/CirrusSearch/includes/HTMLCompletionProfileSettings.php: Fix comp suggest pref page (duration: 00m 48s) [13:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:35:49] dcausse: completed ;] [13:35:59] hashar: thanks! [13:39:20] !log migrating nodes from ganeti2003 for kernel reboot [13:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:17] (03CR) 10Dereckson: Tests for throttle rules (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318149 (owner: 10Dereckson) [13:40:19] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:42:05] !log European SWAT completed [13:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:47:30] (03PS3) 10Dereckson: Test for throttle rules: detect keys typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318149 [13:48:04] (03PS1) 10Ema: cache_text varnishtest: set X-Carrier based on XCIP [puppet] - 10https://gerrit.wikimedia.org/r/318291 (https://phabricator.wikimedia.org/T131503) [13:48:07] (03CR) 10Dereckson: Test for throttle rules: detect keys typo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318149 (owner: 10Dereckson) [13:49:31] (03PS2) 10Ema: cache_text varnishtest: set X-Carrier based on XCIP [puppet] - 10https://gerrit.wikimedia.org/r/318291 (https://phabricator.wikimedia.org/T131503) [13:51:18] RECOVERY - NTP on nihal is OK: NTP OK: Offset -0.005028188229 secs [13:53:05] (03PS1) 10Dereckson: Get rid of ip/IP tolerance for throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318293 (https://phabricator.wikimedia.org/T131469) [13:53:39] 06Operations, 10ops-esams, 10hardware-requests: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883#2748290 (10BBlack) To save future humans trouble: these can't be rebooted to PXE in any easy way, and the disks are behind raid controllers preventing secure erase, too :P [13:53:47] (03CR) 10jenkins-bot: [V: 04-1] Get rid of ip/IP tolerance for throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318293 (https://phabricator.wikimedia.org/T131469) (owner: 10Dereckson) [13:54:30] (03PS2) 10Dereckson: Get rid of ip/IP tolerance for throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318293 (https://phabricator.wikimedia.org/T131469) [13:54:43] (03CR) 10Dereckson: "PS2: "IP" is the one to keep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318293 (https://phabricator.wikimedia.org/T131469) (owner: 10Dereckson) [13:55:16] (03PS6) 10Andrew Bogott: Add 'remember me' checkbox to Horizon auth. [puppet] - 10https://gerrit.wikimedia.org/r/317992 (https://phabricator.wikimedia.org/T149036) [13:59:12] !log migrating nodes from ganeti2002 for kernel reboot [13:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:00:05] (03CR) 10Andrew Bogott: [C: 032] Add 'remember me' checkbox to Horizon auth. [puppet] - 10https://gerrit.wikimedia.org/r/317992 (https://phabricator.wikimedia.org/T149036) (owner: 10Andrew Bogott) [14:01:17] (03PS1) 10BBlack: rcstream: log xcip as well for analysis [puppet] - 10https://gerrit.wikimedia.org/r/318296 [14:01:34] (03CR) 10BBlack: [C: 032 V: 032] rcstream: log xcip as well for analysis [puppet] - 10https://gerrit.wikimedia.org/r/318296 (owner: 10BBlack) [14:03:02] 06Operations, 10Analytics: Rename stat100x machines to have misc element names - https://phabricator.wikimedia.org/T149228#2748319 (10Ottomata) + 1/2 to this idea. I'm all for renaming these boxes, not sure if misc element names is the way to go, but it might be! [14:04:58] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/python2.7/dist-packages/openstack_auth/backend.py] [14:08:48] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:08:59] ignore the page/alert for apertium.svc.codfw.wmnet please [14:09:23] PROBLEM - Host apertium.svc.codfw.wmnet is DOWN: This is a test, ignore me [14:10:03] RECOVERY - Host apertium.svc.codfw.wmnet is UP: PING OK - Packet loss = 0%, RTA = 37.81 ms [14:10:11] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Decommission psw1-eqiad - https://phabricator.wikimedia.org/T149224#2745777 (10faidon) Removed from LibreNMS, rancid, smokeping, torrus, Icinga & DNS. [14:14:00] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2748340 (10thcipriani) [14:14:07] (03Abandoned) 10Ema: cache_text varnishtest: set X-Carrier based on XCIP [puppet] - 10https://gerrit.wikimedia.org/r/318291 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [14:17:40] !log failover of ganeti2002 to new master node in codfw [14:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:17:48] PROBLEM - NTP on ganeti2006 is CRITICAL: NTP CRITICAL: Offset unknown [14:22:25] (03PS2) 10Giuseppe Lavagetto: docker::registry::web: allow using puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/318063 [14:24:30] !log restarted ntp on ganeti2006 (stuck in XFAC state) [14:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:26:17] !log migrating nodes from ganeti2001 for kernel reboot [14:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:30:00] (03PS1) 10Dereckson: Test for throttle rules: parameters logic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318301 [14:30:47] (03CR) 10Dereckson: Test for throttle rules: detect keys typo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318149 (owner: 10Dereckson) [14:31:11] (03CR) 10Ottomata: [C: 031] hue_server: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/316363 (owner: 10Muehlenhoff) [14:37:17] (03PS3) 10Giuseppe Lavagetto: docker::registry::web: allow using puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/318063 [14:39:53] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2748382 (10BBlack) >>! In T140128#2684764, @BBlack wrote: > We could perhaps enable apache logging of the X-Client-IP header to see through the cach... [14:40:19] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2748398 (10BBlack) At a glance, it seems like the bulk of the query traffic comes from GCE and AWS, and the bulk of it's still not HTTPS. [14:42:32] 06Operations, 10Traffic, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2748403 (10BBlack) I've briefly reviewed the python code at https://github.com/wikimedia/mediawiki-services-rcstream/blob/master/rcstream/rcstream and I don't see whe... [14:43:28] RECOVERY - NTP on ganeti2006 is OK: NTP OK: Offset 0.000325948 secs [14:43:30] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::registry::web: allow using puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/318063 (owner: 10Giuseppe Lavagetto) [14:43:55] (03PS2) 10Giuseppe Lavagetto: docker::registry: allow passing configurations [puppet] - 10https://gerrit.wikimedia.org/r/318064 [14:44:02] (03CR) 10Hashar: [C: 032] Test for throttle rules: detect keys typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318149 (owner: 10Dereckson) [14:44:30] (03Merged) 10jenkins-bot: Test for throttle rules: detect keys typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318149 (owner: 10Dereckson) [14:44:48] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] docker::registry: allow passing configurations [puppet] - 10https://gerrit.wikimedia.org/r/318064 (owner: 10Giuseppe Lavagetto) [14:47:48] (03PS2) 10Giuseppe Lavagetto: docker::registry: drop http host setting [puppet] - 10https://gerrit.wikimedia.org/r/318065 [14:47:48] PROBLEM - DPKG on rhodium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:48:18] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:48:33] (03CR) 10Filippo Giunchedi: [C: 032] role::logstash::elasticsearch: include base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/318205 (owner: 10Filippo Giunchedi) [14:48:56] (03PS3) 10Filippo Giunchedi: site: remove explicit role prometheus::node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/318203 [14:48:58] RECOVERY - DPKG on rhodium is OK: All packages OK [14:50:30] (03CR) 10Hashar: [C: 031] "Excellent :]" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318293 (https://phabricator.wikimedia.org/T131469) (owner: 10Dereckson) [14:50:38] (03CR) 10Filippo Giunchedi: [C: 032] site: remove explicit role prometheus::node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/318203 (owner: 10Filippo Giunchedi) [14:50:53] (03PS3) 10Filippo Giunchedi: role::logstash::elasticsearch: include base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/318205 [14:51:48] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:52:30] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::registry: drop http host setting [puppet] - 10https://gerrit.wikimedia.org/r/318065 (owner: 10Giuseppe Lavagetto) [14:52:40] (03PS3) 10Giuseppe Lavagetto: docker::registry: drop http host setting [puppet] - 10https://gerrit.wikimedia.org/r/318065 [14:52:48] (03CR) 10Giuseppe Lavagetto: [V: 032] docker::registry: drop http host setting [puppet] - 10https://gerrit.wikimedia.org/r/318065 (owner: 10Giuseppe Lavagetto) [14:53:27] (03CR) 10Ema: [C: 032] cache_text varnishtest: frontend response headers [puppet] - 10https://gerrit.wikimedia.org/r/318276 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [14:53:35] (03PS3) 10Ema: cache_text varnishtest: frontend response headers [puppet] - 10https://gerrit.wikimedia.org/r/318276 (https://phabricator.wikimedia.org/T131503) [14:53:39] (03CR) 10Ema: [V: 032] cache_text varnishtest: frontend response headers [puppet] - 10https://gerrit.wikimedia.org/r/318276 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [14:53:52] (03CR) 10Ema: [C: 032] cache_text varnishtest: insecure POST forbidden [puppet] - 10https://gerrit.wikimedia.org/r/318287 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [14:53:59] (03PS2) 10Ema: cache_text varnishtest: insecure POST forbidden [puppet] - 10https://gerrit.wikimedia.org/r/318287 (https://phabricator.wikimedia.org/T131503) [14:54:00] (03CR) 10Ema: [V: 032] cache_text varnishtest: insecure POST forbidden [puppet] - 10https://gerrit.wikimedia.org/r/318287 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [14:54:19] (03PS7) 10Elukey: Add extra compiler warnings to the Makefile [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/314662 (https://phabricator.wikimedia.org/T147436) [14:56:28] (03CR) 10Giuseppe Lavagetto: [C: 031] swift: add lvs configuration for esams [puppet] - 10https://gerrit.wikimedia.org/r/318145 (https://phabricator.wikimedia.org/T149098) (owner: 10Filippo Giunchedi) [14:57:22] (03CR) 10Giuseppe Lavagetto: swift: add lvs configuration for esams (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/318145 (https://phabricator.wikimedia.org/T149098) (owner: 10Filippo Giunchedi) [14:57:33] <_joe_> godog: see my qwuestion in the patch [14:57:36] <_joe_> but lgtm [14:58:45] ack [14:58:55] (03PS1) 10Volans: wmf-auto-reimage: add option --new for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/318304 (https://phabricator.wikimedia.org/T148816) [14:59:14] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:59:57] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:59:58] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/salt-common],File[/usr/local/sbin/grain-ensure],File[/home/mholloway-shell],File[/home/midom] [15:00:00] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/home/ori],File[/home/rush],File[/home/mark] [15:01:53] cp4016's puppetfail above is caused by: Failed to generate additional resources using 'eval_generate': Connection refused - connect(2) for "puppet" port 8140 [15:02:12] yeah makes sense [15:02:18] I am about to reboot all the puppetmasters [15:02:32] I am disabling puppet across the fleet first [15:02:46] akosiaris: alright! [15:02:46] !log disable puppet across the fleet for puppetmaster kernel upgrades [15:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:17] <_joe_> akosiaris: couldn't you use dns switching instead? [15:05:44] _joe_: I could.. I suppose I got lazy [15:05:47] :-( [15:06:30] mobrovac: file is /etc/init/cxserver.conf is obsolete as service is managed by service-runner, right? [15:06:44] mobrovac: scb* [15:07:01] kart_: that file is the upstart config file.. probably not you mean [15:07:17] not what* you mean [15:08:46] akosiaris: we have removed it from Puppet. [15:09:40] kart_: yes. service-runner uses service::node which ships a standard one [15:10:46] akosiaris: should those obsolete files removed (it is not working anyway :))? [15:11:19] removed from where ? [15:11:31] the host itself ? [15:11:54] it's irrelevant [15:12:01] OK. [15:12:17] we could stop shipping it indeed though [15:12:34] I 'll upload a patch.. should be easy enough [15:16:28] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2748504 (10chasemp) @Marostegui I'm of the opinion at the moment that reworking the definer could be a more nuanced bit of work itself. Honestly, no idea... [15:18:12] (03PS1) 10Hashar: (WIP) ThrottleTest with PHPUnit data provider (WIP) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318307 [15:20:17] (03CR) 10Hashar: [C: 04-1] "Looks good overall. I have a few nitpicks :]" (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318301 (owner: 10Dereckson) [15:21:12] (03Abandoned) 10Hashar: (WIP) ThrottleTest with PHPUnit data provider (WIP) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318307 (owner: 10Hashar) [15:21:50] 06Operations, 10ops-eqiad, 10hardware-requests: Return wmf4747/wmf4748/wmf4749/wmf4750 to spares - https://phabricator.wikimedia.org/T146171#2748515 (10RobH) Items going back to spares/reclaim should be tagged with #hardware-requests as well. [15:24:01] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2748518 (10jcrespo) > Then the path forward I think is to keep both the VIEWMASTER and MAINTAINVIEWS users with SUPER privs and to use the MAINTAINVEWS us... [15:24:28] PROBLEM - confd service on puppetmaster2001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [15:25:10] _joe_: I suppose that's wrong ^ [15:25:17] Loaded: loaded (/lib/systemd/system/confd.service; disabled) [15:25:27] so, it's not set to start on boot (enabled) [15:25:28] PROBLEM - confd service on puppetmaster1001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [15:25:39] ah on eqiad too [15:25:40] sigh [15:25:40] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2748524 (10jcrespo) Also, there is a labsdbadmin user already existing, which was tasked to do maintenance (create users), maybe that should be the one us... [15:26:25] !log start confd on puppetmaster1001 and puppetmaster2001 [15:26:28] RECOVERY - confd service on puppetmaster1001 is OK: OK - confd is active [15:26:28] RECOVERY - confd service on puppetmaster2001 is OK: OK - confd is active [15:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:42] 06Operations, 06Commons, 10media-storage, 13Patch-For-Review: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2748530 (10Trizek-WMF) [15:28:41] akosiaris: I ran into this on some caches if they didn't run puppet agent at startup quickly (as puppet agent would start confd if it runs successfully) [15:29:22] akosiaris: but I thought I had already fixed it in https://gerrit.wikimedia.org/r/#/c/318107/ [15:29:47] (to make it start as a service on system boot on its own, without puppet agent having to start it) [15:30:26] maybe my fix was unsufficient? [15:30:30] bblack: I would think so too [15:30:38] I was looking at exactly that right now... [15:30:43] 06Operations, 10Cassandra, 13Patch-For-Review, 06Services (next): enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#2748539 (10Eevans) [15:31:23] so, before I tried fixing it, I noticed on the affected hosts that "systemctl is-enabled confd" would report "static" [15:31:42] now, after that gerrit change above for wantedby, they say "disabled" [15:32:05] maybe the [install] wantedby doesn't really take proper effect just from systemctl daemon-reload? [15:33:14] <_joe_> akosiaris: yeah it's not great [15:34:21] _joe_: puppetdb has that too [15:34:24] it's good I checked [15:34:27] hmm [15:34:36] ok looking at systemd docs, I think the [install]wantedby=multi-user thing is correct, but that just *allows* it to be enabled [15:34:44] it still has to be enabled with 'systemctl enable confd' [15:34:53] maybe base::service_unit should be handling that? [15:35:20] so systemd in later version has vendor present and normal preset [15:35:29] at least in stretch [15:35:35] Loaded: loaded (/lib/systemd/system/openvpn.service; enabled; vendor preset: enabled) [15:35:37] heh [15:35:39] <_joe_> bblack: that's what I was thinking too [15:35:58] I notice two other systemd units in our puppet have their own manual hacks to run "systemctl enable" [15:36:05] <_joe_> we should in general assume that you want a service to start at startup [15:36:08] modules/admin/data/data.yaml: 'ALL = NOPASSWD: /bin/systemctl enable wdqs-updater', [15:36:09] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2748568 (10chasemp) >>! In T148560#2748518, @jcrespo wrote: >> Then the path forward I think is to keep both the VIEWMASTER and MAINTAINVIEWS users with S... [15:36:10] modules/role/manifests/cache/base.pp: command => '/bin/systemctl daemon-reload && /bin/systemctl enable traffic-pool', [15:36:28] PROBLEM - NTP on puppetmaster2001 is CRITICAL: NTP CRITICAL: Offset unknown [15:37:01] _joe_: I guess we'd have to poll the state to enforce it though, right? maybe a command that depends on unitfile installation + dameon-reload, where the command is "systemctl enable" and it's got an onlyif on is-enabled? [15:37:22] or if it's idempotent, I guess just run the enable command on every run [15:37:22] <_joe_> bblack: yes it's easy to do actually [15:37:33] <_joe_> give me 5 mons and i can take a stab at it [15:38:03] <_joe_> it's a creates stanza actually [15:39:03] ah [15:39:06] good point [15:39:13] well... [15:39:29] from the docs: [15:39:31] [Install] [15:39:31] WantedBy=multi-user.target [15:39:32] After running systemctl enable, a symlink /etc/systemd/system/multi-user.target.wants/foo.service linking to the actual unit will be created. It tells systemd to pull in the unit when starting multi-user.target. The inverse systemctl disable will remove that symlink again. [15:39:46] so you could do a creates on that file, but only if you parse the WantedBy first heh [15:40:38] and some of our unitfiles intentionally are WantedBy=some-other-service, not multi-user [15:40:53] which I guess is a different case: they'll get started on boot only if some-other-service is started on boot [15:41:14] <_joe_> no that is plainly wrong? [15:41:19] <_joe_> I'll check [15:41:25] I don't think it's wrong [15:41:39] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2748577 (10faidon) I think this is ready to proceed with procurement and setup — @RobH? [15:41:49] it's a valid deployment pattern, to say units x, y, and z are just wanted-by service foo, and service foo is wanted by multi-user [15:42:08] (03PS8) 10Elukey: Add extra compiler warnings to the Makefile [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/314662 (https://phabricator.wikimedia.org/T147436) [15:42:20] modules/graphite/templates/initscripts/local-relay.systemd.erb:WantedBy=carbon.service [15:42:31] <_joe_> bblack: I am not sure WantedBy in Install does what we want [15:42:43] <_joe_> it's not in the [install] stanza I guess [15:42:44] well, wantedby sets up the relationship [15:43:00] it is in the install stanza [15:43:14] "systemctl enable" does the actual symlinking to make a unit's relationships "active" [15:43:28] (why that's not done by daemon-reload is beyond me) [15:45:48] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [15:46:29] 06Operations, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2748611 (10RobH) a:03Cmjohnson These need to be reclaimed to spare (back to asset tag use only, no hostname) and then moved into use for T136340. Please upgrade the controllers from H3... [15:46:48] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3057994 keys, up 218 days 8 hours - replication_delay is 0 [15:47:11] 06Operations, 06Analytics-Kanban, 06Discovery, 06Discovery-Analysis (Current work), and 2 others: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2748617 (10Milimetric) a:03Ottomata [15:51:58] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 24 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/python2.7/dist-packages/openstack_auth/backend.py] [15:52:08] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 1 minute ago with 27 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/salt-common],File[/usr/local/sbin/grain-ensure],File[/home/mholloway-shell],File[/home/midom] [15:52:44] <_joe_> bblack: in fact I was thinking this should all be in the damn systemd provider for service [15:52:58] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 1 minute ago with 3 failures. Failed resources (up to 3 shown): File[/home/ori],File[/home/rush],File[/home/mark] [15:53:21] _joe_: bblack: https://www.freedesktop.org/software/systemd/man/systemd.preset.html [15:53:36] looks like this is going to change in the future [15:53:41] albeit just slightly [15:54:11] although now that I look at it again, jessie does have that [15:54:13] hmmm [15:54:59] twentyafterfour: from my testing we are clear for .23 to group2 [15:55:01] Example 3. Administrator policy /etc/systemd/system-preset/00-lennart.preset: [15:55:01] enable httpd.service [15:55:01] enable sshd.service [15:55:01] enable postfix.service [15:55:01] disable * [15:55:14] which will pose the exact same interesting question [15:55:25] should we be shipping our presets under /etc or under /lib ? [15:55:28] sigh... [15:55:49] <_joe_> under /etc, I guess [15:56:18] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [15:56:38] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [15:56:52] sudo systemctl preset confd [15:56:53] Created symlink from /etc/systemd/system/multi-user.target.wants/confd.service to /lib/systemd/system/confd.service. [15:57:16] <_joe_> that does the exact same thing as enable? [15:57:36] no, it does what the [Install] section says I think [15:57:38] still reading [15:58:12] Reset the enable/disable status one or more unit files, as specified on the command line, to the defaults configured in the preset policy files. This has the same effect as disable or enable, depending how the unit is listed in the preset files. [15:58:25] so it can read extra policy files [15:58:36] /etc/systemd/system-preset/*.preset [15:58:37] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:58:49] for the life of me, I am not sure yet why it's so complicated [15:59:26] <_joe_> systemctl is-enabled hhvm.service [15:59:32] <_joe_> at least they have that [15:59:58] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:59:59] https://freedesktop.org/wiki/Software/systemd/Preset/ [16:00:01] ah [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161027T1600). [16:00:04] hashar: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:08] ok he explains the reasoning on that one [16:00:26] (03PS1) 10RobH: reclaim aqs100[123] to spare pool [dns] - 10https://gerrit.wikimedia.org/r/318313 [16:01:04] and it's actually quite old. since systemd 32 [16:01:08] we just never had to use it [16:01:25] the only *new* thing seems to be the informational message on systemctl status [16:01:36] holding that "vendor preset:" stuff [16:02:41] aah [16:02:42] there we go [16:02:44] Preset policies are stored in .preset files in /usr/lib/systemd/system-preset/. If no policy exists the default implied policy of "enable everything" is enforced, i.e. in Debian style. [16:02:51] 06Operations, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2748810 (10RobH) [16:02:53] 06Operations, 06Analytics-Kanban, 06Discovery, 06Discovery-Analysis (Current work), and 2 others: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2700737 (10Nuria) Turns out that sorting this is going to take a bit more than we though, we... [16:03:01] then.. why on earth do we need to enable it manually ? [16:04:26] _joe_: seems like we should be running systemctl preset after installing a unit [16:04:26] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2748816 (10RobH) 05Open>03stalled >>! In T136340#2748577, @faidon wrote: > I think this is ready to proceed with procureme... [16:05:12] o/ around for puppet SWAT if it happens? :] [16:05:57] hashar: yeah [16:06:24] from https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0October.C2.A027 [16:06:35] got a couples one for the shell script that autorebase the labs puppet.git repo [16:07:01] one is a mere clean up, the other is stop rebasing when nothing has changed in origin/production and skip early [16:07:09] is **to** stop [16:07:11] <_joe_> akosiaris: well, do we have presets on our systems? [16:07:24] (03PS1) 10Ema: varnishapi.py: import latest upstream version [puppet] - 10https://gerrit.wikimedia.org/r/318314 [16:07:28] (03PS2) 10Alexandros Kosiaris: contint: kvm groupadd is only for android testing [puppet] - 10https://gerrit.wikimedia.org/r/318272 (https://phabricator.wikimedia.org/T149294) (owner: 10Hashar) [16:07:32] (03CR) 10Alexandros Kosiaris: [C: 032] contint: kvm groupadd is only for android testing [puppet] - 10https://gerrit.wikimedia.org/r/318272 (https://phabricator.wikimedia.org/T149294) (owner: 10Hashar) [16:07:36] (03CR) 10Alexandros Kosiaris: [V: 032] contint: kvm groupadd is only for android testing [puppet] - 10https://gerrit.wikimedia.org/r/318272 (https://phabricator.wikimedia.org/T149294) (owner: 10Hashar) [16:10:00] hashar: I'm taking a look at your patches [16:10:07] 06Operations, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2748830 (10RobH) [16:10:55] both have been on integration and beta cluster for age and nothing suspicious happened :] [16:11:11] (one puppet master is trusty, the other jessie) [16:12:51] (03PS2) 10Filippo Giunchedi: puppetmaster: polish git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/312747 (owner: 10Hashar) [16:13:21] jouncebot: now [16:13:22] For the next 0 hour(s) and 46 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161027T1600) [16:13:42] waiting for mr jenkins [16:14:15] ah, the CI thing is already merged [16:14:22] was about to do it, but even better :) [16:14:48] (03CR) 10Filippo Giunchedi: [C: 032] puppetmaster: polish git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/312747 (owner: 10Hashar) [16:15:13] ah indeed, merging both [16:15:22] magic! [16:15:33] 06Operations, 06Analytics-Kanban, 10EventBus: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2748867 (10Milimetric) [16:16:18] (03PS3) 10Ema: Fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/316529 (https://phabricator.wikimedia.org/T95064) [16:16:49] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:17:45] (03CR) 10Filippo Giunchedi: "minor nit, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) (owner: 10Hashar) [16:17:54] hashar: see ^ [16:18:04] _joe_: we actually do [16:18:38] !log Restarted Jenkins Gearman client due to a deadlock with the beta cluster jobs [16:18:42] checking [16:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:49] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:18:56] some defaults under /lib/systemd/system-preset/90-systemd.preset [16:19:20] _joe_: I am though perplexed by the following: cxserver on scb1003 was enabled without anything extra on my part [16:19:24] so.. what gives ? [16:20:18] (03CR) 10Hashar: puppetmaster: git-sync-upstream early abort (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) (owner: 10Hashar) [16:20:38] 06Operations, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2748898 (10RobH) [16:20:40] <_joe_> akosiaris: http://m.memegen.com/2j9b8y.jpg [16:20:41] godog: solved :] [16:20:42] (03PS3) 10Hashar: puppetmaster: git-sync-upstream early abort [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) [16:20:58] (03PS1) 10Giuseppe Lavagetto: base::service_unit: enable/disable units under systemd [puppet] - 10https://gerrit.wikimedia.org/r/318315 [16:21:00] 06Operations, 10ops-eqiad, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2748901 (10RobH) Adding ops-eqiad tag since now its pending onsite actions. [16:21:59] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:22:04] (03CR) 10jenkins-bot: [V: 04-1] base::service_unit: enable/disable units under systemd [puppet] - 10https://gerrit.wikimedia.org/r/318315 (owner: 10Giuseppe Lavagetto) [16:22:14] (03PS4) 10Filippo Giunchedi: puppetmaster: git-sync-upstream early abort [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) (owner: 10Hashar) [16:23:09] <_joe_> for once, I welcome a linter's -1 [16:23:22] _joe_: I somehow have this feeling your change is not required [16:23:25] <_joe_> akosiaris: what do you think of the patch? [16:23:33] and brandon's actually fixed the issue for confd [16:23:40] I am honestly still confused [16:23:47] 06Operations, 10Cassandra, 06Services (blocked): SSL handshake errors - https://phabricator.wikimedia.org/T148654#2748903 (10Eevans) >>! In T148654#2746764, @fgiunchedi wrote: > I _think_ https://gerrit.wikimedia.org/r/#/c/316906/ might have been related @fgiunchedi I think you might be right; Thanks! [16:23:49] your meme generator summed it up greatly for me [16:24:05] I want to do some testing [16:24:22] 06Operations, 10Analytics: Rename stat100x machines to have misc element names - https://phabricator.wikimedia.org/T149228#2748907 (10ArielGlenn) >>! In T149228#2748319, @Ottomata wrote: > + 1/2 to this idea. I'm all for renaming these boxes, not sure if misc element names is the way to go, but it might be!... [16:24:36] <_joe_> akosiaris: oblivian@mw1241:~$ systemctl is-enabled hhvm.service [16:24:36] <_joe_> disabled [16:24:42] <_joe_> for instance [16:24:52] 06Operations, 10hardware-requests: eqiad/codfw: swift frontend hardware refresh - https://phabricator.wikimedia.org/T148510#2748908 (10mark) @robh: please request quotes for this, thanks! [16:24:59] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [16:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:48] (03PS1) 10Papaul: Working on a script to be able to change the mgmt password and sshpass is needed to execute remote racadm and ilo commands from the hosts running the script . [puppet] - 10https://gerrit.wikimedia.org/r/318317 [16:26:53] akosiaris@scb1003:~$ sudo systemctl is-enabled cxserver [16:26:53] Failed to get unit file state for cxserver.service: No such file or directory [16:26:55] let's see now [16:27:11] <_joe_> akosiaris: you're re-running puppet now? [16:27:15] yup [16:27:55] akosiaris@scb1003:~$ sudo systemctl is-enabled cxserver [16:27:55] enabled [16:28:00] so.. not really required [16:28:07] <_joe_> so, the doubt remais [16:28:10] <_joe_> *remains [16:28:13] I think bblack actually fixed it for confd [16:28:15] <_joe_> why hhvm is not enabled? [16:28:18] hmm [16:28:34] let me run the same test on mw1241 [16:28:41] <_joe_> depool it first :) [16:28:50] <_joe_> there is an additional issue there [16:28:59] <_joe_> hhvm is installed by debian [16:29:04] <_joe_> we just provide an override [16:29:27] (03CR) 10Filippo Giunchedi: [C: 032] puppetmaster: git-sync-upstream early abort [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) (owner: 10Hashar) [16:29:46] hashar: {{done}} thanks! [16:31:04] (03PS1) 10RobH: decom aqs100[123] [puppet] - 10https://gerrit.wikimedia.org/r/318318 [16:31:44] <_joe_> akosiaris: doesn't make sense that it gets enabled when you install it though [16:31:45] (03CR) 10RobH: [C: 032] reclaim aqs100[123] to spare pool [dns] - 10https://gerrit.wikimedia.org/r/318313 (owner: 10RobH) [16:31:58] (03PS2) 10RobH: decom aqs100[123] [puppet] - 10https://gerrit.wikimedia.org/r/318318 [16:32:12] godog: all set and I have rebased properly. Thank you :] [16:32:18] I am off! [16:33:08] byte! [16:33:09] (03CR) 10RobH: [C: 032] decom aqs100[123] [puppet] - 10https://gerrit.wikimedia.org/r/318318 (owner: 10RobH) [16:33:13] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: mw1241.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=appserver', 'service=apache2']) [16:33:14] lol, "bye" [16:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:56] (03PS3) 10Hashar: contint: move doxygen/graphviz to labs instances [puppet] - 10https://gerrit.wikimedia.org/r/317985 [16:37:29] 06Operations, 10Analytics: Rename stat100x machines to have misc element names - https://phabricator.wikimedia.org/T149228#2748977 (10Ottomata) - stat1001 - app/webserver, no analyst/ressearch access. - stat1003 - compute node, lots of storage, mostly used by researchers to connect to MySQL. - stat1002 - compu... [16:39:32] 07Puppet, 07Beta-Cluster-reproducible, 13Patch-For-Review: puppet failures due to "Could not find class" or "Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type" - https://phabricator.wikimedia.org/T131946#2748997 (10hashar) My git-sync-upstream patch might have helped some c... [16:43:20] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1241.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=appserver', 'service=apache2']) [16:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:43:37] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [16:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:49:42] (03PS2) 10Giuseppe Lavagetto: base::service_unit: enable/disable the service if managed here [puppet] - 10https://gerrit.wikimedia.org/r/318315 [16:49:45] <_joe_> akosiaris: ^^ [16:50:11] <_joe_> I would not merge it now myself, though [16:50:31] yeah no rush [16:50:33] seems good enough [16:50:40] let's merge it tomorrow [16:50:53] <_joe_> around 7 PM :P [16:50:57] oh wait.. you merge it tomorrow. I got a national holiday :P [16:50:59] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:51:03] whatever suits you :P [16:51:05] <_joe_> ahahahh you smartass :P [16:51:17] <_joe_> let's merge on tuesday [16:51:36] (03CR) 10Alexandros Kosiaris: [C: 031] base::service_unit: enable/disable the service if managed here [puppet] - 10https://gerrit.wikimedia.org/r/318315 (owner: 10Giuseppe Lavagetto) [16:51:44] (03PS1) 10Ema: VCL: allow to load test versions of netmapper JSON files [puppet] - 10https://gerrit.wikimedia.org/r/318320 [16:53:07] (03CR) 10Alexandros Kosiaris: [C: 031] Force Content-type for files without extensions (noc.w.o) [puppet] - 10https://gerrit.wikimedia.org/r/318074 (https://phabricator.wikimedia.org/T146421) (owner: 10Elukey) [16:53:20] (03CR) 10Alexandros Kosiaris: "nice regexp btw, took me 5 secs to parse it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/318074 (https://phabricator.wikimedia.org/T146421) (owner: 10Elukey) [16:56:14] !log uploaded gerrit 2.12.5 to apt.wikimedia.org [16:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:57:18] akosiaris: I took inspiration from stack overflow :P [16:57:59] (03PS2) 10Elukey: Force Content-type for files without extensions (noc.w.o) [puppet] - 10https://gerrit.wikimedia.org/r/318074 (https://phabricator.wikimedia.org/T146421) [16:59:44] (03PS1) 10BBlack: upstream diffs 1.11.4 -> master (post-1.11.5) @2016-10-27 [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318321 [16:59:46] (03PS1) 10BBlack: build: remove --with-ipv6 (removed upstream) [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318322 [16:59:48] (03PS1) 10BBlack: disable dynamic tls records patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318323 [16:59:50] (03PS1) 10BBlack: nginx (1.11.4-1+wmf6) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318324 [17:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161027T1700). [17:00:12] (03CR) 10Elukey: [C: 032] Force Content-type for files without extensions (noc.w.o) [puppet] - 10https://gerrit.wikimedia.org/r/318074 (https://phabricator.wikimedia.org/T146421) (owner: 10Elukey) [17:00:30] i plan to depl kartotherian, if i can get it built properly :( [17:04:18] 06Operations, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Font list resource doesn't have a "Content-type: text/plain;charset=utf-8" header - https://phabricator.wikimedia.org/T146421#2749070 (10elukey) 05Open>03Resolved a:03elukey Just deployed, now https://noc.wikimedia.org/conf/fc-list loo... [17:06:47] 06Operations, 10hardware-requests: codfw/eqiad: 2x systems for prometheus - https://phabricator.wikimedia.org/T148513#2749077 (10RobH) 05Open>03stalled That is correct, the misc spares do not tend to have SSD(sff) but tend to be LFF SATA disks for high storage capacity. The Dell systems cannot swap the ho... [17:07:19] RECOVERY - NTP on puppetmaster2001 is OK: NTP OK: Offset -0.0007071197033 secs [17:08:03] 06Operations, 10hardware-requests: codfw/eqiad: 12x swift backend refresh - https://phabricator.wikimedia.org/T149336#2749097 (10fgiunchedi) [17:09:15] (03PS2) 10BBlack: patch upstream -> master @2016-10-27 [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318321 [17:09:17] (03PS2) 10BBlack: build: remove --with-ipv6 (removed upstream) [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318322 [17:09:19] (03PS2) 10BBlack: disable dynamic tls records patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318323 [17:09:21] (03PS2) 10BBlack: nginx (1.11.4-1+wmf6) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318324 [17:09:23] (03PS1) 10BBlack: update stapling proxy patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318327 [17:09:24] 06Operations, 10media-storage: refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T148647#2728868 (10fgiunchedi) thanks @Papaul ! @Cmjohnson would there be enough space in eqiad to do the above? (i.e. 12x new swift backends installed in parallel with old hardware or we'll need to do it... [17:20:01] (03PS3) 10BBlack: patch upstream -> master @2016-10-27 [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318321 [17:20:04] (03PS3) 10BBlack: build: remove --with-ipv6 (removed upstream) [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318322 [17:20:06] (03PS3) 10BBlack: disable dynamic tls records patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318323 [17:20:07] (03PS3) 10BBlack: nginx (1.11.4-1+wmf6) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318324 [17:20:10] (03PS2) 10BBlack: update stapling proxy patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318327 [17:20:12] (03PS1) 10BBlack: fixup debian perl buildflags patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318328 [17:27:24] 06Operations, 10Analytics: Rename stat100x machines to have misc element names - https://phabricator.wikimedia.org/T149228#2749185 (10AlexMonk-WMF) >>! In T149228#2748977, @Ottomata wrote: > - stat1002 - compute node, lots of storage, with private data and Analytics Cluster (Hadoop) access. You mention 'priva... [17:31:15] 06Operations, 10Analytics: Rename stat100x machines to have misc element names - https://phabricator.wikimedia.org/T149228#2749206 (10Ottomata) Perhaps so, but that data is not stored on stat1003. Those MySQL dbs are theoretically accessible from anywhere in the prod network, if you have the proper MySQL user... [17:31:18] 06Operations, 10media-storage: refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T148647#2749209 (10Cmjohnson) @fgiunchedi In eqiad it will be a little tight in rows A and C. I should be able to accommodate 4 in each rack but will have to move a few decom'd/spare cp servers out of eac... [17:33:36] I'm having some problems with git review [17:33:49] I get a connection error when I try to do anything, but only in one repo [17:34:17] The following command failed with exit code 104 [17:34:17] "GET https://gerrit.wikimedia.org/changes/?q=project%3Ar%2Fp%2Fmediawiki%2Fextensions%2FVisualEditor+status%3Aopen" [17:34:29] The GET request is returning a 404 [17:35:34] tchanders: "The requested URL /changes/ was not found on this server." bad url? [17:36:45] tchanders: I think that https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/VisualEditor would be a valid request [17:38:11] That was the url that was generated when I typed git review -l [17:38:19] Any idea how I can change it to the correct one? [17:39:12] (Context: this is a new problem, has been working fine over the last few weeks) [17:39:44] Have you upgraded your git-review package recently? [17:39:53] * bd808 is trying to see if he can reproduce [17:39:58] No [17:40:24] 06Operations, 06Discovery, 10hardware-requests, 06Discovery-Search (Current work): Estimate hardware requirements for ordering new servers for Elasticsearch - https://phabricator.wikimedia.org/T148559#2749232 (10mark) [17:40:38] version 1.25.0 [17:41:05] It seems to be working fine for other people on the team with the same version [17:41:35] *nod* I have 1.25.0 and it seems to work for me on a fresh clone [17:42:18] tchanders: what is your origin url for that clone? Mine looks like -- ssh://bd808@gerrit.wikimedia.org:29418/mediawiki/extensions/VisualEditor [17:42:38] are you using ssh as the origin or https or ?? [17:43:20] https://gerrit.wikimedia.org/r/p/mediawiki/extensions/VisualEditor.git [17:44:27] but gerrit remote is ssh://tchanders@gerrit.wikimedia.org:29418/mediawiki/extensions/VisualEditor.git [17:44:28] ok. I get the same error when you origin is anon https [17:48:02] Ok, I disconnected that https origin and it's working now [17:48:35] glad you found a workaround [17:48:36] Odd, because other people seem to have the same origin, and I still have that same setup on other repos and it works [17:48:47] Thanks for the help! [17:49:23] bd808: i wonder if that is because certifcate errors we had a few ?weeks? ago? [17:49:46] that doesn't seem likely to cause a 404 [17:50:25] T148045 [17:50:26] T148045: Windows 10 & MacOS Sierra Certificate errors due to GlobalSign - https://phabricator.wikimedia.org/T148045 [17:50:45] git-review has dark voodoo for how it decides to do things. I've always had inconsistent results with non-ssh remotes and our server [17:50:57] yeah, not cert related [17:51:00] ok [17:51:21] well it would make sense considering i have that same exact issue on my tool [17:51:42] 06Operations, 10Analytics: Rename stat100x machines to have misc element names - https://phabricator.wikimedia.org/T149228#2749301 (10AlexMonk-WMF) Okay. What sort of private data (beyond credentials to external systems) is stored on stat1002? [17:53:08] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Estimate hardware requirements for WDQS upgrade - https://phabricator.wikimedia.org/T148747#2749305 (10K4-713) [17:54:02] !log upgrading prometheus to 1.2.1 in codfw/eqiad [17:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:55:11] 06Operations, 10Analytics: Rename stat100x machines to have misc element names - https://phabricator.wikimedia.org/T149228#2749313 (10Ottomata) Most notably, (and historically), sampled webrequest logs in the udp2log format. [17:55:39] 06Operations, 10ops-codfw: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T149212#2749315 (10Volans) 05Open>03Resolved a:03Volans @jcrespo @Marostegui @Papaul, FYI: This was triggered by T149099, looks like the check was not put in scheduled downtime on Icinga. There are 3 missing disks acc... [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161027T1800). [18:00:04] Pchelolo: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:04] gehel: Respected human, time to deploy Wikidata query service (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161027T1800). Please do the needful. [18:00:04] SMalyshev and Jonas_WMDE: A patch you scheduled for Wikidata query service is about to be deployed. Please be available during the process. [18:00:33] SMalyshev: should we delay this WDQS deployment a bit? [18:01:03] SMalyshev: Ok, no delay, let's deploy [18:01:07] gehel: as you feel necessary. it's ready from my side, so deploy when you prefer [18:01:19] hrm, looks like the only thing up for SWAT was already deployed? https://gerrit.wikimedia.org/r/#/c/315696/ [18:01:44] jouncebot: next [18:01:44] In 0 hour(s) and 28 minute(s): Reindexing Commonswiki File: namespace (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161027T1830) [18:01:59] thcipriani: if there's nothing in swat, I have a backport to deploy [18:02:38] legoktm: it seems like SWAT is empty (I think this was deploy calendar creation problem) Pchelolo can confirm this if around [18:03:15] SMalyshev: deployment on wdqs-test completed, feel free to test [18:03:18] legoktm: in short, go for it :) [18:03:29] gehel: testing [18:03:30] thcipriani: you can also deploy https://gerrit.wikimedia.org/r/#/c/318293 and check if $wmgThrottlingExceptions stay the same before/after if you wish [18:03:54] gehel: looks ok [18:04:20] thcipriani: I don't know how it got to the swat list, it's already deployed, nothing to do here. Thank you [18:04:28] Pchelolo: cool, thanks. [18:05:10] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 911.59 seconds [18:05:20] legoktm: can I jump in and deploy Dereckson 's change before your backport is ready? [18:05:24] (03PS40) 1020after4: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [18:05:28] SMalyshev: the UI on https://wdqs-test.wmflabs.org tells me data is 6 days old, but wdqs-updater looks to be running correctly [18:05:39] thcipriani: go for it, mine is waiting on jenkins [18:05:40] https://gerrit.wikimedia.org/r/318338 [18:05:49] okie doke [18:06:06] I'll ping you when clear [18:06:16] (03CR) 1020after4: [C: 031] "fixed up pep8 issues and an error that occurred when ~/.netrc was missing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [18:06:29] thanks [18:06:29] gehel: it's ok, the test one is behind, it's fine [18:06:53] thcipriani: can I propose ^ for swat if there is time? [18:07:17] twentyafterfour: sure, throw it on the deployment page [18:07:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318293 (https://phabricator.wikimedia.org/T131469) (owner: 10Dereckson) [18:08:04] (03Merged) 10jenkins-bot: Get rid of ip/IP tolerance for throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318293 (https://phabricator.wikimedia.org/T131469) (owner: 10Dereckson) [18:08:37] can someone kill esisiisis please? [18:08:44] (I get messages like "esisiisis is inviting you to join #Cobot") [18:08:59] is probably that Spanish irc/wikimedia vandal [18:09:12] Trijnstel: you want #wikimedia-ops for kick, bans or #freenode for klines [18:09:24] thanks... didn't know what to join [18:09:54] !log wdqs deployment of latest GUI [18:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:01] SMalyshev: deployment complete [18:10:20] Dereckson: change is live on mw1099, checking vars now, but if you can think of other things to check, please do [18:10:23] SMalyshev: tests are good, feel free to check ... [18:10:24] gehel: great! you want to write a note for Lydia? [18:10:37] SMalyshev: yep, I can do that [18:11:17] thcipriani: as long as we don't lost information from wmgThrottlingExceptions arrays that's fine I think [18:11:46] Dereckson: prefect, wmgThrottlingExceptions is the same before and after [18:11:58] Dereckson: going live everywhere [18:12:16] good [18:12:41] heads up swatters: I have a revert to perform, patchset coming shortly. I'll do it myself, will involve a scap [18:13:08] thcipriani: added [18:14:37] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:318293|Get rid of ip/IP tolerance for throttle rules (T131469)]] (duration: 00m 46s) [18:14:38] T131469: Allow wmf-config/throttle.php to be lenient on ip/IP typo - https://phabricator.wikimedia.org/T131469 [18:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:14:48] ^ Dereckson live everywhere [18:16:20] (03PS41) 10Thcipriani: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [18:21:00] (03PS42) 1020after4: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [18:21:52] thcipriani: is it okay for me to deploy now? [18:22:34] legoktm: yup, all clear, sorry for the delay [18:22:41] err, I assume there's no time to scap before the train window, thcipriani ? [18:23:09] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 382.21 seconds [18:23:18] MaxSem: a full scap would run into the train certainly, but twentyafterfour is running the train today, so up to him. [18:26:13] !log legoktm@tin Synchronized php-1.28.0-wmf.23/includes/parser/Parser.php: Remove tracking category stuff that accidentally slipped into 61adc1e14 - T149310 (duration: 00m 46s) [18:26:14] T149310: The ISBN magic word puts Commons pages containing it into a non-existent category. - https://phabricator.wikimedia.org/T149310 [18:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:26:24] uhm, I don't mind too much if we have to delay the version bump [18:26:25] MaxSem, thcipriani: I'm all done [18:26:57] MaxSem: go for it I guess? just ping me when it finishes? [18:27:07] thanks! [18:29:59] !log maxsem@tin Started scap: https://gerrit.wikimedia.org/r/#/c/318343/ [18:30:04] smalyshev: Respected human, time to deploy Reindexing Commonswiki File: namespace (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161027T1830). Please do the needful. [18:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:30:49] PROBLEM - YARN NodeManager Node-State on analytics1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:31:39] RECOVERY - YARN NodeManager Node-State on analytics1040 is OK: OK: YARN NodeManager analytics1040.eqiad.wmnet:8041 Node-State: RUNNING [18:36:49] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 0.08 seconds [18:37:30] jouncebot: refresh [18:37:33] I refreshed my knowledge about deployments. [18:45:37] (03PS43) 1020after4: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [18:46:42] (03CR) 1020after4: [C: 04-1] "depends on scap 3.4.0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [18:57:45] 06Operations, 10Traffic, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2705418 (10Krinkle) >>! In T147845#2748403, @BBlack wrote: > I've briefly reviewed the python code at https://github.com/wikimedia/mediawiki-services-rcstream/blob/ma... [19:00:05] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161027T1900). Please do the needful. [19:00:10] 06Operations, 10Traffic, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2749511 (10BBlack) Ok, I was only considering the websockets case. Still, since the python code is unaware of X-Client-IP... what is it tying the session to internal... [19:01:20] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [19:04:17] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10procurement: codfw/eqiad: (1 each) wdqs cluster expansion - https://phabricator.wikimedia.org/T149351#2749524 (10RobH) [19:04:43] ^ that's me, apparently scap does localisation at a wrong time :o [19:07:27] 06Operations, 10Traffic, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2749547 (10BBlack) I guess really the answer to that doesn't matter either way. We'd still like to get this service to conform to the pattern of every other service... [19:07:37] !log maxsem@tin Finished scap: https://gerrit.wikimedia.org/r/#/c/318343/ (duration: 37m 38s) [19:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:35] twentyafterfour, I'm done [19:08:57] MaxSem: awesome, thanks [19:13:02] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [19:15:00] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:20:41] (03PS1) 10Gehel: elasticsearch - enable GC logs by default [puppet] - 10https://gerrit.wikimedia.org/r/318353 (https://phabricator.wikimedia.org/T134853) [19:22:13] (03CR) 10Gehel: [C: 04-1] "GC logs are already enabled on relforge. I want to wait a few days before pushing this on production servers as well." [puppet] - 10https://gerrit.wikimedia.org/r/318353 (https://phabricator.wikimedia.org/T134853) (owner: 10Gehel) [19:25:55] !log Bumping all wikis to 1.28.0-wmf.23 refs T147517 [19:25:56] T147517: MW-1.28.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T147517 [19:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:27:02] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.23 [19:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:48] jynus, hi, any idea of when wmpt's wiki will be available? For us to plan the content move and improvements. phab: https://phabricator.wikimedia.org/T126832 [19:32:22] so this is new: " LoadBalancer::{closure}: found writes/callbacks pending." [19:32:52] twentyafterfour, yeah, that is spamming db query logs for some time [19:33:03] alchimista, I honestly cannot say [19:33:56] the plan "check and delete everthing I can" requires time, and right now I cannot even allocate when to start [19:34:23] but we're talking something like 1 month or two, right? [19:34:41] it would help creating a subtask or changing that to DBA [19:34:48] so it is on the queue [19:35:15] https://phabricator.wikimedia.org/tag/dba/ [19:36:06] the people that originally wanted to help with that, demon and Krenair seems they disappear [19:36:21] so I have to take care of it, it seeems [19:36:38] hi [19:36:49] I wanted to help with that? Sure it wasn't Krinkle? [19:37:01] (03PS1) 10DCausse: Revert "Revert "[cirrus] Activate BM25 on top 10 wikis: Step 3"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318356 [19:37:10] Oh, this is ptwikimedia [19:37:29] thought you were talking about the LoadBalancer logs [19:37:45] I want to belive one of the 2 agreed to help with that in exchange to not choose ptwikimedia2 [19:37:49] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 611 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3068038 keys, up 218 days 11 hours - replication_delay is 611 [19:37:58] which would allow that to proceed instantly [19:38:24] demon is on vacation this week [19:38:33] (03PS2) 10DCausse: [cirrus] Activate BM25 on top 10 wikis: Step 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318356 (https://phabricator.wikimedia.org/T147508) [19:38:36] even if you cannot do it, it woudld be helpful to lead as in ping me to get it done [19:39:00] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3058485 keys, up 218 days 11 hours - replication_delay is 0 [19:39:52] so the problem with this task was than we need to create a db called 'ptwikimedia', but an old one already exists with that name, and 'pt2wikimedia' would suck [19:39:58] was that* [19:40:08] * twentyafterfour created T149353 to track the loadbalancer logspam [19:40:35] I conceded on that, because, again, one of you offered to lead the effort to get it done [19:41:01] So what we need to do is make a list of all the things that need to be backed up and dropped? [19:41:16] Krenair, that would certainy help [19:41:25] okay... what am I missing then? [19:41:27] as in investigate all servers with pending content [19:42:20] in fact, if that is converted into a mediawiki script it could even be reused [19:42:29] for the next time it is needed [19:43:51] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: upgrade to prometheus >= 1.1 - https://phabricator.wikimedia.org/T147207#2749692 (10fgiunchedi) [19:44:38] Krenair? [19:44:44] pending content? [19:44:52] yes? [19:44:57] what does that mean? [19:45:35] pages? revisions? users? interwikis? caches? [19:46:44] you told me that was easy to clean up, can you help with that? [19:46:45] those are 'pending'? [19:47:07] mutante: i never did catch up with you regarding tcpircbot acls and firewall rules [19:47:47] pages, revisions and users all live inside the db on the s* cluster and in the CentralAuth DB [19:47:56] s3 I think [19:48:05] we also need to deal with external store [19:48:16] yes, seveal things and places [19:48:32] I don't think memcached or redis will still have data from the old ptwikimedia, it was a long time ago [19:48:43] can you create a plan/script/etc. and I can review it [19:49:13] have into account the need to not create lag for the other wikis [19:49:25] bah [19:49:27] dbtree has broken [19:49:33] Uncaught ReferenceError: $ is not defined [19:49:34] no jquery? [19:49:50] ok, I happen to be logged into tendril [19:50:05] yeah this would all predate x1 [19:50:17] good, please document all that [19:50:22] etherpad, script [19:50:27] whatever it helps [19:50:35] are you ok with that Krenair ? [19:50:48] I'll write it on the ticket [19:50:52] perfect [19:50:57] that will speed things up [19:51:21] for alchimista and the ptwm community :-) [19:52:02] I have too much stuff going on now on my inbox [19:53:05] 21 in progress tickets for 2 people and lots of people pinging me to help them out [19:53:21] thanks jynus and Krenair, and doc'ing things might help for future cases [19:53:26] jynus: I don't envy you [19:53:34] whoops I just pinged [19:53:36] sorry [19:53:44] I am not ashamed of asking for help [19:53:51] as you can see, twentyafterfour [19:54:01] I've got more than that many tickets bug I could hardly call them all 'in progress' [19:54:14] more like 3 or 4 in progress and a bunch of stalled ones [19:55:07] at some point Krenair and I disagreed on how we should go on this ticket, I conceded because that way he could help on his way :-) [19:55:29] twentyafterfour, for me it is diferent [19:55:39] I am working on more tickets than I have in progress [19:55:47] :) [19:56:13] most of the times it is "hey you have a minute for giving me a tip for X problem?" [19:56:27] well if you ever need help from #releng certainly don't hesitate to ask. we appreciate what you're doing and I for one don't mind returning the favor when I can [19:56:50] I do not think I really help much before, in fact ? [19:57:05] you've helped out with phabricator stuff [19:57:09] phabricator for example, it was pure db issues [19:57:10] and you keep everything running [19:57:13] and then I said [19:57:25] "you do the rest" [19:57:25] jynus, I'll create a ticket for dbtree too [19:57:31] Krenair, thanks [19:57:32] I think I know why it broke [19:57:35] great [19:57:45] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: upgrade to prometheus >= 1.1 - https://phabricator.wikimedia.org/T147207#2749782 (10fgiunchedi) All done except tools for which https://gerrit.wikimedia.org/r/318251 is pending merge to fix k8s discovery. Following the change k8s-related metr... [19:57:54] 06Operations, 10DBA: dbtree broken - https://phabricator.wikimedia.org/T149357#2749783 (10Krenair) [19:58:13] oh, this is handled by terbium instead of neon/einsteinium? [19:58:34] yes, it is deployed at a random mw host [19:58:44] maybe it was upgraded or sth? [19:59:00] (03PS3) 10Filippo Giunchedi: prometheus::tools: fix k8s discovery after upgrade [puppet] - 10https://gerrit.wikimedia.org/r/318251 (https://phabricator.wikimedia.org/T147207) [19:59:14] from terbium:/etc/apache2/sites-enabled/50-dbtree-wikimedia-org.conf [19:59:20] [19:59:20] = 2.4> [19:59:20] Require all denied [19:59:20] [19:59:44] yeah, some conf change [19:59:51] terbium has 2.4.7-1ubuntu4.9 [19:59:53] I think the jscript used to be downloaded [19:59:58] jquery [20:00:03] yes, sorry [20:00:14] and for ovious reasons was locally installed [20:00:22] but inc is probably a bad place [20:00:29] JScript was a Microsoft thing :) [20:01:20] it probably requires a patch + deploy [20:01:52] don't worry, I am not going to ask you for that [20:02:03] and if you have tendril access, you have the same tree there [20:02:31] I have a bigger better tree in tendril [20:02:42] But I have to log in for that usually which is a pain [20:02:49] it is not better, it is more confusing [20:03:07] well, luckily I know what most of it is, so [20:03:31] both tendril and dbtree have to die [20:03:41] but I am bad at graphic libaries [20:03:48] to generate graphs [20:04:01] our tree is no longer a tree [20:04:07] indeed :/ [20:05:08] Krenair: Saw your ping. Anything I can help with? [20:05:18] nope [20:05:21] I misunderstood something [20:05:22] I am going to focus on what it is broken, dbstore1002 [20:06:19] there is a query stuck [20:06:23] mutante, hey [20:07:11] 06Operations, 10media-storage: refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T148647#2749843 (10fgiunchedi) >>! In T148647#2749209, @Cmjohnson wrote: > @fgiunchedi In eqiad it will be a little tight in rows A and C. I should > be able to accommodate 4 in each rack but will have to... [20:07:14] does anyone know if there is any maintenance running regarding watchlists? [20:07:39] I do not see anything at https://wikitech.wikimedia.org/wiki/Deployments#Week_of_October_24th [20:07:47] but queries look like maintenance [20:08:08] hmm nothing I'm aware of [20:08:54] (03CR) 10Alex Monk: "See T149357" [software/dbtree] - 10https://gerrit.wikimedia.org/r/239568 (https://phabricator.wikimedia.org/T96499) (owner: 10Reedy) [20:09:11] 06Operations, 10DBA: dbtree broken - https://phabricator.wikimedia.org/T149357#2749852 (10Krenair) was https://gerrit.wikimedia.org/r/#/c/239568/ [20:09:21] jynus, are they coming from the wikiadmin user? [20:09:46] it is replication [20:09:48] I cannot know [20:09:54] neither the user or the host [20:10:09] this is an issue specific for dbstore/tokudb [20:10:47] UPDATE /* User::{closure} */ `watchlist` SET wl_notificationtimestamp = NULL WHERE ... [20:12:37] I think it is missing a key [20:13:39] This would be the closure inside User::clearAllNotifications [20:14:12] it doesn't get stuck on the master where you'd be able to get the mysql user? [20:14:29] no, it doesn't get stuck on the master or on any slave [20:14:32] it is only dbstore [20:14:43] I am going to try to stop the slave [20:14:50] and add the missing index [20:15:27] the NULL is interesting [20:15:40] could it be related to https://phabricator.wikimedia.org/T149353 [20:15:43] it looks like the code specifically doesn't want NULLs [20:16:24] replication at least stops, unlike sanitarium, where it gets stuck and I have to kill the server [20:16:26] twentyafterfour, a refreshLinks job calling User::clearAllNotifications though? [20:16:27] it can be the index [20:16:46] or it can be tokudb creating special plans [20:16:50] it is not a production issue [20:16:53] so do not worry much [20:17:01] it is again, dbstore-only [20:17:06] ok [20:17:50] I appreciate your help, Krenair [20:18:19] but helping on separate tasks like the ptwm is where I really need help [20:18:43] where I cannot reach myself [20:19:26] or like the dbtree reporting [20:22:49] PROBLEM - puppet last run on hassium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:55] 06Operations, 06Analytics-Kanban, 10EventBus: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2749952 (10RobH) @Milimetric, I can note that I put LVM on it. Overall most servers should use LVM for a little breathing room on the disks in the event of a failed logrotate o... [20:25:58] I think also swift might need stuff removing [20:25:59] possibly [20:26:18] not sure if the NFS -> Swift migration in prod was done before or after ptwikimedia deletion [20:26:49] Krenair, that is a good suggestion [20:27:14] I'm just going through the list of external services in our config [20:27:30] turns out a lot has been replaced/created in the past 4 years [20:29:10] (03PS4) 10BBlack: nginx (1.11.4-1+wmf7) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318324 [20:29:12] (03PS1) 10BBlack: disable call to BIO_set_write_buffer_size() [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318368 [20:31:16] replication lag it is finally going down [20:39:54] it only needed some magic [20:50:59] RECOVERY - puppet last run on hassium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [20:54:05] (03CR) 10Filippo Giunchedi: [C: 032] prometheus::tools: fix k8s discovery after upgrade [puppet] - 10https://gerrit.wikimedia.org/r/318251 (https://phabricator.wikimedia.org/T147207) (owner: 10Filippo Giunchedi) [21:02:02] (03PS1) 10Filippo Giunchedi: prometheus::tools: fix tls_config(s) [puppet] - 10https://gerrit.wikimedia.org/r/318370 [21:03:08] (03CR) 10jenkins-bot: [V: 04-1] prometheus::tools: fix tls_config(s) [puppet] - 10https://gerrit.wikimedia.org/r/318370 (owner: 10Filippo Giunchedi) [21:04:17] (03PS2) 10Filippo Giunchedi: prometheus::tools: fix tls_config(s) [puppet] - 10https://gerrit.wikimedia.org/r/318370 [21:06:24] (03CR) 10Filippo Giunchedi: [C: 032] prometheus::tools: fix tls_config(s) [puppet] - 10https://gerrit.wikimedia.org/r/318370 (owner: 10Filippo Giunchedi) [21:09:44] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2750073 (10GWicke) [21:10:49] !log T133395: Altering mobileapps keyspaces to use time-windowed compaction [21:10:49] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [21:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:19:09] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:21:48] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: upgrade to prometheus >= 1.1 - https://phabricator.wikimedia.org/T147207#2750165 (10fgiunchedi) 05Open>03Resolved All done, running prometheus 1.2.1 everywhere. [21:25:02] (03PS5) 10BBlack: nginx (1.11.4-1+wmf8) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318324 [21:25:04] (03PS1) 10BBlack: no SSL readahead or early buffer release [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318431 [21:41:58] !log about to deploy kartotherian update [21:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:42:22] (03PS2) 10Filippo Giunchedi: site: add varnish_exporter to ulsfo/codfw maps/misc [puppet] - 10https://gerrit.wikimedia.org/r/316742 [21:43:57] (03CR) 10Filippo Giunchedi: [C: 032] site: add varnish_exporter to ulsfo/codfw maps/misc [puppet] - 10https://gerrit.wikimedia.org/r/316742 (owner: 10Filippo Giunchedi) [21:46:39] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 189.78 seconds [21:47:29] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:51:34] (03PS1) 10Dzahn: icinga: move files/icinga/ into module [puppet] - 10https://gerrit.wikimedia.org/r/318436 (https://phabricator.wikimedia.org/T110893) [21:52:51] (03PS2) 10Dzahn: icinga: move files/icinga/ into module [puppet] - 10https://gerrit.wikimedia.org/r/318436 (https://phabricator.wikimedia.org/T110893) [21:55:18] !log deployed kartotherian [21:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:57:52] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2750343 (10Dzahn) [21:57:53] 06Operations, 10Icinga, 10Shinken: shutdown neon (icinga) after it has been replaced with shinken - https://phabricator.wikimedia.org/T125023#2750342 (10Dzahn) 05stalled>03Open [21:58:56] 06Operations, 10Icinga, 10Shinken: shutdown neon (icinga) after it has been replaced with shinken - https://phabricator.wikimedia.org/T125023#1972468 (10Dzahn) neon has been replaced with einsteinium and tegmen, which runs a newer icinga on jessie. renaming this ticket again and using it as a decom ticket.... [21:59:25] 06Operations, 10Icinga, 10Shinken: decom neon (shutdown neon (icinga) after it has been replaced ) - https://phabricator.wikimedia.org/T125023#2750361 (10Dzahn) [21:59:27] (03PS3) 10Filippo Giunchedi: graphite: change Cassandra '.count' metrics aggregation [puppet] - 10https://gerrit.wikimedia.org/r/316810 (https://phabricator.wikimedia.org/T121789) [21:59:47] 06Operations, 10Icinga, 10Shinken: decom neon (shutdown neon (icinga) after it has been replaced ) - https://phabricator.wikimedia.org/T125023#1972468 (10Dzahn) a:03Dzahn [21:59:55] (03CR) 10Filippo Giunchedi: graphite: change Cassandra '.count' metrics aggregation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316810 (https://phabricator.wikimedia.org/T121789) (owner: 10Filippo Giunchedi) [22:00:17] (03PS1) 10Dzahn: remove neon from puppet/netboot/dhcp [puppet] - 10https://gerrit.wikimedia.org/r/318437 (https://phabricator.wikimedia.org/T125023) [22:00:24] (03CR) 10Filippo Giunchedi: "Quick test:" [puppet] - 10https://gerrit.wikimedia.org/r/316810 (https://phabricator.wikimedia.org/T121789) (owner: 10Filippo Giunchedi) [22:00:35] 06Operations: setup/deploy einsteinium as monitoring host - https://phabricator.wikimedia.org/T121582#1882881 (10Dzahn) I think this is resolved :) @akosiaris is it? [22:01:34] 06Operations: setup/deploy tegmen/WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#1882900 (10Dzahn) site.pp now has: 2733 # icinga based monitoring host in codfw 2734 node 'tegmen.wikimedia.org' { 2735 role(icinga, tendril, tcpircbot) 2736 } resolved? [22:01:34] (03PS4) 10Filippo Giunchedi: Enable simple-json-datasource on prod Grafana [puppet] - 10https://gerrit.wikimedia.org/r/314029 (https://phabricator.wikimedia.org/T147329) (owner: 10Addshore) [22:04:54] 06Operations, 10Ops-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users for Zareen - https://phabricator.wikimedia.org/T149211#2750375 (10JKatzWMF) @krenair and @Ottomata! [22:08:00] (03PS1) 10Dzahn: icinga: remove ServerAlias with hardcoded hostname [puppet] - 10https://gerrit.wikimedia.org/r/318439 (https://phabricator.wikimedia.org/T125023) [22:08:22] !log "cassandra" graphite machines LV at 90% used, add 300G via lvresize [22:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:09:51] 06Operations, 06ELiSo, 10RESTBase, 10VisualEditor, 07Esperanto-Sites: RESTBase thinks beta.wikiversity pages don't exist - https://phabricator.wikimedia.org/T148861#2750382 (10Jdforrester-WMF) 05Open>03Resolved a:03AlexMonk-WMF Since been fixed. Thanks, Services team. [22:10:48] (03CR) 10Filippo Giunchedi: [C: 032] Enable simple-json-datasource on prod Grafana [puppet] - 10https://gerrit.wikimedia.org/r/314029 (https://phabricator.wikimedia.org/T147329) (owner: 10Addshore) [22:12:04] (03PS2) 10Dzahn: remove neon from puppet/netboot/dhcp/network [puppet] - 10https://gerrit.wikimedia.org/r/318437 (https://phabricator.wikimedia.org/T125023) [22:13:33] addshore: merged your change for json datasource, anything else to do to grafana? [22:15:26] Yes! :D [22:15:27] You need to add a new datasource pointing to the tool [22:15:27] Add data source -> GenericDatasource -> https://tools.wmflabs.org/grafana-json-datasource/ [22:15:51] On the labs grafana I called the datasource "Labs JSON tool" not sure if you think that is okay or have a better idea [22:16:18] (03PS1) 10Dzahn: icinga: remove pre-jessie conditional from monitoring::group [puppet] - 10https://gerrit.wikimedia.org/r/318442 (https://phabricator.wikimedia.org/T125023) [22:17:19] (03PS6) 10Filippo Giunchedi: prometheus: generate varnish targets from get_clusters() [puppet] - 10https://gerrit.wikimedia.org/r/310819 (https://phabricator.wikimedia.org/T147424) [22:20:11] (03PS1) 10Dzahn: remove neon.wikimedia.org, keep neon.mgmt [dns] - 10https://gerrit.wikimedia.org/r/318444 (https://phabricator.wikimedia.org/T125023) [22:21:35] (03PS1) 10Faidon Liambotis: Add user faidon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/318445 [22:22:06] (03CR) 10Faidon Liambotis: [C: 032] Add user faidon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/318445 (owner: 10Faidon Liambotis) [22:23:11] addshore: I see, heh we're trying to keep production and labs datasources segregated, I suppose the tool could run as well in production too? [22:23:28] addshore: I wanted to see an example, though https://grafana-labs.wikimedia.org/dashboard/db/revisionsliderenablesdisables?from=now-1h&to=now e.g. doesn't seem to work? [22:23:48] ahh, yes, there is no reason it couldn't run in production, the code is tiny and simple [22:24:15] godog: see https://grafana-labs-admin.wikimedia.org/dashboard/db/revisionsliderenablesdisables?from=now-90d&to=now [22:24:37] I filed a bug for that not working too, not sure exactly what it is https://phabricator.wikimedia.org/T148669 [22:24:44] CI troubles? [22:25:47] (03CR) 10Faidon Liambotis: [V: 032] Add user faidon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/318445 (owner: 10Faidon Liambotis) [22:26:41] greg-g: jenkins is lagging > 10 mintues for puppet [22:26:41] greg-g: e.g. https://gerrit.wikimedia.org/r/#/c/318442/ [22:27:58] Yep, should be going through the queue now [22:28:22] hashar ^^ [22:29:43] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, we could also consider doing LE for this" [puppet] - 10https://gerrit.wikimedia.org/r/318439 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [22:30:28] (03PS7) 10Filippo Giunchedi: prometheus: generate varnish targets from get_clusters() [puppet] - 10https://gerrit.wikimedia.org/r/310819 (https://phabricator.wikimedia.org/T147424) [22:30:30] paravoid: ugh, it's the apps team again slowing down zuul: https://integration.wikimedia.org/zuul/ [22:30:44] see the spikes in the graphs at the bottom [22:31:29] hashar: this is the second time this week that android changes (on the order of 40 or so) have slowed down zuul for unrelated repos (I have no idea how, but each time someone complains, I open zuul and see the huge backlog of android changes) [22:32:53] greg-g i thought android uses instances not nodepool? [22:33:04] which is why I didn't say nodepool [22:33:24] Oh [22:33:28] sorryu [22:33:29] sorry [22:33:36] 06Operations, 10Monitoring: Icinga stale resources, possible artifact of the Icinga upgrade - https://phabricator.wikimedia.org/T149376#2750437 (10faidon) [22:34:51] ACKNOWLEDGEMENT - HP RAID on db2052 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor - Failed: 1I:1:4 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T149377 [22:34:54] 06Operations, 10ops-codfw: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149377#2750453 (10ops-monitoring-bot) [22:37:54] it seems the slowness of android comes when it depends on open patches (many open patches) [22:38:16] If it was just testing against a parent that was already merged, it would be faster. [22:38:18] (03PS1) 10Andrew Bogott: Openstack: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/318449 [22:38:24] Maybe this is a bug in zuul? [22:39:30] most likely, given it shouldn't have any effect on ops/puppet but it seems to [22:40:08] (03CR) 10Andrew Bogott: [C: 032] Openstack: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/318449 (owner: 10Andrew Bogott) [22:40:31] Yep, seems that zuul may be setting up test's underneeth that it dosent need to do [22:41:29] greg-g should i file a task on phabricator in zuul? [22:41:57] (03PS1) 10Dzahn: mv snaprotate.pl from files/backup/ to backup module [puppet] - 10https://gerrit.wikimedia.org/r/318450 [22:42:20] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [22:43:56] paladox: yeah, I will, just busy right now (I'll do it in a minute, don't worry) [22:44:08] Ok thanks :) [22:44:48] (03PS1) 10Dzahn: delete keys in files/ppa/ [puppet] - 10https://gerrit.wikimedia.org/r/318451 [22:46:52] (03CR) 10Dzahn: "3 of these GPG keys have been added with the initial commit of the public puppet repo, Sep 7 2011, and one on Dec 28 2011" [puppet] - 10https://gerrit.wikimedia.org/r/318451 (owner: 10Dzahn) [22:50:43] (03PS1) 10Dzahn: osm: move files/osm/tuning.conf to module [puppet] - 10https://gerrit.wikimedia.org/r/318453 [22:52:46] (03PS1) 10Dzahn: repeat hostname for AAAA (tegmen,einsteinium,uranium) [dns] - 10https://gerrit.wikimedia.org/r/318454 [22:57:39] (03PS1) 10Dzahn: delete virt1000.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/318455 [22:59:55] 06Operations, 10Continuous-Integration-Infrastructure, 07Zuul: Upgrade Zuul on gallium to 2.5.0-8-gcbc7f62-wmf2precise1 - https://phabricator.wikimedia.org/T144088#2750516 (10Paladox) @hashar is this task resolved? [23:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161027T2300). [23:00:05] kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:59] available! [23:03:26] here [23:04:18] funny, the bot didn't realize julian and i should also in that list [23:04:24] looks like my patch is blocked until next week [23:04:38] wat [23:05:36] who's doing the swat? [23:05:49] dereckson unilaterally -2'd it [23:06:32] (03CR) 10Kaldari: "> Does the group depend on any existing autoconfirmed flags? Or, if a non-autoconfirmed user is granted patroller, it's not dependent on o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317824 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [23:07:02] 06Operations, 10Continuous-Integration-Infrastructure, 07Zuul: Upgrade Zuul on gallium to 2.5.0-8-gcbc7f62-wmf2precise1 - https://phabricator.wikimedia.org/T144088#2750543 (10hashar) 05Open>03Resolved Yeah they have all been uploaded / polished with @elukey via a different task. Thanks @Paladox ! Precis... [23:07:18] 06Operations, 10Continuous-Integration-Infrastructure, 07Zuul: Upgrade Zuul on gallium to 2.5.0-8-gcbc7f62-wmf2precise1 - https://phabricator.wikimedia.org/T144088#2750547 (10hashar) [23:07:19] (03PS2) 10Dzahn: repeat hostname for AAAA (tegmen,einsteinium,uranium) [dns] - 10https://gerrit.wikimedia.org/r/318454 [23:07:37] (03CR) 10Dzahn: [C: 032] repeat hostname for AAAA (tegmen,einsteinium,uranium) [dns] - 10https://gerrit.wikimedia.org/r/318454 (owner: 10Dzahn) [23:09:44] musikanimal, that's generally how -2s work [23:09:56] (03CR) 10Andrew Bogott: [C: 031] delete virt1000.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/318455 (owner: 10Dzahn) [23:09:59] unless things are bad enough that multiple people feel the need to -2 [23:10:12] I know, but there was no reason to -2 [23:10:19] everyone was ready, and in fact we were counting on it [23:10:30] not a big deal, just misleading [23:10:37] (03PS2) 10Dzahn: delete virt1000.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/318455 [23:11:31] (03CR) 10Dzahn: [C: 032] delete virt1000.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/318455 (owner: 10Dzahn) [23:11:34] Hello. [23:11:36] (03CR) 10Filippo Giunchedi: [C: 032] "Going to merge this and then test https://gerrit.wikimedia.org/r/#/c/315098/4 without PCC :(" [puppet] - 10https://gerrit.wikimedia.org/r/310819 (https://phabricator.wikimedia.org/T147424) (owner: 10Filippo Giunchedi) [23:11:40] musikanimal: I'd be more comfortable if we deploy this change next week [23:11:47] (03PS8) 10Filippo Giunchedi: prometheus: generate varnish targets from get_clusters() [puppet] - 10https://gerrit.wikimedia.org/r/310819 (https://phabricator.wikimedia.org/T147424) [23:11:55] 06Operations, 10Continuous-Integration-Infrastructure, 07Zuul: Upgrade Zuul on gallium to 2.5.0-8-gcbc7f62-wmf2precise1 - https://phabricator.wikimedia.org/T144088#2750548 (10Paladox) your welcome :) [23:12:08] There are still questions, both in the Phabricator task and the Gerrit change, after kaldari's scheduling [23:12:47] I've read a comment it's a good idea to merge this now, so the English Wikipedia has the week-end to manage the groups. [23:13:07] Dereckson: I'm not sure there are any pertinent questions. I already answered the one in Gerrit. [23:13:11] My question on the gerrit was answered (this is Andy M. Wang) [23:13:11] yes [23:13:21] we were sort of hoping to have it over the weekend [23:13:37] But we only deploy from Monday to Thursday, and to do in the same deployment week the two changes (the second change hasn't been published on Gerrit yet) seems a better idea, to be able to plan and react correctly (and deploy a fix) if there is an issue [23:14:04] musikanimal, are you doing the SWAT? [23:14:14] there is another patch pending? [23:14:52] or are you saying that the swat itself is canceled? [23:15:00] yeah it was -2'd [23:15:14] yurik: no, just his patch [23:15:18] ah, ok [23:15:28] so is anyone swatting then? :) [23:15:31] yurik: do you wish to deploy yours, or do you wish I help there? [23:15:38] not my patch actually, we were sort of counting down the clock at en:WP:PERM [23:15:39] Dereckson: There's no rush on the 2nd patch. There might even be a couple weeks between them depending on how the en.wiki community wants to roll out the new group. [23:15:53] Dereckson, either way. I haven't done depls in a while, would be good to refresh my scap pull skilz [23:16:02] Probably won't be a big gap, but it's possible. [23:16:31] yurik: okay, you can do it in this case, that will allow to refresh it, ping me if you've an issue [23:16:37] heh [23:17:11] kaldari: you mean you don't have a provisional roadmap and calendar would be useful for the two patches? so why the rush patch 1 must be done today? [23:17:52] (03PS5) 10Filippo Giunchedi: [WIP] prometheus: generate Varnish targets [puppet] - 10https://gerrit.wikimedia.org/r/315098 (https://phabricator.wikimedia.org/T147424) [23:18:07] Dereckson: No rush. I didn't say it had to be today. I just don't see any reason to wait, but maybe giving it a few extra days is a good idea. We'll just have to re-coordinate with everyone. [23:19:25] * Dereckson nods. [23:19:43] Yeah doesn't have to be today, no biggie. But for the record everyone at enwiki is ready :) [23:20:34] (03PS6) 10Filippo Giunchedi: prometheus: generate Varnish targets [puppet] - 10https://gerrit.wikimedia.org/r/315098 (https://phabricator.wikimedia.org/T147424) [23:21:12] Dereckson: Looks like Monday's a holiday, so we can shoot for Tuesday. Does that sound good to you musikanimal? [23:21:42] wait we're off Monday?! [23:22:02] musikanimal: I'm not sure actually :P [23:22:22] if there are ops people present Monday, deployments will be assured [23:22:22] let's see... [23:22:41] Oct 31st is Tim Starling day! [23:22:42] (03PS2) 10Dzahn: Introduce role/class to manage mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/318317 (owner: 10Papaul) [23:22:52] https://wikitech.wikimedia.org/wiki/Deployments#Week_of_October_31st doesn't note an holiday [23:23:17] Halloween deploy? sounds "scary" :p [23:23:23] musikanimal: I guess Halloween isn't a real holiday. I can't believe we get Columbus Day off and not Halloween. That's dumb :) [23:23:31] hehe [23:23:43] scary deploys are fun! [23:23:50] Halloween is the biggest holiday of the year [23:24:01] +1 [23:24:04] I would have to agree [23:24:17] https://en.wikipedia.org/wiki/Samhain [23:24:29] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: generate Varnish targets [puppet] - 10https://gerrit.wikimedia.org/r/315098 (https://phabricator.wikimedia.org/T147424) (owner: 10Filippo Giunchedi) [23:25:09] so what's the story with Tim Starling and Halloween? [23:25:58] https://en.wikipedia.org/wiki/Wikipedia:Tim_Starling_Day [23:25:59] Do he and Sam celebrate it on May 31st instead? [23:26:12] In countries that celebrate Halloween, children will first say "Trick or Treat" and then, when they get the candy, they will say "Secure and Split" and run away, in honor of Tim's work in this area. [23:26:49] nice [23:27:29] OK, so let's make it a Halloween deployment then [23:27:55] I'll schedule it for Monday [23:28:19] PROBLEM - puppet last run on prometheus2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:28:34] that's me ^ [23:29:26] (03PS3) 10Papaul: Introduce role/class to manage mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/318317 [23:30:20] (03PS4) 10Dzahn: Introduce role/class to manage mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/318317 (owner: 10Papaul) [23:30:50] (03CR) 10Dzahn: "compiled with https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/ result: http://puppet-compiler.wmflabs.org/449" [puppet] - 10https://gerrit.wikimedia.org/r/318317 (owner: 10Papaul) [23:31:07] (03CR) 10Dzahn: [C: 032] "installing sshpass on neodymium" [puppet] - 10https://gerrit.wikimedia.org/r/318317 (owner: 10Papaul) [23:39:46] (03PS3) 10Dzahn: Add Xdummy daemon [puppet] - 10https://gerrit.wikimedia.org/r/264303 (https://phabricator.wikimedia.org/T133183) (owner: 10Niedzielski) [23:39:58] !log yurik@tin Synchronized php-1.28.0-wmf.23/extensions/Kartographer/modules/dialog/dialog.js: https://gerrit.wikimedia.org/r/#/c/318457/ (duration: 00m 47s) [23:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:02] (03PS4) 10Niedzielski: contint: add module xdummy for Jenkins' Android emulation tests [puppet] - 10https://gerrit.wikimedia.org/r/264303 (https://phabricator.wikimedia.org/T133183) [23:41:48] (03CR) 10Dzahn: [C: 032] ""already applied on the Jenkins instance that is dedicated to running the Android tests"" [puppet] - 10https://gerrit.wikimedia.org/r/264303 (https://phabricator.wikimedia.org/T133183) (owner: 10Niedzielski) [23:44:59] (03PS1) 10Filippo Giunchedi: prometheus: quote hash keys in varnish_config template [puppet] - 10https://gerrit.wikimedia.org/r/318466 [23:47:03] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: quote hash keys in varnish_config template [puppet] - 10https://gerrit.wikimedia.org/r/318466 (owner: 10Filippo Giunchedi) [23:47:08] (03PS2) 10Filippo Giunchedi: prometheus: quote hash keys in varnish_config template [puppet] - 10https://gerrit.wikimedia.org/r/318466 [23:53:49] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:54:49] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:55:26] (03CR) 10Niedzielski: "Thanks @Dzahn and @Hashar! I hope the failure is unrelated to this change (it looks ok to my Android eyes)." [puppet] - 10https://gerrit.wikimedia.org/r/264303 (https://phabricator.wikimedia.org/T133183) (owner: 10Niedzielski) [23:56:09] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:59:12] (03PS7) 10Paladox: Gerrit: Enable concurrent collector [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) [23:59:25] (03PS8) 10Paladox: Gerrit: Enable concurrent collector [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) [23:59:52] (03PS1) 10Filippo Giunchedi: prometheus: swap cluster/site returned by get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/318470