[00:12:24] (03CR) 10Dzahn: [C: 031] Add text/x-python to files.viewable-mime-types [puppet] - 10https://gerrit.wikimedia.org/r/307456 (owner: 10Paladox) [00:55:00] (03PS1) 10Paladox: Set logoImagePHID and wordmarkText in fixed_settings.yaml [puppet] - 10https://gerrit.wikimedia.org/r/307462 [00:55:54] (03PS2) 10Paladox: Set logoImagePHID and wordmarkText in fixed_settings.yaml [puppet] - 10https://gerrit.wikimedia.org/r/307462 [00:56:41] (03PS1) 10BryanDavis: webservice: Warn when using lighttpd-precise [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/307463 (https://phabricator.wikimedia.org/T143282) [00:58:17] (03PS3) 10Paladox: Set logoImagePHID and wordmarkText in fixed_settings.yaml [puppet] - 10https://gerrit.wikimedia.org/r/307462 [01:11:30] (03CR) 10Dzahn: [C: 04-1] "this is not needed. just set the hostname of the active server in hiera, whether it is production or labs doesn't even matter" [puppet] - 10https://gerrit.wikimedia.org/r/307335 (https://phabricator.wikimedia.org/T144112) (owner: 10Paladox) [01:12:06] (03CR) 10Dzahn: "..we just did and it is past that error already" [puppet] - 10https://gerrit.wikimedia.org/r/307335 (https://phabricator.wikimedia.org/T144112) (owner: 10Paladox) [01:13:05] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 439.43 seconds [01:13:11] (03CR) 10Dzahn: "the answer to the "how to" was the "phabricator_active_server: phab-03" line on https://wikitech.wikimedia.org/wiki/Hiera:Phabricator" [puppet] - 10https://gerrit.wikimedia.org/r/307335 (https://phabricator.wikimedia.org/T144112) (owner: 10Paladox) [01:15:35] (03PS4) 10Dzahn: installserver: put aptrepo role also on install2001 [puppet] - 10https://gerrit.wikimedia.org/r/306713 [01:23:46] 06Operations, 10MediaWiki-JobQueue, 07Regression: Restore 30 minutes delayed list update to no waiting, to stop killing sandbox functionality - https://phabricator.wikimedia.org/T139893#2593370 (10Bawolff) Sorry, but I think the benefits of the change outweigh the drawbacks. I don't think we should revert.... [01:24:41] legoktm: is this class still used anywhere? role::labs::extdist https://gerrit.wikimedia.org/r/#/c/298906/ [01:31:21] RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 55.12 seconds [01:41:48] (03PS2) 10Dzahn: cassandra: add ssl monitoring only for ssl-enabled hosts [puppet] - 10https://gerrit.wikimedia.org/r/307251 (https://phabricator.wikimedia.org/T120662) (owner: 10Filippo Giunchedi) [01:42:30] (03CR) 10Dzahn: [C: 032] "before: number of checks with "cassandra .. SSL ..7001" = 60" [puppet] - 10https://gerrit.wikimedia.org/r/307251 (https://phabricator.wikimedia.org/T120662) (owner: 10Filippo Giunchedi) [01:54:14] (03CR) 10Dzahn: "after: also 60. no-op confirmed on neon" [puppet] - 10https://gerrit.wikimedia.org/r/307251 (https://phabricator.wikimedia.org/T120662) (owner: 10Filippo Giunchedi) [01:59:16] (03CR) 10Dzahn: [C: 031] Phab: Remove config abstraction. Useless & confusing [puppet] - 10https://gerrit.wikimedia.org/r/307354 (https://phabricator.wikimedia.org/T144112) (owner: 10Chad) [02:00:31] mutante: yeah, it's used on labs [02:01:29] (03CR) 10Legoktm: "extdist-01 and extdist-02 in the extdist labs project should be using it...what do I need to do on the labs end to accommodate the rename?" [puppet] - 10https://gerrit.wikimedia.org/r/298906 (owner: 10Dzahn) [02:08:14] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80513 MB (15% inode=99%) [02:08:36] legoktm: oh, ok. i tried to check that with https://tools.wmflabs.org/watroles/role/role::labs::extdist [02:09:01] legoktm: maybe i am using watroles wrongly or it has an issue [02:09:39] mutante: I'm not really sure...Yuvi set up most of the puppet stuff for me [02:10:37] legoktm: ok thanks, i wonder if there is another way to check, will ask [02:11:00] but I'm fairly sure only those two instances have the role set up [02:11:07] and if you tell me what I need to change, I can take care of that [02:12:08] (not now though, I'm about to run for dinner, sorry) [02:18:25] legoktm: the thing for me was just how to figure out the names of the instances using it. if that is known then what needs to be changed is just a checkbox in wikitech ui.. yep, dinner.. later and no rush at all [02:29:16] (03PS1) 10BBlack: Add local chapoly preference hack patch [debs/openssl] (wmf-1.1) - 10https://gerrit.wikimedia.org/r/307465 [02:29:18] (03PS1) 10BBlack: openssl (1.1.0-1~wmf1) jessie-wikimedia; urgency=medium [debs/openssl] (wmf-1.1) - 10https://gerrit.wikimedia.org/r/307466 [02:29:30] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.16) (duration: 11m 57s) [02:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:36:06] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Aug 30 02:36:06 UTC 2016 (duration 6m 36s) [02:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:06:06] RECOVERY - Disk space on elastic1017 is OK: DISK OK [03:17:30] (03Abandoned) 10Mattflaschen: Expire Flow caches after 1 day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249402 (https://phabricator.wikimedia.org/T94029) (owner: 10Matthias Mullie) [05:19:22] 06Operations, 06Labs, 13Patch-For-Review: move nfs /scratch to labstore1003 - https://phabricator.wikimedia.org/T134896#2593476 (10madhuvishy) In order to backup scratch from labstore1001 to labstore1003 using rsync: ### snapshot `lvcreate -L1T -s -n backup-scratch /dev/labstore/scratch` ### mount `mount /... [05:19:37] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/3877/iridium.eqiad.wmnet/ ?" [puppet] - 10https://gerrit.wikimedia.org/r/307354 (https://phabricator.wikimedia.org/T144112) (owner: 10Chad) [05:24:37] (03CR) 10Dzahn: [C: 031] "no idea how you created that PHID but i believe you we need this to keep the custom WMF logo :p" [puppet] - 10https://gerrit.wikimedia.org/r/307462 (owner: 10Paladox) [05:27:33] (03CR) 10Dzahn: [C: 04-1] "don't really know but since all the things that hashar said, probably not.." [puppet] - 10https://gerrit.wikimedia.org/r/306851 (owner: 1020after4) [05:30:18] (03CR) 10Dzahn: "that being said, added Ariel, i think he cares about py linting" [puppet] - 10https://gerrit.wikimedia.org/r/306851 (owner: 1020after4) [05:37:12] (03CR) 10Dzahn: "reading the ticket it sounds like "probably reasonable" but could you add a TLDR to the commit message here" [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) (owner: 10Paladox) [05:38:58] (03CR) 10Dzahn: [C: 031] "i haven't tested myself, but looks to me like this is ready to go" [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [05:39:25] 06Operations, 06Labs, 07Tracking: Migrate tools-project and others(Labs) data from labstore1001 to labstore1004/5 - https://phabricator.wikimedia.org/T144255#2593494 (10madhuvishy) [05:43:10] (03CR) 10Dzahn: [C: 031] "i think it's ok to merge this even though it's likely there will be a follow-up (or 2) since there is really no real way to test Icinga ch" [puppet] - 10https://gerrit.wikimedia.org/r/291023 (https://phabricator.wikimedia.org/T135647) (owner: 10Gehel) [05:45:56] (03CR) 10Dzahn: [C: 04-1] "i think it has been duplicated meanwhile" [puppet] - 10https://gerrit.wikimedia.org/r/244471 (https://phabricator.wikimedia.org/T114161) (owner: 10Rush) [05:46:47] (03CR) 10Dzahn: [C: 031] "i still like this but testing ALL redirects is kind of hard.. still need to generate a list of URLs to test.. hrmm..rhmm" [puppet] - 10https://gerrit.wikimedia.org/r/292785 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack) [05:48:43] (03CR) 10Dzahn: "yea, this is kind of old by now, we should re-evaluate the situation now. can we do this redirect nowadays?" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [05:49:44] (03CR) 10Dzahn: "same here, needs the "secure redirects"" [puppet] - 10https://gerrit.wikimedia.org/r/293464 (https://phabricator.wikimedia.org/T137252) (owner: 10Microchip08) [05:51:29] (03CR) 10Dzahn: [C: 031] "after that typo is fixed it looks good to me, but needs +1 from jcrespo too" [puppet] - 10https://gerrit.wikimedia.org/r/300494 (https://phabricator.wikimedia.org/T112776) (owner: 1020after4) [05:52:14] (03PS3) 10Dzahn: toollabs: install pdf2djvu [puppet] - 10https://gerrit.wikimedia.org/r/304788 (https://phabricator.wikimedia.org/T130138) (owner: 10Merlijn van Deen) [05:53:10] (03CR) 10Dzahn: [C: 032] "seems uncontroversial and i know tool/labs people are busy, so let me just do this to help out" [puppet] - 10https://gerrit.wikimedia.org/r/304788 (https://phabricator.wikimedia.org/T130138) (owner: 10Merlijn van Deen) [06:02:56] (03CR) 10Dzahn: "root@tools-exec-1216:~# dpkg -l | grep pdf" [puppet] - 10https://gerrit.wikimedia.org/r/304788 (https://phabricator.wikimedia.org/T130138) (owner: 10Merlijn van Deen) [06:08:38] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 13Patch-For-Review: Track/alert cassandra certs expiration - https://phabricator.wikimedia.org/T120662#2593536 (10Dzahn) merged and checked on neon. number of checks was 60 before and after / no-op [06:09:46] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 13Patch-For-Review: Track/alert cassandra certs expiration - https://phabricator.wikimedia.org/T120662#2593538 (10Dzahn) guess it's resolved now? [06:10:05] (03CR) 10Nemo bis: "Yes, the redirect is as needed as always" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [06:11:33] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active, AS1299/IPv4: Active [06:12:21] 07Puppet, 10Beta-Cluster-Infrastructure: puppet agent -tv fails to run on deployment-sca01 - https://phabricator.wikimedia.org/T144256#2593546 (10KartikMistry) [06:22:24] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/3/1: down - Transit: Telia (IC-308845) {#3861} [10Gbps]BR [06:25:15] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 26 probes of 394 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [06:26:15] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 35.32 ms [06:31:34] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active, AS1299/IPv4: Active [06:31:46] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 4 probes of 394 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [06:32:26] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [06:38:00] (03PS1) 10Urbanecm: Update instances of Wikimedia Foundation logo #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) [06:41:55] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:44:15] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 2.949 second response time [06:51:45] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [06:56:45] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [07:06:14] (03PS2) 10Urbanecm: Update instances of Wikimedia Foundation logo #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) [07:07:00] !log reimaging mw209[345] to Debian Jessie [07:08:22] !log reimaging mw2087/2089 to jessie [07:17:40] 06Operations: mw2086 & mw2087 do not respond to IPMI commands - https://phabricator.wikimedia.org/T142726#2593663 (10MoritzMuehlenhoff) Status update: mw2088/2089 (which are identical hardware) worked fine and I re-tried mw2087, but still to no avail. Maybe this is limited to a few hosts after all. [07:18:12] (03PS1) 10WMDE-leszek: Enable mention status notifications on mediawikiwiki and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307476 (https://phabricator.wikimedia.org/T144181) [07:22:12] (03PS4) 10WMDE-leszek: Enable mention status notifications everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304608 (https://phabricator.wikimedia.org/T143101) (owner: 10Addshore) [07:25:21] (03CR) 10WMDE-leszek: "Yes, as deployment schedule of status notifications has changed, dewiki will have them enabled along with all other wikis in I6e69bff78061" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304607 (https://phabricator.wikimedia.org/T143100) (owner: 10Addshore) [07:37:06] good morning [07:42:23] (03PS2) 10Muehlenhoff: Provide a systemd override unit for hhvm [puppet] - 10https://gerrit.wikimedia.org/r/307270 (https://phabricator.wikimedia.org/T143210) [07:44:52] !log temporarily disabling puppet on mw1* host to merge hhvm-related puppet change [07:45:11] (03CR) 10Gehel: "Puppet compiler: https://puppet-compiler.wmflabs.org/3878/" [puppet] - 10https://gerrit.wikimedia.org/r/305519 (https://phabricator.wikimedia.org/T133844) (owner: 10Gehel) [07:45:19] (03PS3) 10Gehel: elasticsearch - check shards via the service, not via each individual node [puppet] - 10https://gerrit.wikimedia.org/r/305519 (https://phabricator.wikimedia.org/T133844) [07:46:44] (03CR) 10Jcrespo: [C: 04-1] "We cannot go ahead with this until we decide the logic of the current proxy, and how it interferes with the Master-Slave setup. Also, gran" [puppet] - 10https://gerrit.wikimedia.org/r/300494 (https://phabricator.wikimedia.org/T112776) (owner: 1020after4) [07:46:46] (03CR) 10Gehel: [C: 032] elasticsearch - check shards via the service, not via each individual node [puppet] - 10https://gerrit.wikimedia.org/r/305519 (https://phabricator.wikimedia.org/T133844) (owner: 10Gehel) [07:47:37] (03CR) 10Muehlenhoff: [C: 032] Provide a systemd override unit for hhvm [puppet] - 10https://gerrit.wikimedia.org/r/307270 (https://phabricator.wikimedia.org/T143210) (owner: 10Muehlenhoff) [07:47:42] (03PS3) 10Muehlenhoff: Provide a systemd override unit for hhvm [puppet] - 10https://gerrit.wikimedia.org/r/307270 (https://phabricator.wikimedia.org/T143210) [07:56:14] (03CR) 10Jcrespo: "I think I would let Releng people to decide; in any case, scheduling this for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302223 (owner: 10Jcrespo) [08:00:35] (03PS2) 10Alexandros Kosiaris: site.pp: Remove $ganglia_aggregator node scope variables [puppet] - 10https://gerrit.wikimedia.org/r/307301 [08:00:41] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] site.pp: Remove $ganglia_aggregator node scope variables [puppet] - 10https://gerrit.wikimedia.org/r/307301 (owner: 10Alexandros Kosiaris) [08:01:15] (03PS3) 10Alexandros Kosiaris: ganglia: Use ferm::service instead of ferm::rule [puppet] - 10https://gerrit.wikimedia.org/r/307302 [08:01:20] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ganglia: Use ferm::service instead of ferm::rule [puppet] - 10https://gerrit.wikimedia.org/r/307302 (owner: 10Alexandros Kosiaris) [08:04:01] (03PS3) 10Jcrespo: Sort s3.dblist in lexicographical order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302223 [08:04:51] (03PS7) 10Gehel: elasticsearch - move relforge to its own role [puppet] - 10https://gerrit.wikimedia.org/r/304067 [08:05:15] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch - move relforge to its own role [puppet] - 10https://gerrit.wikimedia.org/r/304067 (owner: 10Gehel) [08:06:47] (03PS8) 10Gehel: elasticsearch - move relforge to its own role [puppet] - 10https://gerrit.wikimedia.org/r/304067 [08:07:03] (03CR) 10Gehel: "rebase" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/304067 (owner: 10Gehel) [08:09:03] (03PS3) 10Jcrespo: Remove db1027 from internal dns entries [dns] - 10https://gerrit.wikimedia.org/r/289168 (https://phabricator.wikimedia.org/T135253) [08:11:04] (03CR) 10Gehel: [C: 04-1] "This requires some cleanup now that we have improved postgres role management. I will probably also split this in changes specific to each" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291023 (https://phabricator.wikimedia.org/T135647) (owner: 10Gehel) [08:12:43] 06Operations, 10ops-eqiad: Broken disk on copper - https://phabricator.wikimedia.org/T144261#2593788 (10MoritzMuehlenhoff) [08:13:23] 07Puppet, 10Beta-Cluster-Infrastructure: puppet agent -tv fails to run on deployment-sca01 - https://phabricator.wikimedia.org/T144256#2593802 (10Krenair) [08:13:25] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-sca0[12] puppet failure due to issues involving /srv/deployment directory - https://phabricator.wikimedia.org/T143065#2593805 (10Krenair) [08:20:54] (03PS9) 10Gehel: elasticsearch - move relforge to its own role [puppet] - 10https://gerrit.wikimedia.org/r/304067 [08:24:28] (03PS1) 10Jcrespo: prometheus mysqld exporter: add a bunch of selected slaves from core [puppet] - 10https://gerrit.wikimedia.org/r/307479 (https://phabricator.wikimedia.org/T126757) [08:26:09] 06Operations, 05Prometheus-metrics-monitoring: MySQL monitoring with prometheus - https://phabricator.wikimedia.org/T143896#2593819 (10fgiunchedi) [08:28:26] 06Operations, 10Traffic, 10media-storage: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2593821 (10MoritzMuehlenhoff) [08:28:40] 06Operations, 10Traffic, 10media-storage: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2593559 (10fgiunchedi) the fact that ulsfo fails but not the others might be related to varnish 4, #traffic recently switched ulsfo cache_misc [08:29:12] (03CR) 10Jcrespo: "Double check for syntax." [puppet] - 10https://gerrit.wikimedia.org/r/307479 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [08:30:55] (03CR) 10Filippo Giunchedi: [C: 031] prometheus mysqld exporter: add a bunch of selected slaves from core [puppet] - 10https://gerrit.wikimedia.org/r/307479 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [08:31:30] (03CR) 10Jcrespo: [C: 032] prometheus mysqld exporter: add a bunch of selected slaves from core [puppet] - 10https://gerrit.wikimedia.org/r/307479 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [08:31:57] filipo what is an easy way to check for monitoring errors? [08:32:32] sorry, I made a typo on your name [08:33:19] haha that's fine jynus, you mean if e.g. mysqld_exporter can't talk to mysql? [08:33:26] yes [08:33:36] or can talk but receives 0s [08:33:43] or any other issue [08:34:12] basically, I belive we are skipping some hosts [08:34:43] ah, usually there's a few metrics you can check for errors [08:34:45] PROBLEM - Apache HTTP on mw2094 is CRITICAL: Connection timed out [08:34:46] e.g. mysql_exporter_last_scrape_error [08:34:57] PROBLEM - Apache HTTP on mw2093 is CRITICAL: Connection timed out [08:35:11] using the web interface? [08:35:15] PROBLEM - Apache HTTP on mw2092 is CRITICAL: Connection timed out [08:35:35] yeah, grafana would work too [08:35:46] PROBLEM - nutcracker port on mw2092 is CRITICAL: Timeout while attempting connection [08:35:46] PROBLEM - nutcracker process on mw2093 is CRITICAL: Timeout while attempting connection [08:35:46] PROBLEM - puppet last run on mw2094 is CRITICAL: Timeout while attempting connection [08:36:08] PROBLEM - nutcracker process on mw2092 is CRITICAL: Timeout while attempting connection [08:36:08] PROBLEM - salt-minion processes on mw2094 is CRITICAL: Timeout while attempting connection [08:36:08] PROBLEM - puppet last run on mw2093 is CRITICAL: Timeout while attempting connection [08:36:38] PROBLEM - puppet last run on mw2092 is CRITICAL: Timeout while attempting connection [08:36:38] PROBLEM - salt-minion processes on mw2093 is CRITICAL: Timeout while attempting connection [08:36:58] what the hell, I scheduled downtime [08:37:02] these are mine [08:37:05] PROBLEM - salt-minion processes on mw2092 is CRITICAL: Timeout while attempting connection [08:37:05] :/ [08:37:25] PROBLEM - Check size of conntrack table on mw2094 is CRITICAL: Timeout while attempting connection [08:37:46] PROBLEM - DPKG on mw2094 is CRITICAL: Timeout while attempting connection [08:39:14] (03PS1) 10Volans: Reimaging: add option to reboot after the reimage [puppet] - 10https://gerrit.wikimedia.org/r/307482 (https://phabricator.wikimedia.org/T143536) [08:40:00] jynus: another metric is "up" which is zero or one if prometheus was able to fetch metrics from that particular target [08:40:04] (03CR) 10Volans: "@elukey: quick stub, need some proper testing. Hope it can help" [puppet] - 10https://gerrit.wikimedia.org/r/307482 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [08:40:07] RECOVERY - DPKG on mw2094 is OK: All packages OK [08:40:56] RECOVERY - salt-minion processes on mw2094 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:40:56] RECOVERY - nutcracker process on mw2092 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [08:41:27] RECOVERY - salt-minion processes on mw2093 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:41:36] 06Operations, 10Traffic, 10media-storage: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2593559 (10ema) Yes, we finished upgrading cache_upload in ulsfo to Varnish 4 yesterday: T131502. I've banned the specific image from the frontends in ulsfo and I now get the right C... [08:41:57] RECOVERY - Apache HTTP on mw2094 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.074 second response time [08:42:05] RECOVERY - salt-minion processes on mw2092 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:42:16] RECOVERY - Check size of conntrack table on mw2094 is OK: OK: nf_conntrack is 0 % full [08:42:16] RECOVERY - Apache HTTP on mw2093 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.086 second response time [08:42:34] sorrt for the spam [08:42:37] RECOVERY - Apache HTTP on mw2092 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.075 second response time [08:43:03] 06Operations, 10Icinga, 10Monitoring, 10Traffic, 07HTTPS: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#2593846 (10fgiunchedi) [08:43:05] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 13Patch-For-Review: Track/alert cassandra certs expiration - https://phabricator.wikimedia.org/T120662#2593844 (10fgiunchedi) 05Open>03Resolved yes! thanks @Dzahn, resolving [08:43:06] RECOVERY - nutcracker process on mw2093 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [08:43:06] RECOVERY - nutcracker port on mw2092 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [08:48:07] yeah mysql_exporter_last_scrape_error==1 (7 errors) [08:52:05] indeed, in eqiad [08:52:11] (03PS2) 10Filippo Giunchedi: monitoring: add check_prometheus define [puppet] - 10https://gerrit.wikimedia.org/r/307269 [08:52:21] elukey: I think these icinga alerts are somewhat unavoidable: wmf-reimage re-recreates the puppet host entry and the during that the previous downtime gets wiped along with the icinga host record [08:53:00] moritzm, it is worse for me, where puppet doesn't fully put a db back to a working state [08:53:13] and that pages [08:53:41] (03PS1) 10Elukey: Raise the Varnishkafka maximum timeout for incomplete records to 1500 [puppet] - 10https://gerrit.wikimedia.org/r/307483 [08:54:08] maybe wmf-reimage could automatically ack services ? [08:54:22] moritzm: ahhh it makes sense now [08:54:34] I think it should rather_ [08:54:36] I think it should rather: [08:54:49] - query a previous downtime and store it [08:55:14] "query a previous downtime", what do you mean? [08:55:16] - when recreating the puppet entry, re-enable the icinga downtime [08:55:25] maybe we should only avoid the "cleaning puppet facts cache for mw2093.codfw.wmnet" occurrences? [08:55:34] query icinga whether a downtime was/is set for the host [08:55:59] IIRC cleaning the cert does not remove the fact [08:56:14] but re-enabling the downtime is also quite ugly, since the icinga entries are only created with the puppet runs on neon [08:56:24] 06Operations, 10Traffic, 10media-storage: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2593871 (10ema) p:05Triage>03High [08:56:42] moritzm, yes, although in some cases, new alerts are introduced (e.g. puppet change that is dependent on upgrade) [08:56:44] elukey: yeah, I was about to try that with the next scaler I'm reimaging [08:57:02] jynus: ah, indeed [08:57:42] I did a bunch of those when updating mariadb 5.5 -> mariadb 10, I also updated coredb puppet class to mariadb::core [08:57:50] and of course, there is new installs [08:58:36] in those cases, a timed ack would work better [09:00:22] godog, in some cases there is pending permissions; but in others the metrics works well from localhost [09:00:40] godog, potential vlan issue? [09:00:53] or firewall? [09:02:36] jynus: since the metric is there prometheus was able to talk to mysqld_exporter so it should be reachable at least, what's an example of one that works from localhost? [09:02:54] godog, labsdb1003 [09:03:36] I am fixing the others [09:04:16] then there is db1069, which is a special case (7 separate mysql instances) [09:05:58] jynus: I see labsdb 1008 and 1001 failed, but 1003 seems to work from the dashboards [09:06:08] oh? [09:06:13] that may be new [09:07:54] I may have fixed it without knowing it? [09:09:12] hehe even better, I see the metrics from this morning [09:09:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] "comment inline. Should be relatively easy to amend the class a bit to parameterize the password." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291023 (https://phabricator.wikimedia.org/T135647) (owner: 10Gehel) [09:09:58] so I think that leaves db1069 and labsdb1005 (which I commented out yesterday) [09:10:07] ACKNOWLEDGEMENT - MD RAID on copper is CRITICAL: CRITICAL: Active: 3, Working: 3, Failed: 1, Spare: 0 Filippo Giunchedi https://phabricator.wikimedia.org/T144261 [09:11:52] 06Operations, 10Traffic, 10media-storage: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2593892 (10ema) Unfortunately quite a few requests on all ulsfo upload frontends are affected, as confirmed with: varnishlog -n frontend -q 'RespHeader ~ "Content-Length: 0" and Re... [09:12:00] ACKNOWLEDGEMENT - MD RAID on wtp2016 is CRITICAL: CRITICAL: Active: 3, Working: 3, Failed: 1, Spare: 0 Muehlenhoff T144260 [09:13:35] RECOVERY - DPKG on ms-be1023 is OK: All packages OK [09:15:35] (03PS3) 10Filippo Giunchedi: prometheus: add to LVS [puppet] - 10https://gerrit.wikimedia.org/r/306672 (https://phabricator.wikimedia.org/T126785) [09:16:34] (03CR) 10Filippo Giunchedi: "> conftool-data/nodes/{eqiad,codfw}.yaml will also need to be" [puppet] - 10https://gerrit.wikimedia.org/r/306672 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [09:17:11] (03CR) 10Jcrespo: [C: 031] "Looks good. Do we setup an alert for production testing?" [puppet] - 10https://gerrit.wikimedia.org/r/307269 (owner: 10Filippo Giunchedi) [09:18:27] 06Operations: Connection problems (from NZ to ULSFO) - https://phabricator.wikimedia.org/T144263#2593942 (10Samtar) [09:19:14] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2593946 (10MoritzMuehlenhoff) [09:20:19] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2334744 (10MoritzMuehlenhoff) This got mentioned as needing ops involvement in SoS, but in yesterday's Ops meeting we weren't sure what kind of help... [09:20:21] 06Operations: Connection problems (from NZ to ULSFO) - https://phabricator.wikimedia.org/T144263#2593955 (10Nurg) [09:21:43] (03CR) 10Filippo Giunchedi: "> Looks good. Do we setup an alert for production testing?" [puppet] - 10https://gerrit.wikimedia.org/r/307269 (owner: 10Filippo Giunchedi) [09:22:43] ref T144263 (above), I added operations though I'm not entirely sure if theres a more suitable tag for connectivity issues [09:23:40] myrcx: also #netops would be a tag to add [09:23:51] 06Operations, 10netops: Connection problems (from NZ to ULSFO) - https://phabricator.wikimedia.org/T144263#2593958 (10MoritzMuehlenhoff) [09:23:52] cheers godog [09:24:06] myrcx: mark can probably have a look, I've added netops to the task [09:28:55] 06Operations, 10netops: Connection problems (from NZ to ULSFO) - https://phabricator.wikimedia.org/T144263#2593965 (10Nurg) Connection is ok just at the moment. So here is a tracert when all is going well. 2 11 ms 11 ms 10 ms default-rdns.callplus.co.nz [101.98.0.131] 3 12 ms 12 ms 11 ms... [09:30:26] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2593966 (10MoritzMuehlenhoff) [09:32:04] 06Operations, 10ops-codfw, 10hardware-requests: procure syslog hardware in codfw - https://phabricator.wikimedia.org/T138075#2593972 (10mark) [09:32:32] !log Banned empty objects with status 200 from cache_upload ulsfo frontends (T144257) [09:34:29] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: codfw: (2) wqds200[12] systems - https://phabricator.wikimedia.org/T138637#2593977 (10mark) [09:34:44] (03CR) 10Filippo Giunchedi: "LGTM, just a comment on the help text" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/307482 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [09:34:50] have we lost the log bot? [09:35:09] 06Operations, 10hardware-requests: CODFW: (2) hardware access request for PUPPET - https://phabricator.wikimedia.org/T142219#2593981 (10mark) [09:35:48] ema: seems, so last entry from SAL is at 2 in the morning [09:36:13] 06Operations, 10Traffic, 10media-storage: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2593985 (10ema) >>! In T144257#2593892, @ema wrote: > Also, ulsfo upload backends don't seem to be affected. A rolling restart of the frontends in ulsfo is probably the easiest way t... [09:38:07] (03PS2) 10Volans: Reimaging: add option to reboot after the reimage [puppet] - 10https://gerrit.wikimedia.org/r/307482 (https://phabricator.wikimedia.org/T143536) [09:38:37] (03CR) 10Volans: Reimaging: add option to reboot after the reimage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/307482 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [09:40:35] I am curious when it is needed to reboot after a reimage? [09:40:59] (03CR) 10Filippo Giunchedi: [C: 031] Reimaging: add option to reboot after the reimage [puppet] - 10https://gerrit.wikimedia.org/r/307482 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [09:41:11] moritzm: any idea how to rescue him? :) [09:41:45] jynus: asked by elukey, to check that everything is working properly and AFAIK also so far to fix some issues with a cgroup [09:41:55] was writing that :) [09:41:59] good to know [09:43:07] ema: not sure actually, maybe running from labs? [09:43:17] jynus: some cgroups mount points have issues after the first puppet run afaics, and a reboot fixes them. I didn't investigate too much how to fix it straight away since a reboot is good anyway but it might be something to do as follow up [09:43:55] jynus: the cgroup needed by hhvm on jessie gets only gets created during boot (and it not yet available on the first booot [09:43:57] oh no, I was asking in case I would need to do it too [09:44:22] ema: https://wikitech.wikimedia.org/wiki/Morebots ? [09:44:38] or perhaps https://wikitech.wikimedia.org/wiki/Tool:Stashbot? [09:45:33] morebots, stashbot is seperate [09:45:46] ema: from the log is morebots [09:45:47] morebots: Lives in #wikimedia-operations and logs to wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:59] logmsgbot is also missing [09:46:51] yesterday someone was talking about an ban issue with freenode, cannot remember who but you can check the logs [09:46:58] it might be related [09:46:59] jynus: I don't think you'll need it, this is pretty special to mediawiki, it's used to prevent excessive resource consumption of external commands spawned by mediawiki [09:47:39] moritzm: I don't agree, also for DB we do a reboot at the end to ensure everything is starting properly automatically IIRC [09:47:53] I do not do a reboot [09:48:06] on new server? [09:48:31] only of mysql if we upgrade it [09:48:41] of course not useful if you need to check stuff manually before an eventual reboot [09:49:14] jynus: this script is for reimaging [09:49:22] so you get a new OS usually :) [09:49:39] volans: I'm not question the usefulness of your patch :-) hhvm is a big enough usecase of it's own and it's useful to have around in general as well [09:49:53] (03CR) 10Paladox: "What is a TLDR?" [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) (owner: 10Paladox) [09:50:27] ema: https://wikitech.wikimedia.org/wiki/Morebots#Example:_restart_the_ops_channel_morebot !log reimaging mw209[567] with Debian Jessie [09:51:19] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2594001 (10MoritzMuehlenhoff) [09:52:40] (03PS4) 10Paladox: Set logoImagePHID and wordmarkText in fixed_settings.yaml [puppet] - 10https://gerrit.wikimedia.org/r/307462 [09:52:59] (03CR) 10Paladox: [C: 031] "Only changed the word from Wikimedia to Phabricator." [puppet] - 10https://gerrit.wikimedia.org/r/307462 (owner: 10Paladox) [09:53:05] (03PS5) 10Paladox: Set logoImagePHID and wordmarkText in fixed_settings.yaml [puppet] - 10https://gerrit.wikimedia.org/r/307462 [09:53:23] moritzm: yes, I got that ;) [09:53:26] thanks [09:54:45] 06Operations, 10netops: Connection problems (from NZ to ULSFO) - https://phabricator.wikimedia.org/T144263#2593896 (10mark) I just did a reverse traceroute and I'm also not seeing problems at the moment... [09:55:28] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.12 upstream [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/307292 (owner: 10Gilles) [09:56:46] p858snake: thanks, morebots seems to be back [09:57:06] !log restarted morebots [09:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:16] \o/ [09:58:29] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: Puppet has 1 failures [09:58:41] Logmsgbot also needs to be done (but its wikipage doesn't document how) - It's the bot that echos messages from scripts etc [09:59:05] (03CR) 10Ema: [C: 031] Raise the Varnishkafka maximum timeout for incomplete records to 1500 [puppet] - 10https://gerrit.wikimedia.org/r/307483 (owner: 10Elukey) [10:00:16] (03CR) 10Elukey: [C: 032] Raise the Varnishkafka maximum timeout for incomplete records to 1500 [puppet] - 10https://gerrit.wikimedia.org/r/307483 (owner: 10Elukey) [10:12:39] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development: Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169#2594020 (10Volans) [10:12:57] hashar: thanks for the suggestion, added poll, mostly for the sake to try it :) [10:13:10] volans: hello :) yeah poll is quite fun [10:21:53] hashar: it's a pity polls don't show up in project's boards, given that they allow comments too one could just create a poll instead of a task for some stuff [10:23:23] 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium): Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2594023 (10mark) >>! In T140257#2553490, @thcipriani wrote: >>>! In T140257#2491705, @faidon wrote: >> I've deliber... [10:24:15] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:37:31] 06Operations: wmf-reimage and handling of "-n" option - https://phabricator.wikimedia.org/T144264#2594026 (10MoritzMuehlenhoff) [10:41:28] 06Operations: wmf-reimage and handling of "-n" option - https://phabricator.wikimedia.org/T144264#2594042 (10Volans) a:03Volans [10:42:27] (03PS1) 10Jcrespo: labsdb: Add firewall to new labsdb databases [puppet] - 10https://gerrit.wikimedia.org/r/307489 [10:43:29] (03PS1) 10Volans: Reimaging: Fix infinite loops when -n is set [puppet] - 10https://gerrit.wikimedia.org/r/307490 (https://phabricator.wikimedia.org/T144264) [10:43:56] (03PS2) 10Jcrespo: labsdb: Add firewall to new labsdb databases [puppet] - 10https://gerrit.wikimedia.org/r/307489 [10:46:44] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/307489 (owner: 10Jcrespo) [10:46:52] 06Operations, 10Cassandra, 10procurement: SSDs for repurposed AMS nodes - https://phabricator.wikimedia.org/T143935#2594050 (10mark) @RobH: please request a quote for some Intels via Dell NL, we shouldn't get into the Samsung mess at esams as well, and instead stick to our standard models there. [10:47:34] (03CR) 10Jcrespo: [C: 032] labsdb: Add firewall to new labsdb databases [puppet] - 10https://gerrit.wikimedia.org/r/307489 (owner: 10Jcrespo) [10:53:27] !log Cutting MediaWiki branch 1.28.0-wmf.17 T142117 [10:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:53:44] heh no stashbot either [10:54:37] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/307490 (https://phabricator.wikimedia.org/T144264) (owner: 10Volans) [10:59:16] volans the fixer :D [10:59:42] rotfl [11:01:00] !log roll-restart xenon/cerium/praseodymium cassandra instances to pick up new certs [11:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:08:05] 06Operations, 10netops: Connection problems (from NZ to ULSFO) - https://phabricator.wikimedia.org/T144263#2594075 (10Nurg) No further problems for a while now, so it may have come right. Thanks everyone. [11:12:19] (03CR) 10Elukey: "It looks awesome, just left a comment about the first puppet run and the reboot. Thanks for working on this Riccardo!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/307482 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [11:14:33] (03CR) 10Volans: Reimaging: add option to reboot after the reimage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/307482 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [11:16:01] volans: you can include a poll with curly braces! On the task just: {V10} [11:16:16] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2594077 (10SindyM3) Please help me: Several instances of abuse on the Wordpress installation on domain 'wikilovesmonuments.org' (server 'schippers.wikimedia.... [11:16:19] the poll will be embedded in the task details [11:16:33] oh cool, let's try [11:17:02] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development: Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169#2594078 (10Volans) [11:17:26] hashar phabmaster :D [11:38:38] role::mediawiki::appserver::canary_api Could not retrieve dependency 'Service[hhvm]' of File[/etc/hhvm/server.ini] [11:38:41] damn puppet [11:38:45] that is never ending :D [11:42:02] !log upgrading remaining jessie-based mw systems to hhvm 3.12.7 (now that the systemd unit override is in place) [11:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:47:52] !log banning elastic10(44|45|46|47) from elasticsearch eqiad cluster - T143685 [11:47:53] T143685: Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685 [11:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:00:27] (03PS1) 10Gehel: reclaim nobelium - remove hiera host configuration [puppet] - 10https://gerrit.wikimedia.org/r/307493 (https://phabricator.wikimedia.org/T142581) [12:04:26] Hi. [12:05:01] PROBLEM - Apache HTTP on mw2100 is CRITICAL: Connection timed out [12:05:47] PROBLEM - nutcracker port on mw2100 is CRITICAL: Timeout while attempting connection [12:05:47] The EU swat window is quiet, perhaps we should retry Math extension on Wikitech [12:06:08] PROBLEM - nutcracker process on mw2100 is CRITICAL: Timeout while attempting connection [12:06:12] the mw2100 is a new reimaged server [12:06:37] PROBLEM - puppet last run on mw2100 is CRITICAL: Timeout while attempting connection [12:07:05] scheduled downtime Cc morebots [12:07:13] argh, moritzm :) [12:07:48] yeah, that's me, once "-n" is fixed, we can avoid that :-) [12:09:58] RECOVERY - Apache HTTP on mw2100 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.075 second response time [12:10:38] RECOVERY - nutcracker port on mw2100 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:10:45] (03PS10) 10Gehel: elasticsearch - move relforge to its own role [puppet] - 10https://gerrit.wikimedia.org/r/304067 [12:11:07] RECOVERY - nutcracker process on mw2100 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [12:11:29] moritzm: but then we'll need to clean certs/keys manually right? [12:15:17] elukey: yes, but that is likely more controllable/shorter since the initial puppet run would then be completed, we'll need to test this [12:15:37] labswiki, but labtestwiki, how coherent [12:26:17] (03CR) 10Gehel: "puppet compiler seems to mostly agree with the change (https://puppet-compiler.wmflabs.org/3882/). I'm still looking into the changes rela" [puppet] - 10https://gerrit.wikimedia.org/r/304067 (owner: 10Gehel) [12:30:18] (03PS1) 10Dereckson: Enable Math on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307494 (https://phabricator.wikimedia.org/T126338) [12:35:48] (03CR) 10Volans: "2 minor comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307269 (owner: 10Filippo Giunchedi) [12:41:28] (03PS1) 10BBlack: upload VCL: workaround borked client Range: headers [puppet] - 10https://gerrit.wikimedia.org/r/307496 [12:45:01] (03CR) 10Ema: [C: 031] upload VCL: workaround borked client Range: headers [puppet] - 10https://gerrit.wikimedia.org/r/307496 (owner: 10BBlack) [12:49:18] (03CR) 10BBlack: [C: 032] upload VCL: workaround borked client Range: headers [puppet] - 10https://gerrit.wikimedia.org/r/307496 (owner: 10BBlack) [12:49:30] (03PS1) 10Ottomata: Rsync pageviews to labs nfs hosts [puppet] - 10https://gerrit.wikimedia.org/r/307502 (https://phabricator.wikimedia.org/T142671) [12:50:25] hashar: ready for swat? :) [12:50:44] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations, 10Research-management: Request access to data for WDQS research - https://phabricator.wikimedia.org/T142780#2594235 (10AlexKrauseTUD) Sorry to bother, somehow my corresponding private key file got corrupted and is unread... [12:53:47] Hi zeljkof. If hashar is busy, I can SWAT with you. [12:54:01] (03PS2) 10Ottomata: Rsync pageviews to labs nfs hosts [puppet] - 10https://gerrit.wikimedia.org/r/307502 (https://phabricator.wikimedia.org/T142671) [12:54:47] (03CR) 10Jcrespo: monitoring: add check_prometheus define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/307269 (owner: 10Filippo Giunchedi) [12:55:53] zeljkof: Dereckson yeah please handle the swat :] [12:55:56] poke me as needed [12:56:14] if you get a Hangouts, I can join to listen [12:56:25] (03CR) 10Ottomata: [C: 032] Rsync pageviews to labs nfs hosts [puppet] - 10https://gerrit.wikimedia.org/r/307502 (https://phabricator.wikimedia.org/T142671) (owner: 10Ottomata) [12:56:50] Dereckson: want to chat here or hangouts? [12:57:14] I will probably need help if I am doing swat, this would be my third time, so I do not have a lot of experience :) [12:58:34] zeljkof: okay let me five minutes and I can join Hangout [12:59:10] Dereckson: ok, will send you invitation [13:00:04] hashar, Dereckson, addshore, and aude: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160830T1300). [13:00:04] jynus, dcausse, and Dereckson: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:20] o/ [13:00:39] (03PS11) 10Gehel: elasticsearch - move relforge to its own role [puppet] - 10https://gerrit.wikimedia.org/r/304067 [13:00:48] I can SWAT today! (Dereckson will help) [13:04:52] Dereckson: this is the hangout [13:04:53] https://hangouts.google.com/hangouts/_/wikimedia.org/euswat [13:05:03] but I don't have your e-mail :| [13:05:54] mine is easy, we just need to test it breaks nothing on canary; it is a noop [13:06:09] (03PS1) 10Jcrespo: prometheus mysqld exporter: add all pending database instances [puppet] - 10https://gerrit.wikimedia.org/r/307503 (https://phabricator.wikimedia.org/T126757) [13:06:23] Dereckson: found it on wikitech [13:06:45] !log removing unused openjdk 7 on maps1001 [13:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:06:56] technically I have deployment rights, so give the other people more preference [13:07:28] jynus: ok, should I leave your for the end? [13:07:39] yes, no problem with that [13:08:11] jynus: ok, looking at the second one https://gerrit.wikimedia.org/r/#/c/307484/ [13:08:46] hashar: the usual hangout [13:09:18] dcausse: around and ready? [13:09:25] zeljkof: yes [13:10:59] !log cleanup openjdk 7 on maps2002 - T142977 [13:11:00] T142977: Maps - remove multiple JVM versions from maps servers - https://phabricator.wikimedia.org/T142977 [13:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:14:01] 06Operations, 06Discovery, 06Maps, 13Patch-For-Review: Maps - remove multiple JVM versions from maps servers - https://phabricator.wikimedia.org/T142977#2594299 (10Gehel) maps1001 is clean, osmosis upgraded and openjdk 7 removed. maps2001 still has cassandra using openjdk 7. Clean up is as follow: 1. upd... [13:14:27] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 449.72 seconds [13:14:47] checking, low priority [13:14:58] (03Abandoned) 10Rush: admin: allow all active users to be applied [puppet] - 10https://gerrit.wikimedia.org/r/244471 (https://phabricator.wikimedia.org/T114161) (owner: 10Rush) [13:15:29] (03CR) 1020after4: [C: 031] Set logoImagePHID and wordmarkText in fixed_settings.yaml [puppet] - 10https://gerrit.wikimedia.org/r/307462 (owner: 10Paladox) [13:16:39] update collection script doesn't love db1047 [13:17:01] looking at https://gerrit.wikimedia.org/r/#/c/306595/ [13:18:21] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306595 (https://phabricator.wikimedia.org/T143844) (owner: 10Dereckson) [13:18:31] (03PS3) 10Zfilipin: Allow bureaucrats to manage account creators group on ar.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306595 (https://phabricator.wikimedia.org/T143844) (owner: 10Dereckson) [13:20:45] (03CR) 10Zfilipin: Allow bureaucrats to manage account creators group on ar.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306595 (https://phabricator.wikimedia.org/T143844) (owner: 10Dereckson) [13:20:54] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306595 (https://phabricator.wikimedia.org/T143844) (owner: 10Dereckson) [13:21:35] (03Merged) 10jenkins-bot: Allow bureaucrats to manage account creators group on ar.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306595 (https://phabricator.wikimedia.org/T143844) (owner: 10Dereckson) [13:22:53] (03CR) 10Alex Monk: "I thought local-multiwrite was also swift" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307494 (https://phabricator.wikimedia.org/T126338) (owner: 10Dereckson) [13:24:55] PROBLEM - dhclient process on mw2101 is CRITICAL: Connection refused by host [13:24:56] PROBLEM - Check size of conntrack table on mw2103 is CRITICAL: Connection refused by host [13:25:04] PROBLEM - Check size of conntrack table on mw2101 is CRITICAL: Connection refused by host [13:25:04] PROBLEM - nutcracker process on mw2103 is CRITICAL: Connection refused by host [13:25:14] PROBLEM - nutcracker port on mw2101 is CRITICAL: Connection refused by host [13:25:15] PROBLEM - DPKG on mw2103 is CRITICAL: Connection refused by host [13:25:35] PROBLEM - Disk space on mw2101 is CRITICAL: Connection refused by host [13:25:36] PROBLEM - DPKG on mw2101 is CRITICAL: Connection refused by host [13:25:36] PROBLEM - puppet last run on mw2103 is CRITICAL: Connection refused by host [13:25:36] PROBLEM - Disk space on mw2103 is CRITICAL: Connection refused by host [13:25:36] PROBLEM - nutcracker process on mw2101 is CRITICAL: Connection refused by host [13:25:54] PROBLEM - HHVM processes on mw2103 is CRITICAL: Connection refused by host [13:25:54] PROBLEM - puppet last run on mw2101 is CRITICAL: Connection refused by host [13:25:55] PROBLEM - Apache HTTP on mw2101 is CRITICAL: Connection refused [13:25:55] PROBLEM - nutcracker port on mw2103 is CRITICAL: Connection refused by host [13:26:05] PROBLEM - HHVM rendering on mw2103 is CRITICAL: Connection refused [13:26:05] PROBLEM - salt-minion processes on mw2103 is CRITICAL: Connection refused by host [13:26:05] PROBLEM - HHVM processes on mw2101 is CRITICAL: Connection refused by host [13:26:15] PROBLEM - salt-minion processes on mw2101 is CRITICAL: Connection refused by host [13:26:21] ^ harmless, fixing [13:26:34] PROBLEM - HHVM rendering on mw2101 is CRITICAL: Connection refused [13:26:43] PROBLEM - NTP on mw2103 is CRITICAL: NTP CRITICAL: No response from NTP server [13:26:45] PROBLEM - NTP on mw2101 is CRITICAL: NTP CRITICAL: No response from NTP server [13:26:54] PROBLEM - configured eth on mw2103 is CRITICAL: Connection refused by host [13:27:17] PROBLEM - dhclient process on mw2103 is CRITICAL: Connection refused by host [13:29:19] (03CR) 10Gehel: "Puppet compiler is happy: https://puppet-compiler.wmflabs.org/3883/" [puppet] - 10https://gerrit.wikimedia.org/r/304067 (owner: 10Gehel) [13:30:00] (03PS2) 10Dereckson: Enable Math on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307494 (https://phabricator.wikimedia.org/T126338) [13:35:31] jynus, dcausse: sorry, looks like we will run out of time today and will not be able to deploy your patches [13:35:47] zeljkof: ok no problem [13:36:03] maybe I can deploy it myself? [13:36:07] this is my third deploy and I am really really slow and scap is taking forever [13:36:26] PROBLEM - salt-minion processes on mw2100 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:37:20] 06Operations, 06Discovery, 06Maps, 10Maps-data, 07Epic: Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616#2594379 (10Gehel) [13:37:22] 06Operations, 06Discovery, 06Maps, 13Patch-For-Review: Maps - remove multiple JVM versions from maps servers - https://phabricator.wikimedia.org/T142977#2594377 (10Gehel) 05Open>03Resolved openjdk 7 is removed from the maps clusters. [13:37:30] jynus: probably in another deploy window [13:41:07] (03PS2) 10Gehel: reclaim nobelium - remove hiera host configuration [puppet] - 10https://gerrit.wikimedia.org/r/307493 (https://phabricator.wikimedia.org/T142581) [13:41:16] PROBLEM - puppet last run on mw2100 is CRITICAL: CRITICAL: Puppet has 8 failures [13:42:02] (03CR) 10Aaron Schulz: Enable Math on wikitech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307494 (https://phabricator.wikimedia.org/T126338) (owner: 10Dereckson) [13:42:31] (03CR) 10Gehel: [C: 032] reclaim nobelium - remove hiera host configuration [puppet] - 10https://gerrit.wikimedia.org/r/307493 (https://phabricator.wikimedia.org/T142581) (owner: 10Gehel) [13:44:02] but I do not see scap running? [13:44:45] jynus: hashar, Dereckson can you check? [13:44:53] maybe something is wrong on my end [13:45:10] it may be slow because backula is doing a backup right now [13:45:17] the last thing on my screen is [13:45:18] 13:30:05 Started sync-masters [13:45:23] sync-masters: 0% (ok: 0; fail: 0; left: 1) [13:45:56] jynus: how do I check if scap is running? [13:46:17] well, I checked the processes of the machine as root and saw nothing related running [13:46:31] jynus: ok, I just got this :( [13:46:32] packet_write_wait: Connection to 208.80.154.149: Broken pipe [13:46:40] packet_write_wait: Connection to UNKNOWN: Broken pipe [13:46:48] I would check the logs, but do not know where they are [13:46:52] should I repeat the last command? [13:46:58] yes, please [13:46:59] scap sync-file wmf-config/InitialiseSettings.php... [13:47:07] ok, will do [13:47:21] 06Operations, 10Traffic: varnishkafka frequently disconnects from kafka servers - https://phabricator.wikimedia.org/T144158#2594406 (10elukey) 05Open>03Resolved [13:47:22] yeah, that definitely died at least minutes ago [13:47:55] ok, looks like it is way faster this time [13:48:04] sync-masters: 100% (ok: 1; fail: 0; left: 0) [13:48:21] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:306595|Allow bureaucrats to manage account creators group on ar.wikipedia (T143844)]] (duration: 00m 50s) [13:48:22] T143844: "Accounts creator" permission in arwiki - https://phabricator.wikimedia.org/T143844 [13:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:50:05] hashar: can you join the hangout again, please [13:51:35] (03CR) 10Filippo Giunchedi: [C: 031] prometheus mysqld exporter: add all pending database instances [puppet] - 10https://gerrit.wikimedia.org/r/307503 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [13:51:56] RECOVERY - salt-minion processes on mw2100 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:53:26] !log ema@palladium conftool action : set/pooled=no; selector: cp4005.ulsfo.wmnet (tags: ['dc=ulsfo', 'cluster=cache_upload', 'service=varnish-be']) [13:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:53:47] godog, I do not know about 307503, on one side, it may be too bold, on the other, it may already be useful for some issues [13:54:40] jynus: I think we should do it :D If something breaks it is easy enough to revert [13:55:00] zeljkof: the scap fails because mw2103.codfw.wmnet has probably been reinstalled [13:55:05] yes, in fact, the dangerous one is the installation [13:55:20] so the ssh known host key on tin.eqiad.wmnet is out of date. Puppet will update it eventually [13:55:54] zeljkof, on those cases, complain to an ops (comment it here), and they will give you guidance [13:56:16] hashar: ignore mw2103, it's being reimaged [13:56:22] it [13:56:23] see^? [13:56:59] RECOVERY - puppet last run on mw2100 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [13:57:09] zeljkof: mw2101 mw2103 and mw2098 see moritzm message :] [13:58:18] (03PS12) 10Gehel: elasticsearch - move relforge to its own role [puppet] - 10https://gerrit.wikimedia.org/r/304067 [13:58:24] (03PS4) 10Hashar: Sort s3.dblist in lexicographical order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302223 (owner: 10Jcrespo) [13:58:51] (03CR) 10Hashar: [C: 032] Sort s3.dblist in lexicographical order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302223 (owner: 10Jcrespo) [13:59:19] \o/ [13:59:25] (03Merged) 10jenkins-bot: Sort s3.dblist in lexicographical order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302223 (owner: 10Jcrespo) [14:00:07] (03CR) 10Hashar: "I cant see a reason the files would had an intentional order. Definitely looks like weird copy pasted / legacy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302223 (owner: 10Jcrespo) [14:00:14] jynus: merging and deploying https://gerrit.wikimedia.org/r/#/c/302223/ [14:00:20] thank you [14:00:55] we had a bit of a discussion if we should use binary or ut8 [14:01:07] PROBLEM - Host ms-be1022 is DOWN: PING CRITICAL - Packet loss = 100% [14:01:31] dcausse: still around? [14:01:36] hashar: yes [14:01:41] I proferred utf8 so in the future we can have "💩wiki" [14:01:44] !log European SWAT extended [14:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:01:51] \o/ [14:02:16] dcausse: I have CR+2 your patch https://gerrit.wikimedia.org/r/#/c/307484/ [14:02:20] jynus: can you test the patch at mw1099? [14:02:21] zeljkof will deploy it [14:02:28] thanks! :) [14:02:30] there is nothing to test? [14:02:32] zeljkof, certainly [14:03:18] jynus: it is at mw1099, test at will :) [14:03:27] it may take a bit, as I will be querying all wikis once [14:03:35] (03CR) 10Alexandros Kosiaris: [C: 031] prometheus: add to LVS [puppet] - 10https://gerrit.wikimedia.org/r/306672 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [14:03:55] sounds safe [14:04:48] (03PS2) 10Mobrovac: Parsoid: Switch to Scap3 deployments [puppet] - 10https://gerrit.wikimedia.org/r/304470 (https://phabricator.wikimedia.org/T120103) [14:05:04] jynus: take your time [14:05:09] yes, I double checked with Kren* and no list requires ordering [14:05:21] in fact, we already had a disorder in the first place [14:05:31] but it doesn't hurt double checking [14:06:56] 06Operations, 10MediaWiki-extensions-CentralNotice, 10Traffic: Varnish-triggered CN campaign about browser security - https://phabricator.wikimedia.org/T144194#2591355 (10Pcoombe) CentralNotice is reliant on JavaScript, and doesn't work on browsers where MediaWiki only offers [[https://www.mediawiki.org/wiki... [14:08:46] 06Operations, 10MediaWiki-extensions-CentralNotice, 10Traffic: Varnish-triggered CN campaign about browser security - https://phabricator.wikimedia.org/T144194#2594482 (10BBlack) Ah, that probably puts a fork in this for notifying truly-outdated clients this way, then. The only case we could capture with CN... [14:09:54] zeljkof: hashar: I've some network issues to join hangout, my SSH session is fine, but hangout says there is a network issue. [14:10:56] bacula was using 200% of the cpu, probably if didn't acount for morning deploys on tin [14:11:30] Dereckson: thanks, hashar is helping me [14:12:05] dcausse: still around? [14:12:12] zeljkof: yes [14:12:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice, thanks for that! comments inline!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307482 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [14:12:31] great, https://gerrit.wikimedia.org/r/#/c/307484/ is merged, deploying it [14:13:50] PROBLEM - puppet last run on mw2098 is CRITICAL: CRITICAL: Puppet has 8 failures [14:14:02] !log Moved mediawiki-core-phpcs job back to Nodepool T143938 [14:14:03] T143938: Bring back jobs to Nodepool - https://phabricator.wikimedia.org/T143938 [14:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:02] zeljkof, I checked 300+ wikipedias, all continue returning 200 OK after the patch [14:15:38] jynus: ok, deploying [14:15:49] RECOVERY - dhclient process on mw2101 is OK: PROCS OK: 0 processes with command name dhclient [14:16:00] RECOVERY - nutcracker process on mw2101 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [14:16:29] RECOVERY - Check size of conntrack table on mw2101 is OK: OK: nf_conntrack is 0 % full [14:16:40] RECOVERY - salt-minion processes on mw2101 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:16:55] RECOVERY - Apache HTTP on mw2101 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time [14:16:55] RECOVERY - nutcracker port on mw2101 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [14:16:58] some slow-parse infos only [14:17:13] RECOVERY - Disk space on mw2101 is OK: DISK OK [14:17:19] (I suppose some of those wikis are not very frequently visited) [14:17:52] do you know if X-debug forces ignoring existing cache at app level? [14:19:57] jynus: if that was a question for me, I do not know :) [14:20:07] in general, for the channel [14:20:14] scap stuck at: sync-apaches: 99% (ok: 334; fail: 1; left: 1) [14:20:39] !log zfilipin@tin Synchronized dblists/s3.dblist: SWAT: [[gerrit:302223|Sort s3.dblist in lexicographical order]] (duration: 02m 44s) [14:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:20:46] that last one could be the one down that has to timeout [14:21:08] jynus: ok, scap has finished [14:21:09] did it say which one, or did it succeeded? [14:21:21] sync-apaches: 100% (ok: 334; fail: 2; left: 0) [14:21:24] (03CR) 10Volans: "@akosiaris thanks for the review and the comments, although I'll probably abandon this change." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307482 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [14:21:37] ok, I will check the 2 that failed [14:21:41] mw2103.codfw.wmnet returned [255]: Host key verification failed. [14:21:49] mw2101.codfw.wmnet port 22: Connection timed out [14:21:50] hashar: yes I can test on mw1099 [14:21:56] RECOVERY - DPKG on mw2101 is OK: All packages OK [14:22:03] yes, those are currently down [14:22:06] do not worry [14:22:14] they will sync on start [14:23:00] dcausse: working on https://gerrit.wikimedia.org/r/#/c/307484/ [14:23:10] zeljkof: thanks :) [14:25:55] (03CR) 10Filippo Giunchedi: monitoring: add check_prometheus define (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307269 (owner: 10Filippo Giunchedi) [14:25:58] dcausse: the change is at mw1099, test at will :) [14:26:09] (03PS3) 10Filippo Giunchedi: monitoring: add check_prometheus define [puppet] - 10https://gerrit.wikimedia.org/r/307269 [14:26:26] ;:D [14:26:29] (03PS4) 10Filippo Giunchedi: prometheus: add to LVS [puppet] - 10https://gerrit.wikimedia.org/r/306672 (https://phabricator.wikimedia.org/T126785) [14:27:22] PROBLEM - Apache HTTP on mw2098 is CRITICAL: Connection refused [14:28:02] zeljkof: it works, thanks! [14:28:07] \O/ [14:28:18] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add to LVS [puppet] - 10https://gerrit.wikimedia.org/r/306672 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [14:28:39] dcausse: ok, doing the scap then [14:28:41] 06Operations, 10Cassandra, 10procurement: SSDs for repurposed AMS nodes - https://phabricator.wikimedia.org/T143935#2583786 (10Peachey88) This task is currently sitting in #procurement and the main space (Visible to everyone) [14:29:12] RECOVERY - puppet last run on mw2101 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:31:39] (03PS2) 10Jcrespo: prometheus mysqld exporter: add all pending database instances [puppet] - 10https://gerrit.wikimedia.org/r/307503 (https://phabricator.wikimedia.org/T126757) [14:32:17] (03PS1) 10Muehlenhoff: keyholder-proxy/agent: Convert to base::service_unit (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/307510 [14:32:44] 06Operations, 10ops-codfw: Broken disk on wtp2016 - https://phabricator.wikimedia.org/T144260#2594580 (10Papaul) p:05Triage>03Normal [14:32:55] !log zfilipin@tin Synchronized php-1.28.0-wmf.16/extensions/CirrusSearch/includes/CirrusSearch.php: SWAT: [[gerrit:307484|Initialize the UserTesting framework before creating a Connection]] (duration: 00m 49s) [14:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:33:48] (03CR) 10jenkins-bot: [V: 04-1] keyholder-proxy/agent: Convert to base::service_unit (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/307510 (owner: 10Muehlenhoff) [14:34:09] zeljkof, hashar: thanks you! :) [14:34:23] !log filippo@palladium conftool action : set/pooled=yes; selector: prometheus2001.codfw.wmnet [14:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:34:37] !log filippo@palladium conftool action : set/pooled=yes; selector: prometheus1001.eqiad.wmnet [14:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:34:48] dcausse: scap is done, failed for mw2103 and mw2101 [14:34:57] Lord Prometheus is rising [14:34:59] :D [14:35:12] !log European SWAT is done! [14:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:35:17] (03PS2) 10Muehlenhoff: keyholder-proxy/agent: Convert to base::service_unit (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/307510 [14:35:26] !log mw2101 running scap pull , it missed bunch of files [14:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:35:33] hahah let's see if I can catch icinga before it pages [14:36:04] elukey, do not worry, I am about to bring it down [14:36:09] (03CR) 10Jcrespo: [C: 032] prometheus mysqld exporter: add all pending database instances [puppet] - 10https://gerrit.wikimedia.org/r/307503 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [14:39:17] !log Applying security patches for 1.28.0-wmf.17 T142117 [14:39:18] T142117: MW-1.28.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T142117 [14:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:58] !log banning objects with status code 200 and content-length 0 from upload backends in ulsfo T144257 [14:39:59] T144257: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257 [14:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:40:08] (03CR) 10Alexandros Kosiaris: "Indeed. This is mostly due to labs shipping it's own hiera.yaml configuration and keeping very little stuff from production. There aren't " [puppet] - 10https://gerrit.wikimedia.org/r/302695 (owner: 10Alexandros Kosiaris) [14:40:55] (03PS3) 10Ottomata: Set up Zookeeper cluster for Druid [puppet] - 10https://gerrit.wikimedia.org/r/306196 (https://phabricator.wikimedia.org/T138263) [14:41:02] !log bounce pybal to pick up prometheus.svc [14:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:42:24] TIL: ProxyFetch wants exactly a 200, 204 doesn't seem to do it by default [14:42:36] :) [14:42:44] godog: "bound pybal" where? :) [14:42:47] err bounce [14:43:43] bblack: err, yeah [14:44:01] !log bounce pybal to pick up prometheus.svc on low-traffic in eqiad/codfw [14:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:13] godog: not all at once, right? [14:44:45] (03CR) 10Ottomata: [C: 032] Set up Zookeeper cluster for Druid [puppet] - 10https://gerrit.wikimedia.org/r/306196 (https://phabricator.wikimedia.org/T138263) (owner: 10Ottomata) [14:45:17] bblack: nope, only lvs1006 so far, I was checking the logs and noticed the 200 vs 204 issue [14:45:26] godog: not that I care about the messaging being that pedantic, I just worry that maybe we've failed to really communicate about how risky pybal restarts are, especially simultaneous ones [14:45:30] ok [14:46:19] RECOVERY - Disk space on mw2103 is OK: DISK OK [14:46:54] !log banning objects with status code 200 and content-length 0 from upload frontends in ulsfo T144257 [14:46:55] T144257: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257 [14:46:58] RECOVERY - configured eth on mw2103 is OK: OK - interfaces up [14:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:00] but yeah I usually start with the standby and move from there [14:47:18] RECOVERY - dhclient process on mw2103 is OK: PROCS OK: 0 processes with command name dhclient [14:47:32] * godog shakes fist at icinga still not showing up prometheus.svc [14:47:40] RECOVERY - nutcracker port on mw2103 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [14:48:01] RECOVERY - nutcracker process on mw2103 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [14:48:28] RECOVERY - Check size of conntrack table on mw2103 is OK: OK: nf_conntrack is 0 % full [14:48:29] RECOVERY - salt-minion processes on mw2103 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:48:41] RECOVERY - DPKG on mw2103 is OK: All packages OK [14:48:57] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Puppet has 2 failures [14:49:20] (03PS1) 10Hashar: Group0 to 1.28.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307517 (https://phabricator.wikimedia.org/T142117) [14:51:03] !log hashar@tin Started scap: testwiki to php-1.27.0-wmf.17 T142117 [14:51:04] T142117: MW-1.28.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T142117 [14:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:51:18] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: puppet fail [14:51:39] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [14:51:40] godog: sometimes that takes cycling through running agent on: service, puppetmaster, then neon [14:51:43] I think [14:51:57] maybe the puppetmaster run just wastes time to avoid another race condition :) [14:52:12] (03PS1) 10Ottomata: Move package zookeeper install into nodemanager.pp [puppet/cdh] - 10https://gerrit.wikimedia.org/r/307520 (https://phabricator.wikimedia.org/T138263) [14:52:14] 06Operations, 10Domains, 10Traffic: Guapopedia - https://phabricator.wikimedia.org/T144276#2594645 (10Joaquinito01) [14:52:46] haha it is possible [14:53:00] (03CR) 10Ottomata: [C: 032] Move package zookeeper install into nodemanager.pp [puppet/cdh] - 10https://gerrit.wikimedia.org/r/307520 (https://phabricator.wikimedia.org/T138263) (owner: 10Ottomata) [14:53:23] but yeah especially for monitoring the current way shows how inadeguate it is to base "reality" on whether or not puppet keys are signed [14:53:39] (03PS1) 10Ottomata: cdh submodule update with zookeeper package change [puppet] - 10https://gerrit.wikimedia.org/r/307521 (https://phabricator.wikimedia.org/T138263) [14:54:50] 06Operations, 10Domains, 10Traffic: Guapopedia - https://phabricator.wikimedia.org/T144276#2594645 (10BBlack) https://meta.wikimedia.org/wiki/Proposals_for_new_projects ? [14:55:41] (03CR) 10Ottomata: [C: 032] cdh submodule update with zookeeper package change [puppet] - 10https://gerrit.wikimedia.org/r/307521 (https://phabricator.wikimedia.org/T138263) (owner: 10Ottomata) [14:58:32] (03PS1) 10Filippo Giunchedi: prometheus: return 200 for / [puppet] - 10https://gerrit.wikimedia.org/r/307522 [14:59:11] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:00:04] hoo: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160830T1500). Please do the needful. [15:02:28] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: return 200 for / [puppet] - 10https://gerrit.wikimedia.org/r/307522 (owner: 10Filippo Giunchedi) [15:05:22] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2594683 (10Dzahn) @SindyM3 I think this should probably have a new ticket. Do you know who is admin of schippers.wikimedia.nl ? [15:07:32] 06Operations, 06Labs: Connect secondary nic for labstore1004 and labstore1005 - https://phabricator.wikimedia.org/T144183#2594689 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson @chasemp Second NIC cabled up for both...updated switch description and enabled port. Vlan will need to be updated. labstore100... [15:07:34] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2594692 (10Cmjohnson) [15:08:40] PROBLEM - Zookeeper Server on druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg [15:09:00] PROBLEM - Zookeeper Server on druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg [15:09:09] PROBLEM - Zookeeper Server on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg [15:09:24] ah that's me! [15:09:26] silencing [15:09:33] :) [15:09:34] Invalid config, exiting abnormally [15:09:34] that's a new cluster, not a main one [15:09:37] ah ok [15:09:43] good, I did not know much about druid [15:09:45] yeah, its busted, and then standup meeting happened [15:09:58] so i got paused [15:11:01] RECOVERY - puppet last run on mw2098 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:12:00] RECOVERY - puppet last run on mw2103 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:12:36] RECOVERY - Apache HTTP on mw2098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.743 second response time [15:13:26] so we are outputing 300K QPS right now on eqiad [15:16:20] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: puppet fail [15:16:27] notbad.gif [15:20:46] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Make elasticsearch configuration more robust to loss of network connectivity - https://phabricator.wikimedia.org/T143552#2594725 (10Gehel) I actually think that the example in the description is a good start. This will detect a failing node in... [15:25:42] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: puppet fail [15:28:59] (03CR) 10Greg Grossmeier: [C: 031] "I have a 3 second wait in my irssi config for this reason, isn't that common-ish? Is there a better way?" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/307333 (https://phabricator.wikimedia.org/T144189) (owner: 10BryanDavis) [15:33:45] hm [15:34:57] hm indeed [15:35:07] !log restart pybal on lvs primaries in codfw/eqiad [15:35:08] probably transient? [15:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:52] (03PS1) 10Ottomata: Make sure zookeeper package is installed on namenodes too [puppet/cdh] - 10https://gerrit.wikimedia.org/r/307524 (https://phabricator.wikimedia.org/T138263) [15:37:36] (03CR) 10Ottomata: [C: 032] Make sure zookeeper package is installed on namenodes too [puppet/cdh] - 10https://gerrit.wikimedia.org/r/307524 (https://phabricator.wikimedia.org/T138263) (owner: 10Ottomata) [15:39:27] (03CR) 10BryanDavis: "The cleanest way to do this would be to listen for the ack of the identify command and then join channels. I was mostly too lazy to figure" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/307333 (https://phabricator.wikimedia.org/T144189) (owner: 10BryanDavis) [15:39:41] (03PS1) 10Ottomata: Update cdh module with zookeeper package on namenodes [puppet] - 10https://gerrit.wikimedia.org/r/307525 [15:40:56] (03CR) 10Ottomata: [C: 032] Update cdh module with zookeeper package on namenodes [puppet] - 10https://gerrit.wikimedia.org/r/307525 (owner: 10Ottomata) [15:41:16] 06Operations, 10Traffic: Evaluate Apache Traffic Server - https://phabricator.wikimedia.org/T96853#2594754 (10BBlack) In the most-recent discussions about this, I've been liking the idea of breaking this down into frontend and backend caching. Frontend is heavy-traffic and high-complexity (in VCL/vmod terms).... [15:44:14] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:44:32] !log hashar@tin Finished scap: testwiki to php-1.27.0-wmf.17 T142117 (duration: 53m 28s) [15:44:33] T142117: MW-1.28.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T142117 [15:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:13] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:50:37] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2594773 (10bd808) >>! In T136429#2593946, @MoritzMuehlenhoff wrote: > This got mentioned as needing ops involvement in SoS, but in yesterday's Ops me... [15:59:24] (03PS1) 10Rush: nodepool: bump up ready states, max, and rate [puppet] - 10https://gerrit.wikimedia.org/r/307526 (https://phabricator.wikimedia.org/T143938) [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160830T1600). Please do the needful. [16:00:04] hashar, Krenair, and mobrovac: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:01:02] (03PS1) 10Hoo man: Enable allowDataAccessInUserLanguage on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307527 (https://phabricator.wikimedia.org/T122670) [16:02:37] (03CR) 10Hoo man: [C: 032] Enable allowDataAccessInUserLanguage on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307527 (https://phabricator.wikimedia.org/T122670) (owner: 10Hoo man) [16:03:03] (03Merged) 10jenkins-bot: Enable allowDataAccessInUserLanguage on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307527 (https://phabricator.wikimedia.org/T122670) (owner: 10Hoo man) [16:03:51] so puppet swat, Krenair here? [16:04:02] yep [16:04:29] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable allowDataAccessInUserLanguage on Wikidata (T122670) (duration: 00m 51s) [16:04:30] T122670: [Task] Enable allowDataAccessInUserLanguage on Wikidata - https://phabricator.wikimedia.org/T122670 [16:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:10] (03PS2) 10Filippo Giunchedi: logging: remove reference to deployment-fluoride [puppet] - 10https://gerrit.wikimedia.org/r/305660 (owner: 10Alex Monk) [16:10:15] (03PS2) 10Rush: nodepool: bump up ready states, max, and rate [puppet] - 10https://gerrit.wikimedia.org/r/307526 (https://phabricator.wikimedia.org/T143938) [16:10:39] (03CR) 10Filippo Giunchedi: [C: 032] logging: remove reference to deployment-fluoride [puppet] - 10https://gerrit.wikimedia.org/r/305660 (owner: 10Alex Monk) [16:11:54] godog, I think the first two should effectively change nothing in prod [16:12:20] third will add an extra line that does nothing in prod [16:13:06] hey [16:13:16] Krenair: ack, thanks yeah I'm looking at them now [16:13:21] looks like I forgot a term window that was doing "sync scap" [16:13:29] !log hashar@tin Synchronized php-1.28.0-wmf.17/includes/EditPage.php: (no message) (duration: 00m 48s) [16:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:35] hope nothing bad happened [16:14:01] godog: o/ [16:14:12] hashar: nope, I've skipped yours for now [16:15:17] I'm always on the fence when we start moving code to templates heh [16:15:42] godog, oh and the second fixes a puppet file header line [16:16:52] ok thanks Krenair [16:17:44] (03CR) 10Filippo Giunchedi: [C: 04-1] "I'm a bit on the fence about moving code to templates just to expand variables, could it be done by reading the environment and defaulting" [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [16:18:45] don't we already have a ton of code in templates just to expand variables? [16:18:48] (03PS3) 10Filippo Giunchedi: Change-Prop: Rerender summary on wikidata item update [puppet] - 10https://gerrit.wikimedia.org/r/306857 (owner: 10Ppchelko) [16:19:32] Krenair: it is possible, I haven't checked [16:20:33] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack/Setup pay-lvs1001[2] - https://phabricator.wikimedia.org/T143900#2594913 (10Cmjohnson) [16:21:42] Krenair: to be clear, I'm not trying to add red tape on purpose but seeing if we can move puppet towards being less specific [16:22:16] IOW code-in-templates is usually smell, it couldn't/wouldn't happen if such code was shipped e.g. by debian packages [16:23:08] mobrovac: I'm going to merge https://gerrit.wikimedia.org/r/#/c/306857/3 [16:23:24] kk godog, thnx [16:23:39] (03CR) 10Filippo Giunchedi: [C: 032] Change-Prop: Rerender summary on wikidata item update [puppet] - 10https://gerrit.wikimedia.org/r/306857 (owner: 10Ppchelko) [16:23:45] godog: i'll run puppet and restart once that's merged [16:24:16] afk for a couple minutes or so, errands at home [16:24:25] mobrovac: kk, {{done}} [16:24:33] thnx [16:24:56] 06Operations, 10Domains, 10Traffic: Guapopedia - https://phabricator.wikimedia.org/T144276#2594923 (10Aklapper) 05Open>03stalled a:05Joaquinito01>03None Hi @Joaquinito01, thanks for taking the time to report this! Unfortunately this report lacks some information. If you have time and can still reprod... [16:26:00] RECOVERY - Host ms-be1022 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [16:26:32] (03PS5) 10Filippo Giunchedi: contint: bump pip 7.0.1 -> 8.1.2 [puppet] - 10https://gerrit.wikimedia.org/r/289639 (owner: 10Hashar) [16:28:47] (03CR) 10Filippo Giunchedi: [C: 032] contint: bump pip 7.0.1 -> 8.1.2 [puppet] - 10https://gerrit.wikimedia.org/r/289639 (owner: 10Hashar) [16:28:52] \o/ [16:29:28] the next about hiera_lookup is fixing hiera_lookp based on my local testing and tests i did on labs months ago. Will get it covered one day with rspec [16:30:29] ok thanks I'll take a look [16:34:39] (03PS8) 10Filippo Giunchedi: hiera_lookup: recognize labs project and site [puppet] - 10https://gerrit.wikimedia.org/r/276346 (https://phabricator.wikimedia.org/T129092) (owner: 10Hashar) [16:35:04] (03PS2) 10Filippo Giunchedi: contint: stop including arcanist on Precise [puppet] - 10https://gerrit.wikimedia.org/r/307143 (owner: 10Hashar) [16:36:15] godog, I should warn you this will be harder to review [16:37:06] no relying on clear no-ops: this could cause issues with logs in prod if we get it wrong [16:37:35] Krenair: ok! no problem, perhaps add me to the code review so I can take a look before swat [16:39:28] (03CR) 10Filippo Giunchedi: [C: 032] hiera_lookup: recognize labs project and site [puppet] - 10https://gerrit.wikimedia.org/r/276346 (https://phabricator.wikimedia.org/T129092) (owner: 10Hashar) [16:39:54] (03CR) 10Filippo Giunchedi: [C: 032] contint: stop including arcanist on Precise [puppet] - 10https://gerrit.wikimedia.org/r/307143 (owner: 10Hashar) [16:39:59] (03PS3) 10Filippo Giunchedi: contint: stop including arcanist on Precise [puppet] - 10https://gerrit.wikimedia.org/r/307143 (owner: 10Hashar) [16:42:29] godog, what do you think we should do about fatalmonitor? [16:42:38] 07Puppet, 05Continuous-Integration-Scaling, 13Patch-For-Review: Hiera is not properly configured on Nodepool instances - https://phabricator.wikimedia.org/T129092#2594971 (10hashar) Last patch landed in puppet.git so that is definitely fixed. [16:42:48] it's not run automatically, it's a script to help shell users [16:43:14] last one of my serie is to get jenkins-debian-glue package updated for jessie ( https://phabricator.wikimedia.org/T141114 ) [16:43:46] (03PS3) 10Rush: nodepool: bump up ready states, max, and rate [puppet] - 10https://gerrit.wikimedia.org/r/307526 (https://phabricator.wikimedia.org/T143938) [16:44:22] Krenair: something like log_directory=${LOG_DIRECTORY:-/srv/mw-log} perhaps, you get the idea [16:44:38] ottomata elukey did you see the druid page? [16:44:47] godog, no, I'm not sure I do [16:45:01] godog, he mentiond it was a mistake [16:45:11] (otto) [16:45:19] because he was in a meeting [16:45:25] 06Operations, 10ops-codfw: Broken disk on wtp2016 - https://phabricator.wikimedia.org/T144260#2594972 (10Papaul) @MoritzMuehlenhoff can you please tell me in which slot we have the broken disk? Thanks. [16:45:38] That means instead of users running 'fatalmonitor' on fluorine.eqiad.wmnet, they'll have to run 'log_directory=/a/mw-log fatalmonitor' [16:45:38] mistake is too strong; a false positive, I mean [16:46:09] Krenair: you can get fatalmonitor to source a set of defaults from /etc/wikimedia/fatalmonitor.sh or something like [16:46:11] I mean, LOG_DIRECTORY=/a/mw-log fatalmonitor [16:46:26] then have just that etc file to be an erb.template [16:46:35] we don't actually have an /etc/wikimedia/ do we? [16:47:00] be bold ! :D [16:47:04] or just stick it in /etc/ [16:47:19] or wherever debian wants to get a default file for a script [16:47:20] godog ya sorry [16:47:23] hashar, No. I am not creating an /etc/wikimedia/. [16:47:25] i scheduled downtime for it [16:47:33] v consufsed about hiera atm [16:47:38] trying to fix [16:47:58] ottomata: ah nevermind I just got the sms but didn't check the timestamp -.- [16:48:10] We do use /etc/profile.d/mediawiki.sh... [16:49:30] role::kafka::main::broker should compile into a catalogue without dependency cycles [16:49:36] looks like rspec is happy! [16:50:20] Krenair: I see what you mean now for fluorine, on the specific I'm surprised there isn't a symlink in place for compatibility but that's beside the point [16:50:32] godog: do you have some hiera foo? i'd ask j o e but it seems he's not around [16:51:12] ottomata: yeah he's on vacation this week, but no I don't have any hiera mojo [16:51:16] I'm actually happy about it [16:51:17] aw maaan [16:51:41] godog: sorry gotta leave :/ jenkins-debian-glue is already manually installed on the CI slaves so T141114 will be a noop. Else we can catch up tomorrow [16:51:41] T141114: Upgrade jenkins-debian-glue on Jessie slaves from 0.13.0 to latest (0.17.0) - https://phabricator.wikimedia.org/T141114 [16:51:43] sorry (: [16:51:47] git review just asked me to login... [16:51:48] wtf? [16:51:59] hashar: np, we can take a look tomorrow or thurs [16:52:21] godog: AH HA! you know has hiera foo [16:52:22] elukey: [16:52:32] his tip: git commit your files [16:52:35] godog: or just publish it :] it will be fine! [16:53:02] hashar: fair enough, can you ping me tomorrow morning? [16:53:10] (03PS1) 10Ottomata: Add druid zk host specific hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/307532 (https://phabricator.wikimedia.org/T138263) [16:53:21] ottomata: haha I'm told that works sometimes yeah [16:53:41] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:53:46] hm! [16:54:49] (03CR) 10Ottomata: [C: 032] Add druid zk host specific hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/307532 (https://phabricator.wikimedia.org/T138263) (owner: 10Ottomata) [16:55:24] godog: puppet merging your arcanist change, ok? [16:56:21] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [16:56:30] (03PS1) 10RobH: new star.wmfusercontent.org certificates [puppet] - 10https://gerrit.wikimedia.org/r/307534 (https://phabricator.wikimedia.org/T140649) [16:56:40] ottomata: whoop, yes thanks! [16:56:40] did it, hope that's ok godog [16:56:43] (03PS2) 10Jforrester: Enable VisualEditor by default for logged-in users on Indic-script Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304117 (https://phabricator.wikimedia.org/T142586) [16:57:00] (03CR) 10Jforrester: [C: 031] "Now announced and good to go in an hour's time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304117 (https://phabricator.wikimedia.org/T142586) (owner: 10Jforrester) [16:58:39] !log shutting down elasticsearch on elastic1047 to prepare moving server - T143685 [16:58:40] T143685: Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685 [16:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:59:37] robh i think it is strange that it also happended to you and mutante possibly related to the upgrade to gerrit 2.12. [16:59:46] RECOVERY - Zookeeper Server on druid1001 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg [16:59:53] COuld you report it here https://storyboard.openstack.org/#!/project/719 and https://bugs.chromium.org/p/gerrit/issues/list?cursor=gerrit%3A267 please? [16:59:59] Yeah, my .git/config contents seemed to reference https [17:00:01] Since im not sure which one is the cause of this [17:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160830T1700). [17:00:12] once i changed it to ssh like similar to the other lines it worked [17:00:16] Yep [17:00:24] i plan to depl graphoid [17:00:25] no parsoid deploy [17:00:38] Deffitly a bug if it changed from ssh to http [17:00:51] just indeed very odd to suddenly happen to multiple folks across distros so I'd err to think its indeed a file contents change we pushed out. [17:01:14] (03PS8) 10Alex Monk: Remove the hard-coded /a/mw-log references scattered around everywhere [puppet] - 10https://gerrit.wikimedia.org/r/305767 [17:01:55] yep [17:02:34] (03CR) 10jenkins-bot: [V: 04-1] Remove the hard-coded /a/mw-log references scattered around everywhere [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [17:02:42] RECOVERY - Zookeeper Server on druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg [17:03:12] RECOVERY - Zookeeper Server on druid1002 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg [17:03:28] (pages) [17:05:05] :( [17:05:08] those are good pages! [17:05:12] !log shutting down elasticsearch on elastic1044 to prepare moving server - T143685 [17:05:12] recoveries [17:05:14] T143685: Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685 [17:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:05:34] actually, i would prefer if those didn't page all of ops... [17:06:23] fixing [17:08:31] !log shutting down elasticsearch on elastic1045 to prepare moving server - T143685 [17:08:33] T143685: Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685 [17:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:09:18] (03PS9) 10Alex Monk: Remove the hard-coded /a/mw-log references scattered around everywhere [puppet] - 10https://gerrit.wikimedia.org/r/305767 [17:12:07] (03PS1) 10Ottomata: Don't page if druid-eqiad zookeeper cluster has a server down [puppet] - 10https://gerrit.wikimedia.org/r/307541 (https://phabricator.wikimedia.org/T138263) [17:12:35] (03PS2) 10Ottomata: Don't page if druid-eqiad zookeeper cluster has a server down [puppet] - 10https://gerrit.wikimedia.org/r/307541 (https://phabricator.wikimedia.org/T138263) [17:15:25] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2595125 (10Multichill) >>! In T118388#2594683, @Dzahn wrote: > @SindyM3 I think this should probably have a new ticket. Do you know who is admin of schippers.... [17:16:33] (03CR) 10Ottomata: [C: 032] Don't page if druid-eqiad zookeeper cluster has a server down [puppet] - 10https://gerrit.wikimedia.org/r/307541 (https://phabricator.wikimedia.org/T138263) (owner: 10Ottomata) [17:16:41] (03CR) 10Ottomata: "Looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/307541 (https://phabricator.wikimedia.org/T138263) (owner: 10Ottomata) [17:17:10] (03PS10) 10Alex Monk: Remove the hard-coded /a/mw-log references scattered around everywhere [puppet] - 10https://gerrit.wikimedia.org/r/305767 [17:19:29] (03PS1) 10Mobrovac: Revert "Change-Prop: Rerender summary on wikidata item update" [puppet] - 10https://gerrit.wikimedia.org/r/307544 [17:21:03] (03CR) 10Ottomata: Remove the hard-coded /a/mw-log references scattered around everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [17:21:14] mobrovac: looking at the revert [17:21:22] (03CR) 10Filippo Giunchedi: [C: 032] Revert "Change-Prop: Rerender summary on wikidata item update" [puppet] - 10https://gerrit.wikimedia.org/r/307544 (owner: 10Mobrovac) [17:23:23] (03CR) 10Alex Monk: Remove the hard-coded /a/mw-log references scattered around everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [17:30:15] RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 37.02 seconds [17:34:02] 06Operations, 10Continuous-Integration-Infrastructure: Upgrade jenkins-debian-glue on Jessie slaves from 0.13.0 to latest (0.17.0) - https://phabricator.wikimedia.org/T141114#2595202 (10hashar) Too many patches for today puppet swat. @fgiunchedi and I will sync tomorrow. [17:34:36] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations, 10Research-management: Request access to data for WDQS research - https://phabricator.wikimedia.org/T142780#2595207 (10Cmjohnson) @AlexKrauseTUD Do you have a wikitech account? If not can you please create one. Thanks [17:36:45] 06Operations, 10Monitoring, 10Traffic, 07HTTPS: adjust ssl certificate montioring to differentiate between standard and LE certificates. - https://phabricator.wikimedia.org/T144293#2595236 (10RobH) [17:37:40] 06Operations, 10Monitoring, 10Traffic, 07HTTPS: adjust ssl certificate montioring to differentiate between standard and LE certificates. - https://phabricator.wikimedia.org/T144293#2595249 (10RobH) [17:41:04] 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium): Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2595280 (10thcipriani) >>! In T140257#2594023, @mark wrote: > we can't open arbitrary firewall holes between labs i... [17:45:54] (03CR) 10Dzahn: [C: 032] "matching logo URL https://phab.wmfusercontent.org/file/data/qzfmum4xnhfoqpl7ws7r/PHID-FILE-rs3pf2brupiulr6zcnrg/Wikimedia-Phabricator-logo" [puppet] - 10https://gerrit.wikimedia.org/r/307462 (owner: 10Paladox) [17:46:00] (03PS6) 10Dzahn: Set logoImagePHID and wordmarkText in fixed_settings.yaml [puppet] - 10https://gerrit.wikimedia.org/r/307462 (owner: 10Paladox) [17:47:45] (03PS7) 10Paladox: phabricator: Set logoImagePHID and wordmarkText in fixed_settings.yaml [puppet] - 10https://gerrit.wikimedia.org/r/307462 [17:48:04] Thats a bug ^^, mutante did that. [17:48:09] fixed in gerrit 2.12.4 [17:48:29] yep, i used the edit button in browser [17:48:31] !log shutting down elasticsearch on elastic1046 to prepare moving server - T143685 [17:48:32] T143685: Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685 [17:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:50:40] twentyafterfour: paladox: applied on iridium [17:51:16] mutante thanks, it wont take effect until twentyafterfour does the next phabricator update [17:51:17] :) [17:51:35] paladox: yep, but i see it was needed to unblock that kind of. thanks for patch [17:51:50] Yep [17:51:53] your welcome [17:51:59] +1 [17:52:03] thanks [17:52:14] :) [17:52:14] yw [17:55:27] sadly, no graphoid depl today, too broken :( [17:58:37] mutante: can you log what you just did? I'm getting a new warning as a phab admin "You have 2 unresolved setup issues..." [17:58:50] andre__: ^ [18:00:04] anomie, ostriches, thcipriani, hashar, and twentyafterfour: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160830T1800). [18:00:04] James_F and ebernhardson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:09] Heya. [18:01:05] Hiya! I can SWAT today. [18:01:44] So if I get it right https://gerrit.wikimedia.org/r/#/c/307462/ is in preparation for the next pull from upstream, and as long as that won't happen Phab admins will see the "2 issues" warning. Hmm, well, okay. [18:01:47] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304117 (https://phabricator.wikimedia.org/T142586) (owner: 10Jforrester) [18:02:13] (03Merged) 10jenkins-bot: Enable VisualEditor by default for logged-in users on Indic-script Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304117 (https://phabricator.wikimedia.org/T142586) (owner: 10Jforrester) [18:02:56] (03PS1) 10Cmjohnson: Relocated elastic1044[5-7], updated dns changes to reflect to rack locations [dns] - 10https://gerrit.wikimedia.org/r/307546 [18:03:07] thcipriani: canceling my patches, i'll remove from wikitech now. In pre testing there is something different between prod and where we tuned our parameters, and we arn't sure what yet which makes the test wrong... [18:03:20] ebernhardson: ack, thanks for the heads up [18:03:47] (03CR) 10Cmjohnson: [C: 032] Remove db1027 from internal dns entries [dns] - 10https://gerrit.wikimedia.org/r/289168 (https://phabricator.wikimedia.org/T135253) (owner: 10Jcrespo) [18:04:39] (03CR) 10Cmjohnson: [C: 032] Relocated elastic1044[5-7], updated dns changes to reflect to rack locations [dns] - 10https://gerrit.wikimedia.org/r/307546 (owner: 10Cmjohnson) [18:05:03] hasharAway: are you done with train prep? Can I revert wikiversions.json? [18:05:27] * thcipriani assumes the answer is probably yes here, but gives it a second anyway [18:05:43] kk reverting [18:06:35] * James_F hopes that doesn't break everything. ;-) [18:06:35] James_F: change is live on mw1099, check please [18:06:44] oh boy :) [18:07:31] thcipriani: Yup, LGTM. [18:08:16] James_F: cool, thanks for checking. Is order of sync important here? IS.php then dblists fine? [18:08:49] thcipriani: IS first. [18:08:50] !log change-prop deploying c793e4a2 [18:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:08:56] kk, doing [18:11:19] (03PS4) 10Rush: nodepool: bump up ready states, max, and rate [puppet] - 10https://gerrit.wikimedia.org/r/307526 (https://phabricator.wikimedia.org/T143938) [18:11:22] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2595369 (10Dzahn) ok, cool. thanks @Multichill [18:11:33] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:304117|Enable VisualEditor by default for logged-in users on Indic-script Wikipedias (T142586)]] PART I (duration: 00m 48s) [18:11:34] T142586: Enable VisualEditor by default for all users of all Indic script Wikipedias - https://phabricator.wikimedia.org/T142586 [18:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:12:29] 06Operations, 06Community-Tech, 10wikidiff2, 13Patch-For-Review: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443#2595372 (10kaldari) @akosiaris, @MaxSem: Is this live now? [18:12:31] !log thcipriani@tin Synchronized dblists/visualeditor-nondefault.dblist: SWAT: [[gerrit:304117|Enable VisualEditor by default for logged-in users on Indic-script Wikipedias (T142586)]] PART II (duration: 00m 49s) [18:12:31] T142586: Enable VisualEditor by default for all users of all Indic script Wikipedias - https://phabricator.wikimedia.org/T142586 [18:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:12:38] ^ James_F live everywhere [18:12:56] Awesome. Thank you! [18:13:20] (03CR) 10Rush: [C: 032] nodepool: bump up ready states, max, and rate [puppet] - 10https://gerrit.wikimedia.org/r/307526 (https://phabricator.wikimedia.org/T143938) (owner: 10Rush) [18:13:45] James_F: :) [18:17:53] !log restart nodepool [18:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:21:38] chasemp: I probably failed to document it, but changes to nodepool.yaml do not need a restart [18:21:58] hasharAway: yep I figured it out but thanks [18:22:19] chasemp: the daemon reread it on each iteration apparently. That is how last week I have hacked up the rate/min-server by just editing the file [18:22:39] iteration as in each rate tick? [18:22:50] I would guess yes [18:28:03] !log deploying latest version of wikidata query service [18:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:29:02] 06Operations, 06Community-Tech, 10wikidiff2, 13Patch-For-Review: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443#2595394 (10MaxSem) Apparently not: ``` maxsem@mw1234:~$ dpkg -l '*wikidiff*' Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/... [18:30:58] SMalyshev: deployment completed, tests are OK [18:31:08] gehel: coolio! [18:31:46] gehel: thank you [18:31:56] SMalyshev: my pleasure! [18:37:12] is viewing https://upload.wikimedia.org/wikipedia/commons/c/c6/Path_to_Whitedell_Cottages-North_Wallington_-_geograph.org.uk_-_747873.jpg broken for anyone else? [18:37:51] (03CR) 10Dzahn: "@Paladox a short summary of a long text. -> https://en.wiktionary.org/wiki/TLDR" [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) (owner: 10Paladox) [18:39:12] (03PS3) 10Paladox: Add git.legoktm.com to system.gitconfig.erb for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) [18:39:16] (03CR) 10Paladox: "Ok thanks." [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) (owner: 10Paladox) [18:39:21] (03PS4) 10Paladox: Add git.legoktm.com to system.gitconfig.erb for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) [18:43:09] (03Abandoned) 10BryanDavis: toollabs::static: Prune and gc git clone [puppet] - 10https://gerrit.wikimedia.org/r/307111 (owner: 10BryanDavis) [18:45:10] (03CR) 10Dzahn: "eh yea, sorry, i had not seen your comment before and ran into the same thing. let's figure that out, will also look later" [puppet] - 10https://gerrit.wikimedia.org/r/307354 (https://phabricator.wikimedia.org/T144112) (owner: 10Chad) [18:48:23] (03CR) 10Merlijn van Deen: Set up a root password for Labs instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) (owner: 10Andrew Bogott) [18:50:36] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 729 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5035335 keys - replication_delay is 729 [18:52:09] 06Operations, 10Wikimedia-Mailing-lists: deactivate maint-announce - https://phabricator.wikimedia.org/T143760#2595469 (10Dzahn) @Faidon @Robh I don't have much of a personal preference here, Activating and deactivating the list are both easy and not much work. I also once thought it would be possible to kill... [18:59:42] (03PS1) 10Cmjohnson: Updating dns for elastic1046 and 1047 [dns] - 10https://gerrit.wikimedia.org/r/307557 [18:59:46] o/ [18:59:48] jouncebot: ping [18:59:55] (03CR) 10Ottomata: Remove the hard-coded /a/mw-log references scattered around everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [19:00:05] TBD: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160830T1900). [19:00:11] (03PS2) 10Hashar: Group0 to 1.28.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307517 (https://phabricator.wikimedia.org/T142117) [19:00:53] (03PS1) 10DCausse: Fix CirrusSearch BM25 A/B test similiraty config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307558 [19:00:58] (03CR) 10Cmjohnson: [C: 032] Updating dns for elastic1046 and 1047 [dns] - 10https://gerrit.wikimedia.org/r/307557 (owner: 10Cmjohnson) [19:01:24] roll the drums [19:01:31] Group0 to 1.28.0-wmf.17 ! [19:01:32] ah k great [19:01:51] (03CR) 10Hashar: [C: 032] Group0 to 1.28.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307517 (https://phabricator.wikimedia.org/T142117) (owner: 10Hashar) [19:01:53] (03PS1) 10Dzahn: archiva: migration class to rsync data to new host [puppet] - 10https://gerrit.wikimedia.org/r/307559 (https://phabricator.wikimedia.org/T123725) [19:02:33] (03Merged) 10jenkins-bot: Group0 to 1.28.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307517 (https://phabricator.wikimedia.org/T142117) (owner: 10Hashar) [19:03:35] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.28.0-wmf.17 T142117 [19:03:36] T142117: MW-1.28.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T142117 [19:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:04:05] 201 Undefined index: mVersion in /srv/mediawiki/php-1.28.0-wmf.16/extensions/CentralAuth/includes/CentralAuthUser.php on line 494 [19:04:09] that is starting "well" [19:04:43] (03PS1) 10BryanDavis: contint: fix resource conflict with service::deploy::common [puppet] - 10https://gerrit.wikimedia.org/r/307561 [19:05:28] roll backing [19:05:39] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: Rollback 1.28.0-wmf.17 T142117 [19:05:39] T142117: MW-1.28.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T142117 [19:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:06:21] daww [19:07:57] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/3888/meitnerium.wikimedia.org/change.meitnerium.wikimedia.org.err amending .." [puppet] - 10https://gerrit.wikimedia.org/r/307559 (https://phabricator.wikimedia.org/T123725) (owner: 10Dzahn) [19:08:01] (03PS1) 10Hashar: Revert "Group0 to 1.28.0-wmf.17" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307562 (https://phabricator.wikimedia.org/T142117) [19:08:02] hashar caused by https://phabricator.wikimedia.org/rECAUa2f2440d3e204ba5bc8e30a7f78f36fc0828651b [19:08:21] (03CR) 10Hashar: [C: 032] "Comes from tin.eqiad.wmnet. Already deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307562 (https://phabricator.wikimedia.org/T142117) (owner: 10Hashar) [19:08:50] (03Merged) 10jenkins-bot: Revert "Group0 to 1.28.0-wmf.17" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307562 (https://phabricator.wikimedia.org/T142117) (owner: 10Hashar) [19:08:53] paladox: mention it on the task please https://phabricator.wikimedia.org/T144307 . And also kudos for finding it so fast [19:08:54] (03CR) 10BryanDavis: "I removed the cherry-pick of this that was on deployment-puppetmaster because it has conflicts with the head of the production branch as w" [puppet] - 10https://gerrit.wikimedia.org/r/306308 (owner: 10Ppchelko) [19:09:03] Your welcome and ok [19:10:15] Dereckson: good to know someone else is watching the logs as well :] [19:10:57] (03PS2) 10Dzahn: archiva: migration class to rsync data to new host [puppet] - 10https://gerrit.wikimedia.org/r/307559 (https://phabricator.wikimedia.org/T123725) [19:11:01] merged mine into your task as you've already references to patch etc. [19:11:29] (03PS2) 10BryanDavis: contint: fix resource conflict with service::deploy::common [puppet] - 10https://gerrit.wikimedia.org/r/307561 (https://phabricator.wikimedia.org/T143065) [19:12:00] (03CR) 10Alex Monk: Remove the hard-coded /a/mw-log references scattered around everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [19:12:02] (03CR) 10BryanDavis: "Cherry-picked to deployment-puppetmaster to fix broken puppet runs on deployment-sca0[12]." [puppet] - 10https://gerrit.wikimedia.org/r/307561 (https://phabricator.wikimedia.org/T143065) (owner: 10BryanDavis) [19:12:23] hashar https://github.com/wikimedia/mediawiki-extensions-CentralAuth/search?utf8=%E2%9C%93&q=mVersion+ [19:13:43] (03CR) 10jenkins-bot: [V: 04-1] archiva: migration class to rsync data to new host [puppet] - 10https://gerrit.wikimedia.org/r/307559 (https://phabricator.wikimedia.org/T123725) (owner: 10Dzahn) [19:14:29] hashar: ah I know. We have to do the same thing as with User. Simple fix. Hold on. [19:15:31] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2595632 (10bd808) [19:15:59] AaronSchulz: follow up in #mediawiki-core :] [19:16:25] There s a mediawiki-core channel [19:16:26] LOL [19:16:36] What is mediawiki channel used for [19:16:41] Support requests [19:16:46] (03CR) 10Dzahn: [C: 032] "works now. no-op on active server titanium. http://puppet-compiler.wmflabs.org/3889/" [puppet] - 10https://gerrit.wikimedia.org/r/307559 (https://phabricator.wikimedia.org/T123725) (owner: 10Dzahn) [19:17:37] (03CR) 10Ottomata: Remove the hard-coded /a/mw-log references scattered around everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [19:18:24] paladox: #mediawiki is historically the channel for developers of mediawiki AND end users [19:18:43] paladox: though most devs also used #wikimedia-dev which is no an hell place thanks to bot and is less used [19:19:01] (03CR) 10Alex Monk: Remove the hard-coded /a/mw-log references scattered around everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [19:19:08] nowadays #mediawiki-core is more or less a channel for developers of mediawiki, with #mediawiki being for end users [19:19:39] Oh [19:19:44] thanks for explaning [19:19:44] getting rake failure [19:19:50] 19:13:42 Gem::RemoteFetcher::FetchError: Errno::ETIMEDOUT: Connection timed out - connect(2) for "rubygems.global.ssl.fastly.net" port 443 (https://rubygems.org/gems/rgen-0.6.6.gem) [19:19:58] fastly.. uhm [19:20:27] (03PS11) 10Alex Monk: Remove the hard-coded /a/mw-log references scattered around everywhere [puppet] - 10https://gerrit.wikimedia.org/r/305767 [19:20:36] is around the corner from office though [19:21:21] PROBLEM - Disk space on stat1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=93%) [19:21:44] mutante what do you mean by "is around the corner from office though"? [19:21:52] paladox: fastly the company [19:22:07] Oh never herd of it [19:22:33] it's a CDN and the rake check [19:22:47] wants to connect to rubygems.... [19:23:10] Oh [19:23:16] rubygems.org uses fastly [19:23:20] (03CR) 10jenkins-bot: [V: 04-1] Remove the hard-coded /a/mw-log references scattered around everywhere [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [19:23:32] and for a moment fastly wasnt working [19:24:08] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/307559 (https://phabricator.wikimedia.org/T123725) (owner: 10Dzahn) [19:24:26] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4988378 keys - replication_delay is 0 [19:24:54] (03CR) 10Ottomata: "Do you need to override log_directory => '/a/mw-log' for prod somewhere?" [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [19:24:55] oh [19:24:59] paladox: https://www.fastly.com/why-fastly [19:25:24] Oh [19:25:26] thanks [19:25:43] faster page loads [19:25:59] But with an 80mbps connection most users wont notice page loading slow [19:26:01] (03CR) 10Alex Monk: "I'm not creating log_directory new - it already existed, fluorine's node entry in manifests/site.pp sets it to /a/mw-log" [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [19:26:06] including those on 50mbps and 30mbps [19:26:12] yea, it works again. was a fluke [19:26:20] maybe maintenance on their side [19:26:59] (03CR) 10Alex Monk: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [19:27:10] oh [19:27:59] so Aaron got it sorted out and I will push wmf.17 again in a few [19:28:07] :) [19:28:46] hello Thehelpfulone [19:28:51] hi mutante! [19:28:57] hey Thehelpfulone [19:29:03] * Thehelpfulone waves [19:29:14] haven't seen you in ages [19:29:20] yea, welcome back [19:30:07] * hashar waves at Thehelpfulone [19:30:23] !log hashar@tin Synchronized php-1.28.0-wmf.16/extensions/CentralAuth/includes/CentralAuthUser.php: Remove verbose cache miss log that was making notices T144307 (duration: 00m 48s) [19:30:24] T144307: Undefined index: mVersion in /srv/mediawiki/php-1.28.0-wmf.16/extensions/CentralAuth/includes/CentralAuthUser.php on line 494 - https://phabricator.wikimedia.org/T144307 [19:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:06] (03PS1) 10Hashar: Group0 to 1.28.0-wmf.17 (bis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307565 (https://phabricator.wikimedia.org/T144307) [19:32:25] (03CR) 10Hashar: [C: 032] Group0 to 1.28.0-wmf.17 (bis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307565 (https://phabricator.wikimedia.org/T144307) (owner: 10Hashar) [19:32:41] (03CR) 10Jforrester: "This is now due to go out next week (6 September)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304118 (https://phabricator.wikimedia.org/T142586) (owner: 10Jforrester) [19:32:48] (03PS2) 10Jforrester: Enable VisualEditor by default for logged-out users on Indic-script Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304118 (https://phabricator.wikimedia.org/T142586) [19:32:51] (03Merged) 10jenkins-bot: Group0 to 1.28.0-wmf.17 (bis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307565 (https://phabricator.wikimedia.org/T144307) (owner: 10Hashar) [19:34:01] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.28.0-wmf.17 (bis) T144307 [19:34:02] T144307: Undefined index: mVersion in /srv/mediawiki/php-1.28.0-wmf.16/extensions/CentralAuth/includes/CentralAuthUser.php on line 494 - https://phabricator.wikimedia.org/T144307 [19:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:36:05] (03PS5) 10Paladox: phabricator: allow mirroring from git.legoktm.com into Diffusion [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) [19:38:13] !log 1.28.0-wmf.17 successfully pushed to group0 [19:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:38:50] hashar: I'm feeling, that the save time (and the parsing of Parsoid, e.g. if I change a template in VE) is much higher than before the deployment :/ But this could be highly subjective [19:39:23] (and I'm not sure, if "last week" in https://grafana.wikimedia.org/dashboard/db/save-timing means the values of "last week", or anything else :/) [19:39:47] FlorianSW: "last week" should be the same datapoint with a 7 days delta [19:40:28] FlorianSW: that shows up nicely if you change the time span to "now-2w" [19:41:38] the VisualEditor API save time shows a slight bump from 6 sec to 8 sec [19:41:42] https://grafana.wikimedia.org/dashboard/db/visualeditor-load-save?panelId=9&fullscreen [19:41:58] but that seems to happen on each new version [19:42:28] ah, ok, then maybe I was just to fast in editing after the deployment :D [19:42:32] thanks for the info! [19:42:45] I guess some caches of some sort have to warm up [19:42:59] maybe your browser has to download / process / cache a bunch of new js/css due to some cache invalidation [19:43:07] honestly, I have absolutely no idea [19:43:23] but yeah the slowdown you have noticed seems to correlate to some graph [19:43:35] thank you to have reported it :] [19:43:53] np :P I'll test it again some hours later, just to make sure :D [19:45:40] (03PS12) 10Ottomata: Remove the hard-coded /a/mw-log references scattered around everywhere [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [19:45:54] Krenair: lgtm: https://puppet-compiler.wmflabs.org/3890/fluorine.eqiad.wmnet/ [19:45:56] merging [19:46:47] ty [19:47:25] (03CR) 10Ottomata: [C: 032] Remove the hard-coded /a/mw-log references scattered around everywhere [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [19:47:42] !log rsyncing archiva data from titanium to meitnerium, runs in a screen [19:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:48:04] mutante: COoOoL [19:48:07] lemme know how that goes [19:48:14] will that be a transparent switchover? [19:48:56] ottomata: i dunno yet, moritz already made meitnerium and was on it, for now i was helping to copy the data over [19:49:08] Krenair: merged and puppet run on fluorine [19:49:09] ok mutante [19:49:19] ottomata: only pushing to new server so the old one is not changed [19:49:50] aye k [19:52:51] !log deployed https://gerrit.wikimedia.org/r/306710, moving 4 parsoid CI jobs from nodepool trusty to nodepool jessie [19:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:56:53] ottomata: wow, it's already done copying all those GBs.. so with my limited knowledge of archiva i figured we want to copy it all except not the config/archiva.xml. so i kept that as it was [19:57:14] (03Abandoned) 10Alex Monk: mw-log-cleanup: remove wfDebug files in deployment-prep every week [puppet] - 10https://gerrit.wikimedia.org/r/305768 (owner: 10Alex Monk) [19:57:21] let's see if puppet is deploying that.. [19:58:36] mutante: aye, i think as long as config is correct for the new host, the datadirs can just be copied and it'll figure things out when it starts [19:58:52] ottomata: sounds good :) [20:06:02] (03PS1) 10BBlack: depool upload in ulsfo [dns] - 10https://gerrit.wikimedia.org/r/307568 [20:06:45] (03CR) 10BBlack: [C: 032] depool upload in ulsfo [dns] - 10https://gerrit.wikimedia.org/r/307568 (owner: 10BBlack) [20:07:24] 06Operations, 10Wikimedia-Mailing-lists: deactivate maint-announce - https://phabricator.wikimedia.org/T143760#2595763 (10RobH) * Using mailman: **pros: *** we already maintain mailman, any outage is noticed *** archive view sorted by date allows on clinic to easily add items to calendar as they were received... [20:09:10] !log cleanup /var/cache/iegreview for bd808 [20:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:12:24] thanks yuvipanda [20:13:17] (03PS4) 10Yuvipanda: base: Use the standard location for puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/257275 [20:16:23] !log restarting ferm on elasticsearch eqiad cluster after reinstall of elastic104[4567] - T143685 [20:16:24] T143685: Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685 [20:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:27:40] 06Operations, 10ops-eqiad, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685#2595870 (10Cmjohnson) elastic104[4-7] were moved to racks A6 and B6. elastic104[4-6] installed, pu... [20:39:21] (03PS3) 10Andrew Bogott: Set up a root password for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) [20:41:58] (03CR) 10Andrew Bogott: "Tested now, and it seems to exclude itself from puppet::self instances. Hard to test properly though." [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) (owner: 10Andrew Bogott) [20:48:22] (03CR) 10Andrew Bogott: [C: 032] labsprojectfrommetadata: Pull project_id from new field [puppet] - 10https://gerrit.wikimedia.org/r/304748 (https://phabricator.wikimedia.org/T105891) (owner: 10Alex Monk) [20:48:32] (03PS2) 10Andrew Bogott: labsprojectfrommetadata: Pull project_id from new field [puppet] - 10https://gerrit.wikimedia.org/r/304748 (https://phabricator.wikimedia.org/T105891) (owner: 10Alex Monk) [20:51:35] (03CR) 10jenkins-bot: [V: 04-1] labsprojectfrommetadata: Pull project_id from new field [puppet] - 10https://gerrit.wikimedia.org/r/304748 (https://phabricator.wikimedia.org/T105891) (owner: 10Alex Monk) [20:52:29] 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium): Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2595926 (10hashar) Regarding the use of a public IP: gallium had one for historical reasons and all uses have been... [20:52:37] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/304748 (https://phabricator.wikimedia.org/T105891) (owner: 10Alex Monk) [20:58:58] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/304748 (https://phabricator.wikimedia.org/T105891) (owner: 10Alex Monk) [20:59:10] (03CR) 10Hashar: "Thanks Bryan that is encouraging. I am swamped in deployments this week, will revisit next week and probably formally announce the change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301339 (https://phabricator.wikimedia.org/T129982) (owner: 10Hashar) [20:59:29] 06Operations, 10Wikimedia-Site-requests, 07Wikimedia-log-errors: Requests to localhost spam the 'localhost' and 'xff' log buckets - https://phabricator.wikimedia.org/T129982#2595934 (10hashar) I am swamped in deployments this week, will revisit next week and probably formally announce the change then SWAT it. [21:12:08] (03PS1) 10Madhuvishy: toollabs: Convert cdnjs pull cron command to one line [puppet] - 10https://gerrit.wikimedia.org/r/307621 (https://phabricator.wikimedia.org/T143637) [21:15:18] (03PS2) 10Andrew Bogott: No longer set up config for our old project-id metadata creation [puppet] - 10https://gerrit.wikimedia.org/r/304750 (https://phabricator.wikimedia.org/T105891) (owner: 10Alex Monk) [21:15:59] (03CR) 10Mobrovac: [C: 031] "Thank you Bryan!" [puppet] - 10https://gerrit.wikimedia.org/r/307561 (https://phabricator.wikimedia.org/T143065) (owner: 10BryanDavis) [21:17:55] (03PS2) 10Madhuvishy: toollabs: Convert cdnjs pull cron command to one line [puppet] - 10https://gerrit.wikimedia.org/r/307621 (https://phabricator.wikimedia.org/T143637) [21:18:18] (03CR) 10Madhuvishy: [C: 032 V: 032] toollabs: Convert cdnjs pull cron command to one line [puppet] - 10https://gerrit.wikimedia.org/r/307621 (https://phabricator.wikimedia.org/T143637) (owner: 10Madhuvishy) [21:18:22] (03CR) 10Andrew Bogott: [C: 032] No longer set up config for our old project-id metadata creation [puppet] - 10https://gerrit.wikimedia.org/r/304750 (https://phabricator.wikimedia.org/T105891) (owner: 10Alex Monk) [21:19:12] (03PS3) 10Andrew Bogott: No longer set up config for our old project-id metadata creation [puppet] - 10https://gerrit.wikimedia.org/r/304750 (https://phabricator.wikimedia.org/T105891) (owner: 10Alex Monk) [21:29:08] (03PS2) 10Andrew Bogott: Horizon puppet panel: Clean up config and defaults [puppet] - 10https://gerrit.wikimedia.org/r/307436 [21:31:11] (03CR) 10Andrew Bogott: [C: 032] Horizon puppet panel: Clean up config and defaults [puppet] - 10https://gerrit.wikimedia.org/r/307436 (owner: 10Andrew Bogott) [21:51:01] Hi, could someone merge and deploy https://gerrit.wikimedia.org/r/#/c/306413/ please? [21:53:05] (03PS1) 10Rush: labstore nfs-exports daemon [puppet] - 10https://gerrit.wikimedia.org/r/307629 (https://phabricator.wikimedia.org/T126083) [21:54:10] @Author of change thiemowmde / Author [21:54:10] Paladox [21:54:10] Aug 24, 2016 8:56 PM [21:54:15] that should be fixed then [21:54:27] Yep, all tested [21:54:34] by me and thiemowmde [21:54:42] On https://gerrit-test.wmflabs.org/ [21:55:44] (03CR) 10jenkins-bot: [V: 04-1] labstore nfs-exports daemon [puppet] - 10https://gerrit.wikimedia.org/r/307629 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [21:56:49] 06Operations, 10Analytics-Cluster, 13Patch-For-Review: Migrate titanium to jessie (archiva.wikimedia.org upgrade) - https://phabricator.wikimedia.org/T123725#2596103 (10Dzahn) I have rsynced the entire /var/lib/archiva from titanium over to meitnerium, the new jessie server. One single file, the conf/archiv... [21:58:54] (03PS1) 10Jdlrobson: Enable Wikidata description taglines on all projects... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307631 (https://phabricator.wikimedia.org/T143344) [21:59:00] cmjohnson1: Hi, would you mind https://gerrit.wikimedia.org/r/#/c/306413/ merging that and deploying please? [22:04:59] (03PS1) 10Andrew Bogott: Typo-fix: puppet_config_backend should be an http url, not https [puppet] - 10https://gerrit.wikimedia.org/r/307635 [22:05:58] Oh he is not available [22:06:10] (03CR) 10Andrew Bogott: [C: 032] Typo-fix: puppet_config_backend should be an http url, not https [puppet] - 10https://gerrit.wikimedia.org/r/307635 (owner: 10Andrew Bogott) [22:10:14] (03PS1) 10Andrew Bogott: Forward horizon config to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/307636 [22:11:41] (03CR) 10Andrew Bogott: [C: 032] Forward horizon config to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/307636 (owner: 10Andrew Bogott) [22:15:00] (03CR) 10Dzahn: "needs review from releng" [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) (owner: 10Paladox) [22:16:17] (03CR) 10Dzahn: "@Paladox can you confirm we already got past that issue (and abandon?)" [puppet] - 10https://gerrit.wikimedia.org/r/307335 (https://phabricator.wikimedia.org/T144112) (owner: 10Paladox) [22:16:49] (03CR) 10Dzahn: "there is still a -1 from hashar on this" [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [22:17:50] (03CR) 10Dzahn: "ignore last comment, wrong patch" [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [22:18:00] !log change-prop deploying a87a61d [22:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:20:47] 06Operations, 10netops: Connection problems (from NZ to ULSFO) - https://phabricator.wikimedia.org/T144263#2596223 (10Peachey88) 05Open>03Resolved >>! In T144263#2594075, @Nurg wrote: > No further problems for a while now, so it may have come right. Thanks everyone. Marking as closed, reopen or refile it... [22:21:02] ottomata: hmm, now we have "503 - service unavailable, but the archive service is running" [22:21:30] ottomata: on the new one.. nothing broken for users [22:26:51] (03PS1) 10Mobrovac: Revert "Revert "Change-Prop: Rerender summary on wikidata item update"" [puppet] - 10https://gerrit.wikimedia.org/r/307641 [22:30:29] 06Operations, 10Analytics-Cluster, 13Patch-For-Review: Migrate titanium to jessie (archiva.wikimedia.org upgrade) - https://phabricator.wikimedia.org/T123725#2596235 (10Dzahn) Now we have the data but still get an Error 503 - Service Unavailable from the new server, even though the archiva service is running... [22:33:18] 06Operations, 10Analytics-Cluster, 13Patch-For-Review: Migrate titanium to jessie (archiva.wikimedia.org upgrade) - https://phabricator.wikimedia.org/T123725#2596244 (10Dzahn) the issue is caused by archiva user being a different UID on old and new server, which means permissions are messed up even when we p... [22:34:09] ostriches hi, if i ask mutante to merge https://gerrit.wikimedia.org/r/#/c/306413/ could you deploy it please? [22:34:37] Meh, if I have to. [22:34:43] ok [22:34:46] thanks [22:40:27] (03PS11) 10Dzahn: Fix phabricator expanding links [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [22:41:30] (03PS12) 10Paladox: Gerrit: Fix phabricator expanding links [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) [22:41:58] howdy! SMalyshev and I have a question about https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups. The "analytics-users" group has access to Hadoop/Hive on stat1004, but "(NO PRIVATE DATA)". what is meant by that? can someone have access to hadoop/hive but not private data (e.g. a sanitized, PII-less subset of wmf.webrequest)? [22:42:00] (03CR) 10Dzahn: [C: 032] Gerrit: Fix phabricator expanding links [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [22:42:11] (03CR) 10Peachey88: [C: 04-1] "Minor -1:" [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [22:45:16] PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: puppet fail [22:45:41] (03PS13) 10Paladox: Gerrit: Fix phabricator expanding links [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) [22:46:06] (03CR) 10Paladox: "@Peachey88 done." [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [22:47:44] (03PS3) 10EBernhardson: logging: Require acknowledgment of kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292514 (https://phabricator.wikimedia.org/T135159) [22:48:29] ostriches could you deploy ^^ it was merged [22:49:04] well it kinda deploys automatically :) [22:49:23] ostriches oh, but dosent gerrit need restarting to pick the config change [22:49:24] :) [22:50:52] 06Operations, 10Analytics-Cluster, 13Patch-For-Review: Migrate titanium to jessie (archiva.wikimedia.org upgrade) - https://phabricator.wikimedia.org/T123725#2596254 (10Dzahn) fix running: root@meitnerium:/var/lib# find /var/lib/archiva/ -uid 108 -exec chown archiva:archiva {} \; [22:50:57] (03CR) 10Paladox: "https://phabricator.wikimedia.org/T75997#1" [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [22:51:17] 06Operations, 10Analytics-Cluster: Migrate titanium to jessie (archiva.wikimedia.org upgrade) - https://phabricator.wikimedia.org/T123725#2596255 (10Dzahn) [22:52:40] paladox: Gerrit watches the config files and self-restarts if it detects changes :) [22:52:45] oh [22:52:49] :) :) [22:54:34] ottomata: fixed, permission issues because we use a different UID. fixed with find -exec, now i see the web ui on new server [22:56:02] 06Operations, 10Analytics-Cluster: Migrate titanium to jessie (archiva.wikimedia.org upgrade) - https://phabricator.wikimedia.org/T123725#2596268 (10Dzahn) fixed, restarted service. i got the Archiva web UI now on meitnerium (when hacking my /etc/resolv.conf to point archiva.wm.org to it). [22:56:46] (03PS1) 10Legoktm: releases: puppetize ownership of /srv/org/wikimedia/releases/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/307649 [22:56:48] (03PS1) 10Legoktm: releases: Add wikidiff2 directory [puppet] - 10https://gerrit.wikimedia.org/r/307650 [22:57:00] ostriches: ^ I have no idea if that's right. [22:57:08] * ostriches looks [22:57:59] (03CR) 10jenkins-bot: [V: 04-1] releases: puppetize ownership of /srv/org/wikimedia/releases/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/307649 (owner: 10Legoktm) [22:58:26] oh, I have to line up =>? ugh [22:58:33] (03CR) 10jenkins-bot: [V: 04-1] releases: Add wikidiff2 directory [puppet] - 10https://gerrit.wikimedia.org/r/307650 (owner: 10Legoktm) [22:59:07] (03CR) 10Chad: "Probably ok, see inline though." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/307649 (owner: 10Legoktm) [22:59:37] ostriches im wondering could you run puppet on lead to pickup the change for gerrit please? [23:00:02] ostriches: what's the current mode on the directory? [23:00:04] RoanKattouw, ostriches, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160830T2300). [23:00:04] ebernhardson, jdlrobson, and RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:20] * RoanKattouw is here [23:00:31] \o [23:00:47] legoktm: 0775 [23:01:21] Gerrit restarting, warning for swatters. [23:01:26] (rather than it restarting halfway into swat) [23:01:29] \o [23:03:23] Eh, must not have made it to puppetmaster. [23:03:52] ostriches: i'm merging on master [23:03:58] done [23:04:07] (03CR) 10Legoktm: releases: puppetize ownership of /srv/org/wikimedia/releases/mediawiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/307649 (owner: 10Legoktm) [23:04:15] (03PS2) 10Legoktm: releases: puppetize ownership of /srv/org/wikimedia/releases/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/307649 [23:04:17] (03PS2) 10Legoktm: releases: Add wikidiff2 directory [puppet] - 10https://gerrit.wikimedia.org/r/307650 [23:04:20] I guess I'll do the SWAT? [23:05:31] (03CR) 10jenkins-bot: [V: 04-1] releases: puppetize ownership of /srv/org/wikimedia/releases/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/307649 (owner: 10Legoktm) [23:06:22] Hmmmm [23:06:25] Silly puppet.... [23:06:29] I missed a , [23:07:04] Post merge failed [23:07:05] https://integration.wikimedia.org/ci/job/operations-puppet-doc/25962/console [23:07:12] ostriches ^^ with something ruby [23:07:15] (03PS3) 10Legoktm: releases: puppetize ownership of /srv/org/wikimedia/releases/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/307649 [23:07:17] (03PS3) 10Legoktm: releases: Add wikidiff2 directory [puppet] - 10https://gerrit.wikimedia.org/r/307650 [23:07:17] paladox: it happens on every change recently [23:07:22] paladox: there is a ticket [23:07:23] oh [23:07:30] thanks [23:07:34] I don't speak ruby [23:07:37] Or post-merge ;-) [23:08:36] paladox: https://phabricator.wikimedia.org/T143233 [23:08:43] thanks [23:08:48] Hmmm [23:08:58] systemd is confused about gerrit right now [23:09:03] But it's running. [23:09:08] I'll figure it out after swat. [23:09:17] PROBLEM - puppet last run on mw2187 is CRITICAL: CRITICAL: Puppet has 1 failures [23:09:23] alright [23:09:28] paladox: general CI issue, not specific to this one [23:10:03] oh [23:10:04] (03CR) 10Chad: [C: 031] releases: puppetize ownership of /srv/org/wikimedia/releases/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/307649 (owner: 10Legoktm) [23:10:27] (03CR) 10Chad: [C: 031] releases: Add wikidiff2 directory [puppet] - 10https://gerrit.wikimedia.org/r/307650 (owner: 10Legoktm) [23:10:46] mutante: do you want to merge ^ those two releases.wm.o changes? :) [23:11:13] mutante hashar says https://phabricator.wikimedia.org/T143233#2580487 so i guess we can update the job to rm the files [23:11:51] RoanKattouw: are you doing it today? [23:11:58] Ahm, crap, yes [23:12:00] Got distracted, sorry [23:12:06] I'll find power for my laptop and then start the SWAT [23:12:12] sounds good :) [23:12:36] RECOVERY - puppet last run on mw2179 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:14:42] (03PS1) 10EBernhardson: Report partial result from mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/307652 (https://phabricator.wikimedia.org/T127788) [23:15:39] (03CR) 10Dzahn: "how about just ensure_resource like in the lines right before that?" [puppet] - 10https://gerrit.wikimedia.org/r/307649 (owner: 10Legoktm) [23:16:22] mutante: I couldn't find any documentation on how to pass group/owner/etc. using ensure_resource() [23:17:30] paladox: i dont know [23:17:41] legoktm: what's broken? is that caesium? [23:17:42] Ok [23:18:05] looks for permissions [23:18:24] (03CR) 10Catrope: [C: 032] Enable Wikidata description taglines on all projects... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307631 (https://phabricator.wikimedia.org/T143344) (owner: 10Jdlrobson) [23:18:38] mutante: bromine. Nothing is broken, ostriches and I were discussing how to add a new directory, and noticed that none of the existing ones were puppetized, so I decided to do that before adding a new one. [23:18:50] (03Merged) 10jenkins-bot: Enable Wikidata description taglines on all projects... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307631 (https://phabricator.wikimedia.org/T143344) (owner: 10Jdlrobson) [23:20:40] jdlrobson: Your patch ---^^ is on mw1099, please test [23:22:00] (03PS1) 10Yuvipanda: wmflib: Return default dir when role::puppet::self isn't used [puppet] - 10https://gerrit.wikimedia.org/r/307656 [23:23:26] (03CR) 10jenkins-bot: [V: 04-1] wmflib: Return default dir when role::puppet::self isn't used [puppet] - 10https://gerrit.wikimedia.org/r/307656 (owner: 10Yuvipanda) [23:23:44] RoanKattouw: on it [23:24:00] (03PS2) 10Yuvipanda: wmflib: Return default dir when role::puppet::self isn't used [puppet] - 10https://gerrit.wikimedia.org/r/307656 [23:26:37] (03PS3) 10Yuvipanda: wmflib: Return default dir when role::puppet::self isn't used [puppet] - 10https://gerrit.wikimedia.org/r/307656 [23:26:40] RoanKattouw: follow up may be required. just debugging something. [23:27:58] (03CR) 10Dzahn: [C: 032] "compiled. checked bromine http://puppet-compiler.wmflabs.org/3891/" [puppet] - 10https://gerrit.wikimedia.org/r/307649 (owner: 10Legoktm) [23:28:14] jdlrobson: OK. I want to proceed with other patches in the meantime; is it OK for your patch to be deployed to the cluster and be followed up on later, or is it not OK to be deployed in its current state? [23:29:47] RoanKattouw: it might be that ive got Wikimedia debug setup incorrectly [23:29:53] i'm seeing the wrong config variables in JS [23:29:56] Hm [23:30:04] (03PS5) 10Yuvipanda: wmflib: Return default dir when role::puppet::self isn't used [puppet] - 10https://gerrit.wikimedia.org/r/307656 [23:30:08] jdlrobson: It may also be that I'm an idiot [23:30:09] Stand by [23:30:59] RoanKattouw: i'm testing on https://tr.m.wikipedia.org/wiki/Angela_Merkel and expecting to see taglines=>true in wgMFDisplayWikibaseDescriptions [23:31:00] jdlrobson: OK, try now [23:31:03] but i'm seeing taglines=>false [23:31:09] I hadn't actually run git pull, and that turns out to be important [23:31:10] BOOM [23:31:14] RoanKattouw: works [23:31:15] legoktm: first one done. wanted to make sure and compiled. this made a change though we lost the GUID bit [23:31:19] lol that was fast [23:31:25] OK, going to all servers with that change [23:31:26] magic git pull [23:31:26] :) [23:31:37] legoktm: as in mode changed '2775' to '0775' [23:32:00] so when releasers upload stuff it would be owned by them [23:32:25] mutante: oh, we should probably switch to 2775? I don't have access to the server so I was relying on ostriches telling me it was 0775 :) [23:32:45] Oh did I fucks up? [23:32:48] i'll change that really quick, so it's just like before [23:32:58] !log catrope@tin Synchronized dblists/nowikidatadescriptiontaglines.dblist: (no message) (duration: 00m 53s) [23:32:59] I just saw drwxrwxr-x :) [23:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:33:37] ahhh ok yeah [23:33:41] 2 is best [23:34:17] RECOVERY - puppet last run on mw2187 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:34:55] !log catrope@tin Synchronized wmf-config: Enable Wikidata description taglines on all projects except top 6 wikis (T143344) (duration: 00m 54s) [23:34:56] T143344: Deploy Wikidata descriptions to mobile web Wikipedias stable channel 1st half - https://phabricator.wikimedia.org/T143344 [23:34:58] (03PS4) 10Legoktm: releases: Add wikidiff2 directory [puppet] - 10https://gerrit.wikimedia.org/r/307650 [23:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:35:00] (03PS1) 10Legoktm: releases: Set mediawiki directory to 2775 [puppet] - 10https://gerrit.wikimedia.org/r/307658 [23:35:05] mutante: ^ [23:35:18] oh, ok, i was also already making one [23:35:36] oops [23:35:45] (03CR) 10Dzahn: [C: 032] "yea, this is how it was before when it wasnt puppetized yet" [puppet] - 10https://gerrit.wikimedia.org/r/307658 (owner: 10Legoktm) [23:37:32] (03CR) 10Dzahn: [V: 032] releases: Set mediawiki directory to 2775 [puppet] - 10https://gerrit.wikimedia.org/r/307658 (owner: 10Legoktm) [23:37:52] (03CR) 10jenkins-bot: [V: 04-1] releases: Add wikidiff2 directory [puppet] - 10https://gerrit.wikimedia.org/r/307650 (owner: 10Legoktm) [23:38:54] that -1 is the random error again because of [23:38:56] Gem::RemoteFetcher::FetchError: Errno::ETIMEDOUT: Connection timed out - connect(2) for "rubygems.global.ssl.fastly.net" port 443 (https://rubygems.org/gems/hiera-1.3.4.gem) [23:39:02] because rubygems.org uses fastly.. and shrug [23:39:13] nothign to do with the content of the change [23:40:39] legoktm: should we make something like /releases/extensions or releases/mediawiki/extensions ? [23:40:58] or straight into the root [23:41:07] mutante: uhh, that's up to ostriches. [23:41:10] mutante looks like other from else where are getting the error [23:41:14] the only other extension we might have in the future would be luasandbox [23:41:25] https://phabricator.wikimedia.org/T144325 [23:42:15] legoktm: ok, it's fine this way [23:42:47] (03CR) 10Dzahn: [C: 032] releases: Add wikidiff2 directory [puppet] - 10https://gerrit.wikimedia.org/r/307650 (owner: 10Legoktm) [23:43:11] (03CR) 10Dzahn: [V: 032] "already verified before rebase" [puppet] - 10https://gerrit.wikimedia.org/r/307650 (owner: 10Legoktm) [23:43:26] 06Operations, 06Labs, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2596341 (10yuvipanda) deployment-prep is done! \o/ None of the instances have duplicates in their puppet.conf either! \o/ [23:43:45] legoktm, mutante: It's not an extension. [23:43:49] And it's not a MW release. [23:43:53] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/307650 (owner: 10Legoktm) [23:44:13] (or is MW even required, technically. although I dunno why you'd use it without MW) [23:44:21] right, PHP* extension [23:44:23] ostriches: oh, i was thinking https://www.mediawiki.org/wiki/Extension:Wikidiff2 [23:44:30] gotcha [23:44:31] Bad page name :p [23:45:14] hmm..waiting for a submit button [23:45:41] there we go [23:45:50] it actually did not let me submit with a jenkins-bot -1 on it [23:46:40] Notice: /Stage[main]/Releases/File[/srv/org/wikimedia/releases/wikidiff2]/ensure: created [23:46:44] legoktm: there it is [23:46:59] woo, thanks :) [23:47:02] np [23:47:32] ostriches: would you like to upload a tarball now? :) [23:47:48] https://releases.wikimedia.org/wikidiff2/ [23:47:59] (it was cached didnt see in index) [23:48:17] legoktm: Sure, toss it on your people.wm.o directory or something? [23:50:32] (03PS1) 10Alex Monk: labsprojectfrommetadata: use jq instead of trying to parse JSON with regex [puppet] - 10https://gerrit.wikimedia.org/r/307660 [23:52:53] (03Abandoned) 10Paladox: Disable $phabricator_active_server in labs since it is uneeded in labs [puppet] - 10https://gerrit.wikimedia.org/r/307335 (https://phabricator.wikimedia.org/T144112) (owner: 10Paladox) [23:53:18] ostriches did gerrit manage to run puppet? [23:53:23] or did it fail? [23:53:31] Yeah puppet ran just fine, but gerrit didn't restart [23:53:35] Systemd is confused [23:53:39] Waiting for swat to end [23:53:59] Oh ok [23:54:44] ostriches when is swat over? [23:54:48] I dunno [23:54:52] Ugh sorry [23:54:54] ok [23:54:55] I got distracted AGAIN [23:55:07] (03CR) 10Catrope: [C: 032] logging: Require acknowledgment of kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292514 (https://phabricator.wikimedia.org/T135159) (owner: 10EBernhardson) [23:55:13] (03PS4) 10Catrope: logging: Require acknowledgment of kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292514 (https://phabricator.wikimedia.org/T135159) (owner: 10EBernhardson) [23:55:19] (03CR) 10Catrope: [C: 032] logging: Require acknowledgment of kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292514 (https://phabricator.wikimedia.org/T135159) (owner: 10EBernhardson) [23:55:43] (03Merged) 10jenkins-bot: logging: Require acknowledgment of kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292514 (https://phabricator.wikimedia.org/T135159) (owner: 10EBernhardson) [23:55:59] RoanKattouw: Protip: netflix during swat is distracting :p [23:56:04] haha [23:56:13] Sadly I'm getting distracted by work, not Netflix [23:56:24] lol [23:56:36] * paladox does not use netflix, i use amazon instant video [23:56:42] Hmm ebern[TAB] is not here [23:56:42] work and chill [23:56:49] I guess he left at :49 which is fair enough [23:57:13] lol [23:57:25] His patch looks safe enough though [23:57:29] ostriches: https://people.wikimedia.org/~legoktm/wikidiff2-1.4.1.tar.gz I think I did that right [23:57:29] Im currently watching tv, /me loves sky tv [23:57:52] (03PS2) 10Catrope: Fix CirrusSearch BM25 A/B test similiraty config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307558 (owner: 10DCausse) [23:57:58] (03CR) 10Catrope: [C: 032] Fix CirrusSearch BM25 A/B test similiraty config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307558 (owner: 10DCausse) [23:58:08] !log catrope@tin Synchronized wmf-config/logging.php: Require ack of kafka logging (T135159) (duration: 00m 47s) [23:58:09] T135159: Require kafka acknowledgment from mediawiki logging pipeline (CirrusSearchRequestSet, ApiAction channels) - https://phabricator.wikimedia.org/T135159 [23:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:58:26] (03Merged) 10jenkins-bot: Fix CirrusSearch BM25 A/B test similiraty config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307558 (owner: 10DCausse) [23:58:42] legoktm: You haz files: https://releases.wikimedia.org/wikidiff2/ [23:58:49] :D [23:58:58] ostriches: any chance you wanna gpg sign it? [23:58:59] !log catrope@tin Started scap: wmf-config/CirrusSearch-common.php Fix CirrusSearch BM25 A/B test similarity config [23:59:01] !log catrope@tin scap aborted: wmf-config/CirrusSearch-common.php Fix CirrusSearch BM25 A/B test similarity config (duration: 00m 01s) [23:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:19] !log Whoops, I meant scap sync-file, not scap sync [23:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious [23:59:39] legoktm: I suppose I can, lemme see if I know how :) [23:59:53] (03PS3) 10Dzahn: Phab: Remove config abstraction. Useless & confusing [puppet] - 10https://gerrit.wikimedia.org/r/307354 (https://phabricator.wikimedia.org/T144112) (owner: 10Chad) [23:59:53] !log catrope@tin Synchronized wmf-config/CirrusSearch-common.php: Fix CirrusSearch BM25 A/B test similarity config (duration: 00m 48s) [23:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master