[00:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171201T0000). [00:00:04] mooeypoo and MaxSem: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:30] (03PS1) 10Rush: toolforge: bastion local throttling [puppet] - 10https://gerrit.wikimedia.org/r/394506 [00:01:03] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3801632 (10Mholloway) p:05Triage>03High [00:02:41] (03CR) 10BryanDavis: [C: 031] toolforge: bastion local throttling [puppet] - 10https://gerrit.wikimedia.org/r/394506 (owner: 10Rush) [00:02:49] addshore: Can we tick "Sort out the extension lists and localization stuff for deployment" as done on T173818 ? [00:02:49] T173818: [Epic] Kill the Wikidata build step - https://phabricator.wikimedia.org/T173818 [00:03:06] I'll deploy [00:03:30] James_F: yup [00:03:47] addshore: Kk. [00:04:08] How long ago was wikimania? [00:04:18] Too long. [00:04:43] 3.75 months. [00:04:47] Ish. [00:04:47] !log Killed all remaining Wikidata JSON/RDF dumpers, due to T181385. This means no dumps this week! [00:04:50] I'm looking forward to closing the rest of the subtasks [00:04:53] you should ask wikidata with wdql [00:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:56] T181385: Wikidata truthy nt dumpers stuck with 100% CPU on snapshot1007 - https://phabricator.wikimedia.org/T181385 [00:05:08] wolfram wiki [00:05:40] mooeypoo: yt? [00:06:59] mutante: true [00:07:11] Sorry, I'm here [00:07:28] (03PS2) 10MaxSem: Switch all wikis to HTML5 section IDs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394460 (https://phabricator.wikimedia.org/T152540) [00:07:32] (03CR) 10MaxSem: [C: 032] Switch all wikis to HTML5 section IDs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394460 (https://phabricator.wikimedia.org/T152540) (owner: 10MaxSem) [00:08:51] MaxSem, I'm not sure how to test this, though, I don't have access to warning logs [00:09:34] you should have access to logstash, mooeypoo [00:09:38] (03Merged) 10jenkins-bot: Switch all wikis to HTML5 section IDs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394460 (https://phabricator.wikimedia.org/T152540) (owner: 10MaxSem) [00:09:55] (03CR) 10jenkins-bot: Switch all wikis to HTML5 section IDs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394460 (https://phabricator.wikimedia.org/T152540) (owner: 10MaxSem) [00:10:10] PROBLEM - Host labstore1007 is DOWN: CRITICAL - Host Unreachable (208.80.155.106) [00:10:18] ... okay, I'm not sure how to access that. [00:11:31] https://logstash.wikimedia.org/ [00:12:15] I'm sorry :\ we had a miscommunication and RoanKattouw is away. [00:12:40] MaxSem: It's something you can monitor much more easily in fatalmonitor as you deploy it. [00:13:42] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: HTML5 sections be upon us! (duration: 00m 45s) [00:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:23] we discussed and decided no SWAT is needed [00:22:03] okay, we're done here [00:22:40] (03CR) 10Dzahn: [C: 032] logstash: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394489 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:24:25] (03PS1) 10Ayounsi: Initial deb packaging [debs/python-json-logger] - 10https://gerrit.wikimedia.org/r/394507 [00:47:26] (03PS2) 10Ayounsi: Initial deb packaging [debs/python-json-logger] - 10https://gerrit.wikimedia.org/r/394507 [00:47:27] (03CR) 10Ayounsi: "Requirement for anycast-helthchecker" [debs/python-json-logger] - 10https://gerrit.wikimedia.org/r/394507 (owner: 10Ayounsi) [00:51:30] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:16:30] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [01:46:40] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Recover "Flominator" svn account for use as a modern developer account - https://phabricator.wikimedia.org/T180813#3801735 (10bd808) a:03bd808 @Flominator I have attached your existing LDAP account to Wikitech and set the account's email addre... [01:46:52] !log Ran scap pull on mwdebug1001 after T181385 testing [01:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:04] T181385: Wikidata truthy nt dumpers stuck with 100% CPU on snapshot1007 - https://phabricator.wikimedia.org/T181385 [02:03:39] (03PS3) 10Krinkle: mediawiki/hhvm: Move fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) [02:03:47] no_justification: thoughts on https://gerrit.wikimedia.org/r/#/c/379953/ ? [02:06:57] I was thinking about that yesterday. It's the right thing, but should definitely move out of that directory. Scap3 doesn't allow for untracked files (it cleans them on deploy). Deploy and puppet would fight. Not strictly a problem yet but would be eventually [02:08:17] Generally I'd like to move all error pages to puppet [02:08:20] PROBLEM - configured eth on labstore1006 is CRITICAL: Return code of 255 is out of bounds [02:08:31] PROBLEM - Check size of conntrack table on labstore1006 is CRITICAL: Return code of 255 is out of bounds [02:09:01] PROBLEM - Check systemd state on labstore1006 is CRITICAL: Return code of 255 is out of bounds [02:09:01] PROBLEM - DPKG on labstore1006 is CRITICAL: Return code of 255 is out of bounds [02:09:10] PROBLEM - Check whether ferm is active by checking the default input chain on labstore1006 is CRITICAL: Return code of 255 is out of bounds [02:09:11] PROBLEM - dhclient process on labstore1006 is CRITICAL: Return code of 255 is out of bounds [02:09:11] PROBLEM - Disk space on labstore1006 is CRITICAL: Return code of 255 is out of bounds [02:09:31] PROBLEM - IPMI Sensor Status on labstore1006 is CRITICAL: Return code of 255 is out of bounds [02:11:00] PROBLEM - puppet last run on labstore1006 is CRITICAL: Return code of 255 is out of bounds [02:11:13] That sounds unhappy [02:13:10] PROBLEM - HP RAID on labstore1006 is CRITICAL: Return code of 255 is out of bounds [02:19:49] ACKNOWLEDGEMENT - Host labstore1007 is DOWN: CRITICAL - Host Unreachable (208.80.155.106) daniel_zahn https://phabricator.wikimedia.org/T181431 [02:20:30] PROBLEM - Check the NTP synchronisation status of timesyncd on labstore1006 is CRITICAL: Return code of 255 is out of bounds [02:21:45] ACKNOWLEDGEMENT - Check size of conntrack table on labstore1006 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T181431 [02:21:45] ACKNOWLEDGEMENT - Check systemd state on labstore1006 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T181431 [02:21:45] ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on labstore1006 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T181431 [02:21:45] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on labstore1006 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T181431 [02:21:45] ACKNOWLEDGEMENT - DPKG on labstore1006 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T181431 [02:21:46] ACKNOWLEDGEMENT - Disk space on labstore1006 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T181431 [02:21:46] ACKNOWLEDGEMENT - HP RAID on labstore1006 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T181431 [02:21:47] ACKNOWLEDGEMENT - IPMI Sensor Status on labstore1006 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T181431 [02:21:47] ACKNOWLEDGEMENT - Long running screen/tmux on labstore1006 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T181431 [02:21:48] ACKNOWLEDGEMENT - configured eth on labstore1006 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T181431 [02:21:48] ACKNOWLEDGEMENT - dhclient process on labstore1006 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T181431 [02:21:49] ACKNOWLEDGEMENT - puppet last run on labstore1006 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T181431 [02:36:20] (03CR) 10Dzahn: [C: 032] "jobqueue dashboard in grafana is at https://grafana.wikimedia.org/dashboard/db/job-queue-health" [puppet] - 10https://gerrit.wikimedia.org/r/394490 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [02:36:25] (03PS2) 10Dzahn: jobqueue_redis,restbase: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394490 (https://phabricator.wikimedia.org/T177225) [02:39:09] (03CR) 10Dzahn: "the redis metrics in ganglia are also already removed except one or 2 host exceptions" [puppet] - 10https://gerrit.wikimedia.org/r/394490 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [02:58:18] ACKNOWLEDGEMENT? That's new. [03:14:29] Niharika: Nah, just most people dont' bother to do it ;) [03:15:14] Reedy_: Mm? Do what? [03:15:27] Acknowledging icinga alerts [03:15:41] Reedy_: I had a question you'd definitely know the answer to. What do I need to do to create a new release for an extension? [03:15:49] Just creating one in Github would do it? [03:15:54] Tim and I have replied in -core :) [03:16:01] Oh. [03:16:07] :D [03:19:33] Niharika: the ACK has 2 advantages: others that check Icinga in web UI usually look at the un-acked section and can skip this. and, ACK means automatically "don't send notifications UNTIL the next status change", so if something comes back, notifications will start again, but until then it won't spam us. that is best of both worlds because you don't have to remember re-enabling something but [03:19:39] also avoid more spam [03:20:15] if people just disable notifactions it stops to show up on IRC but in web UI you are still not sure it's handled and .. somebody has to remember to enable it again later.. which is often forgotten [03:20:34] that's why i vote for ACK.. [03:20:57] Ah. :) [03:22:23] of course one could also do both, and first disable notifications, then ACK (to avoid the ACK "spam" itself), and then enable them again to not have to remember it. but that's even more clicking [03:26:00] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 884.48 seconds [03:34:37] (03PS1) 10Dzahn: mysql: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394518 (https://phabricator.wikimedia.org/T177225) [03:34:37] (03PS1) 10Dzahn: swift: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394519 (https://phabricator.wikimedia.org/T177225) [03:36:34] (03PS2) 10Dzahn: mysql: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394518 (https://phabricator.wikimedia.org/T177225) [03:37:57] (03PS2) 10Dzahn: swift: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394519 (https://phabricator.wikimedia.org/T177225) [03:38:31] (03PS3) 10Dzahn: swift: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394519 (https://phabricator.wikimedia.org/T177225) [03:42:14] (03CR) 10Dzahn: [C: 032] swift: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394519 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [03:43:02] (03PS4) 10Dzahn: swift: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394519 (https://phabricator.wikimedia.org/T177225) [04:16:11] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 272.20 seconds [04:34:54] MaxSem: Looks like the wgFragmentMode change may not have been properly deployed to all servers. [04:35:38] After it rolled out, https://gist.github.com/Krinkle/b3180d5562897ea3160d111908852b5d [04:35:46] the same minute, the module changed, as it should. [04:36:01] but then in hours after, it keeps alternating. [04:36:25] as if the backend response is not consistent. I can only think of one thing that would cause that: an app server being pooled with a different state? [04:36:37] (or maybe varnish backend cache whitewashing) [04:39:03] Refreshing https://en.wikipedia.org/w/load.php?debug=false&lang=en&modules=mediawiki.util&_ with different bogus values shows indeed that when I bypass cache. Some servers respond with "mediawiki.util@14kc4ko" and "wgFragmentMode":["legacy","html5"] [04:39:14] but most respond with ("mediawiki.util@000pyx0" and "wgFragmentMode":["html5","legacy"]} (new setting, html5 first) [04:39:39] the faulty 14k comes from at least mw1220.eqiad.wmnet [04:39:40] possibly others [04:40:01] mw1327.eqiad.wmnet as well [04:41:32] touch and resync? [04:42:11] Capturing data first [04:42:12] Hold on :) [04:54:00] legoktm: looks like HHVM has a stale compilation https://phabricator.wikimedia.org/T181773 [04:54:11] Gotta go, but feel free to try a touch and re-sync if you want [05:14:18] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: touch (duration: 00m 44s) [05:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:10] Krinkle: it looks good on the app servers now [06:13:20] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 24 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [06:18:20] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 11 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [06:20:07] 10Operations, 10ops-codfw, 10DBA: db2044: RAID disk with predictive failure - https://phabricator.wikimedia.org/T181775#3801889 (10Marostegui) [06:25:59] 10Operations, 10ops-codfw, 10DBA: db2044: RAID disk with predictive failure - https://phabricator.wikimedia.org/T181775#3801905 (10Marostegui) p:05Triage>03Normal [06:27:41] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [06:27:51] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:28:50] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.010 second response time [06:29:31] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mwrepl] [06:30:01] PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/ferm.conf] [06:47:30] 10Puppet, 10Wikimedia-Language-setup, 10Patch-For-Review, 10User-MarcoAurelio, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3802025 (10MarcoAurelio) Hello. I'd appreciate a reply on the question above so this does not get stalled. Thanks. [06:57:51] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:11] PROBLEM - HP RAID on db2044 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:12 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11 - Controller: OK - Battery/Capacitor: OK [06:58:12] ACKNOWLEDGEMENT - HP RAID on db2044 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:12 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T181779 [06:58:16] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T181779#3802070 (10ops-monitoring-bot) [06:58:17] Ah, it already failed... [06:58:18] that was fast [06:59:04] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T181779#3802076 (10Marostegui) p:05Triage>03Normal a:03Papaul Can we get this replaced @Papaul ? Thanks! [06:59:30] RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:59:35] 10Operations, 10ops-codfw, 10DBA: db2044: RAID disk with predictive failure - https://phabricator.wikimedia.org/T181775#3802080 (10Marostegui) 05Open>03Resolved And it finally failed: T181779 Let's follow up there [07:00:01] RECOVERY - puppet last run on db2063 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:02:15] (03PS1) 10Marostegui: mariadb: Enable Barracuda on a few roles [puppet] - 10https://gerrit.wikimedia.org/r/394527 (https://phabricator.wikimedia.org/T150949) [07:14:46] !log Logging retroactively for the record, restarting MySQL on db1039 [07:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:17] (03Draft1) 10MarcoAurelio: Add Portal namespace for mwl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394528 (https://phabricator.wikimedia.org/T180052) [07:19:21] (03PS2) 10MarcoAurelio: Add Portal namespace for mwl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394528 (https://phabricator.wikimedia.org/T180052) [07:24:31] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-crh: New upstream release [debs/contenttranslation/apertium-crh] - 10https://gerrit.wikimedia.org/r/393711 (https://phabricator.wikimedia.org/T181465) (owner: 10KartikMistry) [07:24:33] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-tur: New upstream release [debs/contenttranslation/apertium-tur] - 10https://gerrit.wikimedia.org/r/393758 (https://phabricator.wikimedia.org/T181465) (owner: 10KartikMistry) [07:27:10] (03CR) 10Alexandros Kosiaris: [C: 032] diamond: skip DiskSpace for Docker containers [puppet] - 10https://gerrit.wikimedia.org/r/393215 (https://phabricator.wikimedia.org/T177052) (owner: 10Hashar) [07:27:15] (03PS4) 10Alexandros Kosiaris: diamond: skip DiskSpace for Docker containers [puppet] - 10https://gerrit.wikimedia.org/r/393215 (https://phabricator.wikimedia.org/T177052) (owner: 10Hashar) [07:27:18] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] diamond: skip DiskSpace for Docker containers [puppet] - 10https://gerrit.wikimedia.org/r/393215 (https://phabricator.wikimedia.org/T177052) (owner: 10Hashar) [07:48:29] (03CR) 10Alexandros Kosiaris: [C: 032] user homes: Allow git to control +x for $HOME files [puppet] - 10https://gerrit.wikimedia.org/r/377056 (owner: 10BryanDavis) [07:48:37] (03PS3) 10Alexandros Kosiaris: user homes: Allow git to control +x for $HOME files [puppet] - 10https://gerrit.wikimedia.org/r/377056 (owner: 10BryanDavis) [07:48:39] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] user homes: Allow git to control +x for $HOME files [puppet] - 10https://gerrit.wikimedia.org/r/377056 (owner: 10BryanDavis) [07:51:12] !log upload apertium-tur_0.2.0~r83161-1+wmf1, apertium-crh_0.2.0~r83161-1+wmf1 to apt.wikimedia.org/jessie-wikimedia component main. T181465 [07:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:23] T181465: Update crh-tur Apertium language pair - https://phabricator.wikimedia.org/T181465 [07:56:20] (03CR) 10Alexandros Kosiaris: [C: 031] "I think I was this to be merged right before some SWAT window so we can witness potential problems ASAP. I 'll be monitoring Looking at ht" [puppet] - 10https://gerrit.wikimedia.org/r/377269 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [08:06:16] (03PS2) 10Muehlenhoff: Restrict access to ferm service on mwlog* hosts [puppet] - 10https://gerrit.wikimedia.org/r/393240 [08:09:10] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:09:30] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:09:40] PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100% [08:09:42] ganeti down? akosiaris [08:09:55] yeah, all ganeti hosts [08:10:10] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:10:11] PROBLEM - Host etcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:10:15] what ? [08:10:21] PROBLEM - Host neon is DOWN: PING CRITICAL - Packet loss = 100% [08:10:30] ganeti1008 [08:10:30] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:10:30] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:10:31] PROBLEM - Host dysprosium is DOWN: PING CRITICAL - Packet loss = 100% [08:10:32] pffff [08:10:40] PROBLEM - SSH on ganeti1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:10:43] I am guessing the exact same issue as 1006, 1005 [08:11:04] * akosiaris logging to mgmt [08:12:06] console is unresponsive. displays some [7412855.251918] INFO: task drbd_r_resource:9350 blocked for more than 120 seconds. [08:12:06] [7412855.260173] Not tainted 4.9.0-0.bpo.3-amd64 #1 Debian 4.9.30-2+deb9u2~bpo8+1 [08:12:06] [7412855.268819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [08:12:06] messages [08:12:10] for various processes [08:12:52] SEL is empty [08:13:36] rack log is useless too [08:13:40] RECOVERY - SSH on ganeti1008 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [08:13:40] RECOVERY - Host actinium is UP: PING WARNING - Packet loss = 93%, RTA = 108.67 ms [08:13:50] RECOVERY - Host neon is UP: PING OK - Packet loss = 0%, RTA = 2.00 ms [08:13:50] RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 2.99 ms [08:13:50] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 3.29 ms [08:13:50] RECOVERY - Host etcd1005 is UP: PING OK - Packet loss = 0%, RTA = 2.76 ms [08:13:50] RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 3.12 ms [08:14:00] RECOVERY - Host dysprosium is UP: PING OK - Packet loss = 0%, RTA = 2.13 ms [08:14:10] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 3.21 ms [08:14:42] [7413026.665475] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set [08:14:42] [7413026.665475] bad because of flags: 0x14(referenced|dirty) [08:14:43] [7413026.665542] BUG: Bad page state in process in:imklog pfn:747741 [08:14:55] damn [08:15:41] ok so no boxes of that batch are trustable right now [08:15:50] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [08:15:54] I am guessing 1007 will soon exhibit the same issue [08:16:00] I'm still leaning towards a hardware issue [08:16:25] yeah but https://phabricator.wikimedia.org/T181121#3796723 says otherwise [08:16:30] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:16:35] now these tools are not that trustworthy [08:16:45] yeah, I saw that, but maybe that tool simply tests a pattern which we don't exhibit [08:16:47] question is what do we do [08:17:40] (03CR) 10Jcrespo: [C: 031] "I am ok with this, but can we deploy on codfw first, wait some time, just to be 100% sure for some strange reason?" [puppet] - 10https://gerrit.wikimedia.org/r/394518 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [08:17:58] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006 - https://phabricator.wikimedia.org/T181121#3802150 (10akosiaris) >>! In T181121#3796723, @Cmjohnson wrote: > The h/w tests are finished and no errors were found. assigning to @akosiaris I was afraid of that. That kind of leaves us... [08:18:07] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3802151 (10akosiaris) [08:18:40] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5 [08:18:44] anyway, I 'll empty ganeti1008. If history is of any indication, last time ganeti1005 first did this and then died and required a reboot. [08:18:53] I 'd rather not have to witness this [08:19:02] that is with VMs on it [08:19:43] if we'd see the same error from ganeti1001-1004 it's certainly a software issue, but given that this only affects 1005-1008 from the 2017 hw refresh... [08:19:44] which means I have to repool ganeti1005, ganeti1006 unfortunately :-( [08:20:11] !log repool ganeti1005, ganeti1006 to empty ganeti1008 T181121 [08:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:23] T181121: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121 [08:21:00] one alternative would be to run the memtest tool shipped in Debian (the one which can be started from grub), so see whether that shows anything [08:21:19] but could as well be a motherboard related issue [08:21:27] (03PS1) 10Jcrespo: mysql: remove ganglia from codfw [puppet] - 10https://gerrit.wikimedia.org/r/394530 (https://phabricator.wikimedia.org/T177225) [08:21:35] that's more probable btw. We have ECC memory on those [08:21:52] if it was a memory DIMM at fault it would have probably been caught and logged [08:21:55] (03PS2) 10Jcrespo: mysql: remove ganglia from codfw [puppet] - 10https://gerrit.wikimedia.org/r/394530 (https://phabricator.wikimedia.org/T177225) [08:22:04] or cpu [08:22:04] yeah, true that [08:22:15] or let Christ check if a more recent firmware update is available [08:22:15] memory issues sometimes show as memory [08:22:39] Chris, we're not yet in the problem space where Christ gets involved [08:22:53] lol [08:24:17] (03CR) 10Jcrespo: "marostegui: ok to deploy before https://gerrit.wikimedia.org/r/394518 ?" [puppet] - 10https://gerrit.wikimedia.org/r/394530 (https://phabricator.wikimedia.org/T177225) (owner: 10Jcrespo) [08:24:39] (03CR) 10Marostegui: [C: 031] mysql: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394518 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [08:25:01] (03CR) 10Marostegui: [C: 031] mysql: remove ganglia from codfw [puppet] - 10https://gerrit.wikimedia.org/r/394530 (https://phabricator.wikimedia.org/T177225) (owner: 10Jcrespo) [08:25:52] (03CR) 10Jcrespo: [C: 031] mariadb: Enable Barracuda on a few roles [puppet] - 10https://gerrit.wikimedia.org/r/394527 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui) [08:26:03] (03CR) 10Jcrespo: [C: 031] "We should have done this ages ago." [puppet] - 10https://gerrit.wikimedia.org/r/394527 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui) [08:26:21] (03CR) 10Jcrespo: [C: 032] mysql: remove ganglia from codfw [puppet] - 10https://gerrit.wikimedia.org/r/394530 (https://phabricator.wikimedia.org/T177225) (owner: 10Jcrespo) [08:26:28] (03PS2) 10Marostegui: mariadb: Enable Barracuda on a few roles [puppet] - 10https://gerrit.wikimedia.org/r/394527 (https://phabricator.wikimedia.org/T150949) [08:27:25] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3802157 (10MoritzMuehlenhoff) @Cmjohnson : This seems hardware-related, we have four Ganeti nodes with identical software which are older models (and none affected) and four boxes... [08:27:57] (03PS3) 10Marostegui: mariadb: Enable Barracuda on a few roles [puppet] - 10https://gerrit.wikimedia.org/r/394527 (https://phabricator.wikimedia.org/T150949) [08:28:29] !log empty ganeti1008, move VMs to ganeti1006 T181121 [08:28:35] (03CR) 10Marostegui: [C: 032] mariadb: Enable Barracuda on a few roles [puppet] - 10https://gerrit.wikimedia.org/r/394527 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui) [08:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:40] T181121: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121 [08:28:40] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5 [08:33:21] PROBLEM - DPKG on es2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:33:42] (03PS1) 10Elukey: Set role::system::spare to db104[67] [puppet] - 10https://gerrit.wikimedia.org/r/394531 (https://phabricator.wikimedia.org/T156844) [08:34:16] 10Operations, 10Cloud-Services, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 3 others: Grafana reports ALL docker mounts in a spammy way - https://phabricator.wikimedia.org/T177052#3802166 (10hashar) 05Open>03Resolved a:03hashar [08:35:21] PROBLEM - DPKG on db2029 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:35:32] (03PS2) 10Elukey: Set role::system::spare to db104[67] [puppet] - 10https://gerrit.wikimedia.org/r/394531 (https://phabricator.wikimedia.org/T156844) [08:35:35] (03CR) 10Jcrespo: [C: 04-1] mysql: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394518 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [08:35:53] (03PS1) 10Jcrespo: Revert "mysql: remove ganglia from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/394532 [08:36:00] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ganglia-monitor] [08:36:01] PROBLEM - puppet last run on es2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 32 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[ganglia-monitor] [08:36:05] (03PS2) 10Jcrespo: Revert "mysql: remove ganglia from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/394532 [08:36:14] (03CR) 10Jcrespo: [V: 032 C: 032] Revert "mysql: remove ganglia from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/394532 (owner: 10Jcrespo) [08:37:53] (03CR) 10Jcrespo: [C: 04-2] "This patch doesn't work:" [puppet] - 10https://gerrit.wikimedia.org/r/394518 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [08:39:21] RECOVERY - DPKG on es2004 is OK: All packages OK [08:40:22] (03CR) 10Marostegui: [C: 031] Set role::system::spare to db104[67] [puppet] - 10https://gerrit.wikimedia.org/r/394531 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [08:40:57] !log reboot the remaining analytics103* hadoop workers to pick up kernel+jvm updates - T179943 [08:41:01] RECOVERY - puppet last run on es2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:41:02] marostegui: \o/ [08:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:07] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [08:41:24] !log Remove db1046 and db1047 from tendril - T156844 [08:41:30] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:31] T156844: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844 [08:42:53] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3674779 (10jcrespo) The patch doesn't apply cleanly, no matter how many puppet runs are done: ```lines=10 Sleeping 14 for random splay Info: Retrieving pluginfacts Info: Retrieving p... [08:44:31] marostegui: there is a db on db1047 that is called 'ops', do you want to check/save it? [08:44:52] nope, no need to [08:45:30] RECOVERY - DPKG on db2029 is OK: All packages OK [08:46:00] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:47:55] elukey: sorry I wasn't clear, when I said leave it "as is", I meant- if db is to be down, just leave it to get loss, if it is already there on a new host, do not remove it [08:48:05] *lost [08:49:01] jynus: ah ok! Just wanted to be sure, I am triple checking the tables/dbs/etc.. [08:49:04] (03CR) 10Jcrespo: "Let me check if there is some extra puppet code we can remove." [puppet] - 10https://gerrit.wikimedia.org/r/394531 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [08:49:14] (03CR) 10Muehlenhoff: [C: 032] Restrict access to ferm service on mwlog* hosts [puppet] - 10https://gerrit.wikimedia.org/r/393240 (owner: 10Muehlenhoff) [08:49:20] (03PS3) 10Muehlenhoff: Restrict access to ferm service on mwlog* hosts [puppet] - 10https://gerrit.wikimedia.org/r/393240 [08:49:23] dbstore1002 again [08:49:30] replication broken? [08:49:42] I can take it [08:49:49] (03CR) 10Muehlenhoff: [V: 032 C: 032] Restrict access to ferm service on mwlog* hosts [puppet] - 10https://gerrit.wikimedia.org/r/393240 (owner: 10Muehlenhoff) [08:49:50] thanks [08:50:02] if you have the time, can you check what I was goint to check about orphan puppet code? [08:50:19] at https://gerrit.wikimedia.org/r/394531 [08:50:30] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.page_props: Cant find record in page_props, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001555, end_log_pos 673700780 [08:50:31] I highly suspect there is [08:53:46] I had a quick look btw at kernel images. 1008 runs bpo.3 and 1005,1006 run bpo.4 so that can be ruled out [09:01:31] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:03:18] akosiaris: ack, but FWIW at the time 1005 and 1006 crashed, they were still using 4.9.30/bpo.3 as well (they pick up the bpo4 kernel with the reboots, since I haven't done a rolling restart of ganeti in eqiad due to the hw issues) [09:03:36] ah, indeed [09:03:37] the new kernel is running fine in codfw for two weeks or so, though [09:03:49] the old one was running fine for 85 days ... [09:03:57] well more.. it's 85 on ganeti1008 [09:04:21] it's 162 on ganeti1007 (and probably ganeti1005, ganeti1006) [09:04:39] * akosiaris remembered the 200 days bug [09:06:02] ganeti1008 is behaving erratically ... [09:06:07] puppet agent --disable is stuck [09:06:15] prometheus puppet stats are stuck as well [09:08:07] yeah it's having another hiccup.. good thing we moved the VMs out [09:10:03] (03PS14) 10TerraCodes: Add loginwiki and wikidata to $wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392999 (https://phabricator.wikimedia.org/T117302) [09:10:19] (03PS5) 10TerraCodes: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) [09:12:19] akosiaris: its load average looks like wants to take off [09:12:39] it's not responsive anymore I 'd say [09:12:53] a ton of commands are stuck in D state [09:16:10] PROBLEM - Hadoop NodeManager on analytics1038 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [09:16:26] this is me --^ [09:16:38] downtime expired because of long draining [09:21:40] (03PS3) 10Jcrespo: Set role::system::spare to db104[67] [puppet] - 10https://gerrit.wikimedia.org/r/394531 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [09:22:09] I am gonna powercycle [09:23:30] !log powercycle ganeti1008 T181121, it's largely unresponsive [09:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:39] T181121: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121 [09:23:46] !log reboot analytics104* for kernel+jvm updates - T179943 [09:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:54] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [09:28:30] PROBLEM - NTP on sca1004 is CRITICAL: NTP CRITICAL: Offset -1.505220532 secs [09:33:01] godog: moritzm this is interesting https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=16&fullscreen&orgId=1&var-server=ganeti1008&var-datasource=eqiad%20prometheus%2Fops&from=now-3h&to=now [09:33:35] the temp increase precedes the very first logged error by 10 mins [09:33:46] and the icinga alert by 6 [09:34:24] that being said. the temperature there is ridiculously low [09:34:55] it's ~40 degrees before the autoshutdown threshold [09:35:58] the other fun things are the disk iops. they spiked to 13k writes right before the problems started [09:38:13] and of course there's a nice used memory plateau of 3.3GB at 8:31 which is after I 've migrated everything out of the box (and mem usage should drop to ~1G). But the kernel was already not at it's best by then so I am guessing it's expected [09:38:31] akosiaris: indeed, and a jump in / disk used, looks like something was legitimately (?) writing to the disk? too late for logrotate tho [09:39:02] that's the logs [09:39:15] if you look at the time it's after the disk IOPS and temp increase [09:39:30] it's all the stacktraces that finally made it to the fs [09:39:36] at least that's what I think [09:39:56] also network activity peaked right before that [09:40:30] 95Mbps transmitted at 08:02 [09:40:36] so something is triggering it [09:42:29] * akosiaris wants to blame piwik :P [09:42:37] but it doesn't seem it's that [09:43:55] heheh I wonder if we could break the network/disk stats further, the metrics should be all there [09:44:06] when zooming out to 2 day is shows that the host also had a huge network usage yesterday at 8h [09:44:31] but not on the previous day [09:46:24] now I would expect drbd to actually do that. resync stuff and it would explain it. that being said, that's always logged in kern.log [09:46:29] and I see nothing about that today [09:47:43] or yesterday [09:49:03] yeah I'm looking at syslog but nothing interesting so far except tons of errors from the kernel [09:50:11] (03CR) 10Muehlenhoff: [C: 031] Depend on jre-headless and its versioned names. [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/394322 (owner: 10Filippo Giunchedi) [09:51:45] akosiaris: did the kernel change recently? [09:52:16] just now with the reboot. otherwise it was up with .bpo.3 for 85 days [09:52:40] same goes for 1005, 1006, only those had even greater uptimes with .bpo.3 (~160 days) [09:53:24] I checked all the VMs 1-by-1 on grafana. None are responsible either for the IOPS increase or the network traffic [09:54:12] I am gonna run memtester in a screen... just for the fun of it [09:55:42] !log run memtester 61G on ganeti1008 T181121 [09:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:51] T181121: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121 [09:56:24] ack, part of the network traffic could be indeed logs too, it logged 460k "call trace" on lithium [09:56:47] ah, got timestamp for the very first one ? [09:57:31] (03PS2) 10Filippo Giunchedi: Depend on jre-headless and its versioned names. [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/394322 [09:57:51] Dec 1 08:13:31 ganeti1008 kernel: [7412671.294930] qemu-system-x86: page allocation stalls for 11124ms, order:0, mode:0x24000c0(GFP_KERNEL) [09:57:58] looks like [09:58:17] that does not coincide with the network traffic spikes start time [09:58:17] but that's fallout I guess [09:58:30] yeah I do too [09:58:30] RECOVERY - NTP on sca1004 is OK: NTP OK: Offset 0.01142069697 secs [09:58:32] similar what we saw when ganeti1005/1006 bailed [09:58:48] that accounts probably for the 2nd and the 3rd spike [09:59:05] well trapezoid, not spike but anyway [10:00:51] (03CR) 10Filippo Giunchedi: [C: 032] Depend on jre-headless and its versioned names. [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/394322 (owner: 10Filippo Giunchedi) [10:09:43] (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/9101/" [puppet] - 10https://gerrit.wikimedia.org/r/394531 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [10:13:51] (03PS1) 10Jcrespo: maridb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T159412) [10:14:34] (03CR) 10jerkins-bot: [V: 04-1] maridb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T159412) (owner: 10Jcrespo) [10:17:24] (03PS2) 10Jcrespo: maridb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T159412) [10:18:12] !log reimaging mw1259 (video scaler) to stretch, will be kept disabled initially (with some live tests starting next week) [10:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:46] (03CR) 10Jcrespo: [C: 04-1] "Actualy, spare should disable notifications automatically, let me recheck it." [puppet] - 10https://gerrit.wikimedia.org/r/394531 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [10:19:58] (03PS4) 10Jcrespo: Set role::system::spare to db104[67] [puppet] - 10https://gerrit.wikimedia.org/r/394531 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [10:20:00] (03PS3) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T159412) [10:21:01] !log initial purge of old table metrics from graphite2002 - T181689 [10:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:11] T181689: New RESTBase Cassandra cluster has legacy tables - https://phabricator.wikimedia.org/T181689 [10:22:11] (03CR) 10Jcrespo: [C: 04-1] "yes, see hieradata/role/common/spare/system.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/394531 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [10:24:07] (03PS5) 10Jcrespo: Set role::system::spare to db104[67] [puppet] - 10https://gerrit.wikimedia.org/r/394531 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [10:25:42] (03CR) 10Elukey: [C: 031] Set role::system::spare to db104[67] [puppet] - 10https://gerrit.wikimedia.org/r/394531 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [10:25:45] (03PS6) 10Jcrespo: Set role::system::spare to db104[67] [puppet] - 10https://gerrit.wikimedia.org/r/394531 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [10:26:17] (03CR) 10Jcrespo: [C: 031] "I am now happy." [puppet] - 10https://gerrit.wikimedia.org/r/394531 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [10:26:33] (03PS4) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T159412) [10:27:55] (03CR) 10Elukey: [C: 032] Set role::system::spare to db104[67] [puppet] - 10https://gerrit.wikimedia.org/r/394531 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [10:30:12] (03PS1) 10Marostegui: mariadb: Enable barracuda in some more hosts [puppet] - 10https://gerrit.wikimedia.org/r/394542 (https://phabricator.wikimedia.org/T150949) [10:35:00] PROBLEM - eventlogging_sync processes on db1047 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [10:35:16] elukey: ^ that is what I was talking about :-) [10:35:51] buuuu [10:37:20] marostegui: damage is done, puppet ran and all seems good [10:38:07] Awesome! [10:39:56] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Decommission db104[67] - https://phabricator.wikimedia.org/T181784#3802296 (10elukey) [10:40:45] (03PS1) 10Ema: WIP: varnish: prometheus equivalent of statsd metrics daemons [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) [10:44:02] (03PS1) 10Hashar: Remove Arcanist configuration files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394544 [10:46:42] RECOVERY - Hadoop NodeManager on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [10:50:18] gehel: I wanted to test jmx_exporter with elasticsearch, ok to try on deployment-logstash2 ? [10:53:48] godog: hello :) for when you get time, I could use some metrics to be purged from the labs Graphite. They are Diamond diskspace metrics for Docker mounts and that causes a bunch of alarms in Shinken :] https://phabricator.wikimedia.org/T181476 [10:54:04] my email box thanks you in advance ! [10:54:53] hashar: easy enough I'll do it now [10:55:43] (03CR) 10Marostegui: "The puppet compiler complains a bit: https://puppet-compiler.wmflabs.org/compiler02/9103/" [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T159412) (owner: 10Jcrespo) [10:56:55] !log delete docker diskspace metrics from labs - T181476 [10:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:05] T181476: Purge labs graphite metrics of Docker ephemeral partitions - https://phabricator.wikimedia.org/T181476 [10:57:21] !log reboot analytics1028 for kernel + jvm updates (Hadoop HDFS journalnode) - T179943 [10:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:30] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [10:59:06] (03PS2) 10Ema: WIP: varnish: prometheus equivalent of statsd metrics daemons [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) [10:59:11] (03PS5) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) [11:01:22] PROBLEM - graphite-labs.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [11:01:27] (03PS6) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) [11:02:48] (03PS7) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) [11:04:11] !log Update MySQL on db1039 for testing [11:04:15] (03PS8) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) [11:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:21] (03PS9) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) [11:08:34] !log Stop MySQL on db2055 for testing [11:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:04] (03PS1) 10Marostegui: db-codfw.php: Depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394545 [11:10:23] 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3802437 (10elukey) Opened https://phabricator.wikimedia.org/T181784 to fully decom db104[67] [11:11:43] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394545 (owner: 10Marostegui) [11:11:54] (03PS3) 10Ema: WIP: varnish: prometheus equivalent of statsd metrics daemons [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) [11:12:36] (03CR) 10Jcrespo: "This should be better, but we need more testing https://puppet-compiler.wmflabs.org/compiler02/9106/" [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [11:13:13] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394545 (owner: 10Marostegui) [11:13:44] (03PS10) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) [11:14:03] (03CR) 10jenkins-bot: db-codfw.php: Depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394545 (owner: 10Marostegui) [11:14:20] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2055 (duration: 00m 46s) [11:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:49] (03PS4) 10Ema: WIP: varnish: prometheus equivalent of statsd metrics daemons [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) [11:18:06] (03PS1) 10Marostegui: db2055.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/394546 [11:21:12] (03CR) 10Marostegui: [C: 032] db2055.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/394546 (owner: 10Marostegui) [11:28:37] !log upgrading and restarting dbstore2001 [11:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:50] hashar: I'd need to add 'mtail' as a dependency of CI slaves to run tests in modules/mtail/files and then integrate that directory with tox so nose discovers the tests, how do I do that? [11:30:34] godog: sorry for the delay, but yes, have fun with logstash / elasticsearch! [11:31:18] gehel: sweet, thanks! [11:31:37] stopping dbstore2001 with so little available memory is slow [11:35:18] godog: we would need to update the docker container used for operations/puppet and get mtail installed there [11:35:38] godog: that is done using docker-pkg in integration/config.git:dockerfiles/operations-puppet there is a Dockerfile.template there [11:35:45] (03PS5) 10Ema: WIP: varnish: prometheus equivalent of statsd metrics daemons [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) [11:36:04] (03PS1) 10Filippo Giunchedi: mtail: rename exim tests [puppet] - 10https://gerrit.wikimedia.org/r/394548 [11:36:18] godog: though the container uses jessie for now and mtail is not included (though it is in jessie-backports https://packages.debian.org/search?keywords=mtail ) [11:36:20] (03CR) 10jerkins-bot: [V: 04-1] mtail: rename exim tests [puppet] - 10https://gerrit.wikimedia.org/r/394548 (owner: 10Filippo Giunchedi) [11:36:58] hashar: sweet! thanks, I'll open a task for that and cc you and send the review [11:37:13] hashar: is backports available already from inside the container? [11:37:14] godog: mtail sounds a lot like logstash isn't it ? [11:38:07] ah no [11:38:08] hmm [11:38:48] godog: I don't know how apt is configured. Maybe backports is available yes [11:39:08] they are related in the "log processing" sense, not a whole lot else perhaps [11:39:49] (03PS2) 10Filippo Giunchedi: mtail: rename exim tests [puppet] - 10https://gerrit.wikimedia.org/r/394548 [11:42:00] 10Operations, 10Goal, 10User-fgiunchedi: Have jenkins run mtail tests via tox/nose - https://phabricator.wikimedia.org/T181794#3802513 (10fgiunchedi) [11:44:03] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394549 [11:46:04] (03PS1) 10Giuseppe Lavagetto: Re-add nodejs-devel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/394550 [11:49:04] 10Operations, 10Electron-PDFs, 10Design, 10I18n, and 3 others: Use "Charter" as preferred typeface on Electron - https://phabricator.wikimedia.org/T181200#3802538 (10mobrovac) [11:49:14] (03PS2) 10Giuseppe Lavagetto: Re-add nodejs-devel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/394550 (https://phabricator.wikimedia.org/T180524) [11:49:44] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Create an incident report - https://phabricator.wikimedia.org/T181795#3802542 (10akosiaris) [11:52:43] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Create an incident report for ORES overload incident 2017 - https://phabricator.wikimedia.org/T181795#3802557 (10jcrespo) [11:53:30] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Create an incident report for ORES overload incident 2017 - https://phabricator.wikimedia.org/T181795#3802542 (10jcrespo) I know this is just a todo for yourself, but the title appears in many places out of the original context, sorry for the edit. [11:54:23] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3802564 (10Gilles) We can lower the threshold of the slow log at some point and you won't need to hit such extreme cases for them to show up. But yes, for no... [11:54:36] (03PS1) 10Filippo Giunchedi: tox: run mtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394552 (https://phabricator.wikimedia.org/T181794) [11:55:29] (03CR) 10jerkins-bot: [V: 04-1] tox: run mtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394552 (https://phabricator.wikimedia.org/T181794) (owner: 10Filippo Giunchedi) [11:56:00] (03PS2) 10Filippo Giunchedi: tox: run mtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394552 (https://phabricator.wikimedia.org/T181794) [11:56:33] (03CR) 10Filippo Giunchedi: [C: 032] mtail: rename exim tests [puppet] - 10https://gerrit.wikimedia.org/r/394548 (owner: 10Filippo Giunchedi) [11:56:54] (03CR) 10jerkins-bot: [V: 04-1] tox: run mtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394552 (https://phabricator.wikimedia.org/T181794) (owner: 10Filippo Giunchedi) [11:57:54] hashar: https://gerrit.wikimedia.org/r/394551 [11:59:03] !log restarting and upgrading mysql on labsdb1004 [11:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:38] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394549 (owner: 10Marostegui) [12:10:15] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394549 (owner: 10Marostegui) [12:10:26] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394549 (owner: 10Marostegui) [12:11:25] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2055 (duration: 00m 45s) [12:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:59] (03PS1) 10Muehlenhoff: Grant prometheus user to run rec_control on DNS recursors [puppet] - 10https://gerrit.wikimedia.org/r/394554 (https://phabricator.wikimedia.org/T181620) [12:19:36] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Re-add nodejs-devel [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/394550 (https://phabricator.wikimedia.org/T180524) (owner: 10Giuseppe Lavagetto) [12:27:25] 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Joe: Upgrade latest docker-registry.wikimedia.org/nodejs-devel to stretch - https://phabricator.wikimedia.org/T180524#3802701 (10Joe) A new version of the nodejs-devel image based on nodesource's own package... [12:27:35] 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Joe: Upgrade latest docker-registry.wikimedia.org/nodejs-devel to stretch - https://phabricator.wikimedia.org/T180524#3802702 (10Joe) 05Open>03Resolved [12:28:10] (03PS1) 10Jdrewniak: Update portals submodule to master [puppet] - 10https://gerrit.wikimedia.org/r/394555 (https://phabricator.wikimedia.org/T181799) [12:28:59] (03PS2) 10Jdrewniak: Update portals submodule to master [puppet] - 10https://gerrit.wikimedia.org/r/394555 (https://phabricator.wikimedia.org/T181799) [12:32:53] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/393993 (https://phabricator.wikimedia.org/T181465) (owner: 10KartikMistry) [12:44:08] !log reboot druid1001 for kernel+jvm updates - T179943 [12:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:18] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [12:44:54] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: switch to puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/394556 (https://phabricator.wikimedia.org/T177254) [12:47:34] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: switch to puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/394556 (https://phabricator.wikimedia.org/T177254) (owner: 10Giuseppe Lavagetto) [12:48:09] (03PS1) 10Muehlenhoff: Add a Prometheus exporter for PDNS recursor [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394557 [12:53:08] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/394559 [12:53:48] no impact on real time data during the druid1001 restart \o/ [12:54:06] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/394559 (owner: 10Giuseppe Lavagetto) [13:04:56] (03PS1) 10Muehlenhoff: Add .gitreview file [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394560 [13:05:19] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add .gitreview file [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394560 (owner: 10Muehlenhoff) [13:06:08] (03PS1) 10Muehlenhoff: Add Debianisation [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394561 [13:08:56] (03PS2) 10Muehlenhoff: Add Debianisation [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394561 [13:09:28] (03PS3) 10Muehlenhoff: Add Debianisation [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394561 [13:10:19] moritz++ [13:10:34] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add Debianisation [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394561 (owner: 10Muehlenhoff) [13:11:22] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:12:12] PROBLEM - puppet last run on db1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:13:12] PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:13:52] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:14:02] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:14:13] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:14:35] some network/restart glitch? I ran puppet manually on ms-be1022 and that went fine [13:14:37] nitrogen is not happy [13:14:45] maybe an OOM [13:15:02] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:15:04] [Fri Dec 1 13:08:28 2017] Out of memory: Kill process 8171 (java) score 381 or sacrifice child [13:15:07] yep [13:15:32] PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:15:38] elukey: you monster [13:15:44] puppetdb was killed and then it restarted [13:15:52] PROBLEM - puppet last run on wtp1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:16:22] RECOVERY - puppet last run on ms-be1022 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:17:12] RECOVERY - puppet last run on db1011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:19:28] (03PS1) 10Muehlenhoff: Add Prometheus exporter to DNS recursors [puppet] - 10https://gerrit.wikimedia.org/r/394562 (https://phabricator.wikimedia.org/T181620) [13:22:32] PROBLEM - grafana-labs.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [13:23:15] (03PS1) 10Muehlenhoff: Add pdns rec exporters to Prometheus scraper config [puppet] - 10https://gerrit.wikimedia.org/r/394564 (https://phabricator.wikimedia.org/T181620) [13:28:05] (03PS2) 10Marostegui: mariadb: Enable barracuda in some more hosts [puppet] - 10https://gerrit.wikimedia.org/r/394542 (https://phabricator.wikimedia.org/T150949) [13:28:23] (03PS2) 10Rush: toolforge: bastion local throttling [puppet] - 10https://gerrit.wikimedia.org/r/394506 [13:28:58] (03CR) 10Marostegui: [C: 032] mariadb: Enable barracuda in some more hosts [puppet] - 10https://gerrit.wikimedia.org/r/394542 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui) [13:29:01] (03CR) 10Rush: [C: 032] toolforge: bastion local throttling [puppet] - 10https://gerrit.wikimedia.org/r/394506 (owner: 10Rush) [13:29:13] (03PS3) 10Rush: toolforge: bastion local throttling [puppet] - 10https://gerrit.wikimedia.org/r/394506 [13:29:29] I was faster than you merging, chasemp :p [13:29:31] (03PS1) 10Filippo Giunchedi: Default to port 9406 as allocated [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/394566 [13:29:45] marostegui: :) yes [13:30:09] (03PS1) 10Filippo Giunchedi: Release 0.5 [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/394567 [13:30:32] RECOVERY - grafana-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 14011 bytes in 0.027 second response time [13:30:33] (03CR) 10Filippo Giunchedi: [C: 032] Default to port 9406 as allocated [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/394566 (owner: 10Filippo Giunchedi) [13:30:37] (03CR) 10Filippo Giunchedi: [C: 032] Release 0.5 [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/394567 (owner: 10Filippo Giunchedi) [13:31:41] 10Operations, 10Traffic, 10media-storage: "Error: 404, Requested domainname does not exist" when accessing Commons categories/images; works on mobile page - https://phabricator.wikimedia.org/T181801#3802812 (10Aklapper) (adding some related project tags to this task, but currently hard to say which triage ba... [13:32:28] (03PS1) 10Filippo Giunchedi: Update README.md [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/394568 [13:32:54] (03CR) 10Filippo Giunchedi: [C: 032] Update README.md [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/394568 (owner: 10Filippo Giunchedi) [13:40:33] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [13:40:52] RECOVERY - puppet last run on wtp1048 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [13:42:06] 10Operations, 10Phabricator, 10Traffic: Switch on http/2 in phabricator apache - https://phabricator.wikimedia.org/T180998#3802828 (10faidon) >>! In T180998#3800621, @demon wrote: > I was thinking hypothetically, if it were possible: would we actually gain anything from an http2 (or h2c even) connection inte... [13:43:12] RECOVERY - puppet last run on mw1263 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:43:52] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:44:02] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:44:12] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:45:02] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:45:25] (03CR) 10Reedy: "You've gotta run the script to turn the dat into an updated conf file too" [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [13:55:33] 10Operations, 10Traffic, 10media-storage: "Error: 404, Requested domainname does not exist" when accessing Commons categories/images; works on mobile page - https://phabricator.wikimedia.org/T181801#3802873 (10Sumitsurai) Hi @Aklapper, > Does this problem also happen on the desktop page with your laptop wh... [14:00:39] (03PS1) 10Herron: puppet: cut over all puppet service records to codfw puppet 4 masters [dns] - 10https://gerrit.wikimedia.org/r/394570 (https://phabricator.wikimedia.org/T177254) [14:04:58] 10Operations, 10Traffic, 10media-storage: "Error: 404, Requested domainname does not exist" when accessing Commons categories/images; works on mobile page - https://phabricator.wikimedia.org/T181801#3802729 (10jcrespo) It is strange- I cannot see changes on traffic patterns of errors it could be a network/is... [14:08:32] (03CR) 10Herron: [C: 032] puppet: cut over all puppet service records to codfw puppet 4 masters [dns] - 10https://gerrit.wikimedia.org/r/394570 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [14:09:49] <_joe_> hey, fyi: herron just switched production in eqiad over to puppet 4 [14:10:05] !log cutting all puppet service records over to codfw puppet 4 masters [14:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:44] \o/ [14:13:07] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-crh-tur: New upstream release [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/393993 (https://phabricator.wikimedia.org/T181465) (owner: 10KartikMistry) [14:13:11] <_joe_> elukey: can you name me a couple analytics machines that are critical? [14:13:25] <_joe_> I think they're the only large thing missing in codfw [14:13:34] <_joe_> oh and akosiaris you might wanna check OTRS [14:13:48] (03PS1) 10Filippo Giunchedi: hieradata: enable nfs/mountstats collectors for tools-bastion-03 [puppet] - 10https://gerrit.wikimedia.org/r/394571 (https://phabricator.wikimedia.org/T177196) [14:15:32] -15,45 * * * * root /usr/local/sbin/puppet-run > /dev/null 2>&1 [14:15:32] +10,40 * * * * root /usr/local/sbin/puppet-run > /dev/null 2>&1 [14:15:40] this changed _joe_ ^ [14:15:43] any idea why ? [14:16:24] Notice: /Stage[main]/Base::Monitoring::Host/Monitoring::Host[mendelevium]/Nagios_host[mendelevium]/ensure: created [14:16:26] wait what ? [14:16:28] <_joe_> akosiaris: yeah they fixed fqdn_rand() [14:16:30] node name encoding changed in puppet 4 which causes random timings and things like monitoring storeconfigs to refresh [14:16:34] <_joe_> don't worry akosiaris [14:16:38] <_joe_> that's expected [14:16:45] why ? [14:16:47] <_joe_> the way puppet labels hosts changed [14:16:58] <_joe_> so it will add those definitions and the old ones disappear [14:17:13] <_joe_> it's funny how everyone freaks out the first time they see that :P [14:17:16] what definitions ? this is meant to be an exported resource [14:17:17] <_joe_> I did the same [14:17:21] <_joe_> yes [14:17:25] and instead gets created on the host ? [14:17:29] <_joe_> no [14:17:31] _joe_ analytics100[123] [14:17:38] <_joe_> it gets created as an exported resource [14:17:49] akosiaris which host are you looking at, I can send you a diff that might help clarify what changes [14:18:01] (03CR) 10Rush: [C: 031] hieradata: enable nfs/mountstats collectors for tools-bastion-03 [puppet] - 10https://gerrit.wikimedia.org/r/394571 (https://phabricator.wikimedia.org/T177196) (owner: 10Filippo Giunchedi) [14:18:04] _joe_: I beg to differ [14:18:07] akosiaris@mendelevium:~$ sudo cat /etc/nagios/nagios_host.cfg [14:18:11] herron: ^ [14:18:15] thx [14:18:17] file is there. it should NOT be there [14:18:28] it was created Dec 1 14:14 [14:18:59] <_joe_> oh ok, that's not great, I didn't find it on the mw* hosts before [14:19:18] <_joe_> what I can ensure you is that we still generate the correct icinga file [14:20:36] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable nfs/mountstats collectors for tools-bastion-03 [puppet] - 10https://gerrit.wikimedia.org/r/394571 (https://phabricator.wikimedia.org/T177196) (owner: 10Filippo Giunchedi) [14:20:40] (03PS2) 10Filippo Giunchedi: hieradata: enable nfs/mountstats collectors for tools-bastion-03 [puppet] - 10https://gerrit.wikimedia.org/r/394571 (https://phabricator.wikimedia.org/T177196) [14:21:11] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] hieradata: enable nfs/mountstats collectors for tools-bastion-03 [puppet] - 10https://gerrit.wikimedia.org/r/394571 (https://phabricator.wikimedia.org/T177196) (owner: 10Filippo Giunchedi) [14:22:03] !log upload apertium-crh-tur_0.3.0~r83159-1+wmf1 to apt.wikimedia.org/jessie-wikimedia component main. T181465 [14:22:06] !log awight@tin Started deploy [ores/deploy@532bd0b]: (non-production) Update ORES on new cluster [14:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:14] T181465: Update crh-tur Apertium language pair - https://phabricator.wikimedia.org/T181465 [14:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:54] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3802917 (10chasemp) I talked @fgiunchedi into enabling the client collector on tools-bastion-03 in https://gerrit.w... [14:23:37] 10Operations, 10Electron-PDFs, 10Design, 10I18n, and 3 others: Use "Charter" as preferred typeface on Electron - https://phabricator.wikimedia.org/T181200#3802924 (10mobrovac) I tried to change the font in both Beta and production to various values (`Bitstream Charter`, `DejaVu`, `DejaVu Sans`, etc.) and n... [14:24:04] <_joe_> akosiaris: I think I found the issue, btw. *sigh* [14:24:11] please do tell [14:24:11] !log awight@tin Finished deploy [ores/deploy@532bd0b]: (non-production) Update ORES on new cluster (duration: 02m 06s) [14:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:11] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#3802930 (10Aklapper) [14:26:37] <_joe_> # This is a hack. We detect if we are running on the scope of an icinga [14:26:45] <_joe_> # host and avoid exporting the resource if yes [14:27:06] <_joe_> if defined(Class['icinga']) {$rtype = 'nagios_host' } else { $rtype = '@@nagios_host' } [14:27:17] (03CR) 10Muehlenhoff: [C: 031] mwlog/xenon: access should be based on role, not host names [puppet] - 10https://gerrit.wikimedia.org/r/393994 (owner: 10Dzahn) [14:28:12] <_joe_> and then create_resources [14:28:24] yup that's me [14:28:29] sounds... roughly similar to the hacks I originally wrote for nagios [14:28:46] <_joe_> akosiaris: I'm pretty sure that's what tricks it, but lemme check a bit more [14:28:58] <_joe_> btw, as long as we don't lose alerts, it's ok [14:29:01] (03PS1) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) [14:29:10] I am not sure it's that btw [14:29:14] that's a very clear if [14:29:19] either export or not export [14:29:21] <_joe_> not the if [14:29:31] (03PS6) 10Ema: WIP: varnish: prometheus equivalent of statsd metrics daemons [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) [14:29:33] <_joe_> I think create_resources can have changed [14:29:37] (03CR) 10jerkins-bot: [V: 04-1] apt: unattended-upgrades: add reporter script [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [14:29:39] <_joe_> so I have to check that [14:29:41] ah [14:30:46] <_joe_> akosiaris: or in some form, our catalogs in v4 contain the class icinga somewhere, or refer to it, dunno [14:30:51] which version did we update to ? [14:30:52] 10Operations, 10ORES, 10Scoring-platform-team: Problem with Redis server configuration on new ORES cluster - https://phabricator.wikimedia.org/T181806#3802940 (10awight) [14:31:01] 4.8 ? [14:31:03] (03PS1) 10Jcrespo: mariadb: Undeploy db2092, use db1085 for s1 (remove s3 special slaves) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394573 (https://phabricator.wikimedia.org/T170662) [14:31:05] <_joe_> 4.8.2 [14:31:10] I just saw this https://tickets.puppetlabs.com/browse/PUP-6698 [14:31:13] 4.10.2 [14:31:29] <_joe_> I was about to point you to the same issue :P [14:31:41] <_joe_> so yeah, it's easy to solve. it just sucks [14:31:52] this was working just fine [14:32:00] what on earth did they do ? [14:32:10] <_joe_> https://i.imgur.com/iZcUNxH.mp4 [14:34:07] https://puppet.com/docs/puppet/4.5/release_notes.html [14:34:19] they still got a sense of humour at least [14:35:23] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:35:27] <_joe_> akosiaris: so this is interesting [14:35:29] Can somebody from puppet please have a look at T169450 and its related patch-for-review? Thanks. [14:35:29] T169450: Redirect several wikis - https://phabricator.wikimedia.org/T169450 [14:35:38] <_joe_> herron: can you check dataset1001? [14:36:03] <_joe_> akosiaris: the resources still result as exported, according to puppetdb [14:36:25] sure [14:36:30] <_joe_> so while that's correctly reported, somehow it ends up realizing the resources anyways? [14:37:23] 10Operations, 10monitoring, 10Scoring-platform-team (Current): Investigate scb1001 and scb1002 available memory graphs in Grafana - https://phabricator.wikimedia.org/T181544#3802968 (10awight) One more little glitch, the `{cluster="scb"}` list doesn't include the codfw nodes. [14:38:08] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3802973 (10awight) [14:38:10] 10Operations, 10monitoring, 10Scoring-platform-team (Current): Investigate scb1001 and scb1002 available memory graphs in Grafana - https://phabricator.wikimedia.org/T181544#3802972 (10awight) 05Resolved>03Open [14:39:01] <_joe_> akosiaris: what about we export everything? [14:39:21] <_joe_> after all, this distinction doesn't have a lot of use [14:39:34] it actually does [14:39:40] <_joe_> what exactly? [14:39:45] it's there because if both icinga hosts export all resources [14:39:50] you get duplicate definitions [14:39:55] <_joe_> oh right [14:40:04] (03PS1) 10Herron: puppet: fix rsyncd.conf.media template to parse under puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/394577 (https://phabricator.wikimedia.org/T177254) [14:40:06] <_joe_> ok then, I'll need to duplicate data I guess [14:40:13] the other solution of course is killing the second icinga host [14:40:18] but that's a step backwards [14:40:23] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:40:24] <_joe_> nah [14:40:30] <_joe_> the second icinga host is useful [14:41:34] <_joe_> so I'd rather do one of the following: 1 - deduplicate at the level of our script for creating the nagios defs 2 - just copy/paste a lot of data :P [14:42:53] deduplicating on the script would work, but it would only hide the problem under a carpet [14:43:02] (03CR) 10Herron: [C: 032] puppet: fix rsyncd.conf.media template to parse under puppet 4 [puppet] - 10https://gerrit.wikimedia.org/r/394577 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [14:43:19] there would still be a number of exported resources in puppetdb that would be returned on every query [14:43:32] <_joe_> yeah I think I can make this as non-horrible as possible [14:44:23] on interesting this I 've noted in the past about create_resources [14:44:28] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Varnish and Apache root for hoo - https://phabricator.wikimedia.org/T179317#3802980 (10hoo) Regarding the appsevers, the canary ones are indeed (mostly) enough. But for the Varnishes, having access to the actual ones would... [14:44:55] is it was inconsistent about how the first argument should/could be passed [14:45:20] as in both create_resources('resource') and create_resources(resource) was fine [14:45:40] but it was not accepting create_resources(@@resource) [14:45:43] when the string hack [14:45:46] hence* [14:46:02] <_joe_> akosiaris: or, we could just ignore this, keep those files around for now and they will be gone once we move to a version that contains the patch [14:46:15] <_joe_> I'm a bit unsure tbh [14:46:30] I am not even sure that issues is related to bh [14:46:44] it's a convoluted mess. it talks about puppetserver and puppet agent [14:46:56] and puppet aget 1.6.1 ... [14:47:06] <_joe_> I'm pretty sure it is [14:47:35] <_joe_> lemme try to ensure that's the case [14:49:20] the fun part is the guy that says that using the usual syntax works just fine [14:49:34] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Varnish and Apache root for hoo - https://phabricator.wikimedia.org/T179317#3720398 (10mark) >>! In T179317#3788865, @ArielGlenn wrote: > After chat with @hoo in irc, here's the specific list of needs: > > - strace and t... [14:49:36] so we could fallback to a lot of copy/paste indeed [14:49:42] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#3802985 (10awight) [14:50:28] <_joe_> akosiaris: just confirmed [14:51:05] (03CR) 10Marostegui: "Commit message says db1085:)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394573 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [14:53:50] _joe_: I am guessing via https://tickets.puppetlabs.com/browse/PUP-6698?focusedCommentId=443162&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-443162 ? [14:54:14] <_joe_> no I'm testing the code [14:54:21] <_joe_> that I wrote [14:54:29] <_joe_> but yeah that 's the same [14:55:00] https://tickets.puppetlabs.com/browse/PUP-7541 [14:55:02] mama mia [14:55:30] (03PS2) 10Jcrespo: mariadb: Undeploy db2092, use db2085 for s1 (remove s3 special slaves) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394573 (https://phabricator.wikimedia.org/T170662) [14:56:02] (03CR) 10Muehlenhoff: [C: 031] Set $wgRestrictionMethod = 'firejail'; everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393825 (https://phabricator.wikimedia.org/T173370) (owner: 10Legoktm) [14:56:04] <_joe_> akosiaris: https://phabricator.wikimedia.org/P6415 [14:56:23] <_joe_> akosiaris: waaat [14:57:01] (03CR) 10Marostegui: [C: 031] mariadb: Undeploy db2092, use db2085 for s1 (remove s3 special slaves) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394573 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [14:57:40] _joe_: it's been 3 years now that I am fearing we might be forced to move away from puppet [14:57:46] eventually that is [14:57:54] akosiaris: "I've been cheated by you since I don't kno-o-w when..." [14:57:59] <_joe_> akosiaris: basically their idea is that on the monitoring host you do a puppet query, then loop through all the data you gather, and apply the corresponding define [14:58:13] <_joe_> https://s-media-cache-ak0.pinimg.com/564x/2d/6e/e8/2d6ee8c39c1dd4f8af6f8278127e7f47.jpg again [14:58:30] "So I've made up my mind it must come to an end" [14:58:35] lol [15:00:21] <_joe_> akosiaris: maybe we should kindly offer our prespective as users? [15:00:36] when has that mattered for puppetlabs ? [15:00:45] <_joe_> well, one can try [15:00:53] sure.. go ahead :P [15:01:26] 10Operations, 10monitoring, 10Scoring-platform-team (Current): Investigate scb1001 and scb1002 available memory graphs in Grafana - https://phabricator.wikimedia.org/T181544#3803010 (10akosiaris) 05Open>03Resolved Yeah that's because our prometheus per machine stats is per DC, not global. Anyway, simples... [15:01:27] <_joe_> ok so, what do you think? copy/pasta? [15:01:29] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3803012 (10akosiaris) [15:01:47] you two should go to the ConfigMgmtCamp in Ghent ;) [15:03:08] (03CR) 10Jcrespo: [C: 032] mariadb: Undeploy db2092, use db2085 for s1 (remove s3 special slaves) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394573 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [15:03:24] (03CR) 10jenkins-bot: mariadb: Undeploy db2092, use db2085 for s1 (remove s3 special slaves) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394573 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [15:03:56] _joe_: damn it's a lot of copy/pasta lemme try something [15:04:58] <_joe_> akosiaris: akosiaris ahahahahah [15:05:03] <_joe_> I found the way to trick it [15:05:13] ? [15:06:24] <_joe_> https://phabricator.wikimedia.org/P6415#35884 [15:07:06] lol [15:07:33] <_joe_> so lemme see if I can reduce the amount of copy/pasta there [15:07:53] <_joe_> not sure tbh [15:08:30] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3803018 (10Halfak) Well.. I've had my github account locked, so I'm working on experimenting with gitlab. I've completed the upload of LFS'd content... [15:09:38] _joe_: the other way to solve this would be to have the "passive" server not export "local" definitions [15:09:45] and export everything only from the slave [15:09:50] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Undeploy db2092, use db2085 for s1 (duration: 00m 45s) [15:09:51] but this is full of races [15:09:57] <_joe_> akosiaris: yeah, let's not [15:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:02] !log installing libxcursor security updates on trusty [15:14:09] !log gehel@tin Started deploy [kartotherian/deploy@df7ebff]: testing new kartotherian packaging on maps-test2003 [15:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:12] !log jynus@tin Synchronized wmf-config/db-codfw.php: Undeploy db2092, use db2085 for s1 (duration: 00m 45s) [15:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:29] !log gehel@tin Finished deploy [kartotherian/deploy@df7ebff]: testing new kartotherian packaging on maps-test2003 (duration: 00m 20s) [15:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:11] !log installing ffmpeg security updates [15:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:16] !log installing nspr security updates on trusty [15:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:20] !log bounce uwsgi on labmon1001 - stuck [15:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:22] RECOVERY - graphite-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.011 second response time [15:36:46] (03PS1) 10Alexandros Kosiaris: Revert "user homes: Allow git to control +x for $HOME files" [puppet] - 10https://gerrit.wikimedia.org/r/394585 [15:37:03] (03PS2) 10Alexandros Kosiaris: Revert "user homes: Allow git to control +x for $HOME files" [puppet] - 10https://gerrit.wikimedia.org/r/394585 [15:37:10] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "user homes: Allow git to control +x for $HOME files" [puppet] - 10https://gerrit.wikimedia.org/r/394585 (owner: 10Alexandros Kosiaris) [15:37:46] (03PS3) 10Muehlenhoff: Allow to selectively run time servers on Chrony (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/393581 [15:37:53] (03PS1) 10Alexandros Kosiaris: Revert "Revert "user homes: Allow git to control +x for $HOME files"" [puppet] - 10https://gerrit.wikimedia.org/r/394586 [15:39:29] (03CR) 10jerkins-bot: [V: 04-1] Allow to selectively run time servers on Chrony (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/393581 (owner: 10Muehlenhoff) [15:39:36] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "user homes: Allow git to control +x for $HOME files"" [puppet] - 10https://gerrit.wikimedia.org/r/394586 (owner: 10Alexandros Kosiaris) [15:40:10] (03PS1) 10Jcrespo: Set db2085 with s1 and s8; make db2092 pending to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/394588 (https://phabricator.wikimedia.org/T170662) [15:41:02] (03PS1) 10Jcrespo: maridb: Update 10.1 package to 10.1.29-1 [software] - 10https://gerrit.wikimedia.org/r/394589 [15:49:21] (03CR) 10Jcrespo: [V: 032 C: 032] "10.1 for jessie has not been uploaded, not sure if I want to spend time on that..." [software] - 10https://gerrit.wikimedia.org/r/394589 (owner: 10Jcrespo) [15:49:49] (03PS2) 10Jcrespo: Set db2085 with s1 and s8; make db2092 pending to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/394588 (https://phabricator.wikimedia.org/T170662) [15:51:03] (03CR) 10Jcrespo: [C: 032] Set db2085 with s1 and s8; make db2092 pending to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/394588 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [15:51:27] (03PS1) 10Giuseppe Lavagetto: monitoring: workaround puppet 4.x bug with created_resources [puppet] - 10https://gerrit.wikimedia.org/r/394590 [15:51:42] <_joe_> akosiaris: this is an attempt at solving the issue ^^ [15:51:46] (03PS1) 10Jcrespo: Remove db2092; setup db1085 with s1 and s8 [software] - 10https://gerrit.wikimedia.org/r/394591 (https://phabricator.wikimedia.org/T170662) [15:51:52] <_joe_> care to take a look? I'd merge it on mondays anyways [15:52:01] <_joe_> monday, just one [15:52:12] (03CR) 10jerkins-bot: [V: 04-1] monitoring: workaround puppet 4.x bug with created_resources [puppet] - 10https://gerrit.wikimedia.org/r/394590 (owner: 10Giuseppe Lavagetto) [15:52:18] <_joe_> heh, see? [15:53:03] (03CR) 10Jcrespo: [C: 032] Remove db2092; setup db1085 with s1 and s8 [software] - 10https://gerrit.wikimedia.org/r/394591 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [15:53:10] <_joe_> anyways, that's about the concept, first of all [15:53:55] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T181028#3803161 (10Cmjohnson) @fgiunchedi The disk has been swapped, you will probably need to add it back. [15:54:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Premise looks good, minor nitpick" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394590 (owner: 10Giuseppe Lavagetto) [15:55:48] (03CR) 10Giuseppe Lavagetto: monitoring: workaround puppet 4.x bug with created_resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394590 (owner: 10Giuseppe Lavagetto) [15:56:32] RECOVERY - HP RAID on ms-be1039 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [15:56:59] (03CR) 10Jcrespo: "arg, this was db2085 :-(" [software] - 10https://gerrit.wikimedia.org/r/394591 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [15:58:21] _joe_: typo in commit message: create_respources ;) [16:04:31] <_joe_> volans: yeah it's not the only one [16:04:48] <_joe_> and I'm not gonna merge that change, that has the potential to be catastrophic, now [16:10:46] (03PS1) 10Jcrespo: mariadb: Set db2092 as spare explicity [puppet] - 10https://gerrit.wikimedia.org/r/394596 (https://phabricator.wikimedia.org/T170662) [16:11:54] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team (Current): Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3803203 (10akosiaris) A quick look at graphs for various ores hosts[1] and tin [2] does not show anything network related. A look... [16:12:42] (03CR) 10Jcrespo: [C: 032] mariadb: Set db2092 as spare explicity [puppet] - 10https://gerrit.wikimedia.org/r/394596 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [16:12:54] (03PS7) 10Ema: varnish: prometheus equivalent of statsd metrics daemons [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) [16:12:56] (03PS1) 10Ema: mtail: add varnishmtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394597 (https://phabricator.wikimedia.org/T177199) [16:13:13] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3803208 (10Dzahn) @jcrespo Did it happen on all hosts or just a few? Do you recall which host the paste above is from? I am surprised since this same patch got applied on a ton of oth... [16:13:25] (03CR) 10jerkins-bot: [V: 04-1] varnish: prometheus equivalent of statsd metrics daemons [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [16:14:22] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3803209 (10jcrespo) All that I let puppet run-- I reverted quickly. At least 3 or four of them failed. [16:18:03] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3803226 (10jcrespo) db2023 and db2029? Maybe it fails on ubuntu hosts? [16:18:20] (03PS8) 10Ema: varnish: prometheus equivalent of statsd metrics daemons [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) [16:24:10] !log starting cassandra bootstrap, restbase1012-a -- T179422 [16:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:19] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [16:25:48] RECOVERY - Check systemd state on restbase1012 is OK: OK - running: The system is fully operational [16:26:08] RECOVERY - cassandra-a service on restbase1012 is OK: OK - cassandra-a is active [16:26:09] RECOVERY - cassandra-a SSL 10.64.32.202:7001 on restbase1012 is OK: SSL OK - Certificate restbase1012-a valid until 2018-08-17 16:11:12 +0000 (expires in 258 days) [16:26:59] (03CR) 10Zoranzoki21: [C: 031] Add https://studiezaal.nijmegen.nl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394379 (https://phabricator.wikimedia.org/T181713) (owner: 10MarcoAurelio) [16:27:49] (03CR) 10Zoranzoki21: [C: 031] Remove Arcanist configuration files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394544 (owner: 10Hashar) [16:31:34] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission niobium - https://phabricator.wikimedia.org/T181763#3803303 (10mark) Go ahead. :) [16:34:53] !log stopping db2092 to clone s1 to db2085 [16:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:07] (03PS2) 10Zoranzoki21: Move all dblists on noc to dblists/ directory, rather than individually [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394199 (owner: 10Chad) [16:37:30] (03PS2) 10Giuseppe Lavagetto: monitoring: workaround puppet 4.x bug with created_resources [puppet] - 10https://gerrit.wikimedia.org/r/394590 [16:37:37] 10Operations, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3803312 (10fgiunchedi) I tried jmx_exporter on deployment-logstash2 with the results below. A few notes: the exporter config needs to be somewhere ac... [16:38:30] (03CR) 10jerkins-bot: [V: 04-1] Move all dblists on noc to dblists/ directory, rather than individually [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394199 (owner: 10Chad) [16:46:08] PROBLEM - NTP on tools-bastion-03 is CRITICAL: NTP CRITICAL: No response from NTP server [16:46:34] (03CR) 10Zoranzoki21: [C: 031] Add category collation for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393762 (https://phabricator.wikimedia.org/T181503) (owner: 10Jon Harald Søby) [16:49:39] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: search.wikimedia.org is source of lots of 500s - https://phabricator.wikimedia.org/T179266#3803331 (10debt) 05Open>03Resolved a:03debt [16:51:18] PROBLEM - NTP on tools-bastion-02 is CRITICAL: NTP CRITICAL: No response from NTP server [16:52:32] jynus: ugh @ ganglia removal fail. did it affect all hosts or just a few? can i try on like a single host by myself.. i would pick it by hostname in hiera [16:52:47] it's weird because that same thing worked on a whole bunch of other roles just fine [16:53:06] there must be some extra include somewhere.. i will look [16:53:06] try the ones that I told you on ticket [16:53:25] almost sure those failed, not 10% sure otehrs [16:53:25] oh, sorry, i just saw the ticket update right now, just got back online. thank you [16:53:28] will do [16:53:34] but I think all would fail [16:53:42] there were es hosts, which are jesssie [16:53:47] ok [16:53:50] but I may be missremembered [16:53:55] anyway, no harm [16:54:05] i usually did it by role class, just in this case i though the regex is the best.. it wasnt :) [16:54:08] and it was a good thing I tested codfw first :-) mutante [16:54:09] ok! cool [16:54:32] next time let's do that, too [16:55:19] yea, first one host, then one role, then one dc, then the other dc, or so :) [16:59:08] 10Operations, 10ops-eqiad: Disconnect flerovium's disk shelves - https://phabricator.wikimedia.org/T181724#3803350 (10Cmjohnson) - Powered down host - Removed disk shelves and rebooted - Server powered on and was able to ssh to server w/out issue - Powered off server - Powered server on w/disk shelves re... [16:59:32] !log awight@tin Locking from deployment [ores/deploy]: Don't deploy while we're messing with git-lfs (planned duration: 60m 00s) [16:59:40] !log awight@tin Unlocked for deployment [ores/deploy]: Don't deploy while we're messing with git-lfs (duration: 00m 07s) [16:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:55] PROBLEM - NTP on tools-bastion-05 is CRITICAL: NTP CRITICAL: No response from NTP server [17:00:03] !log awight@tin Locking from deployment [ores/deploy]: Don't deploy while we're messing with git-lfs (planned duration: -1m 59s) [17:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:12] !log awight@tin Locking from deployment [ores/deploy]: Don't deploy while we're messing with git-lfs (planned duration: 16666666666m 39s) [17:00:17] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler02/9112/ tells the tale that the only change is the addition of the new resources, which seems" [puppet] - 10https://gerrit.wikimedia.org/r/394590 (owner: 10Giuseppe Lavagetto) [17:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:26] !log awight@tin Unlocked for deployment [ores/deploy]: Don't deploy while we're messing with git-lfs (duration: 00m 14s) [17:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:39] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3803356 (10Halfak) I'm working on updating https://phabricator.wikimedia.org/source/editquality to pull from gitlab and I'm getting ``` Error updating w... [17:02:34] PROBLEM - SSH on tools-elastic-03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:02:34] PROBLEM - SSH on tools-k8s-etcd-01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:44] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/394590 (owner: 10Giuseppe Lavagetto) [17:04:20] ewwww [17:04:21] <_joe_> andrewbogott, herron can you clean up ^^ [17:04:25] <_joe_> please :) [17:04:28] PROBLEM - SSH on tools-proxy-01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:39] I am, will take me a minute though [17:04:45] <_joe_> thanks [17:04:52] andrewbogott do you need a hand? [17:04:58] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring: workaround puppet 4.x bug with created_resources [puppet] - 10https://gerrit.wikimedia.org/r/394590 (owner: 10Giuseppe Lavagetto) [17:05:19] (03PS1) 10Eevans: Enable Cassandra instance: restbase1012-b [puppet] - 10https://gerrit.wikimedia.org/r/394602 (https://phabricator.wikimedia.org/T179422) [17:05:21] (03PS1) 10Eevans: Enable Cassandra instance: restbase1012-c [puppet] - 10https://gerrit.wikimedia.org/r/394603 (https://phabricator.wikimedia.org/T179422) [17:05:22] <_joe_> andrewbogott: I am disabling puppeet on einsteinium for the moment [17:05:44] ok [17:06:08] a cleanup script is running now, after it finishes a puppet run on einsteinium should tidy up everything [17:06:12] ugh, what a mess [17:06:35] <_joe_> andrewbogott: einsteinium has puppet disabled, please DO NOT enable it [17:07:38] (03PS2) 10Eevans: hieradata: enable Cassandra instance: restbase1012-b [puppet] - 10https://gerrit.wikimedia.org/r/394602 (https://phabricator.wikimedia.org/T179422) [17:07:49] (03PS2) 10Eevans: hieradata: enable Cassandra instance: restbase1012-c [puppet] - 10https://gerrit.wikimedia.org/r/394603 (https://phabricator.wikimedia.org/T179422) [17:07:49] PROBLEM - SSH on tools-logs-02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:08:24] (03CR) 10Eevans: [C: 04-1] "Do not yet merge; Waiting on the bootstrap of 1010-a" [puppet] - 10https://gerrit.wikimedia.org/r/394602 (https://phabricator.wikimedia.org/T179422) (owner: 10Eevans) [17:08:55] okay [17:09:00] so windows didn't get better [17:09:21] _joe_: ok, I've cleaned up all those spurious entries so next time the icinga config is refreshed on einsteinium things should get cleaned up. Shall I leave that step to you? [17:09:34] <_joe_> yes please [17:09:38] PROBLEM - SSH on tools-elastic-01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:09:41] (03CR) 10Eevans: [C: 04-1] "Do not yet merge; Waiting on the completion of 1010-a, then 1010-b (via r394602)." [puppet] - 10https://gerrit.wikimedia.org/r/394603 (https://phabricator.wikimedia.org/T179422) (owner: 10Eevans) [17:10:53] (03CR) 10Zoranzoki21: [C: 031] Add NS aliases for zh_wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393835 (https://phabricator.wikimedia.org/T181374) (owner: 10Urbanecm) [17:11:19] PROBLEM - SSH on tools-k8s-master-01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:07] 10Operations, 10ops-eqiad: Disconnect flerovium's disk shelves - https://phabricator.wikimedia.org/T181724#3803394 (10Cmjohnson) I left the server up with both disk shelves fully attached. [17:14:56] (03CR) 10Zoranzoki21: [C: 031] robots.txt: Remove old and disabled archive.org_bot rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358171 (https://phabricator.wikimedia.org/T7582) (owner: 10Framawiki) [17:14:59] PROBLEM - SSH on tools-k8s-etcd-03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:01] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364121 (https://phabricator.wikimedia.org/T170083) (owner: 10Framawiki) [17:18:28] PROBLEM - SSH on tools-k8s-etcd-02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:28] PROBLEM - SSH on tools-package-builder-01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:05] (03PS1) 10Giuseppe Lavagetto: boron: reenable notifications [puppet] - 10https://gerrit.wikimedia.org/r/394604 [17:20:18] PROBLEM - SSH on tools-docker-builder-05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:18] PROBLEM - SSH on tools-proxy-02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:29] PROBLEM - SSH on tools-elastic-02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:44] (03CR) 10Giuseppe Lavagetto: [C: 032] boron: reenable notifications [puppet] - 10https://gerrit.wikimedia.org/r/394604 (owner: 10Giuseppe Lavagetto) [17:22:05] (03PS3) 10Zoranzoki21: Improvements for Kafka + SSL [puppet] - 10https://gerrit.wikimedia.org/r/394438 (https://phabricator.wikimedia.org/T167304) (owner: 10Ottomata) [17:22:46] (03CR) 10jerkins-bot: [V: 04-1] Improvements for Kafka + SSL [puppet] - 10https://gerrit.wikimedia.org/r/394438 (https://phabricator.wikimedia.org/T167304) (owner: 10Ottomata) [17:28:18] (03PS8) 10Muehlenhoff: mediawiki: Add explicit dependency on ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/313963 [17:30:28] PROBLEM - NTP on tools-cron-01 is CRITICAL: NTP CRITICAL: No response from NTP server [17:30:29] PROBLEM - NTP on tools-exec-1407 is CRITICAL: NTP CRITICAL: No response from NTP server [17:30:29] PROBLEM - NTP on tools-exec-1418 is CRITICAL: NTP CRITICAL: No response from NTP server [17:30:29] PROBLEM - NTP on tools-exec-1429 is CRITICAL: NTP CRITICAL: No response from NTP server [17:30:29] PROBLEM - NTP on tools-exec-1440 is CRITICAL: NTP CRITICAL: No response from NTP server [17:30:29] PROBLEM - NTP on tools-webgrid-generic-1402 is CRITICAL: NTP CRITICAL: No response from NTP server [17:30:29] PROBLEM - NTP on tools-webgrid-lighttpd-1409 is CRITICAL: NTP CRITICAL: No response from NTP server [17:32:09] PROBLEM - NTP on tools-exec-1403 is CRITICAL: NTP CRITICAL: No response from NTP server [17:32:09] PROBLEM - NTP on tools-exec-1414 is CRITICAL: NTP CRITICAL: No response from NTP server [17:32:09] PROBLEM - NTP on tools-exec-1425 is CRITICAL: NTP CRITICAL: No response from NTP server [17:32:09] PROBLEM - NTP on tools-exec-1436 is CRITICAL: NTP CRITICAL: No response from NTP server [17:32:09] PROBLEM - NTP on tools-grid-shadow is CRITICAL: NTP CRITICAL: No response from NTP server [17:32:18] PROBLEM - NTP on tools-webgrid-lighttpd-1405 is CRITICAL: NTP CRITICAL: No response from NTP server [17:33:24] !log stopped ircecho on einsteinium [17:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:10] !log phab2001 - restarted apache [17:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:09] (03PS1) 10Herron: puppet: revert puppet agents back to eqiad puppet 3 masters [dns] - 10https://gerrit.wikimedia.org/r/394613 (https://phabricator.wikimedia.org/T177254) [17:53:23] 10Operations, 10ops-eqiad, 10DC-Ops: decommission rcs1001/1002 - https://phabricator.wikimedia.org/T181825#3803505 (10Cmjohnson) [17:53:24] <_joe_> I'm running puppet on einsteinium [17:54:03] (03CR) 10Herron: [C: 032] puppet: revert puppet agents back to eqiad puppet 3 masters [dns] - 10https://gerrit.wikimedia.org/r/394613 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [17:56:26] Hey all - I've got a backport to do quickly for UploadWizard campaigns - should be really fast, just wanted to make sure nobody is surprised by it. no_justification has given the OK for the extraordinary circumstances. [17:58:47] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3803540 (10demon) >>! In T181678#3798554, @Halfak wrote: > Trying start a gerrit review for wheels. Got this: > > ``` > Do you really want to submit t... [17:59:46] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3803543 (10demon) >>! In T181678#3803356, @Halfak wrote: > I'm working on updating https://phabricator.wikimedia.org/source/editquality to pull from git... [18:01:30] 10Operations, 10Electron-PDFs, 10Design, 10I18n, and 3 others: Use "Charter" as preferred typeface on Electron - https://phabricator.wikimedia.org/T181200#3803545 (10Jdlrobson) Is it possible that it's using the font's defined in the article css? I don't actually know how Electron decides which font to use... [18:05:02] (03PS1) 10Jcrespo: mariadb: Pool db2085:3311 (s1) after being moved from db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394615 (https://phabricator.wikimedia.org/T178359) [18:06:02] Oh no...I did a no_justification... [18:06:27] !log marktraceur@tin Synchronized php-1.31.0-wmf.10/extensions/UploadWizard/resources/controller/uw.controller.Deed.js: (no justification provided) (duration: 00m 46s) [18:06:32] Sigh [18:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:40] My shame is immense [18:06:43] Hehe [18:07:01] I wanna make that a list of a couple of shameful things [18:08:49] Anyway, it looks like UploadWizard is still running, and we didn't touch anything but JavaScript for UW, so I think we are done here. Thanks no_justification, sorry I let you down [18:09:51] don't let me don't let me don't let me down. Don't let me dowwwwwnnnnnn [18:10:16] (03CR) 10Jcrespo: [C: 031] "This can go any time, host is up, and replication catched up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394615 (https://phabricator.wikimedia.org/T178359) (owner: 10Jcrespo) [18:11:10] (03PS1) 10Volans: Created Django project [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394618 (https://phabricator.wikimedia.org/T167504) [18:11:12] (03PS1) 10Volans: Created Django apps [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394619 (https://phabricator.wikimedia.org/T167504) [18:11:14] (03PS1) 10Volans: First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) [18:11:16] (03PS1) 10Volans: Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) [18:18:01] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db2085 after s1 reimport [puppet] - 10https://gerrit.wikimedia.org/r/394622 (https://phabricator.wikimedia.org/T178359) [18:19:09] (03PS2) 10Jcrespo: mariadb: Reenable notifications on db2085 after s1 reimport [puppet] - 10https://gerrit.wikimedia.org/r/394622 (https://phabricator.wikimedia.org/T178359) [18:20:50] 10Operations, 10Wikimedia-log-errors: "internal_api_error_MWException: [dbf916b7] Exception Caught: Could not acquire lock for" for some uploads (during upload with Pywikibot OAuth) - https://phabricator.wikimedia.org/T129621#3803606 (10demon) [18:21:12] <_joe_> !log restarting apache2 on the codfw puppetmasters [18:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:15] !log Phabricator: restarting Apache for php-curl update [18:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:06] phab is back, downtime about 10 seconds [18:24:56] (03PS2) 10Volans: First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) [18:24:58] (03PS2) 10Volans: Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) [18:25:42] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:26:02] (03PS1) 10Dzahn: diadem/dysprosium: introduce skeleton role [puppet] - 10https://gerrit.wikimedia.org/r/394624 [18:26:04] (03PS1) 10Dzahn: wmcs: move standard includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/394625 [18:28:01] (03PS4) 10Imarlier: webperf.py: Handle oversamples differently than regular samples [puppet] - 10https://gerrit.wikimedia.org/r/394375 (https://phabricator.wikimedia.org/T181413) [18:31:41] (03CR) 10Jcrespo: [C: 031] "This can go any time, before https://gerrit.wikimedia.org/r/394615 preferably." [puppet] - 10https://gerrit.wikimedia.org/r/394622 (https://phabricator.wikimedia.org/T178359) (owner: 10Jcrespo) [18:31:43] (03CR) 10Imarlier: "Added unit tests for both the positive and the negative case of isOversample." [puppet] - 10https://gerrit.wikimedia.org/r/394375 (https://phabricator.wikimedia.org/T181413) (owner: 10Imarlier) [18:33:26] (03CR) 10Chad: [C: 032] Remove Arcanist configuration files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394544 (owner: 10Hashar) [18:34:53] (03Merged) 10jenkins-bot: Remove Arcanist configuration files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394544 (owner: 10Hashar) [18:35:53] (03PS2) 10Dzahn: diadem/dysprosium: introduce skeleton role [puppet] - 10https://gerrit.wikimedia.org/r/394624 (https://phabricator.wikimedia.org/T169566) [18:36:58] (03CR) 10jenkins-bot: Remove Arcanist configuration files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394544 (owner: 10Hashar) [18:38:22] (03PS2) 10Dzahn: wmcs: move standard includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/394625 [18:39:20] PROBLEM - puppet last run on mw2242 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:39:32] 10Operations, 10ops-eqiad: Disconnect flerovium's disk shelves - https://phabricator.wikimedia.org/T181724#3803704 (10RobH) So to confirm you had the host powered up without shelves, and it was fine. Then adding them back also resulted it in detecting and requiring no further actions? If so, that is ideal.... [18:41:40] (03CR) 10Rush: "please run this through the puppet compiler yes, and we'll be at an offsite next week if you can hold off" [puppet] - 10https://gerrit.wikimedia.org/r/394625 (owner: 10Dzahn) [18:46:11] PROBLEM - puppet last run on prometheus1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:47:28] (03CR) 10Dzahn: "sure, will do and hold off, no worries" [puppet] - 10https://gerrit.wikimedia.org/r/394625 (owner: 10Dzahn) [18:49:35] (03PS2) 10Dzahn: mwlog/xenon: access should be based on role, not host names [puppet] - 10https://gerrit.wikimedia.org/r/393994 [18:50:58] (03CR) 10Krinkle: [C: 04-1] "xenon seem secondary to mwlog. Does it have another more appropriate role to tie this to instead? perf-team plans to move (part) of xenon " [puppet] - 10https://gerrit.wikimedia.org/r/393994 (owner: 10Dzahn) [18:51:11] RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:51:42] (03CR) 10Krinkle: [C: 04-1] "the mediawiki/logging/udp2log role seems a better fit." [puppet] - 10https://gerrit.wikimedia.org/r/393994 (owner: 10Dzahn) [18:52:29] (03CR) 10Chad: [C: 032] Remove timeless inclusion in labs, prod has it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394503 (owner: 10Chad) [18:52:49] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3803805 (10Halfak) @demon, right, I'm still not able to push the wheels LFS migration. Can you help us get gitlabs proxied? [18:52:51] (03CR) 10Dzahn: "xenon is the only role on mwlog hosts properly applied with the role keyword. there is class { 'role::logging::mediawiki::udp2log': thoug" [puppet] - 10https://gerrit.wikimedia.org/r/393994 (owner: 10Dzahn) [18:53:51] (03Merged) 10jenkins-bot: Remove timeless inclusion in labs, prod has it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394503 (owner: 10Chad) [18:54:20] RECOVERY - puppet last run on mw2242 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:54:47] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3803812 (10bd808) Related: {T143969} [18:55:24] 10Operations, 10Diffusion, 10Gerrit, 10ORES, and 4 others: Add gitlab to proxies/whitelist for mirroring to phabricator - https://phabricator.wikimedia.org/T181835#3803814 (10Halfak) [18:55:40] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:57:10] (03CR) 10jenkins-bot: Remove timeless inclusion in labs, prod has it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394503 (owner: 10Chad) [18:58:03] 10Operations, 10Diffusion, 10Gerrit, 10ORES, and 4 others: Add gitlab to proxies/whitelist for mirroring to phabricator - https://phabricator.wikimedia.org/T181835#3803838 (10Halfak) See also {T143969} [18:59:44] !log demon@tin Synchronized wmf-config/CommonSettings-labs.php: no-op (duration: 00m 46s) [18:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:58] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3803864 (10demon) >>! In T181678#3803805, @Halfak wrote: > @demon, right, I'm still not able to push the wheels LFS migration. Can you help us get gitl... [19:05:08] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3803891 (10Halfak) @demon, it seems this is a different conversation. We do want to use lfs internally on gerrit for our wheels repository. I've read... [19:05:34] (03CR) 10Krinkle: [C: 04-1] "Perhaps it should become a role, but the thing is, xenon is really just a secondary bundle of things perf-team uses on the mwlog server. T" [puppet] - 10https://gerrit.wikimedia.org/r/393994 (owner: 10Dzahn) [19:07:35] (03PS2) 10Dzahn: admins: Add yubikey nano key ssh key for aaron [puppet] - 10https://gerrit.wikimedia.org/r/393432 (owner: 10Aaron Schulz) [19:08:03] (03CR) 10Dzahn: [C: 032] "confirmed it's Aaron by him putting the key in his existing home dir on tin" [puppet] - 10https://gerrit.wikimedia.org/r/393432 (owner: 10Aaron Schulz) [19:08:17] (03PS3) 10Dzahn: admins: Add yubikey nano key ssh key for aaron [puppet] - 10https://gerrit.wikimedia.org/r/393432 (owner: 10Aaron Schulz) [19:10:32] (03CR) 10Chad: [C: 032] Remove AdvancedSearch inclusion in beta, it's in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394504 (owner: 10Chad) [19:21:32] (03PS1) 10Chad: Add gitlab.com to Phab proxy whitelist [puppet] - 10https://gerrit.wikimedia.org/r/394640 (https://phabricator.wikimedia.org/T181835) [19:24:23] (03PS2) 10Addshore: Remove wikidatabuilder [puppet] - 10https://gerrit.wikimedia.org/r/394291 (https://phabricator.wikimedia.org/T181706) [19:24:29] (03CR) 10Addshore: "This is now ready to go" [puppet] - 10https://gerrit.wikimedia.org/r/394291 (https://phabricator.wikimedia.org/T181706) (owner: 10Addshore) [19:24:33] (03CR) 10Addshore: [C: 031] Remove wikidatabuilder [puppet] - 10https://gerrit.wikimedia.org/r/394291 (https://phabricator.wikimedia.org/T181706) (owner: 10Addshore) [19:35:29] (03CR) 10Halfak: [C: 031] "Looked for typos in the base URL and it looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/394640 (https://phabricator.wikimedia.org/T181835) (owner: 10Chad) [19:54:12] (03PS3) 10Chad: Move all dblists on noc to dblists/ directory, rather than individually [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394199 [19:55:52] (03CR) 10MarcoAurelio: [C: 04-1] "Per Reedy. Working to improve this." [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [19:55:53] (03CR) 10jerkins-bot: [V: 04-1] Move all dblists on noc to dblists/ directory, rather than individually [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394199 (owner: 10Chad) [20:01:10] (03CR) 10MarcoAurelio: [C: 04-1] [WIP] puppet: redirect several wikis per LangCom decission (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [20:03:27] !log aaron@tin Synchronized php-1.31.0-wmf.10/includes/libs/objectcache/WANObjectCache.php: f096d0b465b75d - temp logging for statsd spam (duration: 00m 45s) [20:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:10] (03CR) 10Marostegui: [C: 031] mariadb: Reenable notifications on db2085 after s1 reimport [puppet] - 10https://gerrit.wikimedia.org/r/394622 (https://phabricator.wikimedia.org/T178359) (owner: 10Jcrespo) [20:15:21] !log awight@tin Started deploy [ores/deploy@9afbf14]: (non-production) Test ORES deployment to ores1001 [20:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:27] (03PS4) 10MarcoAurelio: [WIP] puppet: redirect several wikis per LangCom decission [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) [20:17:24] (03PS1) 10Dzahn: db2023: remove ganglia (test on single host why it failed before) [puppet] - 10https://gerrit.wikimedia.org/r/394647 [20:17:52] !log awight@tin Finished deploy [ores/deploy@9afbf14]: (non-production) Test ORES deployment to ores1001 (duration: 02m 31s) [20:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:32] !log awight@tin Started deploy [ores/deploy@9afbf14]: (non-production) Test ORES deployment to ores100* [20:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:42] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/9113/db2023.codfw.wmnet/ hmmm" [puppet] - 10https://gerrit.wikimedia.org/r/394647 (owner: 10Dzahn) [20:24:04] marostegui: i'm just testing the ganglia thing on db2023, that one host to see how it breaks (if it still does) [20:24:37] i see that host is already disabled notifications in icinga [20:24:56] or that is from the issue with exported resources [20:24:56] Reedy, I'd need some help with the puppet change I have pending for review, could you? [20:25:14] What help do you need? [20:25:14] I'm not able to run the script, well, I'm not sure which is for starters [20:25:49] https://github.com/wikimedia/puppet/tree/production/modules/mediawiki/files/apache/sites/redirects [20:25:51] also I'n not sure if I should use override, rewrite or funnel [20:25:52] compile_redirects.rb [20:26:11] That's documentedhttps://github.com/wikimedia/puppet/blob/production/modules/mediawiki/files/apache/sites/redirects/redirects.dat#L4-L48 [20:26:13] $ ./compile_redirects.rb [20:26:13] ../../../../lib/puppet/parser/functions/compile_redirects.rb: line 30: require: command not found [20:26:13] ../../../../lib/puppet/parser/functions/compile_redirects.rb: line 32: module: command not found [20:26:13] ../../../../lib/puppet/parser/functions/compile_redirects.rb: line 33: class: command not found [20:26:14] https://github.com/wikimedia/puppet/blob/production/modules/mediawiki/files/apache/sites/redirects/redirects.dat#L4-L48 [20:26:15] ../../../../lib/puppet/parser/functions/compile_redirects.rb: line 34: attr_accessor: command not found [20:26:17] ../../../../lib/puppet/parser/functions/compile_redirects.rb: line 36: syntax error near unexpected token `(' [20:26:23] Install ruby? [20:26:24] ../../../../lib/puppet/parser/functions/compile_redirects.rb: line 36: ` def initialize(source)' [20:28:06] sigh [20:28:10] another program [20:32:58] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team (Current): Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3804344 (10awight) I just ran scap with `-l "ores1001.*" and deployment went smoothly. However, with scap running in parallel and... [20:34:30] (03CR) 10Dzahn: [C: 032] db2023: remove ganglia (test on single host why it failed before) [puppet] - 10https://gerrit.wikimedia.org/r/394647 (owner: 10Dzahn) [20:34:51] Reedy, and is this the only thing needed to do the wiki redirect, nothing else to do on other repos? [20:34:59] there's no documentations that I could find [20:35:28] I dunno [20:35:33] You're putting wikis into a weird state then [20:35:38] inaccessible via web [20:35:47] But they'll still be referenced in places like centralauth [20:35:59] yes, that's what LangCom wanted [20:36:09] Good for them [20:36:13] Doesn't take care of any technical isues [20:36:15] they're already in closed.dblist [20:36:15] *issues [20:36:35] you're still in time to raise the technical objections [20:36:57] Well, ops aren't just gonna deploy an apache change on a whim anyway [20:36:58] lol [20:37:00] if this is not desirable, please say so (and less work for me! :) ) [20:37:35] I think the point is, no one knows exactly how this is all gonna work [20:37:41] T169450 [20:37:41] T169450: Redirect several wikis - https://phabricator.wikimedia.org/T169450 [20:40:33] 10Puppet, 10Wikimedia-Language-setup, 10Patch-For-Review, 10User-MarcoAurelio, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3804370 (10MarcoAurelio) Chatting with @Reedy about this, it seems that this change might not be desirable at all at the ops/technical le... [20:49:44] (03Abandoned) 10MarcoAurelio: [WIP] puppet: redirect several wikis per LangCom decission [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [20:50:23] 10Puppet, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3804421 (10MarcoAurelio) a:05MarcoAurelio>03None [20:53:54] Reedy, I've abandoned the patch and let folks know that this needs a bit of input [20:54:02] s/input/discussion [20:54:07] or whatever [21:03:41] Hey folks. I want to install "git lfs" through puppet, but it's not in the standard repos. See https://github.com/git-lfs/git-lfs/blob/master/INSTALLING.md for discussion of including the "packagecloud" repositories to aid in installation. [21:04:02] How crazy is it to include these repositories along with jessie's standard apt repos? [21:04:03] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): Disconnect scoring repos to stop mirroring from GitHub - https://phabricator.wikimedia.org/T181851#3804438 (10awight) [21:04:16] 10Operations, 10ORES, 10Scap, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current): Disconnect scoring repos to stop mirroring from GitHub - https://phabricator.wikimedia.org/T181851#3804454 (10awight) [21:09:39] * halfak is looking into this for tin (prod) and deployment-tin (beta) [21:10:32] 10Operations, 10ORES, 10Scap, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current): Disconnect scoring repos to stop mirroring from GitHub - https://phabricator.wikimedia.org/T181851#3804492 (10Halfak) Done! https://phabricator.wikimedia.org/source/editquality/manage/uris/... [21:12:41] !log db2023 killed gmond (ganglia-monitor) process manually which was still running even though ganglia-monitor package was removed and caused puppet breakage (it seems only on trusty). after that puppet run is clean again and ganglia removed. (T177225) (https://gerrit.wikimedia.org/r/#/c/394647/1) [21:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:52] T177225: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225 [21:16:22] (03PS7) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393842 (https://phabricator.wikimedia.org/T181375) [21:16:24] (03PS5) 10Andrew Bogott: nova-network dnsmasq: set a deployment-appropriate cname for 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393841 (https://phabricator.wikimedia.org/T181375) [21:34:15] (03PS8) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393842 (https://phabricator.wikimedia.org/T181375) [21:34:17] (03PS6) 10Andrew Bogott: nova-network dnsmasq: set a deployment-appropriate cname for 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393841 (https://phabricator.wikimedia.org/T181375) [21:40:02] 10Puppet, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3804597 (10Strainu) Can we get some more details on why such changes would not be desirable? As time passes, chances are that the list of required closures will get bigger. [21:49:00] !log db2029 - removing ganglia-monitor, testing to kill gmond, running puppet to figure out how to cleanly remove it on trusty [21:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:47] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3804622 (10Dzahn) Running the exact same "/usr/bin/apt-get -y -q remove --purge ganglia-monitor" manually on a trusty host (db2029)... works! But when puppet ran it, it failed becaus... [22:01:32] 10Puppet, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3804636 (10MarcoAurelio) They told me that we're leaving wikis in a weird state. That is, innacesible via web but still existing and leaving traces at CentralAuth, CentralNotice, etc... [22:30:23] (03PS1) 10MaxSem: Revoke my key while I'm traveling [puppet] - 10https://gerrit.wikimedia.org/r/394718 [22:35:33] RECOVERY - cassandra-a CQL 10.64.32.202:9042 on restbase1012 is OK: TCP OK - 0.000 second response time on 10.64.32.202 port 9042 [22:39:02] (03CR) 10Ayounsi: [C: 031] Revoke my key while I'm traveling [puppet] - 10https://gerrit.wikimedia.org/r/394718 (owner: 10MaxSem) [22:40:04] 10Operations, 10ORES, 10Release-Engineering-Team (Kanban), 10Scoring-platform-team (Current), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3804694 (10mmodell) a:03mmodell [22:47:25] (03CR) 1020after4: [C: 031] Add gitlab.com to Phab proxy whitelist [puppet] - 10https://gerrit.wikimedia.org/r/394640 (https://phabricator.wikimedia.org/T181835) (owner: 10Chad) [22:49:06] (03PS3) 10Eevans: hieradata: enable Cassandra instance: restbase1012-b [puppet] - 10https://gerrit.wikimedia.org/r/394602 (https://phabricator.wikimedia.org/T179422) [22:49:17] (03CR) 10Eevans: [C: 031] "Ready!" [puppet] - 10https://gerrit.wikimedia.org/r/394602 (https://phabricator.wikimedia.org/T179422) (owner: 10Eevans) [22:49:47] mutante: if you're around, could you do the honors on ^^^ ? [23:05:29] urandom: yes, i am now, doing [23:05:42] i already got a heads-up about it :) [23:07:10] (03CR) 10Dzahn: [C: 032] "confirmed that IP resolves to restbase1012-b" [puppet] - 10https://gerrit.wikimedia.org/r/394602 (https://phabricator.wikimedia.org/T179422) (owner: 10Eevans) [23:08:18] (03PS2) 10Dzahn: Revoke my key while I'm traveling [puppet] - 10https://gerrit.wikimedia.org/r/394718 (owner: 10MaxSem) [23:08:43] PROBLEM - Recursive DNS on 208.80.153.51 is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:08:52] (03PS3) 10Dzahn: admins: Revoke Max' key while he's traveling [puppet] - 10https://gerrit.wikimedia.org/r/394718 (owner: 10MaxSem) [23:09:04] (03CR) 10Dzahn: [C: 032] admins: Revoke Max' key while he's traveling [puppet] - 10https://gerrit.wikimedia.org/r/394718 (owner: 10MaxSem) [23:09:13] danke, mutante [23:09:47] de rien, max [23:11:20] confirmed on bast1001/2001 it's removed already [23:11:31] good travels [23:11:34] RECOVERY - Recursive DNS on 208.80.153.51 is OK: DNS OK: 0.189 seconds response time. www.wikipedia.org returns 208.80.153.224 [23:11:53] .. wooh [23:13:52] (03CR) 10Dzahn: "-b has been merged. i know it has to wait, just the rebased needed anyways" [puppet] - 10https://gerrit.wikimedia.org/r/394603 (https://phabricator.wikimedia.org/T179422) (owner: 10Eevans) [23:13:54] (03PS9) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393842 (https://phabricator.wikimedia.org/T181375) [23:13:56] (03PS7) 10Andrew Bogott: nova-network dnsmasq: set a deployment-appropriate cname for 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393841 (https://phabricator.wikimedia.org/T181375) [23:13:58] (03PS3) 10Dzahn: hieradata: enable Cassandra instance: restbase1012-c [puppet] - 10https://gerrit.wikimedia.org/r/394603 (https://phabricator.wikimedia.org/T179422) (owner: 10Eevans) [23:14:11] mutante: awesome; thank you! [23:14:37] :) np [23:15:21] !log starting cassandra bootstrap, restbase1012-b - T179422 [23:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:32] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [23:17:01] (03PS10) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393842 (https://phabricator.wikimedia.org/T181375) [23:17:03] (03PS8) 10Andrew Bogott: nova-network dnsmasq: set a deployment-appropriate cname for 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393841 (https://phabricator.wikimedia.org/T181375) [23:18:00] urandom: by coincidence i have another request. how about 'decom restbase1014' [23:18:10] decom? [23:18:21] just because i am removing ganglia and for me that host is a special case, puppet is disabled [23:18:27] (24510 minutes ago). Puppet is disabled. eevans: decommissioning (T179422) [23:18:45] oh, it's decommissioned, godog has reimaged it, and after 1012, it's next [23:19:11] would something undesirable happen if puppet runs now [23:19:21] umm... i'm not sure [23:19:27] well, then let's not risk it [23:19:36] i will manually kill ganglia stuff from it [23:19:42] (03PS11) 10Andrew Bogott: labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393842 (https://phabricator.wikimedia.org/T181375) [23:19:50] mutante: looks like the answer is "yes" [23:20:04] heh, ok, i won't touch puppet :) [23:20:07] something undesirable would :) [23:20:38] actually, i'm thinking in this state, it wouldn't run to completion [23:20:39] it's just if i can kill the ganglia things then an entire cluster is gone .. and doesn't stay because 1 host is up [23:20:46] (03CR) 10Andrew Bogott: [C: 032] labsaliaser: handle requests for the simple hostname 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393842 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [23:20:59] because there are systemd units that are masked [23:21:14] and if, if you unmasked them, then something undesirable would happen :) [23:21:39] mutante: i see, yeah, hrmm [23:21:40] ok, *nod* [23:22:39] go on with your bootstrap, this can wait [23:22:46] i have enough others [23:22:52] kk [23:23:35] fyi, whenever icinga updates, it's going to provision checks for 1012-b that will go red immediately [23:23:47] in case i don't catch it right away [23:24:20] yea, no problem [23:24:56] (since they are not paging ;) [23:25:28] what determines whether they are paging? [23:25:51] whether a monitoring::service class has "critical => true" or not [23:25:52] also, what does "paging" mean specifically? [23:25:57] sending SMS [23:26:17] oh ok [23:27:56] if in modules/restbase/manifests/monitoring.pp you would add "critical => true" to the monitoring::service then there would be SMS (but only to all the ops members) [23:30:32] it's handled via the contact_group, true adds a special contactgroup sms and that has people in it who have their phone numbers setup in contacts [23:32:03] (03PS9) 10Andrew Bogott: nova-network dnsmasq: set a deployment-appropriate cname for 'puppet' [puppet] - 10https://gerrit.wikimedia.org/r/393841 (https://phabricator.wikimedia.org/T181375) [23:32:05] (03PS1) 10Andrew Bogott: wmcs recursor: allow labtest to override puppetmaster_hostname [puppet] - 10https://gerrit.wikimedia.org/r/394723 [23:33:17] (03CR) 10Andrew Bogott: [C: 032] wmcs recursor: allow labtest to override puppetmaster_hostname [puppet] - 10https://gerrit.wikimedia.org/r/394723 (owner: 10Andrew Bogott) [23:37:13] (03PS1) 10Dzahn: etcd::networking: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394724 [23:37:47] (03PS2) 10Dzahn: etcd::networking: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394724 [23:38:22] (03CR) 10Dzahn: [C: 032] etcd::networking: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394724 (owner: 10Dzahn) [23:47:31] (03PS1) 10Dzahn: db2028,db2029: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394725 [23:51:53] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Mobile, 10Readers-Web-Backlog (Tracking): On mobile, http://wikipedia.org/wiki/Foo redirects to https://www.m.wikipedia.org/wiki/Foo which does not exist - https://phabricator.wikimedia.org/T154026#2898930 (10EddieGP) > 2. https://wikipedia.org/... [23:53:23] (03PS2) 10Dzahn: db2028,db2029: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394725 [23:54:27] (03CR) 10Dzahn: "per T177225#3803226 i just didn't want to "spam" the phab ticket more right now" [puppet] - 10https://gerrit.wikimedia.org/r/394725 (owner: 10Dzahn) [23:54:43] (03CR) 10Dzahn: [C: 032] db2028,db2029: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394725 (owner: 10Dzahn) [23:57:59] ACKNOWLEDGEMENT - DPKG on db2028 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn removing ganglia