[00:00:23] (03CR) 10Dzahn: [C: 032] Gerrit: Remove java 7 package [puppet] - 10https://gerrit.wikimedia.org/r/327756 (owner: 10Paladox) [00:00:29] (03PS4) 10Dzahn: Gerrit: Remove java 7 package [puppet] - 10https://gerrit.wikimedia.org/r/327756 (owner: 10Paladox) [00:04:20] (03CR) 10Dzahn: "confirmed we are already using 8, both are installed but 8 has lower priority. I did _not_ do the manual cleanup part so far. It's unchang" [puppet] - 10https://gerrit.wikimedia.org/r/327756 (owner: 10Paladox) [00:06:30] (03CR) 10Chad: "I'll handle cleanup on cobalt later. I checked the dependency tree and we *should* be fine to prune, but I'll probably wait for now." [puppet] - 10https://gerrit.wikimedia.org/r/327756 (owner: 10Paladox) [00:07:14] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Rename user TextworkerBot to VladiBot on ru.wiki - https://phabricator.wikimedia.org/T153602#2888743 (10Legoktm) 05Open>03stalled First off, I don't speak Russian so MaxSem helped me translate the initial request, which says "Username too... [00:08:10] RECOVERY - cassandra-b CQL 10.64.48.99:9042 on restbase1018 is OK: TCP OK - 0.000 second response time on 10.64.48.99 port 9042 [00:10:11] (03CR) 10Dzahn: "cool, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/327756 (owner: 10Paladox) [00:21:00] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:24:24] (03CR) 10Eevans: [C: 031] "Last one; Ready!" [puppet] - 10https://gerrit.wikimedia.org/r/328213 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [00:24:51] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:28:44] (03PS2) 10Eevans: enable instance restbase1018-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/328213 (https://phabricator.wikimedia.org/T151086) [00:33:00] (03CR) 10Filippo Giunchedi: [C: 032] enable instance restbase1018-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/328213 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [00:45:30] (03PS1) 10Ppchelko: Graphoid: Install all fonts available to mediawiki. [puppet] - 10https://gerrit.wikimedia.org/r/328316 (https://phabricator.wikimedia.org/T153726) [00:49:00] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [00:52:50] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [01:01:29] (03CR) 10Yurik: [C: 031] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/328316 (https://phabricator.wikimedia.org/T153726) (owner: 10Ppchelko) [01:02:22] PROBLEM - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.100 and port 9042: Connection refused [01:03:03] got it ^^ [01:03:19] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.100 and port 9042: Connection refused eevans Bootstrapping [01:07:26] (03PS2) 10Dzahn: remove wikisource.cz [dns] - 10https://gerrit.wikimedia.org/r/327279 (https://phabricator.wikimedia.org/T137105) [01:08:21] (03CR) 10Dzahn: [C: 032] remove wikisource.cz [dns] - 10https://gerrit.wikimedia.org/r/327279 (https://phabricator.wikimedia.org/T137105) (owner: 10Dzahn) [01:18:04] (03PS2) 10Dzahn: delete unused ppa keys in files/ppa/ [puppet] - 10https://gerrit.wikimedia.org/r/318451 [01:18:52] (03PS3) 10Dzahn: delete unused ppa keys in files/ppa/ [puppet] - 10https://gerrit.wikimedia.org/r/318451 [01:24:22] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [01:34:45] jouncebot: next [01:34:45] In 310 hour(s) and 25 minute(s): HOLIDAY (observed) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170102T0000) [01:34:57] :) [01:35:25] 🎄 [01:35:30] !log mira - upgrading php5 packages [01:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:21] !log tin - upgrading php, imagemagick,.. [01:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:03] PROBLEM - puppet last run on mw2246 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[jobrunner] [01:45:51] icinga-wm: check harder [01:45:55] it's fine [01:45:58] !log terbium - upgrade imagemagick | wasat - upgrade php5, imagemagick [01:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:03] RECOVERY - puppet last run on mw2246 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [01:56:28] 07Puppet, 06Labs: Retire and remove module labs_debrepo - https://phabricator.wikimedia.org/T153612#2888907 (10scfc) @Multichill: Sorry for being too terse. `labs_debrepo` is used inter alia by WDQ in the form of `operations/puppet`'s [[https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/module... [01:56:42] PROBLEM - DPKG on krypton is CRITICAL: DPKG CRITICAL dpkg reports broken packages [01:57:14] ^ looking [02:02:42] RECOVERY - DPKG on krypton is OK: All packages OK [02:02:46] !log copper - upgraded php5* | krypton - installed all pending upgrades [02:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:38] (03PS7) 10Paladox: Contint: Downgrade Chromium to 53.0.2785 [puppet] 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) [02:07:03] (03PS7) 10Paladox: Contint: Downgrade Chromium to 53.0.2785 [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) [02:07:41] interesting how this only fails on some combinations [02:11:46] !log auth (yubiauth), phab2001 - upgrade php5 | bohrium (piwik) - upgrade ssh,ssl,php5 [02:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:40] !log contint1002/2001, einsteinium/tegmen - upgrade php5 | bromine - upgrade php5, libs [02:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:20] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.6) (duration: 06m 41s) [02:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:35] !log osmium - upgrade imagemagick, php5 packages | hafnium - upgrade php5 [02:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:09] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Dec 20 02:26:08 UTC 2016 (duration 4m 48s) [02:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:33] !log fermium (lists) - upgrade apache2 | uranium (ganglia-web) - upgrade apache2, openssh, openssl [02:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:49] [02:52:42] PROBLEM - puppet last run on wtp1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:05:18] (03PS1) 10Chad: Remove --dry-run option from updateBranchPointers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328329 [03:07:32] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1183.40 Read Requests/Sec=154.00 Write Requests/Sec=10.50 KBytes Read/Sec=19687.20 KBytes_Written/Sec=338.40 [03:10:12] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:20:42] RECOVERY - puppet last run on wtp1013 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [03:20:52] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 718.98 seconds [03:23:52] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 244.95 seconds [03:26:32] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=230.70 Read Requests/Sec=254.00 Write Requests/Sec=8.00 KBytes Read/Sec=6835.20 KBytes_Written/Sec=427.20 [03:38:12] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [04:16:32] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3731.30 Read Requests/Sec=4850.30 Write Requests/Sec=38.80 KBytes Read/Sec=19492.00 KBytes_Written/Sec=1299.20 [04:22:32] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.70 Read Requests/Sec=0.00 Write Requests/Sec=7.80 KBytes Read/Sec=0.00 KBytes_Written/Sec=38.40 [04:34:52] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:40:12] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:53:52] PROBLEM - puppet last run on mw1270 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:52] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [05:08:12] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [05:22:52] RECOVERY - puppet last run on mw1270 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [05:42:16] 06Operations: irc bots should send NOTICE not PRIVMSG - https://phabricator.wikimedia.org/T101575#2889091 (10Legoktm) [05:45:27] (03CR) 10Tim Landscheidt: [C: 04-1] "If tools-clush-generator.py does not read the Labs project from /etc/wmflabs-project, but gets it passed as an option argument, this scrip" [puppet] - 10https://gerrit.wikimedia.org/r/328030 (owner: 10Tim Landscheidt) [06:46:53] (03Abandoned) 10Yuvipanda: statistics: Bring in R debs from upstream [puppet] - 10https://gerrit.wikimedia.org/r/319700 (https://phabricator.wikimedia.org/T149949) (owner: 10Yuvipanda) [06:49:12] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:49:22] RECOVERY - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is OK: TCP OK - 0.000 second response time on 10.64.48.100 port 9042 [07:18:12] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:34:52] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2889231 (10Joe) >>! In T147718#2887703, @Andrew wrote: > Thank you for all the rewrites -- I'm happy with the proposal as it stands today. If you have... [07:40:22] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:51:12] PROBLEM - MegaRAID on db2011 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [07:51:14] ACKNOWLEDGEMENT - MegaRAID on db2011 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T153740 [07:51:21] 06Operations, 10ops-codfw: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2889258 (10ops-monitoring-bot) [07:55:51] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2889265 (10Marostegui) p:05Triage>03Normal [07:59:12] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2889258 (10Marostegui) This is correct, please proceed and change the disk ``` Device Present ================ Virtual Drives : 1 Degraded : 1 Offline :... [08:08:32] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [08:27:54] !log renamed some log files ($something.1.gz to $something.1a.gz) on cp1008 and rutherium to unblock logrotation and reduce cronspam - T132324 [08:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:57] T132324: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324 [08:31:12] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [08:31:19] (03CR) 10Elukey: "/me hides in the corner of shame :/" [debs/prometheus-apache-exporter] - 10https://gerrit.wikimedia.org/r/328301 (owner: 10Filippo Giunchedi) [08:49:47] (03PS1) 10Urbanecm: [throttle] Rule for Shivaji University, India [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328338 (https://phabricator.wikimedia.org/T153741) [08:50:57] Hi all, can anybody with shell access deploy 328338 to the cluster? This is throttle rule for this Friday so it isn't possible to wait for January 2017. [08:51:06] Or should I ask somebody for approval? [08:53:38] Urbanecm: I'd say that if it wasn't already whitelisted before the freeze you'd need to get approval, yes [08:54:15] from who? [08:54:27] elukey [08:54:47] probably RelEng, I am checking if hashar is online but I can't find him [08:55:33] elukey: Heya. [08:56:09] maybe greg-g, Urbanecm, let's see what he thinks about it [08:56:12] Revent: o/ [08:56:34] The videoscalers seem to be not having the overloading problem, but they should probably be loaded a bit ‘more’ aggressively. [08:57:01] They are only trying to run 7-10 tasks at a time. [08:57:03] oh yes, definitely, joe was playing with some tunables this morning [08:57:06] elukey, should I reach him somehow? [08:57:14] Or is something needed from my side? [08:57:29] I think that the ping on IRC should be enough! [08:57:34] Okay. [08:58:18] the freeze guidelines should be written somewhre [08:58:41] ah there you go https://wikitech.wikimedia.org/wiki/Deployments#Week_of_December_19th [08:58:46] If you need something deployed please contact Greg (greg-g on IRC or greg@wikimedia). :P [08:58:53] I’ve been resetting some that had ‘messed up’ statuses on the DB, but I think they were all from me resetting them while running to keep the load down. [08:59:12] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [08:59:13] Urbanecm: so I'd suggest to send an email to be sure! [08:59:29] elukey, I always asked here if I needed out-of-schedule deployment. I'll send greg-g an email. [09:00:02] Urbanecm: oh yes and you did the right thing, but in this case I believe that Greg will need to decide whether or not we deploy.. [09:00:24] Okay. [09:00:43] my ":P" was meant to be "I've said the right thing by chance", not anything else :) [09:01:15] ok [09:01:23] elukey: https://quarry.wmflabs.org/query/14917 <- that’s a 42MB HD file, it should process fine. [09:01:40] https://commons.wikimedia.org/wiki/File:Paters_Berchmanianum_verhuizen_naar_pand_in_Brakkenstein.webm [09:04:12] Revent: asking because I am a bit ignorant - the transcodings are to create multiple files (different resolutions etc..) from the one that you posted? [09:04:23] *are needed [09:04:25] Yeah. [09:05:21] User uploads a 1920x1080 video, we transcode it into lower-res ogv and WebM versions for usability. [09:05:34] got it thanks! [09:05:54] Basically just like thumbnailing. [09:13:52] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:16:25] 06Operations, 10Ops-Access-Requests, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2875954 (10elukey) 05Open>03Resolved [09:17:20] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002.eqiad.wmnet for eevans - https://phabricator.wikimedia.org/T153375#2889349 (10elukey) p:05Triage>03Normal [09:28:15] 06Operations, 10Analytics-Cluster: stat1004 - sync snakebite version with repo - https://phabricator.wikimedia.org/T153493#2889361 (10elukey) snakebite changelog on stat1004 ``` snakebite (2.11.0) unstable; urgency=low Support dfs.client.use.datanode.hostname config property -- Wouter de Bie 06Operations, 10Analytics-Cluster: stat1004 - sync snakebite version with repo - https://phabricator.wikimedia.org/T153493#2889365 (10elukey) Ah there you go: https://phabricator.wikimedia.org/T152771 :) [09:30:17] 06Operations, 13Patch-For-Review: python-confluent-kafka conflict with snakebite on stat1002 - https://phabricator.wikimedia.org/T152771#2889369 (10elukey) [09:30:19] 06Operations, 10Analytics-Cluster: stat1004 - sync snakebite version with repo - https://phabricator.wikimedia.org/T153493#2889371 (10elukey) [09:30:38] (03PS3) 10TTO: Set valid content language for Norwegian wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277519 (https://phabricator.wikimedia.org/T126146) (owner: 10Nikerabbit) [09:31:14] (03CR) 10TTO: [C: 031] "Thiemo said this should be fine to merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277519 (https://phabricator.wikimedia.org/T126146) (owner: 10Nikerabbit) [09:31:33] 06Operations, 13Patch-For-Review: python-confluent-kafka conflict with snakebite on stat1002 - https://phabricator.wikimedia.org/T152771#2859574 (10elukey) T153493 reports some issues with the new snakebite version naming :( [09:34:03] 06Operations, 13Patch-For-Review: python-confluent-kafka conflict with snakebite on stat1002 - https://phabricator.wikimedia.org/T152771#2889377 (10elukey) ``` elukey@stat1004:~$ dpkg -l snakebite Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Tri... [09:41:02] PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:43:02] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:44:22] PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:44:48] elukey: FYI, the recent ‘increase’ in the queued transcode backlog is simply due to me throwing transcodes that had ‘broken’ statuses (like both a ‘success’ time and an ‘error’ time) back on the queue. [09:45:01] ack thanks! [09:47:03] They are (from what I see) all ones that I had previously kicked off the queue with the ‘reset while running’ bug. [09:49:32] PROBLEM - Disk space on mw1169 is CRITICAL: DISK CRITICAL - free space: /tmp 699 MB (3% inode=99%) [09:52:39] weird --^ [09:53:37] all transcode tmp files :( [09:55:11] so on mw116[89] tmp is on a separate partion [09:55:17] that is not super big [09:57:06] _joe_ --^ [09:57:24] when I reimaged I did not check that [09:57:32] PROBLEM - Disk space on mw1169 is CRITICAL: DISK CRITICAL - free space: /tmp 699 MB (3% inode=99%) [09:58:19] <_joe_> uhm yeah that's a problem [09:58:32] RECOVERY - Disk space on mw1169 is OK: DISK OK [09:58:47] checked with lsof and removed some files.. [10:10:02] RECOVERY - puppet last run on elastic1036 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [10:10:22] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [10:12:22] RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [10:20:56] (03PS1) 10Elukey: Move mw116[89] to a single partition recipe [puppet] - 10https://gerrit.wikimedia.org/r/328345 (https://phabricator.wikimedia.org/T153488) [10:23:58] (03PS2) 10Elukey: Move mw116[89] to a single partition recipe [puppet] - 10https://gerrit.wikimedia.org/r/328345 (https://phabricator.wikimedia.org/T153488) [10:38:32] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:43:14] (03Abandoned) 10Elukey: Add format.topic configuration [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/230173 (https://phabricator.wikimedia.org/T108379) (owner: 10Ottomata) [11:02:39] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2889527 (10Gehel) Coming back to this after a long break... So it seems that we need to ad... [11:05:09] elukey: Yeah, the logic of how transcodes are started definitely needs work, it’s only running 3 atm, but at least it’s not doing the stupid overloading them. [11:05:42] only 3? wtf? o.O [11:05:59] sounds quite broken [11:14:05] zhuyifei1999_: At least small transcodes are not taking 7+ hours (lol) [11:14:44] yeah that's the good part [11:14:55] And yeah, with the new servers it should run about 100 at a time. [11:15:07] ideally, one per cpu [11:19:09] <_joe_> Revent: are you sure just 3 are really running? [11:19:15] <_joe_> not what I see [11:22:08] _joe_: https://commons.wikimedia.org/wiki/Special:TimedMediaHandler [11:22:45] Which, admittedly, does not reflect actual server load, it’s just (from what I know) based on a DB query. [11:23:21] <_joe_> well I can tell you we have about 20 jobs running in parallel right now [11:24:12] <_joe_> jobs that didn't fail or didn't work otherwise [11:26:06] Yeah, I have not been kicking ‘any’ running transcodes with the reset button, so if there are ones running that don’t show there that’s probably a bug. [11:26:54] _joe_: Look at 8-25-14-_White_House_Press_Briefing.webm 360P webm. [11:28:39] I had cleared out this query… https://quarry.wmflabs.org/query/14916 other than the ones that are ‘unfixable’… (there are 35 entries that are due to deleted or renamed files) [11:28:55] More have been showing up today, [11:30:16] That one in particular, shows up as both ‘completed successfully’ and ‘errored out’ [11:30:51] <_joe_> Revent: about the issues of the TMH extension, you might want to open a phab task [11:31:07] <_joe_> I'm not an expert on that and I'm not sure how i could help [11:31:20] <_joe_> from my POV, the scalers are now working "as expected" [11:31:51] Yeah, frankly, I’m not technical enough to even attempt to try to say what the actual problems are. [11:32:09] I can just comment on what I 'see'. [11:32:50] If tasks are getting run, successfully, and marked as successful, then I am happy. [11:33:10] <_joe_> I really appreciate all that you're doing, manually, to keep things "almost working" [11:33:16] <_joe_> and I'm sorry you have to [11:33:34] <_joe_> I am quite sure the system now works significantly better than yesterday [11:33:40] Oh, yes. [11:33:43] Hell yes. [11:33:52] <_joe_> but the reporting can be screwed up for various reasons [11:34:18] <_joe_> I think there are several tickets already about this problem (how results are reported) [11:35:02] It’s my intent, as things get sorted out, to kick anything with a ‘messed up’ state in the DB back through. [11:37:26] But if they are actually ‘running’, being completed, and being correctly marked as completed, what the report page says is much less important. [11:37:39] <_joe_> also, logging is really unfortunate and not helping me at all [11:37:43] <_joe_> sigh [11:38:23] <_joe_> sorry, I have to follow other topics now [11:45:25] (03CR) 10Yuvipanda: "(1 comment)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda) [11:45:43] (03PS14) 10Yuvipanda: labs: maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [11:53:07] (03PS15) 10Yuvipanda: labs: maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [11:53:26] _joe_: If you want, I could easily dump another 1000 ancient broken transcodes on the queue. :P [11:54:03] kidding, but there is a huge backlog of transcodes that broke years ago. [12:03:02] (03PS1) 10Ema: grafana: Provision Varnish Client Status Code dashboard as JSON [puppet] - 10https://gerrit.wikimedia.org/r/328350 [12:04:21] (03PS2) 10Ema: grafana: Provision Varnish Client Status Code dashboard as JSON [puppet] - 10https://gerrit.wikimedia.org/r/328350 [12:04:29] !log running redact_sanitarium.sh on db1069 [12:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:45] (03PS1) 10Marostegui: [WIP] Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) [12:11:42] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:21:02] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:23:40] (03CR) 10Jcrespo: [C: 031] mariadb: Enable gtid_domain_id - phabricator hosts [puppet] - 10https://gerrit.wikimedia.org/r/326446 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [12:24:04] (03PS2) 10Ema: varnish: remove varnishprocessor [puppet] - 10https://gerrit.wikimedia.org/r/328180 (https://phabricator.wikimedia.org/T151643) [12:24:49] (03CR) 10Jcrespo: [C: 04-1] "We agreed we will explore other ways." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/324482 (owner: 10ArielGlenn) [12:25:29] (03PS1) 10Mobrovac: Parsoid: Define the mwapi_server and mwapi_proxy config variables [puppet] - 10https://gerrit.wikimedia.org/r/328355 [12:26:17] (03CR) 10jerkins-bot: [V: 04-1] Parsoid: Define the mwapi_server and mwapi_proxy config variables [puppet] - 10https://gerrit.wikimedia.org/r/328355 (owner: 10Mobrovac) [12:27:38] (03PS2) 10Mobrovac: Parsoid: Define the mwapi_server and mwapi_proxy config variables [puppet] - 10https://gerrit.wikimedia.org/r/328355 [12:31:54] (03CR) 10Mobrovac: [C: 04-1] "No bueno - https://puppet-compiler.wmflabs.org/4944/wtp2002.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/328355 (owner: 10Mobrovac) [12:33:04] (03PS3) 10Mobrovac: Parsoid: Define the mwapi_server and mwapi_proxy config variables [puppet] - 10https://gerrit.wikimedia.org/r/328355 [12:33:49] (03PS4) 10MarcoAurelio: Removing 'technican' user group from tr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326354 (https://phabricator.wikimedia.org/T152911) [12:40:42] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:49:03] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [13:02:01] (03PS1) 10Ema: add new fake entries for digicert-2016-{ecdsa,rsa}-unified.key [labs/private] - 10https://gerrit.wikimedia.org/r/328359 [13:04:43] (03CR) 10Ema: [V: 032 C: 032] add new fake entries for digicert-2016-{ecdsa,rsa}-unified.key [labs/private] - 10https://gerrit.wikimedia.org/r/328359 (owner: 10Ema) [13:14:31] (03CR) 10Marostegui: [C: 04-2] "Wait for the freeze to finish" [puppet] - 10https://gerrit.wikimedia.org/r/326446 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [13:27:29] (03PS2) 10Marostegui: [WIP] Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) [14:04:07] (03CR) 10Alexandros Kosiaris: [C: 032] kube-scheduler: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326429 (owner: 10Alexandros Kosiaris) [14:05:01] (03CR) 10Alexandros Kosiaris: [C: 032] "cherry-picked in tools with yuvi, worked fine, merging" [puppet] - 10https://gerrit.wikimedia.org/r/326429 (owner: 10Alexandros Kosiaris) [14:05:08] (03PS7) 10Alexandros Kosiaris: kube-scheduler: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326429 [14:05:10] (03CR) 10Giuseppe Lavagetto: [C: 031] Parsoid: Define the mwapi_server and mwapi_proxy config variables [puppet] - 10https://gerrit.wikimedia.org/r/328355 (owner: 10Mobrovac) [14:07:09] (03PS1) 10Ema: varnish cachestats.py: add support for defaults and key_prefix [puppet] - 10https://gerrit.wikimedia.org/r/328368 (https://phabricator.wikimedia.org/T151643) [14:07:40] (03CR) 10Alexandros Kosiaris: [C: 032] k8s::controller: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326430 (owner: 10Alexandros Kosiaris) [14:07:45] (03CR) 10Alexandros Kosiaris: [C: 032] "cherry-picked in tools with yuvi, worked fine, merging" [puppet] - 10https://gerrit.wikimedia.org/r/326430 (owner: 10Alexandros Kosiaris) [14:07:53] (03PS7) 10Alexandros Kosiaris: k8s::controller: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326430 [14:07:57] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] k8s::controller: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326430 (owner: 10Alexandros Kosiaris) [14:08:10] (03CR) 10Alexandros Kosiaris: [C: 032] "cherry-picked in tools with yuvi, worked fine, merging" [puppet] - 10https://gerrit.wikimedia.org/r/326441 (owner: 10Alexandros Kosiaris) [14:08:19] (03PS7) 10Alexandros Kosiaris: k8s::apiserver: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326441 [14:08:27] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] k8s::apiserver: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326441 (owner: 10Alexandros Kosiaris) [14:15:26] (03CR) 10Ema: [C: 032] varnish: remove varnishprocessor [puppet] - 10https://gerrit.wikimedia.org/r/328180 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [14:15:34] (03PS3) 10Ema: varnish: remove varnishprocessor [puppet] - 10https://gerrit.wikimedia.org/r/328180 (https://phabricator.wikimedia.org/T151643) [14:15:34] jouncebot: next [14:15:34] In 297 hour(s) and 44 minute(s): HOLIDAY (observed) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170102T0000) [14:15:40] (03CR) 10Ema: [V: 032 C: 032] varnish: remove varnishprocessor [puppet] - 10https://gerrit.wikimedia.org/r/328180 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [14:17:40] (03PS4) 10Giuseppe Lavagetto: Parsoid: Define the mwapi_server and mwapi_proxy config variables [puppet] - 10https://gerrit.wikimedia.org/r/328355 (owner: 10Mobrovac) [14:20:18] (03CR) 10Giuseppe Lavagetto: [C: 032] Parsoid: Define the mwapi_server and mwapi_proxy config variables [puppet] - 10https://gerrit.wikimedia.org/r/328355 (owner: 10Mobrovac) [14:25:03] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [14:25:43] grr smells like zotero [14:26:29] HTML5 Parser[13693]: segfault at 0 ip 00007fe14bc410e7 sp 00007fe1365fe7c0 error 6 in libmozalloc.so[7fe14bc40000+2000] [14:26:31] grrr [14:33:03] !log mobrovac@tin Starting deploy [parsoid/deploy@1d75b14]: Canary deploy of mwApiServer to wtp[12]001 [14:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:35] !log mobrovac@tin Finished deploy [parsoid/deploy@1d75b14]: Canary deploy of mwApiServer to wtp[12]001 (duration: 00m 32s) [14:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:32] PROBLEM - Disk space on mw1168 is CRITICAL: DISK CRITICAL - free space: /tmp 306 MB (1% inode=99%) [14:53:03] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:55:32] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:57:10] (03PS1) 10Mobrovac: Parsoid: provide the full URI in mwapi_server for codfw [puppet] - 10https://gerrit.wikimedia.org/r/328376 [15:02:09] taking care of mw1168 [15:04:32] RECOVERY - Disk space on mw1168 is OK: DISK OK [15:06:22] PROBLEM - puppet last run on logstash1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:07:42] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [15:08:32] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4115255 keys, up 50 days 6 hours - replication_delay is 38 [15:17:59] (03CR) 10Daniel Kinzler: [C: 04-1] "(1 comment)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327907 (https://phabricator.wikimedia.org/T150881) (owner: 10Smalyshev) [15:23:08] (03CR) 10Giuseppe Lavagetto: [C: 032] Parsoid: provide the full URI in mwapi_server for codfw [puppet] - 10https://gerrit.wikimedia.org/r/328376 (owner: 10Mobrovac) [15:23:33] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:26:10] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, and 3 others: Provide read-only access to OpenStack APIs from WMF IP space - https://phabricator.wikimedia.org/T150092#2890030 (10Andrew) 05Open>03Resolved I can now do 'openstack project list' on a labs Jessie machine with addition of openstack::c... [15:27:31] (03CR) 10Volans: "(1 comment)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328368 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [15:31:56] !log mobrovac@tin Starting deploy [parsoid/deploy@e8f0865]: Canary deploy of mwApiServer to wtp[12]001, take two [15:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:33] !log mobrovac@tin Finished deploy [parsoid/deploy@e8f0865]: Canary deploy of mwApiServer to wtp[12]001, take two (duration: 00m 37s) [15:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:13] legoktm: did the Gerrit's message published here from wikibugs change compared to the previous one? ^^^ "(1 comment)" instead of the start of the actual comment [15:35:22] RECOVERY - puppet last run on logstash1004 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:40:00] (03PS1) 10Elukey: Add another test to run Varnishkafka with Valgrind [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/328381 (https://phabricator.wikimedia.org/T147438) [15:41:22] (03PS1) 10Mobrovac: ChangeProp: Use RESTBase in the other DC to process async requests [puppet] - 10https://gerrit.wikimedia.org/r/328382 [15:42:32] !log mobrovac@tin Starting deploy [parsoid/deploy@e8f0865]: Deploy of mwApiServer across the wtp fleet [15:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:32] !log mobrovac@tin Finished deploy [parsoid/deploy@e8f0865]: Deploy of mwApiServer across the wtp fleet (duration: 05m 59s) [15:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:17] (03CR) 10Mobrovac: [C: 04-1] "Forgot to assing the variable to the config ..." [puppet] - 10https://gerrit.wikimedia.org/r/328382 (owner: 10Mobrovac) [15:50:49] <_joe_> heh I was about to ask :P [15:50:53] (03PS2) 10Mobrovac: ChangeProp: Use RESTBase in the other DC to process async requests [puppet] - 10https://gerrit.wikimedia.org/r/328382 [15:53:09] (03CR) 10Mobrovac: [C: 031] "PCC ok - https://puppet-compiler.wmflabs.org/4949/" [puppet] - 10https://gerrit.wikimedia.org/r/328382 (owner: 10Mobrovac) [15:53:36] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] ChangeProp: Use RESTBase in the other DC to process async requests [puppet] - 10https://gerrit.wikimedia.org/r/328382 (owner: 10Mobrovac) [15:53:47] <_joe_> mobrovac: ^^ [15:53:50] yup yup [15:53:58] <_joe_> running puppet on scb1001 [15:54:24] <_joe_> once puppet-merge is done, though [15:54:48] gimme a minute to relocate to another room _joe_ and i will then restart cp just on scb1001 so that we can see if it's working the way we want it to [15:54:56] <_joe_> cool [15:54:59] <_joe_> take your time [15:55:07] <_joe_> I'll take a break too :) [15:58:08] back [15:59:07] !log restarting change-prop on scb1001 [15:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:55] <_joe_> I don't see traffic on the parsoids in codfw [16:01:57] <_joe_> uhm [16:03:09] <_joe_> mobrovac: and on restbase either [16:03:36] hm might be that cp on scb1001 is not the owner of any rules [16:03:50] <_joe_> now I see it [16:07:00] _joe_: i'll restart all of the CPs, ok? [16:07:15] <_joe_> yup [16:07:22] <_joe_> wait [16:07:29] <_joe_> I didn't run puppet everywhere [16:07:36] oh ok [16:07:59] <_joe_> doing now [16:08:37] <_joe_> and I can confirm I see some traffic to parsoid in codfw and some traffic from there to eqiad via https [16:08:55] <_joe_> mobrovac: for bonus points, we should add https termination to restbase too ^_^ [16:09:12] cool [16:09:12] (03CR) 10Volans: [C: 031] "LGTM although I don't like too much adding another recipe and having the /tmp in the same / partition for an application that uses the /tm" [puppet] - 10https://gerrit.wikimedia.org/r/328345 (https://phabricator.wikimedia.org/T153488) (owner: 10Elukey) [16:09:14] yup yup [16:09:25] _joe_: puppet ran? [16:09:27] <_joe_> mobrovac: you can go on [16:09:29] kk [16:09:59] !log change-prop restarting everywhere [16:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:13] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2890187 (10zhuyifei1999) Going 10k+ now [16:13:21] _joe_: it's working! [16:13:30] i see reqs from CP on wtp2001 [16:14:34] <_joe_> yup [16:15:19] \o/ [16:16:01] <_joe_> https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Parsoid+eqiad&m=cpu_report&s=by+name&mc=2&g=load_report [16:16:04] <_joe_> lol [16:16:18] uuu [16:16:32] PROBLEM - Disk space on mw1168 is CRITICAL: DISK CRITICAL - free space: /tmp 410 MB (2% inode=99%) [16:16:58] (lol, more transcoder drama) [16:17:37] _joe_: ^^ If you have a moment [16:17:38] <_joe_> Revent: uh? [16:17:38] !log drop repl_records_partition events from db1069 [16:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:48] <_joe_> Revent: not really, but tell me :P [16:17:48] (03PS2) 10BBlack: TLS: reduce scope of stream.wm.o redirect exception [puppet] - 10https://gerrit.wikimedia.org/r/328193 (https://phabricator.wikimedia.org/T143925) [16:17:50] (03PS21) 10BBlack: cache_misc app_directors/req_handling split [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [16:17:52] (03PS21) 10BBlack: cache_misc req_handling: sort entries [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [16:17:54] (03PS22) 10BBlack: cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [16:17:56] (03PS6) 10BBlack: cache_misc: stream.wm.o subpathing for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/327550 (https://phabricator.wikimedia.org/T143925) [16:17:57] _joe_: icinga-wm: PROBLEM - Disk space on mw1168 is CRITICAL: DISK CRITICAL - free space: /tmp 410 MB (2% inode=99%) [16:18:04] <_joe_> Revent: oh that's known [16:18:20] <_joe_> Revent: elukey is already working on the underlying problem [16:18:42] Yeah, I just didn’t know if going and cleaning it out would help. [16:18:46] <_joe_> we forgot to modify the partition scheme when adding the two hosts to this cluster [16:19:01] !log notebook1001/1002 - upgraded php packages [16:19:02] _joe_ if you have time (I know that you don't but I try anyway :P) https://gerrit.wikimedia.org/r/328345 [16:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:19] <_joe_> elukey: is that patch correct? if so go for it :P [16:19:34] (lol) [16:19:35] come ooooonnn :P [16:19:57] _joe_ just let me know if you like the idea of a single root partition [16:20:01] then I'll go :D [16:20:05] <_joe_> yes [16:20:07] <_joe_> but [16:20:22] <_joe_> it's ok, I was checking the partman recipe [16:20:39] I truncated the mw.cfg one, brutally [16:20:42] (03CR) 10Mobrovac: [C: 04-1] "(1 comment)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328316 (https://phabricator.wikimedia.org/T153726) (owner: 10Ppchelko) [16:20:48] (03PS3) 10Marostegui: [WIP] Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) [16:20:58] <_joe_> I guess if partman was written by the same people that created the sendmail config [16:21:06] (03CR) 10Giuseppe Lavagetto: [C: 031] Move mw116[89] to a single partition recipe [puppet] - 10https://gerrit.wikimedia.org/r/328345 (https://phabricator.wikimedia.org/T153488) (owner: 10Elukey) [16:21:22] <_joe_> or by klingons, that would explain it too [16:22:14] in some convoluted mind partman might make sense at first read [16:22:46] Some transcodes are still failing, btw, with ‘timeout’ in transcode_error… it seems like most are succeeding, tho. [16:23:32] RECOVERY - Disk space on mw1168 is OK: DISK OK [16:24:34] (03PS2) 10Giuseppe Lavagetto: Conftool: Add restbase101[678] and restbase201[012] [puppet] - 10https://gerrit.wikimedia.org/r/328059 (https://phabricator.wikimedia.org/T151086) (owner: 10Mobrovac) [16:24:53] partman is that thing that is not parsable by humans and doesn't generate a predictable result when parsed by machines ;) [16:25:16] (03PS3) 10Elukey: Move mw116[89] to a single partition recipe [puppet] - 10https://gerrit.wikimedia.org/r/328345 (https://phabricator.wikimedia.org/T153488) [16:25:33] Are you saying that it has an ‘infinite improbability drive’ somewhere inside? :P [16:25:54] (03CR) 10Giuseppe Lavagetto: [C: 032] Conftool: Add restbase101[678] and restbase201[012] [puppet] - 10https://gerrit.wikimedia.org/r/328059 (https://phabricator.wikimedia.org/T151086) (owner: 10Mobrovac) [16:26:44] !log Running optimize table on db1045 for the revision tables as we urgently need some space back on that host - T153739 [16:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:48] Revent: yeah, probably was written by Adams :D [16:26:48] T153739: Defragment db1015 - https://phabricator.wikimedia.org/T153739 [16:27:16] (03PS4) 10Elukey: Move mw116[89] to a single partition recipe [puppet] - 10https://gerrit.wikimedia.org/r/328345 (https://phabricator.wikimedia.org/T153488) [16:28:52] BTW, I’m still seeing a trickle of transcodes completing with ‘negative’ encode times, but I somewhat suspect they are just relics of my earlier crude attempts to keep the load down. [16:28:59] (03CR) 10Elukey: [C: 032] Move mw116[89] to a single partition recipe [puppet] - 10https://gerrit.wikimedia.org/r/328345 (https://phabricator.wikimedia.org/T153488) (owner: 10Elukey) [16:29:02] PROBLEM - DPKG on rcs1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:29:16] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=restbase,dc=codfw,name=restbase201.* [16:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:01] !log upgrading tons of outdated OS packages on rcs100x [16:30:02] PROBLEM - DPKG on rcs1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:45] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2890221 (10RobH) Please note that @papaul is currently away from the datacenter until after the holiday. Any hardware failures will either wait until his return, or will require a smart hands ticket with Cy... [16:31:04] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=restbase,dc=eqiad,name=restbase101[6-8].* [16:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:02] RECOVERY - DPKG on rcs1001 is OK: All packages OK [16:33:02] RECOVERY - DPKG on rcs1002 is OK: All packages OK [16:34:29] 06Operations, 10ChangeProp, 10Mobile-Content-Service, 06Parsing-Team, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2890239 (10Joe) [16:34:38] 06Operations, 10ChangeProp, 10Mobile-Content-Service, 06Parsing-Team, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2837406 (10Joe) a:03Joe [16:35:34] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2890256 (10Marostegui) Hey @RobH no, no need to worry about it. It can wait Thanks for keeping an eye on it! [16:35:48] 06Operations, 10ChangeProp, 10Mobile-Content-Service, 06Parsing-Team, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2837406 (10Joe) Thanks to the invaluable help from @mobrovac, we're now serving the traffic for... [16:39:17] (03CR) 10Dzahn: "to be fair this is testing production services so i wouldn't call that "any arbitrary Labs project"" [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [16:40:58] Revent: I am cleaning up old /tmp files on mw116[89] (not referenced by lsof). I am not sure when/how the get cleaned up on the video scalers [16:41:24] (03PS2) 10Ema: varnish cachestats.py: add support for defaults and key_prefix [puppet] - 10https://gerrit.wikimedia.org/r/328368 (https://phabricator.wikimedia.org/T151643) [16:42:12] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:42:37] (03PS3) 10BBlack: TLS: reduce scope of stream.wm.o redirect exception [puppet] - 10https://gerrit.wikimedia.org/r/328193 (https://phabricator.wikimedia.org/T143925) [16:43:12] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [16:44:40] elukey: probably https://github.com/wikimedia/mediawiki-extensions-TimedMediaHandler/blob/master/WebVideoTranscode/WebVideoTranscodeJob.php#L68 [16:45:40] <_joe_> misc codfw? [16:45:44] <_joe_> what can it be? [16:46:55] rcs? [16:47:03] rcs is eqiad [16:47:12] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [16:47:41] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/328368 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [16:49:12] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [16:58:12] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [17:01:12] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [17:06:12] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:06:35] (03PS2) 10Reedy: [throttle] Rule for Shivaji University, India [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328338 (https://phabricator.wikimedia.org/T153741) (owner: 10Urbanecm) [17:06:47] (03CR) 10Reedy: "PS2 removes the old throttles at the same time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328338 (https://phabricator.wikimedia.org/T153741) (owner: 10Urbanecm) [17:07:13] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:08:04] (03CR) 10Reedy: [C: 032] [throttle] Rule for Shivaji University, India [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328338 (https://phabricator.wikimedia.org/T153741) (owner: 10Urbanecm) [17:08:47] (03CR) 10BryanDavis: [C: 031] "This +1 is slightly better than the "reviewer has working mouse" baseline. I have not executed the code but I've read through it several t" [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda) [17:09:18] (03Merged) 10jenkins-bot: [throttle] Rule for Shivaji University, India [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328338 (https://phabricator.wikimedia.org/T153741) (owner: 10Urbanecm) [17:09:32] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [17:24:47] !log reedy@tin Synchronized wmf-config/throttle.php: T153741 (duration: 00m 40s) [17:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:51] T153741: Request for a temporary lift of account creation cap on IP - https://phabricator.wikimedia.org/T153741 [17:28:33] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:31:08] (03PS1) 10RobH: renaming/reimaging sinistra as mwlog2001 [dns] - 10https://gerrit.wikimedia.org/r/328394 [17:31:16] (03PS3) 10Smalyshev: Add new units for the following: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327907 (https://phabricator.wikimedia.org/T150881) [17:32:12] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 406.08 seconds [17:32:29] (03CR) 10Smalyshev: "(1 comment)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327907 (https://phabricator.wikimedia.org/T150881) (owner: 10Smalyshev) [17:33:24] (03PS4) 10Smalyshev: Add new units for the following: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327907 (https://phabricator.wikimedia.org/T150881) [17:34:12] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:36:23] db1047's lag looks like this query: 44334067 | Using where; Using temporary; Using filesort [17:36:51] how long has been executing? [17:37:12] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 0.01 seconds [17:37:12] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is inactive [17:37:17] it is gone now [17:37:20] :| [17:37:31] I cannot see it on the activity panel, [17:37:42] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:37:57] (03Abandoned) 10ArielGlenn: pick up privatewikis fact from mediawiki config file [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/324482 (owner: 10ArielGlenn) [17:37:59] (03PS1) 10RobH: updating sinistra to mwlog2001 [puppet] - 10https://gerrit.wikimedia.org/r/328397 [17:38:02] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:38:06] system user | | NULL | Connect | 364 | closing tables | NULL [17:38:52] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [17:41:00] (03CR) 10RobH: [C: 032] updating sinistra to mwlog2001 [puppet] - 10https://gerrit.wikimedia.org/r/328397 (owner: 10RobH) [17:41:22] !log rcs1001.eqiad.wmnet: python-gevent downgraded from 1.1b6-1~trusty1 to 1.0-1ubuntu1 [17:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:44] (03CR) 10RobH: [C: 032] renaming/reimaging sinistra as mwlog2001 [dns] - 10https://gerrit.wikimedia.org/r/328394 (owner: 10RobH) [17:45:05] (03PS1) 1020after4: WIP: scap branch plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328399 [17:46:04] 06Operations: reinstall/reimage sinistra as mwlog2001 - https://phabricator.wikimedia.org/T153384#2890440 (10RobH) p:05Triage>03Normal [17:49:42] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:32] !log rcs1002.eqiad.wmnet: python-gevent downgraded from 1.1b6-1~trusty1 to 1.0-1ubuntu1 [17:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:28] (03PS1) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) [17:55:21] (03CR) 10jerkins-bot: [V: 04-1] Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [17:56:39] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:56:58] (03PS10) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) [17:58:14] (03PS2) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) [17:58:23] 06Operations, 10Wikimedia-Stream: Error on startup for the flash policy server - https://phabricator.wikimedia.org/T153770#2890496 (10Volans) [17:59:12] (03CR) 10jerkins-bot: [V: 04-1] Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [18:00:30] (03PS3) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) [18:01:21] (03CR) 10jerkins-bot: [V: 04-1] Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [18:03:18] (03PS4) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) [18:08:58] Revent: qq - how do you guys check the "backlog" of transcode jobs ? [18:09:10] (I am reviewing the task and didn't get it) [18:09:21] (just want to match what I am seeing on terbium) [18:09:29] https://commons.wikimedia.org/wiki/Special:TimedMediaHandler is the easy way [18:09:36] nice [18:10:30] elukey: https://grafana.wikimedia.org/dashboard/db/job-queue-health?var-jobType=webVideoTranscode [18:10:46] the Job Wait Time gives you an idea of how long jobs have been waiting [18:11:03] and "Activity per minute" the rate of that jobs processing [18:11:25] I don't think the jobrunner reports per wiki/ per job queue sizes [18:11:34] but there is a mediawiki maintenance script for that [18:11:57] $ mwscript showJobs.php --wiki=commonswiki --group --type=webVideoTranscode [18:11:58] webVideoTranscode: 13906 queued; 774 claimed (72 active, 702 abandoned); 0 delayed [18:12:38] I am off [18:12:43] hashar: That seems all quite inconsistent. :/ [18:13:04] :((( [18:13:34] there is a task about overhauling the async jobs [18:13:41] but that is a rather large effort [18:13:47] anyway gotta leave the place *wave* [18:13:51] The special page shows 10281 queued transcodes, not 16k [18:14:03] er, 13k [18:16:05] 06Operations, 10ops-codfw: update label/racktables visible label for mwlog2001 (was sinistra) - https://phabricator.wikimedia.org/T153771#2890525 (10RobH) [18:16:34] 06Operations: reinstall/reimage sinistra as mwlog2001 - https://phabricator.wikimedia.org/T153384#2879007 (10RobH) [18:18:13] yeah I am trying to review all the numbers :D [18:18:39] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:27:30] 06Operations, 10Wikimedia-Stream: Upstream prematurely closed connection - https://phabricator.wikimedia.org/T153772#2890554 (10Volans) [18:32:58] (03CR) 10Dzahn: "(2 comments)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [18:36:37] (03CR) 10Dzahn: "(1 comment)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [18:38:29] (03PS3) 10Filippo Giunchedi: grafana: Provision Varnish Client Status Code dashboard as JSON [puppet] - 10https://gerrit.wikimedia.org/r/328350 (owner: 10Ema) [18:40:51] 06Operations, 10Wikimedia-Stream: Socket.io - gevent dependency incompatibility - https://phabricator.wikimedia.org/T153773#2890586 (10Volans) [18:41:57] (03CR) 10Filippo Giunchedi: [C: 032] grafana: Provision Varnish Client Status Code dashboard as JSON [puppet] - 10https://gerrit.wikimedia.org/r/328350 (owner: 10Ema) [18:48:55] (03PS11) 10Dzahn: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [18:49:33] (03CR) 10Dzahn: "PS11: added gerrit_log_host variable in hiera" [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [18:52:54] (03PS3) 10Filippo Giunchedi: prometheus: add aggregation rules for varnish_exporter [puppet] - 10https://gerrit.wikimedia.org/r/327873 (https://phabricator.wikimedia.org/T147424) [18:55:19] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [18:58:43] PROBLEM - Disk space on mwlog2001 is CRITICAL: Return code of 255 is out of bounds [18:58:53] PROBLEM - MD RAID on mwlog2001 is CRITICAL: Return code of 255 is out of bounds [18:59:43] RECOVERY - Disk space on mwlog2001 is OK: DISK OK [18:59:53] RECOVERY - MD RAID on mwlog2001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [19:02:51] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add aggregation rules for varnish_exporter [puppet] - 10https://gerrit.wikimedia.org/r/327873 (https://phabricator.wikimedia.org/T147424) (owner: 10Filippo Giunchedi) [19:06:37] (03PS12) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) [19:06:43] (03PS13) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) [19:10:43] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [19:16:51] (03PS1) 10Filippo Giunchedi: otrs: use wikimedia_domains from role exim [puppet] - 10https://gerrit.wikimedia.org/r/328407 [19:17:38] (03PS2) 10Ppchelko: Graphoid: Install all fonts available to mediawiki. [puppet] - 10https://gerrit.wikimedia.org/r/328316 (https://phabricator.wikimedia.org/T153726) [19:18:03] (03CR) 10jerkins-bot: [V: 04-1] otrs: use wikimedia_domains from role exim [puppet] - 10https://gerrit.wikimedia.org/r/328407 (owner: 10Filippo Giunchedi) [19:20:15] (03PS2) 10Filippo Giunchedi: otrs: use wikimedia_domains from role exim [puppet] - 10https://gerrit.wikimedia.org/r/328407 [19:22:13] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [19:24:07] (03CR) 10Chad: [C: 031] "Untested, but there's no better way to test than just merging it :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328399 (owner: 1020after4) [19:26:03] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:26:53] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [19:37:53] 06Operations: reinstall/reimage sinistra as mwlog2001 - https://phabricator.wikimedia.org/T153384#2890853 (10RobH) 05Open>03Resolved [19:40:44] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 05codfw-rollout: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2890855 (10RobH) Please note that sinistra has been reimaged as 'mwlog2001'. This matches the new mwlog1001 in eqiad, that is replacing fluorine. [19:40:55] just guys know enwiki recieved DDOS threats from an user that has vandalised we are unsure if its a true threat or just a ticked off user but i thought i'd give you guys a curtesy call [19:43:54] (03CR) 10Ppchelko: "And it's obviously a no-op: https://puppet-compiler.wmflabs.org/4950/" [puppet] - 10https://gerrit.wikimedia.org/r/328316 (https://phabricator.wikimedia.org/T153726) (owner: 10Ppchelko) [19:53:34] (03PS16) 10Yuvipanda: labs: maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [19:56:35] (03PS3) 10Zppix: hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [19:57:11] (03CR) 1020after4: [C: 032] scap clean: provide l10n-only option for pruning stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327421 (owner: 10Chad) [19:57:44] (03Merged) 10jenkins-bot: scap clean: provide l10n-only option for pruning stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327421 (owner: 10Chad) [19:58:10] (03CR) 10Chad: [C: 032] scap patch: Remove unused print_function import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327384 (owner: 10Chad) [19:58:13] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:58:42] (03Merged) 10jenkins-bot: scap patch: Remove unused print_function import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327384 (owner: 10Chad) [19:59:09] (03PS17) 10Yuvipanda: labs: maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [20:00:59] (03CR) 10jerkins-bot: [V: 04-1] labs: maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda) [20:01:58] (03PS44) 10Chad: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [20:03:30] (03CR) 1020after4: [C: 04-1] "this is gonna conflict with I30092565bf553d81198846a1fc2bfb66ee5681d5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [20:04:43] (03CR) 10Chad: "I'd say land this first, then rebase that on top so we'll catch any improvements to gerrit.py?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [20:05:05] (03PS1) 10Filippo Giunchedi: prometheus: introduce prometheus::rule define [puppet] - 10https://gerrit.wikimedia.org/r/328415 [20:05:21] (03PS3) 10Chad: WIP: `scap scrape` plugin split out from change 306259 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312016 (owner: 1020after4) [20:05:23] (03CR) 1020after4: [C: 031] "ok works for me..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [20:06:00] (03CR) 10jerkins-bot: [V: 04-1] prometheus: introduce prometheus::rule define [puppet] - 10https://gerrit.wikimedia.org/r/328415 (owner: 10Filippo Giunchedi) [20:06:09] (03PS1) 10Yuvipanda: Revert "puppetmaster: Install self signed CA into system store too" [puppet] - 10https://gerrit.wikimedia.org/r/328416 [20:07:29] (03CR) 10Chad: [C: 032] Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [20:07:48] (03CR) 10Yuvipanda: [C: 032] Revert "puppetmaster: Install self signed CA into system store too" [puppet] - 10https://gerrit.wikimedia.org/r/328416 (owner: 10Yuvipanda) [20:07:57] (03CR) 10Chad: "Bump." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298397 (owner: 10Alex Monk) [20:08:11] (03Merged) 10jenkins-bot: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [20:08:31] (03PS2) 10Filippo Giunchedi: prometheus: introduce prometheus::rule define [puppet] - 10https://gerrit.wikimedia.org/r/328415 [20:09:13] (03PS18) 10Yuvipanda: labs: maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [20:10:01] (03PS2) 10Chad: Remind people not to enable wmgUseKartographer elsewhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305748 (owner: 10Jforrester) [20:10:30] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2890932 (10Gehel) After a discussion with @Smalyshev, it seems that we usually upload indiv... [20:11:17] (03CR) 10Chad: [C: 032] Remind people not to enable wmgUseKartographer elsewhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305748 (owner: 10Jforrester) [20:12:03] (03Merged) 10jenkins-bot: Remind people not to enable wmgUseKartographer elsewhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305748 (owner: 10Jforrester) [20:12:38] ooh, are we doing deploys now? [20:13:14] Nope :) [20:13:27] I have two beta-only changes I want to get out.. [20:13:31] I'm doing a few no-op things :) [20:13:55] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: No-op comment-only (duration: 00m 52s) [20:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:11] (03PS2) 10Legoktm: beta: Set $wgLinterStatsdSampleFactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327438 [20:14:11] beta stuff is no-op? :) [20:14:13] (03PS2) 10Legoktm: beta: Remove duplicate entry for wikidata in $wgLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327439 [20:14:33] bblack: yt? [20:18:35] (03CR) 10MaxSem: "Errrm, why did this comment that's not true anymore get deployed? :P" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305748 (owner: 10Jforrester) [20:19:03] (03PS3) 10Filippo Giunchedi: prometheus: introduce prometheus::rule define [puppet] - 10https://gerrit.wikimedia.org/r/328415 [20:19:12] ostriches, ^ [20:19:15] !log demon@tin Synchronized requirements.txt: no-op scap stuff (duration: 00m 39s) [20:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:51] MaxSem: Then the change should've been abandoned :) [20:19:54] !log demon@tin Synchronized setup.py: no-op scap stuff (duration: 00m 39s) [20:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:58] How was I to know? It's a comment-only change :) [20:20:34] ostriches, comment not matching code should've been a hint :) [20:20:34] !log demon@tin Synchronized test-requirements.txt: no-op scap stuff (duration: 00m 39s) [20:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:51] Like I looked at the code :p [20:20:57] (03PS1) 10MaxSem: Revert "Remind people not to enable wmgUseKartographer elsewhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328419 [20:20:57] Comments are fun [20:21:14] !log demon@tin Synchronized tox.ini: no-op scap stuff (duration: 00m 39s) [20:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:54] !log demon@tin Synchronized scap/plugins: no-op scap stuff (duration: 00m 39s) [20:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:34] (03CR) 10Chad: [C: 032] Revert "Remind people not to enable wmgUseKartographer elsewhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328419 (owner: 10MaxSem) [20:23:44] nuria: yes [20:24:17] (03Merged) 10jenkins-bot: Revert "Remind people not to enable wmgUseKartographer elsewhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328419 (owner: 10MaxSem) [20:24:18] bblack: one more ping about last access cookie and global counts for unique devices [20:24:42] bblack: https://phabricator.wikimedia.org/T138027 [20:24:59] bblack: do you think this is something we can do upcoming quarter? [20:25:04] (03PS1) 10Yuvipanda: puppetmaster: Do not install default file before common package [puppet] - 10https://gerrit.wikimedia.org/r/328420 [20:25:40] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Remove dumb comment (duration: 00m 39s) [20:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:13] RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:26:14] (03CR) 10Mobrovac: [C: 031] "Right, because we already have those packages installed for the pdfrender service." [puppet] - 10https://gerrit.wikimedia.org/r/328316 (https://phabricator.wikimedia.org/T153726) (owner: 10Ppchelko) [20:27:30] (03CR) 10Ppchelko: [C: 031] Properly point eventbus.svc to codfw endpoint in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328212 (owner: 10Ottomata) [20:27:32] andrewbogott: ^ should fix it [20:27:43] andrewbogott: https://gerrit.wikimedia.org/r/328420 that is [20:27:54] ostriches, tsk tsk tsk https://gerrit.wikimedia.org/r/#/c/305748/ [20:28:42] puppetmaster-common and not just 'puppetmaster'? [20:29:22] yuvipanda: ^ ? [20:29:44] ostriches, were you deploying something today? [20:29:54] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: introduce prometheus::rule define [puppet] - 10https://gerrit.wikimedia.org/r/328415 (owner: 10Filippo Giunchedi) [20:29:58] Nope, no deployments. [20:30:00] (03PS4) 10Filippo Giunchedi: prometheus: introduce prometheus::rule define [puppet] - 10https://gerrit.wikimedia.org/r/328415 [20:30:01] andrewbogott: actually you're right, I was confused by before => Package['puppetmaster-common'], [20:30:02] Just some no-ops [20:30:18] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] prometheus: introduce prometheus::rule define [puppet] - 10https://gerrit.wikimedia.org/r/328415 (owner: 10Filippo Giunchedi) [20:30:35] nuria: yes, I'll note it on the task [20:30:48] ostriches, the config repo shows you +2 it today - https://gerrit.wikimedia.org/r/#/c/305748/ [20:30:56] i thought it has to be deployed right away [20:31:08] It was. But it was a comment no-op [20:31:12] So it's not a deployment ;-) [20:31:20] It was sync'd, there was no deployment [20:31:26] but you need to sync it, don't you? [20:31:31] * ostriches is playing with semantics to get a few things done [20:31:32] ;-) [20:31:52] (03CR) 10Andrew Bogott: [C: 032] puppetmaster: Do not install default file before common package [puppet] - 10https://gerrit.wikimedia.org/r/328420 (owner: 10Yuvipanda) [20:31:57] * yurik throws a tomatoe at ostriches :-P [20:32:00] (03PS2) 10Andrew Bogott: puppetmaster: Do not install default file before common package [puppet] - 10https://gerrit.wikimedia.org/r/328420 (owner: 10Yuvipanda) [20:32:27] bblack: ok, we will prioritize our work once we have the varnish code that sets cookie. We talked about setting it up in all domains versus only set it up on wikimedia.org [20:33:24] bblack: let's just do the *.wikipedia.org domain [20:34:14] mutante: can you take a look at https://gerrit.wikimedia.org/r/#/c/328407/ ? puppet is failing on mendelevium [20:34:48] nuria: do you actually want to do a single domain first? [20:34:50] godog: ugh, yea, that would be my fault [20:34:54] bblack: sorry, just saw your prior comment in that it was easier to do it for all domains at once [20:34:57] nuria: (or only)? it's actually easy to do them all [20:34:57] on it [20:35:00] ok [20:35:10] (03PS3) 10Dzahn: otrs: use wikimedia_domains from role exim [puppet] - 10https://gerrit.wikimedia.org/r/328407 (owner: 10Filippo Giunchedi) [20:35:27] (03CR) 10Mobrovac: [C: 031] Properly point eventbus.svc to codfw endpoint in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328212 (owner: 10Ottomata) [20:35:49] bblack: we will compute just for once for starters, i imagine data would have issues that we need to vet but once those are ironed out computing for 1 domain or severals is about the same [20:35:50] (03CR) 10Dzahn: [C: 031] "yes, this is correct, sorry, caused by https://gerrit.wikimedia.org/r/#/c/327138/ and i only checked MX and list server, not OTRS" [puppet] - 10https://gerrit.wikimedia.org/r/328407 (owner: 10Filippo Giunchedi) [20:35:57] mutante: no worries, looks simple enough [20:36:06] (03CR) 10Dzahn: [C: 032] otrs: use wikimedia_domains from role exim [puppet] - 10https://gerrit.wikimedia.org/r/328407 (owner: 10Filippo Giunchedi) [20:36:59] bblack: thank you [20:37:41] should recover now [20:37:43] RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [20:38:12] thanks, out for lunch then, bbiaw [20:52:07] (03PS2) 1020after4: scap branch plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328399 (https://phabricator.wikimedia.org/T140918) [20:54:31] !log mobrovac@tin Starting deploy [changeprop/deploy@77786a5]: Config change: respect the pages blacklisted by RESTBase [20:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:22] !log mobrovac@tin Finished deploy [changeprop/deploy@77786a5]: Config change: respect the pages blacklisted by RESTBase (duration: 00m 50s) [20:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:03] (03PS4) 10BBlack: TLS: reduce scope of stream.wm.o redirect exception [puppet] - 10https://gerrit.wikimedia.org/r/328193 (https://phabricator.wikimedia.org/T143925) [21:00:05] (03PS22) 10BBlack: cache_misc app_directors/req_handling split [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [21:00:07] (03PS22) 10BBlack: cache_misc req_handling: sort entries [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [21:00:09] (03PS23) 10BBlack: cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [21:00:11] (03PS7) 10BBlack: cache_misc: stream.wm.o subpathing for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/327550 (https://phabricator.wikimedia.org/T143925) [21:13:50] (03CR) 10Daniel Kinzler: [C: 031] "I did not check every conversion factor, but i agree with the intent, and the configuration looks sane." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327907 (https://phabricator.wikimedia.org/T150881) (owner: 10Smalyshev) [21:15:43] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:19:21] (03PS1) 10Filippo Giunchedi: prometheus: add aggregation rules for MySQL [puppet] - 10https://gerrit.wikimedia.org/r/328425 [21:26:57] (03CR) 10Jcrespo: [C: 031] prometheus: add aggregation rules for MySQL [puppet] - 10https://gerrit.wikimedia.org/r/328425 (owner: 10Filippo Giunchedi) [21:41:00] (03CR) 10Chad: [C: 032] scap branch plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328399 (https://phabricator.wikimedia.org/T140918) (owner: 1020after4) [21:41:36] (03Merged) 10jenkins-bot: scap branch plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328399 (https://phabricator.wikimedia.org/T140918) (owner: 1020after4) [21:42:55] !log demon@tin Synchronized multiversion/submodules.json: no-op scap stuff (duration: 00m 39s) [21:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:35] !log demon@tin Synchronized scap/plugins: no-op scap stuff (duration: 00m 39s) [21:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:43] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [21:47:08] !log dzahn@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1168.eqiad.wmnet [21:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:17] !log depooled mw1168 videoscaler for reinstall (part of T153488) [21:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:21] T153488: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488 [21:54:09] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891214 (10ssastry) [21:55:03] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1389 [21:59:26] (03PS1) 10Dzahn: dhcp: set installserver for mw1168 to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328427 [22:00:03] RECOVERY - check_mysql on lutetium is OK: Uptime: 1925884 Threads: 3 Questions: 523969177 Slow queries: 14687 Opens: 86478559 Flush tables: 2 Open tables: 64 Queries per second avg: 272.066 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2 [22:00:20] (03PS2) 10Dzahn: dhcp: set installserver for mw1168 to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328427 [22:00:58] (03PS3) 10Dzahn: dhcp: set installserver for mw1168 to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328427 [22:02:02] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891214 (10GWicke) Could this be 404s from replication lag, as handled in https://github.com/wikimedia/change-propagation/blob/9a87712... [22:03:55] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891281 (10GWicke) The 404 rate returned from Parsoid has not changed much over the last months: {F5112276} [22:05:31] (03PS2) 10Dzahn: install: add http & proxy roles on install1001 [puppet] - 10https://gerrit.wikimedia.org/r/325739 (https://phabricator.wikimedia.org/T132757) [22:05:48] (03CR) 10jerkins-bot: [V: 04-1] install: add http & proxy roles on install1001 [puppet] - 10https://gerrit.wikimedia.org/r/325739 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:06:41] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891288 (10ssastry) >>! In T153797#2891281, @GWicke wrote: > The 404 rate returned from Parsoid has not changed much over the last mon... [22:07:42] (03PS3) 10Dzahn: install: add http & proxy roles on install1001 [puppet] - 10https://gerrit.wikimedia.org/r/325739 (https://phabricator.wikimedia.org/T132757) [22:08:08] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891289 (10GWicke) @ssastry, my understanding is that Parsoid does not retry 404s from batch requests, so the two should be equal. Cou... [22:12:42] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891296 (10ssastry) >>! In T153797#2891289, @GWicke wrote: > @ssastry, my understanding is that Parsoid does not retry 404s from batch... [22:13:33] (03PS5) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) [22:14:15] (03Abandoned) 10Dzahn: install: copy/move apt.wm.org setup to aptrepo module [puppet] - 10https://gerrit.wikimedia.org/r/325864 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:14:42] (03Restored) 10Dzahn: install: copy/move apt.wm.org setup to aptrepo module [puppet] - 10https://gerrit.wikimedia.org/r/325864 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:16:03] (03CR) 10jerkins-bot: [V: 04-1] Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [22:16:11] (03CR) 10Dzahn: [C: 04-1] "but i'll keep it as a reminder that the apt.wm.org config should also not be in installserver module, i guess in manifests/role/files/aptr" [puppet] - 10https://gerrit.wikimedia.org/r/325864 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:16:35] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891321 (10GWicke) @ssastry, I haven't seen any code in Parsoid that would trigger retries on a 404 response. Could you point it out t... [22:16:51] (03PS6) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) [22:18:27] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, 07HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#2891325 (10abian) I've been using HTTPS Everywhere for a long time. This extension has [[ https://github.com/EFForg/https-everywhere/blob/master/src/ch... [22:20:25] (03PS7) 10Andrew Bogott: Keystone: Move api service to uwsgi/nginx [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) [22:21:22] (03CR) 10Dzahn: [C: 032] install: add http & proxy roles on install1001 [puppet] - 10https://gerrit.wikimedia.org/r/325739 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:21:23] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, 07HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#2891340 (10bd808) I really think we could just flip the switch at the ingress proxy and then deal with the fallout. Mixed content warnings/errors are r... [22:22:18] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891347 (10ssastry) >>! In T153797#2891321, @GWicke wrote: > @ssastry, I haven't seen any code in Parsoid that would trigger retries o... [22:24:52] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, 07HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#2891349 (10bd808) >>! In T102367#2891340, @bd808 wrote: > I really think we could just flip the switch at the ingress proxy and then deal with the fall... [22:25:00] (03CR) 10Andrew Bogott: "This seems to mostly work, although I'm clearly misunderstanding how logging should work" [puppet] - 10https://gerrit.wikimedia.org/r/328400 (https://phabricator.wikimedia.org/T150774) (owner: 10Andrew Bogott) [22:26:14] PROBLEM - puppet last run on install1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-apt] [22:26:40] that's me and i'm currently thinking how to prevent that during a migration [22:37:53] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891214 (10Pchelolo) I really doubt the 404s come from the replication lag. There were no 404s like this before and also since we now... [22:39:38] 06Operations: Setting up a mirror serv{er,ice} - https://phabricator.wikimedia.org/T84817#2891390 (10Dzahn) Meanwhile we have sodium.wikimedia.org which is mirrors.wikimedia.org and has all the files, while carbon is not it anymore (and has 14T instead of 21T or something in data left in mirrors). Did that reso... [22:41:14] (03PS1) 10Dzahn: install: add hiera override to skip Letsencrypt cert creation [puppet] - 10https://gerrit.wikimedia.org/r/328429 (https://phabricator.wikimedia.org/T132757) [22:42:48] (03CR) 10Chad: install: add hiera override to skip Letsencrypt cert creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328429 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:45:53] (03CR) 10Dzahn: install: add hiera override to skip Letsencrypt cert creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328429 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:47:22] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891429 (10GWicke) Yeah, on IRC @ssastry clarified that in contrast to the primary revision source request, these batch requests would... [22:47:32] (03CR) 10Dzahn: install: add hiera override to skip Letsencrypt cert creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328429 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:47:41] (03CR) 1020after4: [C: 031] Remove --dry-run option from updateBranchPointers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328329 (owner: 10Chad) [22:48:34] (03CR) 10Chad: [C: 031] install: add hiera override to skip Letsencrypt cert creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328429 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:49:54] (03CR) 10Dzahn: install: add hiera override to skip Letsencrypt cert creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328429 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:50:35] (03CR) 10Dzahn: [C: 032] install: add hiera override to skip Letsencrypt cert creation [puppet] - 10https://gerrit.wikimedia.org/r/328429 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:50:41] (03PS2) 10Dzahn: install: add hiera override to skip Letsencrypt cert creation [puppet] - 10https://gerrit.wikimedia.org/r/328429 (https://phabricator.wikimedia.org/T132757) [22:51:10] (03CR) 10Dzahn: [C: 032] install: add hiera override to skip Letsencrypt cert creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328429 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:52:33] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [22:52:48] (03PS2) 10Dzahn: install: add http & proxy roles on install2001 [puppet] - 10https://gerrit.wikimedia.org/r/325743 (https://phabricator.wikimedia.org/T132757) [22:53:23] RECOVERY - puppet last run on install1001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [22:53:33] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4080328 keys, up 50 days 14 hours - replication_delay is 0 [22:53:57] (03PS14) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) [22:54:15] (03CR) 10Dzahn: [C: 032] install: add http & proxy roles on install2001 [puppet] - 10https://gerrit.wikimedia.org/r/325743 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:57:48] interesting, 2 servers, identical roles, one works one fails [22:57:54] hmmmm [22:58:58] PROBLEM - Check systemd state on install2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:58:59] ah, i see why [22:59:20] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, 07HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#1363321 (10Legoktm) We could do the same thing as what we did in prod. Allow HTTP POST for a while, then make a percentage of requests fail, and then f... [22:59:34] on one server it ran once and was then disabled, on the other it never ran [22:59:37] grmbl [23:00:26] legoktm on T102367 outbound requests from tools will still be sendable right (from the actual tools themselves meaning like bot apis and such) [23:00:26] T102367: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 [23:00:48] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [23:02:18] PROBLEM - HTTP on install2001 is CRITICAL: connect to address 208.80.153.4 and port 80: Connection refused [23:02:43] yes, that's me and it's not used [23:02:49] working on it [23:04:47] (03PS1) 10Dzahn: install/apt: set install2001 to active host temp. [puppet] - 10https://gerrit.wikimedia.org/r/328432 [23:05:14] (03PS2) 10Dzahn: install/apt: set install2001 to active host temp. [puppet] - 10https://gerrit.wikimedia.org/r/328432 [23:05:21] (03CR) 10Dzahn: [V: 032 C: 032] install/apt: set install2001 to active host temp. [puppet] - 10https://gerrit.wikimedia.org/r/328432 (owner: 10Dzahn) [23:05:29] (03PS1) 10Dzahn: Revert "install/apt: set install2001 to active host temp." [puppet] - 10https://gerrit.wikimedia.org/r/328433 [23:06:58] RECOVERY - Check systemd state on install2001 is OK: OK - running: The system is fully operational [23:07:05] (03CR) 10Dzahn: [V: 032 C: 032] "just needed one puppet run, then go back to before" [puppet] - 10https://gerrit.wikimedia.org/r/328433 (owner: 10Dzahn) [23:07:18] RECOVERY - HTTP on install2001 is OK: HTTP OK: HTTP/1.1 200 OK - 244 bytes in 0.073 second response time [23:07:48] PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 180 threshold =0.1% breach: status: red, number_of_nodes: 1, unassigned_shards: 180, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 180, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 50 [23:08:12] ^ just me, ignore [23:08:21] thx [23:08:48] RECOVERY - ElasticSearch health check for shards on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: yellow, number_of_nodes: 2, unassigned_shards: 7, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 275, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 97.5069252078, active_shards: 352 [23:08:48] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [23:08:56] (03PS4) 10Dzahn: dhcp: set installserver for mw1168 to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328427 [23:09:26] !log install1001/2001 - upgrade python-requests python-urllib3 arcconf [23:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:33] (03CR) 10Dzahn: [C: 032] dhcp: set installserver for mw1168 to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328427 (owner: 10Dzahn) [23:18:22] !log rebooting mw1168 into PXE for reinstall [23:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:40] installing that using install1001, not carbon, as tftp [23:24:43] and it works [23:24:56] carbon needs to go .. precise [23:35:15] (03PS1) 10Dzahn: install/dhcp: switch all "next-server" from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328439 (https://phabricator.wikimedia.org/T132757) [23:36:26] (03CR) 10Dzahn: "install worked fine. next will be https://gerrit.wikimedia.org/r/#/c/328439/ so no more need for this special case. was just test" [puppet] - 10https://gerrit.wikimedia.org/r/328427 (owner: 10Dzahn) [23:36:57] (03PS1) 10Dzahn: Revert "dhcp: set installserver for mw1168 to install1001" [puppet] - 10https://gerrit.wikimedia.org/r/328440 [23:38:13] mutante: nice! [23:38:19] what else is left on carbon? [23:38:20] 06Operations, 10Parsoid, 06Services: Investigate 404s for batch request api calls after async processing moved from eqiad -> codfw - https://phabricator.wikimedia.org/T153797#2891580 (10Pchelolo) Another interesting type of 404s from parsoid started to appear after move to codfw: https://logstash.wikimedia.o... [23:38:30] (03PS2) 10Filippo Giunchedi: prometheus: add aggregation rules for MySQL [puppet] - 10https://gerrit.wikimedia.org/r/328425 [23:39:24] godog: next is a test to install while DHCP on carbon is also turned off [23:39:56] if there is something left on it that isnt on install* then it was unpuppetized [23:40:16] it worked out great that Luca wanted 2 servers reinstalled anyways [23:41:19] 2 Fliegen mit einer Klappe [23:41:20] https://www.youtube.com/watch?v=QwadHzGfHDg [23:46:33] !log mw1168 - re-signing new puppet cert after install, initial puppet run [23:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:34] 07Puppet, 06Labs: Migrate references from $instance.eqiad.wmflabs to $instance.$project.eqiad.wmflabs - https://phabricator.wikimedia.org/T153608#2891604 (10scfc) [23:47:45] (03CR) 10Dzahn: [C: 032] Revert "dhcp: set installserver for mw1168 to install1001" [puppet] - 10https://gerrit.wikimedia.org/r/328440 (owner: 10Dzahn) [23:52:29] ugh, right, the ganglia aggregator is still there.. [23:54:32] godog: fyi, on first puppet run on a fresh mw videoscaler: dpkg errors "errors were encountered" prometheus-apache-exporter [23:54:57] shows up once every time it installs a package [23:54:58] mutante: ah, does it persist afterwards too? [23:55:09] i'll know soon, still running