[00:01:16] 10Operations, 10MediaWiki-extensions-Scribunto: Build and push a new hhvm-luasandbox package - https://phabricator.wikimedia.org/T171166#3532570 (10tstarling) The status is just what I wrote in the task description of T171267, I haven't done any more work on it since then, except for merging another change int... [00:04:10] (03PS1) 10Krinkle: Enable jQuery 3 on mediawiki.org and test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372485 (https://phabricator.wikimedia.org/T124742) [00:18:48] !log ebernhardson@tin Synchronized php-1.30.0-wmf.14/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: T171213: Increase sampling rate of cirrus satisfaction schema (duration: 00m 44s) [00:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:03] T171213: Interleaved results A/B test: check that data is flowing the way we expect - https://phabricator.wikimedia.org/T171213 [00:48:42] !log ebernhardson@tin Synchronized php-1.30.0-wmf.14/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: T171213: Increase sampling rate of cirrus satisfaction schema (again) to 1k per bucket per day (duration: 00m 44s) [00:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:56] T171213: Interleaved results A/B test: check that data is flowing the way we expect - https://phabricator.wikimedia.org/T171213 [01:06:32] RECOVERY - Check Varnish expiry mailbox lag on cp1048 is OK: OK: expiry mailbox lag is 0 [01:07:43] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-pdf01 - https://phabricator.wikimedia.org/T173552#3532711 (10Krenair) [01:08:08] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-pdf01 - https://phabricator.wikimedia.org/T173552#3532726 (10Krenair) See also T120165 [01:15:35] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-sentry01 - https://phabricator.wikimedia.org/T173554#3532743 (10Krenair) [01:15:52] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-sentry01 - https://phabricator.wikimedia.org/T173554#3532743 (10Krenair) Caused by https://gerrit.wikimedia.org/r/#/c/371481/4 [01:18:54] (03PS1) 10Alex Monk: Followup Ia5d07908: Fix sentry class's base::service_unit to require correct resource class [puppet] - 10https://gerrit.wikimedia.org/r/372495 (https://phabricator.wikimedia.org/T173554) [01:19:17] (03CR) 10jerkins-bot: [V: 04-1] Followup Ia5d07908: Fix sentry class's base::service_unit to require correct resource class [puppet] - 10https://gerrit.wikimedia.org/r/372495 (https://phabricator.wikimedia.org/T173554) (owner: 10Alex Monk) [01:21:56] (03PS2) 10Alex Monk: Followup Ia5d07908: Fix sentry's base::service_unit to require correct class [puppet] - 10https://gerrit.wikimedia.org/r/372495 (https://phabricator.wikimedia.org/T173554) [03:26:23] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 829.61 seconds [04:08:42] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 240.41 seconds [05:38:13] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 46 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [05:43:09] !log upgrading and restarting all mariadb instances on dbstore2002 [05:43:13] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [05:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:22] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 45 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:24:07] !log stopping and upgrading db2033 [06:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:22] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:28:55] PROBLEM - MariaDB Slave IO: x1 on db1031 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2033.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db2033.codfw.wmnet (111 Connection refused) [06:38:16] I cannot upgrade db2033 [06:38:23] it gets stuck on apt [06:39:07] I think install2002 may have issues [06:39:22] you mean upgrade as in reimage? [06:39:28] or dist-upgrade? [06:39:31] no, use apt update [06:39:54] happens on every codfw host I try [06:41:40] 10Operations, 10Wikimedia-Site-requests, 10User-MarcoAurelio, 10Wikimedia-log-errors: Unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T173419#3532903 (10alanajjar) This task should be solved as fast as you can! many new requests accumulate! [06:46:06] this is blocking me on upgrading db2033 [06:47:30] uranium seems to be taking quite some TIME_WAIT connections [06:47:37] this isn't specific to codfw, I can also reproduce this on restbase1008 [06:50:25] should we restart nginx for debugging porpuses on one of the servers? [06:51:23] or maybe it is related to the mirror files? [06:52:36] restarting nginx makes sense. what I'm wondering: [06:52:57] we're running apt-get update via puppet, this should also be affected, right? [06:53:12] yeah, that was my fear, but I saw not complain yet [06:54:47] maybe it's rather a caching problem, let's restart squid on install2002 [06:55:01] doing [06:55:35] !log systemctl restart squid3.service on install2002, apt seem stuck on some servers [06:55:43] that fixed it [06:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:49] although [06:56:03] I saw squid probably at the same than you [06:56:06] *time [06:56:28] I'm still getting "Waiting for headers" now [06:56:37] yeah, me too [06:57:42] also occurs on a stretch system, so not related to apt itself probably [06:57:58] let me do nginx, too, just in case [06:58:05] yeah [06:58:10] !log and nginx [06:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:44] no luck [06:58:51] could be repo files, then? [06:59:55] It just completed within 1-2 seconds on ms-be2022, but a second run is stuck again [07:00:16] I am stuck on db2033 [07:00:56] it is ongoing now, at 60KB/s [07:00:56] did you restart nginx on sodium or install2002? [07:01:08] install2002 only [07:01:36] works fine again on restbase2008 [07:01:39] is that the wrong server? [07:02:15] it is still going very slow, like before it got stuck [07:02:41] no, install2002 is fine [07:03:17] (wrt which host to restart) [07:03:28] bacula may be running? but I didn't see a huge spike on disk or network, just slightly higher than before [07:04:08] not as much as to create so many issues [07:04:25] I got unblocked at least, I will finish this upgrade [07:04:35] and maybe research further later [07:05:12] it's working reliable on all the codfw hosts I tested, but still stuck on restbase1008, will also restart nginx on install1002 [07:05:55] oh, is it a vm? [07:06:06] maybe something else is causing issues? [07:06:14] on the same Phy. host [07:06:17] might be, those are in fact VMs [07:06:31] 10Operations, 10Analytics, 10User-Elukey: Tune Varnishkafka delivery errors to be more sensitive - https://phabricator.wikimedia.org/T173492#3532917 (10elukey) [07:06:40] 10Operations, 10Analytics, 10User-Elukey: Tune Kafka logs to register clients connected - https://phabricator.wikimedia.org/T173493#3532918 (10elukey) [07:06:40] and looking at the instance graphs is missleading [07:07:19] it's not network-related, if I run "apt-get download hhvm-dbg" it fetches with 40 MB/s [07:07:30] interesting [07:07:59] so what could it make get slower on update/full-upgrade? [07:08:10] (03PS1) 10Marostegui: db-codfw.php: Depool db2077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372502 (https://phabricator.wikimedia.org/T168409) [07:08:11] some file problems on the repo? [07:08:11] the numbers shown for "apt-get update" is misleading (46.7 kb/s in my case), since that opens a lot of small connections [07:08:56] now it also works fine on restbase1008 (and I haven't restarted nginx on install1002 yet) [07:10:40] strange [07:11:00] resource usage on install2002 is all on the average, nothing spiked out, so maybe this was in fact a slowdown caused by a spike on some other Ganeti VM [07:13:01] I don't have a reasonable explanation, but need to work on something else now, let's simply keep an eye on it for now [07:16:08] RECOVERY - MariaDB Slave IO: x1 on db1031 is OK: OK slave_io_state Slave_IO_Running: Yes [07:16:24] What's that? [07:16:33] see log [07:16:40] Ah! [07:16:49] Hi jynus you are early :) [07:17:01] apparently x1 return replication was active [07:17:16] Yeah, I think we never stopped that one, only sX [07:20:25] !log installing openjdk-7 security updates on trusty [07:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:44] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372502 (https://phabricator.wikimedia.org/T168409) (owner: 10Marostegui) [07:25:08] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372502 (https://phabricator.wikimedia.org/T168409) (owner: 10Marostegui) [07:26:08] (03CR) 10jenkins-bot: db-codfw.php: Depool db2077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372502 (https://phabricator.wikimedia.org/T168409) (owner: 10Marostegui) [07:26:20] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Deoool db2077 - T168409 (duration: 00m 44s) [07:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:32] T168409: Migrate dbstore2001 to multi instance - https://phabricator.wikimedia.org/T168409 [07:28:33] !log Stop MySQL on db2077 to copy it to dbstore2001 - T168409 [07:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:07] jynus: you want me to restart the instances on dbstore2001? I saw you did it on dbstore2002 so i don't want to step over you if you are about to do it on 2001 [07:29:15] I can do those if you like [07:29:36] I already restarted those yesterday [07:30:21] Ah ok! great, thanks! [07:49:06] (03CR) 10Filippo Giunchedi: [C: 032] Use absolute paths for `data_file_directories` [puppet] - 10https://gerrit.wikimedia.org/r/372469 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [07:49:13] (03CR) 10Filippo Giunchedi: [C: 031] Use absolute paths for `data_file_directories` [puppet] - 10https://gerrit.wikimedia.org/r/372469 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [07:50:41] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM - to be merged next week" [puppet] - 10https://gerrit.wikimedia.org/r/370554 (https://phabricator.wikimedia.org/T172610) (owner: 10Mobrovac) [08:07:40] (03PS1) 10ArielGlenn: dump global block table from central auth [puppet] - 10https://gerrit.wikimedia.org/r/372507 (https://phabricator.wikimedia.org/T173468) [08:08:02] (03PS1) 10Marostegui: mysql-dbstore_codfw: Add dbstore2001 - s7 [puppet] - 10https://gerrit.wikimedia.org/r/372508 (https://phabricator.wikimedia.org/T168409) [08:08:06] (03CR) 10jerkins-bot: [V: 04-1] dump global block table from central auth [puppet] - 10https://gerrit.wikimedia.org/r/372507 (https://phabricator.wikimedia.org/T173468) (owner: 10ArielGlenn) [08:08:20] (03CR) 10Filippo Giunchedi: "LGTM, see comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/372254 (https://phabricator.wikimedia.org/T151554) (owner: 10Gilles) [08:09:54] (03PS2) 10ArielGlenn: dump global block table from central auth [puppet] - 10https://gerrit.wikimedia.org/r/372507 (https://phabricator.wikimedia.org/T173468) [08:12:32] (03PS2) 10Filippo Giunchedi: udev: new module [puppet] - 10https://gerrit.wikimedia.org/r/371642 [08:12:58] (03CR) 10jerkins-bot: [V: 04-1] udev: new module [puppet] - 10https://gerrit.wikimedia.org/r/371642 (owner: 10Filippo Giunchedi) [08:17:21] (03PS3) 10Filippo Giunchedi: udev: new module [puppet] - 10https://gerrit.wikimedia.org/r/371642 [08:20:20] (03PS4) 10Filippo Giunchedi: udev: new module [puppet] - 10https://gerrit.wikimedia.org/r/371642 [08:23:09] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM to my untrained eye, plus a nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/365589 (https://phabricator.wikimedia.org/T169683) (owner: 10Gilles) [08:23:38] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/7530/" [puppet] - 10https://gerrit.wikimedia.org/r/372508 (https://phabricator.wikimedia.org/T168409) (owner: 10Marostegui) [08:27:41] (03CR) 10Gehel: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/372426 (https://phabricator.wikimedia.org/T148478) (owner: 10Chad) [08:30:32] (03PS8) 10Gehel: Gerrit: Enable logstash by default for prod gerrit [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [08:30:43] (03CR) 10Gehel: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [08:30:55] (03CR) 10Paladox: "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [08:33:03] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is activating [08:33:22] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3533003 (10jcrespo) @thcipriani @aaron I know something was done yesterday, (thank you!), may I ask... [08:34:03] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [08:35:50] (03CR) 10Marostegui: [C: 032] mysql-dbstore_codfw: Add dbstore2001 - s7 [puppet] - 10https://gerrit.wikimedia.org/r/372508 (https://phabricator.wikimedia.org/T168409) (owner: 10Marostegui) [08:36:47] 10Operations, 10vm-requests, 10Discovery-Search (Current work): refresh hardware for logstash100[123] - https://phabricator.wikimedia.org/T173298#3533006 (10akosiaris) @gehel, yes it does. In fact it's a pretty good idea. Resource wise, these hosts look like perfect candidates (the mem requirement of 8GB is... [08:39:24] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3533020 (10akosiaris) >>! In T171167#3531227, @fgiunchedi wrote: > Both issues have been fixed upstream! Pending deployment of latest version of lib... [08:42:53] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372512 [08:43:01] (03CR) 10Marostegui: [C: 04-2] "Server still catching up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372512 (owner: 10Marostegui) [08:43:28] (03PS11) 10Jcrespo: mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) [08:43:30] (03PS5) 10Jcrespo: [WIP]mariadb: First attempt at a mydumper-based dump script [puppet] - 10https://gerrit.wikimedia.org/r/371944 (https://phabricator.wikimedia.org/T169516) [08:43:32] (03PS1) 10Jcrespo: mariadb: Disable semisync slave replication on masters [puppet] - 10https://gerrit.wikimedia.org/r/372513 [08:43:46] marostegui: do you think we should merge 6a4dc12b0f today (after some compiler checks) [08:44:03] checking [08:44:35] Ah that one [08:44:54] Well, i don't see any major issue, but I don't see any big benefit of doing it today or Monday morning :) [08:48:41] 10Operations, 10ops-eqiad, 10DBA: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3533031 (10Marostegui) @Cmjohnson you think the disk will arrive today? I wouldn't want to leave this host depooled for the weekend :-( We can always repool it even we a degraded array but not ideal I guess [08:49:38] (03CR) 10Marostegui: [C: 031] mariadb: Disable semisync slave replication on masters [puppet] - 10https://gerrit.wikimedia.org/r/372513 (owner: 10Jcrespo) [08:54:55] 10Operations, 10MediaWiki-extensions-Scribunto: Build and push a new hhvm-luasandbox package - https://phabricator.wikimedia.org/T171166#3533037 (10MoritzMuehlenhoff) I built new jessie packages based on the 2.0.13 release from Github. They're rolled out in deployment-prep on deployment-mediawiki04, deployment... [08:55:01] akosiaris: about the vm-request for logstash, do you need a new phab task? Or should I just repurpose the current one? [08:56:38] 10Operations, 10Wikimedia-Site-requests, 10User-MarcoAurelio, 10Wikimedia-log-errors: Unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T173419#3533046 (10Aklapper) @alanajjar: Please avoid adding "+1" / "me too" comments that do not bring a task closer to resolution and create... [08:58:09] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3533047 (10Ladsgroup) The fix is deployed and so I think this task should be closed, can you see an... [09:01:48] (03CR) 10Gehel: "Seeing the comments on https://gerrit.wikimedia.org/r/#/c/371950, it might make sense to move the entire logback configuration to puppet. " [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [09:01:58] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3533052 (10Ladsgroup) [09:11:42] (03CR) 10Elukey: [C: 031] udev: new module [puppet] - 10https://gerrit.wikimedia.org/r/371642 (owner: 10Filippo Giunchedi) [09:11:54] 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team, 10Thumbor: Ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#3533064 (10fgiunchedi) Ping? Not granting thumbor access for newly created wikis it means files uploaded the... [09:12:00] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3533065 (10Marostegui) >>! In T164173#3533047, @Ladsgroup wrote: > The fix is deployed and so I thi... [09:13:21] gehel: probably a new one is better [09:14:20] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372512 (owner: 10Marostegui) [09:15:46] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372512 (owner: 10Marostegui) [09:16:10] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372512 (owner: 10Marostegui) [09:16:45] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2077 - T168409 (duration: 00m 44s) [09:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:04] T168409: Migrate dbstore2001 to multi instance - https://phabricator.wikimedia.org/T168409 [09:31:34] 10Operations, 10Performance-Team, 10monitoring: Ensure getLagTimes.php is working properly - https://phabricator.wikimedia.org/T172559#3533103 (10jcrespo) 05Resolved>03Open I want to reopen this because I belive this script is causing DBConnection log spam, because it is trying to connect to aawiki on ho... [09:36:03] (03PS25) 10Elukey: Increase max kafka message size for changeprop and kafka main [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [09:37:40] (03PS2) 10Jcrespo: mariadb: Disable semisync slave replication on masters [puppet] - 10https://gerrit.wikimedia.org/r/372513 [09:38:23] (03CR) 10Jcrespo: [C: 032] mariadb: Disable semisync slave replication on masters [puppet] - 10https://gerrit.wikimedia.org/r/372513 (owner: 10Jcrespo) [09:42:22] !log installing libmspack security updates [09:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:08] (03PS2) 10Filippo Giunchedi: prometheus: use blackbox-exporter package from Debian [puppet] - 10https://gerrit.wikimedia.org/r/365239 (https://phabricator.wikimedia.org/T169860) [09:44:10] (03PS2) 10Filippo Giunchedi: prometheus: additional blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/365240 (https://phabricator.wikimedia.org/T169860) [09:45:37] (03PS3) 10Filippo Giunchedi: prometheus: use blackbox-exporter package from Debian [puppet] - 10https://gerrit.wikimedia.org/r/365239 (https://phabricator.wikimedia.org/T169860) [09:46:58] !log ladsgroup@terbium:~$ time /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki wikidatawiki --entity-type=property (T171460) [09:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:09] T171460: Populate term_full_entity_id on www.wikidata.org - https://phabricator.wikimedia.org/T171460 [09:47:12] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: use blackbox-exporter package from Debian [puppet] - 10https://gerrit.wikimedia.org/r/365239 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:47:41] (03PS3) 10Filippo Giunchedi: prometheus: additional blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/365240 (https://phabricator.wikimedia.org/T169860) [09:48:21] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: additional blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/365240 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:49:30] (03PS26) 10Elukey: Increase max kafka message size for changeprop and kafka main [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [09:55:33] !log one small pass of ladsgroup@terbium:~$ time /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki wikidatawiki --entity-type=entity (T171460) [09:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:44] T171460: Populate term_full_entity_id on www.wikidata.org - https://phabricator.wikimedia.org/T171460 [09:56:06] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler03/7535/" [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [09:56:48] (03PS12) 10Jcrespo: mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) [09:58:16] (03CR) 10Jcrespo: [C: 032] mariadb: Remove package hacks for MariaDB 10.1 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/371450 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [09:58:48] WARNING: Revision range includes commits from multiple committers! [09:58:58] ^godog [09:59:14] jynus: ugh, sorry, yeah good to merge [10:07:29] 10Operations, 10Wikimedia-Logstash, 10vm-requests, 10Discovery-Search (Current work): Provision VMs on Ganeti for logstash100[123] - https://phabricator.wikimedia.org/T173565#3533136 (10Gehel) [10:07:43] 10Operations, 10vm-requests, 10Discovery-Search (Current work): refresh hardware for logstash100[123] - https://phabricator.wikimedia.org/T173298#3523477 (10Gehel) 05Open>03declined Replaced by T173565. [10:07:51] (03PS1) 10Filippo Giunchedi: hieradata: bump ProxyFetch timeout for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/372517 (https://phabricator.wikimedia.org/T172930) [10:08:09] 10Operations, 10Wikimedia-Logstash, 10vm-requests, 10Discovery-Search (Current work): Provision VMs on Ganeti for logstash100[123] - https://phabricator.wikimedia.org/T173565#3533136 (10Gehel) [10:08:53] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1951 bytes in 0.159 second response time [10:09:00] Amir1: ^ [10:09:10] Could that be related to your script? [10:09:12] okay [10:09:17] yeah, it probably is [10:09:29] I am checking the master, to check if it is not the jobqueues thingy [10:09:30] stopped [10:10:04] There is no spike on updates, which is normally what we would see if it was it [10:10:10] checking binlogs now to be sure [10:10:50] No, it is not that (which is good!) [10:13:53] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1948 bytes in 0.142 second response time [10:14:19] (03CR) 10Elukey: Increase max kafka message size for changeprop and kafka main (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [10:15:29] (03PS27) 10Elukey: Increase max kafka message size for changeprop and kafka main [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [10:15:45] https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&orgId=1&from=now-3h&to=now [10:17:24] (03PS28) 10Elukey: Increase max kafka message size for changeprop and kafka main [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [10:17:29] (03PS1) 10Marostegui: sanitarium3: Prepare db1102 to run s4 instance [puppet] - 10https://gerrit.wikimedia.org/r/372518 (https://phabricator.wikimedia.org/T172996) [10:19:47] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler03/7538/" [puppet] - 10https://gerrit.wikimedia.org/r/372518 (https://phabricator.wikimedia.org/T172996) (owner: 10Marostegui) [10:21:43] (03CR) 10Elukey: "Last pcc seems fine: https://puppet-compiler.wmflabs.org/compiler02/7537/" [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [10:25:34] dbstore2001 started to swap a bit [10:26:08] we should probably lower the buffer pool even more [10:26:17] :( [10:27:09] Maybe s4 15G? [10:27:16] And x1 maybe 3G instead of 5? [10:29:04] (03PS6) 10Jcrespo: [WIP]mariadb: First attempt at a mydumper-based dump script [puppet] - 10https://gerrit.wikimedia.org/r/371944 (https://phabricator.wikimedia.org/T169516) [10:29:06] (03PS1) 10Jcrespo: dbstore2001: Lower memory pressure [puppet] - 10https://gerrit.wikimedia.org/r/372519 (https://phabricator.wikimedia.org/T168409) [10:29:32] (03CR) 10Marostegui: [C: 031] dbstore2001: Lower memory pressure [puppet] - 10https://gerrit.wikimedia.org/r/372519 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [10:29:42] comment there, I was mid-commiting when you wrote me [10:29:49] No, I think that makes sense [10:29:53] I +1ed [10:30:05] Remove a bit from every shard, instead a "bunch" from just one [10:30:08] we may need to lower it more [10:30:20] because that is with only replication running [10:30:58] (03PS2) 10Jcrespo: dbstore2001: Lower memory pressure [puppet] - 10https://gerrit.wikimedia.org/r/372519 (https://phabricator.wikimedia.org/T168409) [10:31:33] BTW, yesterday, s2 took a long time to stop [10:31:47] much more than the other shards, including s1, s5 and s4 [10:32:21] it was flushing its buffer pool so nothing strange [10:32:41] But nothing like those issues when it was running multi source no? [10:32:49] except it shoult not have taken more than the others [10:32:49] When it took like 30 minutes or so [10:32:49] thin like 5 minutes [10:32:57] but the other took mere 30 seconds at most [10:33:01] \o/ [10:33:14] maybe s2 has something strnge [10:33:25] in any case, rigt now it only afects 1 host [10:33:28] *instance [10:33:34] indeed! :) [10:33:54] (03CR) 10Jcrespo: [C: 032] dbstore2001: Lower memory pressure [puppet] - 10https://gerrit.wikimedia.org/r/372519 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [10:34:44] !log restart mysql on dbstore1002 - attempt to reclaim space after big table drop (stop slaves and el_sync, check running queries, stop mysql, check process, start mysql) [10:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:00] restart mysql? [10:35:23] jynus: there is some polstergreist there [10:35:42] doing it with Manuel's supervision [10:35:46] you better upgrade the kernel at the same time [10:35:52] it is due, I think [10:36:17] and will only take 5 additional minutes [10:36:41] I can definitely do it, but there is the question if the host will be back online or not :D [10:37:13] better knowing it now than after it crashes during the midnight :-) [10:37:56] jynus: all right, lemme ping moritzm to verify the kernel update status [10:48:24] (03PS7) 10MarcoAurelio: Initial configuration for hi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) [10:48:29] (03PS8) 10MarcoAurelio: Initial configuration for hi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) [10:51:48] (03PS9) 10MarcoAurelio: Initial configuration for hi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) [10:52:59] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to mwlog1001.eqiad.wmnet for goransm - https://phabricator.wikimedia.org/T171958#3533185 (10Addshore) >>! In T171958#3528114, @elukey wrote: > @Addshore if the data is not sensitive we could set up a rsync job in `s... [10:54:01] (03CR) 10MarcoAurelio: "> I need to manually rebase and add the wiki to the s3.dblist which I" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) (owner: 10MarcoAurelio) [10:54:48] elukey: which server are we talking about? [10:54:56] dbstore1002 :) [11:00:19] !log reboot dbstore1002 for kernel updates [11:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:36] (03PS2) 10MarcoAurelio: Added Cookbook and Cookbook talk NS on hi.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372387 (https://phabricator.wikimedia.org/T173398) [11:23:51] (03CR) 10Alexandros Kosiaris: [C: 031] "This looks fine for now. I am thinking that we should in the future move this data (alongside the infrastructure data) in the network::sub" [puppet] - 10https://gerrit.wikimedia.org/r/371949 (owner: 10Ayounsi) [11:28:02] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1683.12 seconds [11:28:03] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1002.54 seconds [11:28:03] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2121.56 seconds [11:29:00] (03PS4) 10Alexandros Kosiaris: base::service_unit: move template rendering to the caller [puppet] - 10https://gerrit.wikimedia.org/r/371076 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [11:31:54] (03PS1) 10Addshore: Add seperate ensure for testwiki in wikidata crons [puppet] - 10https://gerrit.wikimedia.org/r/372525 (https://phabricator.wikimedia.org/T173357) [11:32:18] (03CR) 10jerkins-bot: [V: 04-1] Add seperate ensure for testwiki in wikidata crons [puppet] - 10https://gerrit.wikimedia.org/r/372525 (https://phabricator.wikimedia.org/T173357) (owner: 10Addshore) [11:33:32] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/371642 (owner: 10Filippo Giunchedi) [11:34:03] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 247.40 seconds [11:36:21] !log change topology of dbstore2001:x1 and dbstore2002:x1 [11:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:02] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 200.39 seconds [12:13:54] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3533292 (10fgiunchedi) >>! In T169939#3532268, @Eevans wrote: >>>! In T169939#3531892, @Eevans wrote: >> restbase2001.codfw.wmnet has been re-image... [12:15:10] (03PS5) 10Filippo Giunchedi: udev: new module [puppet] - 10https://gerrit.wikimedia.org/r/371642 [12:16:13] (03CR) 10Filippo Giunchedi: [C: 032] udev: new module [puppet] - 10https://gerrit.wikimedia.org/r/371642 (owner: 10Filippo Giunchedi) [12:16:23] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 254.83 seconds [12:18:33] sigh, that change had a dep cycle [12:20:52] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:23:30] (03PS1) 10Filippo Giunchedi: udev: fix dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/372529 [12:23:52] PROBLEM - puppet last run on ms-be1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:24:10] (03CR) 10Filippo Giunchedi: [C: 032] udev: fix dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/372529 (owner: 10Filippo Giunchedi) [12:24:53] PROBLEM - puppet last run on ms-be2018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:25:03] PROBLEM - puppet last run on ms-be2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:25:13] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:26:19] fixed, will be recovering soon [12:26:52] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [12:27:12] PROBLEM - puppet last run on ms-be2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:27:53] RECOVERY - puppet last run on ms-be1032 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [12:27:53] RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [12:28:03] RECOVERY - puppet last run on ms-be2025 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [12:28:12] RECOVERY - puppet last run on ms-be2030 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:28:13] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:41:10] (03CR) 10Platonides: "I don't know about the internals to determine if it's the right implementation, but it seems good." [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) (owner: 10Paladox) [12:44:29] !log start of ladsgroup@terbium:~$ time /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki wikidatawiki --entity-type=item [12:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:13] 10Operations, 10ops-eqiad, 10DBA: Decommission db1015 - https://phabricator.wikimedia.org/T173570#3533347 (10Marostegui) [12:51:57] 10Operations, 10ops-eqiad, 10DBA: Decommission db1015 - https://phabricator.wikimedia.org/T173570#3533347 (10Marostegui) [12:56:02] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1948 bytes in 0.174 second response time [12:56:19] Amir1: ^ [12:56:36] yeah, it's my script, I think it's okay [12:56:52] is it? [12:57:27] in ordinary situations no, but I'm rewriting one of the heaviest tables in wikidata [12:57:32] with 800M rows [12:58:01] we added lots and lots of wait for replication [12:58:20] ok [12:58:34] I'm also monitoring https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&orgId=1&from=now-3h&to=now and stop it if it gets really high [12:59:36] ok, thanks [13:04:21] *waves* Amir1 which bits did you rewrite? D: [13:04:23] :D [13:04:56] addshore: term_full_entity_id [13:05:08] we are adding the column values [13:05:33] it's putting lots of pressure on the dispatcher [13:07:05] (03PS6) 10Filippo Giunchedi: WIP: new prometheus instance 'services' [puppet] - 10https://gerrit.wikimedia.org/r/372357 (https://phabricator.wikimedia.org/T173490) [13:09:19] aaaah [13:15:30] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to mwlog1001.eqiad.wmnet for goransm - https://phabricator.wikimedia.org/T171958#3533418 (10elukey) >>! In T171958#3533185, @Addshore wrote: >>>! In T171958#3528114, @elukey wrote: >> @Addshore if the data is not se... [13:16:08] stopped for now, to give dispatcher some chance to breath [13:19:42] (03PS1) 10Urbanecm: Enable SandboxLink on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372531 (https://phabricator.wikimedia.org/T173054) [13:21:02] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1929 bytes in 0.161 second response time [13:22:17] (03PS7) 10Filippo Giunchedi: prometheus: new instance 'services' [puppet] - 10https://gerrit.wikimedia.org/r/372357 (https://phabricator.wikimedia.org/T173490) [13:23:42] Hello, is there anyone who can run namespaceDupes.php on hiwikiversity for me? Amir1, addshore? [13:24:12] I can if Ops are okay [13:24:17] It's for T172977 [13:24:17] T172977: Create English namespace aliases for hiwikiversity - https://phabricator.wikimedia.org/T172977 [13:24:26] it's a fairly harmless script [13:24:35] Amir1, should I ask someone? Or will you do it for me? [13:25:42] Urbanecm: do you have an estimation how long it would take? [13:25:54] I don't know the size of hiwikiversity [13:26:01] Hiwikiversity is fairly small wiki, it was created recently. [13:26:19] 1Q00 pages in total [13:26:43] !log ladsgroup@terbium:~$ mwscript namespaceDupes.php --wiki=hiwikiversity (T172977) [13:26:45] done [13:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:39] Amir1, you should pass --fix to actually do anything, shouldn't you? Seems nothing has changed, at least from my POV. [13:28:11] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7539/ says this is practically fine, merging" [puppet] - 10https://gerrit.wikimedia.org/r/371076 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [13:28:35] (03PS5) 10Alexandros Kosiaris: base::service_unit: move template rendering to the caller [puppet] - 10https://gerrit.wikimedia.org/r/371076 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [13:28:59] Amir1, thank you! [13:29:33] Urbanecm: yw, isn't there anything else I need to do? [13:29:59] It doesn't seems too, the issue seems to be resolved. Thank you again! [13:32:24] !log another run of /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki wikidatawiki --entity-type=item --from-id $(tail -100 /tmp/rebuildTermSqlIndex.log | grep -E "Processed up to page (\d+?)" | sed -E "s/Processed up to page //; s/ \(Q.+?//" | tail -1) >>/tmp/rebuildTermSqlIndex.log 2>&1 [13:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:13] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:19] (03PS7) 10Gehel: wdqs - moving to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/369682 [13:34:28] (03PS8) 10Gehel: wdqs - moving to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/369682 [13:35:04] (03CR) 10jerkins-bot: [V: 04-1] wdqs - moving to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/369682 (owner: 10Gehel) [13:37:11] (03PS9) 10Gehel: wdqs - moving to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/369682 (https://phabricator.wikimedia.org/T171704) [13:38:02] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1949 bytes in 0.145 second response time [13:38:15] How I can ack it for a while? [13:38:34] define a while [13:38:53] I mean the icinga alarm [13:39:16] yeah I got that, what do you mean by "a while"? I few hours ? a few days ? [13:39:41] two or three hours [13:41:20] done. I 've scheduled downtime for 4 hours [13:41:31] Thanks [13:43:03] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1943 bytes in 0.100 second response time [13:43:25] marostegui: jynus: One thing about the maintenance script is that it batches pages and work on them, right now it's on Q80,000 which these items have lots of terms making the whole thing expensive, it will get faster soon [13:44:59] Amir1: personally, I do not care if it is fast, slow it takes long time or short time- I am only concerned about 2 things- [13:45:25] 1) it doesn't affect other processes (schma changes, lagging, editing, other background jobs) [13:45:55] 2) it gets restarted from time to time to get the latest db configuration and mediawiki version [13:48:04] noted [13:48:11] thanks [13:49:52] (03PS2) 10Filippo Giunchedi: rsyslog: add support to receive syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/369950 (https://phabricator.wikimedia.org/T136312) [13:55:02] for example, db1026 and db1045 seems to be lagging heavily since 12:30 [13:55:25] and that blocks writes: https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&panelId=1&fullscreen&orgId=1&from=now-24h&to=now [13:55:37] that, for me, would be an unbreak now [13:56:09] https://logstash.wikimedia.org/goto/0a28f2acd3bd604743761acd9804837b [13:59:36] jynus: why there are lagging, mediawiki or hardware issues? [14:00:01] there is no hardware issues [14:03:23] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [14:06:42] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 53 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [14:07:34] I'm afk for the next several hours (bus ride), back on around 10 or 11 pm and available until Sun afternoon when I'll be doing the same bus ride back [14:08:00] (03PS3) 10Filippo Giunchedi: rsyslog: add support to receive syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/369950 (https://phabricator.wikimedia.org/T136312) [14:11:04] (03PS4) 10Filippo Giunchedi: rsyslog: add support to receive syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/369950 (https://phabricator.wikimedia.org/T136312) [14:16:42] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 7 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [14:23:02] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:23:06] (03CR) 10Herron: rsyslog: add support to receive syslog over TLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369950 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [14:28:42] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 28 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [14:32:02] 10Operations, 10MediaWiki-extensions-Scribunto: Build and push a new hhvm-luasandbox package - https://phabricator.wikimedia.org/T171166#3533534 (10Anomie) If there's a way to run the Scribunto "--group LuaSandbox" phpunit tests with the new version, that'd be a decent test of things generally working. As for... [14:33:43] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [14:40:02] (03PS5) 10Filippo Giunchedi: rsyslog: add support to receive syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/369950 (https://phabricator.wikimedia.org/T136312) [14:40:27] (03CR) 10Filippo Giunchedi: rsyslog: add support to receive syslog over TLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369950 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [14:52:14] RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:06:22] 10Operations: silver: / partition low on space - https://phabricator.wikimedia.org/T151493#2819106 (10herron) In addition to the directories outlined in the description /usr/share/texlive appears to be large as well. Is texlive required for this system? 1.3G /usr/share/texlive [15:17:17] (03CR) 10Ayounsi: [C: 032] Define management networks and allow them to send syslog to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371949 (owner: 10Ayounsi) [15:17:24] (03PS2) 10Ayounsi: Define management networks and allow them to send syslog to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371949 [15:25:30] (03PS1) 10RobH: new ssh key for diego [puppet] - 10https://gerrit.wikimedia.org/r/372535 (https://phabricator.wikimedia.org/T172891) [15:26:05] .... [15:26:10] i got a rejected message on my end [15:27:33] (03CR) 10RobH: [C: 032] new ssh key for diego [puppet] - 10https://gerrit.wikimedia.org/r/372535 (https://phabricator.wikimedia.org/T172891) (owner: 10RobH) [15:28:56] git review timeout shennanigans [15:31:40] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3533606 (10RobH) 05Open>03Resolved New ssh key is pushed, and your access is restored. I ran on stat1003 and watched it put your access back in... [15:32:31] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Encrypt syslog traffic - https://phabricator.wikimedia.org/T136312#3533610 (10fgiunchedi) [15:36:04] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3533621 (10Eevans) >>! In T169939#3533292, @fgiunchedi wrote: >>>! In T169939#3532268, @Eevans wrote: >>>>! In T169939#3531892, @Eevans wrote: >>>... [15:49:02] (03PS1) 10Elukey: statistics::rsync::mediawiki: rsync wmde specific logs [puppet] - 10https://gerrit.wikimedia.org/r/372540 (https://phabricator.wikimedia.org/T171958) [15:55:09] (03CR) 10Elukey: [C: 032] statistics::rsync::mediawiki: rsync wmde specific logs [puppet] - 10https://gerrit.wikimedia.org/r/372540 (https://phabricator.wikimedia.org/T171958) (owner: 10Elukey) [15:58:51] XioNoX: --^ [15:59:26] elukey: what's up? [15:59:40] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to mwlog1001.eqiad.wmnet for goransm - https://phabricator.wikimedia.org/T171958#3533679 (10elukey) Done! ``` elukey@stat1005:/srv/log/mw-log/archive/wmde$ ls -l total 700 -rw-r--r-- 1 stats stats 296 Jul 10 15:... [16:00:18] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to mwlog1001.eqiad.wmnet for goransm - https://phabricator.wikimedia.org/T171958#3533680 (10ayounsi) 05Open>03Resolved a:05ayounsi>03elukey [16:00:28] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to mwlog1001.eqiad.wmnet for goransm - https://phabricator.wikimedia.org/T171958#3533682 (10elukey) @Addshore confirmed over IRC that the logs are not containing super sensitive data (also spot checked and everythin... [16:17:35] 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team, 10Thumbor: Ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#3533693 (10Gilles) a:03Gilles [16:18:36] (03CR) 10Gilles: [C: 031] "Out of curiosity, what was the default?" [puppet] - 10https://gerrit.wikimedia.org/r/372517 (https://phabricator.wikimedia.org/T172930) (owner: 10Filippo Giunchedi) [16:24:45] anyone know of a tool(s) that can be used to monitor file access? operations, bytes read/written? [16:25:25] (03PS3) 10Gilles: Serve a synth error page when error body is empty in Varnish [puppet] - 10https://gerrit.wikimedia.org/r/365589 (https://phabricator.wikimedia.org/T169683) [16:25:55] (03PS4) 10Gilles: Serve a synth error page when error body is empty in Varnish [puppet] - 10https://gerrit.wikimedia.org/r/365589 (https://phabricator.wikimedia.org/T169683) [16:27:34] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3533743 (10fgiunchedi) >>! In T169939#3533621, @Eevans wrote: > Yeah, sadly commitlogs are for all intents and purposes write-only (they're only re... [16:27:38] (03PS1) 10RobH: fixing diego's login [puppet] - 10https://gerrit.wikimedia.org/r/372542 [16:28:15] (03CR) 10Filippo Giunchedi: "> Out of curiosity, what was the default?" [puppet] - 10https://gerrit.wikimedia.org/r/372517 (https://phabricator.wikimedia.org/T172930) (owner: 10Filippo Giunchedi) [16:30:26] (03CR) 10RobH: [C: 032] fixing diego's login [puppet] - 10https://gerrit.wikimedia.org/r/372542 (owner: 10RobH) [16:34:00] (03PS1) 10Gilles: Add Prometheus lua script for nginx-full [puppet/nginx] - 10https://gerrit.wikimedia.org/r/372543 (https://phabricator.wikimedia.org/T151554) [16:36:07] (03PS3) 10Gilles: Expose Thumbor Nginx metrics in Prometheus format [puppet] - 10https://gerrit.wikimedia.org/r/372254 (https://phabricator.wikimedia.org/T151554) [16:36:39] (03PS2) 10Gilles: Add Prometheus lua script for nginx-full [puppet/nginx] - 10https://gerrit.wikimedia.org/r/372543 (https://phabricator.wikimedia.org/T151554) [16:37:10] (03PS4) 10Gilles: Expose Thumbor Nginx metrics in Prometheus format [puppet] - 10https://gerrit.wikimedia.org/r/372254 (https://phabricator.wikimedia.org/T151554) [16:38:10] (03PS3) 10Gilles: Add Prometheus lua script for nginx-extras [puppet/nginx] - 10https://gerrit.wikimedia.org/r/372543 (https://phabricator.wikimedia.org/T151554) [16:38:48] 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3533785 (10bmansurov) [16:42:38] (03PS3) 10BBlack: numa_networking: new mode "isolate" [puppet] - 10https://gerrit.wikimedia.org/r/362438 [16:42:40] (03PS4) 10BBlack: NUMA binding for cache frontends under 'isolate' [puppet] - 10https://gerrit.wikimedia.org/r/362439 [16:42:42] (03PS1) 10BBlack: numa.rb: add nodes and inverted cpumask [puppet] - 10https://gerrit.wikimedia.org/r/372544 [16:43:17] (03CR) 10jerkins-bot: [V: 04-1] numa_networking: new mode "isolate" [puppet] - 10https://gerrit.wikimedia.org/r/362438 (owner: 10BBlack) [16:44:13] PROBLEM - DPKG on restbase1007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:45:13] RECOVERY - DPKG on restbase1007 is OK: All packages OK [16:47:17] (03Abandoned) 10Ayounsi: Add goransm to the mw-log-readers group [puppet] - 10https://gerrit.wikimedia.org/r/372165 (https://phabricator.wikimedia.org/T171958) (owner: 10Ayounsi) [16:51:18] (03PS4) 10BBlack: numa_networking: new mode "isolate" [puppet] - 10https://gerrit.wikimedia.org/r/362438 [16:51:20] (03PS5) 10BBlack: NUMA binding for cache frontends under 'isolate' [puppet] - 10https://gerrit.wikimedia.org/r/362439 [16:52:39] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3533803 (10Eevans) >>! In T169939#3533743, @fgiunchedi wrote: >>>! In T169939#3533621, @Eevans wrote: >> Yeah, sadly commitlogs are for all intents... [17:01:49] (03CR) 10BBlack: [C: 032] numa.rb: add nodes and inverted cpumask [puppet] - 10https://gerrit.wikimedia.org/r/372544 (owner: 10BBlack) [17:06:10] (03CR) 10BBlack: [C: 032] numa_networking: new mode "isolate" [puppet] - 10https://gerrit.wikimedia.org/r/362438 (owner: 10BBlack) [17:19:08] (03CR) 10BBlack: [C: 032] NUMA binding for cache frontends under 'isolate' [puppet] - 10https://gerrit.wikimedia.org/r/362439 (owner: 10BBlack) [17:23:44] (03PS1) 10BBlack: numa: bugfix frontend isolation prefix [puppet] - 10https://gerrit.wikimedia.org/r/372547 [17:24:03] (03CR) 10BBlack: [V: 032 C: 032] numa: bugfix frontend isolation prefix [puppet] - 10https://gerrit.wikimedia.org/r/372547 (owner: 10BBlack) [17:28:25] (03CR) 10Ppchelko: "thank you @elukey, have a fun vacation" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [17:33:53] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3352141 (10ayounsi) asw2-d-eqiad:ge-6/0/3 (Description: labstore1007, MAC: 30:e1:71:5f:9d:94 ) has been flapping at a rate of ~40 up/downs per hour.... [17:52:44] (03PS1) 10BBlack: apply numa isolation to all new cp4 hosts [puppet] - 10https://gerrit.wikimedia.org/r/372549 [18:01:20] Hi, for ~ 2 hours we have been seeing increased latency and an increased number of connectionTimeout exceptions when calling the Query api with prop=extracts of en.wikipedia.org/w/api.php from the UK. Is this a known issue? [18:15:14] amzndev: no known issue AFAIK [18:16:15] (but there could well be one I don't know about!) [18:16:59] (03PS1) 10BBlack: Configure cp4021-28 into upload+text clusters [puppet] - 10https://gerrit.wikimedia.org/r/372554 (https://phabricator.wikimedia.org/T171967) [18:20:30] (03CR) 10BBlack: [C: 032] Configure cp4021-28 into upload+text clusters [puppet] - 10https://gerrit.wikimedia.org/r/372554 (https://phabricator.wikimedia.org/T171967) (owner: 10BBlack) [18:20:34] (03CR) 10BBlack: [C: 032] apply numa isolation to all new cp4 hosts [puppet] - 10https://gerrit.wikimedia.org/r/372549 (owner: 10BBlack) [18:21:39] !log puppeting cp4021-8 - expect possibility of ipsec alerts, etc... [18:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:21] (03PS1) 10MaxSem: Reinforce LoginNotify settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372555 [18:29:13] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3534026 (10ayounsi) Those changes should land in the august release of LibreNMS (https://github.com/librenms/librenms/releases), most likely due nex... [18:32:52] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 60 connecting: cp4022_v4, cp4022_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:32:52] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 56 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:33:02] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 56 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:33:12] PROBLEM - Hadoop NodeManager on analytics1055 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:33:42] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 56 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:33:49] bleh [18:35:13] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:35:23] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 58 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:35:23] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 58 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:35:32] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 58 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:35:32] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 44 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:35:32] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 58 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:35:32] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 58 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:35:33] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 72 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:35:33] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 72 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:35:33] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 44 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:35:40] ignore the ipsec alerts, as mentioned earlier. somewhat hard to avoid when bringing up fresh nodes :P [18:35:42] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 72 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:35:43] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 56 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:35:53] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 44 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:35:53] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 72 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:35:53] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 56 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:35:53] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 72 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:35:53] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:35:53] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 44 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:35:53] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 44 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:36:02] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 58 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:36:02] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 58 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:36:02] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 44 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:36:02] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 44 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:36:02] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 56 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:36:03] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 56 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:36:03] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 56 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:36:12] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 72 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:36:12] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 72 connecting: cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [18:36:22] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 44 connecting: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [18:36:43] 10Operations, 10Mail: status of studentgroups@ and studentclubs@ mail aliases? - https://phabricator.wikimedia.org/T127550#3534031 (10Dzahn) Thanks Aklapper. I mailed Ariley and Winifred (blindly trying the old @wikimedia.org addresses). [18:36:53] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 60 ESP OK [18:36:53] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 60 ESP OK [18:36:53] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 48 ESP OK [18:36:53] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 48 ESP OK [18:37:02] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 48 ESP OK [18:37:02] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 48 ESP OK [18:37:02] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 60 ESP OK [18:37:03] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 60 ESP OK [18:37:03] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 60 ESP OK [18:37:03] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 60 ESP OK [18:37:10] 10Operations, 10Mail: status of studentgroups@ and studentclubs@ mail aliases? - https://phabricator.wikimedia.org/T127550#3534032 (10Dzahn) p:05Normal>03Low [18:37:22] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 48 ESP OK [18:37:32] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 48 ESP OK [18:37:33] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 48 ESP OK [18:37:43] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 60 ESP OK [18:37:43] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 60 ESP OK [18:37:52] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 48 ESP OK [18:39:03] wee new cp systems! [18:39:12] bblack: that mean next week i can kill more old cp systems? =] [18:40:00] (i may be getting ahead of things, i just wanna redo ulsfo badly!) [18:40:02] robh: probably not immediately, but soon-ish. We'll have to shift the traffic over gently first [18:41:33] 10Operations, 10Mail: status of studentgroups@ and studentclubs@ mail aliases? - https://phabricator.wikimedia.org/T127550#3534033 (10Dzahn) @bcampbell or @bbogaert Could you check for me on the Google side if you have any group/alias/account called studentclubs@wikimedia.org nowadays? Thanks! [18:43:50] 10Operations, 10Mail: move travel related aliases to OIT - https://phabricator.wikimedia.org/T127549#3534035 (10Dzahn) @bbogaert maybe we can give it another try now that the Wikimania travel season is mostly over? I don't know why it failed but there must be something that makes these aliases different from... [18:44:45] (03CR) 10Chad: "Got a link to the original in upstream for the appropriate branch? Would make it easier to compare :)" [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) (owner: 10Paladox) [18:45:28] 10Operations, 10Mail, 10Office-IT, 10WMF-Legal: move legal-tm-vio alias to OIT - https://phabricator.wikimedia.org/T170365#3534055 (10Dzahn) @bcampbell or @bbogaert Can we move legal-tm-vio@wikimedia.org to Google please? We did almost everything else but this is one of the few left. Cheers, Daniel [18:47:01] 10Operations, 10Mail: status of studentgroups@ and studentclubs@ mail aliases? - https://phabricator.wikimedia.org/T127550#3534057 (10bcampbell) @Dzahn Just checked. We don't. [18:47:22] wow, https://phabricator.wikimedia.org/T122144 proves that phab is bad at large task graphs [18:49:32] or maybe the 50 sub tasks of a single task is a bit much :P [18:50:32] dont see the problem. browser doesnt crash and graph is accurate :) [18:50:37] (03CR) 10Chad: "Also requires uncommenting the line from $java_opts or whatever it's called in jetty.pp (although ideally, we could move that array to hie" [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [18:51:42] have you seen https://phabricator.wikimedia.org/T104681 ? :) [18:52:16] 10Operations, 10Mail: status of studentgroups@ and studentclubs@ mail aliases? - https://phabricator.wikimedia.org/T127550#3534061 (10Dzahn) @bcampbell Thank you, i'll just remove it then (unless Winifred tells me otherwise). [18:52:44] That one at least it doesn't cut off the bug names [18:53:04] * bawolff looks to the task graph to get a list of subtasks, not to look at the pretty graph ;) [18:58:04] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 68 ESP OK [18:58:05] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 68 ESP OK [18:58:05] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 82 ESP OK [18:58:05] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 82 ESP OK [18:58:14] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 68 ESP OK [18:58:14] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 68 ESP OK [18:58:24] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 82 ESP OK [18:58:24] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 82 ESP OK [18:58:25] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 68 ESP OK [18:58:35] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 68 ESP OK [18:58:35] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 68 ESP OK [18:58:44] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 68 ESP OK [18:58:45] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 68 ESP OK [18:58:45] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 68 ESP OK [18:58:54] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 82 ESP OK [18:58:54] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 82 ESP OK [18:58:55] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 82 ESP OK [19:01:57] !log reboot cp4021-8 [19:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:56] !log cp1074 - varnish backend restart (mailbox lag) [19:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:47] (03CR) 10Mobrovac: [C: 031] Increase max kafka message size for changeprop and kafka main [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [19:33:41] (03CR) 10Mobrovac: "Some minor comments in-lined" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/372357 (https://phabricator.wikimedia.org/T173490) (owner: 10Filippo Giunchedi) [19:39:26] bblack, hey [19:40:57] (03CR) 10Mobrovac: "Why is restbase2001 not included?" [puppet] - 10https://gerrit.wikimedia.org/r/372469 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [19:56:37] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3534229 (10mobrovac) When would JBOD be pushed out for, then? September of later? If the latter, then we would need to reformat/reimage the nodes,... [19:57:16] (03CR) 10Mobrovac: [C: 031] "> Why is restbase2001 not included?" [puppet] - 10https://gerrit.wikimedia.org/r/372469 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [20:01:37] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3534242 (10Eevans) >>! In T169939#3534229, @mobrovac wrote: > When would JBOD be pushed out for, then? September of later? If the latter, then we w... [20:02:42] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3534243 (10mobrovac) >>! In T159922#3532373, @GWicke wrote: > Assuming this was an i... [20:12:48] (03CR) 10Niharika29: [C: 031] "LGTM. Schedule for SWAT on Monday?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372555 (owner: 10MaxSem) [20:24:48] 10Operations, 10Performance-Team, 10monitoring: Ensure getLagTimes.php is working properly - https://phabricator.wikimedia.org/T172559#3534274 (10Krinkle) a:05Krinkle>03aaron Hm.. seems plausible indeed. `/maintenance/getLagTimes.php`: * `GetLagTimes::execute()` -> `MediaWikiServices::getDBLoadBalancerF... [20:26:52] (03PS1) 10Chad: Scap clean: ensure proper ordering of commands [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372564 [20:28:53] (03CR) 10Smalyshev: "Let's discuss it on Monday. I feel it's ripe for some cleanup :)" [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [20:42:47] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.9 [keeping static files] (duration: 01m 43s) [20:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:07] !log prior thing was a no-op, testing [20:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:41] (03PS2) 10Chad: Scap clean: ensure proper ordering of commands and better error handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372564 [20:45:39] (03CR) 10Chad: [C: 032] Scap clean: ensure proper ordering of commands and better error handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372564 (owner: 10Chad) [20:47:07] (03Merged) 10jenkins-bot: Scap clean: ensure proper ordering of commands and better error handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372564 (owner: 10Chad) [20:47:17] (03CR) 10jenkins-bot: Scap clean: ensure proper ordering of commands and better error handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372564 (owner: 10Chad) [20:48:31] thcipriani: I made it so scap clean doesn't break on multiple runs behind itself [20:48:55] !log demon@tin Synchronized scap/plugins/clean.py: Completeness (duration: 00m 44s) [20:48:58] jouncebot: next [20:48:58] In 64 hour(s) and 11 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170821T1300) [20:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:54] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): Simplify git-fat support for pulling from both production and labs - https://phabricator.wikimedia.org/T171758#3475312 (10ksmith) There was a request in the 2017-08-16 Scrum of Scrums from the Scoring... [20:51:03] (03CR) 10Dzahn: [C: 032] "nitpick, is it really a "limit" if it's called "initial Java heap size"" [puppet] - 10https://gerrit.wikimedia.org/r/372426 (https://phabricator.wikimedia.org/T148478) (owner: 10Chad) [20:52:01] (03PS2) 10Dzahn: Gerrit: Also set minimum heap size [puppet] - 10https://gerrit.wikimedia.org/r/372426 (https://phabricator.wikimedia.org/T148478) (owner: 10Chad) [20:55:53] !log restarting gerrit on gerrit2001 [20:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:30] RainbowSprinkles: i tried the restart on gerrit2001 first.. it failed [20:56:53] did not do cobalt, puppet changed config though [20:57:56] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [20:58:16] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:58:55] PROBLEM - SSH access on gerrit2001 is CRITICAL: connect to address 208.80.153.106 and port 29418: Connection refused [20:59:17] that's not the active server .. but yea [20:59:24] that's why i tried it there first [21:01:02] (03PS1) 10Dzahn: Revert "Gerrit: Also set minimum heap size" [puppet] - 10https://gerrit.wikimedia.org/r/372569 [21:01:51] (03PS2) 10Dzahn: Revert "Gerrit: Also set minimum heap size" [puppet] - 10https://gerrit.wikimedia.org/r/372569 [21:02:21] (03CR) 10Dzahn: [C: 032] Revert "Gerrit: Also set minimum heap size" [puppet] - 10https://gerrit.wikimedia.org/r/372569 (owner: 10Dzahn) [21:04:25] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [21:04:55] RECOVERY - SSH access on gerrit2001 is OK: SSH OK - GerritCodeReview_2.13.8-11-gde96955fb2 (SSHD-CORE-1.2.0) (protocol 2.0) [21:05:00] !log gerrit2001 - restarted gerrit again after reverting gerrit:372426, using systemctl commands, not 'service' or init.d [21:05:05] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [21:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:56] (03CR) 10Kaldari: Reinforce LoginNotify settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372555 (owner: 10MaxSem) [21:06:04] (03CR) 10Dzahn: "17:04 < icinga-wm> RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational" [puppet] - 10https://gerrit.wikimedia.org/r/372569 (owner: 10Dzahn) [21:07:06] (03CR) 10Dzahn: "had to revert it, service fails to start with this change, recovered after reverting and restarting again. only touched gerrit2001, not co" [puppet] - 10https://gerrit.wikimedia.org/r/372426 (https://phabricator.wikimedia.org/T148478) (owner: 10Chad) [21:08:22] (03PS2) 10MaxSem: Reinforce LoginNotify settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372555 [21:09:46] (03CR) 10MaxSem: Reinforce LoginNotify settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372555 (owner: 10MaxSem) [21:10:31] (03CR) 10Dzahn: [C: 032] Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910 (owner: 10Paladox) [21:10:40] (03PS5) 10Dzahn: Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910 (owner: 10Paladox) [21:13:33] mutante: Can we rip out the old init.d yet if it's working via systemctl? [21:13:54] RainbowSprinkles: yea, we probably should [21:14:18] there was something about it being installed from package vs puppet anyways, afair [21:14:26] Oh yeah [21:14:33] * RainbowSprinkles will avoid that one today [21:14:49] RainbowSprinkles: should we try the https://gerrit.wikimedia.org/r/#/c/366910/ right now ?:) [21:15:21] Ehhhhh [21:15:30] It's either gonna work or not work immediately, but the rollback will be a pain [21:15:55] Disable puppet -> revert locally -> restart -> change it to revert in gerrit -> push out new puppet config -> enable puppet [21:16:10] is it possible to test on gerrit2001 without touching cobalt [21:16:46] Not easily. Not much in the way of user stuff goes on there. [21:16:54] I mean, you couldn't create a new account [21:17:16] iirc, certs are a little weird so logging in might be difficult [21:17:32] Oh yeah, I haven't landed the apache fixes for that yet either [21:17:35] any of the labs instances ? [21:17:44] i think paladox has LDAP backend too [21:18:03] a non-prod ldap.. i forget [21:20:21] well ok, it doesnt sound like good for Friday then :) [21:21:46] (03CR) 10Dzahn: [C: 031] "we'll land it another time and not on Friday afternoon :) also might need Apache change to go with it" [puppet] - 10https://gerrit.wikimedia.org/r/366910 (owner: 10Paladox) [21:23:51] (03CR) 10Dzahn: [C: 04-1] ">Oh, this is for my testing and not to be merged :)" [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 (owner: 10Paladox) [21:33:30] (03PS2) 10Herron: WIP: Add acl to warn of invalid/forged HELO messages on lists [puppet] - 10https://gerrit.wikimedia.org/r/372174 (https://phabricator.wikimedia.org/T173338) [21:38:13] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, 10Research-collaborations: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3534437 (10Halfak) [21:40:30] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, 10Research-collaborations: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3534441 (10Halfak) [21:42:44] (03PS9) 10Paladox: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [21:48:14] 10Operations, 10Mail: status of studentgroups@ and studentclubs@ mail aliases? - https://phabricator.wikimedia.org/T127550#3534450 (10Dzahn) This is where this has been used: https://outreach.wikimedia.org/wiki/Student_Organizations/Create_your_Clubhouse https://www.google.com/search?q=%22studentclubs%40wiki... [21:49:23] 10Operations, 10Mail: status of studentgroups@ and studentclubs@ mail aliases? - https://phabricator.wikimedia.org/T127550#3534456 (10Dzahn) [21:57:12] 10Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#3534464 (10Dzahn) [21:57:14] 10Operations, 10Mail: status of studentgroups@ and studentclubs@ mail aliases? - https://phabricator.wikimedia.org/T127550#3534462 (10Dzahn) 05Open>03Resolved removed exim alias in private repo. ``` -# studentgroups RT-1609 -studentgroups: studentclubs - ``` it was just about studentgroups -> studen... [21:59:12] 10Operations, 10Gerrit, 10ORES, 10Scap, and 2 others: Simplify git-fat support for pulling from both production and labs - https://phabricator.wikimedia.org/T171758#3534478 (10demon) I thought about it yesterday. We should just bite the bullet and get git-lfs support for Gerrit. This is possible and is the... [21:59:40] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3534481 (10Dzahn) ^ Merged but had to revert it as well. After merging the service would fail to start (tested on gerrit2001 before doi... [22:00:35] (03PS1) 10Krinkle: webperf: Add unit tests for schema handlers and stat dispatching [puppet] - 10https://gerrit.wikimedia.org/r/372577 [22:01:12] (03CR) 10jerkins-bot: [V: 04-1] webperf: Add unit tests for schema handlers and stat dispatching [puppet] - 10https://gerrit.wikimedia.org/r/372577 (owner: 10Krinkle) [22:02:03] (03PS2) 10Krinkle: webperf: Add unit tests for schema handlers and stat dispatching [puppet] - 10https://gerrit.wikimedia.org/r/372577 [22:02:38] (03PS3) 10Krinkle: webperf: Add unit tests for schema handlers and stat dispatching [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) [22:04:04] paladox: https://gerrit.wikimedia.org/r/#/c/307441/ and https://gerrit.wikimedia.org/r/#/c/307439/ are outdated because they use "-labs" in the channel name [22:04:13] labscloud [22:06:39] and https://gerrit.wikimedia.org/r/#/c/316289/ is outdated because analytics/wikistats the original Perl script by ezachte is not maintained anymore and currently replaced by new analytics-wikistats [22:08:10] thanks [22:08:53] on https://gerrit.wikimedia.org/r/#/c/317805/ you said you were planning to split it into smaller patches, so might not make sense to keep .. not sure [22:10:05] Krinkle: since you said polygerrit :) https://gerrit.wikimedia.org/r/#/c/368547/ [22:10:18] "Gerrit: Add wmf branding to PolyGerrit" heh [22:10:20] Ah yep [22:10:32] mutante :) [22:10:37] that's supported in gerrit 2.15 [22:11:17] that remind me, i need to create a change to get gerrit to write to notedb and reviewdb (will prevent having downtime and other problems in the future) :) [22:11:32] ok :) [22:22:45] PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:43:00] on install,oh really [22:44:56] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [23:55:00] (03PS15) 10Krinkle: Gerrit: Add wmf branding to PolyGerrit [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox)