[01:30:31] PROBLEM - High lag on wdqs1003 is CRITICAL: 3628 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:17:42] PROBLEM - High lag on wdqs1003 is CRITICAL: 3649 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:20:01] PROBLEM - High lag on wdqs1003 is CRITICAL: 3656 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:22:11] PROBLEM - High lag on wdqs1003 is CRITICAL: 3657 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:33:24] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.12) (duration: 13m 36s) [02:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:43] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Mon Jul 16 02:43:43 UTC 2018 (duration 10m 19s) [02:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:12] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1177 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [03:27:52] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 882.43 seconds [04:07:22] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 233.15 seconds [05:09:56] !log Deploy schema change on db2075 T144010 T51190 T199368 [05:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:03] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [05:10:03] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [05:10:04] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [05:24:33] !log Deploy schema change on db2059 T144010 T51190 T199368 [05:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:38] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [05:24:38] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [05:24:39] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [06:23:37] (03PS11) 10Zoranzoki21: Create Publisher namespace in Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444664 (https://phabricator.wikimedia.org/T199028) [06:49:01] (03PS1) 10Muehlenhoff: Add bpirkle to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/445945 [06:50:55] (03CR) 10Muehlenhoff: [C: 032] Add bpirkle to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/445945 (owner: 10Muehlenhoff) [07:04:37] (03PS3) 10Elukey: turnilo: adding acomputed measure of ratio of bot requests on pageview datasets [puppet] - 10https://gerrit.wikimedia.org/r/445654 (owner: 10Nuria) [07:05:32] (03PS4) 10Elukey: turnilo: add computed measure of bot requests' ratio on pageviews ds [puppet] - 10https://gerrit.wikimedia.org/r/445654 (owner: 10Nuria) [07:05:40] (03PS5) 10Elukey: turnilo: add computed measure of bot requests' ratio on pageviews ds [puppet] - 10https://gerrit.wikimedia.org/r/445654 (owner: 10Nuria) [07:05:49] (03PS1) 10Marostegui: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445949 (https://phabricator.wikimedia.org/T199368) [07:07:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445949 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [07:09:13] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445949 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [07:09:25] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445949 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [07:10:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1110 for alter table (duration: 00m 50s) [07:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:58] !log Deploy schema change on db1110 T144010 T51190 T199368 [07:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:03] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [07:11:03] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [07:11:03] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [07:13:44] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445951 [07:15:02] (03PS1) 10Zoranzoki21: Remove frwiki outdated entries in robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445952 (https://phabricator.wikimedia.org/T199496) [07:16:03] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445951 (owner: 10Marostegui) [07:17:11] (03CR) 10Framawiki: [C: 031] Remove frwiki outdated entries in robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445952 (https://phabricator.wikimedia.org/T199496) (owner: 10Zoranzoki21) [07:17:59] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445951 (owner: 10Marostegui) [07:18:21] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445951 (owner: 10Marostegui) [07:19:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1110 after alter table (duration: 00m 50s) [07:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:54] !log Deploy schema change on dbstore1002:s5 T144010 T51190 T199368 [07:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:00] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [07:20:00] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [07:20:01] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [07:23:35] (03PS1) 10Zoranzoki21: Enable ULS webfonts by default at Burmese Wikipedia (mywiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445953 (https://phabricator.wikimedia.org/T196219) [07:24:54] (03CR) 10jerkins-bot: [V: 04-1] Enable ULS webfonts by default at Burmese Wikipedia (mywiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445953 (https://phabricator.wikimedia.org/T196219) (owner: 10Zoranzoki21) [07:26:24] (03PS2) 10Zoranzoki21: Enable ULS webfonts by default at Burmese Wikipedia (mywiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445953 (https://phabricator.wikimedia.org/T196219) [07:33:22] !log Drop unused grants on codfw hosts [07:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:21] 10Operations, 10ops-eqiad, 10DBA: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) 05Resolved>03Open This has happened again, same disk, disk #0, can we get another one? Please ping me before replacing it so I can manually put it offline ``` Enclosure Device ID: 32 Slot Number:... [07:53:52] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1069 is CRITICAL: cluster=mysql device=megaraid,0 instance=db1069:9100 job=node site=eqiad Marostegui T199056 - The acknowledgement expires at: 2018-07-19 07:53:32. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [07:59:08] (03PS4) 10Filippo Giunchedi: grafana: Remove varnish-http-errors dashboard [puppet] - 10https://gerrit.wikimedia.org/r/445336 (owner: 10Krinkle) [07:59:21] (03CR) 10Joal: "Naming nits, except from that looks good." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/445654 (owner: 10Nuria) [07:59:33] (03CR) 10Filippo Giunchedi: [C: 032] grafana: Remove varnish-http-errors dashboard [puppet] - 10https://gerrit.wikimedia.org/r/445336 (owner: 10Krinkle) [08:05:23] !log Drop unused grants on eqiad hosts [08:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:01] (03PS1) 10Muehlenhoff: Blacklist cdrom kernel module [puppet] - 10https://gerrit.wikimedia.org/r/445954 [08:19:27] !log run xfs_repair on filesystems reporting negative space available on ms-be1041 - T199198 [08:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:31] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [08:25:14] (03CR) 10Filippo Giunchedi: [C: 031] Blacklist cdrom kernel module [puppet] - 10https://gerrit.wikimedia.org/r/445954 (owner: 10Muehlenhoff) [08:26:18] (03PS2) 10Muehlenhoff: Enable microcode on restbase servers [puppet] - 10https://gerrit.wikimedia.org/r/445573 (https://phabricator.wikimedia.org/T127825) [08:27:03] !log Drop unused grants on db1108 - this might get dbproxy1009 to complain [08:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:08] elukey: ^ [08:28:27] (03CR) 10Muehlenhoff: [C: 032] Enable microcode on restbase servers [puppet] - 10https://gerrit.wikimedia.org/r/445573 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [08:29:12] !log put back ms-be1036 to full weight - T196873 [08:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:16] T196873: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873 [08:35:08] (03PS2) 10Muehlenhoff: Enable microcode for Swift backend servers [puppet] - 10https://gerrit.wikimedia.org/r/445576 (https://phabricator.wikimedia.org/T127825) [08:36:00] (03CR) 10Muehlenhoff: [C: 032] Enable microcode for Swift backend servers [puppet] - 10https://gerrit.wikimedia.org/r/445576 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [08:52:32] (03CR) 10Vgutierrez: [C: 031] Blacklist cdrom kernel module [puppet] - 10https://gerrit.wikimedia.org/r/445954 (owner: 10Muehlenhoff) [09:02:28] (03CR) 10Vgutierrez: [C: 032] set up parking dns zones for the top 10 of current NXDOMAIN responses [dns] - 10https://gerrit.wikimedia.org/r/445611 (https://phabricator.wikimedia.org/T199525) (owner: 10Vgutierrez) [09:02:34] (03PS2) 10Vgutierrez: set up parking dns zones for the top 10 of current NXDOMAIN responses [dns] - 10https://gerrit.wikimedia.org/r/445611 (https://phabricator.wikimedia.org/T199525) [09:14:28] (03PS10) 10Giuseppe Lavagetto: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) [09:15:39] (03CR) 10jerkins-bot: [V: 04-1] Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [09:32:41] PROBLEM - Check systemd state on ms-be1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:34:56] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Mvolz) >>! In T197242#4409363, @Jrbranaa wrote: > Hey @Mvolz, just wanted to check in on this task. It seems like we are waiting on Zotero te... [09:42:03] 10Operations: Integrate Stretch 9.5 point release - https://phabricator.wikimedia.org/T199670 (10MoritzMuehlenhoff) [09:43:20] 10Operations: Integrate Stretch 9.5 point release - https://phabricator.wikimedia.org/T199670 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:44:17] (03PS1) 10Filippo Giunchedi: base: alert on filesystem available greater than filesystem size [puppet] - 10https://gerrit.wikimedia.org/r/445964 (https://phabricator.wikimedia.org/T199436) [09:46:52] 10Operations, 10Traffic, 10Patch-For-Review: Investigate NXDOMAIN DNS responses in our authdns servers - https://phabricator.wikimedia.org/T199525 (10Vgutierrez) After merging change 445611, we have the following offenders at the top 10: ```$ tshark -r dns.pcap -Y "dns.flags == 0x8005" -Tfields -e dns.qry.na... [09:53:02] (03PS1) 10Vgutierrez: create parking zones for top nxdomain offenders: [dns] - 10https://gerrit.wikimedia.org/r/445965 [09:53:34] (03PS2) 10Vgutierrez: create parking zones for top nxdomain offenders [dns] - 10https://gerrit.wikimedia.org/r/445965 [09:54:57] (03CR) 10Vgutierrez: [C: 032] create parking zones for top nxdomain offenders [dns] - 10https://gerrit.wikimedia.org/r/445965 (owner: 10Vgutierrez) [09:59:03] (03PS2) 10Marostegui: db-eqiad.php: Set up s1 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445369 (https://phabricator.wikimedia.org/T197069) [10:01:11] (03CR) 10Volans: "Looks good. Some questions and minor post-merge comments and nitpicks inline." (0315 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez) [10:02:27] 15 comments... but it looks good /o\ [10:04:06] some are questions :-P [10:04:18] thx for the review BTW <3 [10:04:18] and some are for existing code :) [10:04:21] yw [10:05:46] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445966 (https://phabricator.wikimedia.org/T128546) [10:07:04] (03CR) 10Volans: "question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/445964 (https://phabricator.wikimedia.org/T199436) (owner: 10Filippo Giunchedi) [10:08:27] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445966 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:09:46] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445966 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:09:59] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445966 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:12:42] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:445966|Bumping portals to master (T128546)]] (duration: 00m 50s) [10:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:46] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:13:32] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:445966|Bumping portals to master (T128546)]] (duration: 00m 50s) [10:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:52] 10Operations, 10Traffic, 10Patch-For-Review: Investigate NXDOMAIN DNS responses in our authdns servers - https://phabricator.wikimedia.org/T199525 (10Vgutierrez) refused queries dropped significantly after merging change 445611 as well. {F23795364} I guess that we should keep and eye on this recurrently [10:17:45] 10Operations, 10Traffic: Investigate NXDOMAIN DNS responses in our authdns servers - https://phabricator.wikimedia.org/T199525 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [10:20:30] (03CR) 10ArielGlenn: [C: 031] "Can't find any uses of this class on a jessie host (not in deployment-prep either) so, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/445580 (owner: 10Muehlenhoff) [10:20:42] (03PS1) 10Ladsgroup: Write to the new change tag backend in frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445969 (https://phabricator.wikimedia.org/T194165) [10:22:14] (03PS1) 10Giuseppe Lavagetto: conftool: switch to install python 3 version by default [puppet] - 10https://gerrit.wikimedia.org/r/445970 [10:22:31] <_joe_> volans: ^^ [10:22:48] * volans looking [10:24:20] can we just "switch"? there isn't software around using the py2 version that will need to be changed? [10:24:54] (03PS1) 10Muehlenhoff: Puppetise script to add firmware to netinst image [puppet] - 10https://gerrit.wikimedia.org/r/445972 (https://phabricator.wikimedia.org/T198327) [10:25:08] _joe_: ^^^ [10:25:39] <_joe_> volans: the only thing I can think of is switchdc [10:25:45] (03CR) 10ArielGlenn: [C: 031] "Heh, just noticed the wasat name was still active and then found this." [dns] - 10https://gerrit.wikimedia.org/r/445617 (https://phabricator.wikimedia.org/T193915) (owner: 10Muehlenhoff) [10:25:48] contint? [10:25:54] <_joe_> and wmf_auto_reimage [10:26:10] <_joe_> that would use it as a binary, and using the v3 version should not change anything [10:27:36] ok, same for etcd? [10:28:51] <_joe_> python-etcd only gets installed as a dependency of python-conftool AFAIR [10:29:03] <_joe_> but where that's not the case, v2 can coexist with v3 [10:29:05] it's specified in the same .pp [10:29:22] <_joe_> uhm, right, well that's wrong I think [10:30:04] <_joe_> you mean in CI? [10:30:28] yes modules/contint/manifests/packages/ops.pp [10:30:31] <_joe_> yeah that needs fixing [10:30:39] <_joe_> I'm not even sure we're using that class anymore btw [10:30:45] lol [10:30:48] <_joe_> the tests for conftool run in docker [10:31:02] (03PS11) 10Giuseppe Lavagetto: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) [10:31:21] <_joe_> hashar: do we still use conting::packages::ops ? [10:32:02] (03PS2) 10Giuseppe Lavagetto: conftool: switch to install python 3 version by default [puppet] - 10https://gerrit.wikimedia.org/r/445970 [10:35:32] _joe_: yes for the operations-dns-lint job [10:35:56] the etcd / conftool part, that is probably no more needed [10:36:10] since iirc that is now running in the Docker container you crafted for the puppet patches [10:36:30] <_joe_> yeah [10:36:37] <_joe_> so let's just remove that part [10:36:40] <_joe_> doing it [10:50:46] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/445970 (owner: 10Giuseppe Lavagetto) [10:59:23] jouncebot: now [10:59:23] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [11:00:06] jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180716T1100). Please do the needful. [11:00:06] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180716T1100). [11:00:06] Zoranzoki21, phuedx, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] present [11:00:18] I am here [11:00:27] I can SWAT today [11:00:37] Amir1: go ahead while I review other commits [11:00:42] sure [11:01:09] zeljkof, I'm also here - I can deploy phuedx patch [11:01:59] raynor: great, I think you can +2 it now since it will take some time to merge it [11:02:09] raynor: do you need a lot of time to test it? [11:02:55] nope, it shouldn't be a big think [11:03:08] the problem is that we cannot log into commons via mobile site right now [11:03:14] Amir1, raynor: I've done some updates to the docs, please take a look and update as needed https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers [11:03:26] Thanks! [11:03:42] so most of the checks is just logging in via mobile site plus checking that session is preserved [11:04:36] zeljkof, thanks, those changes are great, you also included the backports - thats awesome [11:04:51] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:445969|Write to the new change tag backend in frwiki (T194165)]] (duration: 00m 50s) [11:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:55] T194165: Start writing to change_tag_def in production - https://phabricator.wikimedia.org/T194165 [11:05:45] Deployed [11:05:48] ok zeljkof -> I'm merging the https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/445957/ [11:05:48] I'm done [11:05:54] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5001.eqsin.wmnet [11:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:21] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 66 ESP OK [11:06:31] RECOVERY - Host cp5001 is UP: PING OK - Packet loss = 0%, RTA = 244.79 ms [11:06:31] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 80 ESP OK [11:06:32] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 66 ESP OK [11:06:41] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 66 ESP OK [11:06:41] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 80 ESP OK [11:06:42] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 66 ESP OK [11:06:42] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK [11:06:46] Amir1: ok, taking over then [11:06:51] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 80 ESP OK [11:06:51] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK [11:06:51] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 80 ESP OK [11:06:51] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 80 ESP OK [11:06:52] !log start of ladsgroup@mwmaint1001:~$ mwscript populateChangeTagDef.php --wiki=frwiki [11:06:52] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 80 ESP OK [11:06:52] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 66 ESP OK [11:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:01] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 80 ESP OK [11:07:02] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 80 ESP OK [11:07:02] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK [11:07:11] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 66 ESP OK [11:07:12] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK [11:07:12] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 66 ESP OK [11:07:12] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 80 ESP OK [11:07:12] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 80 ESP OK [11:07:13] raynor: I'll deploy a few config changes while your patch gets merged, let me know when it's merged [11:08:39] sure, will do [11:11:50] Zoranzoki21: reviewing 445952 [11:12:01] ok zeljkof [11:12:39] !log mobrovac@deploy1001 Started deploy [changeprop/deploy@ab8f7e9]: Bug fix: Remove anchors in blacklisting URIs and decode event URIs - T198386 [11:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:43] T198386: Move static rerender blacklist from RESTBase to ChangeProp - https://phabricator.wikimedia.org/T198386 [11:14:12] !log mobrovac@deploy1001 Finished deploy [changeprop/deploy@ab8f7e9]: Bug fix: Remove anchors in blacklisting URIs and decode event URIs - T198386 (duration: 01m 33s) [11:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:05] Zoranzoki21: is there anything to tests for 445952? [11:15:11] or should I just deploy it? [11:15:26] You can just deploy it [11:15:33] ok [11:15:36] This is not related to interface [11:15:46] deploying [11:16:54] zeljkof: ok [11:18:19] !log zfilipin@deploy1001 Synchronized robots.txt: SWAT: [[gerrit:445952|Remove frwiki outdated entries in robots.txt (T199496)]] (duration: 00m 49s) [11:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:23] T199496: Remove frwiki outdated entries in robots.txt - https://phabricator.wikimedia.org/T199496 [11:19:23] thanks [11:19:41] zeljkof, the core change got merged, let me know when I can pull the changes [11:19:59] raynor: I'm done in a minute, will ping you [11:20:07] kk, take your time [11:20:18] !log zfilipin@deploy1001 Synchronized robots.txt: SWAT: [[gerrit:445952|Remove frwiki outdated entries in robots.txt (T199496)]] (duration: 00m 50s) [11:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:52] Zoranzoki21: 445952 is deployed [11:21:00] raynor: SWAT is all yours! [11:21:09] zeljkof: Ok. Thank you! Other patches I will tomorrow [11:21:21] thank you zeljkof [11:21:44] Zoranzoki21: see you tomorrow! :D [11:22:17] zeljkof, just to confirm [11:22:27] On branch wmf/1.32.0-wmf.12 [11:22:28] Your branch is ahead of 'origin/wmf/1.32.0-wmf.12' by 2 commits. [11:22:51] thats normal, there are 2 security patches on top of the code - Do not allow botpassword login and make newbie limit [11:22:51] sounds good [11:23:03] why don't we merge those two things back to master? :) [11:23:06] raynor: don't mention the security patches :) [11:23:25] I don't know how security releases work :/ [11:23:30] ah, sorry, didn't know that [11:30:19] 10Operations, 10ops-esams, 10Traffic: cp3033 unreacheable since 2018-07-15 11:47:31 - https://phabricator.wikimedia.org/T199677 (10Vgutierrez) ```root@cp3033:/var/log# ethtool -i eth0 driver: bnx2x version: 1.712.30-0 firmware-version: FFV7.10.17 bc 7.10.11 bus-info: 0000:01:00.0 supports-statistics: yes sup... [11:30:47] (03PS2) 10Muehlenhoff: Update DNS config for wasat rename [dns] - 10https://gerrit.wikimedia.org/r/445617 (https://phabricator.wikimedia.org/T193915) [11:30:49] https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/445957/ tested on mwdebug1002 -> login logic works properly, deploying to production [11:32:32] (03CR) 10Muehlenhoff: [C: 032] Update DNS config for wasat rename [dns] - 10https://gerrit.wikimedia.org/r/445617 (https://phabricator.wikimedia.org/T193915) (owner: 10Muehlenhoff) [11:32:42] (03PS1) 10Elukey: deployment-prep: deploy the analytics key in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/445984 [11:32:51] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:32:52] PROBLEM - swift-container-server on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:32:52] PROBLEM - Check systemd state on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:01] PROBLEM - DPKG on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:01] PROBLEM - swift-container-updater on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:01] PROBLEM - swift-account-reaper on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:02] PROBLEM - Disk space on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:12] PROBLEM - swift-account-server on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:12] PROBLEM - swift-account-replicator on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:21] PROBLEM - swift-container-auditor on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:21] PROBLEM - very high load average likely xfs on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:21] PROBLEM - MD RAID on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:22] PROBLEM - swift-account-auditor on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:27] I feel that ms-be1041 is not happy right now [11:33:31] PROBLEM - swift-container-replicator on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:31] PROBLEM - swift-object-updater on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:31] PROBLEM - swift-object-server on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:32] PROBLEM - swift-object-replicator on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:32] PROBLEM - Check size of conntrack table on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:32] PROBLEM - configured eth on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:42] PROBLEM - dhclient process on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:33:42] PROBLEM - swift-object-auditor on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:34:37] godog: ^^^ [11:35:00] I can ssh and puppet is disable [11:35:12] RECOVERY - Host cp3033 is UP: PING WARNING - Packet loss = 28%, RTA = 83.65 ms [11:35:12] PROBLEM - MariaDB Slave SQL: s6 on db2095 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Update_rows_v1 event on table frwiki.change_tag: Duplicate entry 113280313-visualeditor for key ct_rc_id, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db2076-bin.000854, end_log_pos 350101254 [11:35:18] ah wait https://phabricator.wikimedia.org/T199198 [11:35:21] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 54 ESP OK [11:35:21] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 66 ESP OK [11:35:22] RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 134 ESP OK [11:35:22] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 54 ESP OK [11:35:31] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 134 ESP OK [11:35:31] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 54 ESP OK [11:35:32] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 66 ESP OK [11:35:32] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 54 ESP OK [11:35:41] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 66 ESP OK [11:35:41] RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 134 ESP OK [11:35:41] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 134 ESP OK [11:35:41] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 54 ESP OK [11:35:42] RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 134 ESP OK [11:35:42] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 66 ESP OK [11:35:52] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 54 ESP OK [11:36:00] elukey: so just slow IO during the repair? is it still running? [11:36:01] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 54 ESP OK [11:36:01] RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 134 ESP OK [11:36:02] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 54 ESP OK [11:36:11] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 66 ESP OK [11:36:11] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 66 ESP OK [11:36:11] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 66 ESP OK [11:36:11] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 66 ESP OK [11:36:31] PROBLEM - puppet last run on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:36:43] !log pmiazga@deploy1001 Synchronized php-1.32.0-wmf.12/includes/WebResponse.php: SWAT: [[gerrit:445957|WebReponse: Use values altered in WebResponseSetCookie hook (T198525)]] (duration: 00m 54s) [11:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:47] T198525: Can't log into mobile on Commons - https://phabricator.wikimedia.org/T198525 [11:37:24] volans: seems so, but let's wait for godog [11:37:47] Hello everybody, is it still possible to add a small patch to the current swat window ? :) https://gerrit.wikimedia.org/r/445929 [11:37:57] zeljkof, ^ [11:38:16] I was going to close the SWAT, I'm done [11:38:24] framawiki: if raynor is done, I can deploy [11:38:40] yes, zeljkof I'm done, please proceed [11:38:44] !log power cycle cp3033 - T199677 [11:38:44] thanks, i'll add it to the deploy page [11:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:47] T199677: cp3033 unreacheable since 2018-07-15 11:47:31 - https://phabricator.wikimedia.org/T199677 [11:39:01] thank you for allowing me to deploy my patches, and sorry for the mistakes I made [11:39:22] PROBLEM - MegaRAID on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:44:11] PROBLEM - IPMI Sensor Status on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:44:43] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445929 (https://phabricator.wikimedia.org/T199631) (owner: 10Framawiki) [11:46:02] framawiki: can you test 445929 at mwdebug1002? [11:46:30] (03PS3) 10Giuseppe Lavagetto: conftool: switch to install python 3 version by default [puppet] - 10https://gerrit.wikimedia.org/r/445970 [11:46:31] PROBLEM - Check the NTP synchronisation status of timesyncd on ms-be1041 is CRITICAL: Return code of 255 is out of bounds [11:47:33] 10Operations, 10HHVM, 10User-ArielGlenn: Run all jobs on PHP7 or HHVM - https://phabricator.wikimedia.org/T195393 (10MoritzMuehlenhoff) [11:47:35] 10Operations, 10Patch-For-Review: setup replacements for maintenance_server (terbium, wasat) on Stretch - https://phabricator.wikimedia.org/T192092 (10MoritzMuehlenhoff) [11:47:37] 10Operations, 10Patch-For-Review: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10MoritzMuehlenhoff) 05Open>03Resolved wasat has been reimaged with stretch and during the process renamed to mwmaint2001 (for consistency with mwmaint1001). [11:48:14] zeljkof: i can't get NS 110 at https://fr.wiktionary.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces&formatversion=2 :( [11:48:35] framawiki: sorry, not there yet, asked in general [11:48:40] will be there in a minute [11:49:52] 10Operations, 10ops-eqsin, 10Traffic: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10Vgutierrez) both kernel and server event log shows issues on DIMM B4: ``` 3 | 07/14/2018 | 17:49:17 | Memory ECC Uncorr Err | Uncorrectable ECC (UnCorrectable ECC | DIMMB4) | A... [11:49:55] (03CR) 10Zfilipin: Create Reconstruction NS at frwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445929 (https://phabricator.wikimedia.org/T199631) (owner: 10Framawiki) [11:50:03] (03PS3) 10Zfilipin: Create Reconstruction NS at frwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445929 (https://phabricator.wikimedia.org/T199631) (owner: 10Framawiki) [11:50:21] (03CR) 10Zfilipin: [C: 032] Create Reconstruction NS at frwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445929 (https://phabricator.wikimedia.org/T199631) (owner: 10Framawiki) [11:50:24] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: switch to install python 3 version by default [puppet] - 10https://gerrit.wikimedia.org/r/445970 (owner: 10Giuseppe Lavagetto) [11:50:51] framawiki: argh, did not notice merge conflict, it did not merge after +2, rebased now and merging [11:51:39] (03Merged) 10jenkins-bot: Create Reconstruction NS at frwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445929 (https://phabricator.wikimedia.org/T199631) (owner: 10Framawiki) [11:51:53] oh sorry, missed it too [11:51:54] (03CR) 10jenkins-bot: Create Reconstruction NS at frwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445929 (https://phabricator.wikimedia.org/T199631) (owner: 10Framawiki) [11:52:29] 10Operations, 10ops-esams, 10Traffic: cp3033 unreacheable since 2018-07-15 11:47:31 - https://phabricator.wikimedia.org/T199677 (10Vgutierrez) After a power cycle the server it's behaving properly. Since it was already depooled I'm not repooling it [11:53:35] framawiki: ok, it's at mwdebug1002 [11:53:52] zeljkof: it's ok! [11:54:00] framawiki: ok, deploying [11:54:19] raynor, zeljkof: sorry about that. my young one had trapped wind and i had to get him down for a nap [11:54:45] trapped wind? [11:55:03] ah, google knows [11:55:05] :) [11:55:05] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:445929|Create Reconstruction NS at frwikt (T199631)]] (duration: 00m 49s) [11:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:09] T199631: Request for Reconstruction namespace at frwikt - https://phabricator.wikimedia.org/T199631 [11:55:30] framawiki: it's deployed, please check and thanks for deploying with #releng :) [11:55:40] phuedx: no problem, raynor did the deployment [11:55:46] 👍 [11:55:57] zeljkof: 👍 too ! [11:56:40] !log EU SWAT finished [11:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:05] !log rebooting multatuli for microcode tests [12:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:11] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:25:59] hi, am I at the right place to ask for removing a MW installation for a deprecated server? Corresponding task: https://phabricator.wikimedia.org/T166012 [12:26:49] Is it hosted at labs? If so, #wikimedia-cloud [12:27:33] Might be easier to just tag the task.. But you can probably delete instances yourself [12:31:17] Q: Which is location for cron script to put in our Puppet code? ie for ContentTranslation. [12:31:37] !log rebooting mw2136 for microcode tests [12:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:38] thanks Reedy [12:47:02] (03PS1) 10Muehlenhoff: Disable Diamond on multatuli [puppet] - 10https://gerrit.wikimedia.org/r/445988 [12:48:06] (03PS3) 10Muehlenhoff: Puppetise script to add firmware to netinst image [puppet] - 10https://gerrit.wikimedia.org/r/445972 (https://phabricator.wikimedia.org/T198327) [12:52:33] (03CR) 10Muehlenhoff: [C: 032] Puppetise script to add firmware to netinst image [puppet] - 10https://gerrit.wikimedia.org/r/445972 (https://phabricator.wikimedia.org/T198327) (owner: 10Muehlenhoff) [12:52:41] (03PS2) 10Muehlenhoff: Disable Diamond on multatuli [puppet] - 10https://gerrit.wikimedia.org/r/445988 [12:56:51] ah yeah thanks volans elukey, it was an expired downtime [12:57:14] ack, np [13:12:11] (03CR) 10Filippo Giunchedi: base: alert on filesystem available greater than filesystem size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/445964 (https://phabricator.wikimedia.org/T199436) (owner: 10Filippo Giunchedi) [13:14:51] (03PS2) 10Filippo Giunchedi: base: alert on filesystem available greater than filesystem size [puppet] - 10https://gerrit.wikimedia.org/r/445964 (https://phabricator.wikimedia.org/T199436) [13:16:35] 10Operations, 10LDAP-Access-Requests, 10User-Addshore: Give access to graphite and grafana-admin to Aleksey Bekh-Ivanov (WMDE) - https://phabricator.wikimedia.org/T199233 (10MoritzMuehlenhoff) p:05Triage>03Normal [13:18:45] 10Operations, 10Cloud-Services: Ensure we can survive a loss of labservices1001 - https://phabricator.wikimedia.org/T163402 (10Andrew) 05Open>03Resolved a:03Andrew We have, unfortunately, demonstrated that we can live for hours without this box without suffering anything serious. [13:19:23] 10Operations, 10LDAP-Access-Requests, 10User-Addshore: Give access to graphite and grafana-admin to Aleksey Bekh-Ivanov (WMDE) - https://phabricator.wikimedia.org/T199233 (10MoritzMuehlenhoff) He needs to (digitally) sign the NDA, please write to Rachel Stallman and she'll prepare... [13:25:18] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/445964 (https://phabricator.wikimedia.org/T199436) (owner: 10Filippo Giunchedi) [13:32:51] RECOVERY - swift-account-auditor on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [13:33:01] RECOVERY - swift-container-replicator on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [13:33:01] RECOVERY - swift-object-updater on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [13:33:01] RECOVERY - swift-object-server on ms-be1041 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [13:33:01] RECOVERY - Check size of conntrack table on ms-be1041 is OK: OK: nf_conntrack is 0 % full [13:33:02] RECOVERY - swift-object-replicator on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [13:33:02] RECOVERY - configured eth on ms-be1041 is OK: OK - interfaces up [13:33:11] RECOVERY - dhclient process on ms-be1041 is OK: PROCS OK: 0 processes with command name dhclient [13:33:11] RECOVERY - swift-object-auditor on ms-be1041 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [13:33:21] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1041 is OK: OK ferm input default policy is set [13:33:31] RECOVERY - swift-container-server on ms-be1041 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [13:33:31] RECOVERY - Check systemd state on ms-be1041 is OK: OK - running: The system is fully operational [13:33:31] RECOVERY - DPKG on ms-be1041 is OK: All packages OK [13:33:32] RECOVERY - swift-container-updater on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [13:33:32] RECOVERY - swift-account-reaper on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [13:33:41] RECOVERY - puppet last run on ms-be1041 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:33:41] RECOVERY - Disk space on ms-be1041 is OK: DISK OK [13:33:51] RECOVERY - swift-account-server on ms-be1041 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [13:33:51] RECOVERY - swift-account-replicator on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [13:33:52] RECOVERY - very high load average likely xfs on ms-be1041 is OK: OK - load average: 13.66, 4.85, 3.03 [13:33:52] RECOVERY - swift-container-auditor on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:33:52] RECOVERY - MD RAID on ms-be1041 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:34:13] spammy icinga is spammy [13:34:31] ACKNOWLEDGEMENT - MariaDB Slave SQL: s6 on db2095 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Update_rows_v1 event on table frwiki.change_tag: Duplicate entry 113280313-visualeditor for key ct_rc_id, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db2076-bin.000854, end_log_pos 350101254 Marostegui checking [13:35:49] 10Operations: Use firmware-enriched Debian installation images - https://phabricator.wikimedia.org/T182699 (10MoritzMuehlenhoff) 05Open>03Invalid a:03MoritzMuehlenhoff I looked into this last week. Unfortunately that's not a viable option, only firmware-enriched ISO images are provided at this point. There... [13:40:22] RECOVERY - MegaRAID on ms-be1041 is OK: OK: optimal, 14 logical, 14 physical [13:44:31] RECOVERY - IPMI Sensor Status on ms-be1041 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [13:45:53] (03PS1) 10Marostegui: db-eqiad.php: Depool db2076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445997 [13:46:51] RECOVERY - Check the NTP synchronisation status of timesyncd on ms-be1041 is OK: OK: synced at Mon 2018-07-16 13:46:46 UTC. [13:48:11] 10Operations: Integrate Stretch 9.5 point release - https://phabricator.wikimedia.org/T199670 (10MoritzMuehlenhoff) I've verified that none of the packages removed in 9.5 are present in our environment. [13:48:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db2076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445997 (owner: 10Marostegui) [13:50:03] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db2076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445997 (owner: 10Marostegui) [13:50:15] (03CR) 10jenkins-bot: db-eqiad.php: Depool db2076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445997 (owner: 10Marostegui) [13:51:28] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2076 for maintenance (duration: 00m 50s) [13:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:41] RECOVERY - MariaDB Slave SQL: s6 on db2095 is OK: OK slave_sql_state Slave_SQL_Running: Yes [13:56:19] 10Operations, 10ops-esams, 10Traffic: cp3033 unreacheable since 2018-07-15 11:47:31 - https://phabricator.wikimedia.org/T199677 (10Vgutierrez) p:05Triage>03Normal [13:59:26] !log Stop replication on db2076 (db2095's master) [13:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:02] win 10 [14:04:05] argh [14:04:11] win! [14:06:20] (03PS1) 10Andrew Bogott: labtestn: install designate on labtestservices2002 [puppet] - 10https://gerrit.wikimedia.org/r/446004 [14:08:50] (03PS1) 10Zoranzoki21: Create Thesaurus NS at thwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446005 (https://phabricator.wikimedia.org/T198585) [14:09:03] (03PS2) 10Andrew Bogott: labtestn: install designate on labtestservices2002 [puppet] - 10https://gerrit.wikimedia.org/r/446004 [14:09:48] (03PS2) 10Zoranzoki21: Create Thesaurus NS at thwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446005 (https://phabricator.wikimedia.org/T198585) [14:12:57] (03PS2) 10ArielGlenn: do 8 jobs in parallel for wikidata weeklies [puppet] - 10https://gerrit.wikimedia.org/r/432368 (https://phabricator.wikimedia.org/T181936) [14:14:02] (03CR) 10Hoo man: [C: 031] do 8 jobs in parallel for wikidata weeklies [puppet] - 10https://gerrit.wikimedia.org/r/432368 (https://phabricator.wikimedia.org/T181936) (owner: 10ArielGlenn) [14:15:27] (03PS1) 10Vgutierrez: get rid of /etc/certcentral being hardcoded everywhere [software/certcentral] - 10https://gerrit.wikimedia.org/r/446009 [14:16:13] (03CR) 10jerkins-bot: [V: 04-1] get rid of /etc/certcentral being hardcoded everywhere [software/certcentral] - 10https://gerrit.wikimedia.org/r/446009 (owner: 10Vgutierrez) [14:16:33] (03CR) 10Muehlenhoff: [C: 031] base: alert on filesystem available greater than filesystem size [puppet] - 10https://gerrit.wikimedia.org/r/445964 (https://phabricator.wikimedia.org/T199436) (owner: 10Filippo Giunchedi) [14:19:11] (03PS5) 10Anomie: wgMultiContentRevisionSchemaMigrationStage SCHEMA_COMPAT_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) (owner: 10Addshore) [14:19:25] (03CR) 10Anomie: [C: 032] "Deploying configuration change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) (owner: 10Addshore) [14:19:27] (03CR) 10Marostegui: "What's the reason to get it increase?" [puppet] - 10https://gerrit.wikimedia.org/r/432368 (https://phabricator.wikimedia.org/T181936) (owner: 10ArielGlenn) [14:20:34] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db2076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446010 [14:20:48] (03Merged) 10jenkins-bot: wgMultiContentRevisionSchemaMigrationStage SCHEMA_COMPAT_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) (owner: 10Addshore) [14:21:00] (03CR) 10jenkins-bot: wgMultiContentRevisionSchemaMigrationStage SCHEMA_COMPAT_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) (owner: 10Addshore) [14:22:44] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Explicitly set wgMultiContentRevisionSchemaMigrationStage to current default (T174044) (duration: 00m 50s) [14:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:48] T174044: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044 [14:23:17] anomie: are you done with the deploy? I'd like to deploy db-codfw [14:23:40] marostegui: I have to sync InitialiseSettings-labs.php, then I'm done. [14:23:59] Sorry for getting in your way [14:24:01] anomie: great, ping me when done :) [14:24:04] no, not urgent at all [14:24:12] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Sync labs config file, no prod impact (duration: 00m 49s) [14:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:18] marostegui: done [14:24:28] thanks! [14:24:40] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db2076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446010 (owner: 10Marostegui) [14:25:55] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db2076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446010 (owner: 10Marostegui) [14:27:26] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2076 (duration: 00m 49s) [14:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:22] (03CR) 10Filippo Giunchedi: [C: 032] base: alert on filesystem available greater than filesystem size [puppet] - 10https://gerrit.wikimedia.org/r/445964 (https://phabricator.wikimedia.org/T199436) (owner: 10Filippo Giunchedi) [14:29:29] (03PS3) 10Filippo Giunchedi: base: alert on filesystem available greater than filesystem size [puppet] - 10https://gerrit.wikimedia.org/r/445964 (https://phabricator.wikimedia.org/T199436) [14:30:07] (03CR) 10Hoo man: [C: 031] "> What's the reason to get it increase?" [puppet] - 10https://gerrit.wikimedia.org/r/432368 (https://phabricator.wikimedia.org/T181936) (owner: 10ArielGlenn) [14:30:20] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db2076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446010 (owner: 10Marostegui) [14:33:20] (03CR) 10Muehlenhoff: dbtree: move dbtree outside of mwmaint hosts (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/445597 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [14:35:27] (03CR) 10Marostegui: [C: 031] "let's keep an eye on the server though for the upcoming days just in case: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc" [puppet] - 10https://gerrit.wikimedia.org/r/432368 (https://phabricator.wikimedia.org/T181936) (owner: 10ArielGlenn) [14:35:31] (03PS2) 10Vgutierrez: get rid of /etc/certcentral being hardcoded everywhere [software/certcentral] - 10https://gerrit.wikimedia.org/r/446009 [14:36:12] (03CR) 10jerkins-bot: [V: 04-1] get rid of /etc/certcentral being hardcoded everywhere [software/certcentral] - 10https://gerrit.wikimedia.org/r/446009 (owner: 10Vgutierrez) [14:37:21] !log Remove unused grants from es2019 [14:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:25] !log Remove unused grants from labsdb1004 and labsdb1005 [14:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:16] (03PS3) 10Vgutierrez: get rid of /etc/certcentral being hardcoded everywhere [software/certcentral] - 10https://gerrit.wikimedia.org/r/446009 [14:43:26] (03CR) 10ArielGlenn: "Then I'll merge this at the end of this week's run (Wed or Thurs)." [puppet] - 10https://gerrit.wikimedia.org/r/432368 (https://phabricator.wikimedia.org/T181936) (owner: 10ArielGlenn) [14:47:45] !log Remove unused grants from db1073 [14:47:50] (03PS3) 10Zoranzoki21: Create Thesaurus NS at thwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446005 (https://phabricator.wikimedia.org/T198585) [14:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:49] !log Drop unused ceilometer database from db1073 [14:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:31] !log Change expire_log_days on db1067 - https://phabricator.wikimedia.org/T197069 [14:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:47] !log installing openssh updates from stretch point release [14:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:04] (03CR) 10Giuseppe Lavagetto: mediawiki: add vhost define (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/439893 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [15:00:05] (03PS3) 10Andrew Bogott: labtestn: install designate on labtestservices2002 [puppet] - 10https://gerrit.wikimedia.org/r/446004 [15:00:53] (03CR) 10Andrew Bogott: [C: 032] labtestn: install designate on labtestservices2002 [puppet] - 10https://gerrit.wikimedia.org/r/446004 (owner: 10Andrew Bogott) [15:03:56] (03PS1) 10Muehlenhoff: Remove expiry date for jrbranaa [puppet] - 10https://gerrit.wikimedia.org/r/446035 [15:05:11] (03PS1) 10Andrew Bogott: passwords: remove dummy passwords for openstack ceilometer [labs/private] - 10https://gerrit.wikimedia.org/r/446037 [15:06:04] (03CR) 10Muehlenhoff: [C: 032] Remove expiry date for jrbranaa [puppet] - 10https://gerrit.wikimedia.org/r/446035 (owner: 10Muehlenhoff) [15:06:53] (03PS1) 10Andrew Bogott: backups: don't dump the ceilometer database [puppet] - 10https://gerrit.wikimedia.org/r/446040 (https://phabricator.wikimedia.org/T199114) [15:07:14] (03CR) 10Andrew Bogott: [V: 032 C: 032] passwords: remove dummy passwords for openstack ceilometer [labs/private] - 10https://gerrit.wikimedia.org/r/446037 (owner: 10Andrew Bogott) [15:07:29] (03CR) 10Marostegui: [C: 031] backups: don't dump the ceilometer database [puppet] - 10https://gerrit.wikimedia.org/r/446040 (https://phabricator.wikimedia.org/T199114) (owner: 10Andrew Bogott) [15:08:35] (03CR) 10Andrew Bogott: [C: 032] backups: don't dump the ceilometer database [puppet] - 10https://gerrit.wikimedia.org/r/446040 (https://phabricator.wikimedia.org/T199114) (owner: 10Andrew Bogott) [15:11:16] (03PS4) 10Marostegui: mariadb: Promote db1067 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/445354 (https://phabricator.wikimedia.org/T197069) [15:11:25] (03PS3) 10Marostegui: db-eqiad.php: Set up s1 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445369 (https://phabricator.wikimedia.org/T197069) [15:11:36] (03PS2) 10Marostegui: db-eqiad.php: Promote db1067 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445371 (https://phabricator.wikimedia.org/T197069) [15:15:05] PROBLEM - puppet last run on labtestservices2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:15:05] ACKNOWLEDGEMENT - Filesystem available is greater than filesystem size on ms-be1041 is CRITICAL: cluster=swift device=/dev/sdk1 fstype=xfs instance=ms-be1041:9100 job=node mountpoint=/srv/swift-storage/sdk1 site=eqiad Filippo Giunchedi known T199198 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1041&var-datasource=eqiad%2520prometheus%252Fops [15:18:20] ACKNOWLEDGEMENT - Filesystem available is greater than filesystem size on ms-be1040 is CRITICAL: cluster=swift device={/dev/sde1,/dev/sdh1,/dev/sdi1,/dev/sdj1,/dev/sdl1,/dev/sdn1} fstype=xfs instance=ms-be1040:9100 job=node mountpoint={/srv/swift-storage/sde1,/srv/swift-storage/sdh1,/srv/swift-storage/sdi1,/srv/swift-storage/sdj1,/srv/swift-storage/sdl1,/srv/swift-storage/sdn1} site=eqiad Filippo Giunchedi known T199198 https:/ [15:18:20] org/dashboard/db/host-overview?orgId=1&var-server=ms-be1040&var-datasource=eqiad%2520prometheus%252Fops [15:18:20] ACKNOWLEDGEMENT - Filesystem available is greater than filesystem size on ms-be1042 is CRITICAL: cluster=swift device={/dev/sdg1,/dev/sdj1,/dev/sdk1,/dev/sdl1} fstype=xfs instance=ms-be1042:9100 job=node mountpoint={/srv/swift-storage/sdg1,/srv/swift-storage/sdj1,/srv/swift-storage/sdk1,/srv/swift-storage/sdl1} site=eqiad Filippo Giunchedi known T199198 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server [15:18:20] source=eqiad%2520prometheus%252Fops [15:24:46] Hi, images are not working on https://en.wikipedia.org/wiki/Eschatology (was given the link in #wikipedia-en). the image either loads half (and other half is grey) or it just shows the missing image icon. [15:25:05] (03PS1) 10Andrew Bogott: labtestn: don't include 'cloudrepo' on services box [puppet] - 10https://gerrit.wikimedia.org/r/446048 [15:25:17] godog: --^ [15:25:53] paladox: which image exactly? [15:26:10] the one on the right (at the top) [15:26:11] (of the article) [15:26:13] fwiw the images on that page load for me [15:26:20] i see this: https://phabricator.wikimedia.org/F23797963 [15:26:24] Ditto, WFM [15:26:30] but when i click on the actual image it works [15:26:34] Refresh the page? [15:26:44] (03CR) 10Andrew Bogott: [C: 032] labtestn: don't include 'cloudrepo' on services box [puppet] - 10https://gerrit.wikimedia.org/r/446048 (owner: 10Andrew Bogott) [15:27:13] and https://phabricator.wikimedia.org/F23797973 [15:27:16] Reedy i did [15:27:22] a couple of times but didn't work. [15:27:28] Purge it? Null edit it? [15:27:32] Check your browser console? [15:28:35] hmm works now, strange. I refereshed it a few times before reporting it to make sure. But as soon as you said try refreshing the page it worked. [15:29:00] Reedy is a magician :-) [15:29:24] jokes aside, I have seen that happening when local disk cache corrupts on some browsers [15:30:05] Certainly many things can cause problems [15:30:20] Usually worth a purge/null edit before reporting problems like this [15:30:39] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 (10ayounsi) [15:30:41] 10Operations, 10ops-eqiad, 10Traffic: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10ayounsi) [15:33:42] 10Operations, 10Traffic, 10Goal: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Vgutierrez) [15:33:57] 10Operations, 10ops-eqsin, 10Traffic: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10BBlack) p:05Normal>03High Turning priority to "high" for this and the 5006 ticket, as between the two of them they leave the upload@eqsin at its design limit of 4 reliable nodes. [15:34:24] 10Operations, 10Traffic, 10Patch-For-Review: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962 (10Vgutierrez) [15:34:28] 10Operations, 10Traffic, 10Goal: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Vgutierrez) [15:35:28] 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10BBlack) p:05Normal>03High Turning priority to "high" for this and the 5001 ticket ( T199675 ), as between the two of them they leave the upload@eqsin at its design limit of 4 reliable nodes. [15:36:07] 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) I put in the self dispatch last week, but have not gotten a reply on it. I'll fall back to simply calling into technical support daily until this gets a resolution. [15:44:09] 10Operations, 10Traffic: Pick up a suitable ACME library for certcentral - https://phabricator.wikimedia.org/T199717 (10Vgutierrez) p:05Triage>03Normal [15:45:29] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 (10Marostegui) @ayounsi regarding databases All these are passive, so **no** special care is needed dbproxy1004 dbproxy1005 dbproxy1006 db1072 -> m3 m... [15:49:16] ACKNOWLEDGEMENT - Filesystem available is greater than filesystem size on ms-be1043 is CRITICAL: cluster=swift device=/dev/sdh1 fstype=xfs instance=ms-be1043:9100 job=node mountpoint=/srv/swift-storage/sdh1 site=eqiad Filippo Giunchedi known T199198 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1043&var-datasource=eqiad%2520prometheus%252Fops [15:54:54] 10Operations, 10Traffic: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10BBlack) [15:54:56] (03PS1) 10Ladsgroup: labs: disable UI of ORES in enwiki to test it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446057 (https://phabricator.wikimedia.org/T198358) [15:55:29] 10Operations, 10Traffic: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10BBlack) [15:55:31] 10Operations, 10Traffic: Evaluate Apache Traffic Server - https://phabricator.wikimedia.org/T96853 (10BBlack) [15:55:46] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) [15:55:48] 10Operations, 10monitoring, 10Patch-For-Review: Alert on negative disk space available - https://phabricator.wikimedia.org/T199436 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This is done and working, in case the parent task's issue comes up again. [15:56:39] (03CR) 10Ladsgroup: [C: 032] labs: disable UI of ORES in enwiki to test it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446057 (https://phabricator.wikimedia.org/T198358) (owner: 10Ladsgroup) [15:58:26] (03Merged) 10jenkins-bot: labs: disable UI of ORES in enwiki to test it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446057 (https://phabricator.wikimedia.org/T198358) (owner: 10Ladsgroup) [15:58:39] (03CR) 10jenkins-bot: labs: disable UI of ORES in enwiki to test it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446057 (https://phabricator.wikimedia.org/T198358) (owner: 10Ladsgroup) [15:59:18] ^ rebased on deploy1001 [16:04:23] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10mobrovac) We first need to make Citoid and v2 of the translation server work together locally, then in Beta, and only then can we talk about... [16:05:32] 10Operations, 10Traffic: Pick up a suitable ACME library for certcentral - https://phabricator.wikimedia.org/T199717 (10Vgutierrez) From https://letsencrypt.org/docs/client-options/, another interesting option could be free_tls_certificates library. It's a high-level library based on python3-acme, on an initia... [16:09:48] (03PS6) 10Nuria: Adding acomputed measure of ratio of bot requests on pageview datasets [puppet] - 10https://gerrit.wikimedia.org/r/445654 [16:15:35] (03PS1) 10Arturo Borrero Gonzalez: cloud vps: disable labtestnet2001 and replace it with labtestnet2003 [puppet] - 10https://gerrit.wikimedia.org/r/446059 (https://phabricator.wikimedia.org/T196752) [16:49:38] 10Operations, 10Traffic: Pick up a suitable ACME library for certcentral - https://phabricator.wikimedia.org/T199717 (10Krenair) Need to ensure that whatever we pick has the ability to be extended in terms of how challenges are done. I.e. we'll want to be able to have http-01 write to files, and dns-01 either... [17:00:04] gehel: #bothumor I � Unicode. All rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180716T1700). [17:00:53] 10Operations, 10Traffic: Pick up a suitable ACME library for certcentral - https://phabricator.wikimedia.org/T199717 (10Krenair) >>! In T199717#4428126, @Vgutierrez wrote: > From https://letsencrypt.org/docs/client-options/, another interesting option could be free_tls_certificates library. It's a high-level l... [17:07:50] (03CR) 10Andrew Bogott: [C: 031] "Switching 2002 to active and 2003 to standby seems good to me -- it makes the numbering slightly less confusing :)" [puppet] - 10https://gerrit.wikimedia.org/r/446059 (https://phabricator.wikimedia.org/T196752) (owner: 10Arturo Borrero Gonzalez) [17:09:29] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloud vps: disable labtestnet2001 and replace it with labtestnet2003 [puppet] - 10https://gerrit.wikimedia.org/r/446059 (https://phabricator.wikimedia.org/T196752) (owner: 10Arturo Borrero Gonzalez) [17:19:02] (03PS1) 10RobH: adding wikimedia.org validation txt entry for globalsign [dns] - 10https://gerrit.wikimedia.org/r/446066 (https://phabricator.wikimedia.org/T197840) [17:21:15] (03CR) 10BBlack: [C: 031] adding wikimedia.org validation txt entry for globalsign [dns] - 10https://gerrit.wikimedia.org/r/446066 (https://phabricator.wikimedia.org/T197840) (owner: 10RobH) [17:21:35] (03CR) 10RobH: [C: 032] adding wikimedia.org validation txt entry for globalsign [dns] - 10https://gerrit.wikimedia.org/r/446066 (https://phabricator.wikimedia.org/T197840) (owner: 10RobH) [17:26:21] (03PS3) 10Anomie: MCR Enable MCR write-both mode on commons beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442918 (https://phabricator.wikimedia.org/T197818) (owner: 10Daniel Kinzler) [17:34:22] (03PS1) 10Elukey: Add global git http[s].proxy config for thorium and an1003 [puppet] - 10https://gerrit.wikimedia.org/r/446067 (https://phabricator.wikimedia.org/T198623) [17:34:24] (03CR) 10Smalyshev: [C: 031] Enable fetching constraints for Updater [puppet] - 10https://gerrit.wikimedia.org/r/445454 (https://phabricator.wikimedia.org/T192567) (owner: 10Smalyshev) [17:35:48] (03PS1) 10Ladsgroup: Revert "labs: disable UI of ORES in enwiki to test it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446068 [17:35:55] (03CR) 10Ladsgroup: [C: 032] Revert "labs: disable UI of ORES in enwiki to test it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446068 (owner: 10Ladsgroup) [17:37:11] (03Merged) 10jenkins-bot: Revert "labs: disable UI of ORES in enwiki to test it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446068 (owner: 10Ladsgroup) [17:37:13] (03CR) 10Elukey: [C: 032] Add global git http[s].proxy config for thorium and an1003 [puppet] - 10https://gerrit.wikimedia.org/r/446067 (https://phabricator.wikimedia.org/T198623) (owner: 10Elukey) [17:38:01] (03CR) 10jenkins-bot: Revert "labs: disable UI of ORES in enwiki to test it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446068 (owner: 10Ladsgroup) [17:45:15] PROBLEM - swift-object-server on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:45:15] PROBLEM - swift-container-replicator on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:45:15] PROBLEM - swift-object-updater on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:45:15] PROBLEM - swift-object-replicator on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:45:35] PROBLEM - swift-object-auditor on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:45:36] PROBLEM - Check systemd state on ms-be1041 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:45:45] PROBLEM - swift-container-server on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:45:55] PROBLEM - swift-account-auditor on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:45:56] PROBLEM - swift-account-reaper on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:45:56] PROBLEM - swift-container-updater on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:45:56] PROBLEM - swift-account-server on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [17:46:06] PROBLEM - swift-account-replicator on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:46:06] PROBLEM - swift-container-auditor on ms-be1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:50:12] hmm [17:52:36] herron: see T199198 [17:52:37] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [17:52:47] although I was not expecting swift to die honestly [17:52:57] filippo was running an xfs repair there [17:52:57] puppet is disabled on ms-be1041 which makes me think it’s depooled [17:52:59] ah ok [17:53:12] thanks [17:54:09] (03PS1) 10Arturo Borrero Gonzalez: cloud vps: labtest: missing allowed connection [puppet] - 10https://gerrit.wikimedia.org/r/446069 (https://phabricator.wikimedia.org/T196752) [17:54:49] (03PS2) 10Arturo Borrero Gonzalez: cloud vps: labtest: missing allowed connection [puppet] - 10https://gerrit.wikimedia.org/r/446069 (https://phabricator.wikimedia.org/T196752) [17:55:20] herron: is the repair still running? (just in case you were having a look) [17:55:34] I have a shell open let’s see [17:55:41] yes still running [17:55:51] and swift failed? [17:55:55] 10Puppet, 10Cloud-Services, 10Toolforge, 10Goal: Fully puppetize Grid Engine - https://phabricator.wikimedia.org/T88711 (10Bstorm) I believe T199276#4420812 was possibly due to NFS mount happening after package installation (as long as the setup from the package runs when puppet installs it, which I haven'... [17:56:20] because before we had nrpe check failing because of io, not swift explicitely failing IIRC [17:58:38] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11798/labtestnet2002.codfw.wmnet/ catalog compiler happy" [puppet] - 10https://gerrit.wikimedia.org/r/446069 (https://phabricator.wikimedia.org/T196752) (owner: 10Arturo Borrero Gonzalez) [17:58:49] not sure if it died or was stopped, but I see SIGTERM in the log [17:58:58] for example Jul 16 15:45:31 ms-be1041 object-server: SIGTERM received [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180716T1800). [18:00:04] stephanebisson: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:31] Hi [18:02:26] Hi. I can SWAT your patch stephanebisson. [18:02:31] (03PS1) 10Jgreen: add frmon1001 and frmon service alias, remove tellurium [dns] - 10https://gerrit.wikimedia.org/r/446070 [18:03:01] Hi Niharika, that would be great [18:03:17] (03PS3) 10Niharika29: Rollout Watchlist Structured Filters to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440642 (https://phabricator.wikimedia.org/T181193) (owner: 10Mooeypoo) [18:03:28] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440642 (https://phabricator.wikimedia.org/T181193) (owner: 10Mooeypoo) [18:05:13] (03Merged) 10jenkins-bot: Rollout Watchlist Structured Filters to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440642 (https://phabricator.wikimedia.org/T181193) (owner: 10Mooeypoo) [18:05:46] herron: ack, so might actually just be expired downtime if swift was stopped a while ago [18:05:53] I can check the icinga logs in a bit [18:06:12] alright, yeah puppet is stopped as well so I assumed as much [18:06:18] stephanebisson: It's on mwdebug1002. [18:06:32] Niharika: ok, testing now [18:06:45] (03CR) 10Jgreen: [C: 032] add frmon1001 and frmon service alias, remove tellurium [dns] - 10https://gerrit.wikimedia.org/r/446070 (owner: 10Jgreen) [18:06:46] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [18:08:02] (03CR) 10jenkins-bot: Rollout Watchlist Structured Filters to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440642 (https://phabricator.wikimedia.org/T181193) (owner: 10Mooeypoo) [18:08:11] Niharika: works as expected [18:08:24] Alrighty. Syncing it out. [18:10:20] !log niharika29@deploy1001 Synchronized wmf-config/: Rollout Watchlist Structured Filters to all wikis T181193 (duration: 00m 51s) [18:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:24] T181193: [EPIC] Graduate the New Filters UX on Watchlist out of beta on all wikis - https://phabricator.wikimedia.org/T181193 [18:10:34] stephanebisson: Synced. Going to run script now. [18:10:50] 10Operations, 10fundraising-tech-ops, 10netops: NAT and DNS for fundraising monitor host - https://phabricator.wikimedia.org/T198516 (10Jgreen) DNS is done! ;; ANSWER SECTION: frmon.wikimedia.org. 3600 IN CNAME frmon-eqiad.wikimedia.org. frmon-eqiad.wikimedia.org. 3600 IN A 208.80.155.9 [18:10:55] RECOVERY - Check systemd state on restbase-dev1005 is OK: OK - running: The system is fully operational [18:11:25] https://www.irccloud.com/pastebin/ztX25ZOR/uh-oh [18:11:37] stephanebisson: I don't believe that's the expected output. ^ [18:13:17] Niharika: certainly not... [18:13:22] RoanKattouw: ^ [18:13:40] stephanebisson: What's the script supposed to do? [18:14:23] Niharika: set a user preference based on the value of another one [18:14:50] It looks like it's looping on some random string instead of the dblist file [18:14:53] stephanebisson: Okay. Any way to check if that successfully worked? [18:15:16] Looking [18:15:34] It did not work [18:16:23] Hmm I guess foreachwikiindblist doesn't allow absolute paths [18:16:57] You may have to give it a path relative to /srv/mediawiki/dblists, or copy my file to that directory [18:17:04] ( Niharika ) [18:17:24] RoanKattouw: Alright. [18:18:58] (03PS1) 10RobH: Revert "adding wikimedia.org validation txt entry for globalsign" [dns] - 10https://gerrit.wikimedia.org/r/446072 [18:19:08] (03PS2) 10RobH: Revert "adding wikimedia.org validation txt entry for globalsign" [dns] - 10https://gerrit.wikimedia.org/r/446072 [18:19:24] (03CR) 10RobH: [C: 032] Revert "adding wikimedia.org validation txt entry for globalsign" [dns] - 10https://gerrit.wikimedia.org/r/446072 (owner: 10RobH) [18:22:18] 10Puppet, 10Toolforge, 10Goal: Fully puppetize Grid Engine - https://phabricator.wikimedia.org/T88711 (10Bstorm) [18:25:53] RoanKattouw: I don't have permission to create or edit files in that directory and the relative path doesn't work. Same error. [18:26:06] stephanebisson: ^ [18:26:35] Niharika: sudo -u mwdeploy cp [18:26:44] Sorry forgot I had to do that [18:28:11] RoanKattouw: That worked. It's running now. [18:33:53] stephanebisson: It's finished running now. [18:35:26] Niharika: great, thank you [18:39:29] PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet operation_type={create_container,run_podsandbox,start_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:40:30] RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:42:50] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet operation_type={stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:45:00] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:55:08] 10Operations, 10CommRel-Internals, 10Wikimedia-Mailing-lists: Rename (create anew) the TC team mailing list - https://phabricator.wikimedia.org/T155683 (10herron) Sure, we can repurpose this task. But let's update the title and description to reflect the desired outcome. [19:27:15] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (watching): FY2017/18 Program 6 - Outcome 2 - Objective 2: Set up a continuous integration and deployment pipeline - https://phabricator.wikimedia.org/T170481 (10thcipriani) [19:32:35] 10Operations, 10fundraising-tech-ops, 10netops: NAT and DNS for fundraising monitor host - https://phabricator.wikimedia.org/T198516 (10ayounsi) NAT created: ```lang=diff [edit security nat static rule-set static-nat] rule frbast1001 { ... } + rule frmon1001 { + match { + de... [19:34:52] 10Operations, 10Traffic: Setup wikimediafoundation.org domain for July 30 launch of new site - https://phabricator.wikimedia.org/T198922 (10Varnent) @BBlack - excellent - thank you!! [19:57:11] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Helm test failing for CI namespace - https://phabricator.wikimedia.org/T199489 (10thcipriani) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor I � Unicode. All rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180716T2000). [20:02:33] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@fcae441]: Update mobileapps to bed7b29 (T174809) [20:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:36] T174809: Add swagger spec for content-html - https://phabricator.wikimedia.org/T174809 [20:11:34] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@fcae441]: Update mobileapps to bed7b29 (T174809) (duration: 09m 01s) [20:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:38] T174809: Add swagger spec for content-html - https://phabricator.wikimedia.org/T174809 [20:29:45] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10ayounsi) >>! In T184293#4415745, @mark wrote: > # On asw2-d-eqiad, xe-2/0/4 is part of the "access-ports" group which sets a high MTU, whereas it doesn't seem to be on t... [21:00:04] bawolff and Reedy: Time to snap out of that daydream and deploy Weekly Security deployment window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180716T2100). [21:44:59] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve most-read articles for date with no data (with aggregated=true)) timed out before a response was received: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out be [21:44:59] received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received [21:44:59] PROBLEM - apertium apy on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:45:19] PROBLEM - eventstreams on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:45:29] PROBLEM - Check systemd state on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:45:30] PROBLEM - SSH on scb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:45:59] RECOVERY - apertium apy on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.074 second response time [21:46:00] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [21:46:19] RECOVERY - eventstreams on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.097 second response time [21:46:29] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [21:46:30] RECOVERY - SSH on scb2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [22:05:05] 10Operations, 10ChangeProp, 10Services (designing), 10Wikimedia-Incident: Separate dev Change-Prop from production Kafka cluster - https://phabricator.wikimedia.org/T199427 (10Nuria) Let's please make sure this happens this quarter cc-ing @mobrovac and @Fjalapeno for visibility [22:37:29] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [22:46:39] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180716T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:55:02] (03CR) 10Alex Monk: [C: 032] "Sorry, would have waited for you to take a look if I knew you were going to get involved." (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez) [23:57:19] (03CR) 10Alex Monk: [C: 032] "might be overkill but sure if you want" [software/certcentral] - 10https://gerrit.wikimedia.org/r/446009 (owner: 10Vgutierrez) [23:58:10] (03Merged) 10jenkins-bot: get rid of /etc/certcentral being hardcoded everywhere [software/certcentral] - 10https://gerrit.wikimedia.org/r/446009 (owner: 10Vgutierrez) [23:58:58] (03CR) 10jenkins-bot: get rid of /etc/certcentral being hardcoded everywhere [software/certcentral] - 10https://gerrit.wikimedia.org/r/446009 (owner: 10Vgutierrez)