[01:12:19] (03PS1) 10Hoo man: Add checksums for Wikidata entity dumps [puppet] - 10https://gerrit.wikimedia.org/r/423353 (https://phabricator.wikimedia.org/T190457) [02:50:46] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.26) (duration: 12m 53s) [02:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:00] (03PS1) 10Madhuvishy: dumps: Add labstore1007 to list of hosts for rolling rsync [puppet] - 10https://gerrit.wikimedia.org/r/423354 (https://phabricator.wikimedia.org/T171541) [03:30:43] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 855.77 seconds [03:43:22] (03PS1) 10Madhuvishy: labstore monitoring: Move throughput thresholds to class params [puppet] - 10https://gerrit.wikimedia.org/r/423355 [03:47:21] (03CR) 10Madhuvishy: [C: 032] labstore monitoring: Move throughput thresholds to class params [puppet] - 10https://gerrit.wikimedia.org/r/423355 (owner: 10Madhuvishy) [03:55:53] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 290.09 seconds [03:58:08] (03PS1) 10Madhuvishy: dumps: Adjust network saturation thresholds for 10G interface [puppet] - 10https://gerrit.wikimedia.org/r/423359 (https://phabricator.wikimedia.org/T168486) [04:00:56] (03CR) 10Madhuvishy: [C: 032] dumps: Adjust network saturation thresholds for 10G interface [puppet] - 10https://gerrit.wikimedia.org/r/423359 (https://phabricator.wikimedia.org/T168486) (owner: 10Madhuvishy) [04:23:55] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:25:06] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:25:16] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:25:26] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - pdfrender_5252: Servers scb1001.eqiad.wmnet are marked down but pooled [04:25:35] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - pdfrender_5252: Servers scb1001.eqiad.wmnet are marked down but pooled [04:27:48] PROBLEM - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:28:47] RECOVERY - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.016 second response time [04:52:17] (03PS1) 10Madhuvishy: Revert "dumps: Set xmldumps server as localhost for testing" [puppet] - 10https://gerrit.wikimedia.org/r/423362 [04:52:23] (03PS2) 10Madhuvishy: Revert "dumps: Set xmldumps server as localhost for testing" [puppet] - 10https://gerrit.wikimedia.org/r/423362 [04:54:07] (03PS3) 10Madhuvishy: Revert "dumps: Set xmldumps server as localhost for testing" [puppet] - 10https://gerrit.wikimedia.org/r/423362 (https://phabricator.wikimedia.org/T188641) [04:55:47] (03CR) 10Madhuvishy: [C: 032] Revert "dumps: Set xmldumps server as localhost for testing" [puppet] - 10https://gerrit.wikimedia.org/r/423362 (https://phabricator.wikimedia.org/T188641) (owner: 10Madhuvishy) [05:18:03] 10Operations, 10Dumps-Generation: Write issues on dumpsdata1001 - https://phabricator.wikimedia.org/T191177#4096384 (10ArielGlenn) Well. The ls errors seem to be consistent, probably emitted at the conclusion of some pipeline in the bash script. They happen even when nothing else is going on, on the host, so... [05:31:56] (03CR) 10Marostegui: "> Is thjs still relevant or should it be abandoned?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399792 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [05:47:46] PROBLEM - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:48:45] RECOVERY - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.028 second response time [05:49:53] 10Operations, 10Dumps-Generation: Write issues on dumpsdata1001 - https://phabricator.wikimedia.org/T191177#4096402 (10ArielGlenn) ``` root@dumpsdata1001:/data/xmldatadumps/private/wikidatawiki/20180401# zcat /data/xmldatadumps/public/wikidatawiki/20180401/wikidatawiki-20180401-stub-meta-history27.xml.gz | cat... [06:06:31] !log Drop localisation table from the hosts where it still existed - T119811 [06:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:38] T119811: Drop localisation and localisation_file_hash tables, l10nwiki databases too - https://phabricator.wikimedia.org/T119811 [06:09:06] PROBLEM - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:10:57] RECOVERY - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [06:29:46] PROBLEM - MariaDB Slave Lag: s7 on db1069 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 404.65 seconds [06:30:04] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 399.38 seconds [06:30:04] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 400.33 seconds [06:32:23] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 433.64 seconds [06:32:24] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 458.39 seconds [06:32:33] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 460.75 seconds [06:32:36] PROBLEM - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:32:43] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 451.06 seconds [06:32:53] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 455.35 seconds [06:32:53] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 455.75 seconds [06:34:23] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 744.28 seconds [06:34:26] RECOVERY - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [06:37:29] db1069 is vslow, why do we get paged for that? [06:38:53] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 40.37 seconds [06:39:03] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 36.47 seconds [06:39:03] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 36.19 seconds [06:39:13] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 16.03 seconds [06:39:14] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 12.29 seconds [06:39:33] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:39:33] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.12 seconds [06:39:43] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:39:56] RECOVERY - MariaDB Slave Lag: s7 on db1069 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:44:46] PROBLEM - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:45:22] That is very weird it was caused by a drop table probably, but the lag only affected codfw [06:45:26] and it was an empty table [06:45:37] RECOVERY - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [06:45:57] well huh, 1069 did whine at the same time as the rest [06:46:41] yeah, I guess because it is the slower one (old hw) [06:46:51] and the rest of the slaves are SSDs ones [06:47:00] ic [06:47:50] Is pdfrender okay? I'm being paged repeatedly. [06:47:56] <_joe_> has anyone looked at pdfrenderer? it farted twice this morning already (and I'm supposedly off and unavaliable) [06:48:04] <_joe_> madhuvishy: I'd assume not [06:48:06] <_joe_> :) [06:48:36] <_joe_> icinga says it's down on scb1001/2/4 [06:49:25] I am not getting any pages at all :| [06:55:35] I'm on those hosts looking, it claims to be running though [06:56:06] they've only been running 20 hours at most [06:56:12] might restart one anyways and see what it does [06:56:20] it is swapping heavily, though [06:57:41] I dislike the 'restart to fix' option, especially if it's been running for less than a day [06:58:24] while the service claims to be up, there are no jobs logged on scb1001 since two hours, though [06:58:42] welp. I'll do thta one and wait a minute [06:58:46] on /srv/log/pdfrender/syslog.log the last entry is from 04:22 [06:59:33] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [06:59:34] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [06:59:43] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [06:59:54] now jobs are being logged again [07:00:02] doing scb1002 [07:00:20] wonder if it's worth looking in detail at scb1004 [07:00:44] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.004 second response time [07:01:26] !log restarted pdfrender on scb1001,2, service paged and no jobs were being processed [07:01:29] (03PS1) 10Marostegui: reimport_from_master.sh: Add --skip-ssl [software] - 10https://gerrit.wikimedia.org/r/423367 (https://phabricator.wikimedia.org/T191186) [07:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:25] 1001,1002 and 1004 all stopped processing around 04:20 +/- a few minutes [07:02:38] (03CR) 10Jcrespo: [C: 04-1] "It should work as is, if full domains are used." [software] - 10https://gerrit.wikimedia.org/r/423367 (https://phabricator.wikimedia.org/T191186) (owner: 10Marostegui) [07:03:15] (03Abandoned) 10Marostegui: reimport_from_master.sh: Add --skip-ssl [software] - 10https://gerrit.wikimedia.org/r/423367 (https://phabricator.wikimedia.org/T191186) (owner: 10Marostegui) [07:03:28] but 1003 worked fine, maybe they choked on a rendering job [07:03:41] logs have nothing useful, it logs every job to syslog? seriously? sigh [07:03:49] anyways nothing helpful there [07:04:11] yeah, the logging is unimpressive [07:04:47] 90% usage for / [07:05:58] sure no easy way to tell by looking at the processes [07:06:56] I am going to clear the apt cache there [07:07:04] which host? [07:07:12] scb1001 [07:07:24] oh, I'm on 1004 and it's ok there [07:07:27] but sure [07:07:28] 9GB for /, 2 used for downloaded packages [07:07:32] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10media-storage: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4094875 (10Marostegui) >>! In T191129#4094901, @Volans wrote: > I've agreed with @RobH on IRC that this is not UBN for now for the #dba part. > > Although assessing the situation... [07:07:33] root is just 9G and it has lots of unused kernels [07:08:20] but the services all operate/log on /srv which has plenty of space, I think that's no a reason of concern [07:09:15] well, it was at 90% usage, it will get full soon and page [07:10:24] well some things go to syslog too from pdfrender but not as much [07:10:59] you mean on other hosts? [07:12:52] I mean there's slightly more entries in /srv/log/pdfrender/syslog.log than in /var/log/syslog from pdfrender (on, say, scb1004) [07:12:58] but they do go to both places [07:13:00] mostly [07:14:05] probably that should be a ticket. but in the meantime, ... I'm inclined to buy the 'stuck on a bad render' story because what else breaks 3/4 of them at the same time [07:14:21] I just wish we knew how to time out a bad render [07:15:03] no bright ideas left except to restart on 1004, if someone's looking around there I'll wait though [07:16:12] moritzm: ^^ ? [07:19:52] yeah, I'd say let's restart 1004 and make a ticket for investigation by Services which render job broke Electron so that this can be fixed [07:20:44] clearly needs some better restart detection, I also don't get why this didn't page earlier given that 3/4 servers were apparently defunct for several hrs until it paged [07:21:34] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [07:21:57] !log restarted pdfrender on scb1004 after poking around there a bit [07:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:11] I'll open a couple tickets then [07:22:45] https://phabricator.wikimedia.org/T174916 or not [07:22:57] since the 31, our pages seem to be 3 times slower [07:23:16] https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts?orgId=1&from=1522453494015&to=1522499092370 [07:27:14] I think the pages were sent on time [07:27:46] I was probably asleep for the first round, I was awakened very early this morning by church bells, tried to do some work and then get a bit more sleep [07:28:33] 10Operations, 10Electron-PDFs, 10Readers-Web-Backlog (Tracking), 10Services (blocked): electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916#3576939 (10ArielGlenn) Restarted today on scb1001,2,4, best guess is a bad render because they all went out within the same few minutes, i.e. around 4:... [07:32:10] 10Operations, 10Electron-PDFs, 10Services: pdfrender logs to /var/log/syslog as well as to /srv/log/pdfrender - https://phabricator.wikimedia.org/T191191#4096520 (10ArielGlenn) p:05Triage>03Normal [07:32:15] (03Abandoned) 10Jcrespo: mariadb: Decommissioning proposal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399792 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [07:32:43] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 103.77 seconds [07:35:31] (03PS2) 10ArielGlenn: dumps: Add labstore1007 to list of hosts for rolling rsync [puppet] - 10https://gerrit.wikimedia.org/r/423354 (https://phabricator.wikimedia.org/T171541) (owner: 10Madhuvishy) [07:36:39] (03CR) 10ArielGlenn: [C: 032] dumps: Add labstore1007 to list of hosts for rolling rsync [puppet] - 10https://gerrit.wikimedia.org/r/423354 (https://phabricator.wikimedia.org/T171541) (owner: 10Madhuvishy) [07:50:14] (03CR) 10ArielGlenn: Add checksums for Wikidata entity dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/423353 (https://phabricator.wikimedia.org/T190457) (owner: 10Hoo man) [07:59:06] 10Operations, 10Dumps-Generation: Write issues on dumpsdata1001 - https://phabricator.wikimedia.org/T191177#4096564 (10ArielGlenn) Grafana shows network usage pretty high, we have two interfaces so we could bring up the second one, but can the arrays/controller handle it? {F16581153} {F16581157} [08:02:26] (03PS1) 10Jcrespo: labsdb: depool labsdb1011 to copy it to labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/423448 (https://phabricator.wikimedia.org/T191149) [08:02:58] (03CR) 10Marostegui: [C: 031] labsdb: depool labsdb1011 to copy it to labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/423448 (https://phabricator.wikimedia.org/T191149) (owner: 10Jcrespo) [08:03:37] (03CR) 10Jcrespo: [C: 032] labsdb: depool labsdb1011 to copy it to labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/423448 (https://phabricator.wikimedia.org/T191149) (owner: 10Jcrespo) [08:04:07] (03PS1) 10Marostegui: db-codfw.php: Specifying current m5 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423449 [08:05:47] (03PS1) 10Jcrespo: labsdb: Depool labsdb1009 to copy it from labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/423450 (https://phabricator.wikimedia.org/T191149) [08:07:29] (03CR) 10Jcrespo: "This is ok, of course, as it is just a comment, but we have pending a discussion with cloud about the labswiki failover strategy, to avoid" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423449 (owner: 10Marostegui) [08:07:46] (03CR) 10Jcrespo: [C: 032] labsdb: Depool labsdb1009 to copy it from labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/423450 (https://phabricator.wikimedia.org/T191149) (owner: 10Jcrespo) [08:08:16] 10Operations, 10ops-codfw, 10DBA: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4096579 (10Marostegui) [08:08:55] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10media-storage: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4096600 (10Marostegui) I have created: T191193 to track the masters movement [08:09:22] (03CR) 10Marostegui: [C: 032] db-codfw.php: Specifying current m5 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423449 (owner: 10Marostegui) [08:10:37] (03Merged) 10jenkins-bot: db-codfw.php: Specifying current m5 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423449 (owner: 10Marostegui) [08:10:51] (03CR) 10jenkins-bot: db-codfw.php: Specifying current m5 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423449 (owner: 10Marostegui) [08:11:48] 10Operations, 10Dumps-Generation: Write issues on dumpsdata1001 - https://phabricator.wikimedia.org/T191177#4096602 (10ArielGlenn) We could consider QoS, prioritizing NFS packets, or alternatively, deprioritizing rsync packets, to see what impact that has, remaining with the single interface active. It's fine... [08:11:52] !log depool labsdb1011 from web wikirreplicas [08:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:18] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Specify current m5 codfw master (duration: 01m 17s) [08:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:21] 10Operations, 10ops-codfw, 10DBA: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4096614 (10Peachey88) [08:21:36] !log stop mariadb at labsdb1009 and labsdb1010 [08:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:42] 10Operations, 10Dumps-Generation: Write issues on dumpsdata1001 - https://phabricator.wikimedia.org/T191177#4096617 (10ArielGlenn) from /data/xmldatadumps/private on dumpsdata1001: ``` list=*wik* for wikiname in $list; do echo -n "$wikiname: "; zcat "/data/xmldatadumps/public/${wikiname}/20180401/${wikiname}-... [08:26:26] 10Operations, 10Dumps-Generation: data retrieval/write issues via NFS on dumpsdata1001, impacting some dump jobs - https://phabricator.wikimedia.org/T191177#4096618 (10ArielGlenn) [08:27:57] (03PS3) 10Jcrespo: Remove outdated references to virt1000 from db-eqiad.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422362 (https://phabricator.wikimedia.org/T102005) (owner: 10Krinkle) [08:50:24] !log Deploy schema change on s3 codfw master db2043 (this will generate lag on codfw) - T187089 T185128 T153182 [08:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:32] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [08:50:32] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [08:50:32] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [08:52:35] (03CR) 10Marostegui: [C: 031] Remove outdated references to virt1000 from db-eqiad.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422362 (https://phabricator.wikimedia.org/T102005) (owner: 10Krinkle) [08:53:15] (03CR) 10Jcrespo: [C: 032] Remove outdated references to virt1000 from db-eqiad.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422362 (https://phabricator.wikimedia.org/T102005) (owner: 10Krinkle) [08:54:28] (03Merged) 10jenkins-bot: Remove outdated references to virt1000 from db-eqiad.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422362 (https://phabricator.wikimedia.org/T102005) (owner: 10Krinkle) [08:57:59] (03CR) 10jenkins-bot: Remove outdated references to virt1000 from db-eqiad.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422362 (https://phabricator.wikimedia.org/T102005) (owner: 10Krinkle) [09:12:05] !log jynus@tin Synchronized wmf-config/db-codfw.php: Remove references to virt1000 (duration: 01m 16s) [09:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:39] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Remove references to virt1000 (duration: 01m 16s) [09:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:25] 10Operations, 10Dumps-Generation: data retrieval/write issues via NFS on dumpsdata1001, impacting some dump jobs - https://phabricator.wikimedia.org/T191177#4096658 (10ArielGlenn) Other things we could do, just to get the dumps to run smoothly: - move enwikidata or wikidatawiki run to write to dumpsdata1002. T... [09:53:17] 10Operations, 10Commons, 10MediaWiki-Database, 10Multimedia, and 4 others: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704#4096694 (10jcrespo) @Aaron there seems to be a surge on WikiPage::doDeleteArticleReal errors possibly related to this ticket... [10:01:27] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423456 (https://phabricator.wikimedia.org/T128546) [10:03:07] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423456 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:04:16] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423456 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:08:01] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423456 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:15:36] !log jdrewniak@tin Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:423456|Bumping portals to master (T128546)]] (duration: 01m 16s) [10:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:44] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:16:53] !log jdrewniak@tin Synchronized portals: Wikimedia Portals Update: [[gerrit:423456|Bumping portals to master (T128546)]] (duration: 01m 16s) [10:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180402T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:01:42] oh daylight savings time... [11:21:51] (03PS1) 10ArielGlenn: check stub files after production to see if they have binary crap [dumps] - 10https://gerrit.wikimedia.org/r/423465 (https://phabricator.wikimedia.org/T191177) [11:22:11] (03CR) 10jerkins-bot: [V: 04-1] check stub files after production to see if they have binary crap [dumps] - 10https://gerrit.wikimedia.org/r/423465 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn) [11:23:33] (03PS2) 10ArielGlenn: check stub files after production to see if they have binary crap [dumps] - 10https://gerrit.wikimedia.org/r/423465 (https://phabricator.wikimedia.org/T191177) [11:23:50] (03CR) 10jerkins-bot: [V: 04-1] check stub files after production to see if they have binary crap [dumps] - 10https://gerrit.wikimedia.org/r/423465 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn) [11:28:01] (03PS3) 10ArielGlenn: check stub files after production to see if they have binary crap [dumps] - 10https://gerrit.wikimedia.org/r/423465 (https://phabricator.wikimedia.org/T191177) [11:44:41] !log depool mediawiki canary servers for hhvm upgrade [11:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:21] (03CR) 10Hoo man: Add checksums for Wikidata entity dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/423353 (https://phabricator.wikimedia.org/T190457) (owner: 10Hoo man) [11:46:53] PROBLEM - DPKG on mw1261 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:47:04] PROBLEM - DPKG on mw1262 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:47:13] PROBLEM - DPKG on mw1263 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:47:43] PROBLEM - Apache HTTP on mwdebug2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:47:44] PROBLEM - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:47:53] RECOVERY - DPKG on mw1261 is OK: All packages OK [11:48:03] PROBLEM - HHVM rendering on mw1264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:48:04] RECOVERY - DPKG on mw1262 is OK: All packages OK [11:48:13] RECOVERY - DPKG on mw1263 is OK: All packages OK [11:48:33] RECOVERY - Apache HTTP on mwdebug2002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.113 second response time [11:48:43] RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 80124 bytes in 0.263 second response time [11:48:54] RECOVERY - HHVM rendering on mw1264 is OK: HTTP OK: HTTP/1.1 200 OK - 80124 bytes in 0.167 second response time [11:51:20] !log repool mediawiki canary servers after hhvm upgrade [11:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:15] (03CR) 10D3r1ck01: "> Patch Set 2:" (035 comments) [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/421011 (https://phabricator.wikimedia.org/T179059) (owner: 10ArielGlenn) [12:06:38] !log Deploy schema change on dbstore1002 - s3 - T187089 T185128 T153182 [12:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:49] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [12:06:49] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [12:06:50] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [12:07:03] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:15:35] 10Operations, 10DBA, 10Goal: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4096834 (10Marostegui) @jcrespo and myself have done an initial discussion about HW and to which extend (pros and cons) we can achieve redundancy for... [12:15:59] (03PS2) 10ArielGlenn: Add checksums for Wikidata entity dumps [puppet] - 10https://gerrit.wikimedia.org/r/423353 (https://phabricator.wikimedia.org/T190457) (owner: 10Hoo man) [12:16:48] (03CR) 10ArielGlenn: [C: 032] Add checksums for Wikidata entity dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/423353 (https://phabricator.wikimedia.org/T190457) (owner: 10Hoo man) [12:20:53] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [12:35:59] 10Operations, 10cloud-services-team: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4096845 (10chasemp) @robh figured out what he believed is the eth0 issue described, unless a screenshot was captured I don't think there are logs but the message he pasted in irc from console was... [12:40:20] jouncebot, next [12:40:20] In 0 hour(s) and 19 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180402T1300) [12:47:53] PROBLEM - DPKG on mw1244 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:48:03] PROBLEM - DPKG on mw1241 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:48:43] PROBLEM - HHVM rendering on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:48:53] RECOVERY - DPKG on mw1244 is OK: All packages OK [12:49:03] RECOVERY - DPKG on mw1241 is OK: All packages OK [12:49:34] RECOVERY - HHVM rendering on mw1243 is OK: HTTP OK: HTTP/1.1 200 OK - 80122 bytes in 0.160 second response time [12:49:40] !log upgrade mediawiki servers for hhvm upgrade [12:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:03] PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [12:53:33] PROBLEM - HHVM rendering on mw1253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:54:24] RECOVERY - HHVM rendering on mw1253 is OK: HTTP OK: HTTP/1.1 200 OK - 80122 bytes in 0.347 second response time [12:54:24] PROBLEM - DPKG on mw1327 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:54:36] (03PS1) 10Urbanecm: New throttle rule for Ahmedabad University Wikipedia Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423471 (https://phabricator.wikimedia.org/T191187) [12:55:03] PROBLEM - HHVM rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:24] RECOVERY - DPKG on mw1327 is OK: All packages OK [12:55:53] RECOVERY - HHVM rendering on mw1328 is OK: HTTP OK: HTTP/1.1 200 OK - 80122 bytes in 0.147 second response time [12:57:03] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [12:57:13] PROBLEM - DPKG on mw1269 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:57:54] PROBLEM - puppet last run on mw1327 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [12:58:03] PROBLEM - HHVM rendering on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:58:13] RECOVERY - DPKG on mw1269 is OK: All packages OK [12:58:53] PROBLEM - DPKG on mw1250 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:58:54] RECOVERY - HHVM rendering on mw1271 is OK: HTTP OK: HTTP/1.1 200 OK - 80122 bytes in 0.145 second response time [12:59:53] RECOVERY - DPKG on mw1250 is OK: All packages OK [13:00:03] PROBLEM - puppet last run on mw1326 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180402T1300). [13:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] I'M here [13:01:23] PROBLEM - DPKG on mw1324 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:01:46] please delay swat for another 10 mins or so (I 'll inform when I am done). I am upgrading HHVM in eqiad [13:02:23] RECOVERY - DPKG on mw1324 is OK: All packages OK [13:02:25] akosiaris, well, no problem actually, I have no deploy privs and no member of SWAT team is watching :D [13:03:00] :-) [13:03:13] PROBLEM - DPKG on mw1268 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:03:23] PROBLEM - DPKG on mw1320 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:04:14] RECOVERY - DPKG on mw1268 is OK: All packages OK [13:04:23] RECOVERY - DPKG on mw1320 is OK: All packages OK [13:05:09] akosiaris: do you have any idea why scb* services are so unstable on the last hours? [13:05:41] pdfrender seems it failed again on scb1003 [13:06:22] and mobileapps doesn't seem too happy, not sure if related [13:07:51] pdfrender is always unstable. The entire idea of proton is so that we can get rid of pdfrender [13:08:07] ah! [13:08:14] I thought pdfrender was the new one [13:08:25] I got confused [13:08:26] it's the new new one [13:08:30] there is a new new new one [13:08:34] PROBLEM - HHVM rendering on mw1273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:08:34] :D [13:08:35] lol [13:09:15] ok, if I click "download as a pdf", right now, which one do I get? [13:09:24] RECOVERY - HHVM rendering on mw1273 is OK: HTTP OK: HTTP/1.1 200 OK - 80122 bytes in 0.160 second response time [13:09:28] to the best of your knowledge? [13:09:28] ocg = new one, pdfrender = 2x new one, proton = 3x new one [13:09:33] pdfrender [13:09:41] ok, so it IS in production [13:09:41] it's the only one currently in production [13:09:59] now I am slightly less confused [13:10:03] PROBLEM - puppet last run on mw1274 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [13:10:05] but it's plagued by issues, and unmaintained essentially [13:10:25] I see, thanks [13:13:46] for mobileapps it does seem it wasn't just mobileapps. OOM killer showed up in scb1001 on 10:23 this morning [13:14:03] oh [Sun Apr 1 10:23:58 2018] nodejs: page allocation stalls for 11936ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) [13:14:06] interesting [13:14:16] first time we see this on a non VM node [13:15:02] oh, is that similar to the ganeti nodes or something else? [13:15:23] yeah the error message is very very similar [13:15:29] scary [13:16:11] could that happen when trying to allocate a very large block of memory on a machine without enough available, it stalls while swapping a lot of stuff out? [13:16:22] it did show up on only scb1001, scb1002, no scb1003, scb1004 [13:16:32] I can't think of any other reason that'd happen [13:17:14] akosiaris: the problem is that systemd apparently restarted pdfrender, but the service didn't continue working until manually restarted by ariel [13:17:21] or it wasn't killed appropiately [13:17:34] jynus: yes that's one of the issues. We got a task for it, I 'll look it up [13:17:44] ok, if it is filed, no problem [13:17:53] was just puting you up to date, in case you missed that [13:18:19] because for me was a new behaviour [13:18:35] twentyafterfour: I thought so too when I first saw it, but there was never any big swap usage on any of the boxes we 've seen that. Even for https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=4&fullscreen&orgId=1&var-server=scb1001&var-datasource=eqiad%20prometheus%2Fops&from=now-12h&to=now-1m [13:18:46] there's minimal swap changes [13:18:52] 30MBs is nothing [13:18:54] hmm interesting [13:19:11] also https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=6&fullscreen&orgId=1&var-server=scb1001&var-datasource=eqiad%20prometheus%2Fops&from=now-12h&to=now-1m [13:19:15] this is minimal as well [13:19:30] so it doesn't seem like it is related to IO this time around [13:19:51] on the ganeti boxes it was definitely the case (it was IO related for sure) [13:20:03] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:20:08] Is there any chance to get somebody to do 2 patches in "current" SWAT window? [13:20:55] Urbanecm: sure I can do it [13:20:55] (akosiaris, jynus ^^) [13:20:58] Great :) [13:21:16] Thank you. But akosiaris said we should delay for 10 minutes, I'm not sure if everything is ok in the cluster [13:21:24] it would be nice to have swap activity levels, while I agree with you, the amount of swap in use can be misleading with how much it is being written/read [13:21:35] it's fine, you can go ahead. Upgrades on part of the cluster just finished [13:21:40] Great, thank you! [13:21:56] akosiaris: did the install failed? [13:22:05] Urbanecm: which patches? are they already on the wiki page? [13:22:12] "Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm]" [13:22:14] twentyafterfour, yes [13:22:25] https://gerrit.wikimedia.org/r/#/c/423340/ [13:22:31] https://gerrit.wikimedia.org/r/#/c/423471/ [13:22:36] See https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180402T1300 :) [13:22:58] or was just puppet that failed? [13:22:58] jynus: it's taking a long time to install and if puppet runs in the meantime it will get a "there's lock on dpkg right now" and will complain about it [13:23:03] ok, then [13:23:13] those hhvm packages are huge [13:23:13] (03CR) 1020after4: [C: 032] Throttle rule for 2018-04-04, clean obsolete rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423340 (https://phabricator.wikimedia.org/T191168) (owner: 10Urbanecm) [13:23:16] sorry I am bothering you rather than helping 0:-) [13:24:14] I 'll file a task for the message on scb100{1,2} but this is weird [13:24:14] I will restart pdfrender on scb1003 if you are ok with that [13:24:29] (03Merged) 10jenkins-bot: Throttle rule for 2018-04-04, clean obsolete rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423340 (https://phabricator.wikimedia.org/T191168) (owner: 10Urbanecm) [13:24:59] jynus: sure, go ahead. Also the task is https://phabricator.wikimedia.org/T174916 [13:25:17] the fix is "restart enough times" [13:25:21] :-( [13:25:22] akosiaris: would you mind merging https://gerrit.wikimedia.org/r/#/c/423062/ ? [13:25:37] akosiaris: no systemd-specific task, just that one? [13:25:52] jynus: it doesn't have anything to do with systemd [13:25:56] oh [13:25:59] even manually restarting it causes that [13:26:03] ok [13:26:06] manually running* it [13:26:16] there's a race somewhere in xpra or something [13:26:40] (03PS2) 10Alexandros Kosiaris: New ssh key for twentyafterfour [puppet] - 10https://gerrit.wikimedia.org/r/423062 (owner: 1020after4) [13:26:53] thanks! :) [13:27:02] (03CR) 10Alexandros Kosiaris: [C: 032] New ssh key for twentyafterfour [puppet] - 10https://gerrit.wikimedia.org/r/423062 (owner: 1020after4) [13:27:03] !log restarting pdfrender on scd1003 (Socket timeout) [13:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:15] s/d/b/ [13:27:54] RECOVERY - puppet last run on mw1327 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:28:02] I see it now, ugh [13:28:03] twentyafterfour: it's making it's way through the fleet [13:28:09] (03CR) 10jenkins-bot: Throttle rule for 2018-04-04, clean obsolete rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423340 (https://phabricator.wikimedia.org/T191168) (owner: 10Urbanecm) [13:28:13] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [13:29:20] the check says ok, but it complained exactly about what you said [13:29:43] * twentyafterfour used the old key for now [13:29:55] (03CR) 1020after4: [C: 032] New throttle rule for Ahmedabad University Wikipedia Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423471 (https://phabricator.wikimedia.org/T191187) (owner: 10Urbanecm) [13:29:57] so now they have all been restarted today [13:29:59] so it goes [13:30:01] and now some softs on recommendation_api [13:30:03] !log twentyafterfour@tin Synchronized wmf-config/throttle.php: SWAT: Sync throttle rules for T191168 (duration: 01m 16s) [13:30:04] RECOVERY - puppet last run on mw1326 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:30:04] (03CR) 10jerkins-bot: [V: 04-1] New throttle rule for Ahmedabad University Wikipedia Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423471 (https://phabricator.wikimedia.org/T191187) (owner: 10Urbanecm) [13:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:09] T191168: Throttle rule for 2018-04-04 - Senior Citizens Write Wikipedia course - https://phabricator.wikimedia.org/T191168 [13:30:14] Urbanecm: first one's done [13:30:22] twentyafterfour, thank you! [13:30:24] PROBLEM - DPKG on mw1223 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:30:34] PROBLEM - DPKG on mw1229 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:30:36] I hope there won't be any problem with the second one as well ;) [13:30:40] (03PS2) 1020after4: New throttle rule for Ahmedabad University Wikipedia Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423471 (https://phabricator.wikimedia.org/T191187) (owner: 10Urbanecm) [13:31:23] PROBLEM - Apache HTTP on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:31:24] PROBLEM - HHVM rendering on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:31:24] RECOVERY - DPKG on mw1223 is OK: All packages OK [13:31:34] RECOVERY - DPKG on mw1229 is OK: All packages OK [13:32:03] Urbanecm: the second one has a merge conflict [13:32:14] RECOVERY - Apache HTTP on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.058 second response time [13:32:14] RECOVERY - HHVM rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 80156 bytes in 0.154 second response time [13:32:25] (03CR) 1020after4: [C: 032] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423471 (https://phabricator.wikimedia.org/T191187) (owner: 10Urbanecm) [13:32:37] twentyafterfour, will rebase, wait a sec [13:32:58] Urbanecm: I rebased it on gerrit, maybe it's going to work [13:33:08] yeah looks like it'll merge now [13:33:43] Great [13:33:52] (03Merged) 10jenkins-bot: New throttle rule for Ahmedabad University Wikipedia Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423471 (https://phabricator.wikimedia.org/T191187) (owner: 10Urbanecm) [13:34:09] There shouldn't be a merge conflict if it is working... But I'm not going to touch anything working :D [13:34:48] it looks like it's ok, it just had trouble with the first merge before a rebase [13:35:03] RECOVERY - puppet last run on mw1274 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:35:26] Urbanecm: syncing [13:35:30] ack [13:35:33] PROBLEM - DPKG on mw1288 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:36:29] !log twentyafterfour@tin Synchronized wmf-config/throttle.php: SWAT: Sync throttle rules for T191187 (duration: 01m 15s) [13:36:33] RECOVERY - DPKG on mw1288 is OK: All packages OK [13:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:35] T191187: IP exception for Ahmedabad University Wikipedia Workshop - https://phabricator.wikimedia.org/T191187 [13:37:17] thcipriani, can you sync one patch more, please? Sorry for not noticing it before :( [13:37:32] twentyafterfour :D [13:37:42] Wrong tabbing starting from t, sorry again :( [13:37:54] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg] [13:38:16] (03CR) 10jenkins-bot: New throttle rule for Ahmedabad University Wikipedia Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423471 (https://phabricator.wikimedia.org/T191187) (owner: 10Urbanecm) [13:39:38] It's https://gerrit.wikimedia.org/r/421480 twentyafterfour, if you can... [13:39:43] PROBLEM - DPKG on mw1340 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:40:13] PROBLEM - DPKG on mw1282 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:40:23] PROBLEM - HHVM rendering on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:24] Urbanecm: ok [13:40:28] Thank you! [13:40:43] RECOVERY - DPKG on mw1340 is OK: All packages OK [13:40:44] PROBLEM - HHVM rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:44] PROBLEM - HHVM rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:54] PROBLEM - HHVM rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:04] (03CR) 1020after4: [C: 032] Allow to import from the French Wiktionary to Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421480 (https://phabricator.wikimedia.org/T190445) (owner: 10Urbanecm) [13:41:10] (03PS2) 1020after4: Allow to import from the French Wiktionary to Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421480 (https://phabricator.wikimedia.org/T190445) (owner: 10Urbanecm) [13:41:13] RECOVERY - DPKG on mw1282 is OK: All packages OK [13:41:13] RECOVERY - HHVM rendering on mw1283 is OK: HTTP OK: HTTP/1.1 200 OK - 80100 bytes in 0.156 second response time [13:41:33] PROBLEM - puppet last run on mw1313 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [13:41:34] RECOVERY - HHVM rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 80100 bytes in 0.154 second response time [13:41:34] RECOVERY - HHVM rendering on mw1343 is OK: HTTP OK: HTTP/1.1 200 OK - 80100 bytes in 0.154 second response time [13:41:44] RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 80100 bytes in 0.155 second response time [13:44:33] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:53] PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg] [13:45:04] PROBLEM - HHVM rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:23] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 80122 bytes in 0.164 second response time [13:45:23] PROBLEM - HHVM rendering on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:54] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 80122 bytes in 0.139 second response time [13:46:13] RECOVERY - HHVM rendering on mw1342 is OK: HTTP OK: HTTP/1.1 200 OK - 80122 bytes in 0.149 second response time [13:46:23] PROBLEM - DPKG on mw1289 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:47:01] Urbanecm: syncing [13:47:04] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:05] ack [13:47:23] RECOVERY - DPKG on mw1289 is OK: All packages OK [13:47:53] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.043 second response time [13:48:05] !log twentyafterfour@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Sync initializesettings for T190445 (duration: 01m 16s) [13:48:11] (03CR) 10jenkins-bot: Allow to import from the French Wiktionary to Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421480 (https://phabricator.wikimedia.org/T190445) (owner: 10Urbanecm) [13:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:12] T190445: Allow to import from the French Wiktionary to Incubator - https://phabricator.wikimedia.org/T190445 [13:51:04] 10Operations, 10Services: Page allocation stalls on scb1001, scb1002 - https://phabricator.wikimedia.org/T191199#4097023 (10akosiaris) [13:53:32] (03CR) 10ArielGlenn: [C: 032] check stub files after production to see if they have binary crap [dumps] - 10https://gerrit.wikimedia.org/r/423465 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn) [13:54:28] PROBLEM - LVS HTTPS IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.21 and port 443: Connection refused [13:54:47] !log ariel@tin Started deploy [dumps/dumps@0363d50]: add check that xml files don't have binary corruption (nulls) after the header [13:54:51] !log ariel@tin Finished deploy [dumps/dumps@0363d50]: add check that xml files don't have binary corruption (nulls) after the header (duration: 00m 04s) [13:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:57] hmm interesting, that is probably me [13:55:00] looking [13:55:28] RECOVERY - LVS HTTPS IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 16710 bytes in 0.097 second response time [13:55:42] yup me, sorry about that [13:55:45] ok [14:00:03] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:00:03] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:00:04] PROBLEM - HHVM jobrunner on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:01:03] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.009 second response time [14:01:03] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [14:01:04] RECOVERY - HHVM jobrunner on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [14:02:23] PROBLEM - Long running screen/tmux on rhodium is CRITICAL: CRIT: Long running SCREEN process. (user: root PID: 19905, 2677457s 1728000s). [14:04:33] PROBLEM - HHVM jobrunner on mw1310 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:04:33] PROBLEM - HHVM jobrunner on mw1337 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:04:34] PROBLEM - HHVM jobrunner on mw1335 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:04:44] PROBLEM - DPKG on mw1311 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:05:33] RECOVERY - HHVM jobrunner on mw1310 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.003 second response time [14:05:33] RECOVERY - HHVM jobrunner on mw1337 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [14:05:34] RECOVERY - HHVM jobrunner on mw1335 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [14:05:44] RECOVERY - DPKG on mw1311 is OK: All packages OK [14:07:54] RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:08:02] 10Operations, 10Services: Page allocation stalls on scb1001, scb1002 - https://phabricator.wikimedia.org/T191199#4097123 (10akosiaris) p:05Triage>03High [14:09:03] PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:09:13] PROBLEM - HHVM jobrunner on mw1336 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.003 second response time [14:09:13] PROBLEM - HHVM jobrunner on mw1334 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:09:24] PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:09:25] !log akosiaris@puppetmaster1001 conftool action : set/weight=8; selector: dc=eqiad,cluster=scb,name=scb1001.* [14:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:39] !log akosiaris@puppetmaster1001 conftool action : set/weight=8; selector: dc=eqiad,cluster=scb,name=scb1002.* [14:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:03] RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [14:10:03] PROBLEM - puppet last run on mw1310 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [14:10:13] RECOVERY - HHVM jobrunner on mw1336 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [14:10:14] RECOVERY - HHVM jobrunner on mw1334 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [14:10:15] !log lower weight for scb1001, scb1002 from 10 to 8 for all services. T191199. scb1003, scb1004 have a weight of 15 already [14:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:21] T191199: Page allocation stalls on scb1001, scb1002 - https://phabricator.wikimedia.org/T191199 [14:10:23] RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [14:11:33] RECOVERY - puppet last run on mw1313 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:22] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: dc=eqiad,service=recommendation-api,cluster=scb,name=scb1004.* [14:12:26] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: dc=eqiad,service=recommendation-api,cluster=scb,name=scb1003.* [14:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:50] 10Puppet, 10Beta-Cluster-Infrastructure: Error: Could not find class role::kafka::jumbo::mirror for deployment-kafka0[45] - https://phabricator.wikimedia.org/T191154#4097133 (10Ottomata) Ah ya, sorry! Lots of mirror maker stuff moving around. Will fix today. [14:13:33] PROBLEM - HHVM jobrunner on mw1309 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:14:33] RECOVERY - HHVM jobrunner on mw1309 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [14:14:54] RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:16:05] (03CR) 10Ottomata: [C: 031] cdh::hadoop: add the config support for HDFS Trash [puppet/cdh] - 10https://gerrit.wikimedia.org/r/423156 (https://phabricator.wikimedia.org/T189051) (owner: 10Elukey) [14:19:06] (03Abandoned) 10Alexandros Kosiaris: nrpe: Pass ensure from monitor_service to nrpe::check [puppet] - 10https://gerrit.wikimedia.org/r/422106 (owner: 10Alexandros Kosiaris) [14:21:11] (03PS7) 10Madhuvishy: nfsclient: Setup dumps mounts from new servers [puppet] - 10https://gerrit.wikimedia.org/r/403767 (https://phabricator.wikimedia.org/T188643) [14:21:39] (03CR) 10jerkins-bot: [V: 04-1] nfsclient: Setup dumps mounts from new servers [puppet] - 10https://gerrit.wikimedia.org/r/403767 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy) [14:22:19] !log Drop contest* tables from s3 - T186867 [14:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:24] T186867: Drop contest* tables from mediawikiwiki - https://phabricator.wikimedia.org/T186867 [14:28:55] !log Disabling puppet across VPS instances with dumps mounted (https://phabricator.wikimedia.org/P6921) T188643 [14:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:02] T188643: Migrate Dumps WMCS NFS users from labstore1003 to labstore1006/7 - https://phabricator.wikimedia.org/T188643 [14:32:02] (03CR) 10Madhuvishy: [V: 032 C: 032] nfsclient: Setup dumps mounts from new servers [puppet] - 10https://gerrit.wikimedia.org/r/403767 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy) [14:32:34] (03CR) 10BBlack: [C: 032] eqsin: BN, BT, CC, CX, KH, KR, LA, MN, MO, MV, TW [dns] - 10https://gerrit.wikimedia.org/r/423159 (https://phabricator.wikimedia.org/T189252) (owner: 10BBlack) [14:33:48] (03PS9) 10Andrew Bogott: wiki replicas: record grants and set user for index maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [14:35:03] RECOVERY - puppet last run on mw1310 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:35:46] (03CR) 10Andrew Bogott: [C: 032] wiki replicas: record grants and set user for index maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [14:37:00] (03PS5) 10Andrew Bogott: keystone-paste.ini: Remove deprecated extension filters [puppet] - 10https://gerrit.wikimedia.org/r/422352 (https://phabricator.wikimedia.org/T187954) [14:37:46] (03CR) 10Andrew Bogott: [C: 032] keystone-paste.ini: Remove deprecated extension filters [puppet] - 10https://gerrit.wikimedia.org/r/422352 (https://phabricator.wikimedia.org/T187954) (owner: 10Andrew Bogott) [14:40:08] !log disabling puppet on decom host db1020 [14:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:45] (03PS1) 10Cmjohnson: Removing db1020 site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/423477 (https://phabricator.wikimedia.org/T189773) [14:42:15] (03PS2) 10Cmjohnson: Removing db1020 site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/423477 (https://phabricator.wikimedia.org/T189773) [14:43:28] (03CR) 10Cmjohnson: [C: 032] Removing db1020 site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/423477 (https://phabricator.wikimedia.org/T189773) (owner: 10Cmjohnson) [14:46:33] RECOVERY - Host ps1-c6-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.23 ms [14:47:09] 10Operations: Update & standardize Platform-specific_documentation for HP servers - https://phabricator.wikimedia.org/T138866#4097207 (10RobH) p:05Triage>03Normal [14:47:13] RECOVERY - Host db2044.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.59 ms [14:47:13] RECOVERY - Host db2050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.08 ms [14:47:23] RECOVERY - Host db2046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.42 ms [14:47:23] RECOVERY - Host db2035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.95 ms [14:47:43] RECOVERY - Host db2033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.38 ms [14:47:44] RECOVERY - Host db2039.mgmt is UP: PING OK - Packet loss = 0%, RTA = 46.08 ms [14:47:44] RECOVERY - Host db2045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.56 ms [14:47:44] RECOVERY - Host db2037.mgmt is UP: PING OK - Packet loss = 0%, RTA = 49.82 ms [14:47:44] RECOVERY - Host db2040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 49.72 ms [14:47:44] RECOVERY - Host db2041.mgmt is UP: PING OK - Packet loss = 0%, RTA = 49.57 ms [14:47:45] RECOVERY - Host db2049.mgmt is UP: PING OK - Packet loss = 0%, RTA = 48.38 ms [14:47:45] RECOVERY - Host db2042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 48.82 ms [14:47:45] RECOVERY - Host db2038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 48.87 ms [14:47:45] robh: looks like papaul is doing his magic ^ [14:47:46] RECOVERY - Host db2036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 48.52 ms [14:47:46] RECOVERY - Host db2048.mgmt is UP: PING OK - Packet loss = 0%, RTA = 51.86 ms [14:47:47] RECOVERY - Host db2047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 48.24 ms [14:47:56] heh, yep [14:48:13] RECOVERY - Host dbstore2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.55 ms [14:48:47] RECOVERY - Host db2073.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.32 ms [14:49:06] RECOVERY - Host db2083.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.26 ms [14:49:19] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10media-storage: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4097210 (10Papaul) Removed power for 2 minutes and plugged back. Leaving this task open for now to monitoring the switch. [14:49:46] RECOVERY - Host ms-be2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms [14:50:35] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10media-storage: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4097212 (10Papaul) p:05High>03Low [14:50:36] RECOVERY - Host db2043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.19 ms [14:51:04] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10media-storage: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4097213 (10Marostegui) The servers are reporting the recoveries already :-) Thanks! [14:52:03] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10media-storage: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4097214 (10RobH) Bad switch state is the easiest recovery, so that is nice. [14:52:30] 10Operations, 10ops-codfw, 10DBA: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097215 (10Papaul) a:05Papaul>03Marostegui @Marostegui confirm [14:54:14] 10Operations, 10DNS, 10Mail, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4097217 (10RobH) p:05Triage>03High [14:54:28] 10Operations, 10ops-codfw, 10DBA: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097218 (10Marostegui) Thanks @Papaul - we can schedule one movement per day if that works for you! In order to minimize downtime I would need the future IP of each server before we shut it down so I... [14:55:20] 10Operations, 10ops-codfw, 10DBA: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097222 (10Marostegui) a:05Marostegui>03Papaul [14:57:59] 10Operations, 10ops-codfw, 10DBA: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097226 (10Papaul) @Marostegui let me know which one you wan to start with. [14:59:32] 10Operations, 10ops-codfw, 10DBA: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097227 (10Marostegui) [15:00:28] 10Operations, 10ops-codfw, 10DBA: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4096579 (10Marostegui) Let's go for db2035 if that works for you! [15:06:37] !log Reenabled puppet and rolled out mounting new dumps NFS shares from labstore1006|7 on VPS instances T188643 [15:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:44] T188643: Migrate Dumps WMCS NFS users from labstore1003 to labstore1006/7 - https://phabricator.wikimedia.org/T188643 [15:08:46] (03PS1) 10Cmjohnson: Removing db1020 dns entries [dns] - 10https://gerrit.wikimedia.org/r/423480 (https://phabricator.wikimedia.org/T189773) [15:10:18] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4097254 (10Cmjohnson) Removed db1020 switch port ge-1/0/4 [15:15:10] (03PS3) 10Andrew Bogott: nova.conf: remove memcached setting [puppet] - 10https://gerrit.wikimedia.org/r/422433 (https://phabricator.wikimedia.org/T187954) [15:15:12] (03PS3) 10Andrew Bogott: nova.conf: use entry point name for scheduler_driver [puppet] - 10https://gerrit.wikimedia.org/r/422432 (https://phabricator.wikimedia.org/T187954) [15:15:56] (03CR) 10Andrew Bogott: [C: 032] nova.conf: remove memcached setting [puppet] - 10https://gerrit.wikimedia.org/r/422433 (https://phabricator.wikimedia.org/T187954) (owner: 10Andrew Bogott) [15:18:21] (03CR) 10Cmjohnson: [C: 032] Removing db1020 dns entries [dns] - 10https://gerrit.wikimedia.org/r/423480 (https://phabricator.wikimedia.org/T189773) (owner: 10Cmjohnson) [15:19:10] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1020 - https://phabricator.wikimedia.org/T189773#4097273 (10Cmjohnson) [15:19:25] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#4097274 (10Pchelolo) While resolving the cirrus search issues the next bulk of jobs can be switched. Here's what I propose: - `rece... [15:21:57] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4097281 (10Papaul) a:05Papaul>03RobH Switch port confirmation asw-b5-codfw restbase-test2001 ge-5/0/19 restbase-test2002 ge-5/0/16 restbase-test2003 ge-5/0/18 [15:24:47] (03CR) 10Marostegui: [C: 031] phabricator/mariadb: Update database configuration for stretch/10.1 [puppet] - 10https://gerrit.wikimedia.org/r/377693 (https://phabricator.wikimedia.org/T175679) (owner: 10Jcrespo) [15:25:22] 10Operations, 10ops-codfw, 10DBA: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097289 (10Papaul) new IP address 10.192.16.73 [15:25:52] 10Puppet, 10Beta-Cluster-Infrastructure: Error: Could not find class role::kafka::jumbo::mirror for deployment-kafka0[45] - https://phabricator.wikimedia.org/T191154#4097292 (10Ottomata) Hm actually, I don't seem to have access to the deployment-prep project in Horizon anymore. I don't see it in the project d... [15:26:03] 10Operations, 10ops-codfw, 10DBA: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097293 (10Marostegui) Thanks! I will post here as soon as the server is off [15:26:29] 10Operations, 10ops-codfw, 10DBA: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097294 (10Papaul) new switch port information asw-b1-codfw ge-1/0/15 [15:28:14] !log Stop MySQL and power off db2035 (s2 codfw master - this will stop replication on s2 codfw slaves) for rack change - T191193 [15:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:20] T191193: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193 [15:29:42] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild raids on labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#4097300 (10Cmjohnson) This is still ongoing with HP...they wanted me to do a few things. The status is the same -- broke [15:35:14] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Change db2035 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423484 (https://phabricator.wikimedia.org/T191193) [15:35:59] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097394 (10Marostegui) @Papaul db2035 is now off! [15:37:45] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Change db2035 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423484 (https://phabricator.wikimedia.org/T191193) (owner: 10Marostegui) [15:38:16] (03PS1) 10Papaul: DNS: Add db2035 to private1-b-codfw was in private1-c-codfw [dns] - 10https://gerrit.wikimedia.org/r/423485 (https://phabricator.wikimedia.org/T191193) [15:38:42] 10Operations, 10Electron-PDFs, 10Services (doing): pdfrender logs to /var/log/syslog as well as to /srv/log/pdfrender - https://phabricator.wikimedia.org/T191191#4097406 (10mobrovac) a:03mobrovac [15:38:53] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Change db2035 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423484 (https://phabricator.wikimedia.org/T191193) (owner: 10Marostegui) [15:39:08] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Change db2035 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423484 (https://phabricator.wikimedia.org/T191193) (owner: 10Marostegui) [15:39:39] (03CR) 10Marostegui: [C: 031] DNS: Add db2035 to private1-b-codfw was in private1-c-codfw [dns] - 10https://gerrit.wikimedia.org/r/423485 (https://phabricator.wikimedia.org/T191193) (owner: 10Papaul) [15:40:48] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Change db2035 IP - T191193 (duration: 01m 15s) [15:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:55] T191193: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193 [15:41:15] (03CR) 10Marostegui: [C: 032] DNS: Add db2035 to private1-b-codfw was in private1-c-codfw [dns] - 10https://gerrit.wikimedia.org/r/423485 (https://phabricator.wikimedia.org/T191193) (owner: 10Papaul) [15:42:08] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097430 (10Papaul) old switch information asw-c6-codfw ge-6/0/2 [15:42:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Change db2035 IP - T191193 (duration: 01m 15s) [15:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:44] (03PS1) 10Ppchelko: Switch remaining high traffic jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423486 (https://phabricator.wikimedia.org/T175210) [15:42:47] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097434 (10Marostegui) mediawiki config files changed network/interfaces changed dns merged and deployed [15:43:26] (03PS2) 10Madhuvishy: dumps: Absent /public/dumps mount served from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/422848 (https://phabricator.wikimedia.org/T188643) [15:45:46] PROBLEM - Host db2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:46:00] (03CR) 10Madhuvishy: [C: 032] dumps: Absent /public/dumps mount served from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/422848 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy) [15:46:07] ^ that is expected, we are moving db2035 to another rack [15:47:04] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097458 (10Papaul) @robh if switch configuration is not done yet can you please change it from new switch port information asw-b1-codfw ge-1/0/15 to new switch port informat... [15:49:02] 10Operations, 10Commons, 10MediaWiki-Database, 10Multimedia, and 4 others: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704#4097472 (10Anomie) I suppose the query might be failing to lock the `comment` table rows somehow or other (which aren't goin... [15:50:22] 10Operations, 10Analytics, 10Traffic, 10User-Elukey, 10Varnish: Sort out analytics service dependency issues for cp* cache hosts - https://phabricator.wikimedia.org/T128374#4097473 (10Milimetric) p:05Normal>03Triage [15:50:56] RECOVERY - Host db2035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.56 ms [15:51:05] 10Operations, 10Analytics, 10Traffic, 10User-Elukey, 10Varnish: Sort out analytics service dependency issues for cp* cache hosts - https://phabricator.wikimedia.org/T128374#2072088 (10Milimetric) p:05Triage>03Low [15:55:56] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097505 (10Papaul) db2035 was on asw-c6-codfw ge-6/0/2 and now will be on asw-b1-codfw ge-1/0/4 [15:56:00] (03PS1) 10Subramanya Sastry: Enable RemexHtml on all wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423491 [15:56:02] (03PS1) 10Subramanya Sastry: Enable RemexHtml on all wikiquotes except frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423492 (https://phabricator.wikimedia.org/T190726) [15:56:04] !log restart elasticsearch on elastic1024, been stuck at 100% cpu for 3+ hours [15:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:34] (03PS2) 10Subramanya Sastry: Enable RemexHtml on all wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423491 (https://phabricator.wikimedia.org/T188881) [15:57:36] (03PS2) 10Subramanya Sastry: Enable RemexHtml on all wikiquotes except frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423492 (https://phabricator.wikimedia.org/T190726) [15:59:49] !log Absenting /public/dumps mount from labstore1003 across the VPS fleet T188643 [15:59:53] (03CR) 10BBlack: [C: 032] eqsin: default for AS continent + AP fake-country [dns] - 10https://gerrit.wikimedia.org/r/423160 (https://phabricator.wikimedia.org/T189252) (owner: 10BBlack) [15:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:55] T188643: Migrate Dumps WMCS NFS users from labstore1003 to labstore1006/7 - https://phabricator.wikimedia.org/T188643 [15:59:57] (03PS4) 10BBlack: eqsin: default for AS continent + AP fake-country [dns] - 10https://gerrit.wikimedia.org/r/423160 (https://phabricator.wikimedia.org/T189252) [16:03:22] (03PS2) 10EddieGP: apache, wwwportals: De-duplicate vhost code [puppet] - 10https://gerrit.wikimedia.org/r/397770 [16:03:50] 10Operations, 10Analytics, 10EventBus, 10User-Elukey: Eventbus does not handle gracefully changes in DNS recursors - https://phabricator.wikimedia.org/T171048#4097549 (10Milimetric) p:05Low>03Triage [16:04:05] 10Operations, 10Analytics, 10EventBus, 10User-Elukey: Eventbus does not handle gracefully changes in DNS recursors - https://phabricator.wikimedia.org/T171048#3452124 (10Milimetric) p:05Triage>03Low [16:06:17] (03CR) 10EddieGP: [C: 04-1] apache, wwwportals: De-duplicate vhost code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397770 (owner: 10EddieGP) [16:10:37] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4097558 (10BBlack) [16:11:21] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4036653 (10BBlack) [16:14:43] 10Operations, 10cloud-services-team: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4097563 (10RobH) So there is an issue where trusty expects the os to be on eth0, and its on eth3. However, after discussion in IRC, @ayounsi pointed out the new switch in this rack is 10G. So la... [16:15:48] (03PS1) 10Jcrespo: labsdb: Reduce the sleep timeouts of clients to prevent connection hogging [puppet] - 10https://gerrit.wikimedia.org/r/423494 [16:15:50] (03PS1) 10Jcrespo: dbproxy: Move analytics wikireplica service to labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/423495 (https://phabricator.wikimedia.org/T191149) [16:15:59] (03PS1) 10Subramanya Sastry: Enable RemexHtml on wikis with <50 issues in high priority linter cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423496 (https://phabricator.wikimedia.org/T190731) [16:16:40] (03PS1) 10Ladsgroup: Add wordmark for Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423497 (https://phabricator.wikimedia.org/T191176) [16:17:40] (03PS2) 10Madhuvishy: dumps: Set up symlinks on instances under /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/422867 (https://phabricator.wikimedia.org/T188643) [16:18:10] (03CR) 10jerkins-bot: [V: 04-1] dumps: Set up symlinks on instances under /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/422867 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy) [16:20:57] (03CR) 10Madhuvishy: [V: 032 C: 032] dumps: Set up symlinks on instances under /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/422867 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy) [16:20:57] PROBLEM - Host elastic2031 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:48] !log Rolling out new symlinks to /public/dumps for labstore1006 dumps nfs mount T188643 [16:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:54] T188643: Migrate Dumps WMCS NFS users from labstore1003 to labstore1006/7 - https://phabricator.wikimedia.org/T188643 [16:39:14] ebernhardson: i broke elastic2031 on switch config [16:39:17] trying to fix now [16:39:23] that is 100% my fault [16:40:52] robh: its ok, the cluster is pretty resilient as long as you dont kill significantly more :) [16:41:14] ebernhardson: good, because im over here freaking out that i broke your stuff when we werent even supposed to be touching it =P [16:41:25] im working to fix now =] [16:41:32] ebernhardson: can they be rebooted without issue? [16:41:41] What happened was i accidentally disabled its port [16:42:33] ok, i found the vlan issue that i caused and im fixing it [16:42:36] it should come back up now [16:42:40] (i didnt have to reboot it) [16:42:41] robh: yes they can be rebooted fairly arbitrarily. It goes faster with prep work but it's fine without [16:42:53] (the cluster recovery goes faster i mean) [16:42:56] RECOVERY - Host elastic2031 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [16:43:27] ebernhardson: would you mind service level check? [16:43:51] not sure if it needs a kick after network problems [16:43:51] jynus: whats that mean? [16:43:57] ebernhardson: ok, its back online, it lost connectivity to network approximately 7 minutes ago [16:44:00] jynus: oh just login and see if its working? [16:44:08] yeah, its state [16:44:12] so im not sure how that loss of connectivity is rebuilt [16:44:13] I assume it desyncs [16:44:29] not sure if it resyncs 100% atomatically, etc. [16:44:30] yes, it's going to delete 500G off disk and copy it back from other nodes [16:44:36] :-/ [16:44:37] is that automatic? [16:44:39] yes [16:44:42] nice! [16:44:47] (auto recovery is neat) [16:45:09] next elasticsearch upgrade includes transaction sequence ids, which in theory means it can stop deleting all the data incase it was updated. hopefully :) [16:45:27] we've come such a long way from the first search cluster ;D [16:45:55] ebernhardson: so again though, that network connectivity loss was 100% my fault. sorry about that. [16:46:02] ebernhardson: I ask because other high available service we, by default, kick them out of the cluster and require manual aproval to join [16:46:06] i just had too many tabs open =P [16:46:25] node looks fine. joined cluster, it only held onto 20G out of 500G and cluster is rebalancing the rest out [16:46:40] we do that for databases to avoid flaps up and down [16:46:49] jynus: ahh, interesting. i don't think elasticsearch even has an option for that [16:47:18] e.g. for when it is not a hard failure, but an intermitent ones [16:47:50] (03PS1) 10Bstorm: wiki replicas: refactor views to be a profile [puppet] - 10https://gerrit.wikimedia.org/r/423503 [16:48:09] i'm not sure what an intermittent failure would do in elasticsearch, havn't had one yet. I could imagine it causing issues for the whole service though as requests get routed and failed. now more to think about :P [16:48:47] ebernhardson: I guess split brain is not a huge issue there [16:49:05] if the latest updates are not available, the index is still useful [16:49:05] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097692 (10Papaul) moved db2035 in racktables from C6 to B1 [16:49:41] data is a different use case- the latest updates are somtimes the most important ones, every system as its own model [16:49:53] jynus: right, split brain isn't really a problem for search results (as long as it can resolve eventually) [16:50:28] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097693 (10Marostegui) [16:50:34] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097694 (10Papaul) [16:51:11] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097696 (10Marostegui) db2035's mysql is back and slaves are reconnecting. I would suggest next server to be db2039. [16:57:36] (03PS2) 10Bstorm: wiki replicas: refactor views to be a profile [puppet] - 10https://gerrit.wikimedia.org/r/423503 [17:00:04] gehel: (Dis)respected human, time to deploy Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180402T1700). Please do the needful. [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:02:55] (03CR) 10Ottomata: [C: 031] statistics: Mount dumps share from labstore1006|7 on stat1005|6 [puppet] - 10https://gerrit.wikimedia.org/r/420083 (https://phabricator.wikimedia.org/T188644) (owner: 10Madhuvishy) [17:03:10] (03CR) 10Ottomata: [C: 031] statistics: Absent existing dumps mount at /mnt/data [puppet] - 10https://gerrit.wikimedia.org/r/422892 (https://phabricator.wikimedia.org/T188644) (owner: 10Madhuvishy) [17:03:22] (03CR) 10Ottomata: [C: 031] statistics: Symlink /mnt/data to nfs share from active server [puppet] - 10https://gerrit.wikimedia.org/r/422896 (https://phabricator.wikimedia.org/T188644) (owner: 10Madhuvishy) [17:05:21] (03CR) 10Andrew Bogott: [C: 031] "Looks good to me! As far as I'm concerned this can be merged as soon as the needed passwords are moved in the private repo." [puppet] - 10https://gerrit.wikimedia.org/r/423503 (owner: 10Bstorm) [17:07:36] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4097768 (10RobH) >>! In T190540#4093023, @Papaul wrote: > cp2022 SEL after test > > "Normal","Thu Mar 29 2018 20:11:58","Log cleared." > "Warning",... [17:10:04] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4097772 (10RobH) Next steps: * reboot cp2022 in serial console a dozen times and watch it ** if it has the error, it shows during post AND pushes t... [17:11:33] !log smalyshev@tin Started deploy [wdqs/wdqs@49f4eed]: GUI update [17:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:35] PROBLEM - Host cp2022 is DOWN: PING CRITICAL - Packet loss = 100% [17:14:00] thats me [17:14:11] im doing expected reboot testing, but ill coem up and down a bunch in the next few. [17:17:35] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:17:36] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:17:36] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:17:45] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:17:45] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:17:55] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:17:55] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:17:55] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:17:55] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:17:55] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:17:56] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:17:56] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:17:57] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:17:57] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:17:58] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:05] PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:18:05] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:18:05] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:05] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:05] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:15] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:15] PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:18:15] PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:18:16] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:25] PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:25] PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:25] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:18:25] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:25] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:26] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:26] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:27] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:27] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:18:28] PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:18:28] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:29] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:29] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:30] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:35] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:35] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:35] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:35] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:35] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:18:35] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:21:23] !log smalyshev@tin Finished deploy [wdqs/wdqs@49f4eed]: GUI update (duration: 09m 49s) [17:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:15] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:24:23] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4097801 (10Papaul) switch port information when ready to move db2039. This i just a note for when we are ready to do the move. db2039 was on asw-c6-codfw ge-6/0/6 and now will... [17:26:56] 10Operations, 10cloud-services-team (Kanban): Labstore1006/7 profile for meltdown kernel - https://phabricator.wikimedia.org/T185101#4097817 (10madhuvishy) From comparing the 2 kernels for NFSd and raw disk performance, I can see that there's a small loss in performance on both reads and writes in the new spec... [17:27:25] RECOVERY - Host cp2022 is UP: PING OK - Packet loss = 0%, RTA = 36.39 ms [17:27:25] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK [17:27:26] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK [17:27:26] RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK [17:27:26] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 66 ESP OK [17:27:26] (03PS1) 10Jdlrobson: Update Jon Robson's public key [puppet] - 10https://gerrit.wikimedia.org/r/423507 [17:27:26] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 66 ESP OK [17:27:26] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 66 ESP OK [17:27:27] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 66 ESP OK [17:27:27] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 66 ESP OK [17:27:35] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 66 ESP OK [17:27:35] RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 66 ESP OK [17:27:35] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 66 ESP OK [17:27:35] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 66 ESP OK [17:27:35] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 66 ESP OK [17:27:35] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 66 ESP OK [17:27:35] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 66 ESP OK [17:27:36] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 66 ESP OK [17:27:36] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 66 ESP OK [17:27:37] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 66 ESP OK [17:27:37] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 66 ESP OK [17:27:38] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK [17:27:55] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 66 ESP OK [17:27:55] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK [17:27:56] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK [17:28:05] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK [17:28:05] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK [17:28:05] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK [17:28:05] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 66 ESP OK [17:28:05] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 66 ESP OK [17:28:05] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 66 ESP OK [17:28:06] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 66 ESP OK [17:28:06] RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK [17:28:06] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK [17:28:07] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 66 ESP OK [17:28:07] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 66 ESP OK [17:28:08] RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 66 ESP OK [17:28:10] (03CR) 10Bstorm: "They are already duplicated there. I'll double check before I merge, though. Once merged, I'll remove the unnecessary classes." [puppet] - 10https://gerrit.wikimedia.org/r/423503 (owner: 10Bstorm) [17:28:15] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 66 ESP OK [17:28:15] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 66 ESP OK [17:28:15] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 66 ESP OK [17:28:16] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 66 ESP OK [17:28:16] RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK [17:28:25] RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK [17:28:25] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK [17:30:19] (03PS1) 10EBernhardson: Shift enwiki search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423508 [17:30:41] (03CR) 10EBernhardson: [C: 032] Shift enwiki search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423508 (owner: 10EBernhardson) [17:31:55] (03Merged) 10jenkins-bot: Shift enwiki search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423508 (owner: 10EBernhardson) [17:32:13] (03CR) 10jenkins-bot: Shift enwiki search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423508 (owner: 10EBernhardson) [17:33:05] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK [17:35:16] those are all going to alert strongswan again shortly [17:35:23] i let cp2022 back online during testing [17:35:25] and didnt mean to [17:35:38] (03PS2) 10Ppchelko: Switch remaining high traffic jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423486 (https://phabricator.wikimedia.org/T190327) [17:36:38] (03PS4) 10Andrew Bogott: nova.conf: use entry point name for scheduler_driver [puppet] - 10https://gerrit.wikimedia.org/r/422432 (https://phabricator.wikimedia.org/T187954) [17:36:53] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Shift serach traffic for enwiki to codfw (duration: 01m 17s) [17:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:15] PROBLEM - Host cp2022 is DOWN: PING CRITICAL - Packet loss = 100% [17:38:15] (03CR) 10Andrew Bogott: [C: 032] nova.conf: use entry point name for scheduler_driver [puppet] - 10https://gerrit.wikimedia.org/r/422432 (https://phabricator.wikimedia.org/T187954) (owner: 10Andrew Bogott) [17:40:29] taking over mw-config on tin for the next 10 mins or so [17:40:46] (03CR) 10Mobrovac: [C: 032] Switch remaining high traffic jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423486 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [17:41:15] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:15] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:16] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:16] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:25] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:25] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:25] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:25] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:25] PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:41:26] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:26] PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:41:35] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:41:35] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:41:36] PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:41:45] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:45] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:45] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:45] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:45] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:46] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:46] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:47] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:47] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:48] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:48] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:49] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:49] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:50] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:41:52] (03CR) 10Huji: [C: 04-1] Add wordmark for Persian Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423497 (https://phabricator.wikimedia.org/T191176) (owner: 10Ladsgroup) [17:42:05] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:42:05] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:42:05] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:42:05] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:42:05] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:42:05] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:42:05] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [17:42:06] PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:42:06] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [17:42:54] (03PS3) 10Mobrovac: Switch remaining high traffic jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423486 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [17:43:47] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4097889 (10RobH) So I've rebooted cp2022 12 times, attempting to re-create the memory error that @ema experienced on this machine (and is demonstrat... [17:44:32] (03CR) 10Bstorm: [C: 032] wiki replicas: refactor views to be a profile [puppet] - 10https://gerrit.wikimedia.org/r/423503 (owner: 10Bstorm) [17:44:39] (03PS3) 10Bstorm: wiki replicas: refactor views to be a profile [puppet] - 10https://gerrit.wikimedia.org/r/423503 [17:47:26] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:47:26] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Switch the remaining high-traffic jobs to EventBus, test wikis only, file 1/2 - T190327 (duration: 01m 16s) [17:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:33] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [17:47:45] !log ppchelko@tin Started deploy [cpjobqueue/deploy@9e1b203]: Switch remaining high traffic jobs for test wikis. T190327 [17:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:21] (03CR) 10jenkins-bot: Switch remaining high traffic jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423486 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [17:48:28] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@9e1b203]: Switch remaining high traffic jobs for test wikis. T190327 (duration: 00m 43s) [17:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:08] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Switch the remaining high-traffic jobs to EventBus, test wikis only, file 2/2 - T190327 (duration: 01m 15s) [17:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:35] ok, done [17:56:18] (03PS2) 10Jcrespo: dbproxy: Move analytics wikireplica service to labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/423495 (https://phabricator.wikimedia.org/T191149) [17:57:28] (03CR) 10Jcrespo: [C: 032] dbproxy: Move analytics wikireplica service to labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/423495 (https://phabricator.wikimedia.org/T191149) (owner: 10Jcrespo) [17:57:30] 10Operations, 10ops-codfw, 10Traffic: cp2006 memory replacement - https://phabricator.wikimedia.org/T191223#4097930 (10RobH) p:05Triage>03High [17:57:33] 10Operations, 10ops-codfw, 10Traffic: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4097944 (10RobH) p:05Triage>03Normal [17:57:36] 10Operations, 10ops-codfw, 10Traffic: cp2010 memory replacement - https://phabricator.wikimedia.org/T191225#4097958 (10RobH) p:05Triage>03Normal [17:57:48] 10Operations, 10ops-codfw, 10Traffic: cp2011 memory replacement - https://phabricator.wikimedia.org/T191226#4097972 (10RobH) p:05Triage>03Normal [17:57:56] 10Operations, 10ops-codfw, 10Traffic: cp2017 memory replacement - https://phabricator.wikimedia.org/T191227#4097986 (10RobH) p:05Triage>03Normal [17:58:00] 10Operations, 10ops-codfw, 10Traffic: cp2018 memory replacement - https://phabricator.wikimedia.org/T191228#4098000 (10RobH) p:05Triage>03Normal [17:58:11] 10Operations, 10ops-codfw, 10Traffic: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229#4098015 (10RobH) p:05Triage>03Normal [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180402T1800). Please do the needful. [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:04:35] (03PS1) 10Ppchelko: Switch high traffic jobs to kafka for all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423512 (https://phabricator.wikimedia.org/T190327) [18:04:49] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#4098045 (10mobrovac) [18:06:00] 10Operations, 10SCB, 10Services (watching): Page allocation stalls on scb1001, scb1002 - https://phabricator.wikimedia.org/T191199#4098052 (10mobrovac) [18:08:31] 10Operations, 10ops-codfw, 10Traffic: cp2006 memory replacement - https://phabricator.wikimedia.org/T191223#4098059 (10RobH) [18:11:53] (03PS5) 10Madhuvishy: statistics: Mount dumps share from labstore1006|7 on stat1005|6 [puppet] - 10https://gerrit.wikimedia.org/r/420083 (https://phabricator.wikimedia.org/T188644) [18:12:40] (03CR) 10Madhuvishy: [C: 032] statistics: Mount dumps share from labstore1006|7 on stat1005|6 [puppet] - 10https://gerrit.wikimedia.org/r/420083 (https://phabricator.wikimedia.org/T188644) (owner: 10Madhuvishy) [18:16:20] (03PS1) 10Madhuvishy: statistics: Add missed README for dumps nfs mount [puppet] - 10https://gerrit.wikimedia.org/r/423515 (https://phabricator.wikimedia.org/T188644) [18:16:51] (03CR) 10Madhuvishy: [C: 032] statistics: Add missed README for dumps nfs mount [puppet] - 10https://gerrit.wikimedia.org/r/423515 (https://phabricator.wikimedia.org/T188644) (owner: 10Madhuvishy) [18:20:27] 10Operations, 10ops-codfw, 10Traffic: cp2006 memory replacement - https://phabricator.wikimedia.org/T191223#4098091 (10RobH) [18:20:56] (03PS2) 10Madhuvishy: statistics: Absent existing dumps mount at /mnt/data [puppet] - 10https://gerrit.wikimedia.org/r/422892 (https://phabricator.wikimedia.org/T188644) [18:21:23] 10Operations, 10ops-codfw, 10Traffic: cp2006 memory replacement - https://phabricator.wikimedia.org/T191223#4097930 (10RobH) [18:21:31] (03CR) 10Madhuvishy: [C: 032] statistics: Absent existing dumps mount at /mnt/data [puppet] - 10https://gerrit.wikimedia.org/r/422892 (https://phabricator.wikimedia.org/T188644) (owner: 10Madhuvishy) [18:24:40] (03PS2) 10Madhuvishy: statistics: Symlink /mnt/data to nfs share from active server [puppet] - 10https://gerrit.wikimedia.org/r/422896 (https://phabricator.wikimedia.org/T188644) [18:25:23] (03CR) 10Madhuvishy: [C: 032] statistics: Symlink /mnt/data to nfs share from active server [puppet] - 10https://gerrit.wikimedia.org/r/422896 (https://phabricator.wikimedia.org/T188644) (owner: 10Madhuvishy) [18:26:00] mutante: did you get any chance to rebuild that mcrouter package in apt.wikimedia.org? [18:32:57] 10Operations, 10ops-codfw, 10Traffic: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4098122 (10RobH) p:05Normal>03High [18:38:24] (03PS1) 10Arlolra: Update ssh public key for arlolra [puppet] - 10https://gerrit.wikimedia.org/r/423516 [18:38:50] 10Operations, 10ops-esams, 10Traffic: cp3034: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T189305#4098146 (10RobH) a:03RobH [18:40:20] (03PS2) 10Jcrespo: labsdb: Reduce the sleep timeouts of clients to prevent connection hogging [puppet] - 10https://gerrit.wikimedia.org/r/423494 [18:40:22] (03PS1) 10Jcrespo: dbproxy: Repool labsdb1009 as part of the analytics hosts [puppet] - 10https://gerrit.wikimedia.org/r/423517 (https://phabricator.wikimedia.org/T191149) [18:41:40] (03CR) 10Jcrespo: [C: 032] dbproxy: Repool labsdb1009 as part of the analytics hosts [puppet] - 10https://gerrit.wikimedia.org/r/423517 (https://phabricator.wikimedia.org/T191149) (owner: 10Jcrespo) [18:41:45] (03PS2) 10Jcrespo: dbproxy: Repool labsdb1009 as part of the analytics hosts [puppet] - 10https://gerrit.wikimedia.org/r/423517 (https://phabricator.wikimedia.org/T191149) [18:43:09] 10Operations, 10ops-codfw, 10Traffic: cp2010 memory replacement - https://phabricator.wikimedia.org/T191225#4098168 (10RobH) p:05Normal>03High [18:43:28] 10Operations, 10ops-codfw, 10Traffic: cp2010 memory replacement - https://phabricator.wikimedia.org/T191225#4097958 (10RobH) [18:43:29] 11 [18:49:50] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#4098187 (10Pchelolo) [18:51:13] 10Operations, 10ops-codfw, 10Traffic: cp2011 memory replacement - https://phabricator.wikimedia.org/T191226#4098198 (10RobH) p:05Normal>03High [18:57:07] AaronSchulz: no, sorry. i had created https://phabricator.wikimedia.org/T190979 but didn't happen yet. i will look this afternoon [19:04:08] (03CR) 10Ladsgroup: "This -1 is invalid IMO" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423497 (https://phabricator.wikimedia.org/T191176) (owner: 10Ladsgroup) [19:04:17] !log Getting the train back on track: deploying 1.31.0-wmf.27 to Group0 [19:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:34] 10Operations, 10ops-codfw, 10Traffic: cp2017 memory replacement - https://phabricator.wikimedia.org/T191227#4098223 (10RobH) p:05Normal>03High [19:06:49] !log sync rdbms: avoid lag estimates in getLagFromPtHeartbeat ruined by snapshots Bug: T190960 Change-Id: I57dd8d3d0ca96d6fb2f9e83f062f29b1d53224dd [19:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:55] T190960: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960 [19:09:48] !log twentyafterfour@tin Synchronized php-1.31.0-wmf.27/includes/libs/rdbms/: sync I57dd8d3d0ca96d6fb2f9e83f062f29b1d53224dd refs T183966 T190960 (duration: 01m 19s) [19:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:55] T183966: 1.31.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T183966 [19:13:01] (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423520 [19:13:03] (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423520 (owner: 1020after4) [19:14:19] (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423520 (owner: 1020after4) [19:15:12] 10Operations, 10ops-codfw, 10Traffic: cp2018 memory replacement - https://phabricator.wikimedia.org/T191228#4098243 (10RobH) [19:15:22] 10Operations, 10ops-codfw, 10Traffic: cp2018 memory replacement - https://phabricator.wikimedia.org/T191228#4098000 (10RobH) p:05Normal>03High [19:16:25] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.27 [19:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:42] !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.27 (duration: 01m 15s) [19:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:51] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4098249 (10RobH) I've reviewed the next steps with @bblack via IRC. @Papaul ran memtest on all of the machines reporting failed memory, and they al... [19:18:39] (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423520 (owner: 1020after4) [19:20:35] mutante: ok, thanks [19:25:26] AaronSchulz: with the latest patch, 1.31.0-wmf.27 looks better, now I just see a bunch of "Could not wait for replica DBs to catch up" [19:25:31] still the error rate jumped [19:26:02] twentyafterfour: would be nice to have https://gerrit.wikimedia.org/r/#/c/423149/ [19:26:19] AaronSchulz: ok I'll cherry pick it [19:26:21] I want to rule that case out [19:27:15] Cool, I'll get that deployed shortly [19:30:53] twentyafterfour: I wonder if 24353a60d2c860cd24593d721b45291782a8489f will remove the last "x seconds of lag" warnings. I'm only seeing them for wmf6" [19:31:29] AaronSchulz: I think so [19:34:02] AaronSchulz: "Using cached lag value for 10.64.16.102 due to active transaction" is creating a lot of logspam noise [19:34:28] it's only a warning level message so I can live with it but it increased the load on logstash quite a bit [19:34:49] no trace :/ [19:35:44] yeah I'm not sure what affects that? [19:38:03] actually, I know. They are noise that happens on any begin()...that log message is triggered wrongly. [19:41:41] AaronSchulz: ok the extra logging is syncing [19:41:53] !log twentyafterfour@tin Synchronized php-1.31.0-wmf.27/includes/libs/rdbms/database/DatabaseMysqlBase.php: sync 779e7fd3de08101ab626136c07045c98ea162c5b refs T190960 (duration: 01m 16s) [19:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:00] T190960: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960 [19:44:17] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 4 others: Thumbor incorrectly normalizes .jpe and .jpeg into .jpg for Swift thumbnail storage - https://phabricator.wikimedia.org/T191028#4098317 (10Gilles) thumb_handler.php is for private wikis, though, afaik, public wikis hit thumb.php [19:46:34] !log temporarily disabling puppet agents for puppetdb postgres security update [19:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:36] AaronSchulz: "No GTIDs with the same domain between master" didn't show up in logstash yet [19:47:43] so that's eliminated I guess? [19:48:22] (03PS1) 10Mobrovac: pdfrender: Tell SystemD to log directly into a file [puppet] - 10https://gerrit.wikimedia.org/r/423525 (https://phabricator.wikimedia.org/T191191) [19:49:36] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 61882 MB (12% inode=99%) [19:49:49] shit [19:50:22] is that ^ caused by the extra logging? I'm not seeing THAT big of a volume of log messages [19:51:03] twentyafterfour: nope, different cluster. [19:53:56] AaronSchulz: still, a pretty big jump in logging: https://logstash.wikimedia.org/goto/d79480c0f37930145a28656a07957dcb [19:54:09] I definitely can't go to group2 like this [19:54:11] (03CR) 10Mobrovac: "PCC: https://puppet-compiler.wmflabs.org/compiler02/10760/" [puppet] - 10https://gerrit.wikimedia.org/r/423525 (https://phabricator.wikimedia.org/T191191) (owner: 10Mobrovac) [19:54:37] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 62227 MB (12% inode=99%) [19:55:19] BadMethodCallException from line 160 of /srv/mediawiki/php-1.31.0-wmf.26/extensions/CirrusSearch/includes/BuildDocument/RedirectsAndIncomingLinks.php: Call to a member function getTotalHits() on a non-object (null) [19:55:32] hmm and that's not even the new branch [19:56:45] !log puppetdb postgres update complete — puppet agents re-enabled [19:56:45] twentyafterfour: https://gerrit.wikimedia.org/r/423527 should fix the pointless logging [19:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:25] AaronSchulz: thanks, cherry-picked and +2'd [19:59:37] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 61395 MB (12% inode=99%) [20:00:05] cscott, arlolra, subbu, bearND, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180402T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:00:38] arlo or scott might deploy later. i forgot i have a meeting to attend, so cannot deploy now. [20:02:29] (03PS1) 10Ottomata: Multi process MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/423529 (https://phabricator.wikimedia.org/T189464) [20:03:40] doing masterPosWait() on all enwiki slaves works fine in shell.php. hmm [20:05:56] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3773297 (10gh87) I was using the Apple Safari 11.0.3 (13604.5.6). Then I enabled the "Develop" tab in the Menu, then right-clicked to open... [20:06:15] ah, my meeting was cancelled, so can do the deploy after all. [20:06:18] (03PS2) 10Ottomata: Multi process MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/423529 (https://phabricator.wikimedia.org/T189464) [20:06:25] i'm going to prep and do the parsoid deploy in a bit. [20:08:37] RECOVERY - Disk space on elastic1019 is OK: DISK OK [20:08:54] (03CR) 10Ottomata: [C: 032] "Les do it: https://puppet-compiler.wmflabs.org/compiler02/10762/kafka1020.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/423529 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [20:11:41] (03PS1) 10EBernhardson: Shift all search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423532 (https://phabricator.wikimedia.org/T191236) [20:12:33] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 4 others: Thumbor incorrectly normalizes .jpe and .jpeg into .jpg for Swift thumbnail storage - https://phabricator.wikimedia.org/T191028#4091369 (10Tgr) `thumb.php` is for URLs which contain the filename and size as query parameters (... [20:13:02] (03CR) 10jerkins-bot: [V: 04-1] Shift all search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423532 (https://phabricator.wikimedia.org/T191236) (owner: 10EBernhardson) [20:13:18] (03PS1) 10Madhuvishy: statistics: Create /srv/dumps directory to host dumps datasets [puppet] - 10https://gerrit.wikimedia.org/r/423533 (https://phabricator.wikimedia.org/T189283) [20:14:14] why can't i log onto wmnet? i am getting asked for a password. is this related to updating bast .. ? [20:14:26] 10Operations, 10cloud-services-team: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4098456 (10ayounsi) ge-4/0/18 - description labvirt1021:eth0 - vlan-cloud-hosts1-b-eqiad xe-4/0/35 - description labvirt1021:eth1 - cloud-instance-ports [20:15:34] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4098458 (10ayounsi) >>! In T183585#4097254, @Cmjohnson wrote: > Removed db1020 switch port ge-1/0/4 asw2-b updated. [20:16:46] PROBLEM - Check systemd state on kafka1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:16:46] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad on kafka1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad/producer\.properties [20:16:49] me [20:16:50] sorry [20:16:51] acking [20:17:01] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#4098467 (10Tgr) Apparently it doesn't implement fallbacks as per spec (the last fallback value is `origin`). I don't think there's anythin... [20:17:02] !log mholloway-shell@tin Started deploy [mobileapps/deploy@940bd48]: Update mobileapps to 58a0a88 [20:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:25] (03PS2) 10EBernhardson: Shift all search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423532 (https://phabricator.wikimedia.org/T191236) [20:17:32] twentyafterfour: does "recheck" work on wmf branches, e.g. https://gerrit.wikimedia.org/r/#/c/423528/ ? [20:17:45] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4098474 (10ayounsi) [20:17:47] ACKNOWLEDGEMENT - Check systemd state on kafka1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. ottomata new puppet - The acknowledgement expires at: 2018-04-03 23:17:26. [20:17:47] ACKNOWLEDGEMENT - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad on kafka1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad/producer\.properties ottomata new puppet - The acknowledgement expires at: 2018-04-03 23:17:26. [20:18:04] (03CR) 10Ottomata: [C: 031] statistics: Create /srv/dumps directory to host dumps datasets [puppet] - 10https://gerrit.wikimedia.org/r/423533 (https://phabricator.wikimedia.org/T189283) (owner: 10Madhuvishy) [20:18:30] (03CR) 10Madhuvishy: [C: 032] statistics: Create /srv/dumps directory to host dumps datasets [puppet] - 10https://gerrit.wikimedia.org/r/423533 (https://phabricator.wikimedia.org/T189283) (owner: 10Madhuvishy) [20:18:58] (03Restored) 10Dereckson: Add several domains of Ukraine government to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405550 (https://phabricator.wikimedia.org/T185399) (owner: 10Urbanecm) [20:19:08] (03CR) 10Dereckson: [C: 031] Add several domains of Ukraine government to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405550 (https://phabricator.wikimedia.org/T185399) (owner: 10Urbanecm) [20:20:08] (03CR) 10Dereckson: [C: 031] "In January, we were unsure for this change, but this has now been cleared with a Wikimedia Commons discussion. Those domains are considere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405550 (https://phabricator.wikimedia.org/T185399) (owner: 10Urbanecm) [20:20:36] (03CR) 10Huji: [C: 04-1] Add wordmark for Persian Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423497 (https://phabricator.wikimedia.org/T191176) (owner: 10Ladsgroup) [20:20:44] (03PS1) 10Ottomata: Allow '@' characters in client-id prometheus jmx matching for MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/423534 (https://phabricator.wikimedia.org/T189464) [20:21:26] (03CR) 10Ottomata: [C: 032] Allow '@' characters in client-id prometheus jmx matching for MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/423534 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [20:21:29] (03PS2) 10Ottomata: Allow '@' characters in client-id prometheus jmx matching for MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/423534 (https://phabricator.wikimedia.org/T189464) [20:21:31] (03CR) 10Ottomata: [V: 032 C: 032] Allow '@' characters in client-id prometheus jmx matching for MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/423534 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [20:22:54] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@940bd48]: Update mobileapps to 58a0a88 (duration: 05m 52s) [20:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:50] (03CR) 10ArielGlenn: pdfrender: Tell SystemD to log directly into a file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/423525 (https://phabricator.wikimedia.org/T191191) (owner: 10Mobrovac) [20:25:19] AaronSchulz: I think recheck should work [20:27:27] (03PS1) 10Ottomata: Use proper value of mirror_name prometheus label [puppet] - 10https://gerrit.wikimedia.org/r/423536 (https://phabricator.wikimedia.org/T177855) [20:29:47] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10763/kafka1020.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/423536 (https://phabricator.wikimedia.org/T177855) (owner: 10Ottomata) [20:29:56] twentyafterfour: btw, I just added you time some patches about unrelated log spam [20:34:23] (03PS1) 10Ottomata: Quote process_number prometheus label [puppet] - 10https://gerrit.wikimedia.org/r/423537 (https://phabricator.wikimedia.org/T177855) [20:35:34] (03CR) 10Ottomata: [C: 032] Quote process_number prometheus label [puppet] - 10https://gerrit.wikimedia.org/r/423537 (https://phabricator.wikimedia.org/T177855) (owner: 10Ottomata) [20:38:11] !log twentyafterfour@tin Synchronized php-1.31.0-wmf.27/includes/libs/rdbms/database/DatabaseMysqlBase.php: sync 36c52351487aa70a651dbc6157e0a5882dfd9e7f refs T190960 (duration: 01m 16s) [20:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:18] T190960: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960 [20:40:35] (03PS1) 10Ottomata: Increase the number of main->jumbo MirrorMaker process to 4 per host [puppet] - 10https://gerrit.wikimedia.org/r/423538 (https://phabricator.wikimedia.org/T189464) [20:41:19] (03CR) 10Ottomata: [C: 032] Increase the number of main->jumbo MirrorMaker process to 4 per host [puppet] - 10https://gerrit.wikimedia.org/r/423538 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [20:41:34] (03PS1) 10Madhuvishy: dumps: Add rsync fetch jobs for datasets in stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/423539 (https://phabricator.wikimedia.org/T189283) [20:46:48] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4098592 (10Varnent) Just to be clear around expectations - Foundation Wiki is managed by the Foundation and is... [20:47:25] AaronSchulz: rather than less logging, that resulted in more! [20:47:39] https://logstash.wikimedia.org/goto/76d03b198328617d23976a1899db485d [20:48:16] (03CR) 10Ladsgroup: Add wordmark for Persian Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423497 (https://phabricator.wikimedia.org/T191176) (owner: 10Ladsgroup) [20:48:52] twentyafterfour: "+channel:DBQuery" went down [20:49:31] (03PS1) 10Madhuvishy: statistics: Create dumps rsync module to allow read from labstore1006|7 [puppet] - 10https://gerrit.wikimedia.org/r/423540 (https://phabricator.wikimedia.org/T189283) [20:49:43] or (+channel:DBQuery +"cached") to be specific [20:50:28] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:50:59] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad on kafka1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad/producer\.properties [20:51:05] shhh [20:51:08] PROBLEM - Check systemd state on kafka1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:51:45] AaronSchulz: ah ok [20:52:03] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen\ [20:52:13] AaronSchulz: ^ not good [20:52:39] I think I better roll this one back again :( [20:53:57] PROBLEM - Check systemd state on kafka1023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:54:04] 10Operations, 10Dumps-Generation, 10Patch-For-Review: data retrieval/write issues via NFS on dumpsdata1001, impacting some dump jobs - https://phabricator.wikimedia.org/T191177#4098602 (10ArielGlenn) I've rerun the one bad stub, set the dump scheduler on the appropriate host to start up wikidata dumps again... [20:54:48] (03Abandoned) 10Madhuvishy: statistics: Create dumps rsync module to allow read from labstore1006|7 [puppet] - 10https://gerrit.wikimedia.org/r/423540 (https://phabricator.wikimedia.org/T189283) (owner: 10Madhuvishy) [20:57:28] (03PS2) 10Andrew Bogott: bootstrap_vz: remove lots of resolve.conf magic from firstboot script [puppet] - 10https://gerrit.wikimedia.org/r/401636 (https://phabricator.wikimedia.org/T181375) [20:58:06] (03PS1) 1020after4: all wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423543 [20:58:08] (03CR) 1020after4: [C: 032] all wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423543 (owner: 1020after4) [20:58:12] (03CR) 10Andrew Bogott: [C: 032] bootstrap_vz: remove lots of resolve.conf magic from firstboot script [puppet] - 10https://gerrit.wikimedia.org/r/401636 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [20:59:15] !log MediaWiki Train: rolling back to 1.31.0-wmf.26 refs T183966, T190960 [20:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:23] T183966: 1.31.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T183966 [20:59:23] T190960: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960 [20:59:25] (03Merged) 10jenkins-bot: all wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423543 (owner: 1020after4) [20:59:39] (03CR) 10jenkins-bot: all wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423543 (owner: 1020after4) [21:00:04] bawolff and Reedy: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180402T2100). [21:00:05] No GERRIT patches in the queue for this window AFAICS. [21:04:39] twentyafterfour: gah, let's just try https://gerrit.wikimedia.org/r/423546 in branch [21:06:43] maybe it's the change in gtid_ sql variables...better to figure out this out async [21:08:17] (03PS2) 10Madhuvishy: dumps: Add rsync fetch jobs for datasets in stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/423539 (https://phabricator.wikimedia.org/T189283) [21:09:55] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: all wikis to 1.31.0-wmf.26 [21:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:12] AaronSchulz: https://gerrit.wikimedia.org/r/c/423548 cherry picked [21:16:37] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [21:22:17] !log twentyafterfour@tin Synchronized php-1.31.0-wmf.27/includes/libs/rdbms/database/: Revert ceb7d61ee7ef3edc6705abd41ec86b3afcd9c491 refs T183966 T190960 (duration: 00m 59s) [21:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:24] T183966: 1.31.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T183966 [21:22:24] T190960: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960 [21:24:24] (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423553 [21:24:26] (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423553 (owner: 1020after4) [21:26:00] (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423553 (owner: 1020after4) [21:26:50] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.27 [21:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:04] !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.27 (duration: 01m 14s) [21:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:25] (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423553 (owner: 1020after4) [21:32:59] @herron our office IP address was blacklisted by http://dnsbl.spfbl.net can you check for a blacklist de-list that was sent to postmaster@wikimedia.org ? [21:37:19] (03PS1) 10Dzahn: cache::misc: add apache-fast-test script [puppet] - 10https://gerrit.wikimedia.org/r/423557 [21:45:45] (03CR) 10Dzahn: "Can't locate Net/DNS/Resolver.pm in @INC (..." [puppet] - 10https://gerrit.wikimedia.org/r/423557 (owner: 10Dzahn) [21:48:38] twentyafterfour: I don't see anything crazy atm [21:50:01] AaronSchulz: yeah looks better after the revert [21:53:48] PROBLEM - Check whether ferm is active by checking the default input chain on bromine is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [21:55:30] (03PS1) 10Odder: Update logo for the Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423562 (https://phabricator.wikimedia.org/T191174) [21:59:49] !log twentyafterfour@tin Synchronized php-1.31.0-wmf.27/includes/libs/CSSMin.php: (no justification provided) (duration: 01m 16s) [21:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:51] (03PS1) 1020after4: all wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423566 [22:05:53] (03CR) 1020after4: [C: 032] all wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423566 (owner: 1020after4) [22:07:10] (03Merged) 10jenkins-bot: all wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423566 (owner: 1020after4) [22:08:21] (03CR) 10jenkins-bot: all wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423566 (owner: 1020after4) [22:08:42] (03PS3) 10EddieGP: apache, wwwportals: De-duplicate vhost code [puppet] - 10https://gerrit.wikimedia.org/r/397770 [22:11:11] (03CR) 10EddieGP: "Besides the initial testing of whether the general setup works, I've now done it more throughoutly and fixed some stuff related to casing " [puppet] - 10https://gerrit.wikimedia.org/r/397770 (owner: 10EddieGP) [22:11:56] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: all wikis to 1.31.0-wmf.27 [22:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:43] (03CR) 10MusikAnimal: [C: 031] Make a note about the loading order of GlobalPreferences and Echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422642 (https://phabricator.wikimedia.org/T190353) (owner: 10Samwilson) [22:14:48] (03CR) 10Dzahn: "would also need to install perl module to do DNS lookups though..." [puppet] - 10https://gerrit.wikimedia.org/r/423557 (owner: 10Dzahn) [22:20:45] (03PS1) 10Ottomata: Use consistent client.id for mirrormaker producer and consumer [puppet] - 10https://gerrit.wikimedia.org/r/423570 (https://phabricator.wikimedia.org/T189464) [22:21:01] (03PS2) 10Ottomata: Use consistent client.id for mirrormaker producer and consumer [puppet] - 10https://gerrit.wikimedia.org/r/423570 (https://phabricator.wikimedia.org/T189464) [22:21:03] (03PS2) 10Dzahn: cache::misc: add codfw backend for webserver_misc_static [puppet] - 10https://gerrit.wikimedia.org/r/423080 (https://phabricator.wikimedia.org/T188163) [22:21:36] (03CR) 10Dzahn: [C: 032] "tested new backend from tin with apache-fast-test while disabling ferm rule for a few seconds." [puppet] - 10https://gerrit.wikimedia.org/r/423080 (https://phabricator.wikimedia.org/T188163) (owner: 10Dzahn) [22:22:36] (03CR) 10Ottomata: [C: 032] Use consistent client.id for mirrormaker producer and consumer [puppet] - 10https://gerrit.wikimedia.org/r/423570 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [22:22:41] (03PS3) 10Ottomata: Use consistent client.id for mirrormaker producer and consumer [puppet] - 10https://gerrit.wikimedia.org/r/423570 (https://phabricator.wikimedia.org/T189464) [22:22:43] (03CR) 10Ottomata: [V: 032 C: 032] Use consistent client.id for mirrormaker producer and consumer [puppet] - 10https://gerrit.wikimedia.org/r/423570 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [22:23:58] (03PS1) 10Bstorm: wiki replicas: tighten permissions on some configs [puppet] - 10https://gerrit.wikimedia.org/r/423572 [22:24:46] (03PS1) 10Ottomata: Fix client_id with ${::hostname} MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/423573 (https://phabricator.wikimedia.org/T189464) [22:25:04] (03PS2) 10Bstorm: wiki replicas: tighten permissions on some configs [puppet] - 10https://gerrit.wikimedia.org/r/423572 [22:25:54] (03CR) 10Ottomata: [C: 032] Fix client_id with ${::hostname} MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/423573 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [22:27:09] (03PS1) 10Ottomata: Capture client_id from prometheus with : in the name [puppet] - 10https://gerrit.wikimedia.org/r/423575 (https://phabricator.wikimedia.org/T189464) [22:27:11] (03PS3) 10Bstorm: wiki replicas: tighten permissions on some configs [puppet] - 10https://gerrit.wikimedia.org/r/423572 [22:27:23] (03CR) 10Ottomata: [V: 032 C: 032] Capture client_id from prometheus with : in the name [puppet] - 10https://gerrit.wikimedia.org/r/423575 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [22:28:33] (03CR) 10Bstorm: [C: 032] wiki replicas: tighten permissions on some configs [puppet] - 10https://gerrit.wikimedia.org/r/423572 (owner: 10Bstorm) [22:28:49] (03PS4) 10Bstorm: wiki replicas: tighten permissions on some configs [puppet] - 10https://gerrit.wikimedia.org/r/423572 [22:28:55] (03CR) 10Bstorm: [V: 032 C: 032] wiki replicas: tighten permissions on some configs [puppet] - 10https://gerrit.wikimedia.org/r/423572 (owner: 10Bstorm) [22:31:28] (03PS1) 10Ottomata: Can't use ':' in client.id [puppet] - 10https://gerrit.wikimedia.org/r/423578 (https://phabricator.wikimedia.org/T189464) [22:31:46] (03CR) 10Ottomata: [V: 032 C: 032] Can't use ':' in client.id [puppet] - 10https://gerrit.wikimedia.org/r/423578 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [22:31:50] (03PS2) 10Ottomata: Can't use ':' in client.id [puppet] - 10https://gerrit.wikimedia.org/r/423578 (https://phabricator.wikimedia.org/T189464) [22:31:52] (03CR) 10Ottomata: [V: 032 C: 032] Can't use ':' in client.id [puppet] - 10https://gerrit.wikimedia.org/r/423578 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [22:34:57] (03CR) 10Mobrovac: [C: 04-1] pdfrender: Tell SystemD to log directly into a file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/423525 (https://phabricator.wikimedia.org/T191191) (owner: 10Mobrovac) [22:40:05] 10Operations, 10Availability, 10Patch-For-Review: create codfw-equivalent of bromine, make webserver_misc_static active/active in misc varnish - https://phabricator.wikimedia.org/T188163#4098900 (10Dzahn) [22:41:03] !log twentyafterfour@tin Synchronized php-1.31.0-wmf.27/includes/libs/CSSMin.php: sync https://gerrit.wikimedia.org/r/423574 (duration: 00m 58s) [22:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:24] 10Operations, 10Availability, 10Patch-For-Review: create codfw-equivalent of bromine, make webserver_misc_static active/active in misc varnish - https://phabricator.wikimedia.org/T188163#3998458 (10Dzahn) - tested codfw backend with apache-fast-test - added codfw backend to make service active/active - have... [22:42:07] (03CR) 10Mobrovac: [C: 04-1] "Hm, looking at the options we have in our systemd version, we're not going to get anywhere in this way." [puppet] - 10https://gerrit.wikimedia.org/r/423525 (https://phabricator.wikimedia.org/T191191) (owner: 10Mobrovac) [22:44:23] (03PS1) 10Dzahn: misc_static_sites: temp disable bromine backend for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/423580 (https://phabricator.wikimedia.org/T18863) [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180402T2300). [23:00:04] amir1 and odder: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:14] o/ [23:03:07] Is there anyone that can check for mail from postmaster@wikimedia.org? [23:03:52] o/ [23:04:03] Sorry I'm late [23:08:01] No worries Amir1, looks like things might be a little bit delayed, no one has spoken yet ;) [23:08:38] I can SWAT if no one shows up until :12 [23:10:13] byron: mail sent from that address via our servers or mail to that address? Either way a root probably can. herron is on clinic duty, but probably off for the day. If its not urgent then I'd suggest a phab task. [23:12:53] Okay, I can SWAT [23:13:06] bf808: It's pressing. Our outgoing office IP (198.73.209.241) changed, and was blacklisted becuase there was not a reverse pointer. I have since fixed the issue. We're having connectivity issues to some sites, and need someone to click a link in the email sent to postmaster@wikimedia.org [23:13:55] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423562 (https://phabricator.wikimedia.org/T191174) (owner: 10Odder) [23:14:21] bd808: It's pressing. Our outgoing office IP (198.73.209.241) changed, and was blacklisted becuase there was not a reverse pointer. I have since fixed the issue. We're having connectivity issues to some sites, and need someone to click a link in the email sent to postmaster@wikimedia.org [23:14:36] Amir1: I'm having a look at your patch, I guess MF simply scales the file down to the requested size? [23:14:56] I doubt that [23:15:09] Amir1: i.e. the SVG is 155 x 57 px, and in the config it's 50 x 18 [23:15:12] byron: ok. let me see if I can find somebody who can find that for you. [23:15:22] (03Merged) 10jenkins-bot: Update logo for the Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423562 (https://phabricator.wikimedia.org/T191174) (owner: 10Odder) [23:15:25] odder: yeah, tha's correct [23:15:47] i've got another config change to drop in swat: https://gerrit.wikimedia.org/r/#/c/423532/ [23:16:53] (03PS4) 10Dereckson: Add several domains of Ukraine government to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405550 (https://phabricator.wikimedia.org/T185399) (owner: 10Urbanecm) [23:17:03] Amir1: OK, just wondering as the other wordmarks are 133 x 21, but I guess it should be fine it it scales it down [23:17:42] (03CR) 10Dereckson: [C: 031] "PS4: rebased, fixing domain names (prepend www where needed)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405550 (https://phabricator.wikimedia.org/T185399) (owner: 10Urbanecm) [23:17:42] yeah, the discussion in the patch is about whether the given scale is right or not, what do you think of that? [23:18:19] odder: your patch is in wmdebug1002 [23:18:20] (03CR) 10jenkins-bot: Update logo for the Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423562 (https://phabricator.wikimedia.org/T191174) (owner: 10Odder) [23:18:50] Dear ops people, right now we should deploy from tin, right? [23:19:02] Amir1: Looks OK to me [23:19:35] bd808: thanks [23:19:42] cool [23:20:06] let me just double check deployment server with ops before moving forward. [23:20:53] Amir1: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers should be tin [23:21:53] Amir1: use deployment.eqiad.wmnet as the hostname, it will always send you the right place (even codfw, if needed) [23:22:25] yeah, ops sent an email that this server is being decommissioned in favour of deploy1001 but that got rolled back, I haven't checked up the thread after that [23:22:37] as per https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Purging the logos need to be purged, otherwise people will still be seeing the old logo for a while [23:23:10] ebernhardson: yeah, you're right. The reason I avoid it is that every time it gives me fingerprint warning and I need to double check [23:23:32] Amir1: it only gives the warning if it's changed, which is perhaps good to know :) [23:23:39] yup [23:23:48] tin it is, let's move forward [23:26:55] !log ladsgroup@tin Synchronized static/images/project-logos: [[gerrit:423562|Update logo for the Persian Wikipedia (T191174)]] (duration: 00m 59s) [23:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:01] T191174: Update fawiki logo - https://phabricator.wikimedia.org/T191174 [23:29:04] !log Persian Wikipedia logos have been purged using purgeList.php on terbium [23:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:14] odder: Can you double check if everything looks fine? [23:29:25] Hello [23:29:44] Amir1: do you have still time for an extra config change? [23:29:53] Dereckson: yup [23:30:14] Okay, preparing that and editing Deployments page [23:30:17] thanks [23:31:10] (03PS3) 10Ladsgroup: Shift all search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423532 (https://phabricator.wikimedia.org/T191236) (owner: 10EBernhardson) [23:31:26] ebernhardson: you're next :) [23:31:31] sweet [23:31:35] any order in deployment of the files? [23:31:56] Amir1: test file is a noop, so doesn't matter [23:32:14] (noop with respect to deployment at least) [23:32:15] okay, ebernhardson. I guess it's not testable [23:32:27] in mwdebug1002 [23:32:34] it is, i can double check a few [23:33:08] cool [23:33:31] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423532 (https://phabricator.wikimedia.org/T191236) (owner: 10EBernhardson) [23:33:33] Amir1: all looks happy [23:34:02] I will get it to mwdebug1002 in one minute [23:34:44] (03Merged) 10jenkins-bot: Shift all search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423532 (https://phabricator.wikimedia.org/T191236) (owner: 10EBernhardson) [23:35:21] Amir1: Sorry about the delay, I was checking those logos, took me a while [23:35:34] it's okay [23:35:42] Amir1: fawiki.png looks exactly the same to me, so not good [23:35:53] Amir1: you purged it? [23:35:58] odder: even with ?debug=true ? [23:36:05] Dereckson: yup [23:36:13] I guess it might be behind varnish [23:36:15] ottomata: Is there a recommended way to delete a consumer group in kafka? Would like to clean up some of my coal experiment data (tmp-krinkle etc.) given it is exposed on Grafana currently. [23:36:21] I checked on mwdebug1002 and there's the new version there [23:36:36] Amir1: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Purging [23:36:46] Dereckson: Nah, it looks oK with ?debug=true [23:36:57] Krinkle: I did that [23:37:11] I even logged it [23:38:05] ebernhardson: your patch is live in mwdebug1002 [23:38:09] That sometimes fail. [23:38:23] (03CR) 10jenkins-bot: Shift all search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423532 (https://phabricator.wikimedia.org/T191236) (owner: 10EBernhardson) [23:38:53] I re-done all again [23:39:04] ok looking [23:41:10] odder: Krinkle Dereckson It's fixed now, I pruged fa.wikipedia.org instead of en.wikipedia.org [23:41:17] Amir1: still looks happy [23:41:22] I thought it works but it seems it's not [23:41:33] ebernhardson: cool, moving forward [23:42:00] Amir1: Yay, it does work OK now, many thanks [23:42:15] sorry for the mistake :/ [23:42:17] I'll have a question on PM for you regarding something else, I guess I'll wait a couple of minutes [23:44:07] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:423532|Shift all search traffic to codfw (T191236)]] (duration: 00m 59s) [23:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:13] T191236: Resolve elasticsearch latency alerts - https://phabricator.wikimedia.org/T191236 [23:45:01] odder: sure thing [23:45:08] ebernhardson: it's live everywhere [23:45:15] Amir1: the change is https://gerrit.wikimedia.org/r/405550 Add several domains of Ukraine government to wgCopyUploadsDomains [23:45:30] ebernhardson: please test and let me know [23:45:36] !log ladsgroup@tin Synchronized tests/cirrusTest.php: [[gerrit:423532|Shift all search traffic to codfw, part II (T191236)]] (duration: 00m 58s) [23:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:23] Amir1: yup i can see all the traffic moving. thanks. [23:46:23] Dereckson: on it [23:46:31] ebernhardson: \o/ [23:46:39] (03PS5) 10Ladsgroup: Add several domains of Ukraine government to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405550 (https://phabricator.wikimedia.org/T185399) (owner: 10Urbanecm) [23:50:05] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405550 (https://phabricator.wikimedia.org/T185399) (owner: 10Urbanecm) [23:50:13] Dereckson: testable? [23:50:16] yes [23:50:21] cool [23:51:18] (03Merged) 10jenkins-bot: Add several domains of Ukraine government to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405550 (https://phabricator.wikimedia.org/T185399) (owner: 10Urbanecm) [23:51:32] (03CR) 10jenkins-bot: Add several domains of Ukraine government to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405550 (https://phabricator.wikimedia.org/T185399) (owner: 10Urbanecm) [23:52:24] Dereckson: it's live in mwdebug1002 [23:52:37] testing [23:52:53] works [23:53:04] yay [23:53:06] let's go [23:56:04] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:405550|Add several domains of Ukraine government to wgCopyUploadsDomains (T185399)]] (duration: 00m 59s) [23:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:10] T185399: Please add numerous *.gov.ua domains to the wgCopyUploadsDomains whitelist - https://phabricator.wikimedia.org/T185399 [23:56:11] Thanks for the deploy. [23:56:47] cool [23:56:51] :) [23:56:59] now let's move forward to my patch [23:57:28] (03CR) 10Ladsgroup: [C: 032] "SWAT. I see this -1 as invalid." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423497 (https://phabricator.wikimedia.org/T191176) (owner: 10Ladsgroup)